1 server alarm message
[monitoring alarm] alarm Name: the machine of commercial business digital technology center has oom,
Status: CRITICAL,
Environment: xxx - Alibaba cloud - production cluster (production) - production,
Alarm content: log.sys.oom (max, 120s) > 0, current value: 1.00,
Resource type: server (n9e),
Alarm object: 10.231.82.xxx,
Trigger time: 2021-11-02 18:26:30,
View Details: http://monitor.longhu.net/Linux/xxxxxxx
2. HDP cluster component nail alarm
3 view system log
Nov 2 18:26:37 c2-kl-snamenode kernel: titanagent invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Nov 2 18:26:38 c2-kl-snamenode kernel: titanagent cpuset=/ mems_allowed=0 Nov 2 18:26:38 c2-kl-snamenode kernel: CPU: 5 PID: 13361 Comm: titanagent Kdump: loaded Tainted: G OE ------------ 3.10.0-957.12.2.el7.x86_64 #1 Nov 2 18:26:38 c2-kl-snamenode kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 Nov 2 18:26:38 c2-kl-snamenode kernel: Call Trace: Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b63041>] dump_stack+0x19/0x1b Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b5da6a>] dump_header+0x90/0x229 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5501212>] ? ktime_get_ts64+0x52/0xf0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb555845f>] ? delayacct_end+0x8f/0xb0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba7b4>] oom_kill_process+0x254/0x3d0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba25d>] ? oom_unkillable_task+0xcd/0x120 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba306>] ? find_lock_task_mm+0x56/0xc0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55baff6>] out_of_memory+0x4b6/0x4f0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b5e56e>] __alloc_pages_slowpath+0x5d6/0x724 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55c13d4>] __alloc_pages_nodemask+0x404/0x420 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb560e288>] alloc_pages_current+0x98/0x110 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55b6617>] __page_cache_alloc+0x97/0xb0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55b9278>] filemap_fault+0x298/0x490 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffc03be186>] ext4_filemap_fault+0x36/0x50 [ext4] Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e476a>] __do_fault.isra.59+0x8a/0x100 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb542a621>] ? __switch_to+0x151/0x580 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54e0216>] ? update_curr+0x86/0x1e0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e4d1c>] do_read_fault.isra.61+0x4c/0x1b0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54e0669>] ? update_cfs_shares+0xa9/0xf0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e96c4>] handle_pte_fault+0x2f4/0xd10 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54d0977>] ? finish_task_switch+0x57/0x1c0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ec1fd>] handle_mm_fault+0x39d/0x9b0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b70603>] __do_page_fault+0x203/0x4f0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b709d6>] trace_do_page_fault+0x56/0x150 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b6ff62>] do_async_page_fault+0x22/0xf0 Nov 2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b6c798>] async_page_fault+0x28/0x30 Nov 2 18:26:38 c2-kl-snamenode kernel: Mem-Info: Nov 2 18:26:38 c2-kl-snamenode kernel: active_anon:3870217 inactive_anon:96 isolated_anon:0#012 active_file:1922 inactive_file:4988 isolated_file:162#012 unevictable:0 dirty:32 writeback:3 unstable:0#012 slab_reclaimable:31911 slab_unreclaimable:16332#012 mapped:830 shmem:205 pagetables:13667 bounce:0#012 free:33838 free_pcp:0 free_cma:0 Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Nov 2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 2812 15866 15866 Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA32 free:63996kB min:11968kB low:14960kB high:17952kB active_anon:2672360kB inactive_anon:64kB active_file:1412kB inactive_file:6028kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129192kB managed:2882968kB mlocked:0kB dirty:32kB writeback:0kB mapped:1120kB shmem:172kB slab_reclaimable:38752kB slab_unreclaimable:14104kB kernel_stack:8912kB pagetables:8508kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:73256 all_unreclaimable? yes Nov 2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 0 13053 13053 Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 Normal free:55448kB min:55548kB low:69432kB high:83320kB active_anon:12808508kB inactive_anon:320kB active_file:6276kB inactive_file:13924kB unevictable:0kB isolated(anon):0kB isolated(file):648kB present:13631488kB managed:13366892kB mlocked:0kB dirty:96kB writeback:12kB mapped:2200kB shmem:648kB slab_reclaimable:88892kB slab_unreclaimable:51224kB kernel_stack:28688kB pagetables:46160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:37125 all_unreclaimable? yes Nov 2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 0 0 0 Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA32: 1151*4kB (UEM) 1297*8kB (UEM) 1229*16kB (UEM) 416*32kB (UEM) 106*64kB (UEM) 11*128kB (UE) 31*256kB (UE) 3*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 65620kB Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 Normal: 3680*4kB (UEM) 2890*8kB (UE) 1103*16kB (UEM) 13*32kB (EM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55904kB Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Nov 2 18:26:38 c2-kl-snamenode kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Nov 2 18:26:38 c2-kl-snamenode kernel: 7729 total pagecache pages Nov 2 18:26:38 c2-kl-snamenode kernel: 0 pages in swap cache Nov 2 18:26:38 c2-kl-snamenode kernel: Swap cache stats: add 0, delete 0, find 0/0 Nov 2 18:26:38 c2-kl-snamenode kernel: Free swap = 0kB Nov 2 18:26:38 c2-kl-snamenode kernel: Total swap = 0kB Nov 2 18:26:38 c2-kl-snamenode kernel: 4194168 pages RAM Nov 2 18:26:38 c2-kl-snamenode kernel: 0 pages HighMem/MovableOnly Nov 2 18:26:38 c2-kl-snamenode kernel: 127726 pages reserved Nov 2 18:26:38 c2-kl-snamenode kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Nov 2 18:26:38 c2-kl-snamenode kernel: [ 1790] 0 1790 64980 103 132 0 0 systemd-journal Nov 2 18:26:38 c2-kl-snamenode kernel: [ 1811] 0 1811 47590 100 28 0 0 lvmetad Nov 2 18:26:38 c2-kl-snamenode kernel: [ 1829] 0 1829 11129 154 24 0 -1000 systemd-udevd Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3413] 0 3413 13880 118 27 0 -1000 auditd Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3445] 999 3445 153085 1682 62 0 0 polkitd Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3446] 32 3446 17316 135 37 0 0 rpcbind Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3452] 0 3452 5415 101 15 0 0 irqbalance Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3464] 0 3464 6657 173 17 0 0 systemd-logind Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3467] 81 3467 14554 156 35 0 -900 dbus-daemon Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3503] 0 3503 48775 119 34 0 0 gssproxy Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3544] 0 3544 31579 180 19 0 0 crond Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3559] 0 3559 6476 53 17 0 0 atd Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3582] 0 3582 27523 32 10 0 0 agetty Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3583] 0 3583 27523 33 10 0 0 agetty Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3785] 0 3785 25710 524 49 0 0 dhclient Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3846] 0 3846 143483 2796 97 0 0 tuned Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3854] 0 3854 232412 56050 382 0 0 rsyslogd Nov 2 18:26:38 c2-kl-snamenode kernel: [ 3904] 0 3904 340102 19651 119 0 0 n9e-collector Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4177] 0 4177 5731 89 14 0 0 argusagent Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4179] 0 4179 308752 3450 60 0 0 /usr/local/clou Nov 2 18:26:38 c2-kl-snamenode kernel: [12470] 0 12470 28216 256 58 0 -1000 sshd Nov 2 18:26:38 c2-kl-snamenode kernel: [23550] 998 23550 29448 130 30 0 0 chronyd Nov 2 18:26:38 c2-kl-snamenode kernel: [25161] 0 25161 56017 505 113 0 0 httpd Nov 2 18:26:38 c2-kl-snamenode kernel: [17170] 0 17170 37325 897 29 0 0 python Nov 2 18:26:38 c2-kl-snamenode kernel: [17174] 0 17174 717657 40285 223 0 0 python Nov 2 18:26:38 c2-kl-snamenode kernel: [12170] 1016 12170 731784 66076 344 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [12326] 1009 12326 147520 2550 72 0 0 python2.7 Nov 2 18:26:38 c2-kl-snamenode kernel: [14099] 1007 14099 1354046 56626 295 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [14496] 1016 14496 714714 33040 173 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [ 310] 1020 310 47777 1440 44 0 0 python Nov 2 18:26:38 c2-kl-snamenode kernel: [ 311] 1020 311 1904045 34660 179 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [21603] 0 21603 109262 189 28 0 0 AliSecGuard Nov 2 18:26:38 c2-kl-snamenode kernel: [ 5807] 1018 5807 2897189 254199 1008 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [11866] 0 11866 2687 55 11 0 0 jsvc Nov 2 18:26:38 c2-kl-snamenode kernel: [11880] 1018 11880 697340 38125 159 0 0 jsvc Nov 2 18:26:38 c2-kl-snamenode kernel: [ 2797] 1018 2797 769655 102332 361 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [22546] 1009 22546 28845 86 13 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [22560] 1009 22560 1020455 500837 1235 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [22616] 1009 22616 1652841 221421 684 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [20278] 1010 20278 1039497 287501 657 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [ 6637] 1010 6637 1055965 118005 318 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [14342] 0 14342 201779 1185 13 0 0 aliyun-service Nov 2 18:26:38 c2-kl-snamenode kernel: [14445] 0 14445 4451 123 13 0 0 assist_daemon Nov 2 18:26:38 c2-kl-snamenode kernel: [17671] 0 17671 10482 394 20 0 0 AliYunDunUpdate Nov 2 18:26:38 c2-kl-snamenode kernel: [23521] 1004 23521 28845 85 14 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [23535] 1004 23535 1431755 103152 391 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [ 1182] 1020 1182 28845 85 13 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [ 1196] 1020 1196 854319 146975 527 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [26591] 0 26591 127511 342 33 0 0 AliNet Nov 2 18:26:38 c2-kl-snamenode kernel: [26746] 0 26746 578461 2993 79 0 0 AliHips Nov 2 18:26:38 c2-kl-snamenode kernel: [27031] 1018 27031 754075 83566 375 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [ 744] 0 744 44485 12650 88 0 0 AliYunDun Nov 2 18:26:38 c2-kl-snamenode kernel: [26067] 1016 26067 1593210 610784 1391 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [32663] 0 32663 144907 557 42 0 0 AliDetect Nov 2 18:26:38 c2-kl-snamenode kernel: [24142] 1004 24142 28845 87 13 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [24156] 1004 24156 1926525 1023090 2199 0 0 java Nov 2 18:26:38 c2-kl-snamenode kernel: [25057] 48 25057 56017 466 111 0 0 httpd Nov 2 18:26:38 c2-kl-snamenode kernel: [25058] 48 25058 56017 466 111 0 0 httpd Nov 2 18:26:38 c2-kl-snamenode kernel: [25059] 48 25059 56017 466 111 0 0 httpd Nov 2 18:26:38 c2-kl-snamenode kernel: [25060] 48 25060 56017 466 111 0 0 httpd Nov 2 18:26:38 c2-kl-snamenode kernel: [25061] 48 25061 56017 466 111 0 0 httpd Nov 2 18:26:38 c2-kl-snamenode kernel: [13131] 0 13131 118125 13948 130 0 0 titanagent Nov 2 18:26:38 c2-kl-snamenode kernel: [13133] 0 13133 21926 122 13 0 0 titan_monitor Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4393] 0 4393 57508 4449 67 0 0 python Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4564] 0 4564 3437 54 12 0 0 ambari-sudo.sh Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4565] 0 4565 28844 71 14 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4569] 0 4569 22553 145 49 0 0 su Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4573] 0 4573 28811 53 14 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4575] 0 4575 28844 88 13 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4598] 0 4598 52172 2653 56 0 0 python Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4607] 1014 4607 29343 67 12 0 0 bash Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4657] 0 4657 45598 250 46 0 0 crond Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4658] 0 4658 45598 250 46 0 0 crond Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4674] 1014 4674 18754 186 40 0 0 curl Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4705] 0 4705 3437 54 12 0 0 ambari-sudo.sh Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4706] 0 4706 3437 54 12 0 0 ambari-sudo.sh Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4713] 0 4713 22048 135 49 0 0 su Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4714] 0 4714 22048 136 47 0 0 su Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4734] 0 4734 2365 23 8 0 0 sh Nov 2 18:26:38 c2-kl-snamenode kernel: [ 4735] 0 4735 2365 22 9 0 0 sh Nov 2 18:26:38 c2-kl-snamenode kernel: Out of memory: Kill process 24156 (java) score 252 or sacrifice child Nov 2 18:26:38 c2-kl-snamenode kernel: Killed process 24156 (java) total-vm:7706100kB, anon-rss:4092360kB, file-rss:0kB, shmem-rss:0kB Nov 2 18:26:39 c2-kl-snamenode systemd: Started Session 125873 of user root. Nov 2 18:26:39 c2-kl-snamenode systemd: Started Session 125872 of user root.
View HDP component timeline_ Start time of reader process
From the above logs and alarms, we can see that HDP cluster timeline_ The reader component is kill ed by the system, but Ambari is configured with automatic restart, which will restart after hanging up.
4 basic concepts of OOM
Linux has a feature: OOM Killer, a protection mechanism, which is used to avoid serious problems when memory is insufficient and kill some irrelevant processes first, that is, when memory is seriously insufficient, the system will select a process and kill it to release memory and alleviate memory shortage. However, this protection is limited, Can not completely protect the operation of the process.
In many cases, it is often seen that when there is still remaining memory, the oom killer still kills the process. The phenomenon is that the following information is in the / var/log/messages log file:
Out of Memory: Killed process [PID] [process name].
The problem is low memory exhaustion because the kernel uses low memory to track all memory allocations.
When low memory is exhausted, no matter how much high memory is left, oom killer will kill the process to keep the system running normally.
Under the 32-bit CPU, the addressing range is limited. The Linux kernel defines the following three areas:
### You can see from the log above # DMA: 0x00000000 - 0x00999999 (0 - 16 MB) # LowMem: 0x01000000 - 0x037999999 (16 - 896 MB) - size: 880MB # Highmem: 0x038000000 - < hardware specific >
Data belonging to the process, such as Stacks, Heaps, etc. Can be further decomposed into
- active_anon
- inactive_anon
[root@hdp101 ~]# ps -ef | grep 10412 root 10412 1 0 06:41 ? 00:02:58 /usr/java/jdk1.8.0_162/bin/java -server -XX:NewRatio=3 -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 -Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp -Dlog.file.name=hst-server.log -Xms1024m -Xmx2048m -cp /etc/hst/conf:/usr/hdp/share/hst/hst-common/lib/* com.hortonworks.support.tools.server.SupportToolServer root 12725 12592 0 20:25 pts/0 00:00:00 grep --color=auto 10412 [root@hdp101 ~]# [root@hdp101 ~]# [root@hdp101 ~]# ### When OOM is enabled, each process has OOM parameters that can be set [root@hdp101 ~]# ll /proc/10412/oom* -rw-r--r-- 1 root root 0 Nov 6 20:26 /proc/10412/oom_adj -r--r--r-- 1 root root 0 Nov 6 20:26 /proc/10412/oom_score -rw-r--r-- 1 root root 0 Nov 6 20:26 /proc/10412/oom_score_adj [root@hdp101 ~]#
In view of the shortage of server resources, we have to expand memory.