Linux server OOM analysis

  1 server alarm message

[monitoring alarm] alarm Name: the machine of commercial business digital technology center has oom,

Status: CRITICAL,

Environment: xxx - Alibaba cloud - production cluster (production) - production,

Alarm content: log.sys.oom (max, 120s) > 0, current value: 1.00,

Resource type: server (n9e),

Alarm object: 10.231.82.xxx,

Trigger time: 2021-11-02 18:26:30,

View Details: http://monitor.longhu.net/Linux/xxxxxxx

2. HDP cluster component nail alarm

3 view system log

Nov  2 18:26:37 c2-kl-snamenode kernel: titanagent invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Nov  2 18:26:38 c2-kl-snamenode kernel: titanagent cpuset=/ mems_allowed=0
Nov  2 18:26:38 c2-kl-snamenode kernel: CPU: 5 PID: 13361 Comm: titanagent Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.12.2.el7.x86_64 #1
Nov  2 18:26:38 c2-kl-snamenode kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
Nov  2 18:26:38 c2-kl-snamenode kernel: Call Trace:
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b63041>] dump_stack+0x19/0x1b
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b5da6a>] dump_header+0x90/0x229
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5501212>] ? ktime_get_ts64+0x52/0xf0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb555845f>] ? delayacct_end+0x8f/0xb0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba7b4>] oom_kill_process+0x254/0x3d0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba25d>] ? oom_unkillable_task+0xcd/0x120
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba306>] ? find_lock_task_mm+0x56/0xc0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55baff6>] out_of_memory+0x4b6/0x4f0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b5e56e>] __alloc_pages_slowpath+0x5d6/0x724
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55c13d4>] __alloc_pages_nodemask+0x404/0x420
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb560e288>] alloc_pages_current+0x98/0x110
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55b6617>] __page_cache_alloc+0x97/0xb0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55b9278>] filemap_fault+0x298/0x490
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffc03be186>] ext4_filemap_fault+0x36/0x50 [ext4]
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e476a>] __do_fault.isra.59+0x8a/0x100
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb542a621>] ? __switch_to+0x151/0x580
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54e0216>] ? update_curr+0x86/0x1e0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e4d1c>] do_read_fault.isra.61+0x4c/0x1b0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54e0669>] ? update_cfs_shares+0xa9/0xf0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e96c4>] handle_pte_fault+0x2f4/0xd10
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54d0977>] ? finish_task_switch+0x57/0x1c0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ec1fd>] handle_mm_fault+0x39d/0x9b0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b70603>] __do_page_fault+0x203/0x4f0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b709d6>] trace_do_page_fault+0x56/0x150
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b6ff62>] do_async_page_fault+0x22/0xf0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b6c798>] async_page_fault+0x28/0x30
Nov  2 18:26:38 c2-kl-snamenode kernel: Mem-Info:
Nov  2 18:26:38 c2-kl-snamenode kernel: active_anon:3870217 inactive_anon:96 isolated_anon:0#012 active_file:1922 inactive_file:4988 isolated_file:162#012 unevictable:0 dirty:32 writeback:3 unstable:0#012 slab_reclaimable:31911 slab_unreclaimable:16332#012 mapped:830 shmem:205 pagetables:13667 bounce:0#012 free:33838 free_pcp:0 free_cma:0
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov  2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 2812 15866 15866
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA32 free:63996kB min:11968kB low:14960kB high:17952kB active_anon:2672360kB inactive_anon:64kB active_file:1412kB inactive_file:6028kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129192kB managed:2882968kB mlocked:0kB dirty:32kB writeback:0kB mapped:1120kB shmem:172kB slab_reclaimable:38752kB slab_unreclaimable:14104kB kernel_stack:8912kB pagetables:8508kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:73256 all_unreclaimable? yes
Nov  2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 0 13053 13053
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 Normal free:55448kB min:55548kB low:69432kB high:83320kB active_anon:12808508kB inactive_anon:320kB active_file:6276kB inactive_file:13924kB unevictable:0kB isolated(anon):0kB isolated(file):648kB present:13631488kB managed:13366892kB mlocked:0kB dirty:96kB writeback:12kB mapped:2200kB shmem:648kB slab_reclaimable:88892kB slab_unreclaimable:51224kB kernel_stack:28688kB pagetables:46160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:37125 all_unreclaimable? yes
Nov  2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 0 0 0
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA32: 1151*4kB (UEM) 1297*8kB (UEM) 1229*16kB (UEM) 416*32kB (UEM) 106*64kB (UEM) 11*128kB (UE) 31*256kB (UE) 3*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 65620kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 Normal: 3680*4kB (UEM) 2890*8kB (UE) 1103*16kB (UEM) 13*32kB (EM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55904kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov  2 18:26:38 c2-kl-snamenode kernel: 7729 total pagecache pages
Nov  2 18:26:38 c2-kl-snamenode kernel: 0 pages in swap cache
Nov  2 18:26:38 c2-kl-snamenode kernel: Swap cache stats: add 0, delete 0, find 0/0
Nov  2 18:26:38 c2-kl-snamenode kernel: Free swap  = 0kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Total swap = 0kB
Nov  2 18:26:38 c2-kl-snamenode kernel: 4194168 pages RAM
Nov  2 18:26:38 c2-kl-snamenode kernel: 0 pages HighMem/MovableOnly
Nov  2 18:26:38 c2-kl-snamenode kernel: 127726 pages reserved
Nov  2 18:26:38 c2-kl-snamenode kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1790]     0  1790    64980      103     132        0             0 systemd-journal
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1811]     0  1811    47590      100      28        0             0 lvmetad
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1829]     0  1829    11129      154      24        0         -1000 systemd-udevd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3413]     0  3413    13880      118      27        0         -1000 auditd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3445]   999  3445   153085     1682      62        0             0 polkitd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3446]    32  3446    17316      135      37        0             0 rpcbind
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3452]     0  3452     5415      101      15        0             0 irqbalance
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3464]     0  3464     6657      173      17        0             0 systemd-logind
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3467]    81  3467    14554      156      35        0          -900 dbus-daemon
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3503]     0  3503    48775      119      34        0             0 gssproxy
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3544]     0  3544    31579      180      19        0             0 crond
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3559]     0  3559     6476       53      17        0             0 atd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3582]     0  3582    27523       32      10        0             0 agetty
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3583]     0  3583    27523       33      10        0             0 agetty
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3785]     0  3785    25710      524      49        0             0 dhclient
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3846]     0  3846   143483     2796      97        0             0 tuned
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3854]     0  3854   232412    56050     382        0             0 rsyslogd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3904]     0  3904   340102    19651     119        0             0 n9e-collector
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4177]     0  4177     5731       89      14        0             0 argusagent
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4179]     0  4179   308752     3450      60        0             0 /usr/local/clou
Nov  2 18:26:38 c2-kl-snamenode kernel: [12470]     0 12470    28216      256      58        0         -1000 sshd
Nov  2 18:26:38 c2-kl-snamenode kernel: [23550]   998 23550    29448      130      30        0             0 chronyd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25161]     0 25161    56017      505     113        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [17170]     0 17170    37325      897      29        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [17174]     0 17174   717657    40285     223        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [12170]  1016 12170   731784    66076     344        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [12326]  1009 12326   147520     2550      72        0             0 python2.7
Nov  2 18:26:38 c2-kl-snamenode kernel: [14099]  1007 14099  1354046    56626     295        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [14496]  1016 14496   714714    33040     173        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [  310]  1020   310    47777     1440      44        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [  311]  1020   311  1904045    34660     179        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [21603]     0 21603   109262      189      28        0             0 AliSecGuard
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 5807]  1018  5807  2897189   254199    1008        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [11866]     0 11866     2687       55      11        0             0 jsvc
Nov  2 18:26:38 c2-kl-snamenode kernel: [11880]  1018 11880   697340    38125     159        0             0 jsvc
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 2797]  1018  2797   769655   102332     361        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [22546]  1009 22546    28845       86      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [22560]  1009 22560  1020455   500837    1235        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [22616]  1009 22616  1652841   221421     684        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [20278]  1010 20278  1039497   287501     657        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 6637]  1010  6637  1055965   118005     318        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [14342]     0 14342   201779     1185      13        0             0 aliyun-service
Nov  2 18:26:38 c2-kl-snamenode kernel: [14445]     0 14445     4451      123      13        0             0 assist_daemon
Nov  2 18:26:38 c2-kl-snamenode kernel: [17671]     0 17671    10482      394      20        0             0 AliYunDunUpdate
Nov  2 18:26:38 c2-kl-snamenode kernel: [23521]  1004 23521    28845       85      14        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [23535]  1004 23535  1431755   103152     391        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1182]  1020  1182    28845       85      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1196]  1020  1196   854319   146975     527        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [26591]     0 26591   127511      342      33        0             0 AliNet
Nov  2 18:26:38 c2-kl-snamenode kernel: [26746]     0 26746   578461     2993      79        0             0 AliHips
Nov  2 18:26:38 c2-kl-snamenode kernel: [27031]  1018 27031   754075    83566     375        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [  744]     0   744    44485    12650      88        0             0 AliYunDun
Nov  2 18:26:38 c2-kl-snamenode kernel: [26067]  1016 26067  1593210   610784    1391        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [32663]     0 32663   144907      557      42        0             0 AliDetect
Nov  2 18:26:38 c2-kl-snamenode kernel: [24142]  1004 24142    28845       87      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [24156]  1004 24156  1926525  1023090    2199        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [25057]    48 25057    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25058]    48 25058    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25059]    48 25059    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25060]    48 25060    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25061]    48 25061    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [13131]     0 13131   118125    13948     130        0             0 titanagent
Nov  2 18:26:38 c2-kl-snamenode kernel: [13133]     0 13133    21926      122      13        0             0 titan_monitor
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4393]     0  4393    57508     4449      67        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4564]     0  4564     3437       54      12        0             0 ambari-sudo.sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4565]     0  4565    28844       71      14        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4569]     0  4569    22553      145      49        0             0 su
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4573]     0  4573    28811       53      14        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4575]     0  4575    28844       88      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4598]     0  4598    52172     2653      56        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4607]  1014  4607    29343       67      12        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4657]     0  4657    45598      250      46        0             0 crond
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4658]     0  4658    45598      250      46        0             0 crond
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4674]  1014  4674    18754      186      40        0             0 curl
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4705]     0  4705     3437       54      12        0             0 ambari-sudo.sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4706]     0  4706     3437       54      12        0             0 ambari-sudo.sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4713]     0  4713    22048      135      49        0             0 su
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4714]     0  4714    22048      136      47        0             0 su
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4734]     0  4734     2365       23       8        0             0 sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4735]     0  4735     2365       22       9        0             0 sh
Nov  2 18:26:38 c2-kl-snamenode kernel: Out of memory: Kill process 24156 (java) score 252 or sacrifice child
Nov  2 18:26:38 c2-kl-snamenode kernel: Killed process 24156 (java) total-vm:7706100kB, anon-rss:4092360kB, file-rss:0kB, shmem-rss:0kB
Nov  2 18:26:39 c2-kl-snamenode systemd: Started Session 125873 of user root.
Nov  2 18:26:39 c2-kl-snamenode systemd: Started Session 125872 of user root.

  View HDP component timeline_ Start time of reader process

        From the above logs and alarms, we can see that HDP cluster timeline_ The reader component is kill ed by the system, but Ambari is configured with automatic restart, which will restart after hanging up.

4 basic concepts of OOM

Linux has a feature: OOM Killer, a protection mechanism, which is used to avoid serious problems when memory is insufficient and kill some irrelevant processes first, that is, when memory is seriously insufficient, the system will select a process and kill it to release memory and alleviate memory shortage. However, this protection is limited, Can not completely protect the operation of the process.

     In many cases, it is often seen that when there is still remaining memory, the oom killer still kills the process. The phenomenon is that the following information is in the / var/log/messages log file:

    Out of Memory: Killed process [PID] [process name].

     The problem is low memory exhaustion because the kernel uses low memory to track all memory allocations.

     When low memory is exhausted, no matter how much high memory is left, oom killer will kill the process to keep the system running normally.

     Under the 32-bit CPU, the addressing range is limited. The Linux kernel defines the following three areas:

### You can see from the log above
# DMA: 0x00000000 -  0x00999999 (0 - 16 MB) 

# LowMem: 0x01000000 - 0x037999999 (16 - 896 MB) - size: 880MB

# Highmem: 0x038000000 - < hardware specific > 

Data belonging to the process, such as Stacks, Heaps, etc. Can be further decomposed into

  • active_anon
  • inactive_anon
[root@hdp101 ~]# ps -ef | grep 10412
root      10412      1  0 06:41 ?        00:02:58 /usr/java/jdk1.8.0_162/bin/java -server -XX:NewRatio=3 -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 -Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp -Dlog.file.name=hst-server.log -Xms1024m -Xmx2048m -cp /etc/hst/conf:/usr/hdp/share/hst/hst-common/lib/* com.hortonworks.support.tools.server.SupportToolServer
root      12725  12592  0 20:25 pts/0    00:00:00 grep --color=auto 10412
[root@hdp101 ~]#
[root@hdp101 ~]#
[root@hdp101 ~]#
### When OOM is enabled, each process has OOM parameters that can be set
[root@hdp101 ~]# ll  /proc/10412/oom*
-rw-r--r-- 1 root root 0 Nov  6 20:26 /proc/10412/oom_adj
-r--r--r-- 1 root root 0 Nov  6 20:26 /proc/10412/oom_score
-rw-r--r-- 1 root root 0 Nov  6 20:26 /proc/10412/oom_score_adj
[root@hdp101 ~]#

In view of the shortage of server resources, we have to expand memory.

Tags: Linux Operation & Maintenance oom

Posted on Sat, 06 Nov 2021 11:17:01 -0400 by CaseyC1