Record an MR error: Container is running beyond physical memory limits

Little bird's personal blog has been officially launched and opened to the outside world

Blog access address: The big dream of a small vegetable bird

Welcome to all students, I will pay more attention to my official account. official account


background

Earlier, when the project team launched a new demand online, it was necessary to initialize the full historical data of a table in Hive. When the table is processed by ETL due to historical reasons, the data output file (fixed length compressed file) before 2015 is the same ETL date, that is, the partition date of the table. Because the data covers many years and the data file is large and divided into multiple dates, the data storage Hive needs to be re partitioned according to its real business date. Some Spark tasks exist in the online cluster at the same time, so it is decided to use Mapreduce to analyze and store the data and re partition it to the formal table (the cluster is CDH5.14.0).


Error log

Due to the physical isolation between the daily development test and the production environment, the log cannot be obtained. The following logs are from the network)

......
18/05/15 09:36:59 INFO mapreduce.Job: Running job: job_1525923400969_0023
18/05/15 09:37:11 INFO mapreduce.Job: Job job_1525923400969_0023 running in uber mode : false
18/05/15 09:37:11 INFO mapreduce.Job:  map 0% reduce 0%
18/05/15 09:38:15 INFO mapreduce.Job: Task Id : attempt_1525923400969_0023_m_000000_0, Status : FAILED
Container [pid=27842,containerID=container_1525923400969_0023_01_000002] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1525923400969_0023_01_000002 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 28027 27842 27842 27842 (java) 10464 706 2861776896 266585 /usr/java/jdk1.8.0_111/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/home/yarn/nm/usercache/root/appcache/application_1525923400969_0023/container_1525923400969_0023_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.3.1.9 59062 attempt_1525923400969_0023_m_000000_0 2 
    |- 27842 27839 27842 27842 (bash) 3 3 115859456 361 /bin/bash -c /usr/java/jdk1.8.0_111/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/home/yarn/nm/usercache/root/appcache/application_1525923400969_0023/container_1525923400969_0023_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.3.1.9 59062 attempt_1525923400969_0023_m_000000_0 2 1>/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000002/stdout 2>/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000002/stderr  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

18/05/15 09:39:21 INFO mapreduce.Job: Task Id : attempt_1525923400969_0023_m_000000_1, Status : FAILED
Container [pid=32561,containerID=container_1525923400969_0023_01_000003] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1525923400969_0023_01_000003 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 32561 32559 32561 32561 (bash) 3 2 115859456 361 /bin/bash -c /usr/java/jdk1.8.0_111/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/home/yarn/nm/usercache/root/appcache/application_1525923400969_0023/container_1525923400969_0023_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.3.1.9 59062 attempt_1525923400969_0023_m_000000_1 3 1>/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000003/stdout 2>/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000003/stderr  
    |- 32746 32561 32561 32561 (java) 10706 764 2856566784 270760 /usr/java/jdk1.8.0_111/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/home/yarn/nm/usercache/root/appcache/application_1525923400969_0023/container_1525923400969_0023_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.3.1.9 59062 attempt_1525923400969_0023_m_000000_1 3 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

18/05/15 09:40:30 INFO mapreduce.Job: Task Id : attempt_1525923400969_0023_m_000000_2, Status : FAILED
Container [pid=89668,containerID=container_1525923400969_0023_01_000004] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1525923400969_0023_01_000004 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 89966 89668 89668 89668 (java) 10823 859 2856423424 267248 /usr/java/jdk1.8.0_111/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/home/yarn/nm/usercache/root/appcache/application_1525923400969_0023/container_1525923400969_0023_01_000004/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000004 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.3.1.9 59062 attempt_1525923400969_0023_m_000000_2 4 
    |- 89668 89666 89668 89668 (bash) 3 3 115859456 361 /bin/bash -c /usr/java/jdk1.8.0_111/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/home/yarn/nm/usercache/root/appcache/application_1525923400969_0023/container_1525923400969_0023_01_000004/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000004 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.3.1.9 59062 attempt_1525923400969_0023_m_000000_2 4 1>/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000004/stdout 2>/yarn/container-logs/application_1525923400969_0023/container_1525923400969_0023_01_000004/stderr  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

Task running phenomenon: check the historical task monitoring of YARN and find that the MR task runs for a long time, but then an error is reported, resulting in task failure.


Anomaly analysis

Focus on logs:

Container [pid=27842,containerID=container_1525923400969_0023_01_000002] is running beyond physical memory limits. 
Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 2.1 GB virtual memory used. Killing container.

The above log information clearly indicates that the memory overflowed.

Container is running beyond physical memory

This problem is mainly caused by insufficient physical memory. When mapreduce is executed, each map and reduce has its own maximum allocated memory. When the memory required by the map function is greater than this value, this error will be reported.

running beyond virtual memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 2.1 GB virtual memory used. Killing container.

This error is caused by YARN's virtual memory calculation method. In the above example, the memory requested by the user program is 1Gb. YARN obtains the value of the requested virtual memory by multiplying this value by a proportion (2.1 by default). When the virtual memory value required by the user program calculated by YARN is greater than the calculated value, the above error will be reported.

Note here that several memory values: 1.0G, 1G, 2.8G and 2.1G represent:

1.0GB: physical memory occupied by tasks
1GB: mapreduce.map.memory.mb parameter default setting size
2.8GB: virtual memory occupied by programs
2.1GB: mapreduce.map.memory.mb times yarn.nodemanager.vmem-pmem-ratio

Where yarn.nodemanager.vmem-pmem-ratio is the ratio of virtual memory to physical memory. It is set in yarn-site.xml. The default is 2.1.
Obviously, the container occupies 2.8G of virtual memory, but only 2.1GB is allocated to the container, so kill the container.

The above is only the error report generated in map. Of course, it may also be reported in reduce. If it is in reduce, it is MapReduce. Reduce. Memory. DB * yarn. Nodemanager. Vmem pmem ratio.

Note:

Physical memory: memory module
Virtual memory: a piece of logical memory virtually delimited by using disk space. The disk space used as virtual memory is called Swap Space( Strategy proposed to meet the shortage of physical memory)
linux will use the virtual memory of swap partition when the physical memory is insufficient. The kernel will write the temporarily unused memory block information to the swap space. In this way, the physical memory is released, and this memory can be used for other purposes. When the original content needs to be used, these information will be re read into the physical memory from the swap space.


terms of settlement

1. If the number of clusters is large, increase the number of map s or reduce and balance them

2. Cancel the check of virtual memory (not recommended)
Set yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml or program

# Physical memory check
<property>
  <name>yarn.nodemanager.pmem-check-enabled </name>
  <value>false</value>
  <description>Whether physical memory limits will be enforced for containers.</description>
</property>
# Virtual memory check
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
  <description>Whether virtual memory limits will be enforced for containers.</description>
</property>

In addition to the virtual memory overrun, it may also be the physical memory overrun. Similarly, you can set the physical memory check to yarn.nodemanager.pmem-check-enabled: false
Personally, I don't think this method is very good. If the program has problems such as memory leakage, canceling this check may cause the cluster to crash.

3. Increase mapreduce.map.memory.mb or mapreduce.reduce.memory.mb (recommended)

Personally, I think it is a method, which should be given priority. This method can not only solve the problem of virtual memory, but also the problem of insufficient physical memory most of the time. This method is just applicable.

4. Appropriately increase the size of yarn.nodemanager.vmem-pmem-ratio

Increase the corresponding virtual memory for physical memory, but this parameter should not be too outrageous

In addition to the above, we also need to consider whether the program can be optimized to create as few new objects as possible, and whether there is data skew in the data itself. We should give priority to solving such problems from the perspective of program.

Final method:
This operation is mainly the initialization of a large amount of historical data. There has been no such problem in the cluster batch running on weekdays. Therefore, it is finally decided to set it through the command line parameters, and the following parameters are added before the re partition operation:

set mapreduce.map.memory.mb=10240;	#Set map memory size
set mapreduce.reduce.memory.mb=10240;	#Set reduce memory size

summary

The reasons for this abnormal production mainly include the following aspects:
1. The production environment is completely isolated from the development test, that is, the data of any production environment is completely inaccessible during the development test;
2. The distribution of actual data is not clear, and only a small part of information is known from relevant business departments;
3. The test environment does not have a large amount of data for testing, and the above two memory parameters have been set to 8GB, which is not small;
4. There are other offline and real-time tasks running in the production environment at the same time (this needs to be considered in advance).

Production environment the data in this table is completely black box, the number is about 800 million, and it is impossible to know the distribution of data under each business date in advance, This operation is carried out in sequence from file conversion compressed format to HDFS directory -- > fixed len gt h parsing of HDFS directory file to Hive temporary table -- > according to a series of processes from business date re partition to Hive formal table (step-by-step implementation is not adopted at the beginning). However, memory problems occurred in the third stage.

In addition, the O & M colleagues in charge of our project team are also too SB (rigid) most of the time. They feel that they are doing our R & D to death. They have asked them to help query the service log several times, but they don't check it. Who makes others' O & m the boss department of the company.

Anyway, in the end, it was attributed to his poor consideration. He knew the data black box and "worked hard". In fact, those who engage in our business do not adhere to the principle of "simplicity and excellence".

The above is playing with the Tucao, and the big guys make complaints about it! Don't spray!

Tags: Big Data hive mr

Posted on Sat, 04 Sep 2021 23:59:43 -0400 by mkoga