Memory overflow caused by Spark reading Snappy compressed files on HDFS

There are some files growing every day on HDFS. At present, Snappy compression is used. Suddenly, one day, OOM

1. Reasons:

Because snappy cannot split slices, a file will be read by a task. After reading and decompressing, the data will expand many times. If the number of files is too large and your parallelism is very large, it will lead to a large number of full gc and eventually OOM

Because it was implemented by predecessors, it is not easy to change. In order to run the program quickly, we had to change the last preparation (500) before HDFS to 1000, increase the number of files and reduce the amount of data per file. However, it is not a long-term plan to replace the compression method LZO, but it has not been implemented. It still needs to establish an index and plan. Moreover, Snappy is used for all data that needs to be compressed, which is a headache.

Let's record the process of finding and repairing problems. It's very long and delicious.

Files on HDFS

You can see that a file is almost 700MB, a total of 500 files.
When you use Spark to read and do a series of calculations, you will start to report an error OOM. Use Jstat -gc pid 1000 to check the gc situation. It is found that it will get stuck when it reaches a stage. Then continue to perform full GC, and then you will start OOM

1. It is assumed that the data is skewed because a stage or task is always stuck during execution

Through sampling inspection of the data, it is found that there is no data skew and basically no duplicate data

rdd.sample(0.1,false).countByKey().forEach((k,v) ->System.out.println(k+"---"+v));

2. Maybe there is not enough memory, or the parameter adjustment is wrong. Start adjusting parameters

**Directions: 1. Add memory, 2. Out of heap memory, 3. Adjust JVM parameters, 4. Adjust the proportion of cache and execution parameters, 5. Increase the number of cores, increase parallelism, and reduce the amount of data processed by each task, 6. Adjust the number of partitions when the code increases the buffer, 6. Adjust the code convergence and so on


Because there are a lot of codes and a large number of stages, I can't see through the WEB UI which operator in the current stage has the problem. I always thought it was the error reported when reduceBykey, resulting in the wrong direction, so how to adjust is wrong
The harvest is a better understanding of tuning parameters.

1. At the beginning, I thought it was due to insufficient execution memory, so I set fraction to 0.8 and storageFraction to 0.2, and constantly increasing execution memory is useless.

--conf spark.memory.fraction=0.6 
--conf spark.memory.storageFraction=0.4 

2. All attempts to adjust the code and filter data in advance are useless.

3. Something happened during the period. I didn't think it was snappy. I modified the number of partitions when reading files and found that no matter how it was adjusted, it was only 500. At that time, I was still wondering why. It was so stupid

3. Later, I looked at the task execution in the WEB UI. In fact, the stuck tasks have been reading data, and the input Size item is increasing. The following is the size I modified later, and the process is very slow until it reaches 700MB. Moreover, if the number of core s is large, that is, the parallelism is large, I have 320 task parallelism, each of which needs to read 700MB and need to be decompressed, After snappy is decompressed for about 3G, when you check the task's errlog, you will find that the log appears. I can't remember exactly. It is probably reading xxxx file 3G and spiling to the disk through memory. When you see a task reading so large, you suddenly remember that snappy can't split slice

It will cause a file to use a task to read all the data and decompress it, and finally OOM

Parameter adjustment:

# --For the sake of beauty
spark-submit --master spark://11.172.54.167:7077 
--class $main --deploy-mode client --driver-memory 16g
--executor-memory 20g
--executor-cores 8
--total-executor-cores 320 
--conf spark.memory.fraction=0.6 
--conf spark.memory.storageFraction=0.4 
--conf spark.memory.offHeap.enabled=true 
--conf spark.memory.offHeap.size=5g 
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:-TieredCompilation -XX:G1HeapRegionSize=16m -XX:InitiatingHeapOccupancyPercent=55 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:-UseCompressedClassPointers -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:ParallelGCThreads=20 -XX:ConcGCThreads=20 -Xms15g -Xmn4g -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" 
--jars $jars xxxx.jar $date1 $max $date2  >> log/$log_file


#In code parameters
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.serialize", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(new Class[]{ImmutableBytesWritable.class, HyperLogLog.class, HashSet.class, RegisterSet.class, IllegalArgumentException.class, FileCommitProtocol.TaskCommitMessage.class});
//conf.set("spark.kryo.registrationRequired","true"); # If the opened session is not added to it, an error will be reported
conf.set("spark.kryoserializer.buffer.mb", "10");
conf.set("spark.shuffle.file.buffer", "128");
conf.set("spark.reducer.maxSizeInFlight", "144");
conf.set("spark.shuffle.io.maxRetries", "50");
conf.set("spark.shuffle.io.retryWait", "5s");

Tags: Java Scala Hadoop Spark hdfs

Posted on Fri, 19 Nov 2021 01:54:04 -0500 by tstout2