spark Lost executor on YARN

The execution script appears:

15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on myhost2.com: remote Rpc client disassociated

The reason for this is that yarn resources are not enough. Increasing resources for mission implementation

./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 12  /home/myuser/myspark-1.0.jar
Change to

./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=1024M" --driver-java-options -XX:MaxPermSize=1024m --driver-memory 4g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 15  /home/myuser/myspark-1.0.jar


Spark OPTIMIZATION OF job OPERATION PARAMETERS


Generally, many problems of Spark Job are caused by insufficient system resources. Judging from the monitoring log, it is a problem caused by the high occupancy of memory resources. Therefore, we try to solve these problems by configuring parameters.  

1,--conf spark.akka.frameSize=100
This parameter controls the maximum capacity of communication messages in Spark (such as task output), defaulting to 10M, when processed Big data The output of task may be larger than this value, and a higher value needs to be set based on the actual data.  

2,--conf spark.yarn.executor.memoryOverhead=4096
Excutor out-of-heap memory settings, if the program uses a large amount of out-of-heap memory, it should increase this configuration.



1. Lost executor on YARN ALS iterations

debasish83 Q:

During the 4th ALS iteration, I am noticing that one of the executor gets 
disconnected: 

14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding 
SendingConnectionManagerId not found 

14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 
disconnected, so removing it 

14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5 
on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated 

14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12) 
Any idea if this is a bug related to akka on YARN ? 

I am using master 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

ps:ALS(alternating leastsquares): Alternating Least Squares Method

Xiangrui Meng A: 
We know Container YARN killed because it used more memory than it required, but the root cause of the problem has not been found.

We know that the 
container got killed by YARN because it used much more memory that it requested. 
But we haven't figured out the root cause yet. 
  • 1
  • 2
  • 3
  • 1
  • 2
  • 3

debasish83 Q: 
I can use YARN 1.0 or 1.1 reproduce, so this should be a problem with the YARN version.  
For me at least, now it's possible to use the standalone model.  
Sandy Ryza A: 
The solution is to add Spark yarn.executor.memoryOverhead until the error disappears. This configuration controls the buffer between the JVM heap size and the memory size obtained from YARN (JVM can occupy more memory than their heap size). You should also make sure that at YARN In the NodeManager configuration, yarn.nodemanager.vmem-check-enabled is set to false.

The fix is to raise spark.yarn.executor.memoryOverhead until this goes away.  
This controls the buffer between the JVM heap size and the amount of 
memory requested from YARN (JVMs can take up memory beyond their heap size).
You should also make sure that, in the YARN NodeManager 
configuration, yarn.nodemanager.vmem-check-enabled is set to false. 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 1
  • 2
  • 3
  • 4
  • 5

debasish83 Q: 
I put spark.yarn.executor.memoryOverhead 1024 in spark-defaults.conf, but I didn't see spark in webUI - > environment. environment variable in properties. Does it need to be set in spark-env.sh?
Sandy Ryza A: 
The current situation is increasing.
spark.yarn.executor.memoryOverhead fails until job stops. We do have plans to try to automatically zoom in and out of memory on this basis.
Requirements, but it is still only an inspiration.  
debasish83 Q: 
If I use 40 execotor s with 16GB of memory and score billions of dollars on a 100M (100 million) x10M (10 million) matrix, what is a typical spark. yarn. executor. memory Overhead?  
Sandy Ryza A: 
I expect 2GB is enough, not to mention 16GB (unless ALS is using a bunch of off-heap memory?) You mentioned earlier that property is not shown in the environment option in the county town. Are you sure it works? If it works, it will appear in the environment.

2. Getting error in Spark: Executor lost

Q: 
One master and two slaves, each RAM is 32GB, read a csv file of 18 million records (the first row is the column name)

./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file>

rdd = sc.textFile("<path/to/file>")
h = rdd.first()
header_rdd = rdd.map(lambda 1: h in l)
data_rdd = rdd.substract(header_rdd)
data_rdd.first()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Error message:

15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
 ApplicationMaster has disassociated: 192.168.1.114:51058
15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
 ApplicationMaster has disassociated: 192.168.1.114:51058
15/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor: 
 Association with remote system [akka.tcp://sparkYarnAM@192.168.1.114:51058] has failed,
 address is now gated for [5000] ms. Reason: [Disassociated]
15/10/12 13:52:03 ERROR cluster.YarnScheduler: Lost executor 1 on hslave2:
 remote Rpc client disassociated
15/10/12 13:52:03 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 from TaskSet 3.0
15/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor:
 Association with remote system [akka.tcp://sparkExecutor@hslave2:58555] has failed, 
 address is now gated for [5000] ms. Reason: [Disassociated]
15/10/12 13:52:03 WARN scheduler.TaskSetManager:
 Lost task 6.6 in stage 3.0 (TID 208, hslave2): ExecutorLostFailure (executor 1 lost)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

The error was generated when rdd.substract() was run. Then I changed the code, deleted rdd.substract(), and replaced it with rdd.filter():

rdd = sc.textFile("<path/to/file>")
h = rdd.first()
data_rdd = rdd.filter(lambda l: h not in l)
  • 1
  • 2
  • 3
  • 1
  • 2
  • 3

Get the same error

A: 
This is not Spark's bug, it should be for you. Java Yarn is related to the configuration of the Park-config file.  
You can add Java memory, akka's framesize and timeout settings, and so on.

sapark-defaults.conf:
spark.master                       yarn-cluster
spark.yarn.historyServer.address   <your cluster url>
spark.eventLog.enabled             true
spark.eventLog.dir                 hdfs://<your history directory>
spark.driver.extraJavaOptions      -Xmx20480m -XX:MaxPermSize=2048m XX:ReservedCodeCacheSize=2048m
spark.checkpointDir                hdfs://<your checkpoint directory>
yarn.log-aggregation-enable        true
spark.shuffle.service.enabled      true
spark.shuffle.service.port         7337
spark.shuffle.consolidateFiles     true
spark.sql.parquet.binaryAsString   true
spark.speculation                  false
spark.yarn.maxAppAttempts          1
spark.akka.askTimeout              1000
spark.akka.timeout                 1000
spark.akka.frameSize               1000
spark.rdd.compress true
spark.storage.memoryFraction 1
spark.core.connection.ack.wait.timeout 600
spark.driver.maxResultSize         0
spark.task.maxFailures             20
spark.shuffle.io.maxRetries        20
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

You may also want to set how many partitions are required in your Spark program, or you may want to add some partitionBy (partitioner) statements to your RDD, so your code may be like this:

myPartitioner = new HashPartitioner(<your number of partitions>)

rdd = sc.textFile("<path/to/file>").partitionBy(myPartitioner)
h = rdd.first()
header_rdd = rdd.map(lambda l: h in l)
data_rdd = rdd.subtract(header_rdd)
data_rdd.first()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Finally, you may need to set your spark-submit command and add parameters: the number of executors, executor memory, and driver memory.

./spark-submit 
--master yarn 
--deploy-mode client 
--num-executors 100 
--driver-memory 20G 
--executor-memory 10g 
<path/to/.py file>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

3. OPTIMIZATION OF SPARK JOB OPERATION PARAMETERS

Generally, many problems of Spark Job are caused by insufficient system resources. Judging from the monitoring log, it is a problem caused by the high occupancy of memory resources. Therefore, we try to solve these problems by configuring parameters.  
1.

--conf spark.akka.frameSize=100
  • 1
  • 2
  • 1
  • 2

This parameter controls the maximum capacity of communication messages in Spark (such as task output), defaulting to 10M, when processed Big data The output of task may be larger than this value, and a higher value needs to be set based on the actual data.  
2.

--conf spark.shuffle.manager=SORT
  • 1
  • 1

Spark's default shuffle uses Hash mode. In Hash mode, each shuffle generates a number of M*R files (the number of M: Maps, the number of R: Reduces). When the number of Maps and Reduces is large, it generates a considerable amount of construction, and at the same time it brings a lot of memory overhead. To Reduce system resources, Sort mode can be used to generate only M files, but the running time is longer.  
3.

--conf spark.yarn.executor.memoryOverhead=4096
  • 1
  • 1

Excutor out-of-heap memory settings, if the program uses a large amount of out-of-heap memory, it should increase this configuration.

Tags: Spark Ruby Java NodeManager

Posted on Sun, 06 Jan 2019 23:21:09 -0500 by jmansa