Analysis of Hadoop YARN ResourceManager crash caused by data limit of ZooKeeper node

This problem makes us encounter again. It happens infrequently, but once it happens, it will cause resource manager service crash, ZK registration watch too many and other problems. It has always been a hindrance to not completely solve this problem, so based on the previous two times of analysis and reading the latest version of Hadoop 3.2.1 code in the community, the production environment YARN is patch ed to finally solve this problem. For difficult problems, each time I encounter it, I have a different understanding. Next, I record the process of analyzing and solving this problem. The records of solving and analyzing the problem in the first two times are as follows:

Environmental Science

  • Hadoop version: Apache Hadoop 2.6.3
  • ZooKeeper version: ZooKeeper 3.4.10
  • Two resource manager nodes: Master RM01 and slave RM02

Cause of the problem

This problem is hard to reproduce. The reason for this problem has not been found for the first two times. After the patch, we found in the log that the problem is mainly caused by some abnormal tasks. The log is as follows:

2020-04-28 10:05:54 INFO  org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:768 - Application update attemptState data size for /rmstore/ZKRMStateRoot/RMAppRoot/application_1587969707206_16259/appattempt_1587969707206_16259_000001 is 20266528. Exceed the maximum allowed 3145728 size. ApplicationAttemptState info: ApplicationAttemptState{attemptId=appattempt_1587969707206_16259_000001, diagnostics='User class threw exception: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 2.0 failed 4 times, most recent failure: Lost task 15.3 in stage 2.0 (TID 4224, java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.immutable.Set hset;
/* 009 */   private boolean hasNull;
/* 010 */   private UnsafeRow result;
/* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
/* 012 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;

When the task is abnormal, YARN will save the abnormal information of the task. When there are many abnormal information, the amount of data that YARN saves the task status to ZK will exceed the limit of ZK. It can be seen from the log that the abnormal Spark task status data is 20266528 bytes, or 19MB, which is far more than the 3MB we set. On the YARN monitoring interface, you can see 200000 lines of abnormal information of the task:


Because of the experience of finding and solving problems and understanding the source code in the first two times, the solution is much easier this time. The final solution to this problem last August was to adjust the value of the jude.maxbuffer parameter of ZK server and YARN client to 3MB, that is to say, adjust the maximum amount of data that each ZNode in ZK can save to 3MB. However, such a scheme has the following obvious disadvantages:

  • It makes the amount of data saved in ZK larger, which leads to the memory shortage of ZK JVM. In extreme cases, it will make ZK OOM, and also affect the efficiency of ZK data reading and writing, data synchronization and persistence
  • jute.maxbuffer is a hard configuration mode. In order for the configuration to take effect, it is necessary to restart ZK service and client YARN RM service. The operation and maintenance cost of ZK service and services relying on ZK is relatively large. Because the ZK cluster used by our production environment YARN also manages the metadata of HBase and streaming computing tasks, the impact of restart is still large

It can be seen that although the jute.maxbuffer mode is also modified to solve the problem, it will have an impact on ZK services and services relying on ZK, and the operation and maintenance costs are relatively high. So by tracking the community issue and reading the source code of Hadoop 3.2.1, we take the way of adding the configuration of yarn.resourcemanager.zk-max-znode-size.bytes in yarn-site.xml to solve the problem that the amount of data written by YARN to ZK exceeds the limit of ZK. This configuration is added in Hadoop version 2.9.0. In this way, we do not need to modify the configuration of ZK server, but only need to modify the configuration of YARN server and restart YARN to limit the amount of data that YARN writes to ZK, and improve the availability of ZK service. The task status data whose code logic exceeds the data limit after the patch is typed is directly discarded, and the log log is printed for the convenience of future problem tracing. The main code of ZKRMStateStore after the patch is printed is as follows (the rest codes are omitted due to the length):

public class ZKRMStateStore extends RMStateStore{
    private int zknodeLimit; // Limit on the amount of data saved by the ZNode node
    public synchronized void initInternal(Configuration conf) throws Exception {
        // The rest is omitted
        // Get the value of yarn.resourcemanager.zk-max-znode-size.bytes in yarn-site.xml
        zknodeLimit = conf.getInt(YarnConfiguration.RM_ZK_ZNODE_SIZE_LIMIT_BYTES,
    public synchronized void updateApplicationAttemptStateInternal(
            ApplicationAttemptId appAttemptId,
            ApplicationAttemptStateData attemptStateDataPB)
            throws Exception {
        String appIdStr = appAttemptId.getApplicationId().toString();
        String appAttemptIdStr = appAttemptId.toString();
        String appDirPath = getNodePath(rmAppRoot, appIdStr);
        String nodeUpdatePath = getNodePath(appDirPath, appAttemptIdStr);
        if (LOG.isDebugEnabled()) {
            LOG.debug("Storing final state info for attempt: " + appAttemptIdStr
                    + " at: " + nodeUpdatePath);
        byte[] attemptStateData = attemptStateDataPB.getProto().toByteArray();

        ApplicationAttemptState attemptState = getApplicationAttemptState(appAttemptId, attemptStateDataPB);
        // Determine whether the task attempt data information to be written exceeds the value of zknodeLimit variable. If not, perform the task attempt data update operation. Otherwise, only the info information will be printed, and the task will not be executed to try to update the data
        if (attemptStateData.length <= zknodeLimit) {
            if (existsWithRetries(nodeUpdatePath, true) != null) {
                setDataWithRetries(nodeUpdatePath, attemptStateData, -1);
            } else {
                createWithRetries(nodeUpdatePath, attemptStateData, zkAcl,
                LOG.debug(appAttemptId + " znode didn't exist. Created a new znode to"
                        + " update the application attempt state.");
  "Application update attemptState data size for " + nodeUpdatePath + " is "
                    + attemptStateData.length + ". The maximum allowed " + zknodeLimit + " size. ApplicationAttemptState info: " + attemptState.toString() + ". AppAttemptTokens length:" + attemptStateDataPB.getAppAttemptTokens().array().length
                    + ". See yarn.resourcemanager.zk-max-znode-size.bytes.");
        } else {
  "Application update attemptState data size for " + nodeUpdatePath + " is "
                    + attemptStateData.length + ". Exceed the maximum allowed " + zknodeLimit + " size. ApplicationAttemptState info: " + attemptState.toString() + ". AppAttemptTokens length:" + attemptStateDataPB.getAppAttemptTokens().array().length
                    + ". See yarn.resourcemanager.zk-max-znode-size.bytes.");

Problem summary

1. YARN uses ZK to realize fault state recovery. Will the modification here affect the execution of normal tasks and state recovery?

can't. After a period of online operation and data monitored by zkdoctor, it is found that the status data of normal tasks stored in ZK by YARN generally does not exceed 512K, and only the abnormal information data of some abnormal tasks will be particularly large. This abnormal information data is the root cause that the amount of data written by YARN to ZK exceeds the limit.

YARN defines the shared state storage system as an RMStateStore abstract class to store the state information necessary for resource manager to recover from failure. These information are all information of some basic data types, without particularly complex data types, such as byte arrays. Resource manager will not save the resource information allocated to each application master and the resource usage information of each NodeManager, which can be reconstructed through the corresponding heartbeat reporting mechanism. Therefore, the HA implementation of ResourceManager is very lightweight. The main categories related to task status are as follows:

(1) Application state information application state:

   * State of an application application
   * Task status information class
  public static class ApplicationState {
    final ApplicationSubmissionContext context; // Task description content
    final long submitTime; // Task submission time
    final long startTime; // Task start time
    final String user; // Task submitted by
    Map<ApplicationAttemptId, ApplicationAttemptState> attempts =
                  new HashMap<ApplicationAttemptId, ApplicationAttemptState>(); // Task retry information
    // fields set when application completes.
    RMAppState state; // Task running status
    String diagnostics; // Task abnormal diagnosis information
    long finishTime; // Task completion time
    // Omit other codes

(2) Each ApplicationAttemptState corresponding to Application:

   * State of an application attempt
   * Task attempt status information class
  public static class ApplicationAttemptState {
    final ApplicationAttemptId attemptId; // Task attempt ID
    final Container masterContainer; // Information of the container
    final Credentials appAttemptCredentials; // Security token
    long startTime = 0; // start time
    long finishTime = 0; // End time
    // fields set when attempt completes
    RMAppAttemptState state; // running state 
    String finalTrackingUrl = "N/A"; // Task operation log view address
    String diagnostics; // Task abnormal diagnosis information
    int exitStatus = ContainerExitStatus.INVALID; // Task exit status
    FinalApplicationStatus amUnregisteredFinalStatus; // Task final status
    long memorySeconds; // Total memory resources consumed by task
    long vcoreSeconds; // Total CPU resources consumed by tasks
    // Omit other codes

(3) Security token related information RMDTSecretManagerState:

   * Security token information
  public static class RMDTSecretManagerState {
    // DTIdentifier -> renewDate
    Map<RMDelegationTokenIdentifier, Long> delegationTokenState =
        new HashMap<RMDelegationTokenIdentifier, Long>(); // Authorization token status
    Set<DelegationKey> masterKeyState =
        new HashSet<DelegationKey>(); // master key status
    int dtSequenceNumber = 0; // serial number
    // Omit other codes

2. Why do you register many watch es in ZK when YARN is abnormal?

YARN will fail over if there is an exception. Fail over to the standby node. The standby node will call the loadState method of RMState to recover the task state data. The loadState will call the loadRMAppState method of ZKRMStateStore to read the task state data saved in ZK. When calling the getData method of ZK, it will register the watch for the task state node and the task attempt state node, To monitor changes in task status. Because the task state node and task attempt state node are persistent nodes, they will not be deleted due to the failure of ZK client connection, and they are one to many relationships, which will result in a large number of watches. Here is the code for loading task status:

 private synchronized void loadRMAppState(RMState rmState) throws Exception {
        // watch is triggered when the / rmstore/ZKRMStateRoot/RMAppRoot / node and its children are deleted or created
        List<String> childNodes = getChildrenWithRetries(rmAppRoot, true);
        for (String childNodeName : childNodes) {
            String childNodePath = getNodePath(rmAppRoot, childNodeName);
            // Get the task node data and register a watch, which is triggered when the task node is deleted or the data is updated
            byte[] childData = getDataWithRetries(childNodePath, true);
            if (childNodeName.startsWith(ApplicationId.appIdStrPrefix)) {
                // application
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Loading application from znode: " + childNodeName);
                ApplicationId appId = ConverterUtils.toApplicationId(childNodeName);
                ApplicationStateDataPBImpl appStateData =
                        new ApplicationStateDataPBImpl(
                // Get task data
                ApplicationState appState =
                        new ApplicationState(appStateData.getSubmitTime(),
                                appStateData.getDiagnostics(), appStateData.getFinishTime());
                if (!appId.equals(appState.context.getApplicationId())) {
                    throw new YarnRuntimeException("The child node name is different " +
                            "from the application id");
                rmState.appState.put(appId, appState);
                // Get task retry data
                loadApplicationAttemptState(appState, appId);
            } else {
      "Unknown child node with name: " + childNodeName);

Our production environment is set to store 20000 task status information in ZK. In case of any problem, the monitor finds that YARN has registered more than 100000 watches with ZK. Because ZK's watch information is saved by HashMap (the key is the path of the ZNode node, and the value is the watch set registered on the ZNode), a large number of watches will make the HashMap a large object in the JVM, and the large object will always be saved on the server side of ZK and will not be collected until YARN passively deletes or updates the task status data, The number of elements in the HashMap where the ZK server saves the watch information will be reduced accordingly. This is a relatively slow process. In this process, ZK is likely to respond slowly or even have OOMs due to JVM GC problems. Last year, we registered many watches with ZK due to YARN problems, which led to ZK OOM, and then affected the HBase services that depend on ZK to have exceptions. Therefore, we will migrate YARN to a set of independent ZK clusters on the basis of patch ing, which only serve YARN, so as to improve the availability of big data basic services.

According to our monitoring and statistics, under normal circumstances, there are few watches registered by YARN to ZK, which are basically the watches of task status data nodes at runtime, so there will not be too much pressure on ZK.

3. Why does abnormal write task status from YARN to ZK trigger YARN failover?

In the method of interaction between ZKRMStateStore and ZK, the runWithRetries method of ZKRMStateStore.ZKAction class will be called to retry. Normally, there is no need to retry. If an exception occurs, the retry logic will be triggered. By default, 1000 retries will be performed. After 1000 retries, the throw method will be used to throw an exception to the upper caller. Any of the following methods may throw an exception:

The exception will be caught by the notifyStoreOperationFailed method of RMStateStore, which is very simple and mainly performs the following logical judgment:

  • If YARN turns on HA, a failover operation is triggered
  • If HA is not turned on, judge whether the quick failure feature is turned on by YARN, then trigger rmfataleventtype.state'store'op'failed event to exit the process
  • If neither of the above conditions is satisfied, the warn information will be printed

The specific code of this method is as follows:

   * This method informs RM that the storage operation fails. The parameter is the exception information that causes the operation failure
   * This method is called to notify the ResourceManager that the store
   * operation has failed.
   * @param failureCause the exception due to which the operation failed
  protected void notifyStoreOperationFailed(Exception failureCause) {
    LOG.error("State store operation failed ", failureCause);
    // If HA is turned on, perform a failover operation
    if (HAUtil.isHAEnabled(getConfig())) {
      LOG.warn("State-store fenced ! Transitioning RM to standby");
      Thread standByTransitionThread =
          new Thread(new StandByTransitionThread());
      standByTransitionThread.setName("StandByTransitionThread Handler");
    } else if (YarnConfiguration.shouldRMFailFast(getConfig())) { // If HA is not turned on, judge whether there is a quick failure to turn on ha
      LOG.fatal("Fail RM now due to state-store error!");
          new RMFatalEvent(RMFatalEventType.STATE_STORE_OP_FAILED,
    } else { // Otherwise, print skip storage exception warning
      LOG.warn("Skip the state-store error.");

reference material

Tags: Big Data Hadoop Apache Spark SQL

Posted on Sun, 10 May 2020 10:38:53 -0400 by FireWhizzle