Elastic job failover exception

background

The company selects elastic job as A distributed task scheduling tool, version 2.1.5, in which one task corresponds to two machines, A and B. the total number of tasks is 4, A corresponds to 0, 1, and B corresponds to 2, 3. The task is executed at 23:30:00 every night, and the data record date T+1 is calculated on day t for use on day T+1.
Suddenly one day, machine A runs at 22:24 to perform the task of sharding 2. After sharding 2, it executes sharding 3. Note that sharding 2 and 3 should be sharding used by machine B. these two points are very strange. The abnormal time execution does not belong to its own sharding, and it is an execution. Below is the task id recorded in the log

// Abnormal task id. it is clear that the partition is @ 2 @. Machine A is executing
"taskId":"jobname@-@2@-@READY@-@A machine ip@-@4927"

// Normal task id, partition number @ 2,3 @, ip corresponding to machine B
"taskId":"jobname@-@2,3@-@READY@-@B machine ip@-@12384"

After analysis, it is concluded that the task failover logic is basically advanced. But why is task failover? This is not possible when the task is not at the point of execution and is not in progress.
After recollection, at 22:23, the developer did a memory dump for machine B, which was one minute different from the start of machine A. maybe the problem is here. Did the dump operation on machine B cause B to disconnect from the ZK registry for a short time, which led to the server outage? With the problem, we did another dump on B, and soon confirmed our guess. As shown in the figure below, partition 2 is running, and the actual execution of partition 2 is machine a, and the identification is also very clear, failure transfer

However, according to the official statement, failover refers to that the crash of the running job server will not result in re fragmentation, but only the next time the job starts. Enabling the failover function can monitor other job servers to be idle during the execution of this job, and grab the unfinished orphan fragment items for execution.
Running job server crash will not lead to re fragmentation. After confirmation, our task was not in execution at that time, which is very strange. With these questions, we went deep into the source code of elastic job.

Failover

The relevant logic entry of failover is in failonlistenermanager ා jobcrashedjoblistener. The actual processing is FailoverService. When a server goes down, an event type. Node "removed will be triggered. Elastic job will process the event according to the event. All the related processes are commented in the code.

public final class FailoverListenerManager extends AbstractListenerManager {
    
    // ...
    private final FailoverService failoverService;
    private final ShardingService shardingService;
    
    class JobCrashedJobListener extends AbstractJobListener {
        
        @Override
        protected void dataChanged(final String path, final Type eventType, final String data) {
            // 1. Failover enabled, 2. Registry event - node removed, i.e. a server is offline, 3. instance path, i.e. jobName/instances path
            if (isFailoverEnabled() && Type.NODE_REMOVED == eventType && instanceNode.isInstancePath(path)) {
                // path,jobName/instances/ip-@-@pid
                // jobInstanceId is like this ip-@-@pid
                String jobInstanceId = path.substring(instanceNode.getInstanceFullPath().length() + 1);
                // If the jobInstanceId is the same as the current machine, skip directly
                if (jobInstanceId.equals(JobRegistry.getInstance().getJobInstance(jobName).getJobInstanceId())) {
                    return;
                }
                // Obtain the partition of failover, corresponding to the job name / sharding / partition number / failover of zk directory, and the instance id of failover partition
                List<Integer> failoverItems = failoverService.getFailoverItems(jobInstanceId);
                if (!failoverItems.isEmpty()) {
                    // If there are failover shards for jobInstanceId
                    for (int each : failoverItems) {
                        // Store shards in the directory leader/failover/items
                        failoverService.setCrashedFailoverFlag(each);
                        failoverService.failoverIfNecessary();
                    }
                } else {
                    // Obtain the partition corresponding to the jobInstanceId that does not have a failover partition, and then store it in the directory leader/failover/items / partition number to perform the partition failover
                    // From this point of view, as long as the server is down, it is necessary to execute the time effective transfer logic. In fact, it is not,
                    // shardingService.getShardingItems(jobInstanceId) will determine whether the server is still available. If not, the returned fragment collection is empty
                    // However, due to the temporary unavailability of dump to the server caused by memory, there may be errors. This is where our task starts abnormally
                    for (int each : shardingService.getShardingItems(jobInstanceId)) {
                         // Store shards in the directory leader/failover/items
                        failoverService.setCrashedFailoverFlag(each);
                        failoverService.failoverIfNecessary();
                    }
                }
            }
        }
    }
    
    // ...
}
public final class FailoverService {
    
    /**
     * If a failover is required, a job failover is performed
     */
    public void failoverIfNecessary() {
        if (needFailover()) {
            jobNodeStorage.executeInLeader(FailoverNode.LATCH, new FailoverLeaderExecutionCallback());
        }
    }
    
    // Judge whether there are nodes under leader/failover/items
    // Failoverservice.setCrashedFailoverFlag (fragment number); the method is to save the node in the leader/failover/items directory, that is, after executing the setCrashedFailoverFlag method, needFailover() is true
    private boolean needFailover() {
        return jobNodeStorage.isJobNodeExisted(FailoverNode.ITEMS_ROOT) && !jobNodeStorage.getJobNodeChildrenKeys(FailoverNode.ITEMS_ROOT).isEmpty()
                && !JobRegistry.getInstance().isJobRunning(jobName);
    }
    
    /**
     * Gets the collection of failover shards for the job server
     * 
     * @param jobInstanceId Job run instance primary key
     * @return Partition item set of job failover
     */
    public List<Integer> getFailoverItems(final String jobInstanceId) {
        // Job fragmentation
        List<String> items = jobNodeStorage.getJobNodeChildrenKeys(ShardingNode.ROOT);
        List<Integer> result = new ArrayList<>(items.size());
        for (String each : items) {
            int item = Integer.parseInt(each);
            // Get the nodes under the directory sharding / partition number / failover
            String node = FailoverNode.getExecutionFailoverNode(item);
            // Verify that the instances under jobName/sharding / partition number / failover are consistent with the invalid jobInstanceId, and if so, join the invalid partition set
            if (jobNodeStorage.isJobNodeExisted(node) && jobInstanceId.equals(jobNodeStorage.getJobNodeDataDirectly(node))) {
                result.add(item);
            }
        }
        Collections.sort(result);
        return result;
    }
    
    class FailoverLeaderExecutionCallback implements LeaderExecutionCallback {
        
        @Override
        public void execute() {
            // Determine whether the local machine stops scheduling tasks and whether failover is required
            if (JobRegistry.getInstance().isShutdown(jobName) || !needFailover()) {
                return;
            }
            // Get fragment of failover under leader/failover/items
            int crashedItem = Integer.parseInt(jobNodeStorage.getJobNodeChildrenKeys(FailoverNode.ITEMS_ROOT).get(0));
            log.debug("Failover job '{}' begin, crashed item '{}'", jobName, crashedItem);
            // Create a node under sharding / partition No. / failover in the directory to identify that failover is in progress
            jobNodeStorage.fillEphemeralJobNode(FailoverNode.getExecutionFailoverNode(crashedItem), JobRegistry.getInstance().getJobInstance(jobName).getJobInstanceId());
            // Delete fragment failover record
            jobNodeStorage.removeJobNodeIfExisted(FailoverNode.getItemsNode(crashedItem));
            // TODO should not use triggerJob, but use executor to schedule uniformly
            // Perform failover job
            JobScheduleController jobScheduleController = JobRegistry.getInstance().getJobScheduleController(jobName);
            if (null != jobScheduleController) {
                jobScheduleController.triggerJob();
            }
        }
    }
}
public final class ShardingService {
    
    /**
     * Set the mark to be re divided
     */
    public void setReshardingFlag() {
        jobNodeStorage.createJobNodeIfNeeded(ShardingNode.NECESSARY);
    }
   
    
    /**
     * Gets the partition item collection of the job running instance
     *
     * @param jobInstanceId Job run instance primary key
     * @return Partition item set of job running instance
     */
    public List<Integer> getShardingItems(final String jobInstanceId) {
        JobInstance jobInstance = new JobInstance(jobInstanceId);
        // The server is available, that is, there is corresponding ip under the servers/ directory and jobName/instances directory
        if (!serverService.isAvailableServer(jobInstance.getIp())) {
            return Collections.emptyList();
        }
        List<Integer> result = new LinkedList<>();
        // Get all tiles
        int shardingTotalCount = configService.load(true).getTypeConfig().getCoreConfig().getShardingTotalCount();
        for (int i = 0; i < shardingTotalCount; i++) {
            // Find the partition corresponding to the down server
            if (jobInstance.getJobInstanceId().equals(jobNodeStorage.getJobNodeData(ShardingNode.getInstanceNode(i)))) {
                result.add(i);
            }
        }
        return result;
    }
    
}

conclusion

The dump heap memory causes server B to be unavailable for A short time, disconnects from the registry, triggers the zk node deletion event in the registry, and server A monitors to execute the failover logic after hearing the event. When server A goes to get the corresponding fragment of server B, server B resumes its work. At this time, server A obtains two fragments 2 and 3 of service B, and executes the failover logic in turn, That's why after dump B, A starts to perform two slices of B.

class JobCrashedJobListener extends AbstractJobListener {
        
        @Override
        protected void dataChanged(final String path, final Type eventType, final String data) {
            // 1. Failover enabled, 2. Registry event - node removed, i.e. a server is offline, 3. instance path, i.e. jobName/instances path
            if (isFailoverEnabled() && Type.NODE_REMOVED == eventType && instanceNode.isInstancePath(path)) {
                ...
                List<Integer> failoverItems = failoverService.getFailoverItems(jobInstanceId);
                if (!failoverItems.isEmpty()) {
                    ... 
                } else {
                    // Obtain the partition corresponding to the jobInstanceId that does not have a failover partition, and then store it in the directory leader/failover/items / partition number to perform the partition failover
                    // From this point of view, as long as the server is down, it must perform the time effective transfer logic. In fact, it is not,
                    // shardingService.getShardingItems(jobInstanceId) will determine whether the server is still available. If not, the returned fragment collection is empty
                    // However, due to the temporary unavailability of dump to the server caused by memory, there may be errors. This is where our task starts abnormally
                    for (int each : shardingService.getShardingItems(jobInstanceId)) {
                        failoverService.setCrashedFailoverFlag(each);
                        failoverService.failoverIfNecessary();
                    }
                }
            }
        }
    }

Attached task record of zk Registration Center

  • namespace/jobname

    • leader

      • failover

        • items

          • 2. It is also used to judge whether identification is needed
          • 3. Failover fragmentation is also used to determine whether identification is needed
      • sharding

        • necessary needs to readjust sharding
      • election

        • host
        • latch
    • servers

      • 172.16.101.112

        • prcessSuccessCount
        • hostName
        • processFailureCount
        • status
        • disabled
        • sharding
      • 172.16.101.52

        • prcessSuccessCount
        • hostName
        • processFailureCount
        • status
        • disabled
        • sharding
    • config

      • cron
      • shardingTermParameters
      • failover
      • processCountIntervalSeconds
      • monitorExecution
      • shardingTotalCount
      • jobParameter
      • fetchDataCount
      • concurrentDataProcessThreadCount
    • instances

      • [172.16.101.112@-@9644, 172.16.101.52@-@10138]
    • sharding

      • 0

        • Running running
        • Instance running instance
        • Failover fragmentation failover, running
      • 1
      • 2
      • 3

Important classes to be improved

Job registry task management, a JVM and a single instance, records the corresponding care, task status and task instance of task and registry
Schedulerface task scheduling facade class, one task corresponds to one
JobNodeStorage job node access

ShardingNode zk node name building rules
JobNodePath job node construction

/jobname/sharding

Registry Center
RegistryCenter
CoordinatorRegistryCenter
ZookeeperRegistryCenter

event listeners
AbstractListenerManager
ShutdownListenerManager

Tags: Java Fragment jvm

Posted on Mon, 16 Mar 2020 23:01:23 -0400 by dmikester1