Elastic-Job Source Interpretation

Write before

The current job scheduling tool used by the company is Elastic-Job, version 2.1.5. There was an online accident configured due to failover in March. During the process of troubleshooting, the source code was roughly read, just taking this opportunity to get a better understanding of Elastic-Job.

Overall architecture


Note: Pictures from https://github.com/elasticjob/elastic-job-lite

Summary

Elastic-Job is a distributed scheduling solution consisting of two independent subprojects, Elastic-Job-Lite and Elastic-Job-Cloud.
Elastic-Job-Lite is positioned as a lightweight, non-centralized solution that provides coordinated services for distributed tasks in the form of jar packages.
Elastic-Job-Cloud uses Mesos + Docker solutions to provide additional services such as resource governance, application distribution, and process isolation.(Not discussed in this article)

Elastic-Job core components: quartz, Zookeeper.

  • The role of quartz is to schedule tasks on each machine (that is, when the fragmented tasks on each machine execute)
  • Zookeeper is a distributed dispatch center

function

Elastic-Job-Lite
Distributed Scheduling Coordination
Elastic expansion
Failover
Missed Retrigger of Execution Job
Job slicing consistency
Ensure that the same slice has only one execution instance in a distributed environment
Self-diagnose and repair problems caused by distributed instability
Supports parallel scheduling
Support job life cycle operations
Rich job types
Spring Integration and Namespace Provisioning
Operations and Maintenance Platform

Source Code Interpretation

Task Initialization

JobScheduler

public class JobScheduler {
    
    /**
     * Two fixed key values in the JobDetail#JobDataMap of Quartz
     */
    public static final String ELASTIC_JOB_DATA_MAP_KEY = "elasticJob";
    private static final String JOB_FACADE_DATA_MAP_KEY = "jobFacade";
 
     ...
    
    private JobScheduler(final CoordinatorRegistryCenter regCenter, final LiteJobConfiguration liteJobConfig, final JobEventBus jobEventBus, final ElasticJobListener... elasticJobListeners) {
        // Add Job Instance
        JobRegistry.getInstance().addJobInstance(liteJobConfig.getJobName(), new JobInstance());
        // Job Configuration
        this.liteJobConfig = liteJobConfig;
        // Registration Center
        this.regCenter = regCenter;
        List<ElasticJobListener> elasticJobListenerList = Arrays.asList(elasticJobListeners);
        // Provides a pre-and post-extension point at the start or finish of a distributed task, where users can perform specific logic at the start and finish of a distributed task
        setGuaranteeServiceForElasticJobListeners(regCenter, elasticJobListenerList);
        schedulerFacade = new SchedulerFacade(regCenter, liteJobConfig.getJobName(), elasticJobListenerList);
        // Task Face, a task with a jobFacade, encapsulates task opening, task failover, missed re-execution, task events, and so on
        jobFacade = new LiteJobFacade(regCenter, liteJobConfig.getJobName(), Arrays.asList(elasticJobListeners), jobEventBus);
    }
    
    private void setGuaranteeServiceForElasticJobListeners(final CoordinatorRegistryCenter regCenter, final List<ElasticJobListener> elasticJobListeners) {
        GuaranteeService guaranteeService = new GuaranteeService(regCenter, liteJobConfig.getJobName());
        for (ElasticJobListener each : elasticJobListeners) {
            if (each instanceof AbstractDistributeOnceElasticJobListener) {
                ((AbstractDistributeOnceElasticJobListener) each).setGuaranteeService(guaranteeService);
            }
        }
    }
    
    /**
     * Initialize job.
     */
    public void init() {
        // Update Job Configuration to ZK
        LiteJobConfiguration liteJobConfigFromRegCenter = schedulerFacade.updateJobConfiguration(liteJobConfig);
        // Set the number of slices
        JobRegistry.getInstance().setCurrentShardingTotalCount(liteJobConfigFromRegCenter.getJobName(), liteJobConfigFromRegCenter.getTypeConfig().getCoreConfig().getShardingTotalCount());
        // Create Job Scheduling Controller
        JobScheduleController jobScheduleController = new JobScheduleController(
                createScheduler(), createJobDetail(liteJobConfigFromRegCenter.getTypeConfig().getJobClass()), liteJobConfigFromRegCenter.getJobName());
        // Locally associate job with jobScheduleController and the registry, while the zk registry creates a node named jobName
        JobRegistry.getInstance().registerJob(liteJobConfigFromRegCenter.getJobName(), jobScheduleController, regCenter);
        // Register Job Startup Information
        schedulerFacade.registerStartUpInfo(!liteJobConfigFromRegCenter.isDisabled());
        // Schedule Tasks
        jobScheduleController.scheduleJob(liteJobConfigFromRegCenter.getTypeConfig().getCoreConfig().getCron());
    }

    /**
     * Configure Quartz
     * This reflects the fusion of Elastic Job and Quartz. The specific statement for Quartz task scheduling is scheduler.scheduleJob(jobDetail, createTrigger(cron)).
     * When the time for the task is set, Quartz executes the org.quartz.Job#execute(org.quartz.JobExecutionContext) method, and Elastic Job implements LiteJob
     *
     */
    private JobDetail createJobDetail(final String jobClass) {
        // Quartz describes the interface for dispatching tasks
        JobDetail result = JobBuilder.newJob(LiteJob.class).withIdentity(liteJobConfig.getJobName()).build();
        // Notice that the jobFacade in this place is the same as that in LiteJob
        result.getJobDataMap().put(JOB_FACADE_DATA_MAP_KEY, jobFacade);
        Optional<ElasticJob> elasticJobInstance = createElasticJobInstance();
        if (elasticJobInstance.isPresent()) {
            result.getJobDataMap().put(ELASTIC_JOB_DATA_MAP_KEY, elasticJobInstance.get());
        } else if (!jobClass.equals(ScriptJob.class.getCanonicalName())) {
            try {
                // Note that the elasticJob in this place is the same as the elasticJob in LiteJob, which is used to determine the type of task when dispatching it
                result.getJobDataMap().put(ELASTIC_JOB_DATA_MAP_KEY, Class.forName(jobClass).newInstance());
            } catch (final ReflectiveOperationException ex) {
                throw new JobConfigurationException("Elastic-Job: Job class '%s' can not initialize.", jobClass);
            }
        }
        return result;
    }
    
    ...

Task Trigger

Task triggering within a job instance is done through Quartz, which triggers at a time set by cron. Quartz triggers the implementation of the LiteJob#execute method for Job

LiteJob

public final class LiteJob implements Job {


    /**
     * Notice the two static properties in JobScheduler, which are stored in the JobDetail#JobDataMap of Quartz when it is initialized.
     * Quartz When LiteJob is initialized, these two values are retrieved from JobDataMap of JobDetail, as shown in org.quartz.simpl.PropertySettingJobFactory#newJob(org.quartz.spi.TriggerFiredBundle, org.quartz.Scheduler)
     * public static final String ELASTIC_JOB_DATA_MAP_KEY = "elasticJob";
     * private static final String JOB_FACADE_DATA_MAP_KEY = "jobFacade";
     */
    @Setter
    private ElasticJob elasticJob;
    
    @Setter
    private JobFacade jobFacade;
    
    @Override
    public void execute(final JobExecutionContext context) throws JobExecutionException {
        // Entry to Quartz task scheduling
        JobExecutorFactory.getJobExecutor(elasticJob, jobFacade).execute();
    }
}

AbstractElasticJobExecutor

public abstract class AbstractElasticJobExecutor {

    ...Omit Code
    
    /**
     * Execute the job.
     */
    public final void execute() {
        try {
            jobFacade.checkJobExecutionEnvironment();
        } catch (final JobExecutionEnvironmentException cause) {
            jobExceptionHandler.handleException(jobName, cause);
        }
        ShardingContexts shardingContexts = jobFacade.getShardingContexts();
        if (shardingContexts.isAllowSendJobEvent()) {
            jobFacade.postJobStatusTraceEvent(shardingContexts.getTaskId(), JobStatusTraceEvent.State.TASK_STAGING, String.format("Job '%s' execute begin.", jobName));
        }
        // If the task is executing, all slices assigned to the current job instance are recorded as misfire
        if (jobFacade.misfireIfRunning(shardingContexts.getShardingItemParameters().keySet())) {
            if (shardingContexts.isAllowSendJobEvent()) {
                jobFacade.postJobStatusTraceEvent(shardingContexts.getTaskId(), JobStatusTraceEvent.State.TASK_FINISHED, String.format(
                        "Previous job '%s' - shardingItems '%s' is still running, misfired job will start after previous job completed.", jobName, 
                        shardingContexts.getShardingItemParameters().keySet()));
            }
            return;
        }
        try {
            // Job Pre-Extension
            jobFacade.beforeJobExecuted(shardingContexts);
            //CHECKSTYLE:OFF
        } catch (final Throwable cause) {
            //CHECKSTYLE:ON
            jobExceptionHandler.handleException(jobName, cause);
        }
        // 1. Executing Tasks
        execute(shardingContexts, JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER);
        // 2. Miss Execution
        while (jobFacade.isExecuteMisfired(shardingContexts.getShardingItemParameters().keySet())) {
            jobFacade.clearMisfire(shardingContexts.getShardingItemParameters().keySet());
            execute(shardingContexts, JobExecutionEvent.ExecutionSource.MISFIRE);
        }
        // 3. Failure Transfer
        jobFacade.failoverIfNecessary();
        try {
            // Job Post Extension
            jobFacade.afterJobExecuted(shardingContexts);
            //CHECKSTYLE:OFF
        } catch (final Throwable cause) {
            //CHECKSTYLE:ON
            jobExceptionHandler.handleException(jobName, cause);
        }
    }
    
    private void execute(final ShardingContexts shardingContexts, final JobExecutionEvent.ExecutionSource executionSource) {
        if (shardingContexts.getShardingItemParameters().isEmpty()) {
            if (shardingContexts.isAllowSendJobEvent()) {
                jobFacade.postJobStatusTraceEvent(shardingContexts.getTaskId(), JobStatusTraceEvent.State.TASK_FINISHED, String.format("Sharding item for job '%s' is empty.", jobName));
            }
            return;
        }
        // Record Task Start
        jobFacade.registerJobBegin(shardingContexts);
        String taskId = shardingContexts.getTaskId();
        if (shardingContexts.isAllowSendJobEvent()) {
            jobFacade.postJobStatusTraceEvent(taskId, JobStatusTraceEvent.State.TASK_RUNNING, "");
        }
        try {
            // Execute Job
            process(shardingContexts, executionSource);
        } finally {
            // TODO considers increasing the status of job failures and how to handle the overall loop of job failures
            jobFacade.registerJobCompleted(shardingContexts);
            if (itemErrorMessages.isEmpty()) {
                if (shardingContexts.isAllowSendJobEvent()) {
                    jobFacade.postJobStatusTraceEvent(taskId, JobStatusTraceEvent.State.TASK_FINISHED, "");
                }
            } else {
                if (shardingContexts.isAllowSendJobEvent()) {
                    jobFacade.postJobStatusTraceEvent(taskId, JobStatusTraceEvent.State.TASK_ERROR, itemErrorMessages.toString());
                }
            }
        }
    }
    
    private void process(final ShardingContexts shardingContexts, final JobExecutionEvent.ExecutionSource executionSource) {
        Collection<Integer> items = shardingContexts.getShardingItemParameters().keySet();
        // Single slice, the current job instance has only one slice
        if (1 == items.size()) {
            int item = shardingContexts.getShardingItemParameters().keySet().iterator().next();
            JobExecutionEvent jobExecutionEvent =  new JobExecutionEvent(shardingContexts.getTaskId(), jobName, executionSource, item);
            process(shardingContexts, item, jobExecutionEvent);
            return;
        }
        final CountDownLatch latch = new CountDownLatch(items.size());
        // Multi-slicing, one slice and one thread
        for (final int each : items) {
            final JobExecutionEvent jobExecutionEvent = new JobExecutionEvent(shardingContexts.getTaskId(), jobName, executionSource, each);
            if (executorService.isShutdown()) {
                return;
            }
            executorService.submit(new Runnable() {
                
                @Override
                public void run() {
                    try {
                        process(shardingContexts, each, jobExecutionEvent);
                    } finally {
                        latch.countDown();
                    }
                }
            });
        }
        try {
            // Coordinated multi-slice synchronization complete
            latch.await();
        } catch (final InterruptedException ex) {
            Thread.currentThread().interrupt();
        }
    }

    private void process(final ShardingContexts shardingContexts, final int item, final JobExecutionEvent startEvent) {
        if (shardingContexts.isAllowSendJobEvent()) {
            jobFacade.postJobExecutionEvent(startEvent);
        }
        log.trace("Job '{}' executing, item is: '{}'.", jobName, item);
        JobExecutionEvent completeEvent;
        try {
            // Single-slice task processing, which actually goes to user-defined execution
            process(new ShardingContext(shardingContexts, item));
            completeEvent = startEvent.executionSuccess();
            log.trace("Job '{}' executed, item is: '{}'.", jobName, item);
            if (shardingContexts.isAllowSendJobEvent()) {
                jobFacade.postJobExecutionEvent(completeEvent);
            }
            // CHECKSTYLE:OFF
        } catch (final Throwable cause) {
            // CHECKSTYLE:ON
            completeEvent = startEvent.executionFailure(cause);
            jobFacade.postJobExecutionEvent(completeEvent);
            itemErrorMessages.put(item, ExceptionUtil.transform(cause));
            jobExceptionHandler.handleException(jobName, cause);
        }
    }
    
    protected abstract void process(ShardingContext shardingContext);
}

SimpleJobExecutor

Take SimpleJobExecutor for example, AbstractElastic JobExecutor has two other implementations, ScriptJobExecutor and DataflowJobExecutor, which are described in the same way.

public final class SimpleJobExecutor extends AbstractElasticJobExecutor {
    
    private final SimpleJob simpleJob;
    
    public SimpleJobExecutor(final SimpleJob simpleJob, final JobFacade jobFacade) {
        super(jobFacade);
        this.simpleJob = simpleJob;
    }
    
    @Override
    protected void process(final ShardingContext shardingContext) {
        // User-defined task execution content
        simpleJob.execute(shardingContext);
    }
}

Fragmentation strategy

/**
 * A slicing strategy based on an average allocation algorithm.
 * 
 * <p>
 * If the slices cannot be divided, the extra slices that cannot be divided will be appended to the server with a smaller sequence number in turn.
 * For example: 
 * 1. If there are three servers, divided into nine pieces, each server will be divided into 1=[0,1,2], 2=[3,4,5], and 3=[6,7,8].
 * 2. If there are three servers, divided into eight pieces, each server will be divided into 1=[0,1,6], 2=[2,3,7], and 3=[4,5].
 * 3. If there are three servers, divided into 10 pieces, each server will be divided into 1=[0,1,2,9], 2=[3,4,5], and 3=[6,7,8].
 * </p>
 * 
 * @author zhangliang
 */
public final class AverageAllocationJobShardingStrategy implements JobShardingStrategy {
    
    @Override
    public Map<JobInstance, List<Integer>> sharding(final List<JobInstance> jobInstances, final String jobName, final int shardingTotalCount) {
        if (jobInstances.isEmpty()) {
            return Collections.emptyMap();
        }
        // Divided parts, evenly distributed per machine
        Map<JobInstance, List<Integer>> result = shardingAliquot(jobInstances, shardingTotalCount);
        // Undivisible parts, starting with the first machine, one by one, until the division is complete
        addAliquant(jobInstances, shardingTotalCount, result);
        return result;
    }
    
    private Map<JobInstance, List<Integer>> shardingAliquot(final List<JobInstance> shardingUnits, final int shardingTotalCount) {
        Map<JobInstance, List<Integer>> result = new LinkedHashMap<>(shardingTotalCount, 1);
        int itemCountPerSharding = shardingTotalCount / shardingUnits.size();
        int count = 0;
        for (JobInstance each : shardingUnits) {
            List<Integer> shardingItems = new ArrayList<>(itemCountPerSharding + 1);
            for (int i = count * itemCountPerSharding; i < (count + 1) * itemCountPerSharding; i++) {
                shardingItems.add(i);
            }
            result.put(each, shardingItems);
            count++;
        }
        return result;
    }
    
    private void addAliquant(final List<JobInstance> shardingUnits, final int shardingTotalCount, final Map<JobInstance, List<Integer>> shardingResults) {
        int aliquant = shardingTotalCount % shardingUnits.size();
        int count = 0;
        // From the first, unevenly distributed distributions are added to each machine until they are finished
        for (Map.Entry<JobInstance, List<Integer>> entry : shardingResults.entrySet()) {
            if (count < aliquant) {
                entry.getValue().add(shardingTotalCount / shardingUnits.size() * shardingUnits.size() + count);
            }
            count++;
        }
    }
}

Distributed

Elastic Job uses zk as its registry. Job instance information, fragmentation information, configuration information, job running status and so on are all stored in zk node mode.Elastic Job accomplishes distributed task collaboration through zk node change events. Events such as node addition, change, removal are synchronized to each job instance in the distributed environment in real time. Elastic Job provides multiple listeners to handle these events, including the listener parent AbstractJobListener and TreeCacheListener.

TreeCacheListener

Node change monitoring interface provided by zk

/**
 * Listener for {@link TreeCache} changes
 */
public interface TreeCacheListener
{
    /**
     * Listen for zk event changes
     * Called when a change has occurred
     *
     * @param client the client
     * @param event  describes the change
     * @throws Exception errors
     */
    public void childEvent(CuratorFramework client, TreeCacheEvent event) throws Exception;
}

AbstractJobListener

Job listener encapsulated by Elastic Job

public abstract class AbstractJobListener implements TreeCacheListener {
    
    @Override
    public final void childEvent(final CuratorFramework client, final TreeCacheEvent event) throws Exception {
        ChildData childData = event.getData();
        if (null == childData) {
            return;
        }
        String path = childData.getPath();
        if (path.isEmpty()) {
            return;
        }
        dataChanged(path, event.getType(), null == childData.getData() ? "" : new String(childData.getData(), Charsets.UTF_8));
    }
    
    // Abstract methods, subclass listeners implemented on demand
    protected abstract void dataChanged(final String path, final Type eventType, final String data);
}

JobCrashedJobListener

Failover Monitor

class JobCrashedJobListener extends AbstractJobListener {
        
        @Override
        protected void dataChanged(final String path, final Type eventType, final String data) {
            // 1 Failover Open, 2 Registry Event-Node Removal, that is, a server is offline, 3 is the instance path, that is, the jobName/instances path
            if (isFailoverEnabled() && Type.NODE_REMOVED == eventType && instanceNode.isInstancePath(path)) {
                // path´╝îjobName/instances/ip-@-@pid
                // jobInstanceId is ip-@-@pid like this
                String jobInstanceId = path.substring(instanceNode.getInstanceFullPath().length() + 1);
                // If jobInstanceId is the same as the current machine, skip directly
                if (jobInstanceId.equals(JobRegistry.getInstance().getJobInstance(jobName).getJobInstanceId())) {
                    return;
                }
                // Get the failover fragmentation corresponding to the zk directory jobName/sharding/fragment number/failover, and the instance id corresponding to the failover fragmentation
                List<Integer> failoverItems = failoverService.getFailoverItems(jobInstanceId);
                if (!failoverItems.isEmpty()) {
                    // If there is an invalid transfer fragment for jobInstanceId
                    for (int each : failoverItems) {
                        // Store fragments in directory leader/failover/items
                        failoverService.setCrashedFailoverFlag(each);
                        failoverService.failoverIfNecessary();
                    }
                } else {
                    // Get the fragment corresponding to a fragment if jobInstanceId does not failover the fragment and store it in the directory leader/failover/items/fragment number to perform the fragment failover
                    // From this point of view, as long as the server is down, the aging transfer logic must be executed. In fact, it is not.
                    // shardingService.getShardingItems(jobInstanceId) determines if the server is still available and returns an empty collection of fragments if not available
                    // However, for dump's temporary memory-induced server unavailability, errors may occur, where our task starts abnormally
                    for (int each : shardingService.getShardingItems(jobInstanceId)) {
                        failoverService.setCrashedFailoverFlag(each);
                        failoverService.failoverIfNecessary();
                    }
                }
            }
        }
    }

FailoverSettingsChangedJobListener

Failover Configuration Change Monitor closes the processing logic when failover is turned off from the console and does not need to be handled locally if it is turned on

class FailoverSettingsChangedJobListener extends AbstractJobListener {
        
        @Override
        protected void dataChanged(final String path, final Type eventType, final String data) {
            if (configNode.isConfigPath(path) && Type.NODE_UPDATED == eventType && !LiteJobConfigurationGsonFactory.fromJson(data).isFailover()) {
                failoverService.removeFailoverInfo();
            }
        }
    }

Other

JobRegistry Task Management, a JVM unit, records the tasks and registry corresponding concerns, task status, task instances
Scheduler Facade task scheduling facade class, one task for each
JobNodeStorage Job Node Access

ShardingNode zk Node Name Construction Rule
JobNodePath Job Node Construction

summary

This article takes task initialization, task triggering, fragmentation strategy and distribution as starting points to describe the source code of Elastic Job. On the one hand, it summarizes the records by itself, on the other hand, it hopes to help other developers quickly read and understand how Elastic Job works.

Tags: Java Fragment Zookeeper github Docker

Posted on Tue, 05 May 2020 22:57:05 -0400 by BETA