Azkaban task scheduling tool

Little bird's personal blog has been officially launched and opened to the outside world

Blog access address: The big dream of a small vegetable bird

Welcome to all students, I will pay more attention to my official account. Official account will be displayed.

Azkaban brief introduction

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.

Azkaban is a batch workflow job scheduler created on LinkedIn to run Hadoop jobs. Azkaban solves orders through job dependencies and provides an easy-to-use Web user interface to maintain and track your workflow.


  • Compatible with any version of Hadoop
  • Easy to use Web UI
  • Simple Web and http workflow upload
  • Project workspace
  • Scheduling workflow
  • Modular and pluggable
  • Authentication and authorization
  • Track user actions
  • Email alerts about failures and successes
  • SLA alert and auto kill
  • Retry failed work

The above is from the official introduction. For details, see Official website

Why workflow scheduling system

1. A complete data analysis system is usually composed of a large number of task units: shell script program, java program, mapreduce program, hive script and so on
2. There are time sequence and before and after dependencies between task units
3. In order to organize such a complex execution plan well, a workflow scheduling system is needed to schedule execution;

For example, we may have a requirement that a business system generates 20G of raw data every day, and we need to process it every day. The processing steps are as follows:
1. Synchronize the original data to HDFS through Hadoop;
2. The original data is converted with the help of MapReduce computing framework, and the generated data is stored in multiple Hive tables in the form of partition tables;
3. You need to JOIN the data of multiple tables in Hive to get a large detailed data Hive table;
4. Carry out various statistical analysis on the detailed data to obtain the result report information;

The result data obtained from statistical analysis needs to be synchronized to the business system for business invocation.

Implementation of workflow scheduling

  • Simple task scheduling: directly use the crontab of linux to define;
  • Complex task scheduling: develop scheduling platform or use ready-made open source scheduling systems, such as ooize, azkaban, airflow, etc

In the hadoop field, common workflow schedulers include Oozie, Azkaban, Cascading, Hamake, etc

Comparison of characteristics of various scheduling tools

The following table compares the key features of the above four hadoop workflow schedulers. Although the demand scenarios that these workflow schedulers can solve are basically the same, there are still significant differences in design concept, target users, application scenarios, etc., which can be used as a reference when making technical selection:

Workflow description languageXMLXML (xPDL based)text file with key/value pairsJava API
Dependency mechanismdata-drivenexplicitexplicitexplicit
Do you want to web containerNoYesYesNo
Progress trackingconsole/log messagesweb pageweb pageJava API
Hadoop job Scheduling Supportnoyesyesyes
Operation modecommand line utilitydaemondaemonAPI
Pig supportyesyesyesyes
Event notificationnononoyes
Installation requirednoyesyesno
Supported hadoop versions0.18+0.20+currently unknown0.18+
Retry supportnoworkflownode evelyesyes
Run any commandyesyesyesyes
Amazon EMR supportyesnocurrently unknownyes

Comparison between Azkaban and Oozie

For the two popular schedulers on the market. Overall, ooize is a heavyweight task scheduling system compared with azkaban, with comprehensive functions, but the configuration and use are more complex. If you can ignore the lack of some functions, the lightweight scheduler azkaban is a good candidate.

  • function
    Both can schedule mapreduce, pig, java, shell script workflow tasks
    Both can execute workflow tasks regularly

  • Workflow definition
    Azkaban uses the Properties file to define workflows
    Oozie uses XML files to define workflows

  • Work circulation reference
    Azkaban supports direct parameter transfer, such as ${input}
    Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}

  • Timed execution
    Azkaban's scheduled tasks are time-based
    Oozie's scheduled execution task is based on time and input data

  • resource management
    Azkaban has strict permission control, such as reading / writing / executing workflow
    Oozie has no strict permission control

  • Workflow execution
    Azkaban has two operation modes: soloserver mode (the executor server and web server are deployed on the same node) and multi server mode (the executor server and web server can be deployed on different nodes)
    Oozie runs as a workflow server and supports multi-user and multi workflow

  • Workflow management
    Azkaban supports browser and ajax operation workflow
    Oozie supports command line, HTTP REST, Java API, browser operation workflow

Azkaban installation deployment


Using version: Azkaban 3.47.0

tar -zxvf azkaban3.47.0.tar.gz -C ../servers/
cd /export/servers/azkaban-3.47.0/
yum -y install git
yum -y install gcc-c++
./gradlew build installDist -x test

The compilation is successful, as shown in the figure below:

Compiled file

  • azkaban-exec-server

  • azkaban-web-server

  • azkaban-solo-server

  • execute-as-user.c

  • Database script file

azkaban single service mode

Azkaban's solo server uses a single node mode to start the service. It only needs an installation package of azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz to start. All data information is saved in H2, the default data of Azkaban.

cd /export/softwares
tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/conf

# Change time zone

# Start:
cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT
bin/ sure to start this way)

visit: https://ip:port/

The job solving task is always running

cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/plugins/ jobtypes/

Restart azkaban

Two service modes of azkaban

  • Azkaban Web services installation package

  • Azkaban execution service installation package

  • Compiled sql script

  • C program file script
    execute-as-user.c program

  • Install Mysql database

To be executed after installation

mysql -uroot -p

# implement
CREATE USER 'azkaban'@'%' IDENTIFIED BY 'azkaban';    
GRANT all privileges ON azkaban.* to 'azkaban'@'%' identified by 'azkaban' WITH GRANT OPTION; 
source /export/softwares/create-all-sql-0.1.0-SNAPSHOT.sql;
  • Unzip the azkaban installation package
tar -zxvf azkaban-web-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
tar -zxvf azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
  • Install SSL security authentication
cd /export/servers/azkaban-web-server-3.47.0
keytool -keystore keystore -alias jetty -genkey -keyalg RSA

web server installation

cp -r /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/conf/ /export/servers/azkaban-web-server-3.47.0/
  • Modify the configuration file of Azkaban web server
cd /export/servers/azkaban-web-server-3.47.0/conf
#### Configure the following

# Azkaban Personalization Settings
azkaban.label=My Azkaban
# Azkaban UserManager class
# Loader for projects


# Velocity dev mode
# Azkaban Jetty server properties.


# Azkaban Executor settings
# mail settings
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links. 
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081

# JMX stats
# Azkaban plugin settings
  • Add configuration file for
cd /export/servers/azkaban-web-server-3.47.0/conf

#### Configure the following
log4j.rootLogger=INFO, Console
log4j.appender.Console.layout.ConversionPattern=%d{yyyy/MM/dd HH:mm:ss.SSS Z} %p [%c{1}] %m%n

executor server installation

cp -r /export/servers/azkaban-web-server-3.47.0/conf/ /export/servers/azkaban-exec-server-3.47.0/
  • Add plug-in
mkdir -p /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes
cp /export/softwares/execute-as-user.c /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes/

yum -y install gcc-c++
cd /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes
gcc execute-as-user.c -o execute-as-user 
chown root execute-as-user
chmod 6050 execute-as-user
  • Add profile
cd /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes

#### Configure the following

Start web server


Start exec server




Tags: Big Data Azkaban

Posted on Fri, 03 Sep 2021 02:44:56 -0400 by lucy