Azkaban task scheduling tool

Little bird's personal blog has been officially launched and opened to the outside world

Blog access address: The big dream of a small vegetable bird

Welcome to all students, I will pay more attention to my official account. Official account will be displayed.

Azkaban brief introduction

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.

Azkaban is a batch workflow job scheduler created on LinkedIn to run Hadoop jobs. Azkaban solves orders through job dependencies and provides an easy-to-use Web user interface to maintain and track your workflow.

features:

  • Compatible with any version of Hadoop
  • Easy to use Web UI
  • Simple Web and http workflow upload
  • Project workspace
  • Scheduling workflow
  • Modular and pluggable
  • Authentication and authorization
  • Track user actions
  • Email alerts about failures and successes
  • SLA alert and auto kill
  • Retry failed work

The above is from the official introduction. For details, see Official website

Why workflow scheduling system

1. A complete data analysis system is usually composed of a large number of task units: shell script program, java program, mapreduce program, hive script and so on
2. There are time sequence and before and after dependencies between task units
3. In order to organize such a complex execution plan well, a workflow scheduling system is needed to schedule execution;

For example, we may have a requirement that a business system generates 20G of raw data every day, and we need to process it every day. The processing steps are as follows:
1. Synchronize the original data to HDFS through Hadoop;
2. The original data is converted with the help of MapReduce computing framework, and the generated data is stored in multiple Hive tables in the form of partition tables;
3. You need to JOIN the data of multiple tables in Hive to get a large detailed data Hive table;
4. Carry out various statistical analysis on the detailed data to obtain the result report information;

The result data obtained from statistical analysis needs to be synchronized to the business system for business invocation.

Implementation of workflow scheduling

  • Simple task scheduling: directly use the crontab of linux to define;
  • Complex task scheduling: develop scheduling platform or use ready-made open source scheduling systems, such as ooize, azkaban, airflow, etc

In the hadoop field, common workflow schedulers include Oozie, Azkaban, Cascading, Hamake, etc

Comparison of characteristics of various scheduling tools

The following table compares the key features of the above four hadoop workflow schedulers. Although the demand scenarios that these workflow schedulers can solve are basically the same, there are still significant differences in design concept, target users, application scenarios, etc., which can be used as a reference when making technical selection:

characteristicHamakeOozieAzkabanCascading
Workflow description languageXMLXML (xPDL based)text file with key/value pairsJava API
Dependency mechanismdata-drivenexplicitexplicitexplicit
Do you want to web containerNoYesYesNo
Progress trackingconsole/log messagesweb pageweb pageJava API
Hadoop job Scheduling Supportnoyesyesyes
Operation modecommand line utilitydaemondaemonAPI
Pig supportyesyesyesyes
Event notificationnononoyes
Installation requirednoyesyesno
Supported hadoop versions0.18+0.20+currently unknown0.18+
Retry supportnoworkflownode evelyesyes
Run any commandyesyesyesyes
Amazon EMR supportyesnocurrently unknownyes

Comparison between Azkaban and Oozie

For the two popular schedulers on the market. Overall, ooize is a heavyweight task scheduling system compared with azkaban, with comprehensive functions, but the configuration and use are more complex. If you can ignore the lack of some functions, the lightweight scheduler azkaban is a good candidate.

  • function
    Both can schedule mapreduce, pig, java, shell script workflow tasks
    Both can execute workflow tasks regularly

  • Workflow definition
    Azkaban uses the Properties file to define workflows
    Oozie uses XML files to define workflows

  • Work circulation reference
    Azkaban supports direct parameter transfer, such as ${input}
    Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}

  • Timed execution
    Azkaban's scheduled tasks are time-based
    Oozie's scheduled execution task is based on time and input data

  • resource management
    Azkaban has strict permission control, such as reading / writing / executing workflow
    Oozie has no strict permission control

  • Workflow execution
    Azkaban has two operation modes: soloserver mode (the executor server and web server are deployed on the same node) and multi server mode (the executor server and web server can be deployed on different nodes)
    Oozie runs as a workflow server and supports multi-user and multi workflow

  • Workflow management
    Azkaban supports browser and ajax operation workflow
    Oozie supports command line, HTTP REST, Java API, browser operation workflow

Azkaban installation deployment

compile

Using version: Azkaban 3.47.0

tar -zxvf azkaban3.47.0.tar.gz -C ../servers/
cd /export/servers/azkaban-3.47.0/
yum -y install git
yum -y install gcc-c++
./gradlew build installDist -x test

The compilation is successful, as shown in the figure below:

Compiled file

  • azkaban-exec-server
    /export/servers/azkaban-3.47.0/azkaban-exec-server/build/distributions

  • azkaban-web-server
    /export/servers/azkaban-3.47.0/azkaban-web-server/build/distributions

  • azkaban-solo-server
    /export/servers/azkaban-3.47.0/azkaban-solo-server/build/distributions

  • execute-as-user.c
    /export/servers/azkaban-3.47.0/az-exec-util/src/main/c

  • Database script file
    /export/servers/azkaban-3.47.0/azkaban-db/build/install/azkaban-db

azkaban single service mode

Azkaban's solo server uses a single node mode to start the service. It only needs an installation package of azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz to start. All data information is saved in H2, the default data of Azkaban.

cd /export/softwares
tar -zxvf azkaban-solo-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/conf
vim azkaban.properties

# Change time zone
default.timezone.id=Asia/Shanghai

# Start:
cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT
bin/start-solo.sh(Be sure to start this way)

visit: https://ip:port/

The job solving task is always running

cd /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/plugins/ jobtypes/
vim  commonprivate.properties

Restart azkaban

Two service modes of azkaban

  • Azkaban Web services installation package
    azkaban-web-server-0.1.0-SNAPSHOT.tar.gz

  • Azkaban execution service installation package
    azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz

  • Compiled sql script
    create-all-sql-0.1.0-SNAPSHOT.sql

  • C program file script
    execute-as-user.c program

  • Install Mysql database
    Omit

To be executed after installation

mysql -uroot -p

# implement
CREATE DATABASE azkaban;
CREATE USER 'azkaban'@'%' IDENTIFIED BY 'azkaban';    
GRANT all privileges ON azkaban.* to 'azkaban'@'%' identified by 'azkaban' WITH GRANT OPTION; 
  
source /export/softwares/create-all-sql-0.1.0-SNAPSHOT.sql;
  • Unzip the azkaban installation package
tar -zxvf azkaban-web-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
tar -zxvf azkaban-exec-server-0.1.0-SNAPSHOT.tar.gz -C ../servers/
  • Install SSL security authentication
cd /export/servers/azkaban-web-server-3.47.0
keytool -keystore keystore -alias jetty -genkey -keyalg RSA

web server installation

cp -r /export/servers/azkaban-solo-server-0.1.0-SNAPSHOT/conf/ /export/servers/azkaban-web-server-3.47.0/
  • Modify the configuration file of Azkaban web server
cd /export/servers/azkaban-web-server-3.47.0/conf
vim azkaban.properties
#### Configure the following

# Azkaban Personalization Settings
azkaban.name=MyAzkaban
azkaban.label=My Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
#database.type=h2
#h2.path=./h2
#h2.create.tables=true 

database.type=mysql
mysql.port=3306
mysql.host=localhost
mysql.database=azkaban
mysql.user=azkaban
mysql.password=azkaban
mysql.numconnections=100 

# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=true 
jetty.maxThreads=25
jetty.port=8081

jetty.keystore=/export/servers/azkaban-web-server-3.47.0/keystore
jetty.password=azkaban
jetty.keypassword=azkaban
jetty.truststore=/export/servers/azkaban-web-server-3.47.0/keystore
jetty.trustpassword=azkaban 

# Azkaban Executor settings
executor.port=12321
# mail settings
mail.sender=
mail.host=
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links. 
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=

# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban plugin settings
azkaban.jobtype.plugin.dir=plugins/jobtypes
  • Add configuration file for log4j.properties
cd /export/servers/azkaban-web-server-3.47.0/conf
vim log4j.properties

#### Configure the following
log4j.rootLogger=INFO, Console
log4j.appender.Console=org.apache.log4j.ConsoleAppender
log4j.appender.Console.layout=org.apache.log4j.PatternLayout
log4j.appender.Console.layout.ConversionPattern=%d{yyyy/MM/dd HH:mm:ss.SSS Z} %p [%c{1}] %m%n
log4j.category.velocity=INFO

executor server installation

cp -r /export/servers/azkaban-web-server-3.47.0/conf/ /export/servers/azkaban-exec-server-3.47.0/
  • Add plug-in
mkdir -p /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes
cp /export/softwares/execute-as-user.c /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes/

yum -y install gcc-c++
cd /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes
gcc execute-as-user.c -o execute-as-user 
chown root execute-as-user
chmod 6050 execute-as-user
  • Add profile
cd /export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes
vim commonprivate.properties

#### Configure the following
execute.as.user=false
memCheck.enabled=false
azkaban.native.lib=/export/servers/azkaban-exec-server-3.47.0/plugins/jobtypes

Start web server

bin/start-web.sh

Start exec server

bin/start-exec.sh

visit

https://ip:port/

Tags: Big Data Azkaban

Posted on Fri, 03 Sep 2021 02:44:56 -0400 by lucy