Data Source Management | Synchronize data and source analysis based on DataX components

Source code for this article: GitHub. Click here || GitEE. Click here

1. Introduction to DataX Tools

1. Design Ideas

DataX is an offline synchronization tool for heterogeneous data sources. It is dedicated to achieving stable and efficient data synchronization between heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.To solve the synchronization problem of heterogeneous data sources, DataX transforms complex network synchronization links into star data links, and DataX is responsible for connecting various data sources as an intermediate transport carrier.When you need to access a new data source, you can seamlessly synchronize your data with an existing data source by simply connecting it to DataX.

Item: Heterogeneous data sources refer to the use of different database systems to store data in order to handle different types of business.

2. Component Structure

DataX itself, as an offline data synchronization framework, is built using the Framework+plugin architecture.Abstracting data source read and write into Reader and Writer plug-ins incorporates the entire synchronization framework.

  • Reader

Reader is a data collection module that reads the data from the collection data source and sends the data to the Framework.

  • Writer

Writer is the data writing module responsible for continuously fetching data from the Framework and writing it to the destination.

  • Framework

Framework is used to connect reader and writer as data transmission channels for both, and handles core technical issues such as buffering, flow control, concurrency, data conversion, and so on.

3. Architecture Design

  • Job

DataX completes a single job for data synchronization, called a Job. When DataX receives a Job, it starts a process to complete the entire job synchronization process.Job module is the central management node of a single job, which undertakes data cleaning, subtask slicing (converting a single job calculation into multiple subtasks), TaskGroup management and other functions.

  • Split

When DataXJob is started, it is split into several small Tasks (subtasks) according to different source-side slicing strategies for concurrent execution.Tasks are the smallest unit of a DataX Job, and each Task is responsible for synchronizing a portion of the data.

  • Scheduler

After splitting multiple Tasks, Job calls the Scheduler module to reassemble the split Tasks into TaskGroup based on the amount of concurrent data configured.

  • TaskGroup

Each TaskGroup is responsible for running all assigned Tasks in a certain amount of concurrency, with the default number of concurrencies for a single task group of 5.Each Task is started by the TaskGroup, and once the Task is started, the Reader->Channel->Writer threads are fixed to complete task synchronization.After the DataX job runs, Job monitors and waits for multiple TaskGroup module tasks to complete, and Job exits successfully after all TaskGroup tasks are completed.Otherwise, exit abnormally and the process exit value is not 0.

2. Environmental Installation

Python 2.6+, Jdk1.8+ (brain repair installation process) is recommended.

1. Python package download

# yum -y install wget
# wget https://www.python.org/ftp/python/2.7.15/Python-2.7.15.tgz
# tar -zxvf Python-2.7.15.tgz

2. Install Python

# yum install gcc openssl-devel bzip2-devel
[root@ctvm01 Python-2.7.15]# ./configure --enable-optimizations
# make altinstall
# python -V

3. DataX Installation

# pwd
/opt/module
# ll
datax
# cd /opt/module/datax/bin
-- Test whether the environment is correct
# python datax.py /opt/module/datax/job/job.json

3. Synchronization Tasks

1. Synchronization table creation

-- PostgreSQL
CREATE TABLE sync_user (
	id INT NOT NULL,
	user_name VARCHAR (32) NOT NULL,
	user_age int4 NOT NULL,
	CONSTRAINT "sync_user_pkey" PRIMARY KEY ("id")
);
CREATE TABLE data_user (
	id INT NOT NULL,
	user_name VARCHAR (32) NOT NULL,
	user_age int4 NOT NULL,
	CONSTRAINT "sync_user_pkey" PRIMARY KEY ("id")
);

2. Scripting tasks

[root@ctvm01 job]# pwd
/opt/module/datax/job
[root@ctvm01 job]# vim postgresql_job.json

3. Script Content

{
    "job": {
        "setting": {
            "speed": {
                "channel": "3"
            }
        },
        "content": [
            {
                "reader": {
                    "name": "postgresqlreader",
                    "parameter": {
                        "username": "root01",
                        "password": "123456",
                        "column": ["id","user_name","user_age"], 
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:postgresql://192.168.72.131:5432/db_01"], 
                                "table": ["data_user"]
                            }
                        ]
                    }
                }, 
                "writer": {
                    "name": "postgresqlwriter", 
                    "parameter": {
                        "username": "root01",
                        "password": "123456",
                        "column": ["id","user_name","user_age"], 
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:postgresql://192.168.72.131:5432/db_01", 
                                "table": ["sync_user"]
                            }
                        ], 
                        "postSql": [], 
                        "preSql": []
                    }
                }
            }
        ]
    }
}

4. Executing scripts

# /opt/module/datax/bin/datax.py /opt/module/datax/job/postgresql_job.json

5. Execution Log

2020-04-23 18:25:33.404 [job-0] INFO  JobContainer - 
Task start time: 2020-04-23 18:25:22
 Task end time: 2020-04-23 18:25:33
 Total Task Time: 10s
 Average task flow: 1B/s
 Record writing speed: 0rec/s
 Total number of read-out records: 2
 Total number of read and write failures: 0

4. Source Code Flow Analysis

Note: The source code here only shows the core process. If you want to see the full source code, you can download it from Git yourself.

1. Read Data

Core entry: PostgresqlReader

Start Reading Task

public static class Task extends Reader.Task {
    @Override
    public void startRead(RecordSender recordSender) {
        int fetchSize = this.readerSliceConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE);
        this.commonRdbmsReaderSlave.startRead(this.readerSliceConfig, recordSender,
                super.getTaskPluginCollector(), fetchSize);
    }
}

After the read task starts, perform the read data operation.

Core class: CommonRdbmsReader

public void startRead(Configuration readerSliceConfig,
                      RecordSender recordSender,
                      TaskPluginCollector taskPluginCollector, int fetchSize) {
    ResultSet rs = null;
    try {
        // data fetch
        rs = DBUtil.query(conn, querySql, fetchSize);
        queryPerfRecord.end();
        ResultSetMetaData metaData = rs.getMetaData();
        columnNumber = metaData.getColumnCount();
        PerfRecord allResultPerfRecord = new PerfRecord(taskGroupId, taskId, PerfRecord.PHASE.RESULT_NEXT_ALL);
        allResultPerfRecord.start();
        long rsNextUsedTime = 0;
        long lastTime = System.nanoTime();
        // Data Transfer to Switch Area
        while (rs.next()) {
            rsNextUsedTime += (System.nanoTime() - lastTime);
            this.transportOneRecord(recordSender, rs,metaData, columnNumber, mandatoryEncoding, taskPluginCollector);
            lastTime = System.nanoTime();
        }
        allResultPerfRecord.end(rsNextUsedTime);
    }catch (Exception e) {
        throw RdbmsException.asQueryException(this.dataBaseType, e, querySql, table, username);
    } finally {
        DBUtil.closeDBResources(null, conn);
    }
}

2. Data transmission

Core interface: RecordSender

public interface RecordSender {
	public Record createRecord();
	public void sendToWriter(Record record);
	public void flush();
	public void terminate();
	public void shutdown();
}

Core interface: RecordReceiver

public interface RecordReceiver {
	public Record getFromReader();
	public void shutdown();
}

Core Class: BufferedRecordExchanger

class BufferedRecordExchanger implements RecordSender, RecordReceiver

3. Writing data

Core entry: PostgresqlWriter

Start Write Task

public static class Task extends Writer.Task {
	public void startWrite(RecordReceiver recordReceiver) {
		this.commonRdbmsWriterSlave.startWrite(recordReceiver, this.writerSliceConfig, super.getTaskPluginCollector());
	}
}

After the write data task starts, perform the data write operation.

Core class: CommonRdbmsWriter

public void startWriteWithConnection(RecordReceiver recordReceiver,
                                     Connection connection) {
    // Write SQL statements for databases
    calcWriteRecordSql();
    List<Record> writeBuffer = new ArrayList<>(this.batchSize);
    int bufferBytes = 0;
    try {
        Record record;
        while ((record = recordReceiver.getFromReader()) != null) {
            writeBuffer.add(record);
            bufferBytes += record.getMemorySize();
            if (writeBuffer.size() >= batchSize || bufferBytes >= batchByteSize) {
                doBatchInsert(connection, writeBuffer);
                writeBuffer.clear();
                bufferBytes = 0;
            }
        }
        if (!writeBuffer.isEmpty()) {
            doBatchInsert(connection, writeBuffer);
            writeBuffer.clear();
            bufferBytes = 0;
        }
    } catch (Exception e) {
        throw DataXException.asDataXException(
                DBUtilErrorCode.WRITE_DATA_ERROR, e);
    } finally {
        writeBuffer.clear();
        bufferBytes = 0;
        DBUtil.closeDBResources(null, null, connection);
    }
}

5. Source code address

GitHub·address
https://github.com/cicadasmile/data-manage-parent
GitEE·address
https://gitee.com/cicadasmile/data-manage-parent

Tags: Programming Python JSON PostgreSQL github

Posted on Tue, 05 May 2020 20:13:36 -0400 by prcollin