Source code for this article: GitHub. Click here || GitEE. Click here
1. Introduction to DataX Tools1. Design Ideas
DataX is an offline synchronization tool for heterogeneous data sources. It is dedicated to achieving stable and efficient data synchronization between heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.To solve the synchronization problem of heterogeneous data sources, DataX transforms complex network synchronization links into star data links, and DataX is responsible for connecting various data sources as an intermediate transport carrier.When you need to access a new data source, you can seamlessly synchronize your data with an existing data source by simply connecting it to DataX.
Item: Heterogeneous data sources refer to the use of different database systems to store data in order to handle different types of business.
2. Component Structure
DataX itself, as an offline data synchronization framework, is built using the Framework+plugin architecture.Abstracting data source read and write into Reader and Writer plug-ins incorporates the entire synchronization framework.
- Reader
Reader is a data collection module that reads the data from the collection data source and sends the data to the Framework.
- Writer
Writer is the data writing module responsible for continuously fetching data from the Framework and writing it to the destination.
- Framework
Framework is used to connect reader and writer as data transmission channels for both, and handles core technical issues such as buffering, flow control, concurrency, data conversion, and so on.
3. Architecture Design
- Job
DataX completes a single job for data synchronization, called a Job. When DataX receives a Job, it starts a process to complete the entire job synchronization process.Job module is the central management node of a single job, which undertakes data cleaning, subtask slicing (converting a single job calculation into multiple subtasks), TaskGroup management and other functions.
- Split
When DataXJob is started, it is split into several small Tasks (subtasks) according to different source-side slicing strategies for concurrent execution.Tasks are the smallest unit of a DataX Job, and each Task is responsible for synchronizing a portion of the data.
- Scheduler
After splitting multiple Tasks, Job calls the Scheduler module to reassemble the split Tasks into TaskGroup based on the amount of concurrent data configured.
- TaskGroup
Each TaskGroup is responsible for running all assigned Tasks in a certain amount of concurrency, with the default number of concurrencies for a single task group of 5.Each Task is started by the TaskGroup, and once the Task is started, the Reader->Channel->Writer threads are fixed to complete task synchronization.After the DataX job runs, Job monitors and waits for multiple TaskGroup module tasks to complete, and Job exits successfully after all TaskGroup tasks are completed.Otherwise, exit abnormally and the process exit value is not 0.
2. Environmental InstallationPython 2.6+, Jdk1.8+ (brain repair installation process) is recommended.
1. Python package download
# yum -y install wget # wget https://www.python.org/ftp/python/2.7.15/Python-2.7.15.tgz # tar -zxvf Python-2.7.15.tgz
2. Install Python
# yum install gcc openssl-devel bzip2-devel [root@ctvm01 Python-2.7.15]# ./configure --enable-optimizations # make altinstall # python -V
3. DataX Installation
# pwd /opt/module # ll datax # cd /opt/module/datax/bin -- Test whether the environment is correct # python datax.py /opt/module/datax/job/job.json3. Synchronization Tasks
1. Synchronization table creation
-- PostgreSQL CREATE TABLE sync_user ( id INT NOT NULL, user_name VARCHAR (32) NOT NULL, user_age int4 NOT NULL, CONSTRAINT "sync_user_pkey" PRIMARY KEY ("id") ); CREATE TABLE data_user ( id INT NOT NULL, user_name VARCHAR (32) NOT NULL, user_age int4 NOT NULL, CONSTRAINT "sync_user_pkey" PRIMARY KEY ("id") );
2. Scripting tasks
[root@ctvm01 job]# pwd /opt/module/datax/job [root@ctvm01 job]# vim postgresql_job.json
3. Script Content
{ "job": { "setting": { "speed": { "channel": "3" } }, "content": [ { "reader": { "name": "postgresqlreader", "parameter": { "username": "root01", "password": "123456", "column": ["id","user_name","user_age"], "connection": [ { "jdbcUrl": ["jdbc:postgresql://192.168.72.131:5432/db_01"], "table": ["data_user"] } ] } }, "writer": { "name": "postgresqlwriter", "parameter": { "username": "root01", "password": "123456", "column": ["id","user_name","user_age"], "connection": [ { "jdbcUrl": "jdbc:postgresql://192.168.72.131:5432/db_01", "table": ["sync_user"] } ], "postSql": [], "preSql": [] } } } ] } }
4. Executing scripts
# /opt/module/datax/bin/datax.py /opt/module/datax/job/postgresql_job.json
5. Execution Log
2020-04-23 18:25:33.404 [job-0] INFO JobContainer - Task start time: 2020-04-23 18:25:22 Task end time: 2020-04-23 18:25:33 Total Task Time: 10s Average task flow: 1B/s Record writing speed: 0rec/s Total number of read-out records: 2 Total number of read and write failures: 0 4. Source Code Flow AnalysisNote: The source code here only shows the core process. If you want to see the full source code, you can download it from Git yourself.
1. Read Data
Core entry: PostgresqlReader
Start Reading Task
public static class Task extends Reader.Task { @Override public void startRead(RecordSender recordSender) { int fetchSize = this.readerSliceConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); this.commonRdbmsReaderSlave.startRead(this.readerSliceConfig, recordSender, super.getTaskPluginCollector(), fetchSize); } }
After the read task starts, perform the read data operation.
Core class: CommonRdbmsReader
public void startRead(Configuration readerSliceConfig, RecordSender recordSender, TaskPluginCollector taskPluginCollector, int fetchSize) { ResultSet rs = null; try { // data fetch rs = DBUtil.query(conn, querySql, fetchSize); queryPerfRecord.end(); ResultSetMetaData metaData = rs.getMetaData(); columnNumber = metaData.getColumnCount(); PerfRecord allResultPerfRecord = new PerfRecord(taskGroupId, taskId, PerfRecord.PHASE.RESULT_NEXT_ALL); allResultPerfRecord.start(); long rsNextUsedTime = 0; long lastTime = System.nanoTime(); // Data Transfer to Switch Area while (rs.next()) { rsNextUsedTime += (System.nanoTime() - lastTime); this.transportOneRecord(recordSender, rs,metaData, columnNumber, mandatoryEncoding, taskPluginCollector); lastTime = System.nanoTime(); } allResultPerfRecord.end(rsNextUsedTime); }catch (Exception e) { throw RdbmsException.asQueryException(this.dataBaseType, e, querySql, table, username); } finally { DBUtil.closeDBResources(null, conn); } }
2. Data transmission
Core interface: RecordSender
public interface RecordSender { public Record createRecord(); public void sendToWriter(Record record); public void flush(); public void terminate(); public void shutdown(); }
Core interface: RecordReceiver
public interface RecordReceiver { public Record getFromReader(); public void shutdown(); }
Core Class: BufferedRecordExchanger
class BufferedRecordExchanger implements RecordSender, RecordReceiver
3. Writing data
Core entry: PostgresqlWriter
Start Write Task
public static class Task extends Writer.Task { public void startWrite(RecordReceiver recordReceiver) { this.commonRdbmsWriterSlave.startWrite(recordReceiver, this.writerSliceConfig, super.getTaskPluginCollector()); } }
After the write data task starts, perform the data write operation.
Core class: CommonRdbmsWriter
public void startWriteWithConnection(RecordReceiver recordReceiver, Connection connection) { // Write SQL statements for databases calcWriteRecordSql(); List<Record> writeBuffer = new ArrayList<>(this.batchSize); int bufferBytes = 0; try { Record record; while ((record = recordReceiver.getFromReader()) != null) { writeBuffer.add(record); bufferBytes += record.getMemorySize(); if (writeBuffer.size() >= batchSize || bufferBytes >= batchByteSize) { doBatchInsert(connection, writeBuffer); writeBuffer.clear(); bufferBytes = 0; } } if (!writeBuffer.isEmpty()) { doBatchInsert(connection, writeBuffer); writeBuffer.clear(); bufferBytes = 0; } } catch (Exception e) { throw DataXException.asDataXException( DBUtilErrorCode.WRITE_DATA_ERROR, e); } finally { writeBuffer.clear(); bufferBytes = 0; DBUtil.closeDBResources(null, null, connection); } }5. Source code address
GitHub·address https://github.com/cicadasmile/data-manage-parent GitEE·address https://gitee.com/cicadasmile/data-manage-parent