Hbase Foundation: Three Synchronization Schemes and Practice of MySQL Data Insertion for Billion Level Data from MySQL to Hbase

This article is about Learning Guide for Big Data Specialists from Zero (Full Upgrade) Supplement to the Base section.

This section is titled: Three synchronization schemes and practices for billion-level data from MySQL to Hbase. First, you need to know how to quickly insert MySQL.

Then MySQL data insertion will start with the following methods:

  • load data infile

  • Python Single Insert

  • Python Multithreaded Insert

Of course, it can also be implemented in other languages!!!

Let's talk about data insertion step by step!

Before inserting data, we need to know our data, so let's look at the data field description:

Data to ASCII Text representation, comma delimited, carriage return line break(0x0D 0x0A)End. Data items and order: vehicle identification, trigger events, operational status, GPS Time, GPS Longitude, GPS latitude,,GPS Speed, GPS Direction, GPS state
 Vehicle ID: 6 characters
 Trigger event: 0=Empty car, 1=Variable Carrier, 2=Fortification, 3=Withdrawal, 4=Other
 Operational status: 0=Empty car, 1=Carrier, 2=Parking, 3=Outage, 4=Other
GPS Time: Format yyyymmddhhnnss,Beijing Time
GPS Longitude: Format ddd.ddddddd,In degrees.
GPS Latitude: Format dd.ddddddd,In degrees.
GPS Speed: Format ddd,Value 000-255 Internal integer in kilometers/Hours are units.
GPS Azimuth: Format ddd,Value 000-360 Internal integer in degrees.
GPS Status: 0=Invalid, 1=effective
 End string:carriage return+Line Break

Examples of data:


All you need to do is pair the above data fields with the data, one line at a time.

Start by writing the Create Database and Table command:

create database loaddb;
CREATE TABLE loadTable(id int primary key not null auto_increment,
carflag VARCHAR(6),touchevent CHAR(1),opstatus CHAR(1),gpstime DATETIME,
gpslongitude DECIMAL(10,7),gpslatitude DECIMAL(9,7),gpsspeed TINYINT,
gpsorientation SMALLINT,gpsstatus CHAR(1))engine=MyISAM;

Note: The MyISAM engine was selected above because the load command ensures efficient data insertion!

3.1 load data infile

load data infile is very fast in importing large data scenarios! Specific instructions will be detailed later in the comparison, here is the syntax, as follows:

load data local infile "/home/light/mysql/gps1.txt" into table loadTable fields terminated by ',' lines terminated by "\n"  (carflag, touchevent, opstatus,gpstime,gpslongitude,gpslatitude,gpsspeed,gpsorientation,gpsstatus);

When you use this command, you use it on the clinet side of MySQL. Just hit this command when you log in! You will see several key points in the data field description: comma delimiter, carriage return line break, which corresponds to the above code:

fields terminated by ',' lines terminated by "\n"
Note: Change your own dataset path!

3.2 Python Batch Insert

Python Single insert uses pymysql The following is part of the code, see the full code:

Bulk Submit Source

with open('/home/light/mysql/gps1.txt', 'r') as fp:
    for line in fp:
        count += 1
        if count and count%70000==0:
            # Perform multiline insertion, executemany(sql statement, data (requires a tuple type))
            self.cur.executemany(sql, data_list)
            # Submit data, must submit, otherwise data will not be saved
            data_list = []
            print("Submitted:" + str(count) + "Bar data")
if data_list:
    # Perform multiline insertion, executemany(sql statement, data (requires a tuple type))
    self.cur.executemany(sql, data_list)
    # Submit data, must submit, otherwise data will not be saved
    print("Submitted:" + str(count) + "Bar data")
self.cur.close()  # Close Cursor
self.conn.close()  # Close pymysql connection

One of the key points above needs to be clarified:

(1) Use executemany instead of execute, this submission will be faster! (2) Use batch insert instead of single insert submission, which will improve efficiency!

3.3 Python Multithreaded Insertion

The original data is a gps1.txt file, which is too large to be easily inserted with multiple threads. So first use file slicing method, then insert with multiple threads.

For file slicing, click here: File slicing source code.

Use multithreaded source in Python

def multicore(self):
    file_list = [1,2324,4648,6972,9298]
    m1 = mp.Process(target=self.run, args=(file_list[0],file_list[1],'m1',))
    m2 = mp.Process(target=self.run, args=(file_list[1]+1,file_list[2],'m2',))
    m3 = mp.Process(target=self.run, args=(file_list[2]+1,file_list[3],'m3',))
    m4 = mp.Process(target=self.run, args=(file_list[3]+1,file_list[4],'m4',))

The specific insertion idea is to use four threads to read the data of each section separately, and then insert the data in batch! If you don't know this partner, please leave a message.

3.4 Comparison of MySQL data import methods

The load command differs from a normal insert

Same pointDifference
Both are inserted by reading the local txt file and reading at the same delimitation.Program insertion is essentially an indirect execution of insert statements. load data is designed to load large amounts of tabular data in a single operation.

Efficiency comparison


Both take the following time:

First: load data (the result of inserting Innodb engine tables is captured here, it will be faster than it is now when using MyISAM!)

enter image description here

It took 1 hour and 11 minutes.

Second: program insertion (only bulk inserts are intercepted here!)

Time consumed: 27322.45/36=7.58h

The comparison above shows that load data is very efficient because it uses the load data infile method and the second is the traditional insert method.

The main reason is that MySQL has different internal processing mechanisms for load and insert.

Load handles this by turning off the index before the load is executed and re-creating the index when the load is fully executed.

Insert updates the database and index every insert.

In addition, the difference between load and insert is that load eliminates sql statement parsing and sql Engine processing, but instead generates file data blocks directly, so it is much faster than Insert.

Tags: HBase Algorithm Machine Learning

Posted on Sun, 19 Sep 2021 19:52:55 -0400 by ben14