pg_bulkload fast loading data

pg_bulkload

git download address: https://github.com/ossc-db/pg_bulkload

introduce

PostgreSQL provides a convenient data loading tool for the copy command. The copy command comes from the PostgreSQL database. The copy command supports data loading between files and tables and data unloading from tables to files. pg_bulkload is a high-speed data loading tool for PostgreSQL, compared with the copy command. The biggest advantage is speed. The advantage is that let's skip shared buffer and wal buffer. Write files directly. pg_ The direct mode of bulkload is realized by this idea. It also includes the function of data recovery, that is, if the import fails, it needs to be recovered.

Architecture diagram

 pg_bulkload mainly includes two modules: reader and writer. The reader is responsible for reading the file and parsing the tuple, and the writer is responsible for writing the parsed tuple to the output source. pg_ The original version of bulkload was simple, just loading data. Version 3.1 adds the function of data filtering.

install

unzip pg_bulkload-master.zip
cd unzip pg_bulkload-master
make
make installl

After the installation, create the extension in the database

create extension pg_bulkload;

pg_bulkload parameter

[thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload --help
pg_bulkload is a bulk data loading tool for PostgreSQL

Usage:
Dataload: pg_bulkload [dataload options] control_file_path
Recovery: pg_bulkload -r [-D DATADIR]

Dataload options:
-i, --input=INPUT INPUT path or function
-O, --output=OUTPUT OUTPUT path or table
-l, --logfile=LOGFILE LOGFILE path
-P, --parse-badfile=* PARSE_BADFILE path
-u, --duplicate-badfile=* DUPLICATE_BADFILE path
-o, --option="key=val" additional option

Recovery options:
-r, --recovery execute recovery
-D, --pgdata=DATADIR database directory

Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt

Generic options:
-e, --echo echo queries
-E, --elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit

Read the website for details. http://github.com/ossc-db/pg_bulkload
Report bugs to http://github.com/ossc-db/pg_bulkload/issues.

-o is not available in help. You can see the parameter configurations in the test.log log imported below

pg_bulkload usage

db_sqlfx=# create table test(id int,info text ,crt_time timestamp(0));
CREATE TABLE
db_sqlfx=# insert into test select generate_series(1,5000000),md5(random()::text),clock_timestamp();
INSERT 0 5000000
db_sqlfx=# create index i_test_id on test(id);
CREATE INDEX
db_sqlfx=# copy test to '/home/thunisoft/test.csv' with(format 'csv');
COPY 5000000

--Import directly without cleaning up the index
[thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload -i /home/thunisoft/test.csv -O test -l /home/thunisoft/test.log -p 6543 -d db_sqlfx -o "TYPE=CSV" -o "WRITER=PARALLEL"  -o "TRUNCATE=YES"
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
	0 Rows skipped.
	5000000 Rows successfully loaded.
	0 Rows not loaded due to parse errors.
	0 Rows not loaded due to duplicate errors.
	0 Rows replaced with new rows.


Note that for copy export, specify with(format 'csv'), otherwise an error will be reported

pg_bulkload time consuming

21.03 sec

[thunisoft@sdbserver1 pg_bulkload-master]$ cat /home/thunisoft/test.log

pg_bulkload 3.1.18 on 2021-09-22 18:13:26.168808+08

INPUT = /home/thunisoft/test.csv
PARSE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.prs.csv
LOGFILE = /home/thunisoft/test.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 0
DELIMITER = ,
QUOTE = """
ESCAPE = """
NULL = 
OUTPUT = public.test
MULTI_PROCESS = YES
VERBOSE = NO
WRITER = DIRECT
DUPLICATE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = YES

  0 Rows skipped.
  5000000 Rows successfully loaded.
  0 Rows not loaded due to parse errors.
  0 Rows not loaded due to duplicate errors.
  0 Rows replaced with new rows.

Run began on 2021-09-22 18:13:26.168808+08
Run ended on 2021-09-22 18:13:47.19649+08

CPU 0.66s/8.85u sec elapsed 21.03 sec

copy import time consuming

--Export data
db_sqlfx=#  copy test to '/home/thunisoft/test.sql';
COPY 5000000

--Clean up data
db_sqlfx=# truncate table test;
TRUNCATE TABLE
Time: 95.247 ms

--Import
db_sqlfx=# copy test from '/home/thunisoft/test.csv';
COPY 5000000
Time: 38811.062 ms (00:38.811)

With an index, pg_bulkload is more efficient than copy

How is the efficiency of import and copy compared without creating an index?

--Delete index
drop index i_test_id;
--copy Import data 24 s
db_sqlfx=# copy test from '/home/thunisoft/test.csv';
COPY 5000000
Time: 24227.334 ms (00:24.227)
--use pg_bulload Import 10 s
pg_bulkload:10s
--Create index 10 s
db_sqlfx=# CREATE INDEX i_test_id ON public.test USING btree (id);
CREATE INDEX
Time: 10552.171 ms (00:10.552)

unlogged table

--Index retention, empty data
update pg_class set relpersistence='u' where relname='test';  
update pg_class set relpersistence='u' where relname='i_test_id'; 

--pg_bulkload 17s
Run began on 2021-09-23 16:22:52.192337+08
Run ended on 2021-09-23 16:23:09.726125+08
CPU 0.63s/7.43u sec elapsed 17.53 sec

--copy
db_sqlfx=# copy test from '/home/thunisoft/test.csv';
COPY 5000000
Time: 27020.258 ms (00:27.020)

copy increases greatly without logs, pg_bulkload promotion is not obvious because it has bypassed shared_buffers

A batch script is attached:

 1 -bash-4.1$ cat load.sh
 2 #!/bin/sh
 3 
 4 #$1 data fil ename
 5 
 6 file=$1
 7 
 8 if [ ! -f $file  ]
 9 then
10     echo "File is not exist"
11     exit 1
12 fi
13 
14 echo "-----------------------------------------------------------------"
15 
16 tbname=$( echo $file |cut -d . -f1 )
17 echo "Table name is : "$tbname
18 
19 zcat $file|pg_bulkload -i stdin -O public.$tbname -l $tbname.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -d sgdw
20 
21 echo "load complete"
22 echo "-----------------------------------------------------------------"

summary

1. With an index, pg_bulkload is twice as efficient as copy, and there is no index pg_bulkload is also more efficient

2. The performance of copy in the unligged table table is improved, but it is different from pg_bulkload still has some gaps

3,pg_ The advantage of bulkload is obvious. It mainly bypasses the shared buffer and wal buffer and writes files directly. This mode is the same as the unregistered table and does not need to write wal. For the scenario where the stream replication standby database is used, the standby database needs to be rebuilt. The underlying backup also needs to be regenerated

4. When there are large tables, the import performance is very good

Tags: PostgreSQL

Posted on Mon, 22 Nov 2021 12:30:53 -0500 by Paul_Bunyan