pg_bulkload
git download address: https://github.com/ossc-db/pg_bulkload
introduce
PostgreSQL provides a convenient data loading tool for the copy command. The copy command comes from the PostgreSQL database. The copy command supports data loading between files and tables and data unloading from tables to files. pg_bulkload is a high-speed data loading tool for PostgreSQL, compared with the copy command. The biggest advantage is speed. The advantage is that let's skip shared buffer and wal buffer. Write files directly. pg_ The direct mode of bulkload is realized by this idea. It also includes the function of data recovery, that is, if the import fails, it needs to be recovered.
Architecture diagram
pg_bulkload mainly includes two modules: reader and writer. The reader is responsible for reading the file and parsing the tuple, and the writer is responsible for writing the parsed tuple to the output source. pg_ The original version of bulkload was simple, just loading data. Version 3.1 adds the function of data filtering.
install
unzip pg_bulkload-master.zip cd unzip pg_bulkload-master make make installl
After the installation, create the extension in the database
create extension pg_bulkload;
pg_bulkload parameter
[thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload --help
pg_bulkload is a bulk data loading tool for PostgreSQL
Usage:
Dataload: pg_bulkload [dataload options] control_file_path
Recovery: pg_bulkload -r [-D DATADIR]
Dataload options:
-i, --input=INPUT INPUT path or function
-O, --output=OUTPUT OUTPUT path or table
-l, --logfile=LOGFILE LOGFILE path
-P, --parse-badfile=* PARSE_BADFILE path
-u, --duplicate-badfile=* DUPLICATE_BADFILE path
-o, --option="key=val" additional option
Recovery options:
-r, --recovery execute recovery
-D, --pgdata=DATADIR database directory
Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt
Generic options:
-e, --echo echo queries
-E, --elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit
Read the website for details. http://github.com/ossc-db/pg_bulkload
Report bugs to http://github.com/ossc-db/pg_bulkload/issues.
-o is not available in help. You can see the parameter configurations in the test.log log imported below
pg_bulkload usage
db_sqlfx=# create table test(id int,info text ,crt_time timestamp(0)); CREATE TABLE db_sqlfx=# insert into test select generate_series(1,5000000),md5(random()::text),clock_timestamp(); INSERT 0 5000000 db_sqlfx=# create index i_test_id on test(id); CREATE INDEX db_sqlfx=# copy test to '/home/thunisoft/test.csv' with(format 'csv'); COPY 5000000 --Import directly without cleaning up the index [thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload -i /home/thunisoft/test.csv -O test -l /home/thunisoft/test.log -p 6543 -d db_sqlfx -o "TYPE=CSV" -o "WRITER=PARALLEL" -o "TRUNCATE=YES" NOTICE: BULK LOAD START NOTICE: BULK LOAD END 0 Rows skipped. 5000000 Rows successfully loaded. 0 Rows not loaded due to parse errors. 0 Rows not loaded due to duplicate errors. 0 Rows replaced with new rows.
Note that for copy export, specify with(format 'csv'), otherwise an error will be reported
pg_bulkload time consuming
21.03 sec
[thunisoft@sdbserver1 pg_bulkload-master]$ cat /home/thunisoft/test.log pg_bulkload 3.1.18 on 2021-09-22 18:13:26.168808+08 INPUT = /home/thunisoft/test.csv PARSE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.prs.csv LOGFILE = /home/thunisoft/test.log LIMIT = INFINITE PARSE_ERRORS = 0 CHECK_CONSTRAINTS = NO TYPE = CSV SKIP = 0 DELIMITER = , QUOTE = """ ESCAPE = """ NULL = OUTPUT = public.test MULTI_PROCESS = YES VERBOSE = NO WRITER = DIRECT DUPLICATE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.dup.csv DUPLICATE_ERRORS = 0 ON_DUPLICATE_KEEP = NEW TRUNCATE = YES 0 Rows skipped. 5000000 Rows successfully loaded. 0 Rows not loaded due to parse errors. 0 Rows not loaded due to duplicate errors. 0 Rows replaced with new rows. Run began on 2021-09-22 18:13:26.168808+08 Run ended on 2021-09-22 18:13:47.19649+08 CPU 0.66s/8.85u sec elapsed 21.03 sec
copy import time consuming
--Export data db_sqlfx=# copy test to '/home/thunisoft/test.sql'; COPY 5000000 --Clean up data db_sqlfx=# truncate table test; TRUNCATE TABLE Time: 95.247 ms --Import db_sqlfx=# copy test from '/home/thunisoft/test.csv'; COPY 5000000 Time: 38811.062 ms (00:38.811)
With an index, pg_bulkload is more efficient than copy
How is the efficiency of import and copy compared without creating an index?
--Delete index drop index i_test_id; --copy Import data 24 s db_sqlfx=# copy test from '/home/thunisoft/test.csv'; COPY 5000000 Time: 24227.334 ms (00:24.227) --use pg_bulload Import 10 s pg_bulkload:10s --Create index 10 s db_sqlfx=# CREATE INDEX i_test_id ON public.test USING btree (id); CREATE INDEX Time: 10552.171 ms (00:10.552)
unlogged table
--Index retention, empty data update pg_class set relpersistence='u' where relname='test'; update pg_class set relpersistence='u' where relname='i_test_id'; --pg_bulkload 17s Run began on 2021-09-23 16:22:52.192337+08 Run ended on 2021-09-23 16:23:09.726125+08 CPU 0.63s/7.43u sec elapsed 17.53 sec --copy db_sqlfx=# copy test from '/home/thunisoft/test.csv'; COPY 5000000 Time: 27020.258 ms (00:27.020)
copy increases greatly without logs, pg_bulkload promotion is not obvious because it has bypassed shared_buffers
A batch script is attached:
1 -bash-4.1$ cat load.sh 2 #!/bin/sh 3 4 #$1 data fil ename 5 6 file=$1 7 8 if [ ! -f $file ] 9 then 10 echo "File is not exist" 11 exit 1 12 fi 13 14 echo "-----------------------------------------------------------------" 15 16 tbname=$( echo $file |cut -d . -f1 ) 17 echo "Table name is : "$tbname 18 19 zcat $file|pg_bulkload -i stdin -O public.$tbname -l $tbname.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -d sgdw 20 21 echo "load complete" 22 echo "-----------------------------------------------------------------"
summary
1. With an index, pg_bulkload is twice as efficient as copy, and there is no index pg_bulkload is also more efficient
2. The performance of copy in the unligged table table is improved, but it is different from pg_bulkload still has some gaps
3,pg_ The advantage of bulkload is obvious. It mainly bypasses the shared buffer and wal buffer and writes files directly. This mode is the same as the unregistered table and does not need to write wal. For the scenario where the stream replication standby database is used, the standby database needs to be rebuilt. The underlying backup also needs to be regenerated
4. When there are large tables, the import performance is very good