1. Data generation
Tool download
git clone git@github.com:electrum/tpch-dbgen.git
Modify the makefile in the degen directory
################ ## CHANGE NAME OF ANSI COMPILER HERE ################ CC = gcc # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata) # SQLSERVER, SYBASE, ORACLE # Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS, # SGI, SUN, U2200, VMS, LINUX, WIN32 # Current values for WORKLOAD are: TPCH DATABASE= POSTGRESQL MACHINE = LINUX WORKLOAD = TPCH
Modify tpcd.h in the degen directory
Add the following code:
#ifdef POSTGRESQL #define GEN_QUERY_PLAN "EXPLAIN" #define START_TRAN "BEGIN TRANSACTION" #define END_TRAN "COMMIT;" #define SET_OUTPUT "" #define SET_ROWCOUNT "LIMIT %d\n" #define SET_DBASE "" #endif
Execute the command in the degen directory to generate dbgen and qgen files
make -f makefile.suite
Generate. tlb data in the degen directory
./dbgen -s 1 -f
-s 1 means 1G data is generated
-f overwrite previously generated files
View generated data
ls | grep '.*.tbl'
2. Import data into Postgres database
Create database
createdb -p 5432 -e -U postgres tpch
-p indicates connection port 5432
-e indicates command echo
-U the user name for creating the database is postgres
The name of the tpch database
Build table
The table structure is in the dss.ddl file in the degen directory
psql -d tpch < dss.ddl
View created tables
psql -d tpch -c '\dt+'
Import data from table
for i in `ls *.tbl`; do echo $i; sed -i 's/|$//' *.tbl; name=`echo $i| cut -d'.' -f1`; psql -d tpch -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';"; done
View data import
psql -d tpch -c '\dt+'
You can see that there is already data in the table
# View the number of rows in the table psql -d tpch -c "select count(*) from lineitem";
Constrain tables
Constraints are in the dss.ri file. You need to modify the TPCD in the file to your database name. You can also directly use the following to execute in the database.
-- For table REGION ALTER TABLE REGION ADD PRIMARY KEY (R_REGIONKEY); -- For table NATION ALTER TABLE NATION ADD PRIMARY KEY (N_NATIONKEY); ALTER TABLE NATION ADD FOREIGN KEY (N_REGIONKEY) references REGION; COMMIT WORK; -- For table PART ALTER TABLE PART ADD PRIMARY KEY (P_PARTKEY); COMMIT WORK; -- For table SUPPLIER ALTER TABLE SUPPLIER ADD PRIMARY KEY (S_SUPPKEY); ALTER TABLE SUPPLIER ADD FOREIGN KEY (S_NATIONKEY) references NATION; COMMIT WORK; -- For table PARTSUPP ALTER TABLE PARTSUPP ADD PRIMARY KEY (PS_PARTKEY,PS_SUPPKEY); COMMIT WORK; -- For table CUSTOMER ALTER TABLE CUSTOMER ADD PRIMARY KEY (C_CUSTKEY); ALTER TABLE CUSTOMER ADD FOREIGN KEY (C_NATIONKEY) references NATION; COMMIT WORK; -- For table LINEITEM ALTER TABLE LINEITEM ADD PRIMARY KEY (L_ORDERKEY,L_LINENUMBER); COMMIT WORK; -- For table ORDERS ALTER TABLE ORDERS ADD PRIMARY KEY (O_ORDERKEY); COMMIT WORK; -- For table PARTSUPP ALTER TABLE PARTSUPP ADD FOREIGN KEY (PS_SUPPKEY) references SUPPLIER; COMMIT WORK; ALTER TABLE PARTSUPP ADD FOREIGN KEY (PS_PARTKEY) references PART; COMMIT WORK; -- For table ORDERS ALTER TABLE ORDERS ADD FOREIGN KEY (O_CUSTKEY) references CUSTOMER; COMMIT WORK; -- For table LINEITEM ALTER TABLE LINEITEM ADD FOREIGN KEY (L_ORDERKEY) references ORDERS; COMMIT WORK; ALTER TABLE LINEITEM ADD FOREIGN KEY (L_PARTKEY,L_SUPPKEY) references PARTSUPP; COMMIT WORK;
3. Generate query statement
Generate query statement
Copy qgen and dists.dss to queries, and CD to queries directory for execution
./qgen -d 1 > d1.sql
-D is the default parameter, 1 is the SQL statement generated according to the template, and d1.sql is the name of the generated SQL file
Generate sql files in batch
for i in {1..22} do name="d$i.sql" echo $name ./qgen -d $i >$name done
Execute query statement
psql -p 5432 -U postgres -d tpch -f ./d1.sql
Query template description
Q01 is a single table query operation with grouping, sorting and aggregation. This query will result in 95% to 97% of the rows of the data on the table being read.
Q02: multi table query operation with sorting, aggregation and coexistence of sub queries. The query statement does not syntactically limit how many tuples to return, but the TPC-H standard stipulates that the query result only returns the first 100 rows (usually depending on the application implementation).
Q03 has three table query operations with grouping, sorting and aggregation. The query statement does not syntactically limit how many tuples to return, but the TPC-H standard stipulates that the query result only returns the first 10 rows (usually depending on the application implementation).
Q04 is a single table query operation with grouping, sorting, aggregation and sub query. Subqueries are related subqueries.
Q05 is a multi table join query operation with grouping, sorting, aggregation and sub query.
Q06 single table query with aggregation. The query statement uses the BETWEEN-AND operator, and some databases can optimize BETWEEN-AND.
Q07 is a multi table query operation with grouping, sorting, aggregation and sub query operations. The parent query of the sub query has no other query objects. It is a sub query with relatively simple format.
Q08 has query operations with grouping, sorting, aggregation and sub query operations. The parent query of the sub query has no other query objects. It is a relatively simple sub query, but the sub query itself is a multi table connected query.
Q09 has query operations with grouping, sorting, aggregation and sub query operations. The parent query of the sub query has no other query objects. It is a relatively simple sub query, but the sub query itself is a multi table connected query. LIKE operators are used in subqueries. Some query optimizers do not support optimizing LIKE operators.
Q10 multi table join query operation with grouping, sorting and aggregation. The query statement does not syntactically limit how many tuples to return, but the TPC-H standard stipulates that the query result only returns the first 10 rows (usually depending on the application implementation).
Q11 is a multi table join query operation with grouping, sorting, aggregation and sub query operations. The subquery is located in the HAVING condition of the grouping operation.
Q12 has two table join query operations with grouping, sorting and aggregation.
Q13 has query operations such as grouping, sorting, aggregation, sub query and left outer connection.
Q14 query using logical judgment (WHEN ELSE)
Q15 is used to connect common tables and views with sorting, aggregation and aggregation sub query operations.
Q16 has two table join operations with grouping, sorting, aggregation, de duplication and NOT IN sub query.
Q17 has two table join operations with aggregation and aggregation sub query.
Q18 has three table connection operations with grouping, sorting, aggregation and IN sub query. The query statement does not syntactically limit how many tuples to return, but the TPC-H standard stipulates that the query result only returns the first 100 rows (usually depending on the application implementation).
Q19 has three table connection operations with grouping, sorting, aggregation and IN sub query.
Q20 has two table join operations with sorting, aggregation, IN subquery and common subquery.
Q21 has four table connection operations with grouping, sorting, aggregation, EXISTS subquery and NOT EXISTS subquery. The query statement does not syntactically limit how many tuples to return, but the TPC-H standard stipulates that the query result only returns the first 100 rows (usually depending on the application implementation).
Q22 has four table connection operations with grouping, sorting, aggregation, EXISTS subquery and NOT EXISTS subquery.
TPCH metrics
The basic data measured in the test is related to the execution time, which can be divided into: each step operation time of loading data, each query execution time and each update operation execution time. From these times, the data loading time QphH@Size , Power@Size , and Throughput@Size.
-
Loading data time
The whole process of loading data can be divided into timing operation and non timing operation. The time used for timing operation must be measured and included in the data loading time. In general, the operations that need time recording include creating tables, inserting data and establishing indexes. -
Query and update time
In the Power test and Throughput test, the time of all queries and update streams must be measured and recorded. The timing of each query time is from the first character of the submitted query to the last character of the query result. The update time measures the time of RF1 and RF2 respectively, which is the time from the beginning of the submission operation to the end of the completion operation. -
Power@Size
Power@Size Is the result of the power test, defined as the reciprocal of the geometric average of the query time and change time, Power@Size Calculate the speed at which the DBMS calculates the results of a single query. The formula is as follows:
Where: Size is the data scale; SF is the scale factor of data scale; QI (i, 0) is the time of the ith query, in seconds; RI (J, 0) is the time of RFj update, in seconds. -
Throughput@Size
Throughput@Size Is the result of the throughput test. It is defined as the reciprocal of the average execution time of all queries. The formula is as follows:
Where s is the number of query streams executed, and Ts is the total time required to run the throughput test of s stream. -
QphH@Size
Hourly query performance( QphH@Size )Measure, which is obtained from the geometric average of the first two indicators, and reflects many aspects of the database's ability to process queries.
Reference blog
Link: https://www.cnblogs.com/joyeecheung/p/3599698.html
Link: https://blog.csdn.net/hehong_78/article/details/6091011
Link: https://blog.csdn.net/leixingbang1989/article/details/8766047
Link: https://blog.csdn.net/weixin_30329623/article/details/101092449
Link: https://blog.csdn.net/iteapoy/article/details/104214119