Test the query efficiency of mapping Hbase table to Hive table

1, Preparation:

1. Write a program to write 10 million pieces of data into the Hbase table;
2. Map the corresponding Hbase table to Hive table.
Execute the following command in Hive's shell

hive> CREATE EXTERNAL TABLE IF NOT EXISTS t_hbase_person_his10(id string, NAME String, salary string,START_DATE string,END_DATE string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,info:id,info:name,info:salary,info:start_date,info:end_date') TBLPROPERTIES ('hbase.table.name' ='t_hbase_person_his10');

3. Copy the same data to Hive table. This Hive data is actually in Hive. Through Sql like replication

create table t_person_his10 as select * from t_hbase_person_his10;

2, The query time is compared by Hive Jdbc, and the comparison results are as follows

Where t ﹣ Hbase ﹣ person ﹣ his10 table is the Hive table generated by Hbase Association, t ﹣ person ﹣ history10 table is the actual Hive table, and the source is selected * from t ﹣ Hbase ﹣ person ﹣ his10 table;

1. Check the current data and return 30 items by default

sql = "select * from t_hbase_person_his10 where end_date='9999-12-31' limit 30";// use statTime:353ms sql = "select * from t_person_history10 where end_date='9999-12-31' limit 30";//use statTime:119ms

2. Query the data of the specified date, and return 30 items by default

sql = "select * from t_hbase_person_his10 where start_date<='2017-09-18' and end_date>='2017-09-18' and salary>990000 limit 30";//use statTime:411ms sql = "select * from t_hbase_person_his10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 limit 30";//use statTime:908ms sql = "select * from t_person_history10 where start_date<='2017-09-18' and end_date>='2017-09-18' and salary>990000 limit 30";//use statTime:147ms sql = "select * from t_person_history10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 limit 30";//use statTime:266ms

3. order by is inefficient

sql = "select * from t_hbase_person_his10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 order by salary limit 30";// use statTime:95000ms sql = "select * from t_person_history10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 order by salary limit 30";//use statTime:35836ms

4,between and

sql = "select * from t_hbase_person_his10 where end_date='9999-12-31' and salary between 500000 and 600000 limit 30";//use statTime:338ms sql = "select * from t_person_history10 where end_date='9999-12-31' and salary between 500000 and 600000 limit 30";//use statTime:166ms

5. Trace the source of the specified user. Here, the user name is the only identification, which is extremely inefficient. (consider using rowkey as the only identification)

sql = "SELECT mobile,start_date FROM t_hbase_person_his10 where name='hehe98'";//use statTime:86701ms 13901173602,2017-09-04 13201382515,2017-09-07 15107963040,2017-09-11 sql = "SELECT mobile,start_date FROM t_hbase_person_his10 where rowkey='1298'";//use statTime:316ms sql = "SELECT mobile,start_date FROM t_person_history10 where name='hehe98'";//use statTime:6326ms 13901173602,2017-09-04 13201382515,2017-09-07 15107963040,2017-09-11 sql = "SELECT mobile,start_date FROM t_person_history10 where rowkey='1298'";//use statTime:6288ms

6,group by

sql = "select start_date,count(1) from t_hbase_person_his10 group by start_date";//use statTime:100330ms sql = "select start_date,count(1) from t_person_history10 group by start_date";//use statTime:25857ms

7. Fuzzy query

sql = "select * from t_hbase_person_his10 where name like '%hehe111%' limit 30";// use statTime:2738ms sql = "select * from t_hbase_person_his10 where name like '%hehe111%' and start_date>'2017-09-18' limit 10";// use statTime:2745ms sql = "select * from t_hbase_person_his10 where rowkey like '%10059%' and start_date>'2017-09-18' limit 10";// use statTime:665ms sql = "select * from t_person_history10 where name like '%hehe111%' and start_date>'2017-09-18' limit 10";// use statTime:257ms sql = "select * from t_person_history10 where rowkey like '%10059%' and start_date>'2017-09-18' limit 10";// use statTime:135ms sql = "select * from t_person_history10 where name like '%hehe111%' limit 30";// use statTime:225ms

8. Query specified rowkey

sql = "select * from t_hbase_person_his10 where rowkey='11123'";//use statTime:342ms sql = "select * from t_person_history10 where rowkey='11123'";//use statTime:8386ms

9. Query the Hive table by association

sql = "select th.mobile,th.start_date,tb.mobile from t_person_history10 th, t_hbase_person_his10 tb where th.name=tb.name limit 10";//use statTime:88614ms sql = "select th.mobile,th.start_date,tb.mobile from t_person_history10 th left outer join t_hbase_person_his10 tb on th.name=tb.name limit 10";//use statTime:88614ms

Combined with the above results, the query efficiency of mapping Hbase table to Hive table will be greatly reduced. However, if the amount of data is only 10 million, the impact of ordinary queries is not significant. For example, the efficiency of association query and aggregation troubleshooting is very low. I suggest that you do not associate a large amount of data table with Hive table to query, because the corresponding Hive table partition and other original functions will not work.

Test the query efficiency of mapping Hbase table to Hive table

1, Preparation:

2, The query time is compared by Hive Jdbc, and the comparison results are as follows

5 May 2020, 00:24 | Views: 7696

Add new comment

0 comments