1, Preparation:
- 1. Write a program to write 10 million pieces of data into the Hbase table;
- 2. Map the corresponding Hbase table to Hive table.
Execute the following command in Hive's shell
hive> CREATE EXTERNAL TABLE IF NOT EXISTS t_hbase_person_his10(id string, NAME String, salary string,START_DATE string,END_DATE string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,info:id,info:name,info:salary,info:start_date,info:end_date') TBLPROPERTIES ('hbase.table.name' ='t_hbase_person_his10');
- 1
- 2
- 3
- 4
- 5
- 3. Copy the same data to Hive table. This Hive data is actually in Hive. Through Sql like replication
create table t_person_his10 as select * from t_hbase_person_his10;
- 1
2, The query time is compared by Hive Jdbc, and the comparison results are as follows
Where t ﹣ Hbase ﹣ person ﹣ his10 table is the Hive table generated by Hbase Association, t ﹣ person ﹣ history10 table is the actual Hive table, and the source is selected * from t ﹣ Hbase ﹣ person ﹣ his10 table;
- 1. Check the current data and return 30 items by default
sql = "select * from t_hbase_person_his10 where end_date='9999-12-31' limit 30";// use statTime:353ms sql = "select * from t_person_history10 where end_date='9999-12-31' limit 30";//use statTime:119ms
- 1
- 2
- 3
- 2. Query the data of the specified date, and return 30 items by default
sql = "select * from t_hbase_person_his10 where start_date<='2017-09-18' and end_date>='2017-09-18' and salary>990000 limit 30";//use statTime:411ms sql = "select * from t_hbase_person_his10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 limit 30";//use statTime:908ms sql = "select * from t_person_history10 where start_date<='2017-09-18' and end_date>='2017-09-18' and salary>990000 limit 30";//use statTime:147ms sql = "select * from t_person_history10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 limit 30";//use statTime:266ms
- 1
- 2
- 3
- 4
- 5
- 6
- 3. order by is inefficient
sql = "select * from t_hbase_person_his10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 order by salary limit 30";// use statTime:95000ms sql = "select * from t_person_history10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 order by salary limit 30";//use statTime:35836ms
- 1
- 2
- 3
- 4,between and
sql = "select * from t_hbase_person_his10 where end_date='9999-12-31' and salary between 500000 and 600000 limit 30";//use statTime:338ms sql = "select * from t_person_history10 where end_date='9999-12-31' and salary between 500000 and 600000 limit 30";//use statTime:166ms
- 1
- 2
- 3
- 5. Trace the source of the specified user. Here, the user name is the only identification, which is extremely inefficient. (consider using rowkey as the only identification)
sql = "SELECT mobile,start_date FROM t_hbase_person_his10 where name='hehe98'";//use statTime:86701ms 13901173602,2017-09-04 13201382515,2017-09-07 15107963040,2017-09-11 sql = "SELECT mobile,start_date FROM t_hbase_person_his10 where rowkey='1298'";//use statTime:316ms sql = "SELECT mobile,start_date FROM t_person_history10 where name='hehe98'";//use statTime:6326ms 13901173602,2017-09-04 13201382515,2017-09-07 15107963040,2017-09-11 sql = "SELECT mobile,start_date FROM t_person_history10 where rowkey='1298'";//use statTime:6288ms
- 1
- 2
- 3
- 4
- 5
- 6,group by
sql = "select start_date,count(1) from t_hbase_person_his10 group by start_date";//use statTime:100330ms sql = "select start_date,count(1) from t_person_history10 group by start_date";//use statTime:25857ms
- 1
- 2
- 3
- 7. Fuzzy query
sql = "select * from t_hbase_person_his10 where name like '%hehe111%' limit 30";// use statTime:2738ms sql = "select * from t_hbase_person_his10 where name like '%hehe111%' and start_date>'2017-09-18' limit 10";// use statTime:2745ms sql = "select * from t_hbase_person_his10 where rowkey like '%10059%' and start_date>'2017-09-18' limit 10";// use statTime:665ms sql = "select * from t_person_history10 where name like '%hehe111%' and start_date>'2017-09-18' limit 10";// use statTime:257ms sql = "select * from t_person_history10 where rowkey like '%10059%' and start_date>'2017-09-18' limit 10";// use statTime:135ms sql = "select * from t_person_history10 where name like '%hehe111%' limit 30";// use statTime:225ms
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8. Query specified rowkey
sql = "select * from t_hbase_person_his10 where rowkey='11123'";//use statTime:342ms sql = "select * from t_person_history10 where rowkey='11123'";//use statTime:8386ms
- 1
- 2
- 3
- 9. Query the Hive table by association
sql = "select th.mobile,th.start_date,tb.mobile from t_person_history10 th, t_hbase_person_his10 tb where th.name=tb.name limit 10";//use statTime:88614ms sql = "select th.mobile,th.start_date,tb.mobile from t_person_history10 th left outer join t_hbase_person_his10 tb on th.name=tb.name limit 10";//use statTime:88614ms
- 1
- 2
- 3
Combined with the above results, the query efficiency of mapping Hbase table to Hive table will be greatly reduced. However, if the amount of data is only 10 million, the impact of ordinary queries is not significant. For example, the efficiency of association query and aggregation troubleshooting is very low. I suggest that you do not associate a large amount of data table with Hive table to query, because the corresponding Hive table partition and other original functions will not work.