1, Explain
1) Basic grammar
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
2) Case practice
(1) View the execution plan of the following statement
No MR task was generated
hive (default)> explain select * from emp; Explain STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: emp Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE ListSink
Have the ability to generate MR tasks
hive (default)> explain select deptno, avg(sal) avg_sal from emp group by deptno; Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: emp Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: sal (type: double), deptno (type: int) outputColumnNames: sal, deptno Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: sum(sal), count(sal) keys: deptno (type: int) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: double), _col2 (type: bigint) Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0), count(VALUE._col1) keys: KEY._col0 (type: int) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), (_col1 / _col2) (type: double) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
(2) View detailed execution plan
hive (default)> explain extended select * from emp; hive (default)> explain extended select deptno, avg(sal) avg_sal from emp group by deptno;
2, Fetch grab
Fetch Grab means, Hive You may not need to use for queries in some cases MapReduce calculation. For example: SELECT * FROM employees;under these circumstances, Hive Can simply read employee Files in the corresponding storage directory, Then output the query results to the console.
stay hive-default.xml.template In the file hive.fetch.task.conversion The default is more, Old version hive The default is minimal,The attribute is modified to more After that, you can perform global search, field search limit Search, etc Don't go mapreduce.
<property> <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property>
1) Case practice
(1) Set hive.fetch.task.conversion to none, and then execute the query statement. mapreduce will be executed
Procedure.
hive (default)> set hive.fetch.task.conversion=none; hive (default)> select * from emp; hive (default)> select ename from emp; hive (default)> select ename from emp limit 3;
(2) Set hive.fetch.task.conversion to more, and then execute the query statement. The following query methods are not applicable
The mapreduce program is executed.
hive (default)> set hive.fetch.task.conversion=more; hive (default)> select * from emp; hive (default)> select ename from emp; hive (default)> select ename from emp limit 3;
3, Local mode
Most Hadoop jobs need the complete scalability provided by Hadoop to handle large data sets. however,
Sometimes Hive's input data is very small. In this case, the time consumed to trigger the execution task for the query may be
It will take much more time to execute than the actual job. For most of these cases, Hive can operate on a single machine through local mode
Handle all tasks on the. For small data sets, the execution time can be significantly reduced.
You can set the value of hive.exec.mode.local.auto to true to make Hive automatically when appropriate
Start this optimization.
set hive.exec.mode.local.auto=true; //Turn on local mr //Set the maximum input data amount of local mr. when the input data amount is less than this value, the method of local mr is adopted. By default 134217728, i.e. 128 M set hive.exec.mode.local.auto.inputbytes.max=50000000; //Set the maximum number of input files of local mr. when the number of input files is less than this value, the local mr mode is adopted. Default Think 4 set hive.exec.mode.local.auto.input.files.max=10;
1) Case practice
(1) Turn off local mode (off by default) and execute query statements
hive (default)> select count(*) from emp group by deptno;
(2) Turn on local mode and execute query statements
hive (default)> set hive.exec.mode.local.auto=true; hive (default)> select count(*) from emp group by deptno;
4, Table optimization
1) Small table large table Join (MapJOIN)
Put the tables with relatively scattered key s and small amount of data on the left of the join. You can use map join to make small dimension tables
Advanced memory. Complete the join on the map side.
The actual test shows that the new version of hive has optimized the small table JOIN large table and large table JOIN small table. Small watch
There is no difference between the left and the right.
2) Case practice
1) Demand introduction
Test large table JOIN Small table and small table JOIN Efficiency of large meter
2) Enable MapJoin parameter setting
(1) Set automatic selection of Mapjoin
set hive.auto.convert.join = true; Default to true
(2) Threshold setting of large and small tables (small tables are considered below 25M by default)
set hive.mapjoin.smalltable.filesize = 25000000;
3) MapJoin working mechanism
4) Statements for creating large tables, small tables, and post JOIN tables
//Create large table
create table bigtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
//Create small table
create table smalltable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
//Statement after creating the join table
create table jointable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
5) Import data into large and small tables respectively
hive (default)> load data local inpath '/opt/module/data/bigtable' into table bigtable; hive (default)>load data local inpath '/opt/module/data/smalltable' into table smalltable;
6) Small table JOIN large table statement
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from smalltable s join bigtable b on b.id = s.id;
7) Large table JOIN small table statement
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable b join smalltable s on s.id = b.id;
2) Large table Join large table
1) Empty KEY filtering
Sometimes the join timeout occurs because there are too many data corresponding to some keys, and the data corresponding to the same key will be sent to the same key
On the reducer, resulting in insufficient memory. At this time, we should carefully analyze the key s of these exceptions. In many cases,
The data corresponding to these keys is abnormal data, which needs to be filtered in SQL statements. For example, the field corresponding to key is
Empty, the operation is as follows
2) Case practice
(1) Configure history server
<property> <name>mapreduce.jobhistory.address</name> <value>hadoop102:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop102:19888</value> </property>
Start history server
sbin/mr-jobhistory-daemon.sh start historyserver
View jobhistory
http://hadoop102:19888/jobhistory
(2) Create original data empty id table
// Create empty id table create table nullidtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
(3) Load the original data and empty id data into the corresponding table respectively
hive (default)> load data local inpath '/opt/module/data/nullid' into table nullidtable;
(4) Test filter id
hive (default)> insert overwrite table jointable select n.* from nullidtable n left join bigtable o on n.id = o.id;
(5) Test filter empty id
hive (default)> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null) n left join bigtable o on n.id = o.id;
2) Empty key conversion
Sometimes although some key If it is empty, there are many corresponding data, but the corresponding data is not abnormal data and must be included in join At this time, we can use the table a in key Assign a random value to the empty field so that the data is randomly and evenly distributed No different reducer Come on. for example
1) Case practice:
Do not randomly distribute null values:
(1) Set the number of 5 reduce
set mapreduce.job.reduces = 5;
(2) JOIN two tables
insert overwrite table jointable select n.* from nullidtable n left join bigtable b on n.id = b.id;
Results: as shown in the figure below, it can be seen that there is data skew, and the resource consumption of some reducer s is much greater than that of others
He's a reducer.
Randomly distributed null values
(1) Set the number of 5 reduce
set mapreduce.job.reduces = 5;
(2) JOIN two tables
insert overwrite table jointable select n.* from nullidtable n full join bigtable o on nvl(n.id,rand()) = o.id;
Results: as shown in the figure below, it can be seen that the data skew is eliminated and the load balancing reduces the resource consumption of the reducer
3)SMB(Sort Merge Bucket join)
(1) Create a second large table
create table bigtable2( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t'; load data local inpath '/opt/module/data/bigtable' into table bigtable2;
Test large table direct JOIN
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable s join bigtable2 b on b.id = s.id;
(2) Create tap Table 1. The number of buckets should not exceed the number of available CPU cores
create table bigtable_buck1( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) clustered by(id) sorted by(id) into 6 buckets row format delimited fields terminated by '\t'; load data local inpath '/opt/module/data/bigtable' into table bigtable_buck1;
(3) Create tap Table 2. The number of buckets should not exceed the number of available CPU cores
id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) clustered by(id) sorted by(id) into 6 buckets row format delimited fields terminated by '\t'; load data local inpath '/opt/module/data/bigtable' into table bigtable_buck2;
(4) Set parameters
set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
(5) Testing
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable_buck1 s join bigtable_buck2 b on b.id = s.id;