Complete use of Hive enterprise level tuning of big data (Chapter 10)

1, Explain

1) Basic grammar

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

2) Case practice

(1) View the execution plan of the following statement

No MR task was generated

hive (default)> explain select * from emp;

Explain
STAGE DEPENDENCIES:
 Stage-0 is a root stage
STAGE PLANS:
 Stage: Stage-0
 Fetch Operator
 limit: -1
 Processor Tree:
 TableScan
 alias: emp
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 Select Operator
 expressions: empno (type: int), ename (type: string), job 
(type: string), mgr (type: int), hiredate (type: string), sal (type: 
double), comm (type: double), deptno (type: int)
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6, _col7
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 ListSink

Have the ability to generate MR tasks

hive (default)> explain select deptno, avg(sal) avg_sal from emp group by 
deptno;
Explain
STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-0 depends on stages: Stage-1
STAGE PLANS:
 Stage: Stage-1
 Map Reduce
 Map Operator Tree:
 TableScan
 alias: emp
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 Select Operator
 expressions: sal (type: double), deptno (type: int)
 outputColumnNames: sal, deptno
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 Group By Operator
 aggregations: sum(sal), count(sal)
 keys: deptno (type: int)
 mode: hash
 outputColumnNames: _col0, _col1, _col2
 Statistics: Num rows: 1 Data size: 7020 Basic stats:
 COMPLETE Column stats: NONE
 Reduce Output Operator
 key expressions: _col0 (type: int)
 sort order: +
 Map-reduce partition columns: _col0 (type: int)
 Statistics: Num rows: 1 Data size: 7020 Basic stats: 
COMPLETE Column stats: NONE
 value expressions: _col1 (type: double), _col2 (type: 
bigint)
 Execution mode: vectorized
 Reduce Operator Tree:
 Group By Operator
 aggregations: sum(VALUE._col0), count(VALUE._col1)
 keys: KEY._col0 (type: int)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 Select Operator
 expressions: _col0 (type: int), (_col1 / _col2) (type: double)
 outputColumnNames: _col0, _col1
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 File Output Operator
 compressed: false
 Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE 
Column stats: NONE
 table:
 input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
 output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 Stage: Stage-0
 Fetch Operator
 limit: -1
 Processor Tree:
 ListSink

(2) View detailed execution plan

hive (default)> explain extended select * from emp;
hive (default)> explain extended select deptno, avg(sal) avg_sal from emp 
group by deptno;

2, Fetch grab

Fetch Grab means, Hive You may not need to use for queries in some cases MapReduce calculation. For example: SELECT 
* FROM employees;under these circumstances, Hive Can simply read employee Files in the corresponding storage directory,
Then output the query results to the console.
stay hive-default.xml.template In the file hive.fetch.task.conversion The default is more,
Old version hive The default is minimal,The attribute is modified to more After that, you can perform global search, field search limit Search, etc
 Don't go mapreduce. 
<property>
 <name>hive.fetch.task.conversion</name>
 <value>more</value>
 <description>
 Expects one of [none, minimal, more].
 Some select queries can be converted to single FETCH task minimizing 
latency.
 Currently the query should be single sourced not having any subquery 
and should not have any aggregations or distincts (which incurs RS), 
lateral views and joins.
 0. none : disable hive.fetch.task.conversion
 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and 
virtual columns)
 </description>
</property>

1) Case practice

(1) Set hive.fetch.task.conversion to none, and then execute the query statement. mapreduce will be executed
Procedure.

hive (default)> set hive.fetch.task.conversion=none;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

(2) Set hive.fetch.task.conversion to more, and then execute the query statement. The following query methods are not applicable
The mapreduce program is executed.

hive (default)> set hive.fetch.task.conversion=more;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

3, Local mode

Most Hadoop jobs need the complete scalability provided by Hadoop to handle large data sets. however,
Sometimes Hive's input data is very small. In this case, the time consumed to trigger the execution task for the query may be
It will take much more time to execute than the actual job. For most of these cases, Hive can operate on a single machine through local mode
Handle all tasks on the. For small data sets, the execution time can be significantly reduced.

You can set the value of hive.exec.mode.local.auto to true to make Hive automatically when appropriate
Start this optimization.

set hive.exec.mode.local.auto=true; //Turn on local mr
//Set the maximum input data amount of local mr. when the input data amount is less than this value, the method of local mr is adopted. By default
 134217728, i.e. 128 M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//Set the maximum number of input files of local mr. when the number of input files is less than this value, the local mr mode is adopted. Default
 Think 4
set hive.exec.mode.local.auto.input.files.max=10;

1) Case practice

(1) Turn off local mode (off by default) and execute query statements

hive (default)> select count(*) from emp group by deptno;

(2) Turn on local mode and execute query statements

hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> select count(*) from emp group by deptno;

4, Table optimization

1) Small table large table Join (MapJOIN)

Put the tables with relatively scattered key s and small amount of data on the left of the join. You can use map join to make small dimension tables
Advanced memory. Complete the join on the map side.

The actual test shows that the new version of hive has optimized the small table JOIN large table and large table JOIN small table. Small watch
There is no difference between the left and the right.

2) Case practice

1) Demand introduction

Test large table JOIN Small table and small table JOIN Efficiency of large meter

2) Enable MapJoin parameter setting

(1) Set automatic selection of Mapjoin

set hive.auto.convert.join = true; Default to true

(2) Threshold setting of large and small tables (small tables are considered below 25M by default)

set hive.mapjoin.smalltable.filesize = 25000000;

3) MapJoin working mechanism

4) Statements for creating large tables, small tables, and post JOIN tables

//Create large table

create table bigtable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '\t';

//Create small table

create table smalltable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '\t';

//Statement after creating the join table

create table jointable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '\t';

5) Import data into large and small tables respectively

hive (default)> load data local inpath '/opt/module/data/bigtable' into 
table bigtable;
hive (default)>load data local inpath '/opt/module/data/smalltable' into 
table smalltable;

6) Small table JOIN large table statement

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
join bigtable b
on b.id = s.id;

7) Large table JOIN small table statement

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
join smalltable s
on s.id = b.id;

2) Large table Join large table

1) Empty KEY filtering

Sometimes the join timeout occurs because there are too many data corresponding to some keys, and the data corresponding to the same key will be sent to the same key
On the reducer, resulting in insufficient memory. At this time, we should carefully analyze the key s of these exceptions. In many cases,
The data corresponding to these keys is abnormal data, which needs to be filtered in SQL statements. For example, the field corresponding to key is
Empty, the operation is as follows

2) Case practice

(1) Configure history server

<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<property>
 <name>mapreduce.jobhistory.webapp.address</name>
  <value>hadoop102:19888</value>
</property>

Start history server

sbin/mr-jobhistory-daemon.sh start historyserver

View jobhistory

http://hadoop102:19888/jobhistory

(2) Create original data empty id table

// Create empty id table
create table nullidtable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '\t';

(3) Load the original data and empty id data into the corresponding table respectively

hive (default)> load data local inpath '/opt/module/data/nullid' into 
table nullidtable;

(4) Test filter id

hive (default)> insert overwrite table jointable select n.* from 
nullidtable n left join bigtable o on n.id = o.id;

(5) Test filter empty id

hive (default)> insert overwrite table jointable select n.* from (select 
* from nullidtable where id is not null) n left join bigtable o on n.id = 
o.id;

2) Empty key conversion

Sometimes although some key If it is empty, there are many corresponding data, but the corresponding data is not abnormal data and must be included in
join At this time, we can use the table a in key Assign a random value to the empty field so that the data is randomly and evenly distributed
 No different reducer Come on. for example

1) Case practice:

Do not randomly distribute null values:

(1) Set the number of 5 reduce

set mapreduce.job.reduces = 5;

(2) JOIN two tables

insert overwrite table jointable
select n.* from nullidtable n left join bigtable b on n.id = b.id;

Results: as shown in the figure below, it can be seen that there is data skew, and the resource consumption of some reducer s is much greater than that of others
He's a reducer.


Randomly distributed null values
(1) Set the number of 5 reduce
set mapreduce.job.reduces = 5;
(2) JOIN two tables

insert overwrite table jointable
select n.* from nullidtable n full join bigtable o on 
nvl(n.id,rand()) = o.id;

Results: as shown in the figure below, it can be seen that the data skew is eliminated and the load balancing reduces the resource consumption of the reducer

3)SMB(Sort Merge Bucket join)

(1) Create a second large table

create table bigtable2(
 id bigint,
 t bigint,
 uid string,
 keyword string,
 url_rank int,
 click_num int,
 click_url string)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table bigtable2;

Test large table direct JOIN

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable s
join bigtable2 b
on b.id = s.id;

(2) Create tap Table 1. The number of buckets should not exceed the number of available CPU cores

create table bigtable_buck1(
 id bigint,
 t bigint,
 uid string,
 keyword string,
 url_rank int,
 click_num int,
 click_url string)
clustered by(id) 
sorted by(id)
into 6 buckets
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table 
bigtable_buck1;

(3) Create tap Table 2. The number of buckets should not exceed the number of available CPU cores

 id bigint,
 t bigint,
 uid string,
 keyword string,
 url_rank int,
 click_num int,
 click_url string)
clustered by(id)
sorted by(id) 
into 6 buckets
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table 
bigtable_buck2;

(4) Set parameters

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set 
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

(5) Testing

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
join bigtable_buck2 b
on b.id = s.id;

Tags: Big Data Hadoop hive

Posted on Sun, 24 Oct 2021 23:58:57 -0400 by aaronhall