Fast learning - Hive enterprise level tuning

Chapter 9 enterprise level optimization

9.1 Fetch

Fetch refers to the fact that some queries in Hive can be queried without MapReduce. For example: SELECT * FROM employees; in this case, Hive can simply read the files in the storage directory corresponding to the employee, and then output the query results to the console.
In the file hive-default.xml.template, the default value of hive.fetch.task.conversion is more, and the default value of old version hive is minimal. After the attribute is modified to more, mapreduce will not be used in global search, field search, limit search, etc.

<property>
 <name>hive.fetch.task.conversion</name>
 <value>more</value>
 <description>
 Expects one of [none, minimal, more].
 Some select queries can be converted to single FETCH task 
minimizing latency.
 Currently the query should be single sourced not having any 
subquery and should not have
any aggregations or distincts (which incurs RS), lateral views 
and joins.
 0. none : disable hive.fetch.task.conversion
 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT 
only
 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and 
virtual columns)
 </description>
 </property>

Case practice:
1) If you set hive.fetch.task.conversion to none and execute the query statement, the mapreduce program will be executed.

hive (default)> set hive.fetch.task.conversion=none;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

2) Set hive.fetch.task.conversion to more, and execute the query statement. The mapreduce program will not be executed in the following query methods.

hive (default)> set hive.fetch.task.conversion=more;
hive (default)> select * from emp;
hive (default)> select ename from emp;
hive (default)> select ename from emp limit 3;

9.2 local mode

Most of the Hadoop jobs need the complete scalability provided by Hadoop to handle large data sets. However, sometimes Hive's input data volume is very small. In this case, it may take more time to trigger the execution of the task for the query than the actual job. In most cases, Hive can handle all tasks on a single machine in local mode. For small datasets, execution time can be significantly reduced. The user can set the value of hive.exec.mode.local.auto to true to enable Hive to automatically start the optimization when appropriate.

set hive.exec.mode.local.auto=true; //Turn on local mr
//Set the maximum input data amount of local mr. when the input data amount is less than this value, use the local mr method. The default value is 134217728, or 128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//Set the maximum number of input files of local mr. when the number of input files is less than this value, the local mr method is used. The default is 4
set hive.exec.mode.local.auto.input.files.max=10;

Case practice:
1) Turn on local mode and execute query statement

hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> select * from emp cluster by deptno;
Time taken: 1.328 seconds, Fetched: 14 row(s)

2) Turn off local mode and execute query statement

hive (default)> set hive.exec.mode.local.auto=false;
hive (default)> select * from emp cluster by deptno;
Time taken: 20.09 seconds, Fetched: 14 row(s)

9.3 table optimization

9.3.1 small and large meter Join

The key is relatively scattered, and the table with small amount of data is placed on the left side of the join, which can effectively reduce the probability of memory overflow errors; furthermore, map join can be used to let small dimension tables (less than 1000 records) advance memory. Complete reduce on the map side.
The actual test shows that the new version of hive has optimized small table JOIN large table and large table JOIN small table. There is no obvious difference between the small watch on the left and the right.

Case practice

  1. demand
    Test the efficiency of large table JOIN small table and small table JOIN large table
  2. Statement for building large table, small table and table after JOIN
// Create large table
create table bigtable(id bigint, time bigint, uid string, 
keyword string, url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
// Create small table
create table smalltable(id bigint, time bigint, uid string, 
keyword string, url_rank int, click_num int, click_url string) 
row format delimited fields terminated by '\t';
// Statement of creating the table after join
create table jointable(id bigint, time bigint, uid string, 
keyword string, url_rank int, click_num int, click_url string) 
row format delimited fields terminated by '\t';
  1. Import data into large and small tables respectively
hive (default)> load data local inpath 
'/opt/module/data/bigtable' into table bigtable;
hive (default)>load data local inpath 
'/opt/module/data/smalltable' into table smalltable;
  1. Turn off the mapjoin function (it is on by default)
set hive.auto.convert.join = false;
  1. Execute small table JOIN large table statement
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, 
b.click_url
from smalltable s
left join bigtable b
on b.id = s.id;
Time taken: 35.921 seconds
Time taken: 43.152 seconds
  1. Execute large table JOIN small table statement
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num,
b.click_url
from bigtable b
left join smalltable s
on s.id = b.id;

Time taken: 34.196 seconds
Time taken: 40.807 seconds

9.3.2 big watch Join big watch

  1. Empty KEY filter
    Sometimes the join timeout is because there are too many data corresponding to some keys, and the data corresponding to the same key will be sent to the same reducer, resulting in insufficient memory. At this time, we should carefully analyze these abnormal keys. In many cases, the data corresponding to these keys is abnormal data. We need to filter them in SQL statements. For example, the field corresponding to key is empty, and the operation is as follows:

Case practice
(1) Configure history server
Configure mapred-site.xml

<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<property>
 <name>mapreduce.jobhistory.webapp.address</name>
 <value>hadoop102:19888</value>
</property>
Start history server
sbin/mr-jobhistory-daemon.sh start historyserver
 See jobhistory
http://hadoop102:19888/jobhistory

(2) Create original data table, empty id table and merged data table

// Create original table
create table ori(id bigint, time bigint, uid string, keyword 
string, url_rank int, click_num int, click_url string) row 
format delimited fields terminated by '\t';
// Create an empty id table
create table nullidtable(id bigint, time bigint, uid string, 
keyword string, url_rank int, click_num int, click_url string) 
row format delimited fields terminated by '\t';
// Statement of creating the table after join
create table jointable(id bigint, time bigint, uid string, 
keyword string, url_rank int, click_num int, click_url string) 
row format delimited fields terminated by '\t';

(3) Load the original data and empty id data into the corresponding table respectively

hive (default)> load data local inpath '/opt/module/datas/
SogouQ1.txt' into table ori;
hive (default)> load data local inpath '/opt/module/data/nullid' 
into table nullidtable;

(4) Test failed filter id

hive (default)> insert overwrite table jointable 
select n.* from nullidtable n left join ori o on n.id = o.id;
Time taken: 42.038 seconds
Time taken: 37.284 seconds
Time taken: 41.743 seconds

(5) Test filter empty id

hive (default)> insert overwrite table jointable 
select n.* from (select * from nullidtable where id is not null ) 
n left join ori o on n.id = o.id;
Time taken: 31.725 seconds
Time taken: 28.876 seconds
Time taken: 33.29 seconds
  1. Empty key conversion
    Sometimes, although there are many data corresponding to a key that is empty, the corresponding data is not abnormal data and must be included in the result of the join. At this time, we can assign a random value to the field where the key is empty in table a, so that the data can not be randomly and evenly divided into different reducer s. For example:

Case practice:
Do not randomly distribute null values:
(1) Set the number of 5 reduce

set mapreduce.job.reduces = 5;

(2) JOIN two tables

insert overwrite table jointable
select n.* from nullidtable n left join ori b on n.id = b.id;

Results: as shown in Figure 6-13, data skew occurs, and resource consumption of some reducers is far greater than that of other reducers.

Random distribution null value
(1) Set the number of 5 reduce

set mapreduce.job.reduces = 5;

(2) JOIN two tables

insert overwrite table jointable
select n.* from nullidtable n full join ori o on 
case when n.id is null then concat('hive', rand()) else n.id end 
= o.id;

Result: as shown in Figure 6-14, it can be seen that the data skew is eliminated and the resource consumption of the load balancer is balanced

9.3.3 MapJoin

If MapJoin is not specified or does not meet the conditions of MapJoin, the Hive parser will convert the join operation to Common Join, that is, complete the join in the Reduce phase. Prone to data skew. You can use map join to load all the small tables into the memory of the map side to join, avoiding the reducer processing.

  1. Turn on MapJoin parameter setting
    (1) Set up auto select MapJoin
set hive.auto.convert.join = true; Default is true

(2) Threshold setting of large table and small table (25M by default, it is considered as small table):

set hive.mapjoin.smalltable.filesize=25000000;
  1. The working mechanism of MapJoin is shown in Figure 6-15

    Case practice:
    (1) Enable Mapjoin function
set hive.auto.convert.join = true; Default is true

(2) Execute small table JOIN large table statement

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, 
b.click_num, b.click_url
from smalltable s
join bigtable b
on s.id = b.id;
Time taken: 24.594 seconds
Time taken: 26.212 seconds

(3) Execute large table JOIN small table statement

insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, 
b.click_num, b.click_url
from bigtable b
join smalltable s
on s.id = b.id;
Time taken: 24.315 seconds
Time taken: 24.498 seconds

9.3.4 Group By

By default, the same key data in the Map phase is distributed to a reduce. When the key data is too large, it will tilt.

Not all aggregation operations need to be completed at the Reduce end. Many aggregation operations can be partially aggregated at the Map end first, and finally the final result can be obtained at the Reduce end.

  1. Turn on the Map side aggregation parameter settings
    (1) Whether to aggregate on the Map side? True by default
hive.map.aggr = true

(2) Number of entries that aggregate on the Map side

hive.groupby.mapaggr.checkinterval = 100000

(3) Load balancing with data skew (default is false)

hive.groupby.skewindata = true

When the option is set to true, the generated query plan will have two Mr Jobs. In the first MR Job, the output results of Map will be randomly distributed to Reduce. Each Reduce will do some aggregation operations and output the results. In this way, the same Group By Key may be distributed to different Reduce to achieve the purpose of load balancing. The second MR Job will be distributed to Reduce according to the Group By Key according to the pre-processing data results In (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally complete the final aggregation operation.

9.3.5 Count(Distinct) de duplication statistics

When the amount of data is small, it doesn't matter. When the amount of data is large, the COUNT DISTINCT operation needs to be completed with a Reduce Task. The amount of data that this Reduce needs to process is too large, which will make the whole Job difficult to complete. Generally, COUNT DISTINCT is replaced by GROUP BY COUNT:

Case practice

  1. Create a large table
    hive (default)> create table bigtable(id bigint, time bigint,
    uid string, keyword
    string, url_rank int, click_num int, click_url string) row format
    delimited
    fields terminated by '\t';

  2. Loading data

hive (default)> load data local inpath 
'/opt/module/datas/bigtable' into table
bigtable;
  1. Set the number of 5 reduce
set mapreduce.job.reduces = 5;
  1. Execute duplicate id query
hive (default)> select count(distinct id) from bigtable;
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.12 sec 
HDFS Read: 120741990 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 120 msec
OK
c0
100001
Time taken: 23.607 seconds, Fetched: 1 row(s)
Time taken: 39.753 seconds, Fetched: 1 row(s)
  1. Use GROUP by to de duplicate id
hive (default)> select count(id) from (select id from bigtable
group by id) a;
Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 17.53 sec 
HDFS Read: 120752703 HDFS Write: 580 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.29 sec 
HDFS Read: 9409 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 820 msec
OK
_c0
100001
Time taken: 50.795 seconds, Fetched: 1 row(s)
Time taken: 63.108 seconds, Fetched: 1 row(s)

Although it will be completed by one more Job, it is definitely worth it when there is a large amount of data.

Cartesian product

Try to avoid Cartesian product. join without on condition or invalid on condition. Hive can only use one reducer to complete Cartesian product.

9.3.7 row and column filtering

Column processing: in SELECT, only the required columns are used. If there are any, try to use partition filtering instead of SELECT *.
Row processing: in partition clipping, when external association is used, if the filter condition of the secondary table is written after Where,
Then, the whole table will be associated first, and then filtered, for example:

Case practice:

  1. The test first associates two tables, and then filters by where condition
hive (default)> select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;
Time taken: 34.406 seconds, Fetched: 100 row(s)
  1. After the subquery, associate the table
hive (default)> select b.id from bigtable b
join (select id from ori where id <= 10 ) o on b.id = o.id;
Time taken: 30.058 seconds, Fetched: 100 row(s)

9.3.8 dynamic zone adjustment

In a relational database, when inserting data into a partition table, the database will automatically Insert data into the corresponding partition according to the value of the partition field. Hive also provides a similar mechanism, that is, dynamic partition. However, using hive's dynamic partition requires corresponding configuration.

  1. Enable dynamic partition parameter setting
    (1) Enable the dynamic partition function (true by default, enabled)
hive.exec.dynamic.partition=true

(2) Set to non strict mode (dynamic partition mode, strict by default, indicates that at least one partition must be specified as static partition, and non strict mode indicates that all partition fields are allowed to use dynamic partition.)

hive.exec.dynamic.partition.mode=nonstrict

(3) The maximum number of dynamic partitions can be created on all nodes executing MR.

hive.exec.max.dynamic.partitions=1000

(4) The maximum number of dynamic partitions can be created on each node executing MR. This parameter needs to be set according to the actual data. For example: the source data contains the data of one year, that is, the day field has 365 values, so this parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported.

hive.exec.max.dynamic.partitions.pernode=100

(5) How many HDFS files can be created in the whole MR Job.

hive.exec.max.created.files=100000

(6) Whether to throw an exception when an empty partition is generated. Generally, it does not need to be set.

hive.error.on.empty.partition=false
  1. Case practice
    Requirement: insert the data in ori into the corresponding partition of the target table ori partitioned target according to the time (such as 20111230000008).
    (1) Create partition table
create table ori_partitioned(id bigint, time bigint, uid 
string, keyword string,
url_rank int, click_num int, click_url string) 
partitioned by (p_time bigint) 
row format delimited fields terminated by '\t';

(2) Load data into partition table

hive (default)> load data local inpath '/home/atguigu/ds1' into 
table
ori_partitioned partition(p_time='20111230000010') ;
hive (default)> load data local inpath '/home/atguigu/ds2' into 
table ori_partitioned partition(p_time='20111230000011') ;

(3) Create target partition table

create table ori_partitioned_target(id bigint, time bigint, 
uid string,
keyword string, url_rank int, click_num int, click_url string) 
PARTITIONED BY (p_time STRING) row format delimited fields 
terminated by '\t';

(4) Set dynamic partition

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;
hive (default)> insert overwrite table ori_partitioned_target 
partition (p_time) 
select id, time, uid, keyword, url_rank, click_num, click_url, 
p_time from ori_partitioned;

(6) View the partition status of the target partition table

hive (default)> show partitions ori_partitioned_target;

9.4 MR optimization

9.4.1 set Map number reasonably

1) In general, the job will generate one or more map tasks through the input directory. The main determinants are: the total number of input files, the file size of input, and the file block size set by the cluster.
2) Is it better to have more map s?
The answer is No. If a task has many small files (far smaller than the block size of 128m), each small file will also be treated as a block and completed with a map task, and the time of starting and initializing a map task is far longer than the time of logical processing, which will cause a great waste of resources. Moreover, the number of maps that can be executed simultaneously is limited.
3) Is it guaranteed that each map can handle nearly 128m file blocks, so you can rest assured? The answer is not necessarily. For example, if there is a 127m file, it will normally be completed with a map, but this file only has one or two small fields, but there are tens of millions of records. If the logic of map processing is complex, use a map
It's definitely time-consuming to do the task. To solve the above problems 2 and 3, we need to take two ways: reducing the number of maps and increasing the number of maps;

9.4.2 merge small files

Merge small files before map execution to reduce the number of maps: CombineHiveInputFormat has the function of merging small files (the system default format). HiveInputFormat does not have the ability to merge small files.

set hive.input.format= 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

9.4.3 increase Map number for complex files

When the input files are large, the task logic is complex, and the map execution is very slow, we can consider increasing the number of maps to reduce the amount of data processed by each map, so as to improve the efficiency of task execution. The method of adding map is: according to the formula computeslatesize (math. Max (minsize, math. Min (maxSize, blocksize))) = blocksize = 128M, adjust the maxSize maximum value. If the maximum maxSize is lower than the blocksize, the number of maps can be increased.

Case practice:

  1. Execution query
hive (default)> select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  1. Set the maximum slice value to 100 bytes
hive (default)> set 
mapreduce.input.fileinputformat.split.maxsize=100;
hive (default)> select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 6; number of 
reducers: 1

9.4.4 reasonable setting of Reduce number

  1. Adjust the number of reduce method 1
    (1) The default amount of data processed by each Reduce is 256MB
hive.exec.reducers.bytes.per.reducer=256000000

(2) Maximum number of reduce per task, default is 1009

hive.exec.reducers.max=1009 

(3) Formula for calculating reducer number n = min (parameter 2, total input data / parameter 1)

  1. Adjust the number of reduce method 2. Modify and set the number of reduce of each job in the mapred-default.xml file of hadoop set mapreduce.job.reduce = 15;
  2. The number of reduce is not the more the better
    1) Too much startup and initialization of reduce will also consume time and resources;
    2) In addition, if there are many reduce, there will be many output files. If there are many small files generated, then if these small files are used as the input of the next task, there will also be too many small files. When setting the number of reduce, we need to consider these two principles: Processing large amount of data, using the appropriate number of reduce, and making a single reduce The data volume of task processing should be appropriate;

9.5 parallel execution

Hive converts a query into one or more stages. Such stages can be MapReduce stage, sampling stage, consolidation stage and limit stage. Or other stages that hive may need during its execution. By default, hive performs only one phase at a time. However, a particular job may contain many stages, which may not be completely interdependent, that is to say, some stages can be executed in parallel, which may shorten the execution time of the whole job. However, if there are more phases that can be executed in parallel, the job may complete faster. By setting the parameter hive.exec.parallel to true, concurrent execution can be enabled. However, in a shared cluster, it should be noted that if there are more parallel phases in a job, the cluster utilization will increase.

set hive.exec.parallel=true; //Open task parallel execution
set hive.exec.parallel.thread.number=16; //Maximum parallelism is allowed for the same sql,

The default is 8. Of course, only when the system resources are relatively idle can it have an advantage. Otherwise, without resources, parallel can not get up.

9.6 strict mode

Hive provides a strict pattern to prevent users from executing queries that may have unexpected bad effects. Set the value of the attribute hive.mapred.mode to the default non strict mode. To enable strict mode, you need to change the value of hive.mapred.mode to strict. To enable strict mode, you can disable three types of queries.

<property>
 <name>hive.mapred.mode</name>
 <value>strict</value>
	 <description>
		 The mode in which the Hive operations are being performed. 
		 In strict mode, some risky queries are not allowed to run. They 
		include:
		 Cartesian Product.
		 No partition being picked up for a query.
		 Comparing bigints and strings.
		 Comparing bigints and doubles.
		 Orderby without limit.
	</description>
</property>
  1. For partitioned tables, execution is not allowed unless there is a partitioned field filter condition in the where statement to limit the scope. In other words, the user is not allowed to scan all partitions. The reason for this limitation is that partitioned tables usually have very large datasets and the data increases rapidly. Queries that do not have partition restrictions can consume an unacceptable amount of resources to process the table.
  2. For queries that use the order by statement, a LIMIT statement is required. Because order by will distribute all the result data to the same Reducer for processing in order to perform the sorting process, forcing the user to add this LIMIT statement can prevent the Reducer from executing for a long time.
  3. Limit the query of Cartesian product. Users who have a good understanding of relational databases may expect to use WHERE statements instead of ON statements when executing JOIN queries, so that the execution optimizer of relational databases can efficiently convert WHERE statements into that ON statement. Unfortunately, Hive does not perform this optimization, so if the table is large enough, the query can get out of control.

9.7 JVM reuse

JVM reuse is the content of Hadoop tuning parameters, which has a great impact on the performance of Hive, especially for scenarios where it is difficult to avoid small files or where there are many tasks, most of these scenarios have short execution time. The default configuration of Hadoop is usually to use a derived JVM to perform map and Reduce tasks. At this time, the JVM's
The start-up process can cause considerable overhead, especially if the executed job contains hundreds of task s. JVM reuse can make the JVM instance reuse n times in the same job. The value of N can be in Hadoop's
mapred-site.xml file. Usually between 10-20, the specific number needs to be tested according to the specific business scenario.

<property>
 <name>mapreduce.job.jvm.numtasks</name>
 <value>10</value>
 <description>How many tasks to run per jvm. If set to -1, there is
 no limit. 
 </description>
</property>

The disadvantage of this feature is that turning on JVM reuse will always occupy the task slot used, so that it can be reused until the task is completed. If several reduce tasks in an "unbalanced" job take more time to execute than other reduce tasks, the reserved slots will be idle but not be used by other jobs until all tasks are finished.

9.8 speculative execution

In the distributed cluster environment, because of the program bug (including the bug of Hadoop itself), unbalanced load or uneven resource distribution, the running speed of multiple tasks in the same job will be inconsistent. Some tasks may run slower than other tasks (for example, the progress of one task in a job is only 50% and all other tasks have been run), these tasks will slow down the overall progress of the job. In order to avoid this situation, Hadoop adopts the Speculative Execution mechanism, which infers the "lagged" task according to certain rules, and starts a backup task for such task, so that the task and the original task process the same data at the same time, and finally selects the calculation result of the first successful operation of the completed task as the final result.

Set enable speculative execution parameters: configure in the mapred-site.xml file of Hadoop

<property>
 <name>mapreduce.map.speculative</name>
 <value>true</value>
 <description>If true, then multiple instances of some map tasks 
 may be executed in parallel.</description>
</property>
<property>
 <name>mapreduce.reduce.speculative</name>
 <value>true</value>
 <description>If true, then multiple instances of some reduce tasks
 may be executed in parallel.</description>
</property>

However, hive also provides configuration items to control the speculative execution of reduce side:

 <property>
	 <name>hive.mapred.reduce.tasks.speculative.execution</name>
	 <value>true</value>
	 <description>Whether speculative execution for reducers should 
	be turned on. </description>
 </property>

It is difficult to give a specific advice on tuning these speculative execution variables. If the user is very sensitive to run-time deviations, these functions can be turned off. If the user needs to execute a map or Reduce task for a long time due to the large amount of input data, the waste caused by starting speculative execution is very large.

9.10 Explain

  1. Basic grammar
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
  1. Case practice
    (1) View the execution plan of the following statement
hive (default)> explain select * from emp;
hive (default)> explain select deptno, avg(sal) avg_sal from emp group 
by deptno;

(2) View detailed execution plan

hive (default)> explain extended select * from emp;
hive (default)> explain extended select deptno, avg(sal) avg_sal from 
emp group by deptno;
1385 original articles published, praised 1165, visited 110000+
His message board follow

Tags: hive Hadoop jvm xml

Posted on Mon, 17 Feb 2020 22:19:15 -0500 by lazytiger