hive tuning manual

hive tuning manual

1 Fetch grab
Fetch fetching means that some queries in Hive do not need to be calculated by MapReduce. For example: SELECT * FROM employees; In this case, Hive can simply read the files in the storage directory corresponding to the employee, and then output the query results to the console.
In the hive-default.xml.template file, hive.fetch.task.conversion defaults to more, and the old version of hive defaults to minimal. After the attribute is modified to more, mapreduce will not be used in global search, field search, limit search, etc.

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

[none,minimal,more]
none: indicates that all SQL statements need to go through MR
minimal: it means select *, filter based on partition column and limit. These SQL statements do not need to be processed
more: mapreduce is not used for global search, field search, limit search, etc
2 local mode
Most Hadoop jobs need the complete scalability provided by Hadoop to handle large data sets. However, sometimes Hive's input data volume is very small. In this case, it may take much more time to trigger the execution task for the query than the execution time of the actual job. For most of these cases, Hive can handle all tasks on a single machine through local mode. For small data sets, the execution time can be significantly reduced.
Users can set the value of hive.exec.mode.local.auto to true to enable Hive to automatically start the optimization at an appropriate time.

 //Turn on local mr
set hive.exec.mode.local.auto=true; 
//Set the maximum input data amount of local mr. when the input data amount is less than this value, the mode of local mr is adopted. The default is 134217728, i.e. 128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//Set the maximum number of input files of local mr. when the number of input files is less than this value, the local mr mode is adopted, and the default is 4
set hive.exec.mode.local.auto.input.files.max=10;

3. Reasonably set the number of maps and Reduce
3.1 increase the number of maps for complex files
When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, so as to improve the execution efficiency of the task.
To add a map:
According to the formula of computeslatesize (math.max (minSize, math. Min (maxSize, blocksize)) = blocksize = 128M, adjust the maximum value of maxSize so that the maximum value of maxSize is lower than the blocksize to increase the number of map s. In short, turn to minSize in the major and maxSize in the minor.
3.2 merge small files
1) Merge small files before map execution to reduce the number of maps: CombineHiveInputFormat has the function of merging small files (the system default format). HiveInputFormat does not have the ability to merge small files.

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

2) Merge small file settings at the end of map reduce:

stay map-only Merge small files at the end of the task. By default true
SET hive.merge.mapfiles = true;
stay map-reduce Merge small files at the end of the task. By default false
SET hive.merge.mapredfiles = true;
The size of the merged file, 256 by default M
SET hive.merge.size.per.task = 268435456;
When the average size of the output file is less than this value, start a separate file map-reduce Task progress file merge
SET hive.merge.smallfiles.avgsize = 16777216;

3.3 reasonably set the number of Reduce
1) Adjust the number of reduce [method 1]
(1) The amount of data processed by each Reduce is 256MB by default

hive.exec.reducers.bytes.per.reducer=256000000

(2) The maximum number of reduce for each task is 1009 by default

hive.exec.reducers.max=1009

(3) Formula for calculating reducer number

N=min(Parameter 2, total input data/Parameter 1)

2) Adjust the number of reduce [method 2]
Modify in the mapred-default.xml file of hadoop

Set each job of Reduce number
set mapreduce.job.reduces = 15;

3) The number of reduce is not the more the better
(1) Too much startup and initialization of reduce will also consume time and resources;
(2) In addition, there will be many output files according to the number of reduce files. If many small files are generated, there will be too many small files if these small files are used as the input of the next task;
These two principles also need to be considered when setting the number of reduce: deal with a large amount of data and use the appropriate number of reduce; Make the amount of data processed by a single reduce task appropriate;
4 parallel execution
Hive transforms a query into one or more stages. Such stages can be MapReduce stage, sampling stage, merging stage and limit stage. Or other stages that hive may need during execution. By default, hive executes only one phase at a time. However, a specific job may contain many phases, which may not be completely interdependent. That is, some phases can be executed in parallel, which may shorten the execution time of the whole job. However, if there are more phases that can be executed in parallel, the faster the job may complete.
You can enable concurrent execution by setting the value of the parameter hive.exec.parallel to true. However, in a shared cluster, it should be noted that if there are more parallel phases in a job, the cluster utilization will increase.

set hive.exec.parallel=true;              //Open task parallel execution
set hive.exec.parallel.thread.number=16;  //The maximum parallelism allowed for the same sql is 8 by default.

Of course, you have to have an advantage when the system resources are relatively idle. Otherwise, parallelism can't work without resources.
5 JVM reuse [use with caution]
Open the uber mode to realize Jvm reuse. By default, each Task needs to start a Jvm to run. If the amount of data calculated by the Task task is very small, we can let multiple tasks of the same Job run in one Jvm without opening a Jvm for each Task
Open the uber mode and add the following configuration in mapred-site.xml

<!--  open uber pattern -->
<property>
  <name>mapreduce.job.ubertask.enable</name>
  <value>true</value>
</property>

<!-- uber Maximum in mode mapTask Quantity, can be modified downward  --> 
<property>
  <name>mapreduce.job.ubertask.maxmaps</name>
  <value>9</value>
</property>
<!-- uber Maximum in mode reduce Quantity, can be modified downward -->
<property>
  <name>mapreduce.job.ubertask.maxreduces</name>
  <value>1</value>
</property>
<!-- uber The maximum amount of input data in the mode. It is used by default dfs.blocksize Can be modified downward -->
<property>
  <name>mapreduce.job.ubertask.maxbytes</name>
  <value></value>
</property>

Tags: Big Data Hadoop hive

Posted on Wed, 22 Sep 2021 11:21:24 -0400 by mizkie