Hive's data storage format
-
Column storage, row storage
-
The data storage format of the table in Hive supports not only text format, but also many other formats.
-
When creating a table, it is specified by = = STORED AS syntax. If not specified, the default is textfile (row store) = =.
-
Several mainstream file formats in Hive.
-
textfile file format
-
ORC, Parquet columnar storage format.
They are all in column storage format, and the bottom layer is stored in binary form. High data storage efficiency and convenient query.
-
example
Use three different formats to store data respectively. Go to HDFS to check the difference of underlying file storage space.
--1,Create a table and store data in the format TEXTFILE create table log_text ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; --If you don't write stored as textfile Default is textfile --Load data load data local inpath '/root/hivedata/log.data' into table log_text; --2,Create a table and store data in the format ORC create table log_orc( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc ; --load Is a pure copy. The move operation does not adjust the file format. insert into table log_orc select * from log_text; --3,Create a table and store data in the format parquet create table log_parquet( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS PARQUET ; --Insert data into a table insert into table log_parquet select * from log_text ;
-
-
In the actual development, different file formats can be selected according to the needs and matched with different compression algorithms. You can get better storage effect.
-- No compression create table log_orc_none( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc tblproperties ("orc.compress"="NONE"); -- Write data insert into table log_orc_none select * from log_text ; -- compressed data create table log_orc_snappy( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc tblproperties ("orc.compress"="SNAPPY"); -- write in insert into table log_orc_snappy select * from log_text ; --What does it mean not to specify a compression format? --orc Storage files are used by default ZLIB Compression. than snappy Compressed small STORED AS orc; --2.78M --with ORC Format storage is not compressed STORED AS orc tblproperties ("orc.compress"="NONE"); --7.69M --with ORC Format storage usage snappy compress STORED AS orc tblproperties ("orc.compress"="SNAPPY"); --3.78M
-
Conclusion: it is recommended to use = = ORC+snappy = = compression in Hive.
Hive Universal Tuning
-
If you can query data without MR, try not to apply MR, and directly read the file for cutting data display
-
When using mr calculation, local resources can be called directly, and the resources requested in the yarn service are no longer fetched
-
mr computing is slow. You can switch the computing framework spark
1. Fetch mechanism
-
Function: when executing sql, try not to use the MapReduce program to process = = = = without using the MapReduce program to process = =.
-
Try to operate the data file directly.
-
Setting: hive.fetch.task.conversion= more.
--In the following three cases sql Don't go mr program --Global lookup select * from student; --Field lookup select num,name from student; --limit lookup select num,name from student limit 2;
2. mapreduce local mode
-
Function: if the = = MapReduce program must be executed locally, try not to submit it to yarn for execution = =.
-
It is off by default. This means that as long as you go through MapReduce, you will submit to yarn for execution.
mapreduce.framework.name = local Local mode mapreduce.framework.name = yarn Cluster mode
-
Hive provides a parameter to automatically switch the MapReduce program to the local mode. If the conditions are not met, execute the yarn mode.
set hive.exec.mode.local.auto = true;
--All three conditions must be met to automatically switch to local mode
The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default) --The amount of data is less than 128 M The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default) --maptask Less than 4 The total number of reduce tasks required is 1 or 0. --reducetask The number is 0 or 1
-
Switch Hive's execution engine
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
If Hive tuning still fails to meet your needs or is inefficient, try using spark computing engine or Tez
3. join optimization
Try not to disassemble more than two tables
from t2 join t1
from t2 join t3
-
The bottom layer is the join optimization of MapReduce.
-
There are two join methods in MapReduce. It refers to the stage at which the join behavior occurs.
-
map side join
-
reduce end join
-
-
Optimization 1: = = Hive automatically tries to select the map side join to improve the efficiency of the join = = omit the shuffle process.
open mapjoin Parameter setting: (1)Set automatic selection mapjoin set hive.auto.convert.join = true; --Default to true (2)Threshold setting for large and small tables: set hive.mapjoin.smalltable.filesize= 25000000; The unit is 25 m Below 25 in map conduct join output data
-
Optimization 2: large table join
--Background: Big watch join The data of the large table itself is very specific. If join Field exists null How does a null value handle it? --Method 1: empty key Filtering this row of data is not important participate in join Empty before key Filter out the data SELECT a.* FROM (SELECT * FROM nullidtable WHERE id IS NOT NULL ) a JOIN ori b ON a.id =b.id; --Method 2: empty Key transformation CASE WHEN a.id IS NULL THEN 'xxx Arbitrary string' ELSE a.id END hive001 hive11 hive231 CASE WHEN a.id IS NULL THEN concat('hive', rand()) ELSE a.id --Avoid skew random distribution of data after conversion
-
Optimization 3: bucket table join improves optimization efficiency. bucket mapjoin
1.1 conditions 1) set hive.optimize.bucketmapjoin = true; 2) The number of buckets in one table is an integer multiple of the number of buckets in another table 3) bucket column = = join column 4) It must be applied in the scenario of map join 1.2 notes 1) If the table is not a bucket, it is just a normal join.
4. group by data skew optimization
(1) Whether to aggregate on the Map side. The default value is True set hive.map.aggr = true; (2) Number of entries for aggregation on the Map side set hive.groupby.mapaggr.checkinterval = 100000; (3) Load balancing when there is data skew (false by default) set hive.groupby.skewindata = true; --Q: What is the underlying execution mechanism after load balancing is enabled in hive? Male female --step1: start a MapReduce program to randomly send the tilted data to each reduce for scattering Aggregation for each reduce is a local aggregation --Step 2: start the second MapReduce program to summarize the results of local aggregation in the previous step for final aggregation
5. How to adjust the number of task s (parallelism) in the underlying MapReduce in hive
-
Number of maptask s
-
In MapReduce, maptask is determined by = = logical slicing = = mechanism.
-
But in hive, there are many influencing factors. For example, the logical slicing mechanism, whether the file is compressed and whether cutting is supported after compression.
-
Therefore, in ==Hive, adjust the number of MapTask and go directly to HDFS to adjust the size and number of files.
If there are many small files, merge the small files. The best merge size is = block size 130m 65m 65m 128m 64m If there are many large files, adjust the blocl size hdfs-site dfs.block.size
-
-
Number of reducetask
-
In MapReduce, you can directly specify job.setNumReduceTasks(N) through code
-
In Hive, the number of reducetask is controlled by the following conditions
(1) The amount of data processed by each Reduce is 256MB by default hive.exec.reducers.bytes.per.reducer=256000000 (2) The maximum number of reduce for each task is 1009 by default hive.exec.reducsers.max=1009 (3)mapreduce.job.reduce=3 The default value is - 1, which is determined by hive according to the task. --If you do not set hive, you will evaluate the number of reducetask s based on the data volume or sql requirements. --Users can set the number of reducetask s through parameters set mapreduce.job.reduces = N --The user settings may not be effective. If the user settings conflict with the sql execution logic, such as order by, hive will set the reducetask to a reasonable number during sql compilation. Number of reduce tasks determined at compile time: 1
-
6. Several other general tuning
-
Execution plan explain
-
From the execution plan, you can see how = = hive plans to execute this sql = =.
-
Syntax format: explain + sql statement
-
Chestnuts
explain select * from student; +----------------------------------------------------+ | Explain | +----------------------------------------------------+ | STAGE DEPENDENCIES: | | Stage-0 is a root stage | | | | STAGE PLANS: | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | TableScan | | alias: student | | Statistics: Num rows: 1 Data size: 5260 Basic stats: COMPLETE Column stats: NONE | | Select Operator | | expressions: num (type: int), name (type: string), sex (type: string), age (type: int), dept (type: string) | | outputColumnNames: _col0, _col1, _col2, _col3, _col4 | | Statistics: Num rows: 1 Data size: 5260 Basic stats: COMPLETE Column stats: NONE | | ListSink | | | +----------------------------------------------------+
-
-
Parallel execution mechanism
-
If some stage stages at the bottom of hivesql can be executed in parallel, the execution efficiency can be improved.
-
The premise is that there is no dependency between = = stage s. The disadvantage of = = parallelism is that the instantaneous server pressure becomes greater.
-
parameter
set hive.exec.parallel=true; -- Whether to execute jobs in parallel. Applies to MapReduce jobs that can run in parallel, such as moving files to insert targets during multiple inserts set hive.exec.parallel.thread.number=16; -- The maximum number of jobs that can be executed in parallel. The default is 8.
-
-
Hive's strict model
-
be careful. Don't confuse it with the strict pattern of dynamic partitioning.
-
The strict mode here means that = = hive will prohibit some errors that users cannot affect, including inefficient operations = =, and it is not allowed to run some risky queries.
-
set up
set hive.mapred.mode = strict -- the default is non strict mode
-
explain
1. If it is a partitioned table, it is forbidden to perform partition clipping without where 2. The order by statement must have a + limit limit limit
-
-
Speculative execution mechanism
-
A mechanism of task in MapReduce.
-
Function:
-
Multiple tasks may be executed at the bottom of a job. If some delayed tasks are executed slowly, the final job may fail.
-
The so-called = = speculative execution mechanism is to find the lagging task through the algorithm and start the backup task = =.
-
Two task s process a piece of data at the same time. Who processes it first and whose result is the final result.
-
-
It is assumed that the execution mechanism is enabled by default, but in the enterprise production environment = = it is recommended to close = =.
-