Hive's data storage format and optimization

Hive's data storage format

  • Column storage, row storage

  • The data storage format of the table in Hive supports not only text format, but also many other formats.

  • When creating a table, it is specified by = = STORED AS syntax. If not specified, the default is textfile (row store) = =.

  • Several mainstream file formats in Hive.

    • textfile file format

    • ORC, Parquet columnar storage format.

      They are all in column storage format, and the bottom layer is stored in binary form. High data storage efficiency and convenient query.
    • example

      Use three different formats to store data respectively. Go to HDFS to check the difference of underlying file storage space.

      --1,Create a table and store data in the format TEXTFILE
      create table log_text (
      track_time string,
      url string,
      session_id string,
      referer string,
      ip string,
      end_user_id string,
      city_id string
      )
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
      STORED AS TEXTFILE;  --If you don't write stored as textfile Default is textfile
      ​
      --Load data
      load data local inpath '/root/hivedata/log.data' into table log_text;
      ​
      --2,Create a table and store data in the format ORC
      create table log_orc(
      track_time string,
      url string,
      session_id string,
      referer string,
      ip string,
      end_user_id string,
      city_id string
      )
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
      STORED AS orc ;
      ​
      --load Is a pure copy. The move operation does not adjust the file format.
      insert into table log_orc select * from log_text;
      ​
      --3,Create a table and store data in the format parquet
      create table log_parquet(
      track_time string,
      url string,
      session_id string,
      referer string,
      ip string,
      end_user_id string,
      city_id string
      )
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
      STORED AS PARQUET ;
      ​
      --Insert data into a table 
      insert into table log_parquet select * from log_text ;
  • In the actual development, different file formats can be selected according to the needs and matched with different compression algorithms. You can get better storage effect.

    -- No compression
    create table log_orc_none(
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    STORED AS orc tblproperties ("orc.compress"="NONE");
    -- Write data
    insert into table log_orc_none select * from log_text ;
    ​
    -- compressed data 
    create table log_orc_snappy(
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    STORED AS orc tblproperties ("orc.compress"="SNAPPY");
    -- write in
    insert into table log_orc_snappy select * from log_text ;
    ​
    ​
    --What does it mean not to specify a compression format?
    --orc Storage files are used by default ZLIB Compression. than snappy Compressed small
    STORED AS orc;   --2.78M
    ​
    --with ORC Format storage is not compressed
    STORED AS orc tblproperties ("orc.compress"="NONE");  --7.69M
    ​
    --with ORC Format storage usage snappy compress
    STORED AS orc tblproperties ("orc.compress"="SNAPPY"); --3.78M
  • Conclusion: it is recommended to use = = ORC+snappy = = compression in Hive.

Hive Universal Tuning

  • If you can query data without MR, try not to apply MR, and directly read the file for cutting data display

  • When using mr calculation, local resources can be called directly, and the resources requested in the yarn service are no longer fetched

  • mr computing is slow. You can switch the computing framework spark

1. Fetch mechanism

  • Function: when executing sql, try not to use the MapReduce program to process = = = = without using the MapReduce program to process = =.

  • Try to operate the data file directly.

  • Setting: hive.fetch.task.conversion= more.

    --In the following three cases sql Don't go mr program
    ​
    --Global lookup
    select * from student;
    --Field lookup
    select num,name from student;
    --limit lookup
    select num,name from student limit 2;

2. mapreduce local mode

  • Function: if the = = MapReduce program must be executed locally, try not to submit it to yarn for execution = =.

  • It is off by default. This means that as long as you go through MapReduce, you will submit to yarn for execution.

    mapreduce.framework.name = local Local mode
    mapreduce.framework.name = yarn Cluster mode 
  • Hive provides a parameter to automatically switch the MapReduce program to the local mode. If the conditions are not met, execute the yarn mode.

    set hive.exec.mode.local.auto = true;
    --All three conditions must be met to automatically switch to local mode
    
    The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)  --The amount of data is less than 128 M
    ​
    The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)  --maptask Less than 4
    ​
    The total number of reduce tasks required is 1 or 0.  --reducetask The number is 0 or 1
  • Switch Hive's execution engine

    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    If Hive tuning still fails to meet your needs or is inefficient, try using spark computing engine or Tez

3. join optimization

Try not to disassemble more than two tables

from t2 join t1

from t2 join t3

  • The bottom layer is the join optimization of MapReduce.

  • There are two join methods in MapReduce. It refers to the stage at which the join behavior occurs.

    • map side join

    • reduce end join

  • Optimization 1: = = Hive automatically tries to select the map side join to improve the efficiency of the join = = omit the shuffle process.

    open mapjoin Parameter setting:
    (1)Set automatic selection mapjoin
    set hive.auto.convert.join = true;  --Default to true
    (2)Threshold setting for large and small tables:
    set hive.mapjoin.smalltable.filesize= 25000000; The unit is 25 m
     Below 25 in map conduct join output data
  • Optimization 2: large table join

    --Background:
    Big watch join The data of the large table itself is very specific. If join Field exists null How does a null value handle it?
    ​
    --Method 1: empty key Filtering this row of data is not important
     participate in join Empty before key Filter out the data
    SELECT a.* FROM (SELECT * FROM nullidtable WHERE id IS NOT NULL ) a JOIN ori b ON a.id =b.id;
    ​
    --Method 2: empty Key transformation
    CASE WHEN a.id IS NULL THEN 'xxx Arbitrary string' ELSE a.id END     hive001  hive11 hive231  
    CASE WHEN a.id IS NULL THEN concat('hive', rand()) ELSE a.id  --Avoid skew random distribution of data after conversion

  • Optimization 3: bucket table join improves optimization efficiency. bucket mapjoin

    1.1 conditions
        1) set hive.optimize.bucketmapjoin = true;
        2) The number of buckets in one table is an integer multiple of the number of buckets in another table
        3) bucket column = = join column
        4) It must be applied in the scenario of map join
    ​
    1.2 notes
        1) If the table is not a bucket, it is just a normal join.

4. group by data skew optimization

(1) Whether to aggregate on the Map side. The default value is True
set hive.map.aggr = true;
(2) Number of entries for aggregation on the Map side
set hive.groupby.mapaggr.checkinterval = 100000;
(3) Load balancing when there is data skew (false by default)
set hive.groupby.skewindata = true;
​
--Q: What is the underlying execution mechanism after load balancing is enabled in hive?
Male female
 --step1: start a MapReduce program to randomly send the tilted data to each reduce for scattering 
         Aggregation for each reduce is a local aggregation
        
--Step 2: start the second MapReduce program to summarize the results of local aggregation in the previous step for final aggregation       

5. How to adjust the number of task s (parallelism) in the underlying MapReduce in hive

  • Number of maptask s

    • In MapReduce, maptask is determined by = = logical slicing = = mechanism.

    • But in hive, there are many influencing factors. For example, the logical slicing mechanism, whether the file is compressed and whether cutting is supported after compression.

    • Therefore, in ==Hive, adjust the number of MapTask and go directly to HDFS to adjust the size and number of files.

      If there are many small files, merge the small files. The best merge size is = block size
      130m 65m 65m   128m  64m
       If there are many large files, adjust the blocl size
      
      hdfs-site  dfs.block.size
      
  • Number of reducetask

    • In MapReduce, you can directly specify job.setNumReduceTasks(N) through code

    • In Hive, the number of reducetask is controlled by the following conditions

      (1) The amount of data processed by each Reduce is 256MB by default
      hive.exec.reducers.bytes.per.reducer=256000000
       (2) The maximum number of reduce for each task is 1009 by default
      hive.exec.reducsers.max=1009
      (3)mapreduce.job.reduce=3
       The default value is - 1, which is determined by hive according to the task.
      ​
      ​
      --If you do not set hive, you will evaluate the number of reducetask s based on the data volume or sql requirements.
      --Users can set the number of reducetask s through parameters
        set mapreduce.job.reduces = N
       --The user settings may not be effective. If the user settings conflict with the sql execution logic, such as order by, hive will set the reducetask to a reasonable number during sql compilation.  
      ​
      Number of reduce tasks determined at compile time: 1

6. Several other general tuning

  • Execution plan explain

    • From the execution plan, you can see how = = hive plans to execute this sql = =.

    • Syntax format: explain + sql statement

    • Chestnuts

      explain select * from student;
      ​
      +----------------------------------------------------+
      |                      Explain                       |
      +----------------------------------------------------+
      | STAGE DEPENDENCIES:                                |
      |   Stage-0 is a root stage                          |
      |                                                    |
      | STAGE PLANS:                                       |
      |   Stage: Stage-0                                   |
      |     Fetch Operator                                 |
      |       limit: -1                                    |
      |       Processor Tree:                              |
      |         TableScan                                  |
      |           alias: student                           |
      |           Statistics: Num rows: 1 Data size: 5260 Basic stats: COMPLETE Column stats: NONE |
      |           Select Operator                          |
      |             expressions: num (type: int), name (type: string), sex (type: string), age (type: int), dept (type: string) |
      |             outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
      |             Statistics: Num rows: 1 Data size: 5260 Basic stats: COMPLETE Column stats: NONE |
      |             ListSink                               |
      |                                                    |
      +----------------------------------------------------+
  • Parallel execution mechanism

    • If some stage stages at the bottom of hivesql can be executed in parallel, the execution efficiency can be improved.

    • The premise is that there is no dependency between = = stage s. The disadvantage of = = parallelism is that the instantaneous server pressure becomes greater.

    • parameter

      set hive.exec.parallel=true; -- Whether to execute jobs in parallel. Applies to MapReduce jobs that can run in parallel, such as moving files to insert targets during multiple inserts
       set hive.exec.parallel.thread.number=16; -- The maximum number of jobs that can be executed in parallel. The default is 8.
  • Hive's strict model

    • be careful. Don't confuse it with the strict pattern of dynamic partitioning.

    • The strict mode here means that = = hive will prohibit some errors that users cannot affect, including inefficient operations = =, and it is not allowed to run some risky queries.

    • set up

      set hive.mapred.mode = strict -- the default is non strict mode
    • explain

      1. If it is a partitioned table, it is forbidden to perform partition clipping without where
       2. The order by statement must have a + limit limit limit
  • Speculative execution mechanism

    • A mechanism of task in MapReduce.

    • Function:

      • Multiple tasks may be executed at the bottom of a job. If some delayed tasks are executed slowly, the final job may fail.

      • The so-called = = speculative execution mechanism is to find the lagging task through the algorithm and start the backup task = =.

      • Two task s process a piece of data at the same time. Who processes it first and whose result is the final result.

    • It is assumed that the execution mechanism is enabled by default, but in the enterprise production environment = = it is recommended to close = =.

Tags: Big Data Hadoop hive

Posted on Thu, 21 Oct 2021 13:10:02 -0400 by catalinus