Hive compression and storage

Compression and storage

1. Hadoop compression configuration

https://blog.csdn.net/qq_42735631/article/details/116903553

1.1 enable Map output phase compression

  1. Enable hive intermediate transmission data compression function
    set hive.exec.compress.intermediate=true;
    
  2. Enable the map output compression function in mapreduce
    set mapreduce.map.output.compress=true;
    
  3. Set the compression method of map output data in mapreduce
    set mapreduce.map.output.compress.codec=
    org.apache.hadoop.io.compress.SnappyCodec;
    
  4. Execute query statement
    • Compression is not turned on

    • Turn on compression

1.2 enable Reduce output compression

  1. Enable hive final output data compression
    set hive.exec.compress.output=true;
    
  2. Enable mapreduce final output data compression
    set mapreduce.output.fileoutputformat.compress=true;
    
  3. Set mapreduce final data output compression mode
    set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
    
  4. Set mapreduce final data output compression to block compression
    set mapreduce.output.fileoutputformat.compress.type=BLOCK;
    
  5. Test whether the output is a compressed file
    insert overwrite local directory
    '/opt/module/hive-3.1.2/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
    

2. File storage format

Hive supports data storage formats such as TEXTFILE, SEQUENCEFILE, ORC and PARQUET

2.1 column storage and row storage

  1. Characteristics of row storage:
    • When querying a whole row of data that meets the conditions, the column storage needs to find the corresponding value of each column in each aggregated field. The row storage only needs to find one value, and the rest values are in adjacent places, so the row storage query speed is faster.
  2. Characteristics of column storage
    • Because the data of each field is aggregated and stored, when only a few fields are required for query, the amount of data read can be greatly reduced; The data type of each field must be the same. Column storage can be targeted to design better compression algorithms.
    • The storage formats of TEXTFILE and SEQUENCEFILE are based on row storage;
    • ORC and PARQUET are based on column storage.

2.2 TextFile format

In the default format, data is not compressed, resulting in high disk overhead and high data parsing overhead. It can be used in combination with Gzip and Bzip2, but with Gzip, hive will not segment the data, so it is impossible to operate the data in parallel.

2.3 Orc format

  1. Index Data
    • A lightweight index. By default, an index is made every 1W rows. The index here should only record the offset of each field of a row in Row Data
  2. Row Data
    • What is stored is specific data. First take some rows, and then store these rows by column. Each column is encoded and divided into multiple streams for storage
  3. Stripe Footer
    • It stores the type, length and other information of each Stream.

Each file has a File Footer, where the number of rows of each Stripe and the data type information of each Column are stored; At the end of each file is a PostScript, which records the compression type of the whole file and the length information of FileFooter. When reading a file, you will seek to read PostScript at the end of the file, parse the length of File Footer from the inside, read FileFooter again, parse the information of each Stripe from the inside, and then read each Stripe, that is, read from back to front.

2.4 Parquet format

Parquet file is stored in binary mode, so it can not be read directly. The file includes the data and metadata of the file. Therefore, parquet format file is self parsed.

  1. Row Group:
    • Each row group contains a certain number of rows, and at least one row is stored in an HDFS file
      A row group, similar to the concept of orc's stripe
  2. Column Chunk:
    • In a row group, each column is saved in a column block, and all columns in the row group are concatenated
      The continuation is stored in this line group file. The values in a column block are of the same type, and different column blocks may be compressed with different algorithms
  3. Page:
    • Each column block is divided into multiple pages. One page is the smallest coding unit. Different coding methods may be used on different pages of the same column block

Generally, when storing Parquet data, the size of row groups will be set according to the Block size. Generally, the minimum unit of data processed by each Mapper task is a Block, so each row group can be processed by one Mapper task to increase the parallelism of task execution.

3. Storage format comparison experiment

  1. TextFile

    // Create a table with textfile data format
    create table if not exists log_text (
        track_time string,
        url string,
        session_id string,
        referer string,
        id string,
        end_user_id string,
        city_id string
    )
    row format delimited fields terminated by '\t'
    stored as textfile ;
    
    load data local inpath '/opt/module/hive-3.1.2/datas/log.data' into table log_text;
    

  2. ORC

    // Create a table with ocr data format
    create table if not exists log_orc (
        track_time string,
        url string,
        session_id string,
        referer string,
        id string,
        end_user_id string,
        city_id string
    )
    row format delimited fields terminated by '\t'
    stored as orc
    tblproperties ("orc.compress"="NONE");
    
    insert into table log_orc select * from log_text;
    

  3. Parquet

    // Create a table with data format parquet
    create table if not exists log_parquet (
        track_time string,
        url string,
        session_id string,
        referer string,
        id string,
        end_user_id string,
        city_id string
    )
    row format delimited fields terminated by '\t'
    stored as parquet ;
    
    insert into table log_parquet select * from log_text;
    

The experimental results show that Orc > parquet > textfile

4. Experimental comparison of storage and compression

  1. Create a ZLIB compressed ORC storage method

    // Create a table with ocr data format and zlib compression format
    create table if not exists log_ocr_zlib (
        track_time string,
        url string,
        session_id string,
        referer string,
        id string,
        end_user_id string,
        city_id string
    )
    row format delimited fields terminated by '\t'
    stored as orc
    tblproperties ("orc.compress"="ZLIB");
    
    insert into table log_ocr_zlib select * from log_text;
    

  2. Create a SNAPPY compressed ORC storage method

    // Create a table with orc data format and snappy compression format
    create table if not exists log_ocr_snappy (
        track_time string,
        url string,
        session_id string,
        referer string,
        id string,
        end_user_id string,
        city_id string
    )
    row format delimited fields terminated by '\t'
    stored as orc
    tblproperties ("orc.compress"="SNAPPY");
    
    insert into table log_ocr_snappy select * from log_text;
    

  3. Create a snapshot compressed parquet storage method

    // Create a table with parquet data format and snappy compression format
    create table if not exists log_parquet_snappy (
        track_time string,
        url string,
        session_id string,
        referer string,
        id string,
        end_user_id string,
        city_id string
    )
    row format delimited fields terminated by '\t'
    stored as parquet
    tblproperties ("parquet.compression"="SNAPPY");
    

Tags: Big Data Hadoop hive

Posted on Fri, 08 Oct 2021 05:15:49 -0400 by iblackedout