Hive series - 5. Hive's data compression and data storage format

Copyright notice: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. For reprint, please attach the source link of the original text and this notice.

Big data series article directory

hive data compression

In practical work, the data processed in hive generally needs to be compressed. Compression can be used to save the network bandwidth of our MR processing

MR supported compression coding

Compression formattoolalgorithmFile extensionIs it sharable
DEFAULTnothingDEFAULT.deflateno
GzipgzipDEFAULT.gzno
bzip2bzip2bzip2.bz2yes
LZOlzopLZO.lzono
LZ4nothingLZ4.lz4no
SnappynothingSnappy.snappyno

In order to support a variety of compression / decompression algorithms, Hadoop introduces an encoder / decoder, as shown in the table below

Compression formatCorresponding encoder / decoder
DEFLATEorg.apache.hadoop.io.compress.DefaultCodec
gziporg.apache.hadoop.io.compress.GzipCodec
bzip2org.apache.hadoop.io.compress.BZip2Codec
LZOcom.hadoop.compression.lzo.LzopCodec
LZ4org.apache.hadoop.io.compress.Lz4Codec
Snappyorg.apache.hadoop.io.compress.SnappyCodec

Comparison of compression performance

compression algorithmOriginal file sizeCompressed file sizeCompression speedDecompression speed
gzip8.3GB1.8GB17.5MB/s58MB/s
bzip28.3GB1.8GB17.5MB/s9.5MB/s
LZO8.3GB1.8GB17.5MB/s74.6MB/s

http://google.github.io/snappy/
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Compression configuration parameters

To enable compression in Hadoop, you can configure the following parameters (in mapred-site.xml file):

Enable Map output phase compression

Turning on map output compression can reduce the amount of data transfer between map and Reduce task in the job. The specific configuration is as follows:
Case practice:
1) Enable hive intermediate transmission data compression function

hive(default)>set hive.exec.compress.intermediate=true;

2) Enable the map output compression function in mapreduce

hive (default)>set mapreduce.map.output.compress=true;

3) Set the compression method of map output data in mapreduce

hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

4) Execute query statement

select count(1) from score;

Enable Reduce output compression

When Hive writes the output to the table, the output can also be compressed. attribute
hive.exec.compress.output controls this function. Users may need to keep the default value false in the default settings file, so that the default output is an uncompressed plain text file. You can set this value to true in the query statement or execution script to enable the output result compression function.

Case practice:

-- 1)open hive Final output data compression function
set hive.exec.compress.output=true;
-- 2)open mapreduce Final output data compression
set mapreduce.output.fileoutputformat.compress=true;
-- 3)set up mapreduce Final data output compression mode
set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
-- 4)set up mapreduce The final data output is compressed into block compression
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
-- 5)Test whether the output is a compressed file
insert overwrite local directory '/export/data/exporthive/compress' select * from score distribute by sid sort by sscore desc;

hive data storage format

Hive supports the following storage formats: TEXTFILE (row storage), sequencefile (row storage), ORC (column storage) and PARQUET (column storage).

Column storage and row storage


Characteristics of row storage: when querying a whole row of data that meets the conditions, column storage needs to find the corresponding value of each column in each aggregated field. Row storage only needs to find one value, and the rest values are in adjacent places. Therefore, the speed of row storage query is faster at this time.

Characteristics of column storage: because of the data aggregation storage of each field, the amount of data read can be greatly reduced when only a few fields are required for query; The data type of each field must be the same. Column storage can be targeted to design better compression algorithms.

Compared with row storage, column storage has many excellent features in the analysis scenario:

1) Analysis scenarios often need to read a large number of rows, but a few columns. In the row storage mode, the data is continuously stored by row, and the data of all columns are stored in a block. All columns that do not participate in the calculation should also be read out during IO, and the reading operation is seriously enlarged. In the column storage mode, you only need to read the columns involved in the calculation, which greatly reduces the IO overhead and speeds up the query.

2) The data in the same column belong to the same type, and the compression effect is remarkable. Column storage often has a compression ratio of up to ten times or even higher, which saves a lot of storage space and reduces the storage cost.

3) Higher compression ratio means smaller data space and less time to read corresponding data from disk.

4) Free choice of compression algorithm. Different columns of data have different data types, and the applicable compression algorithms are different. You can select the most appropriate compression algorithm for different column types.

The storage formats of TEXTFILE and SEQUENCEFILE are based on row storage;
ORC and PARQUET are based on column storage.

Comparison experiment of mainstream file storage formats

The compression ratio of stored files and query speed are compared.
Compression ratio test of stored files:
1)TextFile
(1) Create a table and store data in TEXTFILE format

create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;

(2) Loading data into a table

load data local inpath '/export/data/hivedatas/log.data' into table log_text ;

(3) View data size in table

hadoop fs  -du -h /user/hive/warehouse/myhive.db/log_text;  

18.1 M /user/hive/warehouse/log_text/log.data

2)ORC
(1) Create a table and store data in ORC format

create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

(2) Loading data into a table

insert into table log_orc select * from log_text ;

(3) View data size in table

hadoop fs  -du -h /user/hive/warehouse/myhive.db/log_orc;

2.8 M /user/hive/warehouse/log_orc/123456_0

3)Parquet

(1) Create a table and store data in parquet format

create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

(2) Loading data into a table

insert into table log_parquet select * from log_text ;

(3) View data size in table

hdoop fs  -du -h /user/hive/warehouse/myhive.db/log_parquet;

13.1 M /user/hive/warehouse/log_parquet/123456_0

Compression ratio summary of stored files:

ORC > Parquet > textFile

Query speed test of stored files:

1)TextFile
hive (default)> select count(*) from log_text;
_c0
100000
Time taken: 21.54 seconds, Fetched: 1 row(s)
2)ORC
hive (default)> select count(*) from log_orc;
_c0
100000
Time taken: 20.867 seconds, Fetched: 1 row(s)
3)Parquet
hive (default)> select count(*) from log_parquet;
_c0
100000
Time taken: 22.922 seconds, Fetched: 1 row(s)

Summary of query speed of stored files:

ORC > TextFile > Parquet

Combination of storage and compression

Compression of ORC storage mode:

KeyDefaultNotes
orc.compressZLIBhigh level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size262,144number of bytes in each compression chunk
orc.stripe.size67,108,864number of bytes in each stripe
orc.row.index.stride10,000number of rows between index entries (must be >= 1000)
orc.create.indextruewhether to create row indexes
orc.bloom.filter.columns""comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp0.05false positive probability for bloom filter (must >0.0 and <1.0)

1) Create an uncompressed ORC storage method

(1) Create table statement

create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");

(2) Insert data

insert into table log_orc_none select * from log_text ;

(3) View Post insert data

hadoop fs -du -h /user/hive/warehouse/myhive.db/log_orc_none;

7.7 M /user/hive/warehouse/log_orc_none/123456_0

2) Create a SNAPPY compressed ORC storage method

(1) Create table statement

create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

(2) Insert data

insert into table log_orc_snappy select * from log_text ;

(3) View Post insert data

hadoop fs  -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;

3.8 M /user/hive/warehouse/log_orc_snappy/123456_0

3) The ORC storage method created by default in the previous section. The size after importing data is

2.8 M /user/hive/warehouse/log_orc/123456_0
Smaller than snappy compressed. The reason is that the orc storage file adopts ZLIB compression by default. Smaller than snappy compression.

4) Storage mode and compression summary:

In the actual project development, the data storage format of hive table is generally orc or parquet. Generally, snappy is selected as the compression method.

end

Next Hive tuning

Tags: Big Data Hadoop hive

Posted on Sat, 18 Sep 2021 13:19:17 -0400 by darknessmdk