Copyright notice: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. For reprint, please attach the source link of the original text and this notice.
Big data series article directory
hive data compressionIn practical work, the data processed in hive generally needs to be compressed. Compression can be used to save the network bandwidth of our MR processing
MR supported compression coding
Compression formattoolalgorithmFile extensionIs it sharableDEFAULTnothingDEFAULT.deflatenoGzipgzipDEFAULT.gznobzip2bzip2bzip2.bz2yesLZOlzopLZO.lzonoLZ4nothingLZ4.lz4noSnappynothingSnappy.snappynoIn order to support a variety of compression / decompression algorithms, Hadoop introduces an encoder / decoder, as shown in the table below
Compression formatCorresponding encoder / decoderDEFLATEorg.apache.hadoop.io.compress.DefaultCodecgziporg.apache.hadoop.io.compress.GzipCodecbzip2org.apache.hadoop.io.compress.BZip2CodecLZOcom.hadoop.compression.lzo.LzopCodecLZ4org.apache.hadoop.io.compress.Lz4CodecSnappyorg.apache.hadoop.io.compress.SnappyCodecComparison of compression performance
compression algorithmOriginal file sizeCompressed file sizeCompression speedDecompression speedgzip8.3GB1.8GB17.5MB/s58MB/sbzip28.3GB1.8GB17.5MB/s9.5MB/sLZO8.3GB1.8GB17.5MB/s74.6MB/shttp://google.github.io/snappy/
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
Compression configuration parameters
To enable compression in Hadoop, you can configure the following parameters (in mapred-site.xml file):
Enable Map output phase compression
Turning on map output compression can reduce the amount of data transfer between map and Reduce task in the job. The specific configuration is as follows:
Case practice:
1) Enable hive intermediate transmission data compression function
hive(default)>set hive.exec.compress.intermediate=true;
2) Enable the map output compression function in mapreduce
hive (default)>set mapreduce.map.output.compress=true;
3) Set the compression method of map output data in mapreduce
hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;
4) Execute query statement
select count(1) from score;
Enable Reduce output compression
When Hive writes the output to the table, the output can also be compressed. attribute
hive.exec.compress.output controls this function. Users may need to keep the default value false in the default settings file, so that the default output is an uncompressed plain text file. You can set this value to true in the query statement or execution script to enable the output result compression function.
Case practice:
-- 1)open hive Final output data compression function set hive.exec.compress.output=true; -- 2)open mapreduce Final output data compression set mapreduce.output.fileoutputformat.compress=true; -- 3)set up mapreduce Final data output compression mode set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; -- 4)set up mapreduce The final data output is compressed into block compression set mapreduce.output.fileoutputformat.compress.type=BLOCK; -- 5)Test whether the output is a compressed file insert overwrite local directory '/export/data/exporthive/compress' select * from score distribute by sid sort by sscore desc;
hive data storage format
Hive supports the following storage formats: TEXTFILE (row storage), sequencefile (row storage), ORC (column storage) and PARQUET (column storage).
Column storage and row storage
Characteristics of row storage: when querying a whole row of data that meets the conditions, column storage needs to find the corresponding value of each column in each aggregated field. Row storage only needs to find one value, and the rest values are in adjacent places. Therefore, the speed of row storage query is faster at this time.
Characteristics of column storage: because of the data aggregation storage of each field, the amount of data read can be greatly reduced when only a few fields are required for query; The data type of each field must be the same. Column storage can be targeted to design better compression algorithms.
Compared with row storage, column storage has many excellent features in the analysis scenario:
1) Analysis scenarios often need to read a large number of rows, but a few columns. In the row storage mode, the data is continuously stored by row, and the data of all columns are stored in a block. All columns that do not participate in the calculation should also be read out during IO, and the reading operation is seriously enlarged. In the column storage mode, you only need to read the columns involved in the calculation, which greatly reduces the IO overhead and speeds up the query.
2) The data in the same column belong to the same type, and the compression effect is remarkable. Column storage often has a compression ratio of up to ten times or even higher, which saves a lot of storage space and reduces the storage cost.
3) Higher compression ratio means smaller data space and less time to read corresponding data from disk.
4) Free choice of compression algorithm. Different columns of data have different data types, and the applicable compression algorithms are different. You can select the most appropriate compression algorithm for different column types.
The storage formats of TEXTFILE and SEQUENCEFILE are based on row storage;
ORC and PARQUET are based on column storage.
Comparison experiment of mainstream file storage formats
The compression ratio of stored files and query speed are compared.
Compression ratio test of stored files:
1)TextFile
(1) Create a table and store data in TEXTFILE format
create table log_text ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ;
(2) Loading data into a table
load data local inpath '/export/data/hivedatas/log.data' into table log_text ;
(3) View data size in table
hadoop fs -du -h /user/hive/warehouse/myhive.db/log_text;
18.1 M /user/hive/warehouse/log_text/log.data
2)ORC
(1) Create a table and store data in ORC format
create table log_orc( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc ;
(2) Loading data into a table
insert into table log_orc select * from log_text ;
(3) View data size in table
hadoop fs -du -h /user/hive/warehouse/myhive.db/log_orc;
2.8 M /user/hive/warehouse/log_orc/123456_0
3)Parquet
(1) Create a table and store data in parquet format
create table log_parquet( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS PARQUET ;
(2) Loading data into a table
insert into table log_parquet select * from log_text ;
(3) View data size in table
hdoop fs -du -h /user/hive/warehouse/myhive.db/log_parquet;
13.1 M /user/hive/warehouse/log_parquet/123456_0
Compression ratio summary of stored files:
ORC > Parquet > textFile
Query speed test of stored files:
1)TextFile hive (default)> select count(*) from log_text; _c0 100000 Time taken: 21.54 seconds, Fetched: 1 row(s) 2)ORC hive (default)> select count(*) from log_orc; _c0 100000 Time taken: 20.867 seconds, Fetched: 1 row(s) 3)Parquet hive (default)> select count(*) from log_parquet; _c0 100000 Time taken: 22.922 seconds, Fetched: 1 row(s)
Summary of query speed of stored files:
ORC > TextFile > Parquet
Combination of storage and compression
Compression of ORC storage mode:
KeyDefaultNotesorc.compressZLIBhigh level compression (one of NONE, ZLIB, SNAPPY)orc.compress.size262,144number of bytes in each compression chunkorc.stripe.size67,108,864number of bytes in each stripeorc.row.index.stride10,000number of rows between index entries (must be >= 1000)orc.create.indextruewhether to create row indexesorc.bloom.filter.columns""comma separated list of column names for which bloom filter should be createdorc.bloom.filter.fpp0.05false positive probability for bloom filter (must >0.0 and <1.0)1) Create an uncompressed ORC storage method
(1) Create table statement
create table log_orc_none( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc tblproperties ("orc.compress"="NONE");
(2) Insert data
insert into table log_orc_none select * from log_text ;
(3) View Post insert data
hadoop fs -du -h /user/hive/warehouse/myhive.db/log_orc_none;
7.7 M /user/hive/warehouse/log_orc_none/123456_0
2) Create a SNAPPY compressed ORC storage method
(1) Create table statement
create table log_orc_snappy( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS orc tblproperties ("orc.compress"="SNAPPY");
(2) Insert data
insert into table log_orc_snappy select * from log_text ;
(3) View Post insert data
hadoop fs -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;
3.8 M /user/hive/warehouse/log_orc_snappy/123456_0
3) The ORC storage method created by default in the previous section. The size after importing data is
2.8 M /user/hive/warehouse/log_orc/123456_0
Smaller than snappy compressed. The reason is that the orc storage file adopts ZLIB compression by default. Smaller than snappy compression.
4) Storage mode and compression summary:
In the actual project development, the data storage format of hive table is generally orc or parquet. Generally, snappy is selected as the compression method.
endNext Hive tuning