Hadoop data compression

1. General

1) Advantages and disadvantages of compression

  • Advantages of compression: to reduce disk IO and disk storage space.
  • Disadvantages of compression: increase CPU overhead.

2) Compression principle

(1) Operation intensive jobs use less compression

(2) IO intensive Job, multi-purpose compression

2. MR supported compression coding

1) Comparison and introduction of compression algorithms

2) Comparison of compression performance


http://google.github.io/snappy/

Snappy is a compression/decompression library. 
It does not aim for maximum compression, or 
compatibility with any other compression library; instead, 
it aims for very high speeds and 
reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of 
magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% 
bigger.
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 
MB/sec or more and decompresses at about 500 MB/sec or more.

3. Selection of compression mode

When selecting the compression method, the following key factors shall be considered: compression / decompression speed, compression ratio (storage size after compression), and whether slicing can be supported after compression.

3.1 Gzip compression

  • Advantages: high compression ratio;
  • Disadvantages: Split is not supported; General compression / decompression speed;

3.2 Bzip2 compression

  • Advantages: high compression ratio; Support Split;
  • Disadvantages: slow compression / decompression speed.

3.3 Lzo compression

  • Advantages: fast compression / decompression speed; Support Split;
  • Disadvantages: average compression ratio; Additional indexes need to be created to support slicing.

3.4 Snappy compression

  • Advantages: fast compression and decompression speed;
  • Disadvantages: Split is not supported; Average compression ratio;

3.5 selection of compression position

Compression can be enabled at any stage of MapReduce.

4. Compression parameter configuration

1) In order to support a variety of compression / decompression algorithms, Hadoop introduces CODEC / decoder

2) To enable compression in Hadoop, you can configure the following parameters

5. Compress practical cases

5.1 Map output is compressed

Even if your MapReduce input and output files are uncompressed files, you can still compress the intermediate result output of the Map task because it needs to be written on the hard disk and transmitted to the Reduce node through the network. Compressing it can improve a lot of performance. For these tasks, just set two properties. Let's see how to set the code.

1) The supported compression formats of Hadoop source code provided to you are BZip2Codec and DefaultCodec

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		// Enable map output compression
		conf.setBoolean("mapreduce.map.output.compress", true);
		// Set the map side output compression mode
		conf.setClass("mapreduce.map.output.compress.codec", 
		BZip2Codec.class,CompressionCodec.class);
		
		Job job = Job.getInstance(conf);
		job.setJarByClass(WordCountDriver.class);
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		boolean result = job.waitForCompletion(true);
		System.exit(result ? 0 : 1);
	} 
}

2) Mapper remains unchanged

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		Text k = new Text();
		IntWritable v = new IntWritable(1);
		
		@Override
		protected void map(LongWritable key, Text value, Context 
		context)throws IOException, InterruptedException {
		// 1 get a row
		String line = value.toString();
		// 2 cutting
		String[] words = line.split(" ");
		// 3 cycle write
		for(String word:words){
			k.set(word);
			context.write(k, v);
		} 
	} 
}

3) Reducer remains unchanged

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

	IntWritable v = new IntWritable();

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
		int sum = 0;
		// 1 Summary
		for(IntWritable value:values){
			sum += value.get();
		}
		v.set(sum);
		// 2 output
		context.write(key, v);
	} 
}

5.2 the reduce output is compressed

Case processing based on WordCount.

1) Modify drive

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.DefaultCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.Lz4Codec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJarByClass(WordCountDriver.class);
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		// Set the output compression on the reduce side
		FileOutputFormat.setCompressOutput(job, true);
		// Set compression mode
		 FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class); 
		// FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); 
		// FileOutputFormat.setOutputCompressorClass(job, 
		DefaultCodec.class);
 
		boolean result = job.waitForCompletion(true);
		System.exit(result?0:1);
	} 
}

2) Mapper and Reducer remain unchanged

come on.

thank!

strive!

Tags: Big Data Hadoop

Posted on Tue, 21 Sep 2021 21:03:55 -0400 by interface