1. MapReduce definition
MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "Hadoop based data analysis applications". The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster
2. Advantages and disadvantages
2.1 advantages
- MapReduce is easy to program
By simply implementing some interfaces, it can complete a distributed program, which can be distributed to a large number of cheap PC machines. As like as two peas, you write a distributed program, which is exactly the same as writing a simple serial program. This feature makes MapReduce programming very popular.
- Good scalability
When your computing resources cannot be met, you can simply add machines to expand its computing power.
- High fault tolerance
The original intention of MapReduce design is to enable the program to be deployed on cheap PC machines, which requires it to have high fault tolerance. For example, if one of the machines hangs, it can transfer the above computing tasks to another node to run, so that the task will not fail. Moreover, this process does not require manual participation, but is completely completed within Hadoop.
- It is suitable for offline processing of massive data above PB level
It can realize the concurrent work of thousands of server clusters and provide data processing capacity.
2.2 disadvantages
- 1) Not good at real-time computing
MapReduce cannot return results in milliseconds or seconds like MySQL.
- 2) Not good at flow computing
The input data of streaming computing is dynamic, while the input data set of MapReduce is static and cannot change dynamically. This is because the design characteristics of MapReduce determine that the data source must be static.
- 3) Not good at DAG (directed acyclic graph) calculation
Multiple applications have dependencies, and the input of the latter application is the output of the previous one. In this case, MapReduce is not impossible, but after use, the output results of each MapReduce job will be written to the disk, resulting in a large number of disk IO and very low performance.
3. MapReduce core idea
(1) Distributed computing programs often need to be divided into at least two stages, map (responsible for decomposition) stage and redurce (Statistics) stage.
(2) The MapTask concurrent instances in the first stage run completely in parallel and irrelevant to each other.
(3) The ReduceTask concurrent instances in the second stage are irrelevant, but their data depends on the output of all MapTask concurrent instances in the previous stage.
(4) The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user's business logic is very complex, it can only run multiple MapReduce programs serially, but it will cause low efficiency.
4. MapReduce process
A complete MapReduce program has three types of instance processes during distributed operation:
(1) MrAppMaster: responsible for process scheduling and state coordination of the whole program.
(2) MapTask: responsible for the entire data processing process in the Map phase.
(3) ReduceTask: responsible for the entire data processing process in the Reduce phase.
5. Common data serialization types
6. MapReduce programming specification
The program written by the user is divided into three parts: Mapper, Reducer and Driver.
6.1 Mapper stage
6.2 Reducer phase
6.3 Driver stage
7. demo implements wordCount
7.1 requirements
Count the total number of occurrences of each word in a given text file
7.2. Input data
atguigu atguigu ss ss cls cls jiao banzhang xue hadoop
7.3 expected output data
atguigu 2 banzhang 1 cls 2 hadoop 1 jiao 1 ss 2 xue 1
7.4 demand analysis
According to MapReduce programming specification, Mapper, Reducer and Driver are written respectively.
7.5 environmental preparation
(1) Create maven project
(2) Add pom coordinates
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.1.3</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.30</version> </dependency> </dependencies>
(3) In the src/main/resources directory of the project, create a new file named "log4j.properties", and fill in the file.
log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
7.6 programming
WordCountMapper
package com.song.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 get a row String line = value.toString(); // 2 cutting String[] words = line.split(" "); // 3 output for (String word : words) { k.set(word); context.write(k, v); } } }
WordCountReducer
package com.song.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> { int sum; IntWritable v = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { // 1 cumulative summation sum = 0; for (IntWritable count : values) { sum += count.get(); } // 2 output v.set(sum); context.write(key,v); } }
WordCountDriver
package com.song.mapreduce.wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class WordCountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { // 1 get configuration information and get job object Configuration conf = new Configuration(); Job job = Job.getInstance(conf); // 2. jar associated with this Driver program job.setJarByClass(WordCountDriver.class); // 3. jar associated with Mapper and Reducer job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); // 4 set the kv type of Mapper output job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 5 setting the final output kv type job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 6 setting input and output paths FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 submit job boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }