MapReduce detailed explanation and code implementation

1. MapReduce definition

MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "Hadoop based data analysis applications". The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster

2. Advantages and disadvantages

2.1 advantages

  • MapReduce is easy to program

By simply implementing some interfaces, it can complete a distributed program, which can be distributed to a large number of cheap PC machines. As like as two peas, you write a distributed program, which is exactly the same as writing a simple serial program. This feature makes MapReduce programming very popular.

  • Good scalability

When your computing resources cannot be met, you can simply add machines to expand its computing power.

  • High fault tolerance

The original intention of MapReduce design is to enable the program to be deployed on cheap PC machines, which requires it to have high fault tolerance. For example, if one of the machines hangs, it can transfer the above computing tasks to another node to run, so that the task will not fail. Moreover, this process does not require manual participation, but is completely completed within Hadoop.

  • It is suitable for offline processing of massive data above PB level

It can realize the concurrent work of thousands of server clusters and provide data processing capacity.

2.2 disadvantages

  • 1) Not good at real-time computing

MapReduce cannot return results in milliseconds or seconds like MySQL.

  • 2) Not good at flow computing

The input data of streaming computing is dynamic, while the input data set of MapReduce is static and cannot change dynamically. This is because the design characteristics of MapReduce determine that the data source must be static.

  • 3) Not good at DAG (directed acyclic graph) calculation

Multiple applications have dependencies, and the input of the latter application is the output of the previous one. In this case, MapReduce is not impossible, but after use, the output results of each MapReduce job will be written to the disk, resulting in a large number of disk IO and very low performance.

3. MapReduce core idea

(1) Distributed computing programs often need to be divided into at least two stages, map (responsible for decomposition) stage and redurce (Statistics) stage.
(2) The MapTask concurrent instances in the first stage run completely in parallel and irrelevant to each other.
(3) The ReduceTask concurrent instances in the second stage are irrelevant, but their data depends on the output of all MapTask concurrent instances in the previous stage.
(4) The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user's business logic is very complex, it can only run multiple MapReduce programs serially, but it will cause low efficiency.

4. MapReduce process

A complete MapReduce program has three types of instance processes during distributed operation:
(1) MrAppMaster: responsible for process scheduling and state coordination of the whole program.
(2) MapTask: responsible for the entire data processing process in the Map phase.
(3) ReduceTask: responsible for the entire data processing process in the Reduce phase.

5. Common data serialization types

6. MapReduce programming specification

The program written by the user is divided into three parts: Mapper, Reducer and Driver.

6.1 Mapper stage

6.2 Reducer phase

6.3 Driver stage

7. demo implements wordCount

7.1 requirements

Count the total number of occurrences of each word in a given text file

7.2. Input data

atguigu atguigu
ss ss
cls cls

7.3 expected output data

atguigu	2
banzhang	1
cls	2
hadoop	1
jiao	1
ss	2
xue	1

7.4 demand analysis

According to MapReduce programming specification, Mapper, Reducer and Driver are written respectively.

7.5 environmental preparation

(1) Create maven project
(2) Add pom coordinates


(3) In the src/main/resources directory of the project, create a new file named "", and fill in the file.

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

7.6 programming



import org.apache.hadoop.mapreduce.Mapper;


public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    Text k = new Text();
    IntWritable v = new IntWritable(1);

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 get a row
        String line = value.toString();
        // 2 cutting
        String[] words = line.split(" ");

        // 3 output
        for (String word : words) {
            context.write(k, v);




import org.apache.hadoop.mapreduce.Reducer;


public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    int sum;
    IntWritable v = new IntWritable();

    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

        // 1 cumulative summation
        sum = 0;
        for (IntWritable count : values) {
            sum += count.get();
        // 2 output




import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        // 1 get configuration information and get job object
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2. jar associated with this Driver program

        // 3. jar associated with Mapper and Reducer

        // 4 set the kv type of Mapper output

        // 5 setting the final output kv type

        // 6 setting input and output paths
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 submit job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);


Tags: Java Big Data Hadoop mapreduce

Posted on Thu, 18 Nov 2021 10:48:22 -0500 by Meltdown