Learning tutorial of YARN in hadoop

YARN learning of hadoop

MapReduce overview

MapReduce definition

MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "Hadoop based data analysis applications".

The core function of MapReduce is to integrate the business logic code written by the user and its own default components into a complete distributed computing program, which runs in parallel on a Hadoop cluster.

MapReduce advantages and disadvantages

advantage

  • Easy programming

  • Good scalability

    When your computing resources cannot be met, you can simply add machines to expand its computing power.

  • High fault tolerance

    The original intention of MapReduce design is to enable the program to be deployed on cheap PC machines, which requires it to have high fault tolerance. For example, if one of the machines hangs, it can transfer the above computing tasks to another node to run, so that the task will not fail. Moreover, this process does not require manual participation, but is completely completed within Hadoop.

  • It is suitable for offline processing of massive data above PB level

    It can realize the concurrent work of thousands of server clusters and provide data processing capability

shortcoming

  • Not good at real-time computing

  • Not good at flow computing

  • Not good at DAG (directed acyclic graph) calculation

    Multiple applications have dependencies, and the input of the latter application is the output of the previous one. In this case, MapReduce is not impossible, but after use, the output results of each MapReduce job will be written to the disk, resulting in a large number of disk IO and very low performance.

MapReduce core idea

(1)Distributed computing programs often need to be divided into at least two stages.
(2)Of the first stage MapTask Concurrent instances run completely in parallel and irrelevant to each other.
(3)Of the second stage ReduceTask Concurrent instances are independent of each other, but their data depends on all the data in the previous stage MapTask Output of concurrent instances.
(4)MapReduce The programming model can contain only one Map Phase and one Reduce In this phase, if the business logic of users is very complex, there can only be multiple users	  MapReduce Program, serial operation.
Summary: analysis WordCount Data flow towards in-depth understanding MapReduce Core idea.

MapReduce process

A complete MapReduce program has three types of instance processes in the distributed runtime

(1) MrAppMaster: responsible for process scheduling and state coordination of the whole program.

(2) MapTask: responsible for the entire data processing process in the Map phase.

(3) ReduceTask: responsible for the entire data processing process in the Reduce phase.

hadoop common data serialization types

MapReduce programming specification

The program written by the user is divided into three parts: Mapper, Reducer and Driver.

1.Mapper stage

2.Reduce phase

3.Driver stage

The client, which is equivalent to the YARN cluster, is used to submit our entire program to the YARN cluster. The submitted job object encapsulates the relevant running parameters of the MapReduce program

WordCount case practice

Write Mapper

package com.pihao.mr;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * Take the wordCount case as an example:
 * To customize the Mapper class, you need to inherit the Mapper provided by Hadoop, and specify the types of input data and output data according to specific business
 *  Type of input data
 *  KEYIN, Read the offset number of the file (LongWritable)
 *  VALUEIN, Read a line of data Text of the file (Text)
 *
 *  Type of output data
 * KEYOUT,  Type of key for output data, one word (Text)
 * VALUEOUT The type of output data value, and the tag number of the word (IntWritable)
 */
public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    private Text outKey = new Text();
    private IntWritable outValue = new IntWritable(1);

    /**
     * Map The core business processing method of the phase. The map method will be called once for each row of data input
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //Get the currently entered data
        String line = value.toString();
        String[] datas = line.split(" ");
        for (String data : datas) {
            //Traverse the set to encapsulate the key and value of the output data
            outKey.set(data);
            context.write(outKey,outValue);

        }
    }
}

Write Reduce

package com.pihao.mr;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * Take the wordCount case as an example:
 * To customize the Reducer class, you need to inherit the Reducer provided by Hadoop, and specify the type of input data and output data according to the specific business
 *  Type of input data
 *  KEYIN, Map Output type of the Key of the output terminal (Text)
 *  VALUEIN, Map Output type of the Value of the output of the terminal (IntWritable)
 *
 *  Type of output data
 * KEYOUT,  The Key type of output data is a word (Text)
 * VALUEOUT The type of Value of the output data, and the total number of occurrences of words (IntWritable)
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text,IntWritable> {
    private Text outKey = new Text();
    private IntWritable outValue = new IntWritable();

    /**
     * Reduce Core business processing method of phase: values of the same set of keys will call the reduce method once
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int total = 0;

        //Traversal values
        for (IntWritable value:values){
            //Accumulation, output
            total += value.get();
        }
        //Encapsulate key and value
        outKey.set(key);
        outValue.set(total);
        context.write(outKey,outValue);
    }
}

Write Driver

package com.pihao.mr;

import com.sun.org.apache.bcel.internal.generic.NEW;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * MR Driver class of the program: mainly used for MR tasks
 */
public class WordCountDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration configuration = new Configuration();
        // Declare a Job object
        Job job = Job.getInstance(configuration);
        //Specifies the driver class of the current job
        job.setJarByClass(WordCountDriver.class);
        //Specify Mapper and Reducer for the current Job
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        //Specify the data types of key and value of the output data at the Map end
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //Specify the data type of key and value of the final output result
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //Specify the directory of input data and the directory of output data
        FileInputFormat.setInputPaths(job,new Path("D:/test data/wcinput/hello.txt"));
        FileOutputFormat.setOutputPath(job,new Path("D:/test data/wcoutput"));

        // FileInputFormat.setInputPaths(job,new Path(args[0]));  Used when packaging to a linux server
        // FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //Submit job
        job.waitForCompletion(true); // true means it can be monitored

    }
}

Execution process

Cluster test

The local test is simple and can be run directly in the IDEA

Print the maven project just created into a jar package: wc.jar, put it on the linux server, start the hadoop cluster, execute the command, and pay attention to the class path of the main class: com.pihao.mr.WordCountDriver

[atguigu@hadoop102 software]$ hadoop jar  wc.jar com.pihao.mr.WordCountDriver /wcinput /wcoutput

You can view the tasks executed by the YARN web page monitoring Hadoop 103:8080

Hadoop serialization

MapReduce framework principle

Yarn resource scheduler

Common errors and Solutions

Tags: Big Data Hadoop Yarn

Posted on Wed, 22 Sep 2021 18:43:54 -0400 by andrewmcgibbon