MapReduce program practice advanced of Hadoop cluster big data solution (custom partition&sort&group)

Get ready

                                       The purpose is to explain:
   1. User defined sorting;
   2. User defined partition;
   3. User defined grouping.

Demand

   1) sample data (datetime temperature [Tab between datetime and temperature]):
·

1949-05-01 14:21:01	38℃
1949-06-03 13:45:01	40℃
1950-01-23 14:21:01	38℃
1950-10-23 08:21:01	12℃
1951-12-18 14:21:01	40℃
1950-10-24 13:21:01	42℃
1950-10-26 14:21:01	45℃
1951-08-01 14:21:01	40℃
1951-08-02 14:21:01	48℃
1953-07-01 14:21:01	45℃
1953-08-06 14:21:01	48℃
1954-06-02 14:21:01	36℃
1952-08-02 14:21:01	45℃
1955-06-02 14:21:01	42℃
1952-04-02 14:21:01	43℃
1953-05-02 14:21:01	34℃
1949-09-02 14:21:01	29℃
1953-10-02 14:21:01	47℃
1952-11-02 14:21:01	45℃
1953-04-02 14:21:01	40℃
1954-05-02 14:21:01	45℃
1955-07-02 14:21:01	28℃
1954-05-09 14:21:01	50℃
1955-09-02 14:21:01	49℃
1953-09-02 14:21:01	32℃

   2) requirements:

1. Calculate the time of the highest temperature every year from 1949 to 1955;
2. From 1949 to 1955, three days before the highest temperature every year;

   3) ideas:

  1. In ascending order of year, and in descending order of temperature in each year;
  2. According to the year group, the corresponding reduce tasks of each year;
  3. Select top 1 and top 3 of the result of reduce output bar;
  
  Core. mapper output: key is the encapsulated object, with (year ascending and temperature descending) as the key, a data type needs to be customized;

Actual combat

                     

                         

package temperture;

import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

//The user-defined type MyKeyYT cannot be directly used in org.apache.hadoop.io. It needs to inherit the interface writablecompatible, which is followed by a generic type, namely itself
//Inheriting interface writecompatable requires rewriting readFields, writecompareTo
public class MyKeyYT implements WritableComparable<MyKeyYT>
{
  private int year;
  private int hot;

  public void setYear(int year) {
      this.year = year;
  }

  public int getYear() {
      return year;
  }

  public void setHot(int hot) {
      this.hot = hot;
  }

  public int getHot() {
      return hot;
  }

  //hadoop uses rpc Protocol, and the data in it is binary stream. It needs to de serialize readFields to convert it into objects
  public void readFields(DataInput dataInput) throws IOException
  {
      this.year=dataInput.readInt();
      this.hot=dataInput.readInt();

  }

  //Serialization, serializing the year and hot in the object into binary streams
  public void write(DataOutput dataOutput) throws IOException
  {
      dataOutput.writeInt(year);
      dataOutput.writeInt(hot);
  }

  //Compare, compare the MyKeyYT o object passed in with the current object to determine whether it is the same key
  public int compareTo(MyKeyYT o)
  {
      int myresult = Integer.compare(year,o.getYear());
      if(myresult!=0)
          return myresult;
      return Integer.compare(hot,o.getHot());
  }

  //Override toString
  @Override
  public String toString() {
      return year+"\t"+hot;
  }

  //Rewrite hashCode, write casually, as long as it is inconsistent with the previous one
  @Override
  public int hashCode() {
      return new Integer(year+hot).hashCode();
  }
}

Overwrite partition

                           

package temperture;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

//Inherit Partitioner and rewrite partition function
public class MyPartition  extends Partitioner<MyKeyYT, Text> {

   //Rewrite the partition function, myKeyYT is the output key of map, text is the output value of map, i is the number of reduce
   @Override
   public int getPartition(MyKeyYT myKeyYT, Text text, int i)
   {
       //Partition by year, * 200 just enlarge the number
       return (myKeyYT.getYear()*200)%i;
   }
}

   3) override sort sort

   the default sorting is based on the dictionary. Obviously, it can't satisfy the sorting of numerical values in ascending order of year and descending order of temperature. Therefore, you need to rewrite the sort class. The code is as follows:

package temperture;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

//Rewriting sort
public class MySortTemp extends WritableComparator
{
    //Rewrite the construction method
    public  MySortTemp()
    {
        //Compare Map output
        super(MyKeyYT.class,true);
    }

    //Rewrite the most important compare method
    public int compare(WritableComparable a, WritableComparable b)
    {
        MyKeyYT myKeyYT1=(MyKeyYT) a;
        MyKeyYT myKeyYT2=(MyKeyYT) b;
        int myresult = Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //Ascending sort
        if(myresult!=0)
            return myresult;
        return -Integer.compare(myKeyYT1.getHot(),myKeyYT2.getHot()); //Negative descending sort
    }
}

   4) Rewrite group

Before    reduce, by default, the same value of key is divided into a group, and our key is "in ascending order of year and descending order of temperature". Obviously, it doesn't meet the requirement that each year is a group for the calculation of reduce, so the group function needs to be rewritten. The code is as follows:



package temperture;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

//Grouping and sorting are actually the same. After sorting, the same values will be grouped together
public class MyGroup  extends WritableComparator {
    //Override construction method
   public MyGroup()
   {
       super(MyKeyYT.class,true);
   }

   //copy the sorted part. As long as you group the years, only the part of the year is taken
    public int compare(WritableComparable a, WritableComparable b)
    {
        MyKeyYT myKeyYT1=(MyKeyYT) a;
        MyKeyYT myKeyYT2=(MyKeyYT) b;
        return Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //Determine whether the same group

    }
}

   5) calling the main function class of the program

                             

            myjob.setMapOutputKeyClass(MyKeyYT.class);//Specify the output key type of Map
            myjob.setMapOutputValueClass(Text.class);//Specifies the value type of Map output
            myjob.setNumReduceTasks(7);//Specify the number of reduce with 7 years
            myjob.setPartitionerClass(MyPartition.class); //Reference custom partition
            myjob.setSortComparatorClass(MySortTemp.class); //Reference custom sort sort
            myjob.setGroupingComparatorClass(MyGroup.class); //Reference custom groups

 

The overall main function class is as follows:

package temperture;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

public class MyTemperatureRunJob  {
    public static SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    static class MyTemperatureMapper extends Mapper<LongWritable, Text,MyKeyYT,Text>
    {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String []ss =line.split("\t");

            if(ss.length==2)
            {
                try
                {
                    Date mydate=sdf.parse(ss[0]); //Get the first element of the array

                    //Take years
                    Calendar myCalendar=Calendar.getInstance();
                    myCalendar.setTime(mydate);
                    int year =myCalendar.get(1);

                    String myhot = ss[1].substring(0,ss[1].indexOf("℃"));

                    //Create a custom MyKeyYT object
                    MyKeyYT myKeyYT=new MyKeyYT();
                    myKeyYT.setYear(year);
                    myKeyYT.setHot(Integer.parseInt(myhot));

                    context.write(myKeyYT,value);

                } catch (Exception e) {
                    e.printStackTrace();
                }

            }
        }
    }

    static class MyTemperatureReducer extends Reducer<MyKeyYT,Text,MyKeyYT,Text>
    {
        @Override
        protected void reduce(MyKeyYT key, Iterable<Text> values, Context context) throws IOException, InterruptedException
        {
            for(Text v:values)
            {
                context.write(key,v);
            }
        }
    }

    public static void main(String[] args)
    {
        //Get the environment variable and set the mapred.job.tracker that submits the Job
        Configuration conf =new Configuration();

        //Configure mapreduce.job.tracker,
        //Just keep consistent with the attributes in mapred-site.xml,
        //This sentence can also be left blank. Running on the cluster will automatically get it and omit it directly.
        // conf.set("mapreduce.job.tracker","dw-cluster-master:9001");

        try
        {
            //The mapreduce output automatically creates a folder,
            //However, if the specified output target folder already exists, an error will be reported,
            //This section is to do fault tolerance, which can make the program rerun
            Path outputPath= new Path(args[2]);
            FileSystem fileSystem =FileSystem.get(conf);
            if(fileSystem.exists(outputPath)){
                fileSystem.delete(outputPath,true);
                System.out.println("outputPath is exist,but has deleted!");
            }

            Job myjob= Job.getInstance(conf);
            myjob.setJarByClass(MyTemperatureRunJob.class);//Specify calling WcJobRun Class as Jar and then run
            myjob.setMapperClass(MyTemperatureMapper.class);//Specify the Map class
            myjob.setReducerClass(MyTemperatureReducer.class);//Specify the Reduce class
            myjob.setMapOutputKeyClass(MyKeyYT.class);//Specify the output key type of Map
            myjob.setMapOutputValueClass(Text.class);//Specifies the value type of Map output

            myjob.setNumReduceTasks(7);//Specify the number of reduce with 7 years
            myjob.setPartitionerClass(MyPartition.class); //Reference custom partition
            myjob.setSortComparatorClass(MySortTemp.class); //Reference custom sort sort
            myjob.setGroupingComparatorClass(MyGroup.class); //Reference custom groups


            //Why use args[1], because the first args[0] parameter is reserved for the Class of the main method
            FileInputFormat.addInputPath(myjob,new Path(args[1]));//Specify the input file path of the entire Job. args[1] indicates that the second parameter of Jar package is immediately followed when calling Jar package
            //FileInputFormat.addInputPath(myjob,new Path("/tmp/wcinput/wordcount.xt"));
//Specify the output file path of the entire Job. args[2] indicates that the third parameter of the Jar package is immediately followed when calling the Jar package
            FileOutputFormat.setOutputPath(myjob,new Path(args[2]));
            //FileOutputFormat.setOutputPath(myjob,new Path("/tmp/wcoutput"));
            System.exit(myjob.waitForCompletion(true)?0:1);//Wait for the Job to complete, and exit if it completes correctly
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }



    }

}

Deployment and invocation

Please refer to The IDE of Hadoop cluster big data solution with Maven to realize MapReduce program practice (5) Packaged deployment of

   upload the jar package to the cluster and call the reference instruction:

 hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input        /data1.txt /tmp/output/

Overall execution process:

[liuxiaowei@dw-cluster-master temperature]$ hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input        /data1.txt /tmp/output/
outputPath is exist,but has deleted!
20/02/03 15:40:20 INFO client.RMProxy: Connecting to ResourceManager at dw-cluster-master/10.216.10.141:8032
20/02/03 15:40:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement         the Tool interface and execute your application with ToolRunner to remedy this.
20/02/03 15:40:22 INFO input.FileInputFormat: Total input files to process : 1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: number of splits:1
20/02/03 15:40:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1578394893972_0056
20/02/03 15:40:22 INFO impl.YarnClientImpl: Submitted application application_1578394893972_0056
20/02/03 15:40:22 INFO mapreduce.Job: The url to track the job: http://dw-cluster-master:8088/proxy/application        _1578394893972_0056/
20/02/03 15:40:22 INFO mapreduce.Job: Running job: job_1578394893972_0056
20/02/03 15:40:28 INFO mapreduce.Job: Job job_1578394893972_0056 running in uber mode : false
20/02/03 15:40:28 INFO mapreduce.Job:  map 0% reduce 0%
20/02/03 15:40:33 INFO mapreduce.Job:  map 100% reduce 0%
20/02/03 15:40:38 INFO mapreduce.Job:  map 100% reduce 29%
20/02/03 15:40:39 INFO mapreduce.Job:  map 100% reduce 100%
20/02/03 15:40:40 INFO mapreduce.Job: Job job_1578394893972_0056 completed successfully
20/02/03 15:40:41 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=942
                FILE: Number of bytes written=1291003
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=761
                HDFS: Number of bytes written=850
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=14
        Job Counters
                Killed reduce tasks=1
                Launched map tasks=1
                Launched reduce tasks=7
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3331
                Total time spent by all reduces in occupied slots (ms)=18830
                Total time spent by all map tasks (ms)=3331
                Total time spent by all reduce tasks (ms)=18830
                Total vcore-milliseconds taken by all map tasks=3331
                Total vcore-milliseconds taken by all reduce tasks=18830
                Total megabyte-milliseconds taken by all map tasks=3410944
                Total megabyte-milliseconds taken by all reduce tasks=19281920
        Map-Reduce Framework
                Map input records=25
                Map output records=25
                Map output bytes=850
                Map output materialized bytes=942
                Input split bytes=111
                Combine input records=0
                Combine output records=0
                Reduce input groups=7
                Reduce shuffle bytes=942
                Reduce input records=25
                Reduce output records=25
                Spilled Records=50
                Shuffled Maps =7
                Failed Shuffles=0
                Merged Map outputs=7
                GC time elapsed (ms)=481
                CPU time spent (ms)=6580
                Physical memory (bytes) snapshot=2787332096
                Virtual memory (bytes) snapshot=51036344320
                Total committed heap usage (bytes)=2913992704
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=650
        File Output Format Counters
                Bytes Written=850

The output results are shown in Figure 1. There are 7 years, 1 reduce every year, and all of them have 7 files:

Figure 1 overall output results

   the highest temperature of each year is actually the first line of each file, and the first 3 is the top 3 of each file. All test data are as follows:

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-*
1953    48      1953-08-06 14:21:01     48℃
1953    47      1953-10-02 14:21:01     47℃
1953    45      1953-07-01 14:21:01     45℃
1953    40      1953-04-02 14:21:01     40℃
1953    34      1953-05-02 14:21:01     34℃
1953    32      1953-09-02 14:21:01     32℃
1955    49      1955-09-02 14:21:01     49℃
1955    42      1955-06-02 14:21:01     42℃
1955    28      1955-07-02 14:21:01     28℃
1950    45      1950-10-26 14:21:01     45℃
1950    42      1950-10-24 13:21:01     42℃
1950    38      1950-01-23 14:21:01     38℃
1950    12      1950-10-23 08:21:01     12℃
1952    45      1952-08-02 14:21:01     45℃
1952    45      1952-11-02 14:21:01     45℃
1952    43      1952-04-02 14:21:01     43℃
1954    50      1954-05-09 14:21:01     50℃
1954    45      1954-05-02 14:21:01     45℃
1954    36      1954-06-02 14:21:01     36℃
1949    40      1949-06-03 13:45:01     40℃
1949    38      1949-05-01 14:21:01     38℃
1949    29      1949-09-02 14:21:01     29℃
1951    48      1951-08-02 14:21:01     48℃
1951    40      1951-08-01 14:21:01     40℃
1951    40      1951-12-18 14:21:01     40℃

Take the first three as follows, take the first one and write by yourself.

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-r-00000 | head  -n3
1953    48      1953-08-06 14:21:01     48℃
1953    47      1953-10-02 14:21:01     47℃
1953    45      1953-07-01 14:21:01     45℃

Github overall project

Conveyor: hadoop_mr_temperature

Published 18 original articles, won praise 3, visited 1668
Private letter follow

Tags: Hadoop Apache Java snapshot

Posted on Mon, 03 Feb 2020 04:45:01 -0500 by remmargorp