MapReduce program practice advanced of Hadoop cluster big data solution (custom partition&sort&group)

Get ready                   &...
Get ready
Demand
Actual combat
Deployment and invocation
Github overall project

Get ready

                                       The purpose is to explain:
   1. User defined sorting;
   2. User defined partition;
   3. User defined grouping.

Demand

   1) sample data (datetime temperature [Tab between datetime and temperature]):
·

1949-05-01 14:21:01 38℃ 1949-06-03 13:45:01 40℃ 1950-01-23 14:21:01 38℃ 1950-10-23 08:21:01 12℃ 1951-12-18 14:21:01 40℃ 1950-10-24 13:21:01 42℃ 1950-10-26 14:21:01 45℃ 1951-08-01 14:21:01 40℃ 1951-08-02 14:21:01 48℃ 1953-07-01 14:21:01 45℃ 1953-08-06 14:21:01 48℃ 1954-06-02 14:21:01 36℃ 1952-08-02 14:21:01 45℃ 1955-06-02 14:21:01 42℃ 1952-04-02 14:21:01 43℃ 1953-05-02 14:21:01 34℃ 1949-09-02 14:21:01 29℃ 1953-10-02 14:21:01 47℃ 1952-11-02 14:21:01 45℃ 1953-04-02 14:21:01 40℃ 1954-05-02 14:21:01 45℃ 1955-07-02 14:21:01 28℃ 1954-05-09 14:21:01 50℃ 1955-09-02 14:21:01 49℃ 1953-09-02 14:21:01 32℃

   2) requirements:

1. Calculate the time of the highest temperature every year from 1949 to 1955; 2. From 1949 to 1955, three days before the highest temperature every year;

   3) ideas:

1. In ascending order of year, and in descending order of temperature in each year; 2. According to the year group, the corresponding reduce tasks of each year; 3. Select top 1 and top 3 of the result of reduce output bar; Core. mapper output: key is the encapsulated object, with (year ascending and temperature descending) as the key, a data type needs to be customized;

Actual combat

                     

                         

package temperture; import org.apache.hadoop.io.WritableComparable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.Objects; //The user-defined type MyKeyYT cannot be directly used in org.apache.hadoop.io. It needs to inherit the interface writablecompatible, which is followed by a generic type, namely itself //Inheriting interface writecompatable requires rewriting readFields, writecompareTo public class MyKeyYT implements WritableComparable<MyKeyYT> { private int year; private int hot; public void setYear(int year) { this.year = year; } public int getYear() { return year; } public void setHot(int hot) { this.hot = hot; } public int getHot() { return hot; } //hadoop uses rpc Protocol, and the data in it is binary stream. It needs to de serialize readFields to convert it into objects public void readFields(DataInput dataInput) throws IOException { this.year=dataInput.readInt(); this.hot=dataInput.readInt(); } //Serialization, serializing the year and hot in the object into binary streams public void write(DataOutput dataOutput) throws IOException { dataOutput.writeInt(year); dataOutput.writeInt(hot); } //Compare, compare the MyKeyYT o object passed in with the current object to determine whether it is the same key public int compareTo(MyKeyYT o) { int myresult = Integer.compare(year,o.getYear()); if(myresult!=0) return myresult; return Integer.compare(hot,o.getHot()); } //Override toString @Override public String toString() { return year+"\t"+hot; } //Rewrite hashCode, write casually, as long as it is inconsistent with the previous one @Override public int hashCode() { return new Integer(year+hot).hashCode(); } }

Overwrite partition

                           

package temperture; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; //Inherit Partitioner and rewrite partition function public class MyPartition extends Partitioner<MyKeyYT, Text> { //Rewrite the partition function, myKeyYT is the output key of map, text is the output value of map, i is the number of reduce @Override public int getPartition(MyKeyYT myKeyYT, Text text, int i) { //Partition by year, * 200 just enlarge the number return (myKeyYT.getYear()*200)%i; } }

   3) override sort sort

   the default sorting is based on the dictionary. Obviously, it can't satisfy the sorting of numerical values in ascending order of year and descending order of temperature. Therefore, you need to rewrite the sort class. The code is as follows:

package temperture; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.io.WritableComparator; //Rewriting sort public class MySortTemp extends WritableComparator { //Rewrite the construction method public MySortTemp() { //Compare Map output super(MyKeyYT.class,true); } //Rewrite the most important compare method public int compare(WritableComparable a, WritableComparable b) { MyKeyYT myKeyYT1=(MyKeyYT) a; MyKeyYT myKeyYT2=(MyKeyYT) b; int myresult = Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //Ascending sort if(myresult!=0) return myresult; return -Integer.compare(myKeyYT1.getHot(),myKeyYT2.getHot()); //Negative descending sort } }

   4) Rewrite group

Before    reduce, by default, the same value of key is divided into a group, and our key is "in ascending order of year and descending order of temperature". Obviously, it doesn't meet the requirement that each year is a group for the calculation of reduce, so the group function needs to be rewritten. The code is as follows:

package temperture; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.io.WritableComparator; //Grouping and sorting are actually the same. After sorting, the same values will be grouped together public class MyGroup extends WritableComparator { //Override construction method public MyGroup() { super(MyKeyYT.class,true); } //copy the sorted part. As long as you group the years, only the part of the year is taken public int compare(WritableComparable a, WritableComparable b) { MyKeyYT myKeyYT1=(MyKeyYT) a; MyKeyYT myKeyYT2=(MyKeyYT) b; return Integer.compare(myKeyYT1.getYear(),myKeyYT2.getYear()); //Determine whether the same group } }

   5) calling the main function class of the program

                             

myjob.setMapOutputKeyClass(MyKeyYT.class);//Specify the output key type of Map myjob.setMapOutputValueClass(Text.class);//Specifies the value type of Map output myjob.setNumReduceTasks(7);//Specify the number of reduce with 7 years myjob.setPartitionerClass(MyPartition.class); //Reference custom partition myjob.setSortComparatorClass(MySortTemp.class); //Reference custom sort sort myjob.setGroupingComparatorClass(MyGroup.class); //Reference custom groups

The overall main function class is as follows:

package temperture; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import java.io.IOException; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; public class MyTemperatureRunJob { public static SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); static class MyTemperatureMapper extends Mapper<LongWritable, Text,MyKeyYT,Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String []ss =line.split("\t"); if(ss.length==2) { try { Date mydate=sdf.parse(ss[0]); //Get the first element of the array //Take years Calendar myCalendar=Calendar.getInstance(); myCalendar.setTime(mydate); int year =myCalendar.get(1); String myhot = ss[1].substring(0,ss[1].indexOf("℃")); //Create a custom MyKeyYT object MyKeyYT myKeyYT=new MyKeyYT(); myKeyYT.setYear(year); myKeyYT.setHot(Integer.parseInt(myhot)); context.write(myKeyYT,value); } catch (Exception e) { e.printStackTrace(); } } } } static class MyTemperatureReducer extends Reducer<MyKeyYT,Text,MyKeyYT,Text> { @Override protected void reduce(MyKeyYT key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for(Text v:values) { context.write(key,v); } } } public static void main(String[] args) { //Get the environment variable and set the mapred.job.tracker that submits the Job Configuration conf =new Configuration(); //Configure mapreduce.job.tracker, //Just keep consistent with the attributes in mapred-site.xml, //This sentence can also be left blank. Running on the cluster will automatically get it and omit it directly. // conf.set("mapreduce.job.tracker","dw-cluster-master:9001"); try { //The mapreduce output automatically creates a folder, //However, if the specified output target folder already exists, an error will be reported, //This section is to do fault tolerance, which can make the program rerun Path outputPath= new Path(args[2]); FileSystem fileSystem =FileSystem.get(conf); if(fileSystem.exists(outputPath)){ fileSystem.delete(outputPath,true); System.out.println("outputPath is exist,but has deleted!"); } Job myjob= Job.getInstance(conf); myjob.setJarByClass(MyTemperatureRunJob.class);//Specify calling WcJobRun Class as Jar and then run myjob.setMapperClass(MyTemperatureMapper.class);//Specify the Map class myjob.setReducerClass(MyTemperatureReducer.class);//Specify the Reduce class myjob.setMapOutputKeyClass(MyKeyYT.class);//Specify the output key type of Map myjob.setMapOutputValueClass(Text.class);//Specifies the value type of Map output myjob.setNumReduceTasks(7);//Specify the number of reduce with 7 years myjob.setPartitionerClass(MyPartition.class); //Reference custom partition myjob.setSortComparatorClass(MySortTemp.class); //Reference custom sort sort myjob.setGroupingComparatorClass(MyGroup.class); //Reference custom groups //Why use args[1], because the first args[0] parameter is reserved for the Class of the main method FileInputFormat.addInputPath(myjob,new Path(args[1]));//Specify the input file path of the entire Job. args[1] indicates that the second parameter of Jar package is immediately followed when calling Jar package //FileInputFormat.addInputPath(myjob,new Path("/tmp/wcinput/wordcount.xt")); //Specify the output file path of the entire Job. args[2] indicates that the third parameter of the Jar package is immediately followed when calling the Jar package FileOutputFormat.setOutputPath(myjob,new Path(args[2])); //FileOutputFormat.setOutputPath(myjob,new Path("/tmp/wcoutput")); System.exit(myjob.waitForCompletion(true)?0:1);//Wait for the Job to complete, and exit if it completes correctly } catch (Exception e) { e.printStackTrace(); } } }

Deployment and invocation

Please refer to The IDE of Hadoop cluster big data solution with Maven to realize MapReduce program practice (5) Packaged deployment of

   upload the jar package to the cluster and call the reference instruction:

hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input /data1.txt /tmp/output/

Overall execution process:

[liuxiaowei@dw-cluster-master temperature]$ hadoop jar hadoop_mr_temperature.jar MyTemperatureRunJob /tmp/input /data1.txt /tmp/output/ outputPath is exist,but has deleted! 20/02/03 15:40:20 INFO client.RMProxy: Connecting to ResourceManager at dw-cluster-master/10.216.10.141:8032 20/02/03 15:40:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 20/02/03 15:40:22 INFO input.FileInputFormat: Total input files to process : 1 20/02/03 15:40:22 INFO mapreduce.JobSubmitter: number of splits:1 20/02/03 15:40:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1578394893972_0056 20/02/03 15:40:22 INFO impl.YarnClientImpl: Submitted application application_1578394893972_0056 20/02/03 15:40:22 INFO mapreduce.Job: The url to track the job: http://dw-cluster-master:8088/proxy/application _1578394893972_0056/ 20/02/03 15:40:22 INFO mapreduce.Job: Running job: job_1578394893972_0056 20/02/03 15:40:28 INFO mapreduce.Job: Job job_1578394893972_0056 running in uber mode : false 20/02/03 15:40:28 INFO mapreduce.Job: map 0% reduce 0% 20/02/03 15:40:33 INFO mapreduce.Job: map 100% reduce 0% 20/02/03 15:40:38 INFO mapreduce.Job: map 100% reduce 29% 20/02/03 15:40:39 INFO mapreduce.Job: map 100% reduce 100% 20/02/03 15:40:40 INFO mapreduce.Job: Job job_1578394893972_0056 completed successfully 20/02/03 15:40:41 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=942 FILE: Number of bytes written=1291003 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=761 HDFS: Number of bytes written=850 HDFS: Number of read operations=24 HDFS: Number of large read operations=0 HDFS: Number of write operations=14 Job Counters Killed reduce tasks=1 Launched map tasks=1 Launched reduce tasks=7 Rack-local map tasks=1 Total time spent by all maps in occupied slots (ms)=3331 Total time spent by all reduces in occupied slots (ms)=18830 Total time spent by all map tasks (ms)=3331 Total time spent by all reduce tasks (ms)=18830 Total vcore-milliseconds taken by all map tasks=3331 Total vcore-milliseconds taken by all reduce tasks=18830 Total megabyte-milliseconds taken by all map tasks=3410944 Total megabyte-milliseconds taken by all reduce tasks=19281920 Map-Reduce Framework Map input records=25 Map output records=25 Map output bytes=850 Map output materialized bytes=942 Input split bytes=111 Combine input records=0 Combine output records=0 Reduce input groups=7 Reduce shuffle bytes=942 Reduce input records=25 Reduce output records=25 Spilled Records=50 Shuffled Maps =7 Failed Shuffles=0 Merged Map outputs=7 GC time elapsed (ms)=481 CPU time spent (ms)=6580 Physical memory (bytes) snapshot=2787332096 Virtual memory (bytes) snapshot=51036344320 Total committed heap usage (bytes)=2913992704 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=650 File Output Format Counters Bytes Written=850

The output results are shown in Figure 1. There are 7 years, 1 reduce every year, and all of them have 7 files:

Figure 1 overall output results

   the highest temperature of each year is actually the first line of each file, and the first 3 is the top 3 of each file. All test data are as follows:

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-* 1953 48 1953-08-06 14:21:01 48℃ 1953 47 1953-10-02 14:21:01 47℃ 1953 45 1953-07-01 14:21:01 45℃ 1953 40 1953-04-02 14:21:01 40℃ 1953 34 1953-05-02 14:21:01 34℃ 1953 32 1953-09-02 14:21:01 32℃ 1955 49 1955-09-02 14:21:01 49℃ 1955 42 1955-06-02 14:21:01 42℃ 1955 28 1955-07-02 14:21:01 28℃ 1950 45 1950-10-26 14:21:01 45℃ 1950 42 1950-10-24 13:21:01 42℃ 1950 38 1950-01-23 14:21:01 38℃ 1950 12 1950-10-23 08:21:01 12℃ 1952 45 1952-08-02 14:21:01 45℃ 1952 45 1952-11-02 14:21:01 45℃ 1952 43 1952-04-02 14:21:01 43℃ 1954 50 1954-05-09 14:21:01 50℃ 1954 45 1954-05-02 14:21:01 45℃ 1954 36 1954-06-02 14:21:01 36℃ 1949 40 1949-06-03 13:45:01 40℃ 1949 38 1949-05-01 14:21:01 38℃ 1949 29 1949-09-02 14:21:01 29℃ 1951 48 1951-08-02 14:21:01 48℃ 1951 40 1951-08-01 14:21:01 40℃ 1951 40 1951-12-18 14:21:01 40℃

Take the first three as follows, take the first one and write by yourself.

[liuxiaowei@dw-cluster-master temperature]$ hadoop fs -cat /tmp/output/part-r-00000 | head -n3 1953 48 1953-08-06 14:21:01 48℃ 1953 47 1953-10-02 14:21:01 47℃ 1953 45 1953-07-01 14:21:01 45℃

Github overall project

Conveyor: hadoop_mr_temperature

Ghost Valley elder martial brother selling navel orange Published 18 original articles, won praise 3, visited 1668 Private letter follow

3 February 2020, 04:45 | Views: 4482

Add new comment

For adding a comment, please log in
or create account

0 comments