Hadoop case demonstration telephone information separation - based on wordcount case expansion

Code case based on IDEA environment

#Simply write a wordcount text
vim word count.txt
hdfs dis -midair -p /user/root/input
#Send to virtual machine, run
hadoop jar ......

Serialization case of Hadoop

  1. demand
    ·Count the total uplink traffic / downlink traffic / status / return total traffic consumed by the mobile phone number
    ·Input data: phone_data.txt
  2. requirement analysis
    ·A typical wordcount case extension
    ·However, using Hadoop native data types is difficult to implement
    ·Use custom
  3. Meet the conditions
    ·The writeable interface must be implemented
    ·During deserialization, the null parameter constructor needs to be called by reflection, so there must be an empty parameter construct:
    public FlowBean(){ super();}
  4. ·Rewrite serialization method;
    ·Override deserialization method
  5. Serialization order and deserialization order must be exactly the same
    ·First in first out queue;
  6. To display the results in the file, you need to rewrite toString(), which can be separated by "\ t" to facilitate subsequent calls;
  7. If you need to transfer the customized bean s in the key, you also need to implement the Comparable excuse. The shuffle process in the MapReduce framework requires that you can sort the keys.
//Custom WriteBean
class FlowBean implements Writable{
	private long upFlow;
	private long downFlow;
	private long sumFlow;
//You must have an empty parameter to construct an object
	public FlowBean{
		super();
	}
	public FlowBean(long upFlow,long downFlow){
		this.upFlow=upFlow;
		this.downFlow=downFlow;
		this.sumFlow=upFlow+downFlow;
	}
//Text oriented objects must override toString();
	public String toString(){
		return this.upFlow+","+downFlow+","+sumFlow;
	}
//Serialization and deserialization must be exactly the same
	public void write(DataOutput out) throws IOException{
			out.writeLong(this.upFlow);
			out.writeLong(this.downFlow);
			out.writeLong(this.sumFlow);
	}
	public void readFields(DataInput in) throws IOException{
		this.upFlow = in.readLong();
		this.downFlow = in.readLong();
		this.sumFlow= in.readLong();
	}
//Subsequent ideas can be generated automatically
}  
//Override wordcount Mapper
public class FlowMapper extends Mapper<LongWriteable key,Text value,Text,FlowBean>{
	//The input is converted to a string and then cut
	protected void map<LongWoteable,Text,context context> throws IOException {
	String Line=value.toString();
	String[] fields = Line.split();
	Text k =new Text();
	k.set(fields[1]);
	FlowBean v = new FlowBean();
	v.set(fields[LongparseLong(filelds.length - 3]),Long.parseLong(fields[fileds.length - 2]));
	context.write(k.v);
	}
}
//Rewrite Reducer
public class FlowsReducer extends Reducer<Text,FlowBean,Text,FlowBean>{
	public void reduce(Text key,Interable<FlowBean> values,Context context)throws IOException{
		long total_upFlow = 0;
		long total_downFlow = 0;
		for(FlowBean bean:values){
			total_upFlow += bean.getupFlow();
			total_downFlow += bean.getdownFlow();
		}
		v.set(total.upFlow,total.downFlow);
		context.write(key,v)
	}
}
//Override Driver

summary
·This just rewrites the FlowBean class based on wordcount
·The overall Driver has not changed much
·This is an extension of the wordcount case

theory

MapReduce slice

1. Parallelism determination mechanism between slice and MapTask

  1. Problem elicitation
    ·The parallelism of MapTask determines the concurrency of task processing in the Map phase, and then affects the processing speed of the whole job
    ·How many tasks are used to process Map data in parallel. It directly affects the concurrent processing ability of the cluster
  2. MapTask parallelism determination mechanism
    ·Data Block: Block is the physical division of data by HDFS.
    ·Data slicing: data slicing is only the logical segmentation of the input, and will not be sliced on the disk for storage.
  3. The blocks shall be divided according to the BlockSize
  • The Map phase parallelism of a Job is determined by the number of slices when the client submits the Job
  • Each spit slice is assigned a MapTask parallel instance for processing
  • By default, the slice size is a BlockSize
  • When slicing, the data set as a whole is not considered, but each file is sliced

2.FileinputFormat slicing process

  1. Find the path of the data store first
  2. Start traversing each file in the processing (planning slice) directory
  3. Start with the first file:
    ·Gets the size of the file
    ·Calculate slice size
    ·Default slice size = BlockSize
    ·Start segmentation. When slicing each time, judge whether the remaining part is greater than 1.1 times the size of the slice. If it is not greater than 1.1 times, no segmentation will be carried out (test points and major questions)

Tags: Java Hadoop mapreduce

Posted on Tue, 26 Oct 2021 09:42:32 -0400 by shaitand