1, Background
In actual database applications, we often need to read data from multiple data tables. At this time, we can use the JOIN in SQL statement to query data in two or more data tables.
In the process of data processing using MapReduce framework, it will also involve reading data from multiple data sets and performing join Association operations. However, at this time, java code needs to be used and the business needs to be implemented according to the programming specification of MapReduce.
However, due to the particularity of MapReduce's distributed design concept, it has certain particularity for MapReduce to realize the join operation. The special is mainly reflected in: at what stage in MapReduce is the association operation of data sets, mapper stage or reducer stage, and what is the difference between them?
The entire MapReduce join is divided into two categories: map side join and reduce side join.
2, reduce side join
1. General
reduce side join, as the name suggests, performs join Association operations in the reduce phase. This is also the easiest way to think of and implement a join. Because the relevant data can be divided into the same groups through the shuffle process, which will provide convenience for the later join operation.
Basically, the steps of reduce side join are as follows:
- mapper reads different data sets respectively;
- In the output of mapper, the field of join is usually used as the output key;
- The data of different data sets will be grouped into the same group after shuffling and the same key;
- In reduce, the data is associated, integrated and summarized according to business requirements, and finally output.
2. Disadvantages
The biggest problem with the reduce side join is that the whole join is completed in the reduce phase, but generally, the parallelism of the reduce in MapReduce is very small (1 by default), which makes all data squeezed into the reduce phase, which is very stressful. Although the parallelism of reduce can be set, the final results will be scattered into multiple different files.
In the process of data from mapper to reducer, the shuffle stage is very cumbersome, and the cost is very high when the data set is large.
3, MapReduce distributed cache
DistributedCache is a mechanism provided by hadoop framework. It can distribute the files specified by the job to the task executing machine before the job is executed, and there are relevant mechanisms to manage the cache files.
DistributedCache can cache the files required by the application (including text, archive files, jar files, etc.).
The map redcue framework copies the necessary files to the slave node before all tasks of the job are executed. It runs efficiently because the files of each job are copied only once and the documents are cached for those slave nodes that do not have documents.
1. Usage
1. Add cache fileYou can use MapReduce's API to add files that need to be cached.
//Add archive to distributed cache job.addCacheArchive(URI uri); //Add normal files to distributed cache job.addCacheFile(URI uri);
Note: files to be distributed must be placed on hdfs in advance. The default path prefix is hdfs: / /.
2. Read the cache file in the programIn the setup method of Mapper class or Reducer class, use the input stream to obtain the files in the distributed cache.
protected void setup(Context context) throw IOException,InterruptedException{ FileReader reader = new FileReader("myfile"); BufferReader br = new BufferedReader(reader); ...... }
4, map side join
1. General
The essence of map side join is to perform the join association operation in the map phase, and the program has no reduce phase, so as to avoid the complexity of shuffle. The key to implementation is to use MapReduce's distributed cache.
Especially when it comes to the processing scenario of a large and a small data set, the join on the map side will play a unique advantage.
The general idea of map side join is as follows:
- Firstly, the data sets processed by join are analyzed, and the small data sets are distributed cached by using distributed caching technology
- When the MapReduce framework is executed, it will automatically distribute the cached data to the machines running maptask
- The program only runs mapper. When mapper initializes, it reads the small data set data from the distributed cache, then join s with the large data set it reads, and outputs the final result.
- There is no shuffle or reducer in the whole join process.
2. Advantages
The biggest advantage of map side join is to reduce the data transmission cost during shuffle. And the parallelism of mapper can be automatically adjusted according to the amount of input data, giving full play to the advantages of distributed computing.
5, MapReduce join case: order item processing
1. Demand
There are two structured data files: itheima_goods, itheima_order_goods (order information table), the specific fields are as follows.
It is required to use MapReduce to count the corresponding specific product name information in each order. For example, 107860 products correspond to AMAZFIT black silicone wristband
Data structure:
- itheima_goods
Fields: goodsId (commodity id), goodsSn (commodity number), goodsName (commodity name) - itheima_order_goods
Fields: Order ID, goodsId, payPrice
2. Reduce Side implementation
1. AnalysisUse mapper to process order data and commodity data, and take goodsId commodity number as key when outputting. Goods and orders with the same goodsId will be in the same group of the same reduce, and the order and commodity information will be associated and merged in the group. In the MapReduce program, you can obtain the file name of the currently processed slice through the context. Judge whether the current processing is order data or commodity data according to the file name, so as to output different logic.
After the join is processed, you can finally gather all the commodity information belonging to the same order through the sorting function of MapReduce program.
Mapper
package com.uuicon.sentiment_upload.join; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; public class ReduceJoinMapper extends Mapper<LongWritable, Text, Text, Text> { Text outKey = new Text(); Text outValue = new Text(); StringBuffer sb = new StringBuffer(); String fileName = null; @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); FileSplit split = (FileSplit) context.getInputSplit(); fileName = split.getPath().getName(); System.out.println("Current file----" + fileName); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { sb.setLength(0); String[] fields = value.toString().split("\\|"); if (fileName.contains("itheima_goods.txt")) { // 100101 | 155083444927602 | 6 Sichuan jelly oranges, about 180g / piece outKey.set(fields[0]); sb.append(fields[1] + "\t" + fields[2]); outValue.set(sb.insert(0, "goods#").toString()); context.write(outKey, outValue); } else { // 1|107860|7191 outKey.set(fields[1]); StringBuffer append = sb.append(fields[0]).append("\t").append(fields[1]).append("\t").append(fields[2]); outValue.set(sb.insert(0, "orders#").toString()); context.write(outKey, outValue); } } }
- Reduce class
package com.uuicon.sentiment_upload.join; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; public class ReduceJoinMapper extends Mapper<LongWritable, Text, Text, Text> { Text outKey = new Text(); Text outValue = new Text(); StringBuffer sb = new StringBuffer(); String fileName = null; @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); FileSplit split = (FileSplit) context.getInputSplit(); fileName = split.getPath().getName(); System.out.println("Current file----" + fileName); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { sb.setLength(0); String[] fields = value.toString().split("\\|"); if (fileName.contains("itheima_goods.txt")) { // 100101 | 155083444927602 | 6 Sichuan jelly oranges, about 180g / piece outKey.set(fields[0]); sb.append(fields[1] + "\t" + fields[2]); outValue.set(sb.insert(0, "goods#").toString()); context.write(outKey, outValue); } else { // 1|107860|7191 outKey.set(fields[1]); StringBuffer append = sb.append(fields[0]).append("\t").append(fields[1]).append("\t").append(fields[2]); outValue.set(sb.insert(0, "orders#").toString()); context.write(outKey, outValue); } } }
Driver class
package com.uuicon.sentiment_upload.join; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class ReduceJoinDriver { public static void main(String[] args) throws Exception { // Profile object Configuration conf = new Configuration(); // Create job instance Job job = Job.getInstance(conf, ReduceJoinDriver.class.getSimpleName()); // Set job driven class job.setJarByClass(ReduceJoinDriver.class); // Set job Mapper reduce class job.setMapperClass(ReduceJoinMapper.class); job.setReducerClass(ReduceJoinReducer.class); // Set the job mapper stage output key value data type, job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); // Set the key value data type output in the job reducer stage, that is, the data type finally output by the program job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); // Configure the input data path for the job FileInputFormat.addInputPath(job, new Path(args[0])); // Configure the output data path of the job FileOutputFormat.setOutputPath(job, new Path(args[1])); // Judge whether the output path exists. If so, delete it FileSystem fs = FileSystem.get(conf); if (fs.exists(new Path(args[1]))) { fs.delete(new Path(args[1]), true); } boolean resultFlag = job.waitForCompletion(true); System.exit(resultFlag ? 0 : 1); } }
- Result sorting
package com.uuicon.sentiment_upload.join; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class ReduceJoinSort { public static class ReduceJoinSortMapper extends Mapper<LongWritable, Text, Text, Text> { Text outKey = new Text(); Text outValue = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //2278 100101 38 155083444927602 Sichuan jelly orange 6 pieces, about 180g / piece String[] fields = value.toString().split("\t"); outKey.set(fields[0]); outValue.set(fields[0] + "\t" + fields[1] + "\t" + fields[3] + "\t" + fields[4] + "\t" + fields[2]); context.write(outKey, outValue); } } public static class ReduceJoinSortReducer extends Reducer<Text, Text, Text, NullWritable> { @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for (Text value : values) { context.write(value, NullWritable.get()); } } } public static void main(String[] args) throws Exception { // Profile object Configuration conf = new Configuration(); // Create job instance Job job = Job.getInstance(conf, ReduceJoinSort.class.getSimpleName()); // Set job driven class job.setJarByClass(ReduceJoinSort.class); // Set job Mapper reduce class job.setMapperClass(ReduceJoinSortMapper.class); job.setReducerClass(ReduceJoinSortReducer.class); // Set the job mapper stage output key value data type, job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); // Set the key value data type output in the job reducer stage, that is, the data type finally output by the program job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); // Configure the input data path for the job FileInputFormat.addInputPath(job, new Path(args[0])); // Configure the output data path of the job FileOutputFormat.setOutputPath(job, new Path(args[1])); // Judge whether the output path exists. If so, delete it FileSystem fs = FileSystem.get(conf); if (fs.exists(new Path(args[1]))) { fs.delete(new Path(args[1]), true); } boolean resultFlag = job.waitForCompletion(true); System.exit(resultFlag ? 0 : 1); } }
Operation results
- Result of reduce join
- Results after reordering
3. Map Side implementation
1. AnalysisMap side join refers to loading a specific dataset in Mapper task. In this case, the commodity data is distributed cached, and Mapper is used to read the order data and connect the cached commodity data.
Usually, for convenience, the distributed cache file will be read in the mapper initialization method setup to load the memory of the program, which is convenient for subsequent mapper to process data.
Because the data association operation has been completed in the mapper stage, the program does not need to reduce. You need to set the number of reducetask to 0 in the job, that is, the output of mapper is the final output of the program.
- Mapper class
package com.uuicon.sentiment_upload.cache; import org.apache.commons.collections.map.HashedMap; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.Map; public class ReduceCacheMapper extends Mapper<LongWritable, Text, Text, NullWritable> { Map<String, String> goodsMap = new HashedMap(); Text outKey = new Text(); @Override protected void setup(Context context) throws IOException, InterruptedException { //Load cache file BufferedReader br = new BufferedReader(new FileReader("itheima_goods.txt")); String line = null; while ((line = br.readLine()) != null) { String[] fields = line.split("\\|"); goodsMap.put(fields[0], fields[1] + "\t" + fields[2]); } } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 56982|100917|1192 String[] fields = value.toString().split("\\|"); outKey.set(value.toString() + "\t" + goodsMap.get(fields[1])); context.write(outKey, NullWritable.get()); } }
- Program main class
package com.uuicon.sentiment_upload.cache; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.net.URI; public class ReduceCacheDriver { public static void main(String[] args) throws Exception { // Profile object Configuration conf = new Configuration(); // Create job instance Job job = Job.getInstance(conf, ReduceCacheDriver.class.getSimpleName()); // Set job driven class job.setJarByClass(ReduceCacheDriver.class); // Set job Mapper reduce class job.setMapperClass(ReduceCacheMapper.class); // Set the job mapper stage output key value data type, job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); // Set the key value data type output in the job reducer stage, that is, the data type finally output by the program job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); job.setNumReduceTasks(0); job.addCacheFile(new URI("/data/cache/itheima_goods.txt")); // Configure the input data path for the job FileInputFormat.addInputPath(job, new Path(args[0])); // Configure the output data path of the job FileOutputFormat.setOutputPath(job, new Path(args[1])); // Judge whether the output path exists. If so, delete it FileSystem fs = FileSystem.get(conf); if (fs.exists(new Path(args[1]))) { fs.delete(new Path(args[1]), true); } boolean resultFlag = job.waitForCompletion(true); System.exit(resultFlag ? 0 : 1); } }
Submit run
- Specify the full path of the main class of the program in the pom.xml file of the project;
<plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <version>2.4</version> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <classpathPrefix>lib/</classpathPrefix> <mainClass>com.uuicon.sentiment_upload.cache.ReduceCacheDriver</mainClass> </manifest> </archive> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> </plugins>
- Execute mvn package command to generate jar package;
- Upload the jar package to the hadoop cluster (on any node);
- Execute the command (on any node): hadoop jar xxxx.jar. Pay attention to ensure that the yarn cluster starts successfully in advance.