2021SC@SDUSC
Brief introduction of research content
Last week, we completed the analysis of the core code in org.apache.hadoop.mapreduce.Counters. This week, we will continue the analysis from org.apache.hadoop.mapreduce.ID.
org.apache.hadoop.mapreduce.ID source code analysis
package org.apache.hadoop.mapreduce; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.classification.InterfaceAudience; import org.apache.hadoop.classification.InterfaceStability; import org.apache.hadoop.io.WritableComparable; /** * A general identifier, which internally stores the id * as an integer. This is the super class of {@link JobID}, * {@link TaskID} and {@link TaskAttemptID}. * * @see JobID * @see TaskID * @see TaskAttemptID */ @InterfaceAudience.Public @InterfaceStability.Stable public abstract class ID implements WritableComparable<ID> { protected static final char SEPARATOR = '_'; protected int id; /** constructs an ID object from the given int */ public ID(int id) { this.id = id; } protected ID() { } /** returns the int which represents the identifier */ public int getId() { return id; } @Override public String toString() { return String.valueOf(id); } @Override public int hashCode() { return id; } @Override public boolean equals(Object o) { if (this == o) return true; if(o == null) return false; if (o.getClass() == this.getClass()) { ID that = (ID) o; return this.id == that.id; } else return false; } /** Compare IDs by associated numbers */ public int compareTo(ID that) { return this.id - that.id; } public void readFields(DataInput in) throws IOException { this.id = in.readInt(); } public void write(DataOutput out) throws IOException { out.writeInt(id); } }
The function of id class is very simple, that is, it is a general identifier. Internally, the id is stored as an integer as a superclass of JobID, TaskID and TaskAttemptID. It contains a constructor, a get() method, and equals() to determine whether IDs are equal.
As a subclass of ID class, JobID, TaskID and TaskAttemptID are very similar, so we take JobID as an example for analysis.
org.apache.hadoop.mapreduce.JobID source code analysis
Let's first look at the source code of JobID:
@InterfaceAudience.Public @InterfaceStability.Stable public class JobID extends org.apache.hadoop.mapred.ID implements Comparable<ID> { public static final String JOB = "job"; // Jobid regex for various tools and framework components public static final String JOBID_REGEX = JOB + SEPARATOR + "[0-9]+" + SEPARATOR + "[0-9]+"; private final Text jtIdentifier; protected static final NumberFormat idFormat = NumberFormat.getInstance(); static { idFormat.setGroupingUsed(false); idFormat.setMinimumIntegerDigits(4); } /** * Constructs a JobID object * @param jtIdentifier jobTracker identifier * @param id job number */ public JobID(String jtIdentifier, int id) { super(id); this.jtIdentifier = new Text(jtIdentifier); } public JobID() { jtIdentifier = new Text(); } public String getJtIdentifier() { return jtIdentifier.toString(); } @Override public boolean equals(Object o) { if (!super.equals(o)) return false; JobID that = (JobID)o; return this.jtIdentifier.equals(that.jtIdentifier); } /**Compare JobIds by first jtIdentifiers, then by job numbers*/ @Override public int compareTo(ID o) { JobID that = (JobID)o; int jtComp = this.jtIdentifier.compareTo(that.jtIdentifier); if(jtComp == 0) { return this.id - that.id; } else return jtComp; } /** * Add the stuff after the "job" prefix to the given builder. This is useful, * because the sub-ids use this substring at the start of their string. * @param builder the builder to append to * @return the builder that was passed in */ public StringBuilder appendTo(StringBuilder builder) { builder.append(SEPARATOR); builder.append(jtIdentifier); builder.append(SEPARATOR); builder.append(idFormat.format(id)); return builder; } @Override public int hashCode() { return jtIdentifier.hashCode() + id; } @Override public String toString() { return appendTo(new StringBuilder(JOB)).toString(); } @Override public void readFields(DataInput in) throws IOException { super.readFields(in); this.jtIdentifier.readFields(in); } @Override public void write(DataOutput out) throws IOException { super.write(out); jtIdentifier.write(out); } /** Construct a JobId object from given string * @return constructed JobId object or null if the given String is null * @throws IllegalArgumentException if the given string is malformed */ public static JobID forName(String str) throws IllegalArgumentException { if(str == null) return null; try { String[] parts = str.split("_"); if(parts.length == 3) { if(parts[0].equals(JOB)) { return new org.apache.hadoop.mapred.JobID(parts[1], Integer.parseInt(parts[2])); } } }catch (Exception ex) {//fall below } throw new IllegalArgumentException("JobId string : " + str + " is not properly formed"); } }
This category is described on the official website as follows:
The public class JobID extends ID and implements comparable < ID >.
JobID represents the immutable and unique identifier of the job. JobID consists of two parts. The first part represents the job tracker identifier, so it defines the job ID mapped by the job tracker. For cluster settings, this string is the jobtracker start time, and for local settings, it is "local" and a random number. The second part of the JobID is the job number.
The official gives an example: an example JobID is: job_200707121733_0003, which represents the third job 200707121733 running on the job tracker starting at.
Instead of constructing or parsing the JobID string, the application should use the appropriate constructor or forName(String) method.
The basic methods in JobID are as follows.
compareTo(), equals(), write() are all common methods. Let's focus on appendTo().
appendTo() method
public StringBuilder appendTo(StringBuilder builder) { builder.append(SEPARATOR); builder.append(jtIdentifier); builder.append(SEPARATOR); builder.append(idFormat.format(id)); return builder; }
The parameter bulider passed in by this method is the builder to attach to. After attaching, return the builder.
Appendto (string builder) can add the content after "job" prefix to a given builder. Such an addition is necessary because the child ID uses this substring at the beginning of its string.
Next, we continue to analyze the source code of other classes.
org.apache.hadoop.mapreduce.InputFormat source code analysis
package org.apache.hadoop.mapreduce; import java.io.IOException; import java.util.List; import org.apache.hadoop.classification.InterfaceAudience; import org.apache.hadoop.classification.InterfaceStability; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; @InterfaceAudience.Public @InterfaceStability.Stable public abstract class InputFormat<K, V> { /** * Logically split the set of input files for the job. * * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper} * for processing.</p> * * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the * input files are not physically split into chunks. For e.g. a split could * be <i><input-file-path, start, offset></i> tuple. The InputFormat * also creates the {@link RecordReader} to read the {@link InputSplit}. * * @param context job configuration. * @return an array of {@link InputSplit}s for the job. */ public abstract List<InputSplit> getSplits(JobContext context ) throws IOException, InterruptedException; /** * Create a record reader for a given split. The framework will call * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before * the split is used. * @param split the split to be read * @param context the information about the task * @return a new record reader * @throws IOException * @throws InterruptedException */ public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context ) throws IOException, InterruptedException; }
InputFormat is an abstract class used to obtain Input data, segment and type it into < K, V > key value pairs.
InputFormat describes the input specification of the map reduce job.
The map reduce framework relies on the InputFormat job:
1. Input specification of verification operation.
2. Split the input file into logical files InputSplit, and then assign each file to a separate Mapper
3. Provide RecordReader to collect input record InputSplit from logic for Mapper
The default behavior of file based InputFormats (usually a subclass of FileInputFormat) is to split the input FileInputFormat into logical InputSplits based on the total size of the input file (in bytes). However, the block size of the file system input file is considered the upper limit of the input split. The lower limit of the split size can be set through mapreduce.input.fileinputformat.split.minsize.
Obviously, logical splitting based on input size is not enough for many applications because record boundaries are respected. In this case, the application must also implement RecordReader, respect the responsibility of record boundaries, and InputSplit provides a record oriented logical view for a single task.
There are two abstract methods in this class:
(1) getSplits: it is responsible for parsing HDFS data into InputSplit, that is, slicing the original data and performing logical slicing according to the set slice size; InputSplit only records the metadata information of the slice, such as the start and end position, length and node list of the offset.
(2) createRecordReader: get each InputSplit and parse each row into < K, V > key value pairs.
The understanding of InputFormat is over. In next week's blog, we will analyze several important subclasses of InputFormat, such as FileInputFormat.
summary
This time, we first analyzed the ID class and had a preliminary understanding of the role of the ID class, and then began to analyze its subclasses. We took JobID as an example to explore its source code and understand the key methods. At the same time, we read the source code of InputFormat and analyzed its functions, so as to lay a foundation for analyzing its subclasses later.