spark source code analysis (based on the yarn cluster pattern) - talk about RDD and dependency

We know that RDD is a particularly important concept in spark. It can be said that all logic of spark needs to rely on RDD. In this article, we briefly talk about RDD in spark. The definition of RDD in spark is as follows:

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
    def this(@transient oneParent: RDD[_]) =
    this(oneParent.context, List(new OneToOneDependency(oneParent)))
}

Each RDD contains the following five attributes:

  • List of partitions for this RDD
  • Calculation function per data file
  • Dependencies on other RDD S
  • Zone selector (optional)
  • Location information for each data file (optional)

For a better understanding, here we use the common HDFS implementation on HDFS: the implementation of Hadoop RDD.

Let's first look at how Hadoop RDD obtains partition information:

override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    SparkHadoopUtil.get.addCredentials(jobConf)
    try {
      val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
      val inputSplits = if (ignoreEmptySplits) {
        allInputSplits.filter(_.getLength > 0)
      } else {
        allInputSplits
      }
      val array = new Array[Partition](inputSplits.size)
      for (i <- 0 until inputSplits.size) {
        array(i) = new HadoopPartition(id, i, inputSplits(i))
      }
      array
    } catch {
      case e: InvalidInputException if ignoreMissingFiles =>
        Array.empty[Partition]
    }
  }

As you can see, hadoop RDD obtains the underlying file information of the corresponding data, that is, the block information in hadoop, and then a block file is a partition. Here, the partition information corresponding to hadoop RDD is encapsulated into hadoop partition:

trait Partition extends Serializable {
  def index: Int
  override def hashCode(): Int = index
  override def equals(other: Any): Boolean = super.equals(other)
}
private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: InputSplit)
  extends Partition {
  val inputSplit = new SerializableWritable[InputSplit](s)
  override def hashCode(): Int = 31 * (31 + rddId) + index
  override def equals(other: Any): Boolean = super.equals(other)
  def getPipeEnvVars(): Map[String, String] = {
    val envVars: Map[String, String] = if (inputSplit.value.isInstanceOf[FileSplit]) {
      val is: FileSplit = inputSplit.value.asInstanceOf[FileSplit]
      // since it's not removed yet
      Map("map_input_file" -> is.getPath().toString(),
        "mapreduce_map_input_file" -> is.getPath().toString())
    } else {
      Map()
    }
    envVars
  }
}

It mainly contains several information:

  • id of RDD
  • Serial number of the partition
  • File block corresponding to partition

For dependencies, because no Shuffle is designed here, the dependency does not exist. When constructing instantiation, the passed in dependency list is empty.
If we do relevant calculations, such as reparation, in RDD, the method is implemented as follows:

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }
def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

You can see that a coalesced RDD is returned at this time, with a shuffled RDD in it, and then the shuffled RDD is sleeved with a MapPartitionsRDD. Let's take a look at the MapPartitionsRDD first:

  private[spark] def mapPartitionsWithIndexInternal[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false,
      isOrderSensitive: Boolean = false): RDD[U] = withScope {
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => f(index, iter),
      preservesPartitioning = preservesPartitioning,
      isOrderSensitive = isOrderSensitive)
  }
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isFromBarrier: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {
....
}

Note that if the parent RDD constructs the incoming RDD[U](prev) in this way:

  def this(@transient oneParent: RDD[_]) =
    this(oneParent.context, List(new OneToOneDependency(oneParent)))

The dependency is one to one dependency. Here, the parent RDD of mappartitions RDD is the current RDD, that is, the Hadoop RDD we analyzed

Let's look at the above five properties

partition

Its partition acquisition function:

override def getPartitions: Array[Partition] = firstParent[T].partitions
protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
    dependencies.head.rdd.asInstanceOf[RDD[U]]
  }

You can see that the partition corresponding to the first dependency in the current dependency list, that is, the partition information of Hadoop RDD, is obtained here

Calculation function

The calculation function of MapPartitionsRDD is passed in through construction

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

// ------------------------------------------
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

You can see that when repartition is performed here, the partition serial number is randomly broken up according to the hash according to each partition of the parent RDD, and each data and new partition ID are returned.
In addition, firstParent[T].iterator(split, context) is used to read the data of the parent RDD. We can take a look at this implementation:

  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

We can see that here we go back to reading RDD data during Task execution. Here we are actually Hadoop RDD. The read data is one data block by one, and then the block data is processed. The data in each block will be scattered into different partitions according to the hash partition.

Look at the ShuffledRDD. Its parent RDD is the MapPartitionsRDD above:

class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient var prev: RDD[_ <: Product2[K, V]],
    part: Partitioner)
  extends RDD[(K, C)](prev.context, Nil) {
...
}

The partition list is returned as follows:

  override def getPartitions: Array[Partition] = {
    Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
  }

The partition here returns a shuffled rddpartition, which contains a partition serial number (according to the number of partitions that need to be repartitioned by the incoming HashPartitioner)
The return dependencies are as follows:

override def getDependencies: Seq[Dependency[_]] = {
    val serializer = userSpecifiedSerializer.getOrElse {
      val serializerManager = SparkEnv.get.serializerManager
      if (mapSideCombine) {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])
      } else {
        serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])
      }
    }
    List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
  }

It can be seen that the returned here is a ShuffleDependency. When we study Stage division before, whenever we encounter a ShuffleDependency, a new Stage will be divided. Combined with our previous analysis, the Stage will be divided here. If subsequent calculations are required, a ShuffleMapTask will be generated to write data as the Map end.
When reading data here, the 'ShuffleRDD.compute' method will be called:

  override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .read()
      .asInstanceOf[Iterator[(K, C)]]
  }

Here is how we analyze Task execution. Each execution node will get the data that should be processed by its own node. Here is to get the corresponding block data file of the corresponding Hadoop RDD, and then return the partition and data corresponding to each data according to the partition. Then each execution node will get the data of the same partition according to the generated data (after re partitioning) Put them together and write them to a local temporary file.

We are looking at the outermost coalesced RDD:

private[spark] class CoalescedRDD[T: ClassTag](
    @transient var prev: RDD[T],
    maxPartitions: Int,
    partitionCoalescer: Option[PartitionCoalescer] = None)
  extends RDD[T](prev.context, Nil) { 
...
}

Through the previous analysis, we know that if there is no other subsequent processing, a ResultStage corresponding to ResultTask will be generated to execute the methods passed in by the user.
Here is the dependency of coalesceddrdd:

  override def getDependencies: Seq[Dependency[_]] = {
    Seq(new NarrowDependency(prev) {
      def getParents(id: Int): Seq[Int] =
        partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices
    })
  }

It can be seen that a NarrowDependency is returned. If there are other subsequent processing, a new RDD will be generated according to the RDD, and the Stage will be divided only when ShuffleDependency is encountered.
The calculation function compute here is implemented as follows:

  override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

You can see that the structure is called here, and the ShuffledRDD is passed in to read the data. As for how to implement this, you can see the previous source code analysis.

Tags: Scala Big Data Spark Yarn

Posted on Tue, 02 Nov 2021 04:05:18 -0400 by Elle0000