Storage and retrieval of shuffle results upstream and downstream of spark

When a job is separated as a stage in DAGScheduler, the entire job is sorted out into a ShuffleMapStage based on its internal shuffle relationship, and the resulting ResultStage iterates through its parent stage when submitted, adding itself to the DAGScheduler's waiting set and executing the child stage in the task process only after all parent's stages have been executed.

private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(
      logDebug("missing: " + missing)
      if (missing.isEmpty) {
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
        waitingStages += stage
  } else {
    abortStage(stage, "No active job for stage " +, None)

As you can see from the code above, when a child stage is committed, it is recursive and the parent stage is called to commit, where the top stage is divided into task s by the submitMissingTasks() method, and the downstream stage is placed in a waiting set until all execution is completed before it enters the execution plan.



With the submitMissingTasks() method, the stage that currently needs to be executed will be converted to task and the ShuffleStage will be converted to ShuffleMapTask and executed via Executor.

execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)

At the end of Executor's ShuffleMapTask execution, the result of the ShuffleMapTask execution is written to BlockManager, and its address in BlockManager is organized into a map, which is serialized as the result of the task execution and returned to inform driver.


On the driver side, task execution results, especially ShuffleMapTask results, are processed in two main steps, one is to register the location of shuffle results in BlockManager, the other is to start its downstream child stage.

The first step will be implemented by a member of MapOutputTracker.

  shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)

When working with task processing results, ShuffleMapTask stores its shuffleid, partition number, and processing results in MapOutputTracker at the driver end through MapOutputTracker's registerMapOutput() method.

MapOutputTracker maintains a map with ShuffleId as the key to save the results of shuffle processing so that you can quickly get the location of its processing results on BlockManager when you need it.


The second step, after it has been processed, calls the submitWaitingChildStages() method to attempt to execute the child stage.

private def submitWaitingChildStages(parent: Stage) {
  logTrace(s"Checking if any dependencies of $parent are now runnable")
  logTrace("running: " + runningStages)
  logTrace("waiting: " + waitingStages)
  logTrace("failed: " + failedStages)
  val childStages = waitingStages.filter(_.parents.contains(parent)).toArray
  waitingStages --= childStages
  for (stage <- childStages.sortBy(_.firstJobId)) {

All child stages of the stage that will be obtained from the waiting set are ready to be called through the submitStage() method at the beginning of this article, but submitStage() will be judged again until all its parent s have been executed.



How does Child stage get the above shuffle results?

Take ShuffleRowRDD's compute() method as an example.

override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = {
  val shuffledRowPartition = split.asInstanceOf[ShuffledRowRDDPartition]
  // The range of pre-shuffle partitions that we are fetching at here is
  // [startPreShufflePartitionIndex, endPreShufflePartitionIndex - 1].
  val reader =
      context)[Iterator[Product2[Int, InternalRow]]].map(_._2)

It will construct a reader directly to try to read the result of shuffle.

The final construction is actually a BlockStoreShuffleReader, and as the name implies, the results of the shuffle are stored in the BlockManager.

val wrappedStreams = new ShuffleBlockFetcherIterator(
  mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
  // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility
  SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
  SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
  SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))

In its read() method, a data stream is constructed, and the destination coordinates are obtained by the getMapSizesByExecutorId() method of MapOutputTracker.

When the executor method is called, it first tries to get the specific storage data through getStatus() and master, and if it does not exist, it will construct a network request and try to pull the specific storage location of the corresponding shuffleId from MapOutputTracker on the driver side.Once obtained, the shuffle's specific data is also obtained, and the data stream can be constructed to continue execution as the child stage's data.

141 original articles published. 19% praised. 100,000 visits+
Private letter follow

Tags: Spark network

Posted on Fri, 24 Jan 2020 21:03:30 -0500 by douceur