Principle analysis of spark RDD count, sample, disease, distinct, order by, limit and other operators


When we write spark code to process data, most of our work is to call spark api to convert the data, and then collect the final results. These api functions are called operators.

1, Overview of RDD operators

Spark rdd operators can be divided into the following three categories:

  • Non shuffle transform operators are represented by map, filter and flatmap operators. The characteristic of this kind of operators is that they do not trigger the rdd calculation process, but convert one rdd to another, and there is a Narrow Dependency between the two RDDS.
  • Shuffle Type transform operators are represented by groupByKey, reduceByKey and repartition operators. This kind of operator is also characterized by converting one rdd to another, but it will trigger the shuffle process of rdd. There is a wide dependency between the two RDDS.
  • action operators, represented by count, take, collect and saveAsTextFile operators, are characterized in that when these functions are used, they actually call the SparkContext.runJob() method, which will trigger the real calculation process of rdd. This is also the direct embodiment of spark's inert evaluation idea - task calculation is triggered only when there is a computing demand.

2, Implementation principle of RDD operator

1. map, filter, flatmap, mapPartions   Operator principle

These methods are based on   RDD.scala   Class. There are four simple transform methods without shuffle process. The result is to convert the current RDD to MapPartionsRDD. There is a narrow dependency between the two RDDS, which can be represented by the following figure:

  Taking the map function as an example, its implementation code is as follows:

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) =>

As can be seen from the code, the map method needs to pass in a parameter f, which is a function, which means that the function accepts the conversion of parameter type T to another type   U. But instead of calling the function immediately, the map method creates a new object   MapPartionsRDD, and a new function is generated at the same time   (context, PID, ITER) = > (cleanf) as   The constructor parameter of MapPartionsRDD class is passed in. During RDD iterative calculation, the compute method of MapPartionsRDD will use this function to traverse Iterator[T] to get a new iterator[U]:

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {
    override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

MapPartionsRDD overrides the compute method of the parent RDD. The call time is during task calculation. When spark submits a task, these RDDS will be packaged as ShuffleMapTask. The runTask method of ShuffleMapTask will call the RDD.iterator() method, and this method will call the RDD.compute() method.

two   Principle of combineByKey, reduceByKey and groupBykey

First, these methods are defined in   PairRDDFunctions is defined in RDD class, but these methods are not in RDD class. Secondly, there are only key value pairs of type [K, V]   RDD can only call these methods. What is the principle? The answer is implicit conversion. Spark can implicitly convert RDD to   PairRDDFunctions class, and then call these methods of that class.

Let's start with the combineByKey method. Its method definition is as follows:

def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)

Functionally, this method converts an rdd of type [k, v] into a new rdd[k, c]. There is a shuffle in this process, which will write the same records of k to the same reduce partition and aggregate the records of the same key. This c is the result of the aggregation of each V corresponding to the same k. The first three parameters of this method are functions, which are not explained here. The partitioner parameter is the partitioner. It specifies the partition rule of the key. mapSideCombine indicates whether to enable map side aggregation. If mapSideCombine is true, that is, the map side aggregation is enabled, the createCombiner and mergeValue functions will be used on the map side to perform partial aggregation, and the mergeCombiners function will be used on the reduce side to perform final aggregation; Otherwise, these three functions will be aggregated on the reduce side.

Take a look at the core code of the combineByKeyWithClassTag method it calls:

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    val aggregator = new Aggregator[K, V, C](
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)

This method first creates an instance of the Aggerator class. Then judge whether the partition of the current rdd is equal to the partition passed in by the method (the condition for equality is that the partition type is the same and the number of partitions is the same). The precondition for satisfying the partition equality is that the rdd has been shuffled, otherwise the partition object of the rdd is NULL. If they are equal, the mapPartitions method of rdd is called to generate MapPartitionsRDD (that is, the shuffle process will not be generated because the data has been cleaned). Otherwise, the ShuffledRDD is generated.

When did the shuffle process happen? The answer is that in the actual task calculation process, the DAGScheduler class judges the current status when performing task Stage segmentation   ShuffledRDD and its dependent prevRDD are wide dependencies (the ShuffledRDD class overrides the getDependencies method), which will be divided into two stages here, as shown in the following figure:

  When the ShuffleMapTask of Stage0 runs, the result obtained by the rdd.iterator() method will be written to the local disk file by calling the ShuffleWriter.write() method. This process is written by Shuffle. When the task of Stage1 runs, it will call the compute() method of ShuffledRDD. It will pull its own data that should be calculated in the output of all map tasks in Stage0 according to the partition rules. This process becomes Shuffle read:

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .asInstanceOf[Iterator[(K, C)]]

reduceByKey and groupByKey are finally called   combineByKeyWithClassTag method. The difference lies in the parameters passed in by the former   mapSideCombine is true and will require an   Func: (v, v) = > v, that is, how to aggregate v with the same key. The latter passes in parameters   mapSideCombine is false, that is, it will not aggregate on the map side, and   The groupByKey method uses the data structure ComPactBuffer to save the value value corresponding to the same key. It is similar to   ArrayBuffer can only append write data structures, and the underlying layer uses arrays to store data.

3. coalesce, repartition principle

The coalesce function is used to reduce or increase the number of RDD partitions to modify the parallelism of tasks. Its implementation source code is as follows:

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions) { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)

The function has three parameters. The first parameter is the number of new partitions, the second parameter is whether to shuffle, and the third parameter is an optional specified partition.

When the shuffle parameter is false, no shuffle process is generated. At this time, the value of the numpartitions parameter must be less than the number of partitions of the current RDD, that is, you can only reduce partitions, not increase partitions. How can coalesced RDD reduce partition consolidation without shuffle? I suggest you take a look   Coalesceddrdd class code, focus on   getPartitions method and   compute method. When the spark task is running, if the number of map small tasks is too large, this operator can be used to reduce the number of map task partitions without generating a shuffle process. However, it should be noted that if the data of RDD is unevenly partitioned, the data skew will be aggravated according to the default partition coordinator of coalesced RDD. For example, an RDD has four partitions, and the data is concentrated on the first partition and the second partition. Now call this operator to adjust to two partitions. The mapping relationship between the old partition and the new partition is [0,1] = > [0], [2,3] = > [1], and the amount of calculation data of the first partition of the new RDD needs to be increased. Therefore, if the data itself is unevenly distributed, you should implement a PartitionCoalescer class and pass it in as a parameter, or set the parameter shuffle = true.

When the shuffle parameter is true, a shuffle process will be generated, which will undergo three rdd transformations: current rdd - > mappartitionsrdd (generated by mapPartitionsWithIndexInternal method in the code) - > shuffledrdd - >   CoalescedRDD. Why is there a map conversion process? Watch   distributePartition function,   This is to solve the problem of uneven data in the shuffle process.

repartition function is actually called coalesce function:

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)

4. Principle of count operator

The count operator belongs to the action operator, which will trigger the task submission process. The implementation source code of the count method is as follows:

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

It can be seen that this method will call the SparkContext.runJob() method, which has several overloaded methods:

class SparkContext(config: SparkConf) extends Logging {
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)


Let's first look at the last overloaded method, which needs to pass in four parameters:

  1. The first parameter rdd represents the rdd to be submitted, which is easy to understand.
  2. The second parameter is a function:   Func: (taskcontext, iterator[T]) = > u, this function means to apply this function in each partition of the final output of the task, and apply this function to the output data set iterator[T] in the partition to obtain the result set U. The func passed in by the count method is   Utils.getIteratorSize _, It means that you only need to return the size of the iterator. The call time of func function is in the ResultTask.runTask method.
  3. The third parameter is the partition index sequence. The first overload method above passes in 0 until rdd.partitions.length, which means that all partitions must participate in the calculation by default.
  4. The fourth parameter is also a function: resulthandler: (int, U) = > unit. This function is a little troublesome. Let's talk about the call timing first. Whenever a ResultTask runs, the statusUpdate method of TaskSchedulerImpl class is called. This method determines that if the task runs successfully rather than fails, it will call the enqueueSuccessfulTask method of TaskResultGetter class to pull the task output of ResultTask of each partition (the result set u with func function mentioned above), It uses thread pool technology to asynchronously pull data locally. After pulling, it will call back the method of DAGScheduler class. After layer by layer event transmission, it will finally call the handletaskcompletement method of DAGScheduler class. In this method, it will call the tasksucceeded method of JobWaiter class:
 override def taskSucceeded(index: Int, result: Any): Unit = {
    // resultHandler call must be synchronized in case resultHandler itself is not thread safe.
    synchronized {
      resultHandler(index, result.asInstanceOf[T])
    if (finishedTasks.incrementAndGet() == totalTasks) {

From the method call here, we can see that the first parameter passed in by the resultHandler function is the task index, and the second parameter is the output result of the task. Let's look at how the third construction method of runJob passes in the resultHandler function parameter?

val results = new Array[U](partitions.size)
runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)

It is obvious that the function passed in by this method receives and stores the output results in the array of results variables. After calling the runJob method, the results array is finally returned. Well, I think that after the analysis, you should also understand the principle of the count operator.

Tags: Scala Big Data Spark

Posted on Fri, 24 Sep 2021 06:57:34 -0400 by danj