map correlation of rdd operators

The first is the understanding of several operators related to map in RDD operator.

map

map is actually a mapping of data structures, which transforms one structure into another.

A simple spark Program multiplies each number in the list by 2:

object MapOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd = context.makeRDD(List(2, 3, 4, 5))

    val mapRdd = rdd.map(_ * 2)
    mapRdd.collect().foreach(println)
    context.stop()
  }
}

Click map source code:

First, the map operation is a transformation, which will return a new RDD, but will not perform data calculation.

Observe the input parameters of the map:

f: T => U

This is a function defined on the driver side, so it can be written as follows:

Then the return value of the map is also an RDD:

RDD[U]

Then we call the withScope function.

Generally speaking, the function body should be written as follows:

  def map[U: ClassTag](f: T => U): RDD[U] ={
    withScope {
      val cleanF = sc.clean(f)
      new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
    }
  } 

Because there is only one withScope operation in the function body, the outer {} is omitted.

Take a look at what withScope does:

  /**
   * Execute a block of code in a scope such that all new RDDs created in this body will
   * be part of the same scope. For more detail, see {{org.apache.spark.rdd.RDDOperationScope}}.
   *
   * Note: Return statements are NOT allowed in the given body.
   */
  private[spark] def withScope[U](body: => U): U = RDDOperationScope.withScope[U](sc)(body)

He wants all new RDDS to be under one scope. Why? Because DAG visualization is required, RDD needs to be tracked.

Pay attention to his pass parameter: Body: = > U. This uses the control abstraction and passes in a code block.

The last call is RDDOperationScope.withScope[U](sc)(body).

The code block of body is passed by us, and sc does not need to be passed. Here, the function coritization is used.

The body we pass in will be executed in the middle. Other parts are public code. This is the use of the goods loan model.

Return to the main logic of the map:

It first clean s our function and then returns intact.

clean means that we may use closures, and closures will use external variables, which may lead to serialization failure, so there will be such a check.

Then our function is placed in a strange place.

The MapPartitionsRDD main constructor needs to pass two parameters, one is RDD, and the other is the function of three input parameters returning iterators. The three input parameters are taskcontext, partition index and iterator respectively.

The RDD we passed in here is this, and our own code is the result of makeRDD:

Who calls the map is this.

Then our function is passed into the map method of an iterator as an input parameter:

According to the construction parameters of mappartitions RDD, iter.map(cleanF) must return an iterator.

So how is it finally calculated?

Finally, the compute method is called to apply f.

So we call the map function:

    val f: (Int => Int) = _ * 2
    val mapRdd = rdd.map(f)

Will eventually become

firstParent[T].iterator(split, context).map(f)

That's all for now.

mapPartitions

Let's look at the following questions:

object MapOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd: RDD[Int] = context.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)


    val mapRdd = rdd.map(
      x => {
        println("====================")
        x * 2
      }
    )


    mapRdd.collect().foreach(println)
    context.stop()
  }

}

result:

====================
====================
====================
====================
====================
====================
====================
====================
====================
2
4
6
8
10
12
14
16
18

9 numbers and 3 partitions, and the map method is applied 9 times.

What if you use mapPartitions?

object MapPartition {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd: RDD[Int] = context.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)


    val mapPartitionRdd = rdd.mapPartitions(
      iterator => {
        println("====================")
        iterator.map(_*2)
      }
    )

    mapPartitionRdd.collect().foreach(println)
    context.stop()
  }
}

result:

====================
====================
====================
2
4
6
8
10
12
14
16
18

This is calculated in partition units, so the function is applied three times.

We enter the source code of mapPartitions:

It is similar to map, except that there is a Boolean value of preservesPartitioning.

The default is false. If it is true, the partition of Laozi will be retained.

mapPartitionsWithIndex

In fact, in mapPartitions, we can get the partition corresponding to the data:

object MapPartition {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd: RDD[Int] = context.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)


    val mapPartitionRdd = rdd.mapPartitions(
      iterator => {
        val partitionId: Int = TaskContext.getPartitionId()
        iterator.map(x => {
          (partitionId, x * 2)
        })
      }
    )

    mapPartitionRdd.collect().foreach(println)
    context.stop()
  }
}

But using TaskContext can be cumbersome.

Therefore, mappartitions with index appears.

object MapPartitionWithIndex {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd: RDD[Int] = context.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)

    val mapRDDParWithIndex = rdd.mapPartitionsWithIndex(
      (index, it) => {
        it.map(num => (index, num))
      }
    )
    mapRDDParWithIndex.collect().foreach(println)
    context.stop()
  }
}

Then we can get the partition directly.

How did he get the number of partitions?

This split.index is the partition:

Tags: Scala Big Data Spark map

Posted on Thu, 14 Oct 2021 16:06:20 -0400 by mapleshilc