byKey series of rdd operators

There are some operators of xxxByKey in spark. Let's see.

groupByKey

explain

Suppose we want to group some string lists:

object GroupByKeyOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)

    val rdd: RDD[String] = context.makeRDD(List(
      "spark", "scala", "hive", "flink",
      "kafka", "kafka", "hbase", "flume",
      "sqoop", "hadoop", "kafka", "spark",
      "flink", "kafka", "kafka", "hbase"
    ),4)

    val mapRDD: RDD[(String, Int)] = rdd.map((_,1))

    val groupByRDD: RDD[(String, Iterable[Int])] = mapRDD.groupByKey()

    groupByRDD.saveAsTextFile("groupby_out")

    TimeUnit.MINUTES.sleep(50)

    context.stop()
  }
}

What will happen? In order to view the DAG diagram, I let the program sleep.

There is a shuffle operation for groupByKey. We will explain shuffle in detail later.

At present, we only need to know that a ShuffledRDD appears in stage1:

Enter the source code of groupByKey:

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

The first is to get the default partition.

Suppose we don't have our own partitioner, so we will use HashPartitioner.

No more assumptions

spark.default.parallelism

The value of,

Then, you will take the value with the largest number of partitions in several RDDS, and then construct a HashPartitioner based on this value. Why are there several RDDS? rdd can join.

The so-called partition is the number of partitions on the module of the hashcode value of the key, and then ensure that it is non negative.

After getting the partition, we enter groupByKey:

He has three functions here:

val createCombiner = (v: V) => CompactBuffer(v)

First, put the first number in the group into CompactBuffer.

val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v

Then put other numbers in the group into CompactBuffer.

val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2

Finally, there is a large merge of multiple compactbuffers.

Entering combineByKeyWithClassTag, we see:

He threw the three functions to the Aggregator and then the Aggregator to the ShuffledRDD.

Let's take a look at the results of the program:

Why are there four partitions? Because we only have one rdd, and this rdd is four partitions (specified).

Why are hive and flick in partition 0? This is the result of the getPartition calculation of HashPartitioner.

Why are all 1s wrapped in CompactBuffer? Because groupByKey uses CompactBuffer to save grouping results.

For this CompactBuffer:

We can't use it because of package access. But he said that this thing is very similar to ArrayBuffer. So we use ArrayBuffer instead of CompactBuffer.

realization

Now that we have seen all the facts, let's imitate a groupByKey.

Because the last thing we need is shuffled RDD. We give it whatever it wants.

The main constructor needs a previous rdd and a partition, which is easy to say.

So what are the three generics?

Look at what people write.

The three generics are the same as the three generics of the Aggregator.

So start with a new Aggregator. So you have to prepare three functions first.

Those three functions can even be stolen directly from groupByKey:

According to the types of the three functions, we know that V is int in our implementation code and C is ArrayBuffer (replace compactbuffer with ArrayBuffer).

So what is K?

In fact, ShuffledRDD makes it clear that K is the key class, and here we are String.

So the final code is:

object GroupByKeyOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)

    val rdd: RDD[String] = context.makeRDD(List(
      "spark", "scala", "hive", "flink",
      "kafka", "kafka", "hbase", "flume",
      "sqoop", "hadoop", "kafka", "spark",
      "flink", "kafka", "kafka", "hbase"
    ),4)

    val mapRDD: RDD[(String, Int)] = rdd.map((_,1))

    val createCombiner = (v:Int) => ArrayBuffer[Int](v)
    val mergeValue = (buf: ArrayBuffer[Int], v: Int) => buf += v
    val mergeCombiners = (c1: ArrayBuffer[Int], c2: ArrayBuffer[Int]) => c1 ++= c2

    val aggregator: Aggregator[String, Int, ArrayBuffer[Int]] = new Aggregator[String, Int, ArrayBuffer[Int]](createCombiner, mergeValue, mergeCombiners)

    val shuffledRDD: ShuffledRDD[String, Int, ArrayBuffer[Int]] = new ShuffledRDD[String, Int, ArrayBuffer[Int]](mapRDD, new HashPartitioner(rdd.partitions.length))

    shuffledRDD.setMapSideCombine(false)
    shuffledRDD.setAggregator(aggregator)

    shuffledRDD.saveAsTextFile("shuffledRDD_out")

    context.stop()
  }
}

As for why

shuffledRDD.setMapSideCombine(false)

It is said that the map side does not need combine. You are grouping. Even if you use combine, you will not reduce the amount of data. Then all the data on the map side will be inserted into the large hash table, which will put a lot of pressure on the elderly generation.

The last thing to note about groupByKey is that this is actually an implicit conversion.

groupByKey is called by RDD, but there is no such method in RDD. This method is in pairrddffunctions.

Therefore, RDD must be wrapped by PairRDDFunctions:

groupBy

Speaking of groupbyKey, let's talk about groupBy by by the way.

Take a simple example:

object GroupByOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)

    val rdd = context.makeRDD(List("Hello", "Spark", "Hi", "Scala"))

    val f = (str: String) => str.charAt(0)

    val groubyRdd = rdd.groupBy(f)
    
    groubyRdd.collect().foreach(println)
    context.stop()
  }
}

result:

(S,CompactBuffer(Spark, Scala))
(H,CompactBuffer(Hello, Hi))

In other words, what groupBy wants to be more flexible, and groupByKey can only be grouped according to the first element of dual tuple.

Click the source code to see:

In fact, he turns into a dual tuple and calls groupByKey, so our test code is essentially:

object GroupByOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)

    val rdd = context.makeRDD(List("Hello", "Spark", "Hi", "Scala"))

    val f = (str: String) => str.charAt(0)

    //    val groubyRdd = rdd.groupBy(f)

    val groupbyRdd: RDD[(Char, Iterable[String])] = rdd.map(t => (f(t), t)).groupByKey()

    groupbyRdd.collect().foreach(println)
    context.stop()
  }
}

reduceByKey

We just looked at the PairRDDFunctions class:

He has a lot of xxByKey operations here. Let's take another look at reduceByKey.

For example:

object ReduceByKeyOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val dataRDD = context.makeRDD(List(("a", 1), ("b", 2), ("c", 3), ("a", 4), ("c", 5)))

    val func = (v: Int, w: Int) => v + w

    val reduceRDD: RDD[(String, Int)] = dataRDD.reduceByKey(func)

    reduceRDD.collect().foreach(println)

    context.stop()
  }
}

result:

(a,5)
(b,2)
(c,8)

Therefore, reduceByKey only has more aggregation functions than groupByKey.

The underlying call is also combineByKeyWithClassTag.

The first function in the parameter is to take out a value and occupy a bit (do nothing).

The func we wrote is used in both local aggregation and global aggregation.

At the bottom, of course, is a shuffled RDD.

distinct

distinct is de duplication.

For example:

object DistinctOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd = context.makeRDD(List(1, 2, 3, 4, 1, 2, 3, 4, 5))

    rdd.distinct().collect().foreach(println)
    context.stop()
  }
}

Why talk about distinct under reduceByKey?

We click on the source code of distinct:

If we modify our program, it is roughly as follows:

object DistinctOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd = context.makeRDD(List(1, 2, 3, 4, 1, 2, 3, 4, 5))

    //    rdd.distinct().collect().foreach(println)

    rdd.map(x => (x, null)).reduceByKey((x, _) => x).map(_._1).foreach(println)
    context.stop()
  }
}

What he reuses here is reduceByKey.

aggregateByKey

The use of aggregateByKey is more flexible.

First, it has an initial value, which is the same as foldByKey, and then there are functions in the region and functions between partitions.

We need to know where the initial value is used:

You can see that the initial value is only applied to the function in the region.

for instance:

object AggregateByKeyOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")
    val context = new SparkContext(conf)
    val rdd: RDD[(String, Int)] = context.makeRDD(List(("a", 1), ("a", 2), ("c", 3), ("b", 4), ("c", 5), ("c", 6)),2)

    val seqOp = (x: Int, y: Int) => math.max(x, y)

    val combOp = (x: Int, y: Int) => x + y

    val aggRDD: RDD[(String, Int)] = rdd.aggregateByKey(10)(seqOp, combOp)


    aggRDD.saveAsTextFile("agg-out")

  }
}

Let's look at the running results first:

How does aggregate do it?

The rdd we built has two partitions, so the first partition is:

("a", 1), ("a", 2), ("c", 3)

The second partition is:

("b", 4), ("c", 5), ("c", 6)

In the partition, the maximum value of the value of the key comparing the same hashcode value, and the initial value will also participate in the operation in the partition, so in the first partition

("a", 1), ("a", 2), ("a", 10)

The result of the comparison is:

("a", 10)
("c", 3), ("c", 10)

The result of the comparison is

("c", 10)

As for the second partition, the following will remain:

("b", 10), ("c", 10)

Global aggregation requires adding the value values of the same key.

So there are

("a", 10), ("c", 10+10), ("b", 10)

combineByKey

combineByKey is a little different in that the first function passed in can transform the data structure.

Let's have a demand:

Average the values of the same key.

object CombineByKeyOperator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")

    val context = new SparkContext(conf)
    val rdd = context.makeRDD(List(
      ("a", 1), ("a", 2), ("b", 3),
      ("b", 4), ("b", 5), ("a", 6)
    ), 2)

    val createCombiner = (x: Int) => (x, 1)

    val mergeValue = (t: (Int, Int), v: Int) => (t._1 + v, t._2 + 1)

    val mergeCombiners = (t1: (Int, Int), t2: (Int, Int)) => (t1._1 + t2._1, t1._2 + t2._2)

    //Get the average value of the same key
    val aggRdd: RDD[(String, (Int, Int))] = rdd.combineByKey(createCombiner, mergeValue, mergeCombiners)


    val result: RDD[(String, Int)] = aggRdd.mapValues {
      case (sum, count) => {
        sum / count
      }
    }
    result.collect().foreach(println)

    context.stop()
  }
}

I need to state how the three functions passed in work.

Because the data we created is divided into two partitions, so

("a", 1), ("a", 2), ("b", 3)

Is a partition.

 ("b", 4), ("b", 5), ("a", 6)

For another partition.

First look

("a", 1)

a has this key ever appeared? Obviously not. So apply createCombiner.

Then ("a", 1) becomes a = = > (1,1). This means that the value corresponding to the key a is (1,1).

Then look at the second number ("a", 2).

Or did a show up? Ah, a appeared. So apply the second function mergeValue. So what if you do it?

((1,1),2) => (1+2, 1+1)

In the obtained (3,2), 3 is the addition of values and 2 is the addition of the number of a.

Let's look at the case where the key in the second partition is a. because there is only one a, the final result is:

a==>(6,1)

The calculation in the area is over.

Aggregation between partitions requires the mergeCombiners function. That is, (3,2) and (6,1) apply the function (we have been talking about the case where the key is a).

So finally:

((3,2),(6,1))=>(3+6,2+1)

In this way, we count the sum of the values of all tuples with key a and the number of occurrences, and the final division is the average.

In this way, the working principle of combineByKey is clear.

Tags: Scala Big Data Spark source code

Posted on Thu, 14 Oct 2021 19:38:43 -0400 by ErnesTo