An action operator is an operator that triggers an action. Triggering an action means real calculation data.
collect
Collect is to collect data from the executor side to the driver side.
For example, a simple wordcount program:
object CollectAction { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount") val sc: SparkContext = new SparkContext(conf) val words: RDD[String] = sc.parallelize(List("Spark", "Spark", "Flume", "Spark", "Flume", "Hive", "Hive", "Hive", "Spark"), 2) val oneWord: RDD[(String, Int)] = words.map((_, 1)) val reduceByKeyRDD: RDD[(String, Int)] = oneWord.reduceByKey(_ + _) reduceByKeyRDD.collect().foreach(println) sc.stop() } }
Therefore, the action operator is essentially an operator that will execute runJob.
collect returns the result in the form of Array.
Now that we have seen the truth, let's call runJob ourselves:
object CollectAction { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount") val sc: SparkContext = new SparkContext(conf) val words: RDD[String] = sc.parallelize(List("Spark", "Spark", "Flume", "Spark", "Flume", "Hive", "Hive", "Hive", "Spark"), 2) val oneWord: RDD[(String, Int)] = words.map((_, 1)) val reduceByKeyRDD: RDD[(String, Int)] = oneWord.reduceByKey(_ + _) val func = (iter: Iterator[(String, Int)]) => iter.toArray val collected: Array[Array[(String, Int)]] = sc.runJob(reduceByKeyRDD, func) val res: Array[(String, Int)] = Array.concat(collected: _*) res.foreach(println) sc.stop() } }
This will get the same result.
reduce
reduce is an iterative operation:
object ReduceAction { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ReduceAction") val sc: SparkContext = new SparkContext(conf) val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2) val reduceResult: Int = rdd.reduce(_ + _) println(reduceResult) sc.stop() } }
The question is, how does he work?
You can see that it is local aggregation first, and then global aggregation.
Local and global aggregate functions defined by ourselves are applied.
aggregate
object AggAction { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[*]").setAppName("RDD") val sc = new SparkContext(conf) val rdd = sc.makeRDD(List(1,2,3,4),2) println(rdd.aggregate(10)(_ + _, _ * _)) sc.stop() } }
Like aggregateByKey, aggregate has an initial value and can define intra partition aggregation and inter partition aggregation functions.
But one thing is different from aggregateByKey. The initial value of aggregateByKey is only applied to intra partition aggregation, and inter partition aggregation is not used.
However, the zeroValue of aggregate is used in both phases.
We can also draw such a conclusion from the result 2210 of the above program.
foreach
We use standalone single master mode to deploy spark.
Printing in foreach has no result. This shows that the result of foreach is not displayed on the driver side, but executed in the executor.
I found the printed results in the standard output of two executor s.
So the content in foreach is distributed.
In the past, we often used rdd.collect().foreach(println), which is collected at the driver side and then printed:
count
count seems very simple, just counting:
object CountAction { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setMaster("local").setAppName("CountAction") val sc: SparkContext = new SparkContext(conf) val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2) println(rdd.count()) sc.stop() } }
In the source code, it is the count of each partition, and then sum.
If we implement it ourselves, it will be like this:
object CountAction { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setMaster("local").setAppName("CountAction") val sc: SparkContext = new SparkContext(conf) val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2) val longs: Array[Long] = sc.runJob(rdd, (it: Iterator[Int]) => { var count = 0L while (it.hasNext) { count += 1L it.next() } count }) println(longs.toBuffer) sc.stop() } }
takeOrdered
You can use takeOrdered or top to get several numbers in order. Top is implemented by takeOrdered.
Let's talk directly about takeOrdered:
object TopNAction { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount") val sc: SparkContext = new SparkContext(conf) val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7,8,9,10),2) val res: Array[Int] = rdd.takeOrdered(2)(Ordering.Int.reverse) println(res.toBuffer) sc.stop() } }
The number in each partition is added to the BoundedPriorityQueue, which is a bounded priority queue and can be used for sorting, just like TreeSet.
Among partitions, multiple bounded priority queues are merged and then sorted as a whole.
The question here is why not directly sort the population and then take the first few.
Because the amount of data will overflow memory. Let's see what spark does? For example, if you want to take the first two numbers in descending order, he arranges each partition to take the first two for you. Our situation is that partition 0 takes out 4,5, partition 1 takes out 9,10, and finally 4,5,9,10 are combined into a bounded priority queue (queue1++=queue2). Because queue1 has specified the maximum storage of two numbers, 9,10 will be saved in the end, and finally 9,10 will be sorted in descending order.
min
min is the same as max, and the lower layer uses reduce:
It's actually wonderful to use reduce to calculate the maximum and minimum. It's a bit like reduce in scala:
val arr = Array(20, 12, 6, 15, 2, 9) val res: Int = arr.reduceLeft(_ min _)
Only in spark, there are two processes: intra partition and inter partition.