RDD common operators of spark notes

hello everyone! Here are the saprk operator notes I learned during the epidemic holiday. I just spent the whole afternoon sorting them out and sharing them with you! It's not easy to code. If it helps you, remember to like it!

Article catalog

1, spark action operator

1. Reduce (F: (T, t) = > t: gather all elements in RDD through func function, first aggregate the data in the partition, then aggregate the data between partitions.

val list1: RDD[Int] = sc.makeRDD(1 to 10)
val reduceRDD: Int = list1.reduce(_+_)
println(reduceRDD)  //55

2.collect(): in the driver, all elements of the dataset are returned as arrays.

val list1: RDD[Int] = sc.parallelize(List(1,2,3,4,5))
list1.collect().foreach(println)
Return value: Array(1,2,3,4,5)

3.count(): returns the number of elements in RDD.

val list1: RDD[Int] = sc.parallelize(List(1,2,3,4,5))
val countRDD: Long = list1.count()
println(countRDD)     // 5

4.first(): returns the first element in RDD.

val list1: RDD[Int] = sc.parallelize(List(1,2,3,4,5))
val firstRDD: Int = list1.first()
println(firstRDD)     // 1

5.take(n: Int): returns an array of the first n elements of RDD.

val list1: RDD[Int] = sc.makeRDD(List(7,2,5,6,4,3))
val takeRDD: Array[Int] = list1.take(3)
takeRDD.foreach(println)   //7 2 5

6.takeOrdered(n: Int): returns an array of the first n elements after the RDD is sorted.

val list1: RDD[Int] = sc.makeRDD(List(7,2,5,6,4,3))
val takeOrderedRDD: Array[Int] = list3.takeOrdered(3)
takeOrderedRDD.foreach(println)   //2 3 4

7. aggregate(zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): first calculate within each partition, then calculate between partitions.
Note: during the operation of aggregateByKey, the initial value will be added in the partition, but not between partitions!!!
aggregate in the process of operation, the initial value will be added in the partition, and also added between partitions!!!

val list1: RDD[Int] = sc.makeRDD(1 to 10,2)
val aggregateRDD: Int = list1.aggregate(0)(_+_,_+_)
println(aggregateRDD)   //55
val aggregateRDD1: Int = list1.aggregate(10)(_+_,_+_)
println(aggregateRDD1)    //85 + 10 for each zone and + 10 for each zone

8.fold(): a simplified version of aggregate. You can use the fold () operation when the algorithm is the same within and between the aggregate partitions.

val list1: RDD[Int] = sc.makeRDD(1 to 10,2)
val foldRDD: Int = list1.fold(0)(_+_)
println(firstRDD)   //55

9.saveAsTextFile(path)
saveAsSequenceFile(path)
saveAsObjectFile(path)


The difference between saving data set elements as textfile to HDFS file system or other supported file system is that the saved data format is different.

val list1: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("b", 2), ("c", 3)))
list1.saveAsTextFile("test1")
list1.saveAsSequenceFile("test2")
list1.saveAsObjectFile("test3")

Foreach(): I usually use foreach() in Scala to traverse the array, and this foreach() is in Spark.
Difference: foreach() in Scala needs to be executed in Executor, while foreach() in Spark needs to be executed directly in Drvier

2, spark single value type

1. Map (F: a = > b): returns a new RDD. The data is converted according to the logic of the incoming function.

val test: RDD[Int] = sc.makeRDD(1 to 10)
val mapRDD: RDD[Int] = test.map(_*2)      //map was called 10 times
mapRDD.collect().foreach(println)
Return value: Array(2,4,6,8,10,12,14,16,18,20)

2. Mappartitions (F: iterator [t] = > iterator [u]): the class is map, but it processes each partition data independently.

mapPartitions are called several times for several partitions. If there are N elements and M partitions, the functions of map will be called N times, while mapPartitions will be called M times, and one function will handle all partitions at one time.

val test: RDD[Int] = sc.makeRDD(1 to 10,2)
//map was called 10 times, mapPartitions was called 2 times
val mapPartitionsRDD: RDD[Int] = test.mapPartitions(_.map(_*2)) 
mapPartitionsRDD.collect().foreach(println)
Return value: Array(2,4,6,8,10,12,14,16,18,20)

The difference between map() and mapPartition():

  • map: process one piece of data at a time.
  • mapPartition(): process the data of one partition at a time. The disadvantage is that after this partition is processed, the data of the partition in the original RDD can be released. May cause OOM (data overflow). When the memory space is large, mappartition () is recommended to improve the processing efficiency.

3. Mappartitionswithindex (F: iterator [t] = > iterator [u]): class is in mapPartitions, but it has one more partition index value.

val test: RDD[Int] = sc.makeRDD(1 to 8,2)
val mapPartitionsIndex: RDD[(Int, String)] = test.mapPartitionsWithIndex {
  case (num, x) => {
    x.map(("partition" + num))
  }
}
mapPartitionsIndex.collect().foreach(println)
Return values: Array((1, partition 0), (2, partition 0), (3, partition 0), (4, partition 0), (5, partition 1), (6, partition 1), (7, partition 1), (8, partition 1))

4. Flatmap (F: T = > traversableonce [u]): flattening is similar to map. But the return of the incoming function should be a collection rather than a single element.

val test2: RDD[List[Int]] = sc.makeRDD(Array(List(1,2,3),List(4,5,6)))  
val flatmapRDD: RDD[Int] = test2.flatMap(datas => datas)   //Receive a collection, return a collection
flatmapRDD.collect().foreach(println)
Return value: Array(1,2,3,4,5,6)

5.glom(): form an array for each partition, and form a new RDD type.

val test: RDD[Int] = sc.makeRDD(1 to 10,2)
val glomRDD: RDD[Array[Int]] = test1.glom()
//Take the maximum value of each partition    
glomRDD.collect().foreach( x => {
   println(x.max)
})
Return value: Array(5,10)

6. Groupby (F: T = > k): group data. Grouping is based on the return value of the incoming function. The grouping data is tuple k-v, K represents index value, and v represents grouping data set

val list: RDD[Int] = sc.makeRDD(List(1,2,3,4))
val groupbyRDD: RDD[(Int, Iterable[Int])] = list.groupBy( x => x%2)  //Group by integral multiple of 2
groupbyRDD.collect().foreach(println)
Return value: Array((0,CompactBuffer(2, 4),(1,CompactBuffer(1, 3))

7. Filter (F: T = > Boolean): filter the data. Leave the satisfied and filter the unsatisfied.

val list: RDD[Int] = sc.makeRDD(List(1,2,3,4))
val filterRDD: RDD[Int] = list.filter( x => x%2==0)   //Filter by integral multiple of 2
filterRDD.collect().foreach(println)
Return value: Array(2,4)

8. Sample (with replacement, fraction, seed): random sampling of specified data.
With replacement: indicates whether the extracted data is put back, true and false.
fraction: the number of randomly selected seeds. Between (0-1)
Seed: used to specify random number generator seed


val list: RDD[Int] = sc.makeRDD(1 to 10)
val sampleRDD: RDD[Int] = list.sample(false,0.5,1)    
sampleRDD.collect().foreach(println)
Return value: Array(1,5,6,7,8,10)

9.distinct([numPartitions]: Int): de duplicate the data and put the results into several partitions. numPartitions indicates the number of partitions. The default is the current number of partitions.

val list: RDD[Int] = sc.makeRDD(List(1,2,2,3,4,3,3,2,9,10),3)    //Default three partitions
val distinctRDD: RDD[Int] = list.distinct(2)
distinctRDD.collect().foreach(println)
Return value: Array(4,10,2,1,3,9)

10.coalesce(numPartitions: int, shuffle: Boolean = false): reduce the partition and use it to filter large data sets, so as to improve the execution efficiency of small data sets.

numPartitions: number of partitions
shuffle: whether the data is scrambled and reorganized. The default value is false true. It is scrambled and reorganized to reduce the number of partitions and improve the execution efficiency

val list: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6),3)
val coalesceRDD: RDD[Int] = list.coalesce(2)
println(coalesceRDD.partitions.size)   //Return value is 2

11.repartition(numPartitions: Int): repartition of data and reshuffle all data. numPartitions indicates the number of partitions.

val list: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6),3)
val repartitionRDD: RDD[Int] = list.repartition(2)
println(repartitionRDD.partitions.size)    //Return value is 2

The difference between coalesce and repartition:

  • coalesce: repartition, you can choose whether to perform the shuffle process. Determined by the parameter shuffle: Boolean = false/true.
  • repartition: in fact, it is the coalesce called, which is shuffle d by default. The source code is as follows
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  coalesce(numPartitions, shuffle = true)
}

12. Sortby (F: (T) = > k, ascending: Boolean = true): sort the data. Sort according to different rules of function return value

f: Higher order function
ascending: default to order true false: reverse order

val list: RDD[Int] = sc.makeRDD(List(2,3,5,1,6,7))
val sortByRDD: RDD[Int] = list.sortBy( x => x)
sortByRDD.collect().foreach(println)
Return value: Array(1,2,3,5,6,7)

3, spark double value type

1.union(): returns a new RDD after the union of the source RDD and the parameter RDD.

val list1: RDD[Int] = sc.makeRDD(List(1,2,3,4,5))
val list2: RDD[Int] = sc.makeRDD(List(6,7,8,9,10))
val unionRDD: RDD[Int] = list1.union(list2)
unionRDD.collect().foreach(println)
Return value: Array(1,2,3,4,5,6,7,8,9)

2.subtract(): take the difference set, remove the same elements in two RDD S, and keep the different ones.

val list1: RDD[Int] = sc.makeRDD(List(1,2,3,4,5))
val list2: RDD[Int] = sc.makeRDD(List(4,5,6,7,8))
val subtractRDD: RDD[Int] = list1.subtract(list2)
subtractRDD.collect().foreach(println)  
Return value: Array(1,2,3)	   // Repeat 4 and 5 to display 1 2 3 (only this array is displayed)

3.intersection(): take the intersection, remove the different elements in two RDD S, and the same will be retained.

val list1: RDD[Int] = sc.makeRDD(List(1,2,3,4,5))
val list2: RDD[Int] = sc.makeRDD(List(4,5,6,7,8))
val intersectionRDD: RDD[Int] = list1.intersection(list2)
intersectionRDD.collect().foreach(println)
Return value: Array(4,5)

4.zip(): zipper, K-V combination of two RDD elements, the data must correspond one by one or the data of each partition must correspond one by one or error

val list1: RDD[Int] = sc.parallelize(Array(1,2,3))
val list2: RDD[String] = sc.parallelize(Array("a","b","c"))
val zipRDD: RDD[(Int, String)] = list1.zip(list2)
zipRDD.collect().foreach(println)
Return value: Array((1,a),(2,b),(3,c))

4, spark operator KV type

1.partitionBy(partitioner: Partitioner): the default partition rule for RDD is to take redundant partition. The Partitioner can directly transfer a new hashpartitioner (numpartitions: Int). For details, see the source code pairrdfunctions.scala. You can also customize the dividers

val list1: RDD[(String, Int)] = sc.makeRDD(List(("aaa",1),("bbb",2),("bbb",3)))
val partitionByRDD: RDD[(String, Int)] = list1.partitionBy(new HashPartitioner(2))  //2 divisions
println(partitionByRDD.partitions.size)    //The number of partitions 2 is 0 and 1 respectively

2.groupByKey(): operate on each key, and put the value of the same key into a set

val list1: RDD[String] = sc.makeRDD(List("A","A","B","A","C","B","C"))
val groupByKeyRDD: RDD[(String, Iterable[Int])] = list1.map( x => (x,1)).groupByKey()
groupByKeyRDD.collect().foreach(println)
Return value: Array((A,CompactBuffer(1, 1, 1)),(B,CompactBuffer(1, 1)),(C,CompactBuffer(1, 1)))

3.reduceByKey(func, [numTasks]): aggregate the values of the same key. The number of reduce tasks can be set through the second optional parameter.

  • func: higher order function
  • [numTasks]: value add two
val list1: RDD[String] = sc.makeRDD(List("A","A","B","A","C","B","C"))
val mapRDD: RDD[(String, Int)] = list1.map(x => (x,1))
val reduceByKey: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
reduceByKey.collect().foreach(println)
Return value: Array((A,3),(B,2),(C,2))

4.countByKey(): indicates the number of elements corresponding to each key.

val list1: RDD[(Int, Int)] = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
val countByKeyRDD: collection.Map[Int, Long] = list1.countByKey()
countByKeyRDD.foreach(println)   
Return value: Array((3,2),(1,3),(2,1))

5. Aggregatebykey (zerovalue: u, [partitioner: partitioner]) (seqop: (U, V) = > u, combop: (U, U) = > U): the currification of the function performs the calculation within each partition first, and then the calculation between partitions.

  • zeroValue: initial value;
  • seqOp: function is used to iterate value step by step with initial value in each partition;
  • combOp: function is used to merge the results in each partition;
    Note: during the operation of aggregateByKey, the initial value will be added in the partition, but not between partitions!!!
    aggregate in the process of operation, the initial value will be added in the partition, and also added between partitions!!!

val list1: RDD[(String, Int)] = sc.makeRDD(List(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
val aggregateByKeyRDD: RDD[(String, Int)] = list1.aggregateByKey(0)(math.max(_,_),_+_)   //Take the maximum value of different keys in each partition, and add the maximum values of keys in each partition
aggregateByKeyRDD.collect().foreach(println)    
Return value: Array((b,3),(a,3),(c,12))

6.foldByKey(): a simplified version of aggregateByKey

val list1: RDD[(String, Int)] = sc.makeRDD(List(("a",1),("a",1),("c",1),("b",1),("c",1),("c",1)),2)
val foldByKeyRDD: RDD[(String, Int)] = list1.foldByKey(0)(_+_)
foldByKeyRDD.collect().foreach(println) 
Return value: Array((b,1),(a,2),(c,3))

7.sortByKey([ascending], [numTasks]): returns a (K,V) RDD sorted by key, that is, sorting by key value.

  • [ascending]: true for ascending false for descending default for ascending
  • [numTasks]: specify the number of partitions
val list1: RDD[(String, Int)] = sc.makeRDD(List(("A" -> 1),("B" -> 2),("C" -> 1),("A" -> 2)))
val sortByKeyRDD: RDD[(String, Int)] = list1.sortByKey()
sortByKeyRDD.collect().foreach(println)
Return value: Array((A,1),(A,2),(B,2),(C,1))

8.mapValues(): operate on each V.

val list1: RDD[(Int, String)] = sc.parallelize(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
val mapValuesRDD: RDD[(Int, String)] = list1.mapValues(_+"*")   //Put each value + "*"
mapValuesRDD.collect().foreach(println)
Return value: Array((1,a*),(1,d*),(2,b*),(3,c*))

9. Join (other dataset, [numtasks]): associate two RDD data with the same key

  • otherDataset: another RDD associated
  • [numTasks]: number of partitions
val list1: RDD[(String, Int)] = sc.makeRDD(List(("A",1),("B",2),("C",3)))
val list2: RDD[(String, String)] = sc.makeRDD(List(("A","a"),("B","b"),("C","c")))
val joinRDD: RDD[(String, (Int, String))] = list1.join(list2)
joinRDD.collect().foreach(println)     
Return value: Array((A,(1,a)),(B,(2,b)),(C,(3,c)))

10. Cogroup (other dataset, [numtasks]): its function is similar to join

val list1: RDD[(String, Int)] = sc.makeRDD(List(("A",1),("B",2),("C",3)))
val list2: RDD[(String, String)] = sc.makeRDD(List(("A","a"),("B","b"),("C","c")))
val cogroupRDD: RDD[(String, (Iterable[Int], Iterable[String]))] = list1.cogroup(list2)
cogroupRDD.collect().foreach(println)     
Return value: Array((A,(CompactBuffer(1),CompactBuffer(a))),(B,(CompactBuffer(2),CompactBuffer(b))),(C,(CompactBuffer(3),CompactBuffer(c))))

Tags: Spark Scala

Posted on Mon, 18 May 2020 04:02:17 -0400 by Mattyspatty