[Spark] [RDD] summary of notes for initial learning RDD

RDD

Author: cute wolf blue sky

[Bili Bili] cute wolf blue sky

[blog] https://mllt.cc

[blog park] Menglang blue sky blog Park

WeChat official account mllt9920

[learning and communication QQ group] 238948804

catalogue

@Cute wolf blue sky

[!] start spark cluster

[!] start spark shell

Spark 2.0 integrates spark context and hive context into spark session

spark can also be used as a program entry

spark programming with scala

characteristic

It is an unalterable and partitioned collection object on the cluster node;

  • It is created by parallel transformation (such as map, filter, join, etc.);
  • Failure automatic reconstruction;
  • You can control the storage level (memory, disk, etc.) for reuse;
  • Must be serializable; When the memory is insufficient, it can be automatically degraded to disk storage and the RDD can be stored on disk. At this time, the performance will be greatly reduced, but it will not be worse than the current MapReduce;
  • For the lost data partition, it can be recalculated only according to its lineage without making a specific checkpoint;

establish

Create RDD from memory

Start spark shell

val list = List(1,2,3)
var rdd = sc.parallelize(list)
rdd.partitions.size

Create RDD from external storage

1. Create local file

cd /home
mkdir data
touch a.txt
  1. You don't have to create it in your home directory
  2. You can use vim to add something to a.txt

2. Start spark shell

3. Read from local file system

val localrdd = sc.textFile("file:///home / username / data/a.txt ")

The path preceded by file: / / indicates that it is read from the local file system

localrdd.collect//Returns all elements in the RDD

Note: in the fully distributed spark shell mode, the file needs to be saved in the same location of all nodes before it can be read. Otherwise, the error "file does not exist" will be reported

Create RDD from HDFS

1. Create a directory (name, student number) under the HDFS root directory

hdfs dfs -mkdir /zwj25
hdfs dfs -ls /

Visit http://[IP]:50070

2. Upload local files to HDFS

hdfs dfs -put file.txt /zwj25

3. Enter the spark4 shell

var hdfsrdd=sc.textFile("/zwj25/file.txt")
hdfsrdd.collect
hdfsrdd.partitions
hdfsrdd.partitions.size

sc.defaultMinPartitions=min(sc.defaultParallelism,2)

Number of rdd partitions = max (number of block s in HDFS file, sc.defaultMinPartitions)

Create from other RDD S

operator

map(func)

Type: Transformation operator

Map: convert each data item of the original RDD into a new RDD through the user-defined function f in the map. The map operation will not change the number of partitions of the RDD

filter filtering

filter(func)

Transformation type operator

Retain the elements with the return value of true through the func function to form a new RDD

eg: filter out elements in data RDD whose elements are less than or equal to 2

val data =sc.parallelize(List(1,2,3,4))
val result = data.filter(x=>x>2)
result.collect

flatMap(func) splits words

Type: Transformation operator
flatMap: map and flatten each element in the collection

val data = sc.parallelize(List("I am Meng Lang Lan Tian","my wechat is mllt9920"))
data.map(x=>x.split(" ")).collect
data.flatMap(x=>x.split(" ")).collect

sortBy sort

sortBy(f:(T) => K, ascending, numPartitions)

Type: Transformation operator
Function: sort standard RDD S

sortBy() accepts three parameters:
f: (T) = > k: each element in the object to be sorted is on the left, and the value returned on the right is the value to be sorted in the element.
Ascending: determines whether the elements in the RDD are sorted in ascending or descending order. The default is true, that is, ascending order, and false is sorted in descending order.

numPartitions: this parameter determines the number of RDD partitions after sorting. By default, the number of partitions after sorting is equal to the number before sorting.

eg: sort in descending order according to the second value of each element, and store the obtained results in RDD "data2"

val data1 = sc.parallelize(List((1,3),(2,4),(5,7),(6,8)))
val data2 = data1.sortBy(x=>x._2,false,1)
val data3 = data1.sortBy(x=>x._1,false,1)

distinct de duplication

distinct([numPartitions]))

Type: Transformation operator
Function: de duplication. Only one element is reserved for duplicate elements in RDD

eg:

val data1 = sc.parallelize(List(1,2,3,3,3,4,4))
data1.collect
data1.distinct.collect
data1.collect

union merge

union(otherDataset)

Function: to merge RDDS, you need to ensure that the two RDD element types are consistent

eg: merge rdd1 and rdd2

val rdd1 = sc.parallelize(List(1,2,3))
val rdd2 = sc.parallelize(List(4,5,6))
rdd1.union(rdd2).collect

Note: the two RDD element types of union should be consistent

intersection

intersection(otherDataset)

Function: find out the common elements of two RDDS, that is, find the intersection of two RDDS

eg: find the same element in c_rdd1 and c_rdd2

val c_rdd1 = sc.parallelize(List(('a',1),('b',2),('a',1),('c',1)))
val c_rdd2 = sc.parallelize(List(('a',1),('b',1),('d',1),('e',1)))
c_rdd1.intersection(c_rdd2).collect

subtract difference set

subtract (otherDataset)

Function: get the difference set between two RDD S

eg: find the difference set between rdd1 and rdd2

val rdd1 = sc.parallelize(Array("A","B","C","D"))
val rdd2 = sc.parallelize(Array("C","D","E","F"))
val subtractRDD = rdd1.subtract(rdd2)
subtractRDD.collect

cartesian

cartesian(otherDataset)

Name: Cartesian product

Function: combine the elements of two sets into a group

eg:

val rdd01 = sc.makeRDD(List(1,3,5,3))
val rdd02 = sc.makeRDD(List(2,4,5,1))
rdd01.cartesian(rdd02).collect

take(num)

Return num records in front of RDD

val data = sc.parallelize(List(1,2,3,4))
data.take(2)

Key value pair RDD

mapValues

val rdd = sc.parallelize(List("a","b","c","d"))
//Create key value pairs through map
var rddp = rdd.map(x=>(x,1))
rddp.collect
rddp.keys.collect
rddp.values.collect
//Add one to all values through mapValues
rddp.mapValues(x=>x+1).collect

val rdd1 = sc.parallelize(List("I am a student","Hello word","Just Play"))
val rdd2 = rdd1.map(x=>(x,992))
rdd2.collect
rdd2.keys.collect
rdd2.values
rdd2.values.collect

val rdd3 = sc.parallelize(List("I am a student","Hello word","Just Play"))
val rdd4 = rdd1.map(x=>x.split(" "))
rdd4.collect
val p1=rdd4.map(x=>(x.split(" "),x))
p1.collect

join key internal connection

val rdd = sc.parallelize(List("a","b","c","d"))
//Create key value pairs through map
var rddp = rdd.map(x=>(x,1))
//Add one to all values through mapValues
var rdd1 = rddp.mapValues(x=>x+1)
//Similarly, rdd2 is obtained
val rdd2 = sc.parallelize(List("a","b","c","d","e")).map(x=>(x,1))
rdd1.collect
rdd2.collect
//join rdd1 and rdd2 together
rdd1.join(rdd2).collect
rdd2.join(rdd1).collect

leftOuterJoin and rightOuterJoin and fullOuterJoin

rightOuterJoin right outer connection. The key of the second RDD must exist

leftOuterJoin left outer connection. The key of the first RDD must exist

fullOuterJoin global external connection. Both keys must have

//rdd1 and rdd2 continue the above
rdd1.collect
rdd2.collect
//Right outer connection
rdd1.rightOuterJoin(rdd2).collect
//Left outer connection
rdd1.leftOuterJoin(rdd2).collect
//Total external connection
rdd1.fullOuterJoin(rdd2).collect

zip

Function: combine two RDD S as key value pairs

  • The number of partitions of two RDDS must be the same (query the number of partitions rdd.partitions.size)
  • The number of elements of two RDD S must be the same
val rdd1  = sc.parallelize(1 to 3)
val rdd2 = sc.parallelize(List("a","b","c"))
rdd1.collect
rdd2.collect
rdd2.zip(rdd1).collect
rdd1.zip(rdd2).collect
rdd1.partitions.size
rdd2.partitions.size
val rdd3  = sc.parallelize(1 to 3,3)//3 is the number of partitions
val rdd4 = sc.parallelize(List("a","b","c"),3)//3 is the number of partitions
rdd3.partitions.size
rdd4.partitions.size

CombineByKey

Merge the values of the same key. The types of merged values can be different

Target: want to convert value to List type

groupByKey([numPartitions])

Key grouping. When (K, V) calls the composed RDD, it returns the new RDD composed of (K, iteratable) pairs.

val rdd1 = sc.parallelize(List("A","B","C","C","C","D","D")).map(x=>(x,1))
rdd1.groupByKey().collect
rdd1.groupByKey().collect()

reduceByKey(func, [numPartitions])

Group the Key values into RDD keys and aggregate them (if the keys are the same, only one Key is reserved, and the value is + 1)

  • When called on an RDD composed of key value pairs of type (K, V), a new RDD composed of key value pairs of type (K, V) is returned
  • The value of each key of the new RDD is aggregated using the given reduce function func, which must be of type (V, V) = > v
  • It can be used to count the number of occurrences of each key
val rdd1 = sc.parallelize(List("A","B","C","C","C","D","D")).map(x=>(x,1))
rdd1.reduceByKey((x,y)=>x+y).collect
rdd1.reduceByKey((x,y)=>x+y).collect()

File reading and storage

Structure name Structured describe
text file no Ordinary text file, one record per line
SequenceFile yes Common Hadoop file formats for key value pair data
rdd.partitions.size

saveAsTestFile(Path:String)

Save RDD to HDFS

val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1.saveAsTextFile("/302Spark/savetext")
//Input IP:50070 query

saveAsSequenceFile and sc.sequenceFile

saveAsSequenceFile(Path:String)

Serialization file, only key value pair RDD is supported

sequenceFile[K, V](path: String, keyClass: Class[K], valueClass: Class[V], minPartitions: Int)

Read serialization file:

//In order for Spark to support Hadoop data types, you need to import packages
import org.apache.hadoop.io.
sc.sequenceFile(Path:String,KeyClass:key[K])

example

//Serialized file storage
val rdd = sc.parallelize(List(("panda",3),("dog",6),("cat",3)))
rdd.saveAsSequenceFile("/hadoop-zwj25/testSeq")
rdd.partitions.size

//View serialization file
hdfs dfs -ls /hadoop-zwj25
hdfs dfs -ls /hadoop-zwj25/testSeq
hdfs dfs -cat /hadoop-zwj25/testSeq/part-00000

//Introducing hadoop data types
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
//Serialized file read
//Text in the first classOf[Text] is the type of key
//IntSWritable in the second classOf[IntSWritable] is the type of value
val output = sc.sequenceFile("/hadoop-zwj25/testSeq",classOf[Text],classOf[IntWritable])
output.map{case(x,y)=>(x.toString,y.get())}.collect
rdd.collect
val rddtest = sc.parallelize(List(1,2,3))
rddtest.map{case 1=>"One";case 2=>"Two";case _=>"other"}.collect
rddtest.map{case x=>(x,"a")}.collect

repartition() repartition

repartition(numPartitions: Int)

  • You can increase or decrease the level of parallelism in this RDD. Internally, it redistributes data using shuffle.
  • If you want to reduce the number of partitions in this RDD, consider using coalesce to avoid shuffling.
coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

Query partition: partitions.size

rdd.repartition(numPartitions:Int).partitions.size

To reduce the number of partitions, consider using coalesce to avoid execution

val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1.saveAsTextFile("/302Spark/savetext")
//Input IP:50070 query
//---------------------------------------
rdd1.partitions.size
rdd1.repartition(1).partitions.size
rdd1.repartition(1).saveAsTextFile("/302Spark/savetext1")

practice

Practice01

subject

Find out the student ID whose test score has been 100 points, and the final results need to be collected into an RDD.

source material

Please paste the content in the following code block into the text document result_bigdata.txt

1001	Big data foundation	90
1002	Big data foundation	94
1003	Big data foundation	100
1004	Big data foundation	99
1005	Big data foundation	90
1006	Big data foundation	94
1007	Big data foundation	100
1008	Big data foundation	93
1009	Big data foundation	89
1010	Big data foundation	78
1011	Big data foundation	91
1012	Big data foundation	84

code

//Create RDD from local file
val rdd_bigdata = sc.textFile("file:///home / username / result_bigdata.txt")
//Just output a test
rdd_bigdata.take(2)
//View all results
rdd_bigdata.collect
//The following method needs to be converted to Int type
val bigdata_100=rdd_bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(1),x(2).toInt)).filter(x=>x._3==100).map(x=>x._1)
bigdata_100.collect
//The following method does not need to be converted to Int type
val bigdata_100=rdd_bigdata.map(x=>x.split("\t")).filter(x=>x(2)=="100").map(x=>x(0))
bigdata_100.collect

Practice02

subject

Output the total score of each student. It is required to add the scores with the same student ID in the two grades.

source material

Please paste the contents of the following code block into the text document score.txt

math John 90
math Betty 88
math Mike 95
math Lily 92
chinese John 78
chinese Betty 80
chinese Mike 88
chinese Lily 85
english John 92
english Betty 84
english Mike 90
english Lily 85

Please paste the contents of the following code block into the text document result_math.txt

1001	applied mathematics	96
1002	applied mathematics	94
1003	applied mathematics	100
1004	applied mathematics	100
1005	applied mathematics	94
1006	applied mathematics	80
1007	applied mathematics	90
1008	applied mathematics	94
1009	applied mathematics	84
1010	applied mathematics	86
1011	applied mathematics	79
1012	applied mathematics	91

Please paste the contents of the following code block into the text document result_bigdata.txt

1001	Big data foundation	90
1002	Big data foundation	94
1003	Big data foundation	100
1004	Big data foundation	99
1005	Big data foundation	90
1006	Big data foundation	94
1007	Big data foundation	100
1008	Big data foundation	93
1009	Big data foundation	89
1010	Big data foundation	78
1011	Big data foundation	91
1012	Big data foundation	84

code

//Create RDD from local file
val rdd_bigdata = sc.textFile("file:///home/hadoop-zwj25/result_bigdata.txt")
val rdd_math = sc.textFile("file:///home/hadoop-zwj25/result_math.txt")
//Returns all elements in the RDD
rdd_bigdata.collect
rdd_math.collect
//Merge two RDD S
val rddall = rdd_math.union(rdd_bigdata)
rddall.collect
rddall.map(x=>(x.split("\t"))).map(x=>(x(0),x(2).toInt)).reduceByKey((x,y)=>x+y).collect

Practice03

subject

1. Output the average score of each student. It is required to add the scores with the same ID in the two score sheets and calculate the average score.

2. Combine the total score and average score of each student

Complete average score and merge tasks

source material

code

//1. Create RDD and convert
val math_map = math.map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt))
math_map.collect
val bigdata=sc.textFile("file:///home/hadoop-zwj25/result_bigdata.txt")
val bigdata_map = bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt))
bigdata_map.collect
//Merge the two key value pairs RDD, and use take to randomly test one
math_map.union(bigdata_map).take(3)
//Add a score count for the score, starting from 1, and the format after completion is (ID, (score, count))
math_map.union(bigdata_map).mapValues(x=>(x,1)).take(3)
//Note that the converted type must be consistent with the original value type
//Reducebykey ((x, y) = > ((X. _1 + y. _1), (X. _2 + y. _2))) is explained as follows
//x is the original score and y is the count
//(x._1+y._1) represents the total score (the scores of the two disciplines are added together)
//(x._2+y._2) represents the total number of gates (the sum of two count values)
math_map.union(bigdata_map).mapValues(x=>(x,1)).reduceByKey((x,y)=>((x._1+y._1),(x._2+y._2))).take(3)
//Group and aggregate the merged RDD keys
//The average score is the total score / number of subjects
//x_1 is the ID; X. _2is (score, count value)
//So the average score is (x._2._1/x._2._2)
val pj = math_map.union(bigdata_map).mapValues(x=>(x,1)).reduceByKey((x,y)=>((x._1+y._1),(x._2+y._2))).map(x=>(x._1,(x._2._1/x._2._2)))

//Query results
pj.collect
//Output total score
val zf = math_map.union(bigdata_map).reduceByKey((x,y)=>x+y)
zf.collect

pj.count
zf.count
//Combine the total and average scores of each student
zf.join(pj).count
      

Tags: Spark

Posted on Sat, 30 Oct 2021 18:55:43 -0400 by [Demonoid]