[Spark] [RDD] summary of notes for initial learning RDD

RDD Author: cute wolf blue sky ...
characteristic
establish
operator
Key value pair RDD
File reading and storage
practice
RDD

Author: cute wolf blue sky

[Bili Bili] cute wolf blue sky

[blog] https://mllt.cc

[blog park] Menglang blue sky blog Park

WeChat official account mllt9920

[learning and communication QQ group] 238948804

catalogue

@Cute wolf blue sky

[!] start spark cluster

[!] start spark shell

Spark 2.0 integrates spark context and hive context into spark session

spark can also be used as a program entry

spark programming with scala

characteristic

It is an unalterable and partitioned collection object on the cluster node;

  • It is created by parallel transformation (such as map, filter, join, etc.);
  • Failure automatic reconstruction;
  • You can control the storage level (memory, disk, etc.) for reuse;
  • Must be serializable; When the memory is insufficient, it can be automatically degraded to disk storage and the RDD can be stored on disk. At this time, the performance will be greatly reduced, but it will not be worse than the current MapReduce;
  • For the lost data partition, it can be recalculated only according to its lineage without making a specific checkpoint;

establish

Create RDD from memory

Start spark shell

val list = List(1,2,3) var rdd = sc.parallelize(list) rdd.partitions.size

Create RDD from external storage

1. Create local file
cd /home mkdir data touch a.txt
  1. You don't have to create it in your home directory
  2. You can use vim to add something to a.txt
2. Start spark shell 3. Read from local file system
val localrdd = sc.textFile("file:///home / username / data/a.txt ")

The path preceded by file: / / indicates that it is read from the local file system

localrdd.collect//Returns all elements in the RDD

Note: in the fully distributed spark shell mode, the file needs to be saved in the same location of all nodes before it can be read. Otherwise, the error "file does not exist" will be reported

Create RDD from HDFS

1. Create a directory (name, student number) under the HDFS root directory
hdfs dfs -mkdir /zwj25 hdfs dfs -ls /

Visit http://[IP]:50070

2. Upload local files to HDFS
hdfs dfs -put file.txt /zwj25

3. Enter the spark4 shell
var hdfsrdd=sc.textFile("/zwj25/file.txt") hdfsrdd.collect hdfsrdd.partitions hdfsrdd.partitions.size

sc.defaultMinPartitions=min(sc.defaultParallelism,2)

Number of rdd partitions = max (number of block s in HDFS file, sc.defaultMinPartitions)

Create from other RDD S

operator

map(func)

Type: Transformation operator

Map: convert each data item of the original RDD into a new RDD through the user-defined function f in the map. The map operation will not change the number of partitions of the RDD

filter filtering

filter(func)

Transformation type operator

Retain the elements with the return value of true through the func function to form a new RDD

eg: filter out elements in data RDD whose elements are less than or equal to 2

val data =sc.parallelize(List(1,2,3,4)) val result = data.filter(x=>x>2) result.collect

flatMap(func) splits words

Type: Transformation operator
flatMap: map and flatten each element in the collection

val data = sc.parallelize(List("I am Meng Lang Lan Tian","my wechat is mllt9920")) data.map(x=>x.split(" ")).collect data.flatMap(x=>x.split(" ")).collect

sortBy sort

sortBy(f:(T) => K, ascending, numPartitions)

Type: Transformation operator
Function: sort standard RDD S

sortBy() accepts three parameters:
f: (T) = > k: each element in the object to be sorted is on the left, and the value returned on the right is the value to be sorted in the element.
Ascending: determines whether the elements in the RDD are sorted in ascending or descending order. The default is true, that is, ascending order, and false is sorted in descending order.

numPartitions: this parameter determines the number of RDD partitions after sorting. By default, the number of partitions after sorting is equal to the number before sorting.

eg: sort in descending order according to the second value of each element, and store the obtained results in RDD "data2"

val data1 = sc.parallelize(List((1,3),(2,4),(5,7),(6,8))) val data2 = data1.sortBy(x=>x._2,false,1) val data3 = data1.sortBy(x=>x._1,false,1)

distinct de duplication

distinct([numPartitions]))

Type: Transformation operator
Function: de duplication. Only one element is reserved for duplicate elements in RDD

eg:

val data1 = sc.parallelize(List(1,2,3,3,3,4,4)) data1.collect data1.distinct.collect data1.collect

union merge

union(otherDataset)

Function: to merge RDDS, you need to ensure that the two RDD element types are consistent

eg: merge rdd1 and rdd2

val rdd1 = sc.parallelize(List(1,2,3)) val rdd2 = sc.parallelize(List(4,5,6)) rdd1.union(rdd2).collect

Note: the two RDD element types of union should be consistent

intersection

intersection(otherDataset)

Function: find out the common elements of two RDDS, that is, find the intersection of two RDDS

eg: find the same element in c_rdd1 and c_rdd2

val c_rdd1 = sc.parallelize(List(('a',1),('b',2),('a',1),('c',1))) val c_rdd2 = sc.parallelize(List(('a',1),('b',1),('d',1),('e',1))) c_rdd1.intersection(c_rdd2).collect

subtract difference set

subtract (otherDataset)

Function: get the difference set between two RDD S

eg: find the difference set between rdd1 and rdd2

val rdd1 = sc.parallelize(Array("A","B","C","D")) val rdd2 = sc.parallelize(Array("C","D","E","F")) val subtractRDD = rdd1.subtract(rdd2) subtractRDD.collect

cartesian

cartesian(otherDataset)

Name: Cartesian product

Function: combine the elements of two sets into a group

eg:

val rdd01 = sc.makeRDD(List(1,3,5,3)) val rdd02 = sc.makeRDD(List(2,4,5,1)) rdd01.cartesian(rdd02).collect

take(num)

Return num records in front of RDD

val data = sc.parallelize(List(1,2,3,4)) data.take(2)

Key value pair RDD

mapValues

val rdd = sc.parallelize(List("a","b","c","d")) //Create key value pairs through map var rddp = rdd.map(x=>(x,1)) rddp.collect rddp.keys.collect rddp.values.collect //Add one to all values through mapValues rddp.mapValues(x=>x+1).collect

val rdd1 = sc.parallelize(List("I am a student","Hello word","Just Play")) val rdd2 = rdd1.map(x=>(x,992)) rdd2.collect rdd2.keys.collect rdd2.values rdd2.values.collect

val rdd3 = sc.parallelize(List("I am a student","Hello word","Just Play")) val rdd4 = rdd1.map(x=>x.split(" ")) rdd4.collect val p1=rdd4.map(x=>(x.split(" "),x)) p1.collect

join key internal connection

val rdd = sc.parallelize(List("a","b","c","d")) //Create key value pairs through map var rddp = rdd.map(x=>(x,1)) //Add one to all values through mapValues var rdd1 = rddp.mapValues(x=>x+1) //Similarly, rdd2 is obtained val rdd2 = sc.parallelize(List("a","b","c","d","e")).map(x=>(x,1)) rdd1.collect rdd2.collect //join rdd1 and rdd2 together rdd1.join(rdd2).collect rdd2.join(rdd1).collect

leftOuterJoin and rightOuterJoin and fullOuterJoin

rightOuterJoin right outer connection. The key of the second RDD must exist

leftOuterJoin left outer connection. The key of the first RDD must exist

fullOuterJoin global external connection. Both keys must have

//rdd1 and rdd2 continue the above rdd1.collect rdd2.collect //Right outer connection rdd1.rightOuterJoin(rdd2).collect //Left outer connection rdd1.leftOuterJoin(rdd2).collect //Total external connection rdd1.fullOuterJoin(rdd2).collect

zip

Function: combine two RDD S as key value pairs

  • The number of partitions of two RDDS must be the same (query the number of partitions rdd.partitions.size)
  • The number of elements of two RDD S must be the same
val rdd1 = sc.parallelize(1 to 3) val rdd2 = sc.parallelize(List("a","b","c")) rdd1.collect rdd2.collect rdd2.zip(rdd1).collect rdd1.zip(rdd2).collect rdd1.partitions.size rdd2.partitions.size val rdd3 = sc.parallelize(1 to 3,3)//3 is the number of partitions val rdd4 = sc.parallelize(List("a","b","c"),3)//3 is the number of partitions rdd3.partitions.size rdd4.partitions.size

CombineByKey

Merge the values of the same key. The types of merged values can be different

Target: want to convert value to List type

groupByKey([numPartitions])

Key grouping. When (K, V) calls the composed RDD, it returns the new RDD composed of (K, iteratable) pairs.

val rdd1 = sc.parallelize(List("A","B","C","C","C","D","D")).map(x=>(x,1)) rdd1.groupByKey().collect rdd1.groupByKey().collect()

reduceByKey(func, [numPartitions])

Group the Key values into RDD keys and aggregate them (if the keys are the same, only one Key is reserved, and the value is + 1)

  • When called on an RDD composed of key value pairs of type (K, V), a new RDD composed of key value pairs of type (K, V) is returned
  • The value of each key of the new RDD is aggregated using the given reduce function func, which must be of type (V, V) = > v
  • It can be used to count the number of occurrences of each key
val rdd1 = sc.parallelize(List("A","B","C","C","C","D","D")).map(x=>(x,1)) rdd1.reduceByKey((x,y)=>x+y).collect rdd1.reduceByKey((x,y)=>x+y).collect()

File reading and storage

Structure name Structured describe text file no Ordinary text file, one record per line SequenceFile yes Common Hadoop file formats for key value pair data
rdd.partitions.size

saveAsTestFile(Path:String)

Save RDD to HDFS

val rdd1 = sc.parallelize(List(1,2,3,4)) rdd1.saveAsTextFile("/302Spark/savetext") //Input IP:50070 query

saveAsSequenceFile and sc.sequenceFile

saveAsSequenceFile(Path:String)

Serialization file, only key value pair RDD is supported

sequenceFile[K, V](path: String, keyClass: Class[K], valueClass: Class[V], minPartitions: Int)

Read serialization file:

//In order for Spark to support Hadoop data types, you need to import packages import org.apache.hadoop.io. sc.sequenceFile(Path:String,KeyClass:key[K])

example

//Serialized file storage val rdd = sc.parallelize(List(("panda",3),("dog",6),("cat",3))) rdd.saveAsSequenceFile("/hadoop-zwj25/testSeq") rdd.partitions.size

//View serialization file hdfs dfs -ls /hadoop-zwj25 hdfs dfs -ls /hadoop-zwj25/testSeq hdfs dfs -cat /hadoop-zwj25/testSeq/part-00000

//Introducing hadoop data types import org.apache.hadoop.io.Text import org.apache.hadoop.io.IntWritable //Serialized file read //Text in the first classOf[Text] is the type of key //IntSWritable in the second classOf[IntSWritable] is the type of value val output = sc.sequenceFile("/hadoop-zwj25/testSeq",classOf[Text],classOf[IntWritable]) output.map.collect rdd.collect val rddtest = sc.parallelize(List(1,2,3)) rddtest.map.collect rddtest.map.collect

repartition() repartition

repartition(numPartitions: Int)

  • You can increase or decrease the level of parallelism in this RDD. Internally, it redistributes data using shuffle.
  • If you want to reduce the number of partitions in this RDD, consider using coalesce to avoid shuffling.
coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

Query partition: partitions.size

rdd.repartition(numPartitions:Int).partitions.size

To reduce the number of partitions, consider using coalesce to avoid execution

val rdd1 = sc.parallelize(List(1,2,3,4)) rdd1.saveAsTextFile("/302Spark/savetext") //Input IP:50070 query //--------------------------------------- rdd1.partitions.size rdd1.repartition(1).partitions.size rdd1.repartition(1).saveAsTextFile("/302Spark/savetext1")

practice

Practice01

subject

Find out the student ID whose test score has been 100 points, and the final results need to be collected into an RDD.

source material

Please paste the content in the following code block into the text document result_bigdata.txt

1001 Big data foundation 90 1002 Big data foundation 94 1003 Big data foundation 100 1004 Big data foundation 99 1005 Big data foundation 90 1006 Big data foundation 94 1007 Big data foundation 100 1008 Big data foundation 93 1009 Big data foundation 89 1010 Big data foundation 78 1011 Big data foundation 91 1012 Big data foundation 84
code
//Create RDD from local file val rdd_bigdata = sc.textFile("file:///home / username / result_bigdata.txt") //Just output a test rdd_bigdata.take(2) //View all results rdd_bigdata.collect //The following method needs to be converted to Int type val bigdata_100=rdd_bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(1),x(2).toInt)).filter(x=>x._3==100).map(x=>x._1) bigdata_100.collect //The following method does not need to be converted to Int type val bigdata_100=rdd_bigdata.map(x=>x.split("\t")).filter(x=>x(2)=="100").map(x=>x(0)) bigdata_100.collect

Practice02

subject

Output the total score of each student. It is required to add the scores with the same student ID in the two grades.

source material

Please paste the contents of the following code block into the text document score.txt

math John 90 math Betty 88 math Mike 95 math Lily 92 chinese John 78 chinese Betty 80 chinese Mike 88 chinese Lily 85 english John 92 english Betty 84 english Mike 90 english Lily 85

Please paste the contents of the following code block into the text document result_math.txt

1001 applied mathematics 96 1002 applied mathematics 94 1003 applied mathematics 100 1004 applied mathematics 100 1005 applied mathematics 94 1006 applied mathematics 80 1007 applied mathematics 90 1008 applied mathematics 94 1009 applied mathematics 84 1010 applied mathematics 86 1011 applied mathematics 79 1012 applied mathematics 91

Please paste the contents of the following code block into the text document result_bigdata.txt

1001 Big data foundation 90 1002 Big data foundation 94 1003 Big data foundation 100 1004 Big data foundation 99 1005 Big data foundation 90 1006 Big data foundation 94 1007 Big data foundation 100 1008 Big data foundation 93 1009 Big data foundation 89 1010 Big data foundation 78 1011 Big data foundation 91 1012 Big data foundation 84
code
//Create RDD from local file val rdd_bigdata = sc.textFile("file:///home/hadoop-zwj25/result_bigdata.txt") val rdd_math = sc.textFile("file:///home/hadoop-zwj25/result_math.txt") //Returns all elements in the RDD rdd_bigdata.collect rdd_math.collect
//Merge two RDD S val rddall = rdd_math.union(rdd_bigdata) rddall.collect rddall.map(x=>(x.split("\t"))).map(x=>(x(0),x(2).toInt)).reduceByKey((x,y)=>x+y).collect

Practice03

subject

1. Output the average score of each student. It is required to add the scores with the same ID in the two score sheets and calculate the average score.

2. Combine the total score and average score of each student

Complete average score and merge tasks

source material code
//1. Create RDD and convert val math_map = math.map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt)) math_map.collect val bigdata=sc.textFile("file:///home/hadoop-zwj25/result_bigdata.txt") val bigdata_map = bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt)) bigdata_map.collect //Merge the two key value pairs RDD, and use take to randomly test one math_map.union(bigdata_map).take(3) //Add a score count for the score, starting from 1, and the format after completion is (ID, (score, count)) math_map.union(bigdata_map).mapValues(x=>(x,1)).take(3) //Note that the converted type must be consistent with the original value type //Reducebykey ((x, y) = > ((X. _1 + y. _1), (X. _2 + y. _2))) is explained as follows //x is the original score and y is the count //(x._1+y._1) represents the total score (the scores of the two disciplines are added together) //(x._2+y._2) represents the total number of gates (the sum of two count values) math_map.union(bigdata_map).mapValues(x=>(x,1)).reduceByKey((x,y)=>((x._1+y._1),(x._2+y._2))).take(3) //Group and aggregate the merged RDD keys //The average score is the total score / number of subjects //x_1 is the ID; X. _2is (score, count value) //So the average score is (x._2._1/x._2._2) val pj = math_map.union(bigdata_map).mapValues(x=>(x,1)).reduceByKey((x,y)=>((x._1+y._1),(x._2+y._2))).map(x=>(x._1,(x._2._1/x._2._2))) //Query results pj.collect //Output total score val zf = math_map.union(bigdata_map).reduceByKey((x,y)=>x+y) zf.collect pj.count zf.count //Combine the total and average scores of each student zf.join(pj).count

30 October 2021, 18:55 | Views: 7663

Add new comment

For adding a comment, please log in
or create account

0 comments