Author: cute wolf blue sky
[Bili Bili] cute wolf blue sky
[blog] https://mllt.cc
[blog park] Menglang blue sky blog Park
WeChat official account mllt9920
[learning and communication QQ group] 238948804
catalogue@Cute wolf blue sky
[!] start spark cluster
[!] start spark shell
Spark 2.0 integrates spark context and hive context into spark session
spark can also be used as a program entry
spark programming with scala
characteristic
It is an unalterable and partitioned collection object on the cluster node;
- It is created by parallel transformation (such as map, filter, join, etc.);
- Failure automatic reconstruction;
- You can control the storage level (memory, disk, etc.) for reuse;
- Must be serializable; When the memory is insufficient, it can be automatically degraded to disk storage and the RDD can be stored on disk. At this time, the performance will be greatly reduced, but it will not be worse than the current MapReduce;
- For the lost data partition, it can be recalculated only according to its lineage without making a specific checkpoint;
establish
Create RDD from memory
Start spark shell
val list = List(1,2,3) var rdd = sc.parallelize(list) rdd.partitions.size
Create RDD from external storage
1. Create local filecd /home mkdir data touch a.txt
- You don't have to create it in your home directory
- You can use vim to add something to a.txt
val localrdd = sc.textFile("file:///home / username / data/a.txt ")
The path preceded by file: / / indicates that it is read from the local file system
localrdd.collect//Returns all elements in the RDD
Note: in the fully distributed spark shell mode, the file needs to be saved in the same location of all nodes before it can be read. Otherwise, the error "file does not exist" will be reported
Create RDD from HDFS
1. Create a directory (name, student number) under the HDFS root directoryhdfs dfs -mkdir /zwj25 hdfs dfs -ls /
Visit http://[IP]:50070
2. Upload local files to HDFShdfs dfs -put file.txt /zwj253. Enter the spark4 shell
var hdfsrdd=sc.textFile("/zwj25/file.txt") hdfsrdd.collect hdfsrdd.partitions hdfsrdd.partitions.size
sc.defaultMinPartitions=min(sc.defaultParallelism,2)
Number of rdd partitions = max (number of block s in HDFS file, sc.defaultMinPartitions)
Create from other RDD S
operator
map(func)
Type: Transformation operator
Map: convert each data item of the original RDD into a new RDD through the user-defined function f in the map. The map operation will not change the number of partitions of the RDD
filter filtering
filter(func)
Transformation type operator
Retain the elements with the return value of true through the func function to form a new RDD
eg: filter out elements in data RDD whose elements are less than or equal to 2
val data =sc.parallelize(List(1,2,3,4)) val result = data.filter(x=>x>2) result.collect
flatMap(func) splits words
Type: Transformation operator
flatMap: map and flatten each element in the collection
val data = sc.parallelize(List("I am Meng Lang Lan Tian","my wechat is mllt9920")) data.map(x=>x.split(" ")).collect data.flatMap(x=>x.split(" ")).collect
sortBy sort
sortBy(f:(T) => K, ascending, numPartitions)
Type: Transformation operator
Function: sort standard RDD S
sortBy() accepts three parameters:
f: (T) = > k: each element in the object to be sorted is on the left, and the value returned on the right is the value to be sorted in the element.
Ascending: determines whether the elements in the RDD are sorted in ascending or descending order. The default is true, that is, ascending order, and false is sorted in descending order.
numPartitions: this parameter determines the number of RDD partitions after sorting. By default, the number of partitions after sorting is equal to the number before sorting.
eg: sort in descending order according to the second value of each element, and store the obtained results in RDD "data2"
val data1 = sc.parallelize(List((1,3),(2,4),(5,7),(6,8))) val data2 = data1.sortBy(x=>x._2,false,1) val data3 = data1.sortBy(x=>x._1,false,1)
distinct de duplication
distinct([numPartitions]))
Type: Transformation operator
Function: de duplication. Only one element is reserved for duplicate elements in RDD
eg:
val data1 = sc.parallelize(List(1,2,3,3,3,4,4)) data1.collect data1.distinct.collect data1.collect
union merge
union(otherDataset)
Function: to merge RDDS, you need to ensure that the two RDD element types are consistent
eg: merge rdd1 and rdd2
val rdd1 = sc.parallelize(List(1,2,3)) val rdd2 = sc.parallelize(List(4,5,6)) rdd1.union(rdd2).collect
Note: the two RDD element types of union should be consistent
intersection
intersection(otherDataset)
Function: find out the common elements of two RDDS, that is, find the intersection of two RDDS
eg: find the same element in c_rdd1 and c_rdd2
val c_rdd1 = sc.parallelize(List(('a',1),('b',2),('a',1),('c',1))) val c_rdd2 = sc.parallelize(List(('a',1),('b',1),('d',1),('e',1))) c_rdd1.intersection(c_rdd2).collect
subtract difference set
subtract (otherDataset)
Function: get the difference set between two RDD S
eg: find the difference set between rdd1 and rdd2
val rdd1 = sc.parallelize(Array("A","B","C","D")) val rdd2 = sc.parallelize(Array("C","D","E","F")) val subtractRDD = rdd1.subtract(rdd2) subtractRDD.collect
cartesian
cartesian(otherDataset)
Name: Cartesian product
Function: combine the elements of two sets into a group
eg:
val rdd01 = sc.makeRDD(List(1,3,5,3)) val rdd02 = sc.makeRDD(List(2,4,5,1)) rdd01.cartesian(rdd02).collect
take(num)
Return num records in front of RDD
val data = sc.parallelize(List(1,2,3,4)) data.take(2)
Key value pair RDD
mapValues
val rdd = sc.parallelize(List("a","b","c","d")) //Create key value pairs through map var rddp = rdd.map(x=>(x,1)) rddp.collect rddp.keys.collect rddp.values.collect //Add one to all values through mapValues rddp.mapValues(x=>x+1).collect
val rdd1 = sc.parallelize(List("I am a student","Hello word","Just Play")) val rdd2 = rdd1.map(x=>(x,992)) rdd2.collect rdd2.keys.collect rdd2.values rdd2.values.collect
val rdd3 = sc.parallelize(List("I am a student","Hello word","Just Play")) val rdd4 = rdd1.map(x=>x.split(" ")) rdd4.collect val p1=rdd4.map(x=>(x.split(" "),x)) p1.collect
join key internal connection
val rdd = sc.parallelize(List("a","b","c","d")) //Create key value pairs through map var rddp = rdd.map(x=>(x,1)) //Add one to all values through mapValues var rdd1 = rddp.mapValues(x=>x+1) //Similarly, rdd2 is obtained val rdd2 = sc.parallelize(List("a","b","c","d","e")).map(x=>(x,1)) rdd1.collect rdd2.collect //join rdd1 and rdd2 together rdd1.join(rdd2).collect rdd2.join(rdd1).collect
leftOuterJoin and rightOuterJoin and fullOuterJoin
rightOuterJoin right outer connection. The key of the second RDD must exist
leftOuterJoin left outer connection. The key of the first RDD must exist
fullOuterJoin global external connection. Both keys must have
//rdd1 and rdd2 continue the above rdd1.collect rdd2.collect //Right outer connection rdd1.rightOuterJoin(rdd2).collect //Left outer connection rdd1.leftOuterJoin(rdd2).collect //Total external connection rdd1.fullOuterJoin(rdd2).collect
zip
Function: combine two RDD S as key value pairs
- The number of partitions of two RDDS must be the same (query the number of partitions rdd.partitions.size)
- The number of elements of two RDD S must be the same
val rdd1 = sc.parallelize(1 to 3) val rdd2 = sc.parallelize(List("a","b","c")) rdd1.collect rdd2.collect rdd2.zip(rdd1).collect rdd1.zip(rdd2).collect rdd1.partitions.size rdd2.partitions.size val rdd3 = sc.parallelize(1 to 3,3)//3 is the number of partitions val rdd4 = sc.parallelize(List("a","b","c"),3)//3 is the number of partitions rdd3.partitions.size rdd4.partitions.size
CombineByKey
Merge the values of the same key. The types of merged values can be different
Target: want to convert value to List type
groupByKey([numPartitions])
Key grouping. When (K, V) calls the composed RDD, it returns the new RDD composed of (K, iteratable) pairs.
val rdd1 = sc.parallelize(List("A","B","C","C","C","D","D")).map(x=>(x,1)) rdd1.groupByKey().collect rdd1.groupByKey().collect()
reduceByKey(func, [numPartitions])
Group the Key values into RDD keys and aggregate them (if the keys are the same, only one Key is reserved, and the value is + 1)
- When called on an RDD composed of key value pairs of type (K, V), a new RDD composed of key value pairs of type (K, V) is returned
- The value of each key of the new RDD is aggregated using the given reduce function func, which must be of type (V, V) = > v
- It can be used to count the number of occurrences of each key
val rdd1 = sc.parallelize(List("A","B","C","C","C","D","D")).map(x=>(x,1)) rdd1.reduceByKey((x,y)=>x+y).collect rdd1.reduceByKey((x,y)=>x+y).collect()
File reading and storage
Structure name Structured describe text file no Ordinary text file, one record per line SequenceFile yes Common Hadoop file formats for key value pair datardd.partitions.size
saveAsTestFile(Path:String)
Save RDD to HDFS
val rdd1 = sc.parallelize(List(1,2,3,4)) rdd1.saveAsTextFile("/302Spark/savetext") //Input IP:50070 query
saveAsSequenceFile and sc.sequenceFile
saveAsSequenceFile(Path:String)
Serialization file, only key value pair RDD is supported
sequenceFile[K, V](path: String, keyClass: Class[K], valueClass: Class[V], minPartitions: Int)
Read serialization file:
//In order for Spark to support Hadoop data types, you need to import packages import org.apache.hadoop.io. sc.sequenceFile(Path:String,KeyClass:key[K])
example
//Serialized file storage val rdd = sc.parallelize(List(("panda",3),("dog",6),("cat",3))) rdd.saveAsSequenceFile("/hadoop-zwj25/testSeq") rdd.partitions.size
//View serialization file hdfs dfs -ls /hadoop-zwj25 hdfs dfs -ls /hadoop-zwj25/testSeq hdfs dfs -cat /hadoop-zwj25/testSeq/part-00000
//Introducing hadoop data types import org.apache.hadoop.io.Text import org.apache.hadoop.io.IntWritable //Serialized file read //Text in the first classOf[Text] is the type of key //IntSWritable in the second classOf[IntSWritable] is the type of value val output = sc.sequenceFile("/hadoop-zwj25/testSeq",classOf[Text],classOf[IntWritable]) output.map.collect rdd.collect val rddtest = sc.parallelize(List(1,2,3)) rddtest.map.collect rddtest.map.collect
repartition() repartition
repartition(numPartitions: Int)
- You can increase or decrease the level of parallelism in this RDD. Internally, it redistributes data using shuffle.
- If you want to reduce the number of partitions in this RDD, consider using coalesce to avoid shuffling.
coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
Query partition: partitions.size
rdd.repartition(numPartitions:Int).partitions.size
To reduce the number of partitions, consider using coalesce to avoid execution
val rdd1 = sc.parallelize(List(1,2,3,4)) rdd1.saveAsTextFile("/302Spark/savetext") //Input IP:50070 query //--------------------------------------- rdd1.partitions.size rdd1.repartition(1).partitions.size rdd1.repartition(1).saveAsTextFile("/302Spark/savetext1")
practice
Practice01
subjectFind out the student ID whose test score has been 100 points, and the final results need to be collected into an RDD.
source materialPlease paste the content in the following code block into the text document result_bigdata.txt
1001 Big data foundation 90 1002 Big data foundation 94 1003 Big data foundation 100 1004 Big data foundation 99 1005 Big data foundation 90 1006 Big data foundation 94 1007 Big data foundation 100 1008 Big data foundation 93 1009 Big data foundation 89 1010 Big data foundation 78 1011 Big data foundation 91 1012 Big data foundation 84code
//Create RDD from local file val rdd_bigdata = sc.textFile("file:///home / username / result_bigdata.txt") //Just output a test rdd_bigdata.take(2) //View all results rdd_bigdata.collect //The following method needs to be converted to Int type val bigdata_100=rdd_bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(1),x(2).toInt)).filter(x=>x._3==100).map(x=>x._1) bigdata_100.collect //The following method does not need to be converted to Int type val bigdata_100=rdd_bigdata.map(x=>x.split("\t")).filter(x=>x(2)=="100").map(x=>x(0)) bigdata_100.collect
Practice02
subjectOutput the total score of each student. It is required to add the scores with the same student ID in the two grades.
source materialPlease paste the contents of the following code block into the text document score.txt
math John 90 math Betty 88 math Mike 95 math Lily 92 chinese John 78 chinese Betty 80 chinese Mike 88 chinese Lily 85 english John 92 english Betty 84 english Mike 90 english Lily 85
Please paste the contents of the following code block into the text document result_math.txt
1001 applied mathematics 96 1002 applied mathematics 94 1003 applied mathematics 100 1004 applied mathematics 100 1005 applied mathematics 94 1006 applied mathematics 80 1007 applied mathematics 90 1008 applied mathematics 94 1009 applied mathematics 84 1010 applied mathematics 86 1011 applied mathematics 79 1012 applied mathematics 91
Please paste the contents of the following code block into the text document result_bigdata.txt
1001 Big data foundation 90 1002 Big data foundation 94 1003 Big data foundation 100 1004 Big data foundation 99 1005 Big data foundation 90 1006 Big data foundation 94 1007 Big data foundation 100 1008 Big data foundation 93 1009 Big data foundation 89 1010 Big data foundation 78 1011 Big data foundation 91 1012 Big data foundation 84code
//Create RDD from local file val rdd_bigdata = sc.textFile("file:///home/hadoop-zwj25/result_bigdata.txt") val rdd_math = sc.textFile("file:///home/hadoop-zwj25/result_math.txt") //Returns all elements in the RDD rdd_bigdata.collect rdd_math.collect
//Merge two RDD S val rddall = rdd_math.union(rdd_bigdata) rddall.collect rddall.map(x=>(x.split("\t"))).map(x=>(x(0),x(2).toInt)).reduceByKey((x,y)=>x+y).collect
Practice03
subject1. Output the average score of each student. It is required to add the scores with the same ID in the two score sheets and calculate the average score.
2. Combine the total score and average score of each student
Complete average score and merge tasks
source material code//1. Create RDD and convert val math_map = math.map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt)) math_map.collect val bigdata=sc.textFile("file:///home/hadoop-zwj25/result_bigdata.txt") val bigdata_map = bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt)) bigdata_map.collect //Merge the two key value pairs RDD, and use take to randomly test one math_map.union(bigdata_map).take(3) //Add a score count for the score, starting from 1, and the format after completion is (ID, (score, count)) math_map.union(bigdata_map).mapValues(x=>(x,1)).take(3) //Note that the converted type must be consistent with the original value type //Reducebykey ((x, y) = > ((X. _1 + y. _1), (X. _2 + y. _2))) is explained as follows //x is the original score and y is the count //(x._1+y._1) represents the total score (the scores of the two disciplines are added together) //(x._2+y._2) represents the total number of gates (the sum of two count values) math_map.union(bigdata_map).mapValues(x=>(x,1)).reduceByKey((x,y)=>((x._1+y._1),(x._2+y._2))).take(3) //Group and aggregate the merged RDD keys //The average score is the total score / number of subjects //x_1 is the ID; X. _2is (score, count value) //So the average score is (x._2._1/x._2._2) val pj = math_map.union(bigdata_map).mapValues(x=>(x,1)).reduceByKey((x,y)=>((x._1+y._1),(x._2+y._2))).map(x=>(x._1,(x._2._1/x._2._2))) //Query results pj.collect //Output total score val zf = math_map.union(bigdata_map).reduceByKey((x,y)=>x+y) zf.collect pj.count zf.count //Combine the total and average scores of each student zf.join(pj).count