Spark provides a core abstraction of data, which is called resilient distributed dataset (RDD). All or part of this data set can be cached in memory and reused during multiple calculations. RDD is actually a data set distributed on multiple nodes.
The main features of RDD are as follows:
- RDD is immutable, but it can be converted into a new RDD for operation.
- RDD is partitioned. RDD consists of many partitions, and each partition corresponds to a Task to execute (the partition will be explained in detail in Section 3.4).
- Operating on RDD is equivalent to operating on each partition of RDD.
- RDD has a series of functions to calculate partitions, called operators (operators will be explained in detail in Section 3.3).
- There are dependencies between RDD S, which can realize pipelining and avoid the storage of intermediate data.
1, Test data description
Well known movie recommendation data set: MovieLen official website https://grouplens.org/datasets/movielens/
Here, small data sets are selected for testing. After the logic is written, you can try large data sets again.
There are three files in the dataset:
- movies.dat
- ratings.dat
- users.dat
ratings.dat
It is a film rating document, and the fields are as follows:
UserID::MovieID::Rating::Timestamp user ID::film ID::score::time stamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 1::3408::4::978300275
- At least 20 ratings per user
- Score: 1-5 points
movies.dat
It is a movie file, and the fields are as follows:
MovieID::Title::Genres film ID::Movie name::Film type 1::Toy Story (1995)::Animation|Children's|Comedy 2::Jumanji (1995)::Adventure|Children's|Fantasy 3::Grumpier Old Men (1995)::Comedy|Romance
The film types are as follows:
Action get some action Adventure adventure Animation cartoon Children's Children's Comedy comedy Crime Crime Documentary documentary Drama Theatre Fantasy fantasy Film-Noir Black film Horror terror Musical musical play Mystery Answer Romance romantic Sci-Fi science fiction Thriller Thriller War Warfare Western occident
users.dat
User file, related fields are as follows:
UserID::Gender::Age::Occupation::Zip-code user ID::Gender::Age::occupation::Zip code 1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117
Age represents a range:
1: "1-18" 18: "18-24" 25: "25-34" 35: "35-44" 45: "45-49" 50: "50-55" 56: "56+"
Occupation is an enumeration number, and the corresponding relationship is:
0: "other" or not specified ""Other" or unspecified 1: "academic/educator" "scholar/"Educator" 2: "artist" ""Artist" 3: "clerical/admin" "Clerk/"Administrator" 4: "college/grad student" "college student/"Graduate students" 5: "customer service" ""Customer service" 6: "doctor/health care" "doctor/"Health care" 7: "executive/managerial" "implement/"Management" 8: "farmer" ""Farmers" 9: "homemaker" ""Housewife" 10: "K-12 student" "K-12 "Student" 11: "lawyer" ""Lawyer" 12: "programmer" ""Programmer" 13: "retired" ""Retirement" 14: "sales/marketing" ""Sales and marketing" 15: "scientist" ""Scientists" 16: "self-employed" ""Self employed person" 17: "technician/engineer" "technician/"Engineer" 18: "tradesman/craftsman" "businessman/"Craftsman" 19: "unemployed" "Unemployment " 20: "writer" ""Writer"
2, RDD usage exercise
1. Get the film evaluation top10
- Firstly, the data is segmented to obtain the occurrence of movie ID
- Then aggregate according to the movie ID
- Change the number of occurrences to the key position by changing the movie ID and the number of occurrences
- Finally, it is sorted according to the number of occurrences
- Print out the first ten occurrences
object MovieLen { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val dataPath = "/home/ffzs/data/ml-1m" val conf = new SparkConf() conf.setAppName("movieLen") conf.setMaster("local[*]") val sc = new SparkContext(conf) val ratingsRdd = sc.textFile(f"$dataPath/ratings.dat") // The number of film reviews ranked top 10 ratingsRdd.map(_.split("::")) // Segmentation data .map(_(1) -> 1) // Get key value of ID - > 1 .reduceByKey(_+_) // Aggregate the movie id to calculate the number of times each movie appears .map(it => (it._2, it._1)) // Adjust the key value position to sort the occurrence times .sortByKey(false) // Sort occurrences .take(10) // Get the first 10 values .foreach(println) } }
Result output:
2. Get word of mouth top10
- Aggregate the total score and total visits of the movie through the movie ID
- Then average each film
- Finally, sort the output top10
println("Film reputation(score)ranking top10") ratingsRdd.map(_.split("::")) .map(it => (it(1), (it(2).toDouble, 1))) // The score is converted to double type to facilitate division calculation .reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2)) // Aggregate the total score and total viewing times of the film .map(it => ((it._2._1/it._2._2), it._1)) // The average score of comments is obtained from the total score and the total number of times .sortByKey(false) // Sort through the average score of comments .take(10) // Get top10 .foreach(println)
Output results:
3. Top 10 movies with the best reputation among men
- join the score and user gender
- Then do data screening according to gender
- Finally, top 10 is selected according to the above word-of-mouth logic
val gender = "M" val genderMap:Map[String, String] = Map("M"->"Male", "F"->"female sex") println(f"Most popular ${genderMap(gender)}Welcome movie top10: ") ratingsRdd.map(_.split("::")) .map(x => (x(0), (x(0), x(1), x(2)))) .join( // join the score and user gender according to the user ID usersRdd.map(_.split("::")) .map(x=> x(0)->x(1)) ) .filter(_._2._2.equals(gender)) // Screen out the corresponding gender .map(it => (it._2._1._2, (it._2._1._3.toDouble, 1))) // Find the average score to sort .reduceByKey((x,y) => (x._1+y._1, x._2+y._2)) .map(it => (it._2._1/it._2._2, it._1)) .sortByKey(ascending = false) .map(it => it._2->it._1) .take(10) .foreach(println)
The output result is:
4. The scores are sorted twice according to the time
- Build a secondary sorting processing class through score and time. Both score and time are in descending order
- Sort the key s through the sort class
- Then output each line of data
class SecondSortKey(val first:Int, val second:Int) extends Ordered[SecondSortKey] with Serializable { override def compare(that: SecondSortKey): Int = { if (this.first != that.first) { this.first-that.first } else{ this.second-that.second } } }
ratingsRdd.map(line => { val row = line.split("::") ((new SecondSortKey(row(2).toInt, row(3).toInt)), line) }) .sortByKey(false) .map(_._2) .take(10) .foreach(println) }
5. Film type top10
- The movie type in the movie is segmented by flatmap
- Then count the occurrence times of each type
- Output the first 10 after the last sorting
println("Film type top10") movieRdd.map(_.split("::")(2)) .flatMap(_.split("\\|")) .map((_, 1)) .reduceByKey(_+_) .map(it => (it._2, it._1)) .sortByKey(ascending = false) .map(it=> it._2->it._1) .take(10) .foreach(println)
Output:
6. Daily new users
- First convert the timestamp to a date
- Then, the smallest date in each group is obtained through user ID grouping, that is, the user new date
- Then aggregate the date
- Finally, sort the output
println("Daily new users top10") val sdf = new SimpleDateFormat("yyyy-MM-dd") ratingsRdd.map(_.split("::")) .map(it => (it(0), it(3).toLong*1000)) .map(it => (it._1, sdf.format(it._2))) .groupByKey() .map(it => (it._2.min, 1)) .reduceByKey(_+_) .map(it => it._2 -> it._1) .sortByKey(ascending = false) .map(it => it._2 -> it._1) .take(10) .foreach(println)
The output result is: