spark advanced: RDD usage

Spark provides a core abstraction of data, which is called resilient distributed dataset (RDD). All or part of this data set can be cached in memory and reused during multiple calculations. RDD is actually a data set distributed on multiple nodes.

The main features of RDD are as follows:

  • RDD is immutable, but it can be converted into a new RDD for operation.
  • RDD is partitioned. RDD consists of many partitions, and each partition corresponds to a Task to execute (the partition will be explained in detail in Section 3.4).
  • Operating on RDD is equivalent to operating on each partition of RDD.
  • RDD has a series of functions to calculate partitions, called operators (operators will be explained in detail in Section 3.3).
  • There are dependencies between RDD S, which can realize pipelining and avoid the storage of intermediate data.

1, Test data description

Well known movie recommendation data set: MovieLen official website

Here, small data sets are selected for testing. After the logic is written, you can try large data sets again.

There are three files in the dataset:

  • movies.dat
  • ratings.dat
  • users.dat


It is a film rating document, and the fields are as follows:

 user ID::film ID::score::time stamp
  • At least 20 ratings per user
  • Score: 1-5 points


It is a movie file, and the fields are as follows:

 film ID::Movie name::Film type
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance

The film types are as follows:

Action   get some action
Adventure   adventure
Animation   cartoon
Children's   Children's
Comedy   comedy
Crime   Crime
Documentary   documentary
Drama   Theatre
Fantasy   fantasy
Film-Noir   Black film
Horror   terror
Musical   musical play
Mystery   Answer
Romance   romantic
Sci-Fi   science fiction
Thriller   Thriller
War   Warfare
Western   occident


User file, related fields are as follows:

 user ID::Gender::Age::occupation::Zip code

Age represents a range:

 1:   "1-18"
18:  "18-24"
25:  "25-34"
35:  "35-44"
45:  "45-49"
50:  "50-55"
56:  "56+"

Occupation is an enumeration number, and the corresponding relationship is:

 0:  "other" or not specified      ""Other" or unspecified    
 1:  "academic/educator"      "scholar/"Educator"    
 2:  "artist"      ""Artist"    
 3:  "clerical/admin"      "Clerk/"Administrator"    
 4:  "college/grad student"      "college student/"Graduate students"    
 5:  "customer service"      ""Customer service"    
 6:  "doctor/health care"      "doctor/"Health care"    
 7:  "executive/managerial"      "implement/"Management"    
 8:  "farmer"      ""Farmers"    
 9:  "homemaker"      ""Housewife"    
10:  "K-12 student"      "K-12 "Student"    
11:  "lawyer"      ""Lawyer"    
12:  "programmer"      ""Programmer"    
13:  "retired"      ""Retirement"    
14:  "sales/marketing"      ""Sales and marketing"    
15:  "scientist"      ""Scientists"    
16:  "self-employed"      ""Self employed person"    
17:  "technician/engineer"      "technician/"Engineer"    
18:  "tradesman/craftsman"      "businessman/"Craftsman"    
19:  "unemployed"      "Unemployment "    
20:  "writer"      ""Writer"    

2, RDD usage exercise

1. Get the film evaluation top10

  • Firstly, the data is segmented to obtain the occurrence of movie ID
  • Then aggregate according to the movie ID
  • Change the number of occurrences to the key position by changing the movie ID and the number of occurrences
  • Finally, it is sorted according to the number of occurrences
  • Print out the first ten occurrences
object MovieLen {
  def main(args: Array[String]): Unit = {
    val dataPath = "/home/ffzs/data/ml-1m"

    val conf = new SparkConf()

    val sc = new SparkContext(conf)
    val ratingsRdd = sc.textFile(f"$dataPath/ratings.dat")
    // The number of film reviews ranked top 10"::"))  // Segmentation data
      .map(_(1) -> 1)  // Get key value of ID - > 1
      .reduceByKey(_+_)  // Aggregate the movie id to calculate the number of times each movie appears
      .map(it => (it._2, it._1))   // Adjust the key value position to sort the occurrence times
      .sortByKey(false)   // Sort occurrences
      .take(10)   // Get the first 10 values

Result output:

2. Get word of mouth top10

  • Aggregate the total score and total visits of the movie through the movie ID
  • Then average each film
  • Finally, sort the output top10
    println("Film reputation(score)ranking top10")"::"))
      .map(it => (it(1), (it(2).toDouble, 1)))  // The score is converted to double type to facilitate division calculation
      .reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2))  // Aggregate the total score and total viewing times of the film
      .map(it => ((it._2._1/it._2._2), it._1))  // The average score of comments is obtained from the total score and the total number of times
      .sortByKey(false)  // Sort through the average score of comments
      .take(10)  // Get top10

Output results:

3. Top 10 movies with the best reputation among men

  • join the score and user gender
  • Then do data screening according to gender
  • Finally, top 10 is selected according to the above word-of-mouth logic
    val gender = "M"
    val genderMap:Map[String, String] = Map("M"->"Male", "F"->"female sex")
    println(f"Most popular ${genderMap(gender)}Welcome movie top10: ")"::"))
      .map(x => (x(0), (x(0), x(1), x(2))))
      .join(   // join the score and user gender according to the user ID"::"))
          .map(x=> x(0)->x(1))
      .filter(_._2._2.equals(gender))   // Screen out the corresponding gender
      .map(it => (it._2._1._2, (it._2._1._3.toDouble, 1)))   // Find the average score to sort
      .reduceByKey((x,y) => (x._1+y._1, x._2+y._2))
      .map(it => (it._2._1/it._2._2, it._1))
      .sortByKey(ascending = false)
      .map(it => it._2->it._1)

The output result is:

4. The scores are sorted twice according to the time

  • Build a secondary sorting processing class through score and time. Both score and time are in descending order
  • Sort the key s through the sort class
  • Then output each line of data
class SecondSortKey(val first:Int, val second:Int) extends Ordered[SecondSortKey] with Serializable {
  override def compare(that: SecondSortKey): Int = {
    if (this.first != that.first) {
} => {
    val row = line.split("::")
    ((new SecondSortKey(row(2).toInt, row(3).toInt)), line)

5. Film type top10

  • The movie type in the movie is segmented by flatmap
  • Then count the occurrence times of each type
  • Output the first 10 after the last sorting
println("Film type top10")"::")(2))
  .map((_, 1))
  .map(it => (it._2, it._1))
  .sortByKey(ascending = false)
  .map(it=> it._2->it._1)


6. Daily new users

  • First convert the timestamp to a date
  • Then, the smallest date in each group is obtained through user ID grouping, that is, the user new date
  • Then aggregate the date
  • Finally, sort the output
println("Daily new users top10")
val sdf = new SimpleDateFormat("yyyy-MM-dd")"::"))
  .map(it => (it(0), it(3).toLong*1000))
  .map(it => (it._1, sdf.format(it._2)))
  .map(it => (it._2.min, 1))
  .map(it => it._2 -> it._1)
  .sortByKey(ascending = false)
  .map(it => it._2 -> it._1)

The output result is:

Tags: Scala Spark

Posted on Wed, 06 Oct 2021 15:08:26 -0400 by jazz_snob