Sparkstreaming \ updatestatebykey state calculation

Catalog

1, Theoretical basis

2, Code test wordCount

1, code

2. Test data

3. Results display

1, Theoretical basis

1. In flow computing, there is usually a need for state computing, that is, the current computing results not only depend on the current received data, but also need to merge the previous results. Because of the mini batch mechanism of spark streaming, the previous state results must be stored in RDD and taken out for merging in the next batch calculation, which is the use of updateStateByKey method.

2. The updateStateByKey operation allows us to maintain a state for each key and continuously update the state.

(1) First, define a state, which can be any data type;

(2) , second, define the state update function - > specify how a function updates the state using the previous state and the new value.

(3) For each batch, Spark will apply a state update function to each previously existing key, regardless of whether the key has new data in the batch. If the state update function returns none, the state corresponding to the key will be deleted. Of course, for each new key, the state update function is also executed.

3. updateStateByKey operation, the Checkpoint mechanism must be enabled.

In this way, the state corresponding to each key can be found not only in memory, but also in checkpoint. If you want to save a state of key for a long time, spark streaming requires checkpoint, so that when memory data is lost, you can recover data from checkpoint

2, Code test wordCount

1, code

package main.scala.com.cn.sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * sparkStreaming Real time monitoring of a directory file and real-time word statistics
  */
object SparkStreamingSimpleExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("SparkStreamingSimpleExample")
    val ssc = new StreamingContext(conf,Seconds(5))

    ssc.checkpoint("checkpoint")
    ssc.sparkContext.setLogLevel("WARN")
    val dStream = ssc.textFileStream("C:\\Users\\ddd\\Desktop\\aa")
    val valueMap = dStream.flatMap(d=>d.split(",")).map(s=>(s,1))
    //val countResult = valueMap.reduceByKey(_+_)
    //val countResult = valueMap.updateStateByKey{(newValues:Seq[Int],state:Option[Int])=> Some(newValues.sum+state.getOrElse(0))}
    val countResult = valueMap.updateStateByKey(updateFunction)
    countResult.print()
    ssc.start()
    ssc.awaitTermination()
  }

  def updateFunction(newValues:Seq[Int],state:Option[Int]):Option[Int]={
    val newCount = state.getOrElse(0)
    Some(newValues.sum+newCount)
  }
}

2. Test data

Copy t1.txt first:

spark,hadoop

Then copy t2.txt:

livy,spark

3. Results display

-------------------------------------------
Time: 1581822110000 ms
-------------------------------------------
(spark,1)
(hadoop,1)

-------------------------------------------
Time: 1581822115000 ms
-------------------------------------------
(livy,1)
(spark,2)
(hadoop,1)

 

Published 93 original articles, won praise 11, visited 5305
Private letter follow

Tags: Spark Hadoop Apache Scala

Posted on Sun, 16 Feb 2020 01:04:55 -0500 by godwisam