spark advanced: spark streaming usage

Spark Streaming is an extension of Spark Core API (Spark RDD), which supports scalable, high-throughput and fault-tolerant processing of real-time data streams. Data can be obtained from various sources such as Kafka, Flume, Kinesis or TCPSocket, and complex algorithms can be used to process data. These algorithms are represented by advanced functions such as map(), reduce(), join(), and window().

Spark Streaming provides a high-level abstraction called DStream (discrete stream). Internally, each batch split into input data streams is actually an RDD. A DStream is composed of multiple RDDS, which is equivalent to an RDD sequence.

Similar to RDD, DStream is also supported by many operators available on ordinary RDD. Using these operators, you can modify the data in the input DStream to create a new DStream. There are three main operations on DStream: stateless operation, state operation and window operation.

Status operation refers to the cumulative calculation of the data of the current time batch and the historical time batch, that is, the processing of the current time batch needs to use the data or intermediate results of the previous batch. Using the updateStateByKey() operator, you can keep the state of the key and continuously update the previous state with the new state.

Using the persist () method on a DStream can persist each RDD of the DStream into memory. This is useful when the data in DStream needs to be calculated multiple times (for example, multiple operations on the same data). For window based operations (such as reduceByWindow(), reduceByKeyAndWindow()) and state based operations (such as updateStateByKey()), persist() is enabled by default.

Spark Streaming applications must run around the clock, so failures unrelated to the application logic (such as system failure, JVM crash, etc.) should not affect them. Therefore, Spark Streaming needs to set checkpoints on enough data and store them in the fault-tolerant system to recover from the failure.

1, Receive and accumulate websock data

  • Create stream context
  • Set checkpoint
  • Connection port
  • Carry out accumulation processing
/**
 * @author: ffzs
 * @Date: 2021/10/8 5:55 PM
 */
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream

object SparkStream {
  val updateFunc: (Seq[Int], Option[Int]) => Some[Int] = (values:Seq[Int], state:Option[Int]) => {
    val currentCount = values.sum  // Accumulate current batch words
    val previousCount = state.getOrElse(0)  //Get the last batch of words. The default is 0
    Some(currentCount+previousCount)  // Add and
  }
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val conf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("WordCount")
    val ssc = new StreamingContext(conf, Seconds(10))  // Set statistics once every 10 seconds
    ssc.checkpoint("hdfs://Localhost: 9000 / spark CK ") / / set checkpoint

    val lines = ssc.socketTextStream("localhost", 9999)  // DStream obtains data through the connection port
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val result:DStream[(String, Int)] = pairs.updateStateByKey(updateFunc)  // Carry out accumulation operation
    result.print()  // Print the first 10 lines

    ssc.start()   // start-up
    ssc.awaitTermination()   // Wait for completion
  }
}

Open the terminal and open port 9999:

(base) [~/softwares/kafka_2.12-2.8.1]$ nc -lk 9999
1 2 3 4 5 6
hello world

2, Statistical word processing of kafka source

Add dependencies first:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
  <version>3.1.2</version>
</dependency>
  • Create StringContext
  • Configure kafka
  • Direct connection via KafkaUtils.createDirectStream
  • Get DStream for accumulation processing
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.kafka.common.serialization.StringDeserializer

/**
 * @author: ffzs
 * @Date: 2021/10/9 7:21 PM
 */
object StreamKafka {
  val updateFunc: (Seq[Int], Option[Int]) => Some[Int] = (values:Seq[Int], state:Option[Int]) => {
    val currentCount = values.sum  // Accumulate current batch words
    val previousCount = state.getOrElse(0)  //Get the last batch of words. The default is 0
    Some(currentCount+previousCount)  // Add and
  }
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val conf = new SparkConf()
      .setMaster("local[*]")
      .setAppName("StreamKafka")

    val ssc = new StreamingContext(conf, Seconds(10))  // Set statistics once every 10 seconds
    ssc.checkpoint("hdfs://Localhost: 9000 / spark CK ") / / set checkpoint

    val kafkaTopics = Array("topictest")  // Multiple Kafka topics can be used for setting

    // Configuration of kafka
    val kafkaParams = Map[String, Object] (
      "bootstrap.servers" -> "ffzs-ub:9092",
      "key.deserializer"->classOf[StringDeserializer],   // Set the class used by key and value deserialization
      "value.deserializer"->classOf[StringDeserializer],
      "group.id"->"1",  // Set the consumer group ID. consumers with the same ID belong to the same group
      "enable.auto.commit"->(false:java.lang.Boolean)  // kafka does not automatically submit the offset. It is true by default through the spark operation
    )

    // Create DStream
    val inputStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String,String](
      ssc,
      LocationStrategies.PreferConsistent,
      Subscribe[String, String](kafkaTopics,kafkaParams)
    )

    // Parse and fetch key and value
    val linesDStream = inputStream.map(record => (record.key(), record.value()))
    val word = linesDStream.map(_._2)
      .flatMap(_.split(" "))
      .map(it => (it, 1))

    val result = word.updateStateByKey(updateFunc)
    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Terminal open kafka producer:

(base) [~/softwares/kafka_2.12-2.8.1]$ kafka-console-producer.sh --broker-list ffzs-ub:9092 --topic topictest
>a b b b b bbc c
>hello world
>ffzs is a good man

Tags: Scala Big Data Spark

Posted on Sat, 09 Oct 2021 07:07:40 -0400 by hoppyite