Kafka offset management

1. definition

Each partition in Kafka consists of a series of orderly and immutable messages, which are continuously appended to the partition. Each message in the partition has a sequential sequence number, which is used by the partition to uniquely identify a message.

Offset records the sequence number of the next message to be sent to the Consumer.

There are three common semantics in the flow processing system:

At most once. Each record is either processed once or not at all
At least once. This is better than at most because it ensures that no data is lost. But it could be repeated
Yes and only once Each record will be processed exactly once, no data will be lost, and no data will be processed many times

The semantics of streaming systems are often captured in terms of how many times each record can be processed by the system. There are three types of guarantees that a system can provide under all possible operating conditions (despite failures, etc.)

  1. At most once: Each record will be either processed once or not processed at all.
  2. At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates.
  3. Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.

 2.Kafka offset Management with Spark Streaming

Firstly, it is recommended to store offset in zookeeper. Zookeeper is lighter than HBASE, and it is ha (high availability cluster), so offset is more secure.

There are two common operations for offset Management:

  • Save offsets
  • Get offsets

3. Environmental preparation

Start a Kafka producer and test using topic: TP ﹣ Kafka:

./kafka-console-producer.sh --broker-list hadoop000:9092 --topic tp_kafka

Launch a Kafka consumer:

./kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic tp_kafka

Production data in IDEA:

package com.taipark.spark;

import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;

import java.util.Properties;
import java.util.UUID;

public class KafkaApp {

    public static void main(String[] args) {
        String topic = "tp_kafka";

        Properties props = new Properties();
        props.put("serializer.class","kafka.serializer.StringEncoder");
        props.put("metadata.broker.list","hadoop000:9092");
        props.put("request.required.acks","1");
        props.put("partitioner.class","kafka.producer.DefaultPartitioner");
        Producer<String,String> producer = new Producer<>(new ProducerConfig(props));

        for(int index = 0;index <100; index++){
            KeyedMessage<String, String> message = new KeyedMessage<>(topic, index + "", "taipark" + UUID.randomUUID());
            producer.send(message);
        }
        System.out.println("Data production completed");

    }
}

4. The first offset management method: smallest

Statistics of Spark Streaming link Kafka:

package com.taipark.spark.offset

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Offset01App {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Offset01App")
    val ssc = new StreamingContext(sparkConf,Seconds(10))

    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> "hadoop000:9092",
      "auto.offset.reset" -> "smallest"
    )
    val topics = "tp_kafka".split(",").toSet
    val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)

    messages.foreachRDD(rdd=>{
      if(!rdd.isEmpty()){
        println("Taipark" + rdd.count())
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }

}

Reproduction of 100 Kafka data - > spark streaming acceptance:

But at this time, if Spark Streaming stops and restarts:

You will find that the restart starts counting here, because the value of auto.offset.reset is set to smallest in the code. (before kafka-0.10.1.X)

5. The second offset management method: checkpoint

Create a / offset folder in HDFS:

hadoop fs -mkdir /offset

Use Checkpoint:

package com.taipark.spark.offset

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}

object Offset01App {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Offset01App")

    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> "hadoop000:9092",
      "auto.offset.reset" -> "smallest"
    )
    val topics = "tp_kafka".split(",").toSet
    val checkpointDirectory = "hdfs://hadoop000:8020/offset/"
    def functionToCreateContext():StreamingContext = {
      val ssc = new StreamingContext(sparkConf,Seconds(10))
      val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
      //Set checkpoint
      ssc.checkpoint(checkpointDirectory)
      messages.checkpoint(Duration(10*1000))

      messages.foreachRDD(rdd=>{
        if(!rdd.isEmpty()){
          println("Taipark" + rdd.count())
        }
      })

      ssc
    }
    val ssc = StreamingContext.getOrCreate(checkpointDirectory,functionToCreateContext _)




    ssc.start()
    ssc.awaitTermination()
  }

}

Note: IDEA modifies HDFS users. In VM options in settings:

-DHADOOP_USER_NAME=hadoop

Start first:

We found that we consumed the previous 100 items. This is the production of 100 pieces after stopping, and then starting:

It is found that only 100 entries between the last end and this startup are read here, instead of reading all previous entries like smallest.

However, there are problems with checkpiont. If you use this method to manage offset, as long as the business logic changes, checkpoint will not work. Because it calls getOrCreate().

6. The third offset management method: manual offset management

Train of thought:

  1. Create StreamingContext
  2. Get data from Kafka < = = get offset
  3. Process according to business logic
  4. Write processing result to external storage = = > Save offset
  5. The initiator waits for the thread to terminate
package com.taipark.spark.offset

import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Offset01App {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Offset01App")
    val ssc = new StreamingContext(sparkConf,Seconds(10))


    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> "hadoop000:9092",
      "auto.offset.reset" -> "smallest"
    )
    val topics = "tp_kafka".split(",").toSet
    //Get offset from somewhere
    val fromOffsets = Map[TopicAndPartition,Long]()

    val messages = if(fromOffsets.size == 0){  //Consumption from scratch
      KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
    }else{  //Consumption from specified offset

      val messageHandler = (mm:MessageAndMetadata[String,String]) => (mm.key,mm.message())
      KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,(String,String)](ssc,kafkaParams,fromOffsets,messageHandler)

      )
    }

    messages.foreachRDD(rdd=>{
      if(!rdd.isEmpty()){
        //Business logic
        println("Taipark" + rdd.count())

        //Save offset commit to somewhere
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        offsetRanges.foreach(x =>{
          //Submit the following information to external storage
          println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
        })
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }

}
  • Saving the offset before saving the data may result in data loss
  • Saving the data first and then the offset may cause repeated data execution

Solution 1: implement idempotent

In programming, the characteristic of an idempotent operation is that any multiple execution has the same effect as one execution.

Solution 2: transaction

1. A database transaction can contain one or more database operations, but these operations form a logical whole.

2. These database operations that form a logical whole are either successfully executed or not executed at all.

3. All operations that make up a transaction either have an impact on the database or have no impact on it. That is to say, whether the transaction is successfully executed or not, the database always maintains a consistent state.

4. The above is still true even in the case of database failure and concurrent transactions.

The business logic and offset are saved in one transaction and executed only once.

auto.kafka.reset after 7.Kafka-0.10.1.X:

earliest When there are submitted offsets under each partition, consumption starts from the submitted offset; when there is no submitted offset, consumption starts from the beginning
latest When there are submitted offsets under each partition, consumption starts from the submitted offset; when there is no submitted offset, consumption of the newly generated data under the partition
none When the submitted offset exists in each partition of topic, consumption starts after the offset; if there is no submitted offset in one partition, an exception is thrown

 

75 original articles published, 30 praised, 20000 visitors+
Private letter follow

Tags: kafka Spark Apache Database

Posted on Thu, 12 Mar 2020 08:17:51 -0400 by codrgii