Flink uses Kafka Source & Kafka Sink

FlinkKafkaConnector

This connector provides access to the event flow of the Apache Kafka service.

Flink provides a special Kafka connector for reading and writing data from Kafka topics. Flink Kafka Consumer is integrated with Flink's checkpoint mechanism to provide a once and only semantics. For this reason, Flink not only relies on Kafka's consumer group offset tracking, but also internally tracks and checks these offsets.
Development process

Next, let's introduce how Flink reads Kafka and writes it to a new Topic with an example configuration

1. Rely on kafka

  Add dependency to pom.xml

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.11.1</version>
            <scope>provided</scope>
        </dependency>
 
 
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-core</artifactId>
            <version>1.11.1</version>
        </dependency>
 
 
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.11.1</version>
            <scope>provided</scope>
        </dependency>
 
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>1.11.1</version>
            <scope>provided</scope>
        </dependency>
 
 
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
            <version>1.11.1</version>
        </dependency>
 
 
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-java-driver</artifactId>
            <version>3.4.1</version>
        </dependency>
 
 
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.7</version>
        </dependency>

2. Connect Kafka & write Kafka

public static final String BOOTSTRAP_SERVERS = "192.168.121.101:9092";

  public static void main(String[] args) throws Exception {

      StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

      // 1.0 configuring KafkaConsumer
      Properties props = new Properties();
      props.setProperty("bootstrap.servers", BOOTSTRAP_SERVERS);
      props.setProperty("group.id", "test111");
      props.put("enable.auto.commit", "true");
      props.put("auto.commit.interval.ms", "1000");
      props.put("session.timeout.ms", "30000");
      props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
      props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
      // 1.1 set kafka to source
      env.enableCheckpointing(5000); // checkpoint every 5000 msecs
      DataStream<String> stream = env
              .addSource(new FlinkKafkaConsumer<>("demo", new SimpleStringSchema(), props));

      // 2.0 configuring kafkaProducer
      FlinkKafkaProducer<String> myProducer = new FlinkKafkaProducer<String>(
              BOOTSTRAP_SERVERS,            // broker list
              "demo_flink",                  // target topic
              new SimpleStringSchema());   // serialization schema
      myProducer.setWriteTimestampToKafka(true);

      // 2.1 set kafka to sink
      stream.addSink(myProducer);
      
      // 2.2 print the kafka receive debug
      stream.print();

      env.execute("Kafka Source");
  }

Fault tolerance guarantee

In the real production environment, we all need to ensure the high availability of the system. That is, it is necessary to ensure that each component of the system can not have problems, or provide a series of fault-tolerant mechanisms.
Kafka consumer fault tolerance

After enabling Flink's checkpoint, Flink Kafka Consumer will record the Kafka offset and other operator's operations to the checkpoint regularly in a consistent manner during a topic consumption record. In case the job fails, Flink will restore the streaming program to the state of the latest checkpoint and reuse Kafka's records from the offset stored in the checkpoint.

To use fault-tolerant Kafka consumers, you need to enable topology checkpoints in the execution environment:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // checkpoint every 5000 msecs

If checkpointing is not enabled, Kafka users will periodically submit offsets to Zookeeper.

Fault tolerance of Kafka producers
Kafka 0.8

Before 0.9, Kafka did not provide any mechanism to guarantee the semantics at least once and only once

Kafka 0.9 and 0.10

After enabling the checkpoint of Flink, FlinkKafkaProducer09 and FlinkKafkaProducer010 can provide at least one guarantee.

In addition to enabling the checkpoint of Flink, you should also configure the setting methods setLogFailuresOnly (boolean) and setFlushOnCheckpoint (boolean) appropriately.

    setLogFailuresOnly(boolean): by default, it is set to false. Enabling this feature will enable the producer to record only faults, rather than capture and re throw faults. In essence, the record is successful even if it has never been written to the target Kafka topic. This feature must be disabled if required at least once.
    setFlushOnCheckpoint(boolean): this setting is true by default. When this function is enabled, Flink's checkpoint will wait for all instant records when Kafka confirms the checkpoint, and then execute the checkpoint. This ensures that all records before the checkpoint are written to Kafka. Must be enabled at least once. This feature must be enabled at least once if required.

In short, by default, Kafka producers have at least one semantic guarantee for versions 0.9 and 0.10, where setLogFailureOnly is set to false and setFlushOnCheckpoint is set to true.

Kafka 0.11 and newer

After enabling the checkpoint of Flink, FlinkKafkaProducer011 (applicable to version 1.0.0 FlinkKafkaProducer with Kafka > = 1.0.0) can provide semantic guarantee only once.

In addition to enabling the checkpoint of Flink, you can also select three different operation modes by passing appropriate semantic parameters to FlinkKafkaProducer011 (applicable to FlinkKafkaProducer with Kafka > = version 1.0.0):

    Semantic.NONE: Flink does not guarantee that any messages produced may be lost or sent repeatedly
    Semantic.AT_LEAST_ONCE (default configuration): similar to setFlushOnCheckpoint(true) in FlinkKafkaProducer010. You can guarantee that messages will not be lost, but they will not be sent repeatedly
    Semantic.EXACTLY_ONCE: the use of Kafka transaction provides a once only semantics. Whenever you write to Kafka using a transaction, don't forget to set the required isolation level (read_committed or read_uncommitted - the latter is the default value) for any application using Kafka records.

Tags: Big Data kafka flink

Posted on Wed, 29 Sep 2021 19:51:38 -0400 by theoph