Kafka Stream(KStream) vs Apache Flink

The original text is translated from DZone and translated freely according to the original text.

Tencent cloud flow computing Oceanus is a powerful tool for real-time analysis of big data and is compatible with Apache Flink application. New users can 1 yuan purchase flow calculation Oceanus(Flink) cluster , readers are welcome to experience it.

summary

The two most popular and fastest growing stream processing frameworks are Flink (since 2015) and Kafka's Stream API (in Kafka v0.10 since 2016). Both are open source from Apache and quickly replaced Spark Streaming, the traditional leader in this field.

In this article, I will share the main differences between the two flow processing methods through code examples. Articles on this topic rarely cover advanced differences, such as [1],[2] and [3], However, the information provided through the code examples is not much.

In this article, I will solve a simple problem and try to provide code and compare it in the two frameworks. Before starting to write code, the following is a summary when I started learning KStream.

Example 1

Here are the steps in this example:

  1. Read the digital stream from the Kafka topic. These numbers are generated by strings surrounded by "[" and "]". All records are generated using the same Key.
  2. Define a rollover window at 5-second intervals.
  3. Reduce operation (attach numbers when they arrive).
  4. Print to the console.

Kafka Stream code

static String TOPIC_IN = "Topic-IN";
final StreamsBuilder builder = new StreamsBuilder();
builder
.stream(TOPIC_IN, Consumed.with(Serdes.String(), Serdes.String()))
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)))
.reduce((value1, value2) -> value1 + value2)
.toStream()
.print(Printed.toSysOut());
Topology topology = builder.build();
System.out.println(topology.describe());
final KafkaStreams streams = new KafkaStreams(topology, props); 
streams.start();

Flink code

static String TOPIC_IN = "Topic-IN";
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer<KafkaRecord> kafkaConsumer = new FlinkKafkaConsumer<>(TOPIC_IN, new MySchema(), props);
kafkaConsumer.setStartFromLatest();
DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
stream
.timeWindowAll(Time.seconds(5))
.reduce(new ReduceFunction<KafkaRecord>() 
 {
   KafkaRecord result = new KafkaRecord();
   @Override
   public KafkaRecord reduce(KafkaRecord record1, KafkaRecord record2) throws Exception
   {
     result.key = record1.key;  
     result.value = record1.value + record2.value;      
     return result;
   }
})
.print();        
System.out.println( env.getExecutionPlan() );
env.execute();

Differences observed after running the two

  1. In Kafka Stream, window() cannot be used without groupByKey(); Flink provides a method that timeWindowAll() can process all records in the stream without a Key.
  2. Kafka Stream reads records and their keys by default, but Flink needs to customize the implementation of kafkadeserialization schema < T > to read keys and values. If you are not interested in Key, you can use its new SimpleStringSchema() as the second parameter of the flinkkafkaconsumer < > constructor. The implementation of my MySchema can be found in Found on Github.
  3. You can print both pipeline topologies. This helps optimize your code. However, in addition to JSON dumps, Flink also provides a Web application to visually view the topology https://flink.apache.org/visualizer/.
  4. In Kafka Stream, I can only print the results to the console after calling toStream(), while Flink can print the results directly.
  5. Finally, Kafka Stream took more than 15 seconds to print the results to the console, and Flink was instant. This seems a little strange to me because it adds additional delays to developers.

Example 2

Here are the steps in this example

  1. Read the digital stream from Kafka Topic. These numbers are generated as strings surrounded by "[" and "]". All records are generated using the same Key.
  2. Define a 5 second rollover window.
  3. Define a delay period of 500 milliseconds to allow lateness.
  4. Reduce operation (attach numbers when they arrive).
  5. Send the results to another Kafka topic.

Kafka Stream code

static String TOPIC_IN = "Topic-IN";
static String TOPIC_OUT = "Topic-OUT";
final StreamsBuilder builder = new StreamsBuilder();
builder
.stream(TOPIC_IN, Consumed.with(Serdes.String(), Serdes.String()))
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(5))).grace(Duration.ofMillis(500)))
.reduce((value1, value2) -> value1 + value2)
.toStream()
.to(TOPIC_OUT);
            
Topology topology = builder.build();    
final KafkaStreams streams = new KafkaStreams(topology, props); 
streams.start();

Flink code

static  String  TOPIC_IN  =  "Topic-IN" ;
static  String  TOPIC_OUT  =  "Topic-OUT" ;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
FlinkKafkaConsumer<KafkaRecord> kafkaConsumer = new FlinkKafkaConsumer<>(TOPIC_IN, new MySchema(), props);
kafkaConsumer.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<KafkaRecord>() 
{
  @Override
  public long extractAscendingTimestamp(KafkaRecord record) 
  {
    return record.timestamp;
  }
});
// define kafka producer using Flink API.
KafkaSerializationSchema<String> serializationSchema = (value, timestamp) -> new ProducerRecord<byte[], byte[]>(TOPIC_OUT, value.getBytes());
FlinkKafkaProducer<String> kafkaProducer = 
                new FlinkKafkaProducer<String>(TOPIC_OUT, 
                                               serializationSchema, 
                                               prodProps, 
                                               Semantic.EXACTLY_ONCE);
DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
stream
.keyBy(record -> record.key)
.timeWindow(Time.seconds(5))
.allowedLateness(Time.milliseconds(500))  
.reduce(new ReduceFunction<String>()
{
  @Override
  public String reduce(String value1, String value2) throws Exception
  {
    return value1+value2;
  }
})
.addSink(kafkaProducer);
env.execute();

Differences observed after running the two

  1. Due to the native integration of Kafka Stream and Kafka, it is very easy to define this pipeline in ksstream, and Flink is relatively complex.
  2. In Flink, I have to define both Consumer and Producer, which adds extra code.
  3. KStream automatically uses the time stamps that exist in the records (when they are inserted into Kafka), and Flink needs the developer to provide this information. I think Flink's Kafka connector can be improved in the future so that developers can write less code.  
  4. KStream is easier to handle delayed arrival than Flink, but please note that Flink also provides a side output stream with delayed arrival, which is not available in Kafka stream.
  5. Finally, after running both, I observed that Kafka Stream takes a few extra seconds to write to the output topic, and Flink sends data to the output topic very quickly at the moment of calculating the time window results.

conclusion

  • If your project is tightly coupled with Kafka on both the source and receiver sides, the KStream API is a better choice. However, you need to manage and operate the elasticity of KStream applications.
  • Flink is a complete streaming computing system that supports HA, fault tolerance, self-monitoring and multiple deployment modes.
  • Because of built-in support for multiple third-party sources, and Sink Flink is more useful for such projects, it can be easily customized to support custom data sources.
  • Compared with Kafka Stream, Flink has richer API s and supports batch processing, complex event processing (CEP), FlinkML and Gelly (for graphics processing).

Tags: kafka flink

Posted on Sat, 27 Nov 2021 23:56:32 -0500 by jf3000