Flink Time Semantics and Watermarks

Article Directory

1. Time Semantics in Flink

In Flink streaming, different concepts of time are involved, as shown in the following figure:

  • Event Time: The time at which the event was created.It is usually described by timestamps in events, such as collected log data, where each log records its own generation time, and Flink accesses the event timestamp through the timestamp allocator.
  • Ingestion Time: is the time at which data enters Flink.
  • Processing Time: is the local system time for each operator that performs a time-based operation, machine-dependent, and the default time attribute is Processing Time.

Which time semantics are more important?


For example, a log enters Flink at 2017-11-12 10:00:00.123 and reaches Window s at 2017-11-12 10:00:01.234. The contents of the log are as follows:

2017-11-02 18:37:15.624 INFO Fail over to rm2

For business, which time is most meaningful to count the number of failure logs in one minute?- EventTime, because we want to count the log generation time.

  • Different application scenarios at different semantic times
  • We tend to be more concerned with event time EventTime

2. Introduction of EventTime

In Flink streaming, the vast majority of businesses use eventTime, which is typically forced to use ProcessingTime or IngestionTime only if it is not available.

If you want to use EventTime, you need to introduce the time properties of EventTime as follows:

We can set the time characteristics of the stream by calling the setStreamTimeCharacteristic method directly in the code on the execution environment.

Specific time also requires extracting a timestamp from the data

val env = StreamExecutionEnvironment.getExecutionEnvironment

// From the call time to env Every created stream Append time characteristics
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

3. Watermark (Water Level)

3.1 Basic concepts

We know that there is a process and time in the process of streaming from event generation to source to operator. Although in most cases, the data flowing to the operator comes in the time sequence of event generation, it is also not excluded that due to network, distribution and other reasons, the disorder is caused. The so-called disorder refers to Flink.The sequence of received events is not strictly in the Event Time order of events.


Then there is a problem at this time. Once the order is out of order, if only according to the eventTime to decide the operation of windows, we can not make sure if the data is all in place, but can not wait indefinitely. At this time, we must have a mechanism to ensure that after a specific time, we must trigger windows to calculate, this special machine
System is Watermark.

  • Watermark is a mechanism for measuring the progress of Event Time.
  • Watermark is used to handle disordered events, and the correct handling of disordered events is usually achieved by Watermark mechanism combined with window.
  • Watermark in the data stream is used to represent data where timestamp is less than Watermark and has arrived, so the execution of window s is triggered by Watermark.
  • Watermark can be understood as a delay trigger mechanism. We can set the delay time t of Watermark, each time the system checks the maximum maxEventTime in the data that has arrived, and then determines that all data that has an eventTime smaller than maxEventTime - t has arrived. If the stop time of a window is equal to maxEventTime - t, then the window is triggered to execute.


When Flink receives data, it generates a Watermark according to certain rules, which is equal to the maxEventTime - delay time of all currently arriving data, that is, the Watermark is carried by the data, and once the data carries a Watermark later than the stop time of the currently untouched window, the execution of the corresponding window will be triggered.Since Watermark is carried with data, if new data is not available during the run, windows that are not triggered will never be triggered.

In the figure above, we set the maximum allowable delay arrival time to be 2s, so events with a timestamp of 7s correspond to a Watermark of 5s and events with a timestamp of 12s have a Watermark of 10s. If our window 1 is 1s ~ 5s and window 2 is 6s ~ 10s, then the Watermarker with a timestamp of 7s will trigger window 1 exactly when the event arrives, and the timestamp is 12sWatermark triggers window 2 exactly when the event arrives.

Watermark is the "closing time" that triggers the previous window. Once the closing time is triggered, all the data within the window with the current moment in mind will be included in the window.

As long as the water level is not reached, closing the window will not trigger no matter how long the real time advances.

Features of 3.2 watermark

  • watermark is a special data record
  • The watermark must be monotonically incremented to ensure that the task's event time clock advances forward rather than backwards.
  • watermark is related to the time stamp of the data

3.3 Transfer of watermarks

3.4 Introduction of Watermark

The introduction of watermark s is simple, and the most common ways to refer to disordered data are as follows:

dataStream.assignTimestampsAndWatermarks( new 
BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.millisecond
s(1000)) {
 override def extractTimestamp(element: SensorReading): Long = {
 element.timestamp * 1000
 }
} )

The use of Event Time must specify a timestamp in the data source.Otherwise, the program will not know what the event time of the event is (if the data in the data source does not have a timestamp, only Processing Time will be used).

We see in the example above that a somewhat complex class is created that actually implements an interface that allocates time stamps.Flink exposes the TimestampAssigner interface for our implementation, allowing us to customize how time stamps are extracted from event data.

val env = StreamExecutionEnvironment.getExecutionEnvironment
// Append time attributes to each stream created by env from the moment of invocation
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

val readings: DataStream[SensorReading] = env
.addSource(new SensorSource) 
.assignTimestampsAndWatermarks(new MyAssigner())

MyAssigner has two types

  • AssignerWithPeriodicWatermarks
  • AssignerWithPunctuatedWatermarks

Both interfaces inherit from TimestampAssigner.

Assigner with periodic watermarks

Periodic watermarks: The system periodically inserts watermarks into the stream (water lines are also a special event!).The default cycle is 200 milliseconds.have access to
The ExecutionConfig.setAutoWatermarkInterval() method is set.

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

// Generate a watermark every 5 seconds
env.getConfig.setAutoWatermarkInterval(5000)

Logic for generating watermark s: Flink calls every 5 seconds
The getCurrentWatermark() method of AssignerWithPeriodicWatermarks.

If the method returns a timestamp that is larger than the previous water level, the new watermark is inserted into the stream.This check ensures that the water level increases monotonically.If the time stamp returned by the method is less than or equal to the time stamp of the previous water level, no new watermark will be generated.

For example, customize a periodic timestamp extraction:

	class PeriodicAssigner extends
AssignerWithPeriodicWatermarks[SensorReading] {
val bound: Long = 60 * 1000 // Delay of 1 minute
var maxTs: Long = Long.MinValue // Maximum time stamp observed

override def getCurrentWatermark: Watermark = {
	new Watermark(maxTs - bound) 
}

override def extractTimestamp(r: SensorReading, previousTS: Long) = {
maxTs = maxTs.max(r.timestamp) 
r.timestamp
	} 
}

A simple special case is that if we know beforehand that the timestamp of the data stream is monotonically increasing,
That is, there is no chaos, so we can use assignAscending Timestamps, which generates watermark s directly using the timestamp of the data.

val stream: DataStream[SensorReading] = ...
val withTimestampsAndWatermarks = stream
	.assignAscendingTimestamps(e => e.timestamp)

>> result: E(1), W(1), E(2), W(2), ...

For disordered data streams, if we can roughly estimate the maximum latency of events in the data stream, we can use the following code:

val stream: DataStream[SensorReading] = ...
val withTimestampsAndWatermarks = stream.assignTimestampsAndWatermarks(
	new SensorTimeAssigner
)
	class SensorTimeAssigner extends
BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(5)) {
// Decimating timestamps
override def extractTimestamp(r: SensorReading): Long = r.timestamp
}

>> relust: E(10), W(0), E(8), E(7), E(11), W(1), ...

Assigner with punctuated watermarks

Generate watermarks intermittently.Unlike periodic generation, this method is not fixed-time, but can filter and process each data as needed.Let's just go directly to the code and give the sensor_1 Sensor data stream insertion watermark:

	class PunctuatedAssigner extends
AssignerWithPunctuatedWatermarks[SensorReading] {
	val bound: Long = 60 * 1000
	override def checkAndGetNextWatermark(r: SensorReading, extractedTS:
Long): Watermark = {
		if (r.id == "sensor_1") {
			new Watermark(extractedTS - bound) 
		} else {
		null
	} 
}
override def extractTimestamp(r: SensorReading, previousTS: Long): Long
= { 
		r.timestamp
	} 
}

IV. Use of EvnetTime in Windows

4.1Scroll window (TumblingEventTimeWindows)

def main(args: Array[String]): Unit = {
 // Environmental Science
 val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment

 env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 env.setParallelism(1)
 
 val dstream: DataStream[String] = env.socketTextStream("localhost",7777)
 
 val textWithTsDstream: DataStream[(String, Long, Int)] = dstream.map

{ text =>
 val arr: Array[String] = text.split(" ")
 (arr(0), arr(1).toLong, 1)
 }
 val textWithEventTimeDstream: DataStream[(String, Long, Int)] = 
textWithTsDstream.assignTimestampsAndWatermarks(new 
BoundedOutOfOrdernessTimestampExtractor[(String, Long,
Int)](Time.milliseconds(1000)) {
 override def extractTimestamp(element: (String, Long, Int)): Long = {
 return element._2
 }
 })
 
 val textKeyStream: KeyedStream[(String, Long, Int), Tuple] = 
textWithEventTimeDstream.keyBy(0)
 textKeyStream.print("textkey:")
 
 val windowStream: WindowedStream[(String, Long, Int), Tuple, TimeWindow] 
= textKeyStream.window(TumblingEventTimeWindows.of(Time.seconds(2)))
 
 val groupDstream: DataStream[mutable.HashSet[Long]] = 
windowStream.fold(new mutable.HashSet[Long]()) { case (set, (key, ts, count)) 
=>
 set += ts
 }
 
 groupDstream.print("window::::").setParallelism(1)
  env.execute()
	 } 
 }

The result is calculated according to the time window of the Event Time, regardless of the system time (including the speed of input).

4.2Sliding EventTimeWindows

def main(args: Array[String]): Unit = {
 // Environmental Science
 val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment
 
 env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 env.setParallelism(1)

 val dstream: DataStream[String] = env.socketTextStream("localhost",7777)

 val textWithTsDstream: DataStream[(String, Long, Int)] = dstream.map { text 
=>
 val arr: Array[String] = text.split(" ")
 (arr(0), arr(1).toLong, 1)
 }
 val textWithEventTimeDstream: DataStream[(String, Long, Int)] = 
textWithTsDstream.assignTimestampsAndWatermarks(new 
BoundedOutOfOrdernessTimestampExtractor[(String, Long, 
Int)](Time.milliseconds(1000)) {
 override def extractTimestamp(element: (String, Long, Int)): Long = {
 	return element._2
 }
 })
 
 val textKeyStream: KeyedStream[(String, Long, Int), Tuple] = 
textWithEventTimeDstream.keyBy(0)
 textKeyStream.print("textkey:")

 val windowStream: WindowedStream[(String, Long, Int), Tuple, TimeWindow] = 
textKeyStream.window(SlidingEventTimeWindows.of(Time.seconds(2),Time.millis
econds(500)))

 val groupDstream: DataStream[mutable.HashSet[Long]] = windowStream.fold(new 
mutable.HashSet[Long]()) { case (set, (key, ts, count)) =>
 set += ts
 }

 groupDstream.print("window::::").setParallelism(1)

 env.execute()
}

4.3Session Window (EventTimeSessionWindows)

Execution is triggered when the time difference between two adjacent data EventTime s exceeds the specified time interval.If a Watermark is added, a delay will occur if the window triggers are met.Reach the delayed water level and trigger the window.

def main(args: Array[String]): Unit = {
 // Environmental Science
 val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment

 env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 env.setParallelism(1)

 val dstream: DataStream[String] = env.socketTextStream("localhost",7777)
 
 val textWithTsDstream: DataStream[(String, Long, Int)] = dstream.map { text 
=>
 val arr: Array[String] = text.split(" ")
 (arr(0), arr(1).toLong, 1)
 }
 val textWithEventTimeDstream: DataStream[(String, Long, Int)] = 
textWithTsDstream.assignTimestampsAndWatermarks(new 
BoundedOutOfOrdernessTimestampExtractor[(String, Long, 
Int)](Time.milliseconds(1000)) {
 override def extractTimestamp(element: (String, Long, Int)): Long = {

 return element._2
 }
 })

 val textKeyStream: KeyedStream[(String, Long, Int), Tuple] = 
textWithEventTimeDstream.keyBy(0)
 textKeyStream.print("textkey:")

 val windowStream: WindowedStream[(String, Long, Int), Tuple, TimeWindow] 
=
textKeyStream.window(EventTimeSessionWindows.withGap(Time.milliseconds(500)
) )

 windowStream.reduce((text1,text2)=>
 ( text1._1,0L,text1._3+text2._3)
 ) .map(_._3).print("windows:::").setParallelism(1)

	 env.execute()

 }
Original article 374 won 1032 hits 470,000+
follow Private letter

Tags: Windows less Session Attribute

Posted on Thu, 14 May 2020 12:25:44 -0400 by Trevors