Design differences of WaterMark under Flink multi parallelism

Raising questions
Will the position of WaterMark design affect the normal opening and closing of the window?

Next, I simulated two scenarios (source parallelism is 1 and map parallelism is 2), which are
1. Set watermark after source and open the window after passing through map
2. Set watermark after the map, and then open the window

ps: I have set the naturally growing watermark in the following two codes. The window time is 5 seconds, but the location of the watermark is different

            watermark yes testWM Object ts field*1000

Code 1: add WaterMark after Source
public class WMTest {

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource<String> source = env.socketTextStream("localhost", 9999);
    // Todo: set watermark after source on December 1, 2021 
    SingleOutputStreamOperator<String> sourceWithWM = source.assignTimestampsAndWatermarks(WatermarkStrategy
            .<String>forMonotonousTimestamps()
            .withTimestampAssigner(new SerializableTimestampAssigner<String>() {
                @Override
                public long extractTimestamp(String element, long recordTimestamp) {
                    String[] split = element.split(",");
                    return Long.parseLong(split[2]) * 1000;
                }
            }));
    // Todo: set map parallelism to 2 on December 1, 2021 
    SingleOutputStreamOperator<testWM> mapDS = sourceWithWM.map(r -> {
        String[] split = r.split(",");
        return new testWM(Integer.parseInt(split[0]), Integer.parseInt(split[1]),Long.parseLong(split[2]));
    }).setParallelism(2);
    SingleOutputStreamOperator<String> resultDS = mapDS.keyBy(r -> r.getId())
            .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            .process(new ProcessWindowFunction<testWM, String, Integer, TimeWindow>() {
                @Override
                public void process(Integer integer, Context context, Iterable<testWM> elements, Collector<String> out) throws Exception {
                    out.collect("I closed the window");
                }
            });
    resultDS.print();
    env.execute();

}

}
@Data
@AllArgsConstructor
class testWM{

private int id;
private int num;
private long ts;

}
Code 2: set WaterMark after Map
public class WMTest {

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource<String> source = env.socketTextStream("localhost", 9999);
    // Todo: set map parallelism to 2 on December 1, 2021
    SingleOutputStreamOperator<testWM> mapDS = source.map(r -> {
        String[] split = r.split(",");
        return new testWM(Integer.parseInt(split[0]), Integer.parseInt(split[1]),Long.parseLong(split[2]));
    }).setParallelism(2);
    // Todo: add watermark after map on December 1, 2021
    SingleOutputStreamOperator<testWM> mapWithWM = mapDS.assignTimestampsAndWatermarks(WatermarkStrategy
            .<testWM>forMonotonousTimestamps()
            .withTimestampAssigner(new SerializableTimestampAssigner<testWM>() {
                @Override
                public long extractTimestamp(testWM element, long recordTimestamp) {
                    return element.getTs() * 1000;
                }
            }));
    SingleOutputStreamOperator<String> resultDS = mapWithWM.keyBy(r -> r.getId())
            .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            .process(new ProcessWindowFunction<testWM, String, Integer, TimeWindow>() {
                @Override
                public void process(Integer integer, Context context, Iterable<testWM> elements, Collector<String> out) throws Exception {
                    out.collect("I closed the window");
                }
            });
    resultDS.print();
    env.execute();

}

}
@Data
@AllArgsConstructor
class testWM{

private int id;
private int num;
private long ts;

}
Operation results:
For the first, the result of adding watermark after source is as follows:

When the data 1,1,1 enters, the window [0,5) is opened. When the data 1,1,9 enters, watermark rises to 9000 (ignoring watermark minus 1). The window is closed and there is no problem

For the second, the result of adding watermark after map is as follows:

It is obvious that when the first 1,1,9 enters, [0,5) the window is not closed. Why is it not closed until the second 1,1,9 enters?

I drew the following figure to understand the above two situations

Figure 1. Figure 2 depicts the scenario of setting watermark after source. Generally speaking, this is what we need in production

WaterMark sends it downstream in the form of broadcast, Big data training And if multiple watermarks with parallel degrees are received at the same time, the smaller one shall prevail

As a result, in Figure 3 (setting watermark after Map), I need to send two pieces of data sufficient for [0,5) the window to close before I can really close the window, because the data can reach each parallelism only after polling.

expand:
In KafkaSource, we have done a good optimization. In production, we generally set the parallelism to be the same as the number of topic partitions

If the parallelism is set more than the number of topic partitions, there must be parallelism that cannot consume data, which will keep WaterMark at Long.min_value

When this WaterMark broadcasts to the downstream, all windows of normal parallelism cannot be closed because WaterMark takes the minimum value of each parallelism

However, when this state is maintained for a period of time, the program will automatically filter out watermarks without data parallelism when calculating watermarks. This is the optimization of KafkaSource

Tags: Big Data flink

Posted on Fri, 03 Dec 2021 06:38:21 -0500 by $0.05$