Big data Flink performance optimization

1 History Server performance optimization

Flynk's HistoryServer is mainly used to store and view the history of tasks. See the official website for specific information
https://ci.apache.org/projects/flink/flink-docs-release-
1.12/deployment/advanced/historyserver.html

# Directory to upload completed jobs to. Add this directory to the list of
# monitored directories of the HistoryServer as well (see below).
# Upload completed jobs to the directory
jobmanager.archive.fs.dir: hdfs://node01:8020/completed-jobs/

# The address under which the web-based HistoryServer listens.
# Address of Web-based HistoryServer
historyserver.web.address: 0.0.0.0

# The port under which the web-based HistoryServer listens.
# The port number of the Web-based HistoryServer
historyserver.web.port: 8082

# Comma separated list of directories to monitor for completed jobs.
# A comma separated list of directories used to monitor completed jobs
historyserver.archive.fs.dir: hdfs://node01:8020/completed-jobs/

# Interval in milliseconds for refreshing the monitored directories.
# The time interval (in milliseconds) between refreshing the monitored directory
historyserver.archive.fs.refresh-interval: 10000

⚫ Parameter definition
○ jobmanager.archive.fs.dir: the log storage directory after the flight job is completed
○ historyserver.archive.fs.dir: hdfs monitoring directory of the flick history process
○ historyserver.web.address: the host where the flink history process is located
○ historyserver.web.port: the port occupied by the flick history process
○ historyserver.archive.fs.refresh interval: the time interval (in milliseconds) to refresh the monitored directory.
⚫ Default boot port 8082:
○ bin/historyserver.sh (start|start-foreground|stop)

2 serialization

⚫ First, let's talk about Java's native serialization method:
Advantages: the advantage is that it is relatively simple and general, as long as the object implements the Serializable interface;
Disadvantages: the efficiency is relatively low, and if the user does not specify the serialVersionUID, it is likely that the previous data cannot be deserialized after the job is recompiled (this is also a pain point of Spark Streaming Checkpoint. In business use, the problem of unable to recover from the Checkpoint after modifying the code often occurs). For distributed computing, Data transmission efficiency is very important. A good serialization framework can greatly improve the computational efficiency and job stability through low serialization time and low memory consumption.
⚫ Flink and Spark adopt different methods for data serialization
Spark adopts Java Native serialization for all data by default, and users can also configure Kryo; Compared with Java Native serialization, Kryo is better in terms of serialization efficiency and memory occupation of serialization results (spark claims that generally Kryo will save 10x memory occupation than Java Native); The spark document indicates that the only reason why they do not set Kryo as the default serialization framework is that Kryo requires the user to register the classes to be serialized, and it is recommended that the user enable Kryo through configuration. Flink implements a set of efficient serialization methods.

3 reuse object

For example, the following code:

stream
    .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
        @Override
        public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, Long>> collector) throws Exception {
            long changesCount = ...
            // A new Tuple instance is created on every execution
            collector.collect(new Tuple2<>(userName, changesCount));
        }
    }

It can be seen that every time the apply function is executed, an instance of Tuple2 class will be created, which increases the pressure on the garbage collector. One way to solve this problem is to use the same instance repeatedly:

stream
        .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
    // Create an instance that we will reuse on every call
    private Tuple2<String, Long> result = new Tuple<>();
    @Override
    public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, Long>> collector) throws Exception {
        long changesCount = ...
        // Set fields on an existing object instead of creating a new one
        result.f0 = userName;
        // Auto-boxing!! A new Long value may be created
        result.f1 = changesCount;
        // Reuse the same Tuple2 object
        collector.collect(result);
    }
}

This approach also indirectly creates an instance of the Long class. To solve this problem, Flink has many so-called value classes: intvalue, LongValue, StringValue, FloatValue, etc. Here's how to use them:

stream
        .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
    // Create a mutable count instance
    private LongValue count = new LongValue();
    // Assign mutable count to the tuple
    private Tuple2<String, LongValue> result = new Tuple<>("", count);

    @Override
    // Notice that now we have a different return type
    public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, LongValue>> collector) throws Exception {
        long changesCount = ...

        // Set fields on an existing object instead of creating a new one
        result.f0 = userName;
        // Update mutable count value
        count.setValue(changesCount);

        // Reuse the same tuple and the same LongValue instance
        collector.collect(result);
    }
}

4 data skew

If grouping operations such as keyBy are used in our flick program, it is easy to have data skew. Data skew will slow down the overall computing speed, and some child nodes can't even receive data, resulting in no utilization of the allocated resources.
⚫ Operations with windows
○ the distribution of all data in each window with windows is uneven, and the processing data of a window is too large, resulting in slow speed
○ causes the Source data processing process to be slower and slower
○ this leads to slower and slower processing of all windows
⚫ Operation without window
○ some child nodes have little or no data to process, resulting in no utilization of allocated resources at all
⚫ WebUI embodiment:
Open each window in Subtasks in the WebUI to see the operation of each window process: as shown in the figure above, the data distribution is very uneven, resulting in slow data processing in some windows
⚫ Optimization method:
 disperse the key evenly (hash, salt, etc.)
‡ custom partition
‡ use Rebalabce
Note: Rebalance is used when the data is tilted. Do not use Rebalance if it is not tilted, otherwise a large number of nets will be generated due to shuffle
Network overhead

Tags: Big Data Spark flink

Posted on Mon, 13 Sep 2021 16:20:09 -0400 by ayok