Big data Flink Transformation

1. Official website API list

https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/

Overall, operations on streaming data can be divided into four categories.
The first type is the operation for a single record, such as filtering out unqualified records (Filter operation), or making a conversion for each record (Map operation)
The second type is the operation of multiple records. For example, to count the total trading volume of orders within an hour, you need to add the trading volume of all order records within an hour. In order to support this type of operation, you have to associate the required records through Window for processing
The third type is to operate on multiple streams and convert them into a single stream. For example, multiple flows can be combined through operations such as Union, Join, or Connect. The merging logic of these operations is different, but they will eventually produce a new unified flow, so that some cross flow operations can be carried out.
Finally, DataStream also supports split operations symmetrical to merge, that is, splitting a stream into multiple streams according to certain rules
(Split operation), each stream is a subset of the previous stream, so we can deal with different streams differently.

2 basic operation - omitted

2.1 map

⚫ API
map: the function acts on each element in the collection and returns the result after the action

2.2 flatMap

⚫ API
flatMap: turns each element in the collection into one or more elements and returns the flattened result

2.3 keyBy

The data in the stream is grouped according to the specified key, which has been demonstrated in the previous introductory case
be careful:
There is no groupBy in stream processing, but keyBy

2.4 filter

⚫ API
Filter: filter the elements in the collection according to the specified conditions, and filter out the elements that return true / meet the conditions

2.5 sum

⚫ API
sum: sums the elements in the collection according to the specified field

2.6 reduce

⚫ API
reduce: aggregate the elements in the collection

2.7 code demonstration

⚫ Requirements:
Count the words in the streaming data and exclude the sensitive word heihei
⚫ Code demonstration

package cn.oldlu.transformation;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * Author oldlu
 * Desc
 */
public class TransformationDemo01 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
        //2.source
        DataStream<String> linesDS = env.socketTextStream("node1", 9999);

        //3. Data processing - transformation
        DataStream<String> wordsDS = linesDS.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                //value is row by row data
                String[] words = value.split(" ");
                for (String word : words) {
                    out.collect(word);//Collect and return the cut words one by one
                }
            }
        });
        DataStream<String> filtedDS = wordsDS.filter(new FilterFunction<String>() {
            @Override
            public boolean filter(String value) throws Exception {
                return !value.equals("heihei");
            }
        });
        DataStream<Tuple2<String, Integer>> wordAndOnesDS = filtedDS.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String value) throws Exception {
                //value is the word that comes in one by one
                return Tuple2.of(value, 1);
            }
        });
        //KeyedStream<Tuple2<String, Integer>, Tuple> groupedDS = wordAndOnesDS.keyBy(0);
        KeyedStream<Tuple2<String, Integer>, String> groupedDS = wordAndOnesDS.keyBy(t -> t.f0);

        DataStream<Tuple2<String, Integer>> result1 = groupedDS.sum(1);
        DataStream<Tuple2<String, Integer>> result2 = groupedDS.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
                return Tuple2.of(value1.f0, value1.f1 + value1.f1);
            }
        });

        //4. Output result - sink
        result1.print("result1");
        result2.print("result2");

        //5. Trigger execute
        env.execute();
    }
}

3 merge split

3.1 union and connect

⚫ API
union:
union operator can merge multiple data streams of the same type and generate data streams of the same type, that is, multiple DataStream[T]
Merge into a new DataStream[T]. The data will be merged according to the First In First Out mode and will not be deleted
Heavy.
connect:
Connect provides a function similar to that of union, which is used to connect two data streams. The difference between connect and union is that connect can only connect two data streams, and union can connect multiple data streams. The data types of the two data streams connected by connect can be inconsistent, and the data types of the two data streams connected by union must be consistent.
Two datastreams are converted into ConnectedStreams after being connected. ConnectedStreams will apply different processing methods to the data of the two streams, and the state can be shared between the two streams.
⚫ demand
union two streams of String type and connect one stream of String type and one stream of Long type
⚫ code implementation

package cn.oldlu.transformation;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;

/**
 * Author oldlu
 * Desc
 */
public class TransformationDemo02 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);

        //2.Source
        DataStream<String> ds1 = env.fromElements("hadoop", "spark", "flink");
        DataStream<String> ds2 = env.fromElements("hadoop", "spark", "flink");
        DataStream<Long> ds3 = env.fromElements(1L, 2L, 3L);

        //3.Transformation
        DataStream<String> result1 = ds1.union(ds2);//Merge without redoing https://blog.csdn.net/valada/article/details/104367378
        ConnectedStreams<String, Long> tempResult = ds1.connect(ds3);
        //interface CoMapFunction<IN1, IN2, OUT>
        DataStream<String> result2 = tempResult.map(new CoMapFunction<String, Long, String>() {
            @Override
            public String map1(String value) throws Exception {
                return "String->String:" + value;
            }

            @Override
            public String map2(Long value) throws Exception {
                return "Long->String:" + value.toString();
            }
        });

        //4.Sink
        result1.print();
        result2.print();

        //5.execute
        env.execute();
    }
}

3.2 split, select and Side Outputs

⚫ API
Split is to divide a stream into multiple streams, and Select is to obtain the corresponding data after shunting
Note: the split function has expired and has been removed
Side Outputs: you can use the process method to process the data in the stream and collect the data into different outputtags according to different processing results
⚫ Requirements:
The data in the convection is shunted according to odd and even numbers, and the shunted data is obtained
⚫ Code implementation:

package cn.oldlu.transformation;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

/**
 * Author oldlu
 * Desc
 */
public class TransformationDemo03 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);

        //2.Source
        DataStreamSource<Integer> ds = env.fromElements(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);

        //3.Transformation
        /*SplitStream<Integer> splitResult = ds.split(new OutputSelector<Integer>() {
            @Override
            public Iterable<String> select(Integer value) {
                //value It's the number coming in
                if (value % 2 == 0) {
                    //even numbers
                    ArrayList<String> list = new ArrayList<>();
                    list.add("Even ");
                    return list;
                } else {
                    //Odd number
                    ArrayList<String> list = new ArrayList<>();
                    list.add("Odd ");
                    return list;
                }
            }
        });
        DataStream<Integer> evenResult = splitResult.select("Even ");
        DataStream<Integer> oddResult = splitResult.select("Odd ");*/

        //Define two output labels
        OutputTag<Integer> tag_even = new OutputTag<Integer>("even numbers", TypeInformation.of(Integer.class));
        OutputTag<Integer> tag_odd = new OutputTag<Integer>("Odd number"){};
        //Process the data in ds
        SingleOutputStreamOperator<Integer> tagResult = ds.process(new ProcessFunction<Integer, Integer>() {
            @Override
            public void processElement(Integer value, Context ctx, Collector<Integer> out) throws Exception {
                if (value % 2 == 0) {
                    //even numbers
                    ctx.output(tag_even, value);
                } else {
                    //Odd number
                    ctx.output(tag_odd, value);
                }
            }
        });

        //Take out the marked data
        DataStream<Integer> evenResult = tagResult.getSideOutput(tag_even);
        DataStream<Integer> oddResult = tagResult.getSideOutput(tag_odd);

        //4.Sink
        evenResult.print("even numbers");
        oddResult.print("Odd number");

        //5.execute
        env.execute();
    }
}

4 zoning

4.1 rebalance zoning

⚫ API
Similar to repartition in Spark, but more powerful, it can directly solve data skew
Flink also has data skew. For example, there are about 1 billion pieces of data to be processed. In the process of processing, the situation as shown in the figure may occur. In case of data skew, the other three machines also have to wait for the execution of machine 1 before the overall task is completed;
Therefore, in actual work, a better solution to this situation is rebalance (the internal round robin method is used to evenly disperse the data)
⚫ Code demonstration:

package cn.oldlu.transformation;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

/**
 * Author oldlu
 * Desc
 */
public class TransformationDemo04 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC).setParallelism(3);

        //2.source
        DataStream<Long> longDS = env.fromSequence(0, 100);

        //3.Transformation
        //The following operations are equivalent to randomly distributing the data, and data skew may occur
        DataStream<Long> filterDS = longDS.filter(new FilterFunction<Long>() {
            @Override
            public boolean filter(Long num) throws Exception {
                return num > 10;
            }
        });

        //Next, use the map operation to convert the data to (partition number / subtask number, data)
        //Rich means multi-functional. There are more API s than MapFunction for us to use
        DataStream<Tuple2<Integer, Integer>> result1 = filterDS
                .map(new RichMapFunction<Long, Tuple2<Integer, Integer>>() {
                    @Override
                    public Tuple2<Integer, Integer> map(Long value) throws Exception {
                        //Get partition number / subtask number
                        int id = getRuntimeContext().getIndexOfThisSubtask();
                        return Tuple2.of(id, 1);
                    }
                }).keyBy(t -> t.f0).sum(1);

        DataStream<Tuple2<Integer, Integer>> result2 = filterDS.rebalance()
                .map(new RichMapFunction<Long, Tuple2<Integer, Integer>>() {
                    @Override
                    public Tuple2<Integer, Integer> map(Long value) throws Exception {
                        //Get partition number / subtask number
                        int id = getRuntimeContext().getIndexOfThisSubtask();
                        return Tuple2.of(id, 1);
                    }
                }).keyBy(t -> t.f0).sum(1);

        //4.sink
        //result1.print();// Data skew is possible
        result2.print();//rebalance repartition balance is performed before output to solve the problem of data skew

        //5.execute
        env.execute();
    }
}

4.2 other zones

⚫ API


explain:
recale partition. Based on the parallelism of upstream and downstream operators, records are output to each instance of downstream operators in a circular manner.
give an example:
If the upstream parallelism is 2 and the downstream parallelism is 4, one upstream parallelism outputs the record to the two downstream parallelism in a circular manner; The other parallel degree upstream outputs the record to the other two parallel degrees downstream in a circular manner. If the upstream parallelism is 4 and the downstream parallelism is 2, the two upstream parallelism will output the record to one downstream parallelism; The other two parallel degrees upstream output the record to another parallel degree downstream.
⚫ Requirements:
The elements in the convection use various partitions and output
⚫ code implementation

package cn.oldlu.transformation;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.Partitioner;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * Author oldlu
 * Desc
 */
public class TransformationDemo05 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);

        //2.Source
        DataStream<String> linesDS = env.readTextFile("data/input/words.txt");
        SingleOutputStreamOperator<Tuple2<String, Integer>> tupleDS = linesDS.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                String[] words = value.split(" ");
                for (String word : words) {
                    out.collect(Tuple2.of(word, 1));
                }
            }
        });

        //3.Transformation
        DataStream<Tuple2<String, Integer>> result1 = tupleDS.global();
        DataStream<Tuple2<String, Integer>> result2 = tupleDS.broadcast();
        DataStream<Tuple2<String, Integer>> result3 = tupleDS.forward();
        DataStream<Tuple2<String, Integer>> result4 = tupleDS.shuffle();
        DataStream<Tuple2<String, Integer>> result5 = tupleDS.rebalance();
        DataStream<Tuple2<String, Integer>> result6 = tupleDS.rescale();
        DataStream<Tuple2<String, Integer>> result7 = tupleDS.partitionCustom(new Partitioner<String>() {
            @Override
            public int partition(String key, int numPartitions) {
                return key.equals("hello") ? 0 : 1;
            }
        }, t -> t.f0);

        //4.sink
        //result1.print();
        //result2.print();
        //result3.print();
        //result4.print();
        //result5.print();
        //result6.print();
        result7.print();

        //5.execute
        env.execute();
    }
}

Tags: html flink

Posted on Wed, 08 Sep 2021 19:09:45 -0400 by bateman