Flink streaming API

5. Flink streaming API

  • Implicit transformation in Scala programming

  • import org.apache.flink.streaming.api.scala._
    import org.apache.flink.table.api._
    import org.apache.flink.table.api..scala._
    

5.1,Environment

  • getExecutionEnvironment
    • Create an execution environment that represents the context of the current executing program.
    • If the program is called independently, this method returns to the local execution environment;
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  • createLocalEnvironment
    • To return to the local execution environment, you need to specify the default parallelism when calling.
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(1);
  • createLocalEnvironmentWithWebUI(new Configuration())
    • Return * local execution environment and WebUI * *, and the configuration needs to be specified
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
  • createRemoteEnvironment
    • Return to the cluster execution environment and submit the Jar to the remote server. You need to specify the IP and port number of the JobManager when calling, and specify the Jar package to run in the cluster.
StreamExecutionEnvironment env =
StreamExecutionEnvironment.createRemoteEnvironment("jobmanage-hostname", 6123,"YOURPATH//WordCount.jar");

5.2,Source

  • File based
    • readTextFile(path) - reads the data of the text file
    • Readfile (file input format, path) - read the data of the file through a custom reading method
//Start reading file
DataStreamSource<String> dataStream1 = environment.readTextFile("data/flink.txt", "UTF-8");

//readFile()
DataStreamSource<String> dataStream2 = environment.readFile(new YjxxtFileInputFormat(), "data/flink.txt");
class YjxxtFileInputFormat extends FileInputFormat<String> {
    @Override
    public boolean reachedEnd() throws IOException {
        return false;
    }
    @Override
    public String nextRecord(String s) throws IOException {
        return null;
    }
}
  • socket based
    • socketTextStream reads data from socket port
  • Set based
    • From collection (Collection) - read data from the collection to form a data stream. The element types in the collection need to be consistent
    • From elements (t...) - read data from the array to form a data stream. The element types in the collection need to be consistent.
    • generateSequence(from, to) - create a data stream, the number of data in the data source from from to.
//Create a Collection
List<String> list = Arrays.asList("aa bb", "bb cc", "cc dd", "dd ee", "ee ff", " ff aa");
DataStreamSource<String> dataStream = environment.fromCollection(list);

//Based on Element
DataStreamSource<String> dataStream2 = environment.fromElements("aa bb", "bb cc", "cc dd", "dd ee", "ee ff", " ff aa");

//generateSequence(from, to)
DataStreamSource<Long> dataStream3 = environment.generateSequence(1, 100);
  • Custom source
    • addSource - customize a data source, such as FlinkKafkaConsumer, to read data from kafka.
		<!-- Flink Kafka Connector dependency -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.11_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
//Connect data source
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
properties.setProperty("group.id", "yjx_kafka_flink");
properties.setProperty("auto.offset.reset", "earliest");
properties.setProperty("enable.auto.commit", "true");
DataStreamSource<String> dataStream = 
	environment.addSource(new FlinkKafkaConsumer011<String>("yjxflink", new SimpleStringSchema(), properties));

5.3,Transform (Operator)

  1. Basic transformation operator
    • The basic conversion operator will process each individual event in the stream, that is, each input data will produce an output data. Single value conversion, data segmentation and data filtering are typical examples of basic conversion operations.
  2. Keyed flow conversion operator
    • A basic requirement of many stream processing programs is to be able to group data, and the grouped data share the same attribute. The DataStream API provides an abstraction called KeyedStream, which logically partitions the DataStream. The partitioned data has the same Key value, and the partitioned streams are not related to each other.
    • The state transition operation for KeyedStream can read data or write data to the state corresponding to the current event Key. This indicates that all events with the same Key can access the same state, that is, these events can be processed together.
  3. Multi stream conversion operator
    • Many applications need to ingest multiple streams and merge them. They may also need to divide a stream into multiple streams and then apply different business logic for each stream. Next, we will discuss the operators provided in the DataStream API that can process multiple input streams or send multiple output streams.
  4. Distributed transformation operator
    • The partition operation corresponds to the "data exchange strategy" section we talked about earlier. These operations define how events are assigned to different tasks. When we use DataStream API to write programs, the system will automatically select the data partition strategy, and then route the data to the correct place according to the semantics of the operator and the set parallelism. Sometimes, we need to control the partition policy at the application level or customize the partition policy. For example, if we know that data skew will occur, we want to load balance the data flow and send the data flow to the following operators on average. Alternatively, the business logic of the application may need an operator, and all parallel tasks need to receive the same data. Or when we need to customize the partition policy.

5.3.1. Basic conversion operator

  1. map
    • map() is basically a one-to-one service, that is, input an element and output an element.
  2. flatmap
    • Flattening operation, 1 to many;
  3. filter
    • Evaluate a Boolean condition to filter out some elements, and then send the remaining elements.

5.3.2. Keyed conversion operator

  1. keyby
    • Based on different key s, events in the stream will be allocated to different partitions.
  2. Rolling aggregation
    • sum(): min(): max(): minBy(): maxBy():
  3. Reduce() aggregation
    • The data flow grouped according to the same Key takes effect, and the summary operation is carried out in two in one to generate a new element of the same type.

5.3.6. Multi stream conversion operator

  • The event confluence mode is FIFO mode.

  • union

    • Merge two or more datastreams into one output DataStream with the same type as the input stream.
    • Without weight removal, the merging order is first in first out
    • Merged streams must be of the same type
  • connect

    • Two data streams can be merged
    • Connect can only connect two data streams
    • The data types of streams can be inconsistent

5.3.9. Distributed transformation operator

  1. .shuffle()

    • Random data exchange
  2. rebalance()

    • Round robin load balancing algorithm distributes the input stream equally to all subsequent parallel tasks
  3. rescale()

    • Lightweight round robin load balancing strategy.
    • A communication channel is established between each task and a molecular parallel task of the downstream operator.
  4. broadcast()

    • All data is copied and sent to all parallel tasks of downstream operators
  5. global()

    • All stream data is sent to the first parallel task of the downstream operator.
  6. The partitionCustom() method is used to customize the partition policy.

5.4. Supported data types

  • There are four categories and eight types of Primitives
  • Tuples (tuples) for Java and Scala
  • Sample classes for Scala
  • POJO type - must be a public class
  • Some special types
DataStream[Long] numbers = env.fromElements(1L, 2L, 3L, 4L);
numbers.map(n -> n + 1);

//PoJo
DataStream<Person> persons = env.fromElements(
new Person("Alex", 42),
new Person("Wendy", 23)
);
//key definition method
persons.keyBy("f2")

5.5 UDF function

  • Flink exposes the interfaces of all udf functions (implemented as interfaces or abstract classes).
    • MapFunction
    • FilterFunction
    • FlatmapFunction
    • ProcessFunction custom advanced functions

5.6. Rich function

  • Before the function processes data, it needs to do some initialization;

  • It is necessary to obtain some information about the function execution context when processing data;

  • And do some cleaning when processing the data.

  • All conversion operation functions provided by DataStream API have their "rich" versions,

    1. RichMapFunction
    2. RichFlatMapFunction
    3. RichFilterFunction
  • Additional methods can be implemented when using rich functions:

    1. The open() method is the initialization method of rich function
    2. The close() method is the last method called in the life cycle
    3. The getRuntimeContext() method provides some information about the RuntimeContext of the function

5.7,Sink

  • writeAsText() / / output the calculation results to a text file
  • writeAsCsv(...) / / output the calculation result to a csv file
  • print() / / print the calculation results to the console
  • writeUsingOutputFormat() / / customize the output method.
  • writeToSocket / / outputs the calculation results to the port of a machine.

Tags: Scala Big Data flink

Posted on Mon, 06 Sep 2021 20:29:53 -0400 by phpdragon