1. pom.xml dependency
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId> <version>${flink.version}</version> </dependency>
2. Use DataStream to implement word count
2.1 word count without window
apiTest\WordSourceFunction.scala
Function: data source, constantly generate word randomly
package apiTest import org.apache.flink.streaming.api.functions.source.SourceFunction import scala.util.Random class WordSourceFunction extends SourceFunction[String] { private var is_running = true private val words=Array("hello", "world", "flink", "stream", "batch", "table", "sql") override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = { while(is_running) { val index = Random.nextInt(words.size) sourceContext.collect(words(index)) // 1 second Thread.sleep(1000) } } override def cancel(): Unit = { is_running = false } }
apiTest\StreamWordCount.scala
package apiTest import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation} object StreamWordCount { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val ds:DataStream[(String,Int)] = senv.addSource(new WordSourceFunction()) .map(word => (word,1)) .keyBy(_._1) .sum(1) ds.print() // Lazy loading, call execute to execute senv.execute() } }
The count of word keeps increasing. The execution results are as follows:
8> (stream,1) 1> (table,1) 4> (sql,1) 3> (hello,1) 4> (sql,2) 1> (table,2) 7> (flink,1) 4> (sql,3) 7> (flink,2) 4> (sql,4) 7> (flink,3) 7> (flink,4) ......Omitted part......
2.2 word count with window
apiTest\WindowStreamWordCount.scala
package apiTest import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation} import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows import org.apache.flink.streaming.api.windowing.time.Time object WindowStreamWordCount { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val ds:DataStream[(String,Int)] = senv.addSource(new WordSourceFunction()) .map(word => (word,1)) .keyBy(_._1) .window(TumblingProcessingTimeWindows.of(Time.seconds(10))) .sum(1) ds.print() // Lazy loading, call execute to execute senv.execute() } }
The result of the first window is only 7 words, which can be solved through Timestamp and Watermarks, todo
8> (stream,2) 7> (batch,3) 3> (hello,1) 5> (world,1)
Results of the second window
5> (world,2) 7> (flink,2) 8> (stream,4) 4> (sql,1) 3> (hello,1)
Results of the third window
8> (stream,1) 5> (world,2) 4> (sql,1) 3> (hello,1) 7> (batch,3) 7> (flink,2)
3. DataStream API data source
The following data sources are supported:
- Self contained data sources, such as senv.readTextFile(filePath), senv.addSource(FlinkKafkaConsumer[OUT])
- senv.addSource(SourceFunction[OUT]), parallelism 1
- senv.addSource(ParallelSourceFunction[OUT]), parallelism n; ParallelSourceFunction[OUT] is a subclass of SourceFunction[OUT]
- senv.addSource(RichParallelSourceFunction[OUT]), the parallelism is n, and the RuntimeContext can be accessed; RichParallelSourceFunction[OUT] is a subclass of ParallelSourceFunction[OUT]
The following describes the built-in data sources:
3.1 document based
File reading is completed by two subtasks:
- File monitoring task: when the parallelism is 1, the file is monitored, then the file is split and processed by the data reading task
- Data reading task: read split files in parallel
readTextFile.txt file content:
hello world
Example code:
package datastreamApi import org.apache.flink.api.java.io.TextInputFormat import org.apache.flink.core.fs.Path import org.apache.flink.streaming.api.functions.source.FileProcessingMode import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation} object DatasourceTest { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val text_filepath = "src/main/resources/readTextFile.txt" val text_input = senv.readTextFile(text_filepath, "UTF-8") text_input.print("text_input") /* text_input:8> hello text_input:3> world */ val file_input = senv.readFile( new TextInputFormat(new Path(text_filepath)), text_filepath, // FileProcessingMode.PROCESS_ONCE, / / exit after reading only once FileProcessingMode.PROCESS_CONTINUOUSLY, // Scan the file every 5 seconds. If the file is updated, read all the contents again 5000L ) file_input.print("file_input") /* file_input:5> hello file_input:8> world */ senv.execute() } }
3.2 socket based
- Install ncat
[root@bigdata005 ~]# [root@bigdata005 ~]# yum install -y nc [root@bigdata005 ~]#
- Start nc
[root@bigdata005 ~]# [root@bigdata005 ~]# nc -lk 9998 hello world
- Start listener
package datastreamApi import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment object DatasourceTest { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val socket_input = senv.socketTextStream( "192.168.xxx.xxx", 9998, '\n', 0L // Maximum attempts on failure ) socket_input.print("socket_input") senv.execute("DatasourceTest") } }
Execution results:
socket_input:3> hello socket_input:4> world
3.3 set based
package datastreamApi import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation} import org.apache.flink.util.NumberSequenceIterator import scala.collection.mutable.ArrayBuffer object DatasourceTest { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val input1 = senv.fromElements(("key1", 10), ("key2", 20)) input1.print("input1") /* input1:7> (key2,20) input1:6> (key1,10) */ val datas = ArrayBuffer(("key1", 10), ("key2", 20)) val input2 = senv.fromCollection(datas) input2.print("input2") /* input2:6> (key2,20) input2:5> (key1,10) */ // The parameter is: SplittableIterator[T]. This example generates a sequence of 0, 1, 2 and 3 val input3 = senv.fromParallelCollection(new NumberSequenceIterator(0L, 3L)) input3.print("input3") /* input3:3> 2 input3:2> 1 input3:1> 0 input3:4> 3 */ // Generate sequence of 0, 1, 2, 3 val input4 = senv.fromSequence(0L, 3L) input4.print("input4") /* input4:3> 0 input4:8> 3 input4:7> 2 input4:5> 1 */ senv.execute("DatasourceTest") } }
4. Data Sink receiver of datastream API
- The DataStream.write * method does not implement checkpoint, and the data processing semantics cannot reach exactly once; You can use the flick connector filesystem to achieve exactly once
package datastreamApi import org.apache.flink.api.common.serialization.SimpleStringSchema import org.apache.flink.api.java.io.TextOutputFormat import org.apache.flink.core.fs.Path import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation} object DatasourceTest { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val input = senv.fromElements("hello", "world") input.writeUsingOutputFormat( new TextOutputFormat[String]( new Path("src/main/resources/textOutputDir"), "UTF-8" ) ) input.writeToSocket( "192.168.xxx.xxx", 9998, new SimpleStringSchema() ) /* [root@bigdata005 ~]# [root@bigdata005 ~]# nc -lk 9998 helloworld */ input.print("print") /* print:2> world print:1> hello */ input.printToErr("printToErr") /* The printed font color is red printToErr:7> world printToErr:6> hello */ senv.execute("DatasourceTest") } }
The directory structure of textOutputDir is shown in the following figure:
5. Side Outputs
Usage background: as shown below, when we split the input and use the filter twice, we need to traverse the input twice, resulting in a waste of server performance, so we can use Side Outputs
val senv = StreamExecutionEnvironment.getExecutionEnvironment val input = senv.fromElements(1, 2, 3, 4, 5, 6) val output1 = input.filter(_ < 4) val output2 = input.filter(_ >= 4)
- side output can be performed in the following classes:
- ProcessFunction: DataStream.process(ProcessFunction<I, O>)
- KeyedProcessFunction
- CoProcessFunction
- KeyedCoProcessFunction
- ProcessWindowFunction
- ProcessAllWindowFunction
The following describes Side Outputs using ProcessFunction:
package datastreamApi import org.apache.flink.streaming.api.functions.ProcessFunction import org.apache.flink.streaming.api.scala.{OutputTag, StreamExecutionEnvironment, createTypeInformation} import org.apache.flink.util.Collector object SideOutputTest { def main(args: Array[String]): Unit = { val senv = StreamExecutionEnvironment.getExecutionEnvironment val input = senv.fromElements(1, 2, 3, 4, 5, 6) val side_output_tag1 = new OutputTag[String]("my_side_output_tag1") val side_output_tag2 = new OutputTag[Int]("my_side_output_tag2") val output = input.process(new ProcessFunction[Int, Int] { override def processElement(value: Int, ctx: ProcessFunction[Int, Int]#Context, out: Collector[Int]): Unit = { // out.collect(value + 1) // do something // Diversion can be carried out only once if (value < 4) { ctx.output(side_output_tag1, s"side_output1>>>>>>${value}") } else { ctx.output(side_output_tag2, value) } } }) val side_output1 = output.getSideOutput(side_output_tag1) val side_output2 = output.getSideOutput(side_output_tag2) side_output1.print("side_output1") side_output2.print("side_output2") senv.execute("SideOutputTest") } }
Execution results:
side_output1:4> side_output1>>>>>>2 side_output1:5> side_output1>>>>>>3 side_output1:3> side_output1>>>>>>1 side_output2:8> 6 side_output2:6> 4 side_output2:7> 5
6. Two phase commit (2pc)
A distributed consistency protocol. The participating members are Coordinator (similar to master) and participant (similar to slave)
1, 2PC submission process
- Stage 1: voting stage
- The coordinator sends a prepare request to all participant s
- participant performs the prepare operation and records the rollback log
- The participant informs the coordinator whether the prepare succeeded (yes) or failed (no)
- Phase II: submission phase
- prepare success (yes)
1. The coordinator sends a commit request to all participant s
2. The participant performs the commit operation
3. The participant sends the commit result to the coordinator
- prepare success (yes)
- prepare failed (no)
- The coordinator sends a rollback request to all participant s
- participant performs a rollback operation
- The participant sends the rollback result to the coordinator
2, Disadvantages of 2PC
3. The coordinator has a single point of failure
4. Slow participants will block other participants
5. When a participant performs a commit operation, some participants commit successfully and some fail due to network failures
3, Flink's 2PC
Flink provides an abstract class of TwoPhaseCommitSinkFunction. All sinks that need to ensure exactly once need to inherit this class and implement four abstract methods:
protected abstract TXN beginTransaction() throws Exception; // Phase I protected abstract void preCommit(TXN transaction) throws Exception; // Phase II success (yes) protected abstract void commit(TXN transaction); // Phase II failure (no) protected abstract void abort(TXN transaction);
FlinkKafkaProducer inherits TwoPhaseCommitSinkFunction. The stream semantics of Flink's various sources and Sink are guaranteed as follows:
Source | Guarantees |
---|---|
Apache Kafka | exactly once |
Files | exactly once |
Sockets | at most once |
Sink | Guarantees |
---|---|
Elasticsearch | at least once |
Kafka producer | at least once / exactly once |
File sinks | exactly once |
Socket sinks | at least once |
Standard output | at least once |
Redis sink | at least once |