wordCount, data source and Sink, Side Outputs, two-phase commit (2pc) of Flink DataStream

1. pom.xml dependency

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

2. Use DataStream to implement word count

2.1 word count without window

apiTest\WordSourceFunction.scala

Function: data source, constantly generate word randomly

package apiTest

import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random


class WordSourceFunction extends SourceFunction[String] {

  private var is_running = true
  private val words=Array("hello", "world", "flink", "stream", "batch", "table", "sql")

  override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {

    while(is_running) {
      val index = Random.nextInt(words.size)
      sourceContext.collect(words(index))

      // 1 second
      Thread.sleep(1000)
    }

  }


  override def cancel(): Unit = {
    is_running = false
  }

}

apiTest\StreamWordCount.scala

package apiTest
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}



object StreamWordCount {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val ds:DataStream[(String,Int)] = senv.addSource(new WordSourceFunction())
      .map(word => (word,1))
      .keyBy(_._1)
      .sum(1)

    ds.print()

    // Lazy loading, call execute to execute
    senv.execute()

  }

}

The count of word keeps increasing. The execution results are as follows:

8> (stream,1)
1> (table,1)
4> (sql,1)
3> (hello,1)
4> (sql,2)
1> (table,2)
7> (flink,1)
4> (sql,3)
7> (flink,2)
4> (sql,4)
7> (flink,3)
7> (flink,4)
......Omitted part......

2.2 word count with window

apiTest\WindowStreamWordCount.scala

package apiTest
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time

object WindowStreamWordCount {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val ds:DataStream[(String,Int)] = senv.addSource(new WordSourceFunction())
      .map(word => (word,1))
      .keyBy(_._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
      .sum(1)

    ds.print()

    // Lazy loading, call execute to execute
    senv.execute()

  }

}

The result of the first window is only 7 words, which can be solved through Timestamp and Watermarks, todo

8> (stream,2)
7> (batch,3)
3> (hello,1)
5> (world,1)

Results of the second window

5> (world,2)
7> (flink,2)
8> (stream,4)
4> (sql,1)
3> (hello,1)

Results of the third window

8> (stream,1)
5> (world,2)
4> (sql,1)
3> (hello,1)
7> (batch,3)
7> (flink,2)

3. DataStream API data source

The following data sources are supported:

  1. Self contained data sources, such as senv.readTextFile(filePath), senv.addSource(FlinkKafkaConsumer[OUT])
  2. senv.addSource(SourceFunction[OUT]), parallelism 1
  3. senv.addSource(ParallelSourceFunction[OUT]), parallelism n; ParallelSourceFunction[OUT] is a subclass of SourceFunction[OUT]
  4. senv.addSource(RichParallelSourceFunction[OUT]), the parallelism is n, and the RuntimeContext can be accessed; RichParallelSourceFunction[OUT] is a subclass of ParallelSourceFunction[OUT]

The following describes the built-in data sources:

3.1 document based

File reading is completed by two subtasks:

  1. File monitoring task: when the parallelism is 1, the file is monitored, then the file is split and processed by the data reading task
  2. Data reading task: read split files in parallel

readTextFile.txt file content:

hello
world

Example code:

package datastreamApi

import org.apache.flink.api.java.io.TextInputFormat
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.source.FileProcessingMode
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

object DatasourceTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val text_filepath = "src/main/resources/readTextFile.txt"

    val text_input = senv.readTextFile(text_filepath, "UTF-8")
    text_input.print("text_input")
    /*
    text_input:8> hello
    text_input:3> world
     */

    val file_input = senv.readFile(
      new TextInputFormat(new Path(text_filepath)),
      text_filepath,
      // FileProcessingMode.PROCESS_ONCE, / / exit after reading only once
      FileProcessingMode.PROCESS_CONTINUOUSLY, // Scan the file every 5 seconds. If the file is updated, read all the contents again
      5000L
    )
    file_input.print("file_input")
    /*
    file_input:5> hello
    file_input:8> world
     */

    senv.execute()
  }

}

3.2 socket based

  1. Install ncat
[root@bigdata005 ~]#
[root@bigdata005 ~]# yum install -y nc
[root@bigdata005 ~]#
  1. Start nc
[root@bigdata005 ~]#
[root@bigdata005 ~]# nc -lk 9998
hello
world

  1. Start listener
package datastreamApi

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

object DatasourceTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val socket_input = senv.socketTextStream(
      "192.168.xxx.xxx",
      9998,
      '\n',
      0L        // Maximum attempts on failure
    )
    socket_input.print("socket_input")

    senv.execute("DatasourceTest")

  }

}

Execution results:

socket_input:3> hello
socket_input:4> world

3.3 set based

package datastreamApi

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.util.NumberSequenceIterator

import scala.collection.mutable.ArrayBuffer

object DatasourceTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val input1 = senv.fromElements(("key1", 10), ("key2", 20))
    input1.print("input1")
    /*
    input1:7> (key2,20)
    input1:6> (key1,10)
     */

    val datas = ArrayBuffer(("key1", 10), ("key2", 20))
    val input2 = senv.fromCollection(datas)
    input2.print("input2")
    /*
    input2:6> (key2,20)
    input2:5> (key1,10)
     */

    // The parameter is: SplittableIterator[T]. This example generates a sequence of 0, 1, 2 and 3
    val input3 = senv.fromParallelCollection(new NumberSequenceIterator(0L, 3L))
    input3.print("input3")
    /*
    input3:3> 2
    input3:2> 1
    input3:1> 0
    input3:4> 3
     */

    // Generate sequence of 0, 1, 2, 3
    val input4 = senv.fromSequence(0L, 3L)
    input4.print("input4")
    /*
    input4:3> 0
    input4:8> 3
    input4:7> 2
    input4:5> 1
     */

    senv.execute("DatasourceTest")

  }

}

4. Data Sink receiver of datastream API

  • The DataStream.write * method does not implement checkpoint, and the data processing semantics cannot reach exactly once; You can use the flick connector filesystem to achieve exactly once
package datastreamApi

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.io.TextOutputFormat
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

object DatasourceTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val input = senv.fromElements("hello", "world")

    input.writeUsingOutputFormat(
      new TextOutputFormat[String](
        new Path("src/main/resources/textOutputDir"),
        "UTF-8"
      )
    )

    input.writeToSocket(
      "192.168.xxx.xxx",
      9998,
      new SimpleStringSchema()
    )
    /*
    [root@bigdata005 ~]#
    [root@bigdata005 ~]# nc -lk 9998
    helloworld
    
     */

    input.print("print")
    /*
    print:2> world
    print:1> hello
     */
    
    input.printToErr("printToErr")
    /* The printed font color is red
    printToErr:7> world
    printToErr:6> hello
     */

    senv.execute("DatasourceTest")
  }

}

The directory structure of textOutputDir is shown in the following figure:

5. Side Outputs

Usage background: as shown below, when we split the input and use the filter twice, we need to traverse the input twice, resulting in a waste of server performance, so we can use Side Outputs

val senv = StreamExecutionEnvironment.getExecutionEnvironment
val input = senv.fromElements(1, 2, 3, 4, 5, 6)
val output1 = input.filter(_ < 4)
val output2 = input.filter(_ >= 4)
  • side output can be performed in the following classes:
    1. ProcessFunction: DataStream.process(ProcessFunction<I, O>)
    2. KeyedProcessFunction
    3. CoProcessFunction
    4. KeyedCoProcessFunction
    5. ProcessWindowFunction
    6. ProcessAllWindowFunction

The following describes Side Outputs using ProcessFunction:

package datastreamApi

import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala.{OutputTag, StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.util.Collector

object SideOutputTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val input = senv.fromElements(1, 2, 3, 4, 5, 6)

    val side_output_tag1 = new OutputTag[String]("my_side_output_tag1")
    val side_output_tag2 = new OutputTag[Int]("my_side_output_tag2")
    val output = input.process(new ProcessFunction[Int, Int] {

      override def processElement(value: Int, ctx: ProcessFunction[Int, Int]#Context, out: Collector[Int]): Unit = {

        // out.collect(value + 1)   // do something


        // Diversion can be carried out only once
        if (value < 4) {
          ctx.output(side_output_tag1, s"side_output1>>>>>>${value}")
        } else {
          ctx.output(side_output_tag2, value)
        }

      }
    })

    val side_output1 = output.getSideOutput(side_output_tag1)
    val side_output2 = output.getSideOutput(side_output_tag2)
    side_output1.print("side_output1")
    side_output2.print("side_output2")


    senv.execute("SideOutputTest")


  }

}

Execution results:

side_output1:4> side_output1>>>>>>2
side_output1:5> side_output1>>>>>>3
side_output1:3> side_output1>>>>>>1
side_output2:8> 6
side_output2:6> 4
side_output2:7> 5

6. Two phase commit (2pc)

A distributed consistency protocol. The participating members are Coordinator (similar to master) and participant (similar to slave)

1, 2PC submission process

  1. Stage 1: voting stage
    1. The coordinator sends a prepare request to all participant s
    2. participant performs the prepare operation and records the rollback log
    3. The participant informs the coordinator whether the prepare succeeded (yes) or failed (no)
  2. Phase II: submission phase
    1. prepare success (yes)
      1. The coordinator sends a commit request to all participant s
      2. The participant performs the commit operation
      3. The participant sends the commit result to the coordinator

  1. prepare failed (no)
    1. The coordinator sends a rollback request to all participant s
    2. participant performs a rollback operation
    3. The participant sends the rollback result to the coordinator


2, Disadvantages of 2PC
3. The coordinator has a single point of failure
4. Slow participants will block other participants
5. When a participant performs a commit operation, some participants commit successfully and some fail due to network failures

3, Flink's 2PC

Flink provides an abstract class of TwoPhaseCommitSinkFunction. All sinks that need to ensure exactly once need to inherit this class and implement four abstract methods:

protected abstract TXN beginTransaction() throws Exception;

// Phase I
protected abstract void preCommit(TXN transaction) throws Exception;

// Phase II success (yes)
protected abstract void commit(TXN transaction);

// Phase II failure (no)
protected abstract void abort(TXN transaction);

FlinkKafkaProducer inherits TwoPhaseCommitSinkFunction. The stream semantics of Flink's various sources and Sink are guaranteed as follows:

SourceGuarantees
Apache Kafkaexactly once
Filesexactly once
Socketsat most once
SinkGuarantees
Elasticsearchat least once
Kafka producerat least once / exactly once
File sinksexactly once
Socket sinksat least once
Standard outputat least once
Redis sinkat least once

Tags: flink

Posted on Sat, 20 Nov 2021 10:04:26 -0500 by stalione