Flink status management

Flink_ Status in Flink

Detailed explanation of Flink state management: deep parsing of Keyed State and Operator List State  <= Good article, recommended reading

  • Operator State
  • Keyed State
  • State Backends

Status overview

  • All data maintained by a task and used to calculate a result belong to the status of the task
  • It can be considered that the task status is a local variable that can be accessed by the business logic of the task
  • Flink manages state, including state consistency, fault handling, and efficient storage and access, so that developers can focus on the logic of the application
  • In Flink, states are always associated with specific operators
  • In order for the runtime Flink to understand the state of the operator, the operator needs to register its state in advance

In general, there are two types of states:

  • Operator State
    • The scope of the operator state is limited to operator tasks (that is, it cannot be accessed across tasks)
  • Keyed State
    • It is maintained and accessed according to the key defined in the input data flow

Operator State

  • The scope of operator state is limited to operator tasks, and all data processed by the same parallel task can access the same state.

  • States are shared for the same task. (cannot cross slot)

  • State operators cannot be accessed by another task with the same or different operators.

Operator state data structure

  • List state

    • Represents the status as a list of a set of data
  • Union list state

    • The status will also represent a list of no data. It differs from the regular list state in how to recover in the event of a failure or when starting an application from a savepoint
  • Broadcast state

    • If an operator has multiple tasks and each task state is the same, this special case is most suitable for applying broadcast state

Test code

In practice, there are few operator States, and generally more keying states are used.

package apitest.state;

import apitest.beans.SensorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.checkpoint.ListCheckpointed;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.Collections;
import java.util.List;

/**
 * @author : Ashiamd email: ashiamd@foxmail.com
 * @date : 2021/2/2 4:05 AM
 */
public class StateTest1_OperatorState {

  public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    // socket text stream
    DataStream<String> inputStream = env.socketTextStream("localhost", 7777);

    // Convert to SensorReading type
    DataStream<SensorReading> dataStream = inputStream.map(line -> {
      String[] fields = line.split(",");
      return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
    });

    // Define a stateful map operation to count the number of data in the current partition
    SingleOutputStreamOperator<Integer> resultStream = dataStream.map(new MyCountMapper());

    resultStream.print();

    env.execute();
  }

  // Custom MapFunction
  public static class MyCountMapper implements MapFunction<SensorReading, Integer>, ListCheckpointed<Integer> {
    // Define a local variable as the operator state
    private Integer count = 0;

    @Override
    public Integer map(SensorReading value) throws Exception {
      count++;
      return count;
    }

    @Override
    public List<Integer> snapshotState(long checkpointId, long timestamp) throws Exception {
      return Collections.singletonList(count);
    }

    @Override
    public void restoreState(List<Integer> state) throws Exception {
      for (Integer num : state) {
        count += num;
      }
    }
  }
}

Input (input after opening socket locally)

sensor_1,1547718199,35.8
sensor_1,1547718199,35.8
sensor_1,1547718199,35.8
sensor_1,1547718199,35.8
sensor_1,1547718199,35.8

  Keyed State

Flink_ Status in Flink

  • Keying status is maintained and accessed according to the key s defined in the input data stream.

  • Flink maintains a state instance for each key, and partitions all data with the same key into the same operator task, which will maintain and process the state corresponding to the key.

  • When a task processes a piece of data, it will automatically limit the access scope of the status to the key of the current data.

Keyed state data structure

  • Value state

    • Represents the status as a single value
  • List state

    • Represents the status as a list of a set of data
  • Map state

    • The status is represented as a set of key value pairs
  • Reducing state & aggregating state

    • Represents the status as a list for aggregation operations

Test code

Note: declaring a keyed state is generally declared in the open() of the operator, because the context information can only be obtained at runtime

  • java test code

    package apitest.state;
    
    import apitest.beans.SensorReading;
    import org.apache.flink.api.common.functions.RichMapFunction;
    import org.apache.flink.api.common.state.*;
    import org.apache.flink.configuration.Configuration;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    
    /**
     * @author : Ashiamd email: ashiamd@foxmail.com
     * @date : 2021/2/2 5:41 PM
     */
    public class StateTest2_KeyedState {
    
      public static void main(String[] args) throws Exception {
        // Create execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // Set parallelism = 1
        env.setParallelism(1);
        // Read data from local socket
        DataStream<String> inputStream = env.socketTextStream("localhost", 7777);
    
        // Convert to SensorReading type
        DataStream<SensorReading> dataStream = inputStream.map(line -> {
          String[] fields = line.split(",");
          return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });
    
        // Use the custom map method, which uses our custom Keyed State
        DataStream<Integer> resultStream = dataStream
          .keyBy(SensorReading::getId)
          .map(new MyMapper());
    
        resultStream.print("result");
        env.execute();
      }
    
      // Custom map rich function to test keying status
      public static class MyMapper extends RichMapFunction<SensorReading,Integer>{
    
        //        Exception in thread "main" java.lang.IllegalStateException: The runtime context has not been initialized.
        //        ValueState<Integer> valueState = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("my-int", Integer.class));
    
        private ValueState<Integer> valueState;
    
    
        // Declaration of other types of status
        private ListState<String> myListState;
        private MapState<String, Double> myMapState;
        private ReducingState<SensorReading> myReducingState;
    
        @Override
        public void open(Configuration parameters) throws Exception {
          valueState = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("my-int", Integer.class));
    
          myListState = getRuntimeContext().getListState(new ListStateDescriptor<String>("my-list", String.class));
          myMapState = getRuntimeContext().getMapState(new MapStateDescriptor<String, Double>("my-map", String.class, Double.class));
          //            myReducingState = getRuntimeContext().getReducingState(new ReducingStateDescriptor<SensorReading>())
    
        }
    
        // Here we simply count the information quantity of each sensor
        @Override
        public Integer map(SensorReading value) throws Exception {
          // Other status API calls
          // list state
          for(String str: myListState.get()){
            System.out.println(str);
          }
          myListState.add("hello");
          // map state
          myMapState.get("1");
          myMapState.put("2", 12.3);
          myMapState.remove("2");
          // reducing state
          //            myReducingState.add(value);
    
          myMapState.clear();
    
    
          Integer count = valueState.value();
          // The first acquisition is null and requires judgment
          count = count==null?0:count;
          ++count;
          valueState.update(count);
          return count;
        }
      }
    }

Scenario test

Suppose a temperature alarm is made. If the temperature difference between the front and back of a sensor exceeds 10 degrees, it will alarm. The keyed state Keyed State + flatMap is used here

  • java code

    package apitest.state;
    
    import apitest.beans.SensorReading;
    import org.apache.flink.api.common.functions.RichFlatMapFunction;
    import org.apache.flink.api.common.state.ValueState;
    import org.apache.flink.api.common.state.ValueStateDescriptor;
    import org.apache.flink.api.java.tuple.Tuple3;
    import org.apache.flink.configuration.Configuration;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.util.Collector;
    
    /**
     * @author : Ashiamd email: ashiamd@foxmail.com
     * @date : 2021/2/2 6:37 PM
     */
    public class StateTest3_KeyedStateApplicationCase {
    
      public static void main(String[] args) throws Exception {
        // Create execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // Set parallelism = 1
        env.setParallelism(1);
        // Get data from socket
        DataStream<String> inputStream = env.socketTextStream("localhost", 7777);
        // Convert to SensorReading type
        DataStream<SensorReading> dataStream = inputStream.map(line -> {
          String[] fields = line.split(",");
          return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });
    
        SingleOutputStreamOperator<Tuple3<String, Double, Double>> resultStream = dataStream.keyBy(SensorReading::getId).flatMap(new MyFlatMapper(10.0));
    
        resultStream.print();
    
        env.execute();
      }
    
      // If the temperature difference between the front and back of the sensor exceeds the specified temperature (10.0 is specified here), an alarm will be given
      public static class MyFlatMapper extends RichFlatMapFunction<SensorReading, Tuple3<String, Double, Double>> {
    
        // Temperature difference threshold for alarm
        private final Double threshold;
    
        // Record the last temperature
        ValueState<Double> lastTemperature;
    
        public MyFlatMapper(Double threshold) {
          this.threshold = threshold;
        }
    
        @Override
        public void open(Configuration parameters) throws Exception {
          // Get keyedState from runtime context
          lastTemperature = getRuntimeContext().getState(new ValueStateDescriptor<Double>("last-temp", Double.class));
        }
    
        @Override
        public void close() throws Exception {
          // Release resources manually
          lastTemperature.clear();
        }
    
        @Override
        public void flatMap(SensorReading value, Collector<Tuple3<String, Double, Double>> out) throws Exception {
          Double lastTemp = lastTemperature.value();
          Double curTemp = value.getTemperature();
    
          // If it is not empty, judge whether the temperature difference exceeds the threshold, and alarm will be given if it exceeds the threshold
          if (lastTemp != null) {
            if (Math.abs(curTemp - lastTemp) >= threshold) {
              out.collect(new Tuple3<>(value.getId(), lastTemp, curTemp));
            }
          }
    
          // Update saved "last temperature"
          lastTemperature.update(curTemp);
        }
      }
    }
  • Start socket

    nc -lk 7777
  • Enter data and view results

    • input

      sensor_1,1547718199,35.8
      sensor_1,1547718199,32.4
      sensor_1,1547718199,42.4
      sensor_10,1547718205,52.6   
      sensor_10,1547718205,22.5
      sensor_7,1547718202,6.7
      sensor_7,1547718202,9.9
      sensor_1,1547718207,36.3
      sensor_7,1547718202,19.9
      sensor_7,1547718202,30
    • output

      There is no output in the middle (sensor_7,9.9,19.9). It should be the calculation accuracy of double floating-point numbers, regardless of it

      (sensor_1,32.4,42.4)
      (sensor_10,52.6,22.5)
      (sensor_7,19.9,30.0)

State backend

Status in Flink_Flink

summary

  • Every time a piece of data is passed in, the stateful operator task will read and update the state.

  • Since effective state access is essential for low latency in processing data, each parallel task maintains its state locally to ensure fast state access.

  • The storage, access and maintenance of states are determined by an pluggable component, which is called the state backend

  • The state backend is mainly responsible for two things: local state management and writing checkpoint state to remote storage

    Select a status backend

  • MemoryStateBackend
    • The memory level state backend manages keying states as objects in memory, stores them in the JVM heap of TaskManager, and stores checkpoint s in the memory of JobManager
    • Features: fast, low latency, but unstable
  • FsStateBackend (default)
    • Save the checkpoint to the remote persistent file system. For the local state, just like the MemoryStateBackend, it will also be stored on the JVM heap of the task manager
    • At the same time, it has memory level local access speed and better fault tolerance guarantee
  • RocksDBStateBackend
    • After all the states are serialized, they are stored in the local RocksDB

configuration file

flink-conf.yaml

#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#
# state.backend: filesystem
 The above is the default checkpoint existence filesystem


# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints

# Default target directory for savepoints, optional.
#
# state.savepoints.dir: hdfs://namenode-host:port/flink-savepoints

# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend). 
#
# state.backend.incremental: false

# The failover strategy, i.e., how the job computation recovers from task failures.
# Only restart tasks that may have been affected by the task failure, which typically includes
# downstream tasks and potentially upstream tasks if their produced data is no longer available for consumption.

jobmanager.execution.failover-strategy: region

Above this region It means that if a task with Multiple Parallelism dies, only the task to which it belongs will be restarted region(May contain multiple subtasks) without restarting the entire Flink program

  Sample code

  • The use of RocksDBStateBackend requires the addition of pom dependencies

    <!-- RocksDBStateBackend -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-statebackend-rocksdb_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
  • java code

    package apitest.state;
    
    import apitest.beans.SensorReading;
    import org.apache.flink.api.common.restartstrategy.RestartStrategies;
    import org.apache.flink.api.common.time.Time;
    import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
    import org.apache.flink.runtime.state.filesystem.FsStateBackend;
    import org.apache.flink.runtime.state.memory.MemoryStateBackend;
    import org.apache.flink.streaming.api.CheckpointingMode;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    
    /**
     * @author : Ashiamd email: ashiamd@foxmail.com
     * @date : 2021/2/2 11:35 PM
     */
    public class StateTest4_FaultTolerance {
        public static void main(String[] args) throws Exception {
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setParallelism(1);
    
            // 1. Status backend configuration
            env.setStateBackend(new MemoryStateBackend());
            env.setStateBackend(new FsStateBackend("checkpointDataUri"));
            // This requires additional import dependencies
            env.setStateBackend(new RocksDBStateBackend("checkpointDataUri"));
    
            // socket text stream
            DataStream<String> inputStream = env.socketTextStream("localhost", 7777);
    
            // Convert to SensorReading type
            DataStream<SensorReading> dataStream = inputStream.map(line -> {
                String[] fields = line.split(",");
                return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
            });
    
            dataStream.print();
            env.execute();
        }
    }

Tags: Big Data flink

Posted on Tue, 02 Nov 2021 04:23:12 -0400 by elearnindia