Big data Flink asynchronous IO

1 Introduction

1.1 requirements for asynchronous IO operation

https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/asyncio.html
Async I/O is a very popular feature contributed by Alibaba to the community. It was introduced in version 1.2. The main purpose is to solve the problem that network delay becomes the system bottleneck when interacting with external systems. Stream computing systems often need to interact with external systems. Our usual practice is to send the query request of user a to the database and wait for the result to return. Before that, our program cannot send the query request of user b. This is a synchronous access method,
As shown in the figure below
⚫ As shown in the left figure: the usual implementation method is to send the query request of user a to the database (for example, in MapFunction) and wait for the result to return. Before that, we cannot send the query request of user b. This is a synchronous access mode. The long brown bar in the figure identifies the waiting time. It can be found that the network waiting time greatly hinders the throughput and delay
⚫ The figure on the right shows that in order to solve the problem of synchronous access, the asynchronous mode can process multiple requests and replies concurrently, and can continuously send requests of users a, b, c, d, etc. to the database. At the same time, the reply of which request returns first will be processed, so that there is no need to block and wait between successive requests. This is also the implementation principle of Async I/O.

1.2 prerequisites for using Aysnc I/O

⚫ The database (or key/value storage system) provides client s that support asynchronous requests. (such as vertx in java)
⚫ If there is no asynchronous request client, you can also throw the synchronous client into the thread pool for execution as an asynchronous client

1.3 Async I/O API

Async I/O API allows users to use asynchronous clients to access external storage in data flow. This API handles the integration with data flow and dirty work such as message Order, event time and consistency (fault tolerance). Users only focus on business
If there is an asynchronous client in the target database, the asynchronous streaming conversion operation can be realized in three steps (asynchronous for the database):
⚫ The AsyncFunction used to distribute requests is implemented to send asynchronous requests to the database and set callbacks
⚫ Get the callback of the operation result and submit it to ResultFuture
⚫ Apply asynchronous I/O operations to DataStream

2 case demonstration

https://blog.csdn.net/weixin_41608066/article/details/105957940
⚫ Requirements:
Reading data from MySQL using asynchronous IO
⚫ Data preparation:

DROP TABLE IF EXISTS `t_category`;
CREATE TABLE `t_category` (
  `id` int(11) NOT NULL,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

-- ----------------------------
-- Records of t_category
-- ----------------------------
INSERT INTO `t_category` VALUES ('1', 'mobile phone');
INSERT INTO `t_category` VALUES ('2', 'computer');
INSERT INTO `t_category` VALUES ('3', 'clothing');
INSERT INTO `t_category` VALUES ('4', 'Cosmetics');
INSERT INTO `t_category` VALUES ('5', 'food');

⚫ Code demonstration

package cn.oldlu.extend;

import io.vertx.core.AsyncResult;
import io.vertx.core.Handler;
import io.vertx.core.Vertx;
import io.vertx.core.VertxOptions;
import io.vertx.core.json.JsonObject;
import io.vertx.ext.jdbc.JDBCClient;
import io.vertx.ext.sql.SQLClient;
import io.vertx.ext.sql.SQLConnection;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;

import java.sql.*;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

/**
 * Prerequisites for using asynchronous io
 * 1.The database (or key/value store) provides client s that support asynchronous requests.
 * 2.If there is no asynchronous request client, you can also throw the synchronous client into the thread pool for execution as an asynchronous client.
 */
public class ASyncIODemo {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //2.Source
        //DataStreamSource[1,2,3,4,5]
        DataStreamSource<CategoryInfo> categoryDS = env.addSource(new RichSourceFunction<CategoryInfo>() {
            private Boolean flag = true;
            @Override
            public void run(SourceContext<CategoryInfo> ctx) throws Exception {
                Integer[] ids = {1, 2, 3, 4, 5};
                for (Integer id : ids) {
                    ctx.collect(new CategoryInfo(id, null));
                }
            }
            @Override
            public void cancel() {
                this.flag = false;
            }
        });
        //3.Transformation


        //Method 1: the asynchronous client provided in Java vertx implements asynchronous IO
        //unorderedWait unordered wait
        SingleOutputStreamOperator<CategoryInfo> result1 = AsyncDataStream
                .unorderedWait(categoryDS, new ASyncIOFunction1(), 1000, TimeUnit.SECONDS, 10);

        //Mode 2: synchronous client + thread pool in MySQL simulates asynchronous IO
        //unorderedWait unordered wait
        SingleOutputStreamOperator<CategoryInfo> result2 = AsyncDataStream
                .unorderedWait(categoryDS, new ASyncIOFunction2(), 1000, TimeUnit.SECONDS, 10);

        //4.Sink
        result1.print("Mode 1: Java-vertx Asynchronous provided in client Implement asynchrony IO \n");
        result2.print("Mode 2: MySQL Medium synchronization client+Thread pool emulation asynchronous IO \n");

        //5.execute
        env.execute();
    }
}

@Data
@NoArgsConstructor
@AllArgsConstructor
class CategoryInfo {
    private Integer id;
    private String name;
}

class MysqlSyncClient {
    private static transient Connection connection;
    private static final String JDBC_DRIVER = "com.mysql.jdbc.Driver";
    private static final String URL = "jdbc:mysql://localhost:3306/bigdata";
    private static final String USER = "root";
    private static final String PASSWORD = "root";

    static {
        init();
    }

    private static void init() {
        try {
            Class.forName(JDBC_DRIVER);
        } catch (ClassNotFoundException e) {
            System.out.println("Driver not found!" + e.getMessage());
        }
        try {
            connection = DriverManager.getConnection(URL, USER, PASSWORD);
        } catch (SQLException e) {
            System.out.println("init connection failed!" + e.getMessage());
        }
    }

    public void close() {
        try {
            if (connection != null) {
                connection.close();
            }
        } catch (SQLException e) {
            System.out.println("close connection failed!" + e.getMessage());
        }
    }

    public CategoryInfo query(CategoryInfo category) {
        try {
            String sql = "select id,name from t_category where id = "+ category.getId();
            Statement statement = connection.createStatement();
            ResultSet rs = statement.executeQuery(sql);
            if (rs != null && rs.next()) {
                category.setName(rs.getString("name"));
            }
        } catch (SQLException e) {
            System.out.println("query failed!" + e.getMessage());
        }
        return category;
    }
}

/**
 * Method 1: the asynchronous client provided in Java vertx implements asynchronous IO
 */
class ASyncIOFunction1 extends RichAsyncFunction<CategoryInfo, CategoryInfo> {
    private transient SQLClient mySQLClient;

    @Override
    public void open(Configuration parameters) throws Exception {
        JsonObject mySQLClientConfig = new JsonObject();
        mySQLClientConfig
                .put("driver_class", "com.mysql.jdbc.Driver")
                .put("url", "jdbc:mysql://localhost:3306/bigdata")
                .put("user", "root")
                .put("password", "root")
                .put("max_pool_size", 20);

        VertxOptions options = new VertxOptions();
        options.setEventLoopPoolSize(10);
        options.setWorkerPoolSize(20);
        Vertx vertx = Vertx.vertx(options);
        //Obtain the asynchronous request client according to the above configuration parameters
        mySQLClient = JDBCClient.createNonShared(vertx, mySQLClientConfig);
    }

    //Sending asynchronous requests using asynchronous clients
    @Override
    public void asyncInvoke(CategoryInfo input, ResultFuture<CategoryInfo> resultFuture) throws Exception {
        mySQLClient.getConnection(new Handler<AsyncResult<SQLConnection>>() {
            @Override
            public void handle(AsyncResult<SQLConnection> sqlConnectionAsyncResult) {
                if (sqlConnectionAsyncResult.failed()) {
                    return;
                }
                SQLConnection connection = sqlConnectionAsyncResult.result();
                connection.query("select id,name from t_category where id = " +input.getId(), new Handler<AsyncResult<io.vertx.ext.sql.ResultSet>>() {
                    @Override
                    public void handle(AsyncResult<io.vertx.ext.sql.ResultSet> resultSetAsyncResult) {
                        if (resultSetAsyncResult.succeeded()) {
                            List<JsonObject> rows = resultSetAsyncResult.result().getRows();
                            for (JsonObject jsonObject : rows) {
                                CategoryInfo categoryInfo = new CategoryInfo(jsonObject.getInteger("id"), jsonObject.getString("name"));
                                resultFuture.complete(Collections.singletonList(categoryInfo));
                            }
                        }
                    }
                });
            }
        });
    }
    @Override
    public void close() throws Exception {
        mySQLClient.close();
    }

    @Override
    public void timeout(CategoryInfo input, ResultFuture<CategoryInfo> resultFuture) throws Exception {
        System.out.println("async call time out!");
        input.setName("unknown");
        resultFuture.complete(Collections.singleton(input));
    }
}

/**
 * Mode 2: synchronous call + thread pool simulation asynchronous IO
 */
class ASyncIOFunction2 extends RichAsyncFunction<CategoryInfo, CategoryInfo> {
    private transient MysqlSyncClient client;
    private ExecutorService executorService;//Thread pool

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        client = new MysqlSyncClient();
        executorService = new ThreadPoolExecutor(10, 10, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue<Runnable>());
    }

    //Send request asynchronously
    @Override
    public void asyncInvoke(CategoryInfo input, ResultFuture<CategoryInfo> resultFuture) throws Exception {
        executorService.execute(new Runnable() {
            @Override
            public void run() {
                resultFuture.complete(Collections.singletonList((CategoryInfo) client.query(input)));
            }
        });
    }


    @Override
    public void close() throws Exception {
    }

    @Override
    public void timeout(CategoryInfo input, ResultFuture<CategoryInfo> resultFuture) throws Exception {
        System.out.println("async call time out!");
        input.setName("unknown");
        resultFuture.complete(Collections.singleton(input));
    }
}

⚫ Reading Redis data through asynchronous IO

package cn.oldlu.extend;

import io.vertx.core.Vertx;
import io.vertx.core.VertxOptions;
import io.vertx.redis.RedisClient;
import io.vertx.redis.RedisOptions;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;

import java.util.Collections;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;
import java.util.function.Supplier;
/**
Accessing redis using asynchronous IO
hset AsyncReadRedis beijing 1
hset AsyncReadRedis shanghai 2
hset AsyncReadRedis guangzhou 3
hset AsyncReadRedis shenzhen 4
hset AsyncReadRedis hangzhou 5
hset AsyncReadRedis wuhan 6
hset AsyncReadRedis chengdu 7
hset AsyncReadRedis tianjin 8
hset AsyncReadRedis chongqing 9
 
city.txt
1,beijing
2,shanghai
3,guangzhou
4,shenzhen
5,hangzhou
6,wuhan
7,chengdu
8,tianjin
9,chongqing
 */
public class AsyncIODemo_Redis {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        DataStreamSource<String> lines = env.readTextFile("data/input/city.txt");

        SingleOutputStreamOperator<String> result1 = AsyncDataStream.orderedWait(lines, new AsyncRedis(), 10, TimeUnit.SECONDS, 1);
        SingleOutputStreamOperator<String> result2 = AsyncDataStream.orderedWait(lines, new AsyncRedisByVertx(), 10, TimeUnit.SECONDS, 1);

        result1.print().setParallelism(1);
        result2.print().setParallelism(1);

        env.execute();
    }
}
/**
 * Read redis data asynchronously
 */
class AsyncRedis extends RichAsyncFunction<String, String> {
    //Define the connection pool object of redis
    private JedisPoolConfig config = null;

    private static String ADDR = "localhost";
    private static int PORT = 6379;
    //The maximum time to wait for an available connection, in milliseconds. The default is - 1, which means never timeout. If the waiting time is exceeded, an exception will be thrown
    private static int TIMEOUT = 10000;
    //Define the connection pool instance of redis
    private JedisPool jedisPool = null;
    //Defines the core object of the connection pool
    private Jedis jedis = null;
    //Initialize redis connection
    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        //Define connection pool object property configuration
        config = new JedisPoolConfig();
        //Initialize connection pool object
        jedisPool = new JedisPool(config, ADDR, PORT, TIMEOUT);
        //Instantiate the connection object (get an available connection)
        jedis = jedisPool.getResource();
    }

    @Override
    public void close() throws Exception {
        super.close();
        if(jedis.isConnected()){
            jedis.close();
        }
    }

    //Call redis asynchronously
    @Override
    public void asyncInvoke(String input, ResultFuture<String> resultFuture) throws Exception {
        System.out.println("input:"+input);
        //Initiate an asynchronous request and return the result
        CompletableFuture.supplyAsync(new Supplier<String>() {
            @Override
            public String get() {
                String[] arrayData = input.split(",");
                String name = arrayData[1];
                String value = jedis.hget("AsyncReadRedis", name);
                System.out.println("output:"+value);
                return  value;
            }
        }).thenAccept((String dbResult)->{
            //Set the callback when the request is completed and return the result
            resultFuture.complete(Collections.singleton(dbResult));
        });
    }

    //The method called when the connection times out. Generally, the method outputs the error log of the connection timeout. If the method is not restarted, an exception will be thrown after the connection times out
    @Override
    public void timeout(String input, ResultFuture<String> resultFuture) throws Exception {
        System.out.println("redis connect timeout!");
    }
}
/**
 * Using the high-performance asynchronous component vertx to realize the function similar to connection pool, the efficiency is higher than that of connection pool
 * 1)It can be used directly in the java version
 * 2)If used in the scala version, the required version of scala is 2.12+
 */
class AsyncRedisByVertx extends RichAsyncFunction<String,String> {
    //Member variables marked with the transient keyword do not participate in the serialization process
    private transient RedisClient redisClient;
    //Gets the configuration object of the connection pool
    private JedisPoolConfig config = null;
    //Get connection pool
    JedisPool jedisPool = null;
    //Get core object
    Jedis jedis = null;
    //Redis server IP
    private static String ADDR = "localhost";
    //Redis port number
    private static int PORT = 6379;
    //Access password
    private static String AUTH = "XXXXXX";
    //The maximum time to wait for an available connection, in milliseconds. The default value is - 1, which means never timeout. If the waiting time is exceeded, the JedisConnectionException will be thrown directly;
    private static int TIMEOUT = 10000;
    private static final Logger logger = LoggerFactory.getLogger(AsyncRedis.class);
    //Initialize connection
    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        config = new JedisPoolConfig();
        jedisPool = new JedisPool(config, ADDR, PORT, TIMEOUT);
        jedis = jedisPool.getResource();

        RedisOptions config = new RedisOptions();
        config.setHost(ADDR);
        config.setPort(PORT);

        VertxOptions vo = new VertxOptions();
        vo.setEventLoopPoolSize(10);
        vo.setWorkerPoolSize(20);

        Vertx vertx = Vertx.vertx(vo);

        redisClient = RedisClient.create(vertx, config);
    }

    //Data asynchronous call
    @Override
    public void asyncInvoke(String input, ResultFuture<String> resultFuture) throws Exception {
        System.out.println("input:"+input);
        String[] split = input.split(",");
        String name = split[1];
        // Initiate an asynchronous request
        redisClient.hget("AsyncReadRedis", name, res->{
            if(res.succeeded()){
                String result = res.result();
                if(result== null){
                    resultFuture.complete(null);
                    return;
                }
                else {
                    // Set the callback when the request is completed: pass the result to the collector
                    resultFuture.complete(Collections.singleton(result));
                }
            }else if(res.failed()) {
                resultFuture.complete(null);
                return;
            }
        });
    }
    @Override
    public void timeout(String input, ResultFuture resultFuture) throws Exception {
    }
    @Override
    public void close() throws Exception {
        super.close();
        if (redisClient != null) {
            redisClient.close(null);
        }
    }
}

Tags: Java Database flink

Posted on Mon, 13 Sep 2021 21:10:56 -0400 by jimbo_head