Dimension table JOIN - a business scenario that cannot be bypassed
In the process of Flink flow processing, it is often necessary to interact with external systems, and use dimension tables to complete the fields in fact tables.
For example, in an e-commerce scenario, a skuid of a commodity is needed to associate some attributes of the commodity, such as the industry, manufacturer and manufacturer of the commodity; In the logistics scenario, if you know the package id, you need to associate the industry attribute, shipment information, receipt information, etc. of the package.
By default, in Flink's MapFunction, a single parallel can only interact in a synchronous way: send the request to external storage, IO blocks, wait for the request to return, and then continue to send the next request. This kind of synchronous interaction often takes a lot of time in the network waiting. In order to improve the processing efficiency, we can increase the parallelism of MapFunction, but increasing the parallelism means more resources, which is not a very good solution.
Async I/O asynchronous non blocking request
Flink introduces Async I/O in 1.2. In the asynchronous mode, IO operation is asynchronized. A single parallel can send multiple requests continuously, which request is returned first and processed first, so there is no need for blocking waiting between consecutive requests, which greatly improves the flow processing efficiency.
Async I/O is a highly acclaimed feature that Alibaba has contributed to the community. To solve the problem of network delay when interacting with external systems has become the bottleneck of the system.
The long brown bar in the figure shows the waiting time. It can be found that the network waiting time greatly hinders the throughput and delay. In order to solve the problem of synchronous access, asynchronous mode can handle multiple requests and replies concurrently. That is to say, you can send users a, b, c and other requests to the database continuously. At the same time, the response of which request is returned first will be processed, so there is no need to block and wait between successive requests, as shown on the right side of the figure above. This is exactly the implementation principle of Async I/O.
The detailed principle can refer to the first link given at the end of the article, which is shared by Alibaba cloud evil.
A simple example is as follows:
public class AsyncIOFunctionTest { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); env.setParallelism(1); Properties p = new Properties(); p.setProperty("bootstrap.servers", "localhost:9092"); DataStreamSource<string> ds = env.addSource(new FlinkKafkaConsumer010<string>("order", new SimpleStringSchema(), p)); ds.print(); SingleOutputStreamOperator<order> order = ds .map(new MapFunction<string, order>() { @Override public Order map(String value) throws Exception { return new Gson().fromJson(value, Order.class); } }) .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<order>() { @Override public long extractAscendingTimestamp(Order element) { try { return element.getOrderTime(); } catch (Exception e) { e.printStackTrace(); } return 0; } }) .keyBy(new KeySelector<order, string>() { @Override public String getKey(Order value) throws Exception { return value.getUserId(); } }) .window(TumblingEventTimeWindows.of(Time.minutes(10))) .maxBy("orderTime"); SingleOutputStreamOperator<tuple7<string, string, integer, long>> operator = AsyncDataStream .unorderedWait(order, new RichAsyncFunction<order, tuple7<string, string, integer, long>>() { private Connection connection; @Override public void open(Configuration parameters) throws Exception { super.open(parameters); Class.forName("com.mysql.jdbc.Driver"); connection = DriverManager.getConnection("url", "user", "pwd"); connection.setAutoCommit(false); } @Override public void asyncInvoke(Order input, ResultFuture<tuple7<string, string, integer, long>> resultFuture) throws Exception { List<tuple7<string, string, integer, long>> list = new ArrayList<>(); // Query database asynchronously in asyncInvoke method String userId = input.getUserId(); Statement statement = connection.createStatement(); ResultSet resultSet = statement.executeQuery("select name,age,sex from user where userid=" + userId); if (resultSet != null && resultSet.next()) { String name = resultSet.getString("name"); int age = resultSet.getInt("age"); String sex = resultSet.getString("sex"); Tuple7<string, string, integer, long> res = Tuple7.of(userId, name, age, sex, input.getOrderId(), input.getPrice(), input.getOrderTime()); list.add(res); } // Collect data resultFuture.complete(list); } @Override public void close() throws Exception { super.close(); if (connection != null) { connection.close(); } } }, 5000, TimeUnit.MILLISECONDS,100); operator.print(); env.execute("AsyncIOFunctionTest"); } }
In the above code, the original order flow comes from Kafka, and the user information of the order is retrieved from the associated dimension table. From the above example, we can see that we create a connection object in open(), close the connection in the close() method, and directly query the database operation in the asycinvoke() method of RichAsyncFunction, and return the data. This completes a simple asynchronous request.
The principle and basic usage of Async I/O
Simply put, the API corresponding to Flink using Async I/O is the abstract class RichAsyncFunction. The following abstract class implements the open (initialization), asyncInvoke (asynchronous data call), close (stop some operations) methods, and the most important is to implement the methods in asyncInvoke.
Let's start with a template approach using Async I/O:
// This example implements the asynchronous request and callback with Futures that have the // interface of Java 8's futures (which is the same one followed by Flink's Future) /** * An implementation of the 'AsyncFunction' that sends requests and sets the callback. */ class AsyncDatabaseRequest extends RichAsyncFunction<string, tuple2<string, string>> { /** The database specific client that can issue concurrent requests with callbacks */ private transient DatabaseClient client; @Override public void open(Configuration parameters) throws Exception { client = new DatabaseClient(host, post, credentials); } @Override public void close() throws Exception { client.close(); } @Override public void asyncInvoke(String key, final ResultFuture<tuple2<string, string>> resultFuture) throws Exception { // issue the asynchronous request, receive a future for result final Future<string> result = client.query(key); // set the callback to be executed once the request by the client is complete // the callback simply forwards the result to the result future CompletableFuture.supplyAsync(new Supplier<string>() { @Override public String get() { try { return result.get(); } catch (InterruptedException | ExecutionException e) { // Normally handled explicitly. return null; } } }).thenAccept( (String dbResult) -> { resultFuture.complete(Collections.singleton(new Tuple2<>(key, dbResult))); }); } } // create the original stream DataStream<string> stream = ...; // apply the async I/O transformation DataStream<tuple2<string, string>> resultStream = AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100);
Suppose that we need to request other databases asynchronously in a scenario, then there are three steps to implement an asynchronous I/O to operate the database: 1. Implement the AsyncFunction used to distribute requests 2. Get the callback of the operation result and submit it to the AsyncCollector 3. Convert asynchronous I/O operation to DataStream
There are two important parameters:Timeouttimeout defines how long an asynchronous operation will be discarded. This parameter prevents dead or failed requests The Capacity parameter defines how many asynchronous requests can be processed simultaneously. Although asynchronous I/O method will bring better throughput, but operators will still become the bottleneck of flow application. There is a back pressure on the number of concurrent requests that exceed the limit.
Several points need to be noted:
-
Using Async I/O requires external storage of clients that support asynchronous requests.
-
Using Async I/O, inheriting richasyncfunction (abstract class of interface asyncfunction < in, out >), rewriting or implementing the three methods of open (establish connection), close (close connection), asyncinvoke (asynchronous call).
-
Using Async I/O, preferably combined with cache, can reduce the number of requests for external storage and improve efficiency.
-
Async I/O provides a Timeout parameter to control the maximum wait time for requests. By default, when an asynchronous I/O request is timed out, an exception is thrown and the job is restarted or stopped. If you want to handle Timeout, you can override the asyncfunction Timeout method.
-
Async I/O provides the Capacity parameter to control the number of concurrent requests. Once the Capacity is exhausted, a back pressure mechanism will be triggered to suppress the intake of upstream data.
-
Async I/O output provides two modes: out of order and sequential.
Out of order. With the AsyncDataStream.unorderedWait(...) API, the output order and input order of each parallel may be inconsistent. Order, using the AsyncDataStream.orderedWait(...) API, the output order and input order of each parallel are the same. To ensure the order, you need to sort in the output Buffer, which is less efficient.
Flink 1.9 optimization
Because of the newly merged Blink related functions, it is very simple for Flink 1.9 to implement the dimension table function. If you want to use this function, you need to introduce Blink's Planner.
<dependency> <groupid>org.apache.flink</groupid> <artifactid>flink-table-planner-blink_$</artifactid> <version>$</version> </dependency>
Then we only need to customize the implementation of the LookupableTableSource interface and implement the methods in it. Let's analyze the code of LookupableTableSource as follows:
public interface LookupableTableSource<t> extends TableSource<t> { TableFunction<t> getLookupFunction(String[] lookupKeys); AsyncTableFunction<t> getAsyncLookupFunction(String[] lookupKeys); boolean isAsyncEnabled(); }
The three methods are as follows:
-
The isAsyncEnabled method mainly indicates whether the table supports asynchronous access to external data sources to obtain data. When true is returned, the asynchronous function will be returned for calling after registration with the TableEnvironment. When false is returned, the function will be accessed synchronously.
-
The getLookupFunction method returns a function that accesses the external data system synchronously. What does it mean? You need to wait until the data is returned to continue processing data when you query the external database through the Key. This will affect the throughput of the system.
-
Getasynlookupfunction method is to return an asynchronous function to access the external data system asynchronously and get data, which can greatly improve the system throughput.
Regardless of synchronous access function, getasynlookupfunction will return functions that access external data sources asynchronously. If you want to use asynchronous functions, you can only use them if the isAsyncEnabled method of LookupableTableSource returns true. Use asynchronous function to access external data system. Generally, external system has asynchronous access client. If not, you can use thread pool to access external system asynchronously. For example:
public class MyAsyncLookupFunction extends AsyncTableFunction<row> { private transient RedisAsyncCommands<string, string> async; @Override public void open(FunctionContext context) throws Exception { RedisClient redisClient = RedisClient.create("redis://127.0.0.1:6379"); StatefulRedisConnection<string, string> connection = redisClient.connect(); async = connection.async(); } public void eval(CompletableFuture<collection<row>> future, Object... params) { redisFuture.thenAccept(new Consumer<string>() { @Override public void accept(String value) { future.complete(Collections.singletonList(Row.of(key, value))); } }); } }
A complete example is as follows:
Main method:
import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.common.typeinfo.Types; import org.apache.flink.api.java.typeutils.RowTypeInfo; import org.apache.flink.api.java.utils.ParameterTool; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.java.StreamTableEnvironment; import org.apache.flink.types.Row; import org.junit.Test; import java.util.Properties; public class LookUpAsyncTest { @Test public void test() throws Exception { LookUpAsyncTest.main(new String[]{}); } public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //env.setParallelism(1); EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings); final ParameterTool params = ParameterTool.fromArgs(args); String fileName = params.get("f"); DataStream<string> source = env.readTextFile("hdfs://172.16.44.28:8020" + fileName, "UTF-8"); TypeInformation[] types = new TypeInformation[]; String[] fields = new String[]{"id", "user_click", "time"}; RowTypeInfo typeInformation = new RowTypeInfo(types, fields); DataStream<row> stream = source.map(new MapFunction<string, row>() { private static final long serialVersionUID = 2349572543469673349L; @Override public Row map(String s) { String[] split = s.split(","); Row row = new Row(split.length); for (int i = 0; i < split.length; i++) { Object value = split[i]; if (types[i].equals(Types.STRING)) { value = split[i]; } if (types[i].equals(Types.LONG)) { value = Long.valueOf(split[i]); } row.setField(i, value); } return row; } }).returns(typeInformation); tableEnv.registerDataStream("user_click_name", stream, String.join(",", typeInformation.getFieldNames()) + ",proctime.proctime"); RedisAsyncLookupTableSource tableSource = RedisAsyncLookupTableSource.Builder.newBuilder() .withFieldNames(new String[]{"id", "name"}) .withFieldTypes(new TypeInformation[]) .build(); tableEnv.registerTableSource("info", tableSource); String sql = "select t1.id,t1.user_click,t2.name" + " from user_click_name as t1" + " join info FOR SYSTEM_TIME AS OF t1.proctime as t2" + " on t1.id = t2.id"; Table table = tableEnv.sqlQuery(sql); DataStream<row> result = tableEnv.toAppendStream(table, Row.class); DataStream<string> printStream = result.map(new MapFunction<row, string>() { @Override public String map(Row value) throws Exception { return value.toString(); } }); Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "127.0.0.1:9094"); FlinkKafkaProducer011<string> kafkaProducer = new FlinkKafkaProducer011<>( "user_click_name", new SimpleStringSchema(), properties); printStream.addSink(kafkaProducer); tableEnv.execute(Thread.currentThread().getStackTrace()[1].getClassName()); } }
Redisasynlookuptablesource method:
import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.java.typeutils.RowTypeInfo; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.TableSchema; import org.apache.flink.table.functions.AsyncTableFunction; import org.apache.flink.table.functions.TableFunction; import org.apache.flink.table.sources.LookupableTableSource; import org.apache.flink.table.sources.StreamTableSource; import org.apache.flink.table.types.DataType; import org.apache.flink.table.types.utils.TypeConversions; import org.apache.flink.types.Row; public class RedisAsyncLookupTableSource implements StreamTableSource<row>, LookupableTableSource<row> { private final String[] fieldNames; private final TypeInformation[] fieldTypes; public RedisAsyncLookupTableSource(String[] fieldNames, TypeInformation[] fieldTypes) { this.fieldNames = fieldNames; this.fieldTypes = fieldTypes; } //Synchronization method @Override public TableFunction<row> getLookupFunction(String[] strings) { return null; } //Asynchronous method @Override public AsyncTableFunction<row> getAsyncLookupFunction(String[] strings) { return MyAsyncLookupFunction.Builder.getBuilder() .withFieldNames(fieldNames) .withFieldTypes(fieldTypes) .build(); } //Open asynchronous @Override public boolean isAsyncEnabled() { return true; } @Override public DataType getProducedDataType() { return TypeConversions.fromLegacyInfoToDataType(new RowTypeInfo(fieldTypes, fieldNames)); } @Override public TableSchema getTableSchema() { return TableSchema.builder() .fields(fieldNames, TypeConversions.fromLegacyInfoToDataType(fieldTypes)) .build(); } @Override public DataStream<row> getDataStream(StreamExecutionEnvironment environment) { throw new UnsupportedOperationException("do not support getDataStream"); } public static final class Builder { private String[] fieldNames; private TypeInformation[] fieldTypes; private Builder() { } public static Builder newBuilder() { return new Builder(); } public Builder withFieldNames(String[] fieldNames) { this.fieldNames = fieldNames; return this; } public Builder withFieldTypes(TypeInformation[] fieldTypes) { this.fieldTypes = fieldTypes; return this; } public RedisAsyncLookupTableSource build() { return new RedisAsyncLookupTableSource(fieldNames, fieldTypes); } } }
MyAsyncLookupFunction
import io.lettuce.core.RedisClient; import io.lettuce.core.RedisFuture; import io.lettuce.core.api.StatefulRedisConnection; import io.lettuce.core.api.async.RedisAsyncCommands; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.java.typeutils.RowTypeInfo; import org.apache.flink.table.functions.AsyncTableFunction; import org.apache.flink.table.functions.FunctionContext; import org.apache.flink.types.Row; import java.util.Collection; import java.util.Collections; import java.util.concurrent.CompletableFuture; import java.util.function.Consumer; public class MyAsyncLookupFunction extends AsyncTableFunction<row> { private final String[] fieldNames; private final TypeInformation[] fieldTypes; private transient RedisAsyncCommands<string, string> async; public MyAsyncLookupFunction(String[] fieldNames, TypeInformation[] fieldTypes) { this.fieldNames = fieldNames; this.fieldTypes = fieldTypes; } @Override public void open(FunctionContext context) { //Configure redis asynchronous connection RedisClient redisClient = RedisClient.create("redis://127.0.0.1:6379"); StatefulRedisConnection<string, string> connection = redisClient.connect(); async = connection.async(); } //This method is called to join every stream data public void eval(CompletableFuture<collection<row>> future, Object... paramas) { //Table name, primary key name, primary key value, column name String[] info = {"userInfo", "userId", paramas[0].toString(), "userName"}; String key = String.join(":", info); RedisFuture<string> redisFuture = async.get(key); redisFuture.thenAccept(new Consumer<string>() { @Override public void accept(String value) { future.complete(Collections.singletonList(Row.of(key, value))); //todo // BinaryRow row = new BinaryRow(2); } }); } @Override public TypeInformation<row> getResultType() { return new RowTypeInfo(fieldTypes, fieldNames); } public static final class Builder { private String[] fieldNames; private TypeInformation[] fieldTypes; private Builder() { } public static Builder getBuilder() { return new Builder(); } public Builder withFieldNames(String[] fieldNames) { this.fieldNames = fieldNames; return this; } public Builder withFieldTypes(TypeInformation[] fieldTypes) { this.fieldTypes = fieldTypes; return this; } public MyAsyncLookupFunction build() { return new MyAsyncLookupFunction(fieldNames, fieldTypes); } } }
Several points need to be paid attention to:
1. External data source must be asynchronous client: if it is thread safe (multiple clients are used together), you can initialize it once without adding the transient keyword. Otherwise, you need to add transient and not initialize it. In the open method, initialize one for each Task instance.
2. There is an additional completable future in the eval method. When the asynchronous access is completed, its method needs to be called for processing. For example, in the above example:
redisFuture.thenAccept(new Consumer<string>() { @Override public void accept(String value) { future.complete(Collections.singletonList(Row.of(key, value))); } });
3. Although the community provides the function of asynchronously associating dimension tables, in fact, associating dimension tables of external systems with large amount of data will still become the bottleneck of the system, so we usually add caching to synchronous and asynchronous functions. Considering concurrent, easy to use, real-time update and multi version, Hbase is the best external dimension table.
Reference article: http://wuchong.me/blog/2017/05/17/flink-internals-async-io/# https://www.jianshu.com/p/d8f99d94b761 https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65870673 https://www.jianshu.com/p/7ce84f978ae0
Statement: all articles in this article are original except for special notes. Readers of the public number have the right of priority reading. They can not be reproduced without the permission of the author, otherwise they will be held liable for infringement.
Pay attention to my public address, background reply [JAVAPDF] get 200 page test questions! 50000 people pay attention to the way of big data becoming God, don't you come to understand it? 50000 people pay attention to the way of big data becoming God. Don't you really come to understand it? 50000 people pay attention to the way of big data becoming God. Are you sure you don't want to understand it?
Welcome to follow The way of big data becoming God</string></row></string></string></collection<row></string,></string,></row></row></row></row></row></row></string></row,></string></row></string,></row></string></string></collection<row></string,></string,></row></t></t></t></t></in,></tuple2<string,></string></string></string></tuple2<string,></string,></string,></tuple7<string,></tuple7<string,></order,></tuple7<string,></order,></order></string,></order></string></string>