1, Functions
Flink Table and SQL have built-in functions supported in SQL; If there are unmet needs, you can implement user-defined functions (UDF) to solve them.
1.1 system built-in functions
Flink Table API and SQL provide users with a set of built-in functions for data conversion. Many functions supported in SQL, such as Table API and SQL, have been implemented, and others are still under rapid development and expansion.
the following are some examples of typical functions. All built-in functions can be introduced on the official website.
⚫ Comparison function
SQL:
value1 = value2
value1 > value2
Table API:
ANY1 === ANY2
ANY1 > ANY2
⚫ Logic function
SQL:
boolean1 OR boolean2
boolean IS FALSE
NOT boolean
Table API:
BOOLEAN1 || BOOLEAN2
BOOLEAN.isFalse
!BOOLEAN
⚫ Arithmetic function
SQL:
numeric1 + numeric2
POWER(numeric1, numeric2)
Table API:
NUMERIC1 + NUMERIC2
NUMERIC1.power(NUMERIC2)
⚫ String function
SQL:
string1 || string2
UPPER(string)
CHAR_LENGTH(string)
Table API:
STRING1 + STRING2
STRING.upperCase()
STRING.charLength()
⚫ Time function
SQL:
DATE string
TIMESTAMP string
CURRENT_TIME
INTERVAL string range
Table API:
STRING.toDate
STRING.toTimestamp
currentTime()
NUMERIC.days
NUMERIC.minutes
⚫ Aggregate function
SQL:
COUNT(*)
SUM([ ALL | DISTINCT ] expression)
RANK()
ROW_NUMBER()
Table API:
FIELD.count
FIELD.sum0
1.2 UDF
User defined functions (UDFs) are an important feature because they significantly extend the expressive power of queries. For some requirements that cannot be solved by the built-in functions of the system, we can use UDF to define and implement them.
1.2.1 register user-defined function UDF
in most cases, user-defined functions must be registered before they can be used in queries. There is no need to register functions specifically for Scala's Table API.
the function registers in the TableEnvironment by calling the registerFunction () method. When a user-defined function is registered, it is inserted into the function directory of the TableEnvironment so that the Table API or SQL parser can recognize and interpret it correctly.
1.2.2 Scalar Functions
user defined scalar functions can map 0, 1 or more scalar values to new scalar values. In order to define scalar functions, you must extend the base class Scalar Function in org.apache.flink.table.functions and implement (one or more) evaluation (eval) methods. The behavior of scalar functions is determined by the evaluation method, which must be publicly declared and named eval (direct def declaration, no override). The parameter type and return type of the evaluation method determine the parameter and return type of the Scalar Function.
In the following code, we define our own HashCode function, register it in TableEnvironment, and call it in the query.
// Customize a scalar function public static class HashCode extends ScalarFunction { private int factor = 13; public HashCode(int factor) { this.factor = factor; } public int eval(String s) { return s.hashCode() * factor; } }
In the main function, calculate the hash value of sensor id (Part copy, flow environment, table environment, read source, build table):
public static void main(String[] args) throws Exception { // 1. Create environment StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env); // 2. Read the file to get DataStream String filePath = "..\\sensor.txt"; DataStream<String> inputStream = env.readTextFile(filePath); // 3. Convert to Java Bean and specify timestamp and watermark DataStream<SensorReading> dataStream = inputStream.map( line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); } ); // 4. Convert DataStream to Table Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature"); // 5. Call the user-defined hash function to hash the id HashCode hashCode = new HashCode(23); tableEnv.registerFunction("hashCode", hashCode); Table resultTable = sensorTable.select("id, ts, hashCode(id)"); // sql tableEnv.createTemporaryView("sensor", sensorTable); Table resultSqlTable = tableEnv.sqlQuery("select id, ts, hashCode(id) from sensor"); tableEnv.toAppendStream(resultTable, Row.class).print("result"); tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql"); env.execute("scalar function test");
1.2.3 Table Functions
similar to user-defined scalar functions, user-defined table functions can take 0, 1 or more scalar values as input parameters; Unlike scalar functions, it can return any number of rows as output instead of a single value.
in order to define a table function, you must extend the base class TableFunction in org.apache.flex.table.functions and implement (one or more) evaluation methods. The behavior of a table function is determined by its evaluation method, which must be public and named eval. The parameter type of the evaluation method determines all valid parameters of the table function.
the type of the returned table is determined by the generic type of TableFunction. The evaluation method emits the output line using the protected collect (T) method.
in the Table API, the Table function needs to be used with. Joinlater or. Leftouterjoinlater. The joinlater operator will connect each row in the external Table with all rows calculated by the Table function (the parameter of the operator is its expression). The leftouterjoinlater operator is a left outer join, which will also connect each row in the external Table with all rows generated by the calculation of the Table function; Moreover, the external rows of the empty Table returned by the Table function should also be retained.
in SQL, you need to use a final table() or a left connection with an ON TRUE condition.
In the code below, we will define a table function, register it in the table environment, and call it in the query. Custom TableFunction:
// Custom TableFunction public static class Split extends TableFunction<Tuple2<String, Integer>> { private String separator = ","; public Split(String separator) { this.separator = separator; } // Similar to flatmap, there is no return value public void eval(String str) { for (String s : str.split(separator)) { collect(new Tuple2<String, Integer>(s, s.length())); } } }
Next, it is called in code. The first is the method of Table API:
Split split = new Split("_"); tableEnv.registerFunction("split", split); Table resultTable = sensorTable .joinLateral( "split(id) as (word, length)") .select("id, ts, word, length");
Then the SQL method:
tableEnv.createTemporaryView("sensor", sensorTable); Table resultSqlTable = tableEnv.sqlQuery("select id, ts, word, length " + "from sensor, lateral table( split(id) ) as splitId(word, length)");
1.2.4 Aggregate Functions
user defined aggregate functions (UDAGGs) can aggregate the data in a table into a scalar value. User defined aggregate functions are implemented by inheriting the AggregateFunction abstract class.
An example of aggregation is shown in the figure above. Suppose there is a table containing the data of various drinks. The table consists of three columns (id, name and price) and five rows of data. Now we need to find the highest price of all drinks in the table, that is, perform max() aggregation, and the result will be a value.
The AggregateFunction works as follows.
⚫ First, it needs an accumulator to hold the data structure (state) of the aggregated intermediate results. You can create an empty accumulator by calling the createAccumulator () method of AggregateFunction.
⚫ The function's accumulate () method is then called on each input line to update the accumulator.
⚫ After all rows are processed, the getValue () method of the function is called to calculate and return the final result. The AggregationFunction requires the following methods to be implemented:
• createAccumulator()
• accumulate()
• getValue()
In addition to the above methods, there are some alternative implementation methods. Some of these methods can make the system execute queries more efficiently, while others are necessary for some scenarios. For example, if an aggregate function is applied to a session window
(session group window), the merge () method is required.
• retract()
• merge()
• resetAccumulator()
Next, we write a custom AggregateFunction to calculate the average temperature of each sensor
// Defines the Accumulator for the AggregateFunction public static class AvgTempAcc { double sum = 0.0; int count = 0; } // Customize an aggregation function, calculate the average temperature value of each sensor, and save the status (tempsum, tempcount) public static class AvgTemp extends AggregateFunction<Double, AvgTempAcc>{ @Override public Double getValue(AvgTempAcc accumulator) { return accumulator.sum / accumulator.count; } @Override public AvgTempAcc createAccumulator() { return new AvgTempAcc(); } // Implement a specific processing and calculation function, calculate public void accumulate( AvgTempAcc accumulator, Double temp) { accumulator.sum += temp; accumulator.count += 1; } }
Next, you can call it in code.
// Create an aggregate function instance AvgTemp avgTemp = new AvgTemp(); // Call of Table API tableEnv.registerFunction("avgTemp", avgTemp); Table resultTable = sensorTable .groupBy("id") .aggregate("avgTemp(temperature) as avgTemp") .select("id, avgTemp"); // sql tableEnv.createTemporaryView("sensor", sensorTable); Table resultSqlTable = tableEnv.sqlQuery("select id, avgTemp(temperature) " + "from sensor group by id"); tableEnv.toRetractStream(resultTable, Row.class).print("result"); tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql");
1.2.5 Table Aggregate Functions
User defined table aggregate functions (udtags) can aggregate data in a table into a result table with multiple rows and columns. This is very similar to AggregateFunction, except that the aggregation result was a scalar value before, and now it has become a table.
For example, now we need to find the first two highest prices of all drinks in the table, that is, execute the aggregation of top2 () table. We need to check each of the five rows and the result will be a table with the first two values after sorting.
The user-defined table aggregation function is implemented by inheriting the TableAggregateFunction abstract class.
TableAggregateFunction works as follows.
⚫ First, it also needs an Accumulator, which is a data structure that holds the intermediate results of aggregation.
An empty accumulator can be created by calling the createAccumulato (r) method of TableAggregateFunction.
⚫ The function's accumulate () method is then called on each input line to update the accumulator.
⚫ After all rows are processed, the function's emitValue () method is called to calculate and return the final result.
The AggregationFunction requires the following methods to be implemented:
• createAccumulator()
• accumulate()
In addition to the above methods, there are some alternative implementation methods.
• retract()
• merge()
• resetAccumulator()
• emitValue()
• emitUpdateWithRetract()
Next, we write a custom TableAggregateFunction to extract the two highest temperature values of each sensor.
// Define an Accumulator first public static class Top2TempAcc { double highestTemp = Double.MIN_VALUE; double secondHighestTemp = Double.MIN_VALUE; } // Custom table aggregate function public static class Top2Temp extends TableAggregateFunction<Tuple2<Double, Integer>, Top2TempAcc> { @Override public Top2TempAcc createAccumulator() { return new Top2TempAcc(); } // Implement the function calculate aggregate to calculate aggregate results public void accumulate(Top2TempAcc acc, Double temp) { if (temp > acc.highestTemp) { acc.secondHighestTemp = acc.highestTemp; acc.highestTemp = temp; } else if (temp > acc.secondHighestTemp) { acc.secondHighestTemp = temp; } } // Implement a method to output results, which is called when all data in the table is finally processed public void emitValue(Top2TempAcc acc, Collector<Tuple2<Double, Integer>> out) { out.collect(new Tuple2<>(acc.highestTemp, 1)); out.collect(new Tuple2<>(acc.secondHighestTemp, 2)); } }
Next, you can call it in code.
// Create a table aggregate function instance Top2Temp top2Temp = new Top2Temp(); tableEnv.registerFunction("top2Temp", top2Temp); Table resultTable = sensorTable .groupBy("id") .flatAggregate("top2Temp(temperature) as (temp, rank)") .select("id, temp, rank"); tableEnv.toRetractStream(resultTable, Row.class).print("result");
2, Case
2.1 Scalar Function
code:
package org.flink.tableapi.udf; import org.flink.beans.SensorReading; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.java.StreamTableEnvironment; import org.apache.flink.table.functions.ScalarFunction; import org.apache.flink.types.Row; /** * @author Just a * @date 2021-09-30 */ public class UdfTest1_ScalarFunction { public static void main(String[] args) throws Exception{ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env); // 1. Read data DataStreamSource<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt"); // 2. Convert to POJO DataStream<SensorReading> dataStream = inputStream.map(line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); }); // 3. Convert stream to table Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp"); // 4. Customize the scalar function to calculate the hash value of id // 4.1 table API HashCode hashCode = new HashCode(23); // UDF needs to be registered in the environment tableEnv.registerFunction("hashCode", hashCode); Table resultTable = sensorTable.select("id, ts, hashCode(id)"); // 4.2 SQL tableEnv.registerTable("sensor", sensorTable); Table resultSqlTable = tableEnv.sqlQuery("select id, ts, hashCode(id) from sensor"); // Printout tableEnv.toAppendStream(resultTable, Row.class).print("result"); tableEnv.toAppendStream(resultSqlTable, Row.class).print("sql"); env.execute(); } // Implement custom scalarfunctions public static class HashCode extends ScalarFunction{ private int factor = 13; public HashCode(int factor) { this.factor = factor; } public int eval(String str){ return str.hashCode() * factor; } } }
Test record:
2.2 Table Function
code:
package org.flink.tableapi.udf; import org.flink.beans.SensorReading; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.java.StreamTableEnvironment; import org.apache.flink.table.functions.TableFunction; import org.apache.flink.types.Row; /** * @author Just a * @date 2021-09-30 */ public class UdfTest2_TableFunction { public static void main(String[] args) throws Exception{ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env); // 1. Read data DataStreamSource<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt"); // 2. Convert to POJO DataStream<SensorReading> dataStream = inputStream.map(line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); }); // 3. Convert stream to table Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp"); // 4. Customize the table function to split the id and output (word, length) // 4.1 table API Split split = new Split("_"); // UDF needs to be registered in the environment tableEnv.registerFunction("split", split); Table resultTable = sensorTable .joinLateral("split(id) as (word, length)") .select("id, ts, word, length"); // 4.2 SQL tableEnv.registerTable("sensor", sensorTable); Table resultSqlTable = tableEnv.sqlQuery("select id, ts, word, length " + " from sensor, lateral table(split(id)) as splitid(word, length)"); // Printout tableEnv.toAppendStream(resultTable, Row.class).print("result"); tableEnv.toAppendStream(resultSqlTable, Row.class).print("sql"); env.execute(); } // Implement custom TableFunction public static class Split extends TableFunction<Tuple2<String, Integer>>{ // Defining attributes, delimiters private String separator = ","; public Split(String separator) { this.separator = separator; } // You must implement an eval method with no return value public void eval( String str ){ for( String s: str.split(separator) ){ collect(new Tuple2<>(s, s.length())); } } } }
Test record:
2.3 Aggregate Function
code:
package org.flink.tableapi.udf; import org.flink.beans.SensorReading; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.java.StreamTableEnvironment; import org.apache.flink.table.functions.AggregateFunction; import org.apache.flink.types.Row; /** * @author Just a * @date 2021-09-30 */ public class UdfTest3_AggregateFunction { public static void main(String[] args) throws Exception{ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env); // 1. Read data DataStreamSource<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt"); // 2. Convert to POJO DataStream<SensorReading> dataStream = inputStream.map(line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); }); // 3. Convert stream to table Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp"); // 4. Customize the aggregation function to find the average temperature value of the current sensor // 4.1 table API AvgTemp avgTemp = new AvgTemp(); // UDF needs to be registered in the environment tableEnv.registerFunction("avgTemp", avgTemp); Table resultTable = sensorTable .groupBy("id") .aggregate( "avgTemp(temp) as avgtemp" ) .select("id, avgtemp"); // 4.2 SQL tableEnv.registerTable("sensor", sensorTable); Table resultSqlTable = tableEnv.sqlQuery("select id, avgTemp(temp) " + " from sensor group by id"); // Printout tableEnv.toRetractStream(resultTable, Row.class).print("result"); tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql"); env.execute(); } // Implement custom AggregateFunction public static class AvgTemp extends AggregateFunction<Double, Tuple2<Double, Integer>>{ @Override public Double getValue(Tuple2<Double, Integer> accumulator) { return accumulator.f0 / accumulator.f1; } @Override public Tuple2<Double, Integer> createAccumulator() { return new Tuple2<>(0.0, 0); } // You must implement an calculate method to update the status after the data public void accumulate( Tuple2<Double, Integer> accumulator, Double temp ){ accumulator.f0 += temp; accumulator.f1 += 1; } } }
Test record:
reference resources:
- https://www.bilibili.com/video/BV1qy4y1q728
- https://ashiamd.github.io/docsify-notes/#/study/BigData/Flink/%E5%B0%9A%E7%A1%85%E8%B0%B7Flink%E5%85%A5%E9%97%A8%E5%88%B0%E5%AE%9E%E6%88%98-%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0?id=_11-table-api%e5%92%8cflink-sql