Flink basic series 33 table API and functions of Flink SQL

1, Functions

  Flink Table and SQL have built-in functions supported in SQL; If there are unmet needs, you can implement user-defined functions (UDF) to solve them.

1.1 system built-in functions

  Flink Table API and SQL provide users with a set of built-in functions for data conversion. Many functions supported in SQL, such as Table API and SQL, have been implemented, and others are still under rapid development and expansion.

   the following are some examples of typical functions. All built-in functions can be introduced on the official website.
⚫ Comparison function
SQL:
value1 = value2
value1 > value2

Table API:
ANY1 === ANY2
ANY1 > ANY2

⚫ Logic function
SQL:
boolean1 OR boolean2
boolean IS FALSE
NOT boolean

Table API:
BOOLEAN1 || BOOLEAN2
BOOLEAN.isFalse
!BOOLEAN

⚫ Arithmetic function
SQL:
numeric1 + numeric2
POWER(numeric1, numeric2)

Table API:
NUMERIC1 + NUMERIC2
NUMERIC1.power(NUMERIC2)

⚫ String function
SQL:
string1 || string2
UPPER(string)
CHAR_LENGTH(string)

Table API:
STRING1 + STRING2
STRING.upperCase()
STRING.charLength()

⚫ Time function
SQL:
DATE string
TIMESTAMP string
CURRENT_TIME
INTERVAL string range

Table API:
STRING.toDate
STRING.toTimestamp
currentTime()
NUMERIC.days
NUMERIC.minutes

⚫ Aggregate function
SQL:
COUNT(*)
SUM([ ALL | DISTINCT ] expression)
RANK()
ROW_NUMBER()

Table API:
FIELD.count
FIELD.sum0

1.2 UDF

User defined functions (UDFs) are an important feature because they significantly extend the expressive power of queries. For some requirements that cannot be solved by the built-in functions of the system, we can use UDF to define and implement them.

1.2.1 register user-defined function UDF

  in most cases, user-defined functions must be registered before they can be used in queries. There is no need to register functions specifically for Scala's Table API.

  the   function registers in the TableEnvironment by calling the registerFunction () method. When a user-defined function is registered, it is inserted into the function directory of the TableEnvironment so that the Table API or SQL parser can recognize and interpret it correctly.

1.2.2 Scalar Functions

   user defined scalar functions can map 0, 1 or more scalar values to new scalar values. In order to define scalar functions, you must extend the base class Scalar Function in org.apache.flink.table.functions and implement (one or more) evaluation (eval) methods. The behavior of scalar functions is determined by the evaluation method, which must be publicly declared and named eval (direct def declaration, no override). The parameter type and return type of the evaluation method determine the parameter and return type of the Scalar Function.

In the following code, we define our own HashCode function, register it in TableEnvironment, and call it in the query.

// Customize a scalar function

public static class HashCode extends ScalarFunction {

private int factor = 13;

public HashCode(int factor) {
this.factor = factor;
}


public int eval(String s) {
return s.hashCode() * factor;
}

}

In the main function, calculate the hash value of sensor id (Part copy, flow environment, table environment, read source, build table):

public static void main(String[] args) throws Exception {
// 1. Create environment

    StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();

    env.setParallelism(1);

    StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);


    // 2. Read the file to get DataStream
    String filePath = "..\\sensor.txt";
    DataStream<String> inputStream = env.readTextFile(filePath);


    // 3. Convert to Java Bean and specify timestamp and watermark

    DataStream<SensorReading> dataStream = inputStream.map( line -> {
String[] fields = line.split(",");

    return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
} );


    // 4. Convert DataStream to Table
    Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature");

    // 5. Call the user-defined hash function to hash the id
    HashCode hashCode = new HashCode(23);
    tableEnv.registerFunction("hashCode", hashCode);
    Table resultTable = sensorTable.select("id, ts, hashCode(id)");

    //	sql
    tableEnv.createTemporaryView("sensor", sensorTable);
Table resultSqlTable = tableEnv.sqlQuery("select id, ts, hashCode(id) from sensor");
    tableEnv.toAppendStream(resultTable, Row.class).print("result");
    tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql");

    env.execute("scalar function test");

1.2.3 Table Functions

   similar to user-defined scalar functions, user-defined table functions can take 0, 1 or more scalar values as input parameters; Unlike scalar functions, it can return any number of rows as output instead of a single value.

  in order to define a table function, you must extend the base class TableFunction in org.apache.flex.table.functions and implement (one or more) evaluation methods. The behavior of a table function is determined by its evaluation method, which must be public and named eval. The parameter type of the evaluation method determines all valid parameters of the table function.

  the type of the returned table is determined by the generic type of TableFunction. The evaluation method emits the output line using the protected collect (T) method.

  in the Table API, the Table function needs to be used with. Joinlater or. Leftouterjoinlater. The joinlater operator will connect each row in the external Table with all rows calculated by the Table function (the parameter of the operator is its expression). The leftouterjoinlater operator is a left outer join, which will also connect each row in the external Table with all rows generated by the calculation of the Table function; Moreover, the external rows of the empty Table returned by the Table function should also be retained.

    in SQL, you need to use a final table() or a left connection with an ON TRUE condition.

In the code below, we will define a table function, register it in the table environment, and call it in the query. Custom TableFunction:

// Custom TableFunction

public static class Split extends TableFunction<Tuple2<String, Integer>> {

private String separator = ",";


public Split(String separator) {
this.separator = separator;
}


// Similar to flatmap, there is no return value
public void eval(String str) {
for (String s : str.split(separator)) {
collect(new Tuple2<String, Integer>(s, s.length()));
}

}

}

Next, it is called in code. The first is the method of Table API:

Split split = new Split("_"); tableEnv.registerFunction("split", split); Table resultTable = sensorTable
.joinLateral( "split(id) as (word, length)")
.select("id, ts, word, length");

Then the SQL method:

tableEnv.createTemporaryView("sensor", sensorTable);

Table resultSqlTable = tableEnv.sqlQuery("select id, ts, word, length " +
"from sensor, lateral table( split(id) ) as splitId(word, length)");

1.2.4 Aggregate Functions

  user defined aggregate functions (UDAGGs) can aggregate the data in a table into a scalar value. User defined aggregate functions are implemented by inheriting the AggregateFunction abstract class.

An example of aggregation is shown in the figure above. Suppose there is a table containing the data of various drinks. The table consists of three columns (id, name and price) and five rows of data. Now we need to find the highest price of all drinks in the table, that is, perform max() aggregation, and the result will be a value.

The AggregateFunction works as follows.
⚫ First, it needs an accumulator to hold the data structure (state) of the aggregated intermediate results. You can create an empty accumulator by calling the createAccumulator () method of AggregateFunction.
⚫ The function's accumulate () method is then called on each input line to update the accumulator.
⚫ After all rows are processed, the getValue () method of the function is called to calculate and return the final result. The AggregationFunction requires the following methods to be implemented:
• createAccumulator()
• accumulate()
• getValue()

In addition to the above methods, there are some alternative implementation methods. Some of these methods can make the system execute queries more efficiently, while others are necessary for some scenarios. For example, if an aggregate function is applied to a session window
(session group window), the merge () method is required.
• retract()
• merge()
• resetAccumulator()

Next, we write a custom AggregateFunction to calculate the average temperature of each sensor

// Defines the Accumulator for the AggregateFunction

public static class AvgTempAcc {

double sum = 0.0;

int count = 0;

}

// Customize an aggregation function, calculate the average temperature value of each sensor, and save the status (tempsum, tempcount)

public static class AvgTemp extends AggregateFunction<Double, AvgTempAcc>{

@Override

public Double getValue(AvgTempAcc accumulator) {

return accumulator.sum / accumulator.count;

}

@Override

public AvgTempAcc createAccumulator() {

return new AvgTempAcc();

}

// Implement a specific processing and calculation function, calculate

public void accumulate( AvgTempAcc accumulator, Double temp) {
accumulator.sum += temp;

accumulator.count += 1;

}
}

Next, you can call it in code.

// Create an aggregate function instance

AvgTemp avgTemp = new AvgTemp();


// Call of Table API

tableEnv.registerFunction("avgTemp", avgTemp); Table resultTable = sensorTable
.groupBy("id")
.aggregate("avgTemp(temperature) as avgTemp")
.select("id, avgTemp");

// sql
tableEnv.createTemporaryView("sensor", sensorTable);
Table resultSqlTable = tableEnv.sqlQuery("select id, avgTemp(temperature) " +
"from sensor group by id");

tableEnv.toRetractStream(resultTable, Row.class).print("result"); tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql");

1.2.5 Table Aggregate Functions

User defined table aggregate functions (udtags) can aggregate data in a table into a result table with multiple rows and columns. This is very similar to AggregateFunction, except that the aggregation result was a scalar value before, and now it has become a table.

For example, now we need to find the first two highest prices of all drinks in the table, that is, execute the aggregation of top2 () table. We need to check each of the five rows and the result will be a table with the first two values after sorting.
The user-defined table aggregation function is implemented by inheriting the TableAggregateFunction abstract class.

TableAggregateFunction works as follows.
⚫ First, it also needs an Accumulator, which is a data structure that holds the intermediate results of aggregation.
An empty accumulator can be created by calling the createAccumulato (r) method of TableAggregateFunction.

⚫ The function's accumulate () method is then called on each input line to update the accumulator.

⚫ After all rows are processed, the function's emitValue () method is called to calculate and return the final result.

The AggregationFunction requires the following methods to be implemented:
• createAccumulator()
• accumulate()

In addition to the above methods, there are some alternative implementation methods.
• retract()
• merge()
• resetAccumulator()
• emitValue()
• emitUpdateWithRetract()

Next, we write a custom TableAggregateFunction to extract the two highest temperature values of each sensor.

// Define an Accumulator first

public static class Top2TempAcc {

double highestTemp = Double.MIN_VALUE;

double secondHighestTemp = Double.MIN_VALUE;

}

// Custom table aggregate function
public static class Top2Temp extends TableAggregateFunction<Tuple2<Double, Integer>, Top2TempAcc> {
@Override

public Top2TempAcc createAccumulator() {

return new Top2TempAcc();

}

// Implement the function calculate aggregate to calculate aggregate results
public void accumulate(Top2TempAcc acc, Double temp) {

if (temp > acc.highestTemp) { acc.secondHighestTemp = acc.highestTemp; acc.highestTemp = temp;
} else if (temp > acc.secondHighestTemp) { acc.secondHighestTemp = temp;
}

}
// Implement a method to output results, which is called when all data in the table is finally processed

public void emitValue(Top2TempAcc acc, Collector<Tuple2<Double, Integer>> out) {
out.collect(new Tuple2<>(acc.highestTemp, 1));

out.collect(new Tuple2<>(acc.secondHighestTemp, 2));

}

}

Next, you can call it in code.

// Create a table aggregate function instance

Top2Temp top2Temp = new Top2Temp(); tableEnv.registerFunction("top2Temp", top2Temp); Table resultTable = sensorTable
.groupBy("id")
.flatAggregate("top2Temp(temperature) as (temp, rank)")
.select("id, temp, rank");

tableEnv.toRetractStream(resultTable, Row.class).print("result");

2, Case

2.1 Scalar Function

code:

package org.flink.tableapi.udf;

import org.flink.beans.SensorReading;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.functions.ScalarFunction;
import org.apache.flink.types.Row;

/**
 * @author Just a
 * @date   2021-09-30
 */

public class UdfTest1_ScalarFunction {
    public static void main(String[] args) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // 1. Read data
        DataStreamSource<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt");

        // 2. Convert to POJO
        DataStream<SensorReading> dataStream = inputStream.map(line -> {
            String[] fields = line.split(",");
            return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });

        // 3. Convert stream to table
        Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp");

        // 4. Customize the scalar function to calculate the hash value of id
        // 4.1 table API
        HashCode hashCode = new HashCode(23);
        // UDF needs to be registered in the environment
        tableEnv.registerFunction("hashCode", hashCode);
        Table resultTable = sensorTable.select("id, ts, hashCode(id)");

        // 4.2 SQL
        tableEnv.registerTable("sensor", sensorTable);
        Table resultSqlTable = tableEnv.sqlQuery("select id, ts, hashCode(id) from sensor");

        // Printout
        tableEnv.toAppendStream(resultTable, Row.class).print("result");
        tableEnv.toAppendStream(resultSqlTable, Row.class).print("sql");

        env.execute();
    }

    // Implement custom scalarfunctions
    public static class HashCode extends ScalarFunction{
        private int factor = 13;

        public HashCode(int factor) {
            this.factor = factor;
        }

        public int eval(String str){
            return str.hashCode() * factor;
        }
    }
}

Test record:

2.2 Table Function

code:

package org.flink.tableapi.udf;


import org.flink.beans.SensorReading;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.functions.TableFunction;
import org.apache.flink.types.Row;

/**
 * @author Just a
 * @date   2021-09-30
 */

public class UdfTest2_TableFunction {
    public static void main(String[] args) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // 1. Read data
        DataStreamSource<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt");

        // 2. Convert to POJO
        DataStream<SensorReading> dataStream = inputStream.map(line -> {
            String[] fields = line.split(",");
            return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });

        // 3. Convert stream to table
        Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp");

        // 4. Customize the table function to split the id and output (word, length)
        // 4.1 table API
        Split split = new Split("_");

        // UDF needs to be registered in the environment
        tableEnv.registerFunction("split", split);
        Table resultTable = sensorTable
                .joinLateral("split(id) as (word, length)")
                .select("id, ts, word, length");

        // 4.2 SQL
        tableEnv.registerTable("sensor", sensorTable);
        Table resultSqlTable = tableEnv.sqlQuery("select id, ts, word, length " +
                " from sensor, lateral table(split(id)) as splitid(word, length)");

        // Printout
        tableEnv.toAppendStream(resultTable, Row.class).print("result");
        tableEnv.toAppendStream(resultSqlTable, Row.class).print("sql");

        env.execute();
    }

    // Implement custom TableFunction
    public static class Split extends TableFunction<Tuple2<String, Integer>>{
        // Defining attributes, delimiters
        private String separator = ",";

        public Split(String separator) {
            this.separator = separator;
        }

        // You must implement an eval method with no return value
        public void eval( String str ){
            for( String s: str.split(separator) ){
                collect(new Tuple2<>(s, s.length()));
            }
        }
    }
}

Test record:

2.3 Aggregate Function

code:

package org.flink.tableapi.udf;


import org.flink.beans.SensorReading;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.functions.AggregateFunction;
import org.apache.flink.types.Row;


/**
 * @author Just a
 * @date   2021-09-30
 */

public class UdfTest3_AggregateFunction {
    public static void main(String[] args) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // 1. Read data
        DataStreamSource<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt");

        // 2. Convert to POJO
        DataStream<SensorReading> dataStream = inputStream.map(line -> {
            String[] fields = line.split(",");
            return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });

        // 3. Convert stream to table
        Table sensorTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp");

        // 4. Customize the aggregation function to find the average temperature value of the current sensor
        // 4.1 table API
        AvgTemp avgTemp = new AvgTemp();

        // UDF needs to be registered in the environment
        tableEnv.registerFunction("avgTemp", avgTemp);
        Table resultTable = sensorTable
                .groupBy("id")
                .aggregate( "avgTemp(temp) as avgtemp" )
                .select("id, avgtemp");

        // 4.2 SQL
        tableEnv.registerTable("sensor", sensorTable);
        Table resultSqlTable = tableEnv.sqlQuery("select id, avgTemp(temp) " +
                " from sensor group by id");

        // Printout
        tableEnv.toRetractStream(resultTable, Row.class).print("result");
        tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql");

        env.execute();
    }

    // Implement custom AggregateFunction
    public static class AvgTemp extends AggregateFunction<Double, Tuple2<Double, Integer>>{
        @Override
        public Double getValue(Tuple2<Double, Integer> accumulator) {
            return accumulator.f0 / accumulator.f1;
        }

        @Override
        public Tuple2<Double, Integer> createAccumulator() {
            return new Tuple2<>(0.0, 0);
        }

        // You must implement an calculate method to update the status after the data
        public void accumulate( Tuple2<Double, Integer> accumulator, Double temp ){
            accumulator.f0 += temp;
            accumulator.f1 += 1;
        }
    }
}

Test record:

reference resources:

  1. https://www.bilibili.com/video/BV1qy4y1q728
  2. https://ashiamd.github.io/docsify-notes/#/study/BigData/Flink/%E5%B0%9A%E7%A1%85%E8%B0%B7Flink%E5%85%A5%E9%97%A8%E5%88%B0%E5%AE%9E%E6%88%98-%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0?id=_11-table-api%e5%92%8cflink-sql

Posted on Sun, 07 Nov 2021 23:18:15 -0500 by skroks609