Flink practical tutorial: Advanced 1-dimensional table Association

Introduction to flow computing Oceanus

Stream computing Oceanus is a powerful tool for real-time analysis of big data product ecosystem. It is an enterprise level real-time big data analysis platform based on Apache Flink with the characteristics of one-stop development, seamless connection, sub second delay, low cost, security and stability. Stream computing Oceanus aims to maximize the value of enterprise data and accelerate the construction process of real-time digitization of enterprises.

This article will introduce in detail how to extract MySQL data and HBase data for dimension table Association (flow dimension join), and store them in elastic search after simple aggregation analysis.

Pre preparation

Create flow computing Oceanus cluster

On the flow calculation Oceanus product activity page 1 yuan to buy Oceanus cluster.

Enter Oceanus console [1], click cluster management on the left, and click Create cluster on the top left. For details, please refer to Oceanus official document to create an exclusive cluster [2].

Create MySQL instance

Enter MySQL console [3] and click new. For details, please refer to the official document to create a MySQL instance [4].

Data preparation:

Enter the instance database and create oceanus_advanced1_student_grade table and insert data manually.

-- Create table statement CREATE TABLE `oceanus_advanced1_student_grade` ( `name` varchar(50) NOT NULL DEFAULT '', `grade` int(3) DEFAULT NULL, PRIMARY KEY (`name`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 -- Data insertion INSERT INTO `oceanus_advanced1_student_grade` (`name`, `grade`) VALUES ('Oceanus-1', 85); INSERT INTO `oceanus_advanced1_student_grade` (`name`, `grade`) VALUES ('Oceanus-2', 95)

Create EMR cluster

Log in to the elastic MapReduce console [5], select [cluster list] > [new cluster] to start creating a new cluster. For details, refer to creating an EMR cluster [6]. When creating a new cluster, you need to choose to install HBase components.

Data preparation:

Log in to the EMR cluster node, enter the HBase instance database through the HBase Shell command, create a new table, and insert data manually.

# Enter HBase command root@yourhostname~# hbase shell

-- Create table statement create 'oceanus_advanced1_student_info','StuInfo' -- Data insertion put 'oceanus_advanced1_student_info','Oceanus-1','StuInfo:Class','01' put 'oceanus_advanced1_student_info','Oceanus-1','StuInfo:Age','17' put 'oceanus_advanced1_student_info','Oceanus-2','StuInfo:Class','01' put 'oceanus_advanced1_student_info','Oceanus-2','StuInfo:Age','20' put 'oceanus_advanced1_student_info','Oceanus-3','StuInfo:Class','01' put 'oceanus_advanced1_student_info','Oceanus-3','StuInfo:Age','18'

Create Elasticsearch cluster

Enter Elasticsearch console [7], click [new] on the upper left to create an Elasticsearch instance. For specific operations, please visit create Elasticsearch cluster [8]

Flow calculation Oceanus job

1. Create Source

CREATE TABLE `mysql_cdc_source_table` ( `name` STRING, `grade` STRING, `proc_time` AS PROCTIME(), -- here proc_time Fields match the following flow dimensions join When used. PRIMARY KEY (`name`) NOT ENFORCED -- If the database table to be synchronized has a primary key defined, You also need to define it here ) WITH ( 'connector' = 'mysql-cdc', -- Fixed value 'mysql-cdc' 'hostname' = '10.0.0.158', -- Database IP 'port' = '3306', -- Access port of the database 'username' = 'root', -- User name for database access (required) SHOW DATABASES,REPLICATION SLAVE,REPLICATION CLIENT,SELECT and RELOAD Permissions) 'password' = 'Tencent123$', -- Password for database access 'database-name' = 'testdb', -- Databases that need to be synchronized 'table-name' = 'oceanus_advanced1_student_grade' -- Name of data table to be synchronized );

2. Create HBase dimension table

CREATE TABLE hbase_table ( rowkey STRING, StuInfo ROW <Class STRING,Age STRING>, PRIMARY KEY (rowkey) NOT ENFORCED ) WITH ( 'connector' = 'hbase-1.4', -- Flink 1.13 support hbase-2.2 'table-name' = 'oceanus_advanced1_student_info', -- HBase Table name 'zookeeper.quorum' = '10.0.0.118:2181,10.0.0.119:2181,10.0.0.3:2181' -- HBase of zookeeper address );

3. Create Sink

CREATE TABLE elasticsearch6_sink_table ( `class` STRING, `amount` BIGINT, PRIMARY KEY(`class`) NOT ENFORCED ) WITH ( 'connector' = 'elasticsearch-6', -- Output to Elasticsearch 6 'username' = 'elastic', -- Optional user name 'password' = 'Tencent123$', -- Optional password 'hosts' = 'http://10.0.0.97:9200 '-- connection address of elasticsearch 'index' = 'oceanus_advanced1', -- Elasticsearch of Index name 'document-type' = '_doc', -- Elasticsearch of Document type 'sink.bulk-flush.max-actions' = '1', -- Every data is refreshed 'format' = 'json' -- Output data format, currently only supported 'json' );

4. Write business SQL

INSERT INTO elasticsearch6_sink_table SELECT b.StuInfo.Class AS class, COUNT(a.name) AS amount FROM mysql_cdc_source_table AS a JOIN hbase_table FOR SYSTEM_TIME AS OF a.proc_time AS b -- You must join here for SYSTEM_TIME as of Statement, otherwise it can still be executed JOIN，However, the database will only be read in full once, and the results may not meet expectations. ON a.name = b.rowkey WHERE CAST(a.grade AS INT) >= 90 AND CAST(b.StuInfo.Age AS INT) >= 18 GROUP BY b.StuInfo.Class

summary

Currently, the built-in Connector of the stream computing Oceanus platform supports MySQL, PostgreSQL, Hive, HBase, Redis and Oracle dimension tables, which can be used without user development. Please refer to the built-in dimension table reference list [9] for details on how to use the dimension table. This example counts the number of people in each class who are 18 years old or older and have a score of 90 points or older, which has no actual business purpose.

When creating a Sink table to ES, Flink will treat the primary key class field defined in the CREATE TABLE statement as_ id generates the primary key and updates the previous document (Upsert stream) accordingly. If there is no definition of primary key, Flink will randomly generate a string_ id generates the primary key, which is written for the Append stream.