Flink CDC Series - Build Streaming ETL on MySQL and Postgres

This tutorial will show you how to quickly build streaming ETL for MySQL and Postgres based on Flink CDC.

Flink-CDC project address:
https://github.com/ververica/flink-cdc-connectors

This tutorial's demo is based on a Docker environment and will be done in the Flink SQL CLI, involving only SQL, without a single line of Java/Scala code, or with an IDE installed.

Suppose we're running an e-commerce business, where data for goods and orders is stored in MySQL and logistic information for orders is stored in Postgres.

For the convenience of analysis, we want the order form to be linked with its corresponding commodity and logistics information, form a wide table, and write it to Elastic Search in real time.

The next section describes how to use the Flink Mysql/Postgres CDC to fulfill this requirement. The overall architecture of the system is shown below:

I. Preparatory phase

Prepare a Linux or MacOS computer with Docker installed.

1.1 Components needed to prepare the tutorial

The next tutorial prepares the required components in a docker-compose fashion.

Create a docker-compose.yml file using the following:

version: '2.1'
services:
  postgres:
    image: debezium/example-postgres:1.1
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=1234
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
  mysql:
    image: debezium/example-mysql:1.1
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=123456
      - MYSQL_USER=mysqluser
      - MYSQL_PASSWORD=mysqlpw
  elasticsearch:
    image: elastic/elasticsearch:7.6.0
    environment:
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.type=single-node
    ports:
      - "9200:9200"
      - "9300:9300"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
  kibana:
    image: elastic/kibana:7.6.0
    ports:
      - "5601:5601"

The containers contained in this Docker Compose are:

  • MySQL: The commodity table products and the order table orders will be stored in the database, which will be associated with the logistics table shipments in the Postgres database to get an order table enriched_with more information Orders;
  • Postgres: The logistics table shipments will be stored in the database;
  • Elasticsearch: Final order form enriched_orders will be written to Elasticsearch;
  • Kibana: Data used to visualize ElasticSearch.

Execute the following command in the directory where docker-compose.yml is located to start the components required for this tutorial:

docker-compose up -d

This command will automatically start all containers defined in the Docker Compose configuration in detached mode. You can use docker ps to see if the above containers are properly started or to access them http://localhost:5601/ To see if Kibana is working properly.

Note: Container-related commands used next in this tutorial also need to be executed in the directory where docker-compose.yml resides.

1.2 Download Flink and required dependent packages

  1. download Flink 1.13.2 [1] and extract it to the directory flink-1.13.2
  2. Download the dependent packages listed below and place them in the directory flink-1.13.2/lib/

[1] https://downloads.apache.org/flink/flink-1.13.2/flink-1.13.2-bin-scala_2.11.tgz

1.3 Preparing data

1.3.1 Preparing data in a MySQL database

  1. Enter MySQL container:

    docker-compose exec mysql mysql -uroot -p123456
  2. Create a database and table products, orders, and insert data:

    -- MySQL
    CREATE DATABASE mydb;
    USE mydb;
    CREATE TABLE products (
      id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      description VARCHAR(512)
    );
    ALTER TABLE products AUTO_INCREMENT = 101;
    
    INSERT INTO products
    VALUES (default,"scooter","Small 2-wheel scooter"),
           (default,"car battery","12V car battery"),
           (default,"12-pack drill bits","12-pack of drill bits with sizes ranging from #40 to #3"),
           (default,"hammer","12oz carpenter's hammer"),
           (default,"hammer","14oz carpenter's hammer"),
           (default,"hammer","16oz carpenter's hammer"),
           (default,"rocks","box of assorted rocks"),
           (default,"jacket","water resistent black wind breaker"),
           (default,"spare tire","24 inch spare tire");
    
    CREATE TABLE orders (
      order_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
      order_date DATETIME NOT NULL,
      customer_name VARCHAR(255) NOT NULL,
      price DECIMAL(10, 5) NOT NULL,
      product_id INTEGER NOT NULL,
      order_status BOOLEAN NOT NULL -- Whether order has been placed
    ) AUTO_INCREMENT = 10001;
    
    INSERT INTO orders
    VALUES (default, '2020-07-30 10:08:22', 'Jark', 50.50, 102, false),
           (default, '2020-07-30 10:11:09', 'Sally', 15.00, 105, false),
           (default, '2020-07-30 12:00:30', 'Edward', 25.25, 106, false);

1.3.2 Preparing data in the Postgres database

  1. Enter the Postgres container:

    docker-compose exec postgres psql -h localhost -U postgres
  2. Create table shipments and insert data:

    -- PG
    CREATE TABLE shipments (
       shipment_id SERIAL NOT NULL PRIMARY KEY,
       order_id SERIAL NOT NULL,
       origin VARCHAR(255) NOT NULL,
       destination VARCHAR(255) NOT NULL,
       is_arrived BOOLEAN NOT NULL
     );
     ALTER SEQUENCE public.shipments_shipment_id_seq RESTART WITH 1001;
     ALTER TABLE public.shipments REPLICA IDENTITY FULL;
     INSERT INTO shipments
     VALUES (default,10001,'Beijing','Shanghai',false),
            (default,10002,'Hangzhou','Shanghai',false),
            (default,10003,'Shanghai','Hangzhou',false);

2. Start Flink Cluster and Flink SQL CLI

  1. Use the following command to jump to the Flink directory:

    cd flink-1.13.2
  2. Start the Flink cluster with the following command:

    ./bin/start-cluster.sh

    If the startup is successful, you can http://localhost:8081/ Access the Flink Web UI as follows:

  3. Start Flink SQL CLI with the following command

    ./bin/sql-client.sh

    After successful startup, you can see the following pages:

3. Creating tables using Flink DDL in Flink SQL CLI

First, turn on the checkpoint and do it every 3 seconds.

-- Flink SQL                   Flink SQL> SET execution.checkpointing.interval = 3s;

Then, for the tables products, orders, shipments in the database, use the Flink SQL CLI to create corresponding tables to synchronize the data of these underlying database tables.

-- Flink SQLFlink SQL> CREATE TABLE products (    id INT,    name STRING,    description STRING,    PRIMARY KEY (id) NOT ENFORCED  ) WITH (    'connector' = 'mysql-cdc',    'hostname' = 'localhost',    'port' = '3306',    'username' = 'root',    'password' = '123456',    'database-name' = 'mydb',    'table-name' = 'products'  );Flink SQL> CREATE TABLE orders (   order_id INT,   order_date TIMESTAMP(0),   customer_name STRING,   price DECIMAL(10, 5),   product_id INT,   order_status BOOLEAN,   PRIMARY KEY (order_id) NOT ENFORCED ) WITH (   'connector' = 'mysql-cdc',   'hostname' = 'localhost',   'port' = '3306',   'username' = 'root',   'password' = '123456',   'database-name' = 'mydb',   'table-name' = 'orders' );Flink SQL> CREATE TABLE shipments (   shipment_id INT,   order_id INT,   origin STRING,   destination STRING,   is_arrived BOOLEAN,   PRIMARY KEY (shipment_id) NOT ENFORCED ) WITH (   'connector' = 'postgres-cdc',   'hostname' = 'localhost',   'port' = '5432',   'username' = 'postgres',   'password' = 'postgres',   'database-name' = 'postgres',   'schema-name' = 'public',   'table-name' = 'shipments' );

Finally, create enriched_ The orders table, which writes the associated order data to Elasticsearch.

-- Flink SQLFlink SQL> CREATE TABLE enriched_orders (   order_id INT,   order_date TIMESTAMP(0),   customer_name STRING,   price DECIMAL(10, 5),   product_id INT,   order_status BOOLEAN,   product_name STRING,   product_description STRING,   shipment_id INT,   origin STRING,   destination STRING,   is_arrived BOOLEAN,   PRIMARY KEY (order_id) NOT ENFORCED ) WITH (     'connector' = 'elasticsearch-7',     'hosts' = 'http://localhost:9200',     'index' = 'enriched_orders' );

4. Associate order data and write it into Elasticsearch

Use Flink SQL to associate the order table with the commodity table products and the logistics information table shipments, and write the associated order information into Elasticsearch.

-- Flink SQLFlink SQL> INSERT INTO enriched_orders SELECT o.*, p.name, p.description, s.shipment_id, s.origin, s.destination, s.is_arrived FROM orders AS o LEFT JOIN products AS p ON o.product_id = p.id LEFT JOIN shipments AS s ON o.order_id = s.order_id;

After successful startup, you can access http://localhost : 8081/#/job/running sees a running Flink Streaming Job on the Flink Web UI as shown below:

You can now see order data in Kibana that contains both commodity and logistics information.

First Access http://localhost:5601/app/kibana#/management/kibana/index_pattern Create index pattern enriched_orders.

Then you can http://localhost:5601/app/kibana#/discover You see the data written.

Next, modify the data in the tables in the MySQL and Postgres databases, and the order data displayed in Kibana will be updated in real time.

  1. Insert a piece of data into the orders table of MySQL:

    --MySQLINSERT INTO ordersVALUES (default, '2020-07-30 15:22:00', 'Jark', 29.71, 104, false);
  2. Insert a piece of data into the shipment table of Postgres:

    --PGINSERT INTO shipmentsVALUES (default,10004,'Shanghai','Beijing',false);
  3. Update the status of the order in the orders table of MySQL:

    --MySQLUPDATE orders SET order_status = true WHERE order_id = 10004;
  4. Update the status of logistics in the shipment table of Postgres:

    --PGUPDATE shipments SET is_arrived = true WHERE shipment_id = 1004;
  5. Delete a piece of data from the orders table in MYSQL:

    --MySQLDELETE FROM orders WHERE order_id = 10004;

    Refresh Kibana every step of execution, and you can see that the order data displayed in Kibana will be updated in real time, as follows:

5. Environmental Cleanup

At the end of this tutorial, stop all containers by executing the following command in the directory where the docker-compose.yml file resides:

docker-compose down

Stop the Flink cluster by executing the following command in Flink's directory, flink-1.13.2:

./bin/stop-cluster.sh

6. Summary

In this article, we show you how to quickly build Streaming ETL using Flink CDC in a simple business scenario. Hopefully, this article can help readers get started with Flink CDC quickly, and also hope that Flink CDC can meet your business needs.

More Flink CDC related technical issues, can be scanned to join the community nail exchange group~

Recent hot spots

More Flink related technical issues to join the community staple exchange group
Get the latest technical articles and community updates for the first time, please pay attention to the Public Number~

Activity Recommendations

AliCloud's enterprise product, Real Time Computing Flink, built on Apache Flink, is now active:
99 Yuan Trial Real-time Computing Flink Edition (Coverage year, package month, 10CU) will have the opportunity to get exclusive customized garments from Flink; Another package offers 85% discount for 3 months or more!
Learn more about the event: https://www.aliyun.com/product/bigdata/sc

Tags: Scala Database MySQL Docker SQL

Posted on Wed, 01 Dec 2021 23:09:41 -0500 by abhishek