flink real-time data warehouse

Common sense of e-commerce Since this project is based on e-commerce data, here is a simple popularization of some commo...
Analysis table structure
Use code to import data and checkpoint

Common sense of e-commerce

Since this project is based on e-commerce data, here is a simple popularization of some common sense of e-commerce

SKU and SPU

SKU: a silver, 128G memory iPhoneX that supports China Unicom Network
SPU: iPhoneX
Tm_id: brand Id apple, including IPHONE, headset, mac, etc

What is the difference between an order table and an order details table?

The order status of the order table will change and the order details table will not because there is no order status.
Order form record user_id, order id, order number, total amount of order_status, payment method, order status, etc.
Order details table record user_id, item sku_id, specific commodity information (commodity name sku_name, price order_price, quantity sku_num)

Real time data warehouse architecture

The overall architecture is not much different from the offline warehouse architecture. Compared with offline projects, the overall architecture can be seen by looking at the pictures, but the intermediate conversion is operated through Flink, more Flinksal, and more of the remaining contents (tables) exist in Kafka

Parse the table used

The IDs of sensitive fields in the table have been desensitized

1. Primary classification table

DROP TABLE IF EXISTS `base_category1`; CREATE TABLE `base_category1` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `name` varchar(10) NOT NULL COMMENT 'Classification name', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=utf8 COMMENT='Primary classification table';


The overall data type is a big classification

2. Secondary classification table

DROP TABLE IF EXISTS `base_category2`; CREATE TABLE `base_category2` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `name` varchar(200) NOT NULL COMMENT 'Secondary classification name', `category1_id` bigint(20) DEFAULT NULL COMMENT 'Primary classification number', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=126 DEFAULT CHARSET=utf8 COMMENT='Secondary classification table';


The corresponding is the breakdown under the first level classification table. The third field represents the category of the first level classification table

3. Three level classification table

DROP TABLE IF EXISTS `base_category3`; CREATE TABLE `base_category3` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `name` varchar(200) NOT NULL COMMENT 'Three level classification name', `category2_id` bigint(20) DEFAULT NULL COMMENT 'Secondary classification number', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=1150 DEFAULT CHARSET=utf8 COMMENT='Three level classification table';


The tertiary classification table is more specific. The third field corresponds to the field id of the secondary classification table

4. Province table

DROP TABLE IF EXISTS `base_province`; CREATE TABLE `base_province` ( `id` int(20) DEFAULT NULL COMMENT 'id', `name` varchar(20) DEFAULT NULL COMMENT 'Province name', `region_id` int(20) DEFAULT NULL COMMENT 'Large area id', `area_code` varchar(20) DEFAULT NULL COMMENT 'Administrative location code' ) ENGINE=InnoDB DEFAULT CHARSET=utf8;


It corresponds to the province of the region and the corresponding id

5. Regional table

DROP TABLE IF EXISTS `base_region`; CREATE TABLE `base_region` ( `id` int(20) NOT NULL COMMENT 'Large area id', `region_name` varchar(20) DEFAULT NULL COMMENT 'Region name', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;


Region number and region name

6. Brand table

DROP TABLE IF EXISTS `base_trademark`; CREATE TABLE `base_trademark` ( `tm_id` varchar(20) DEFAULT NULL COMMENT 'brand id', `tm_name` varchar(20) DEFAULT NULL COMMENT 'Brand name' ) ENGINE=InnoDB DEFAULT CHARSET=utf8;


Brands corresponding to some IDS

7. Time latitude table

DROP TABLE IF EXISTS `date_info`; CREATE TABLE `date_info` ( `date_id` int(11) NOT NULL, `week_id` int(11) DEFAULT NULL, `week_day` int(11) DEFAULT NULL, `day` int(11) DEFAULT NULL, `month` int(11) DEFAULT NULL, `quarter` int(11) DEFAULT NULL, `year` int(11) DEFAULT NULL, `is_workday` int(11) DEFAULT NULL, `holiday_id` int(11) DEFAULT NULL, PRIMARY KEY (`date_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

8. Holiday date table

DROP TABLE IF EXISTS `holiday_info`; CREATE TABLE `holiday_info` ( `holiday_id` int(11) NOT NULL, `holiday_name` varchar(20) DEFAULT NULL, PRIMARY KEY (`holiday_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;


Some corresponding holidays

9. A time range of holidays

DROP TABLE IF EXISTS `holiday_year`; CREATE TABLE `holiday_year` ( `year_id` int(11) DEFAULT NULL, `holiday_id` int(11) DEFAULT NULL, `start_date_id` int(11) DEFAULT NULL, `end_date_id` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

10. Order details

DROP TABLE IF EXISTS `order_detail`; CREATE TABLE `order_detail` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `order_id` bigint(20) DEFAULT NULL COMMENT 'Order No', `sku_id` bigint(20) DEFAULT NULL COMMENT 'sku_id', `sku_name` varchar(200) DEFAULT NULL COMMENT 'sku Name (redundant))', `img_url` varchar(200) DEFAULT NULL COMMENT 'Picture name (redundant))', `order_price` decimal(10,2) DEFAULT NULL COMMENT 'Purchase price(When placing an order sku Price)', `sku_num` varchar(200) DEFAULT NULL COMMENT 'Number of purchases', `create_time` datetime DEFAULT NULL COMMENT 'Creation time', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=1054 DEFAULT CHARSET=utf8 COMMENT='Order Details ';


The order details contains many attributes of the order and will not be changed

11. Order form

DROP TABLE IF EXISTS `order_info`; CREATE TABLE `order_info` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `consignee` varchar(100) DEFAULT NULL COMMENT 'consignee', `consignee_tel` varchar(20) DEFAULT NULL COMMENT 'Recipient phone', `total_amount` decimal(10,2) DEFAULT NULL COMMENT 'Total amount', `order_status` varchar(20) DEFAULT NULL COMMENT 'Order status', `user_id` bigint(20) DEFAULT NULL COMMENT 'user id', `payment_way` varchar(20) DEFAULT NULL COMMENT 'payment method', `delivery_address` varchar(1000) DEFAULT NULL COMMENT 'Shipping address', `order_comment` varchar(200) DEFAULT NULL COMMENT 'Order remarks', `out_trade_no` varchar(50) DEFAULT NULL COMMENT 'Order transaction number (for third party payment))', `trade_body` varchar(200) DEFAULT NULL COMMENT 'Order description(For third party payment)', `create_time` datetime DEFAULT NULL COMMENT 'Creation time', `operate_time` datetime DEFAULT NULL COMMENT 'Operation time', `expire_time` datetime DEFAULT NULL COMMENT 'Failure time', `tracking_no` varchar(100) DEFAULT NULL COMMENT 'Logistics order No', `parent_order_id` bigint(20) DEFAULT NULL COMMENT 'Parent order number', `img_url` varchar(200) DEFAULT NULL COMMENT 'Picture path', `province_id` int(20) DEFAULT NULL COMMENT 'region', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=371 DEFAULT CHARSET=utf8 COMMENT='Order form';


12. Order status table

DROP TABLE IF EXISTS `order_status_log`; CREATE TABLE `order_status_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `order_id` int(11) DEFAULT NULL, `order_status` int(11) DEFAULT NULL, `operate_time` datetime DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=43 DEFAULT CHARSET=utf8;


13. Payment flow sheet

DROP TABLE IF EXISTS `payment_info`; CREATE TABLE `payment_info` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `out_trade_no` varchar(20) DEFAULT NULL COMMENT 'External business No', `order_id` varchar(20) DEFAULT NULL COMMENT 'Order No', `user_id` varchar(20) DEFAULT NULL COMMENT 'User number', `alipay_trade_no` varchar(20) DEFAULT NULL COMMENT 'Alipay transaction flow code', `total_amount` decimal(16,2) DEFAULT NULL COMMENT 'Payment amount', `subject` varchar(20) DEFAULT NULL COMMENT 'Transaction content', `payment_type` varchar(20) DEFAULT NULL COMMENT 'Payment method', `payment_time` varchar(20) DEFAULT NULL COMMENT 'Payment time', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=21 DEFAULT CHARSET=utf8 COMMENT='Payment flow sheet';


14. Inventory unit table

DROP TABLE IF EXISTS `sku_info`; CREATE TABLE `sku_info` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'skuid(itemID)', `spu_id` bigint(20) DEFAULT NULL COMMENT 'spuid', `price` decimal(10,0) DEFAULT NULL COMMENT 'Price', `sku_name` varchar(200) DEFAULT NULL COMMENT 'sku name', `sku_desc` varchar(2000) DEFAULT NULL COMMENT 'Product specification description', `weight` decimal(10,2) DEFAULT NULL COMMENT 'weight', `tm_id` bigint(20) DEFAULT NULL COMMENT 'brand(redundancy)', `category3_id` bigint(20) DEFAULT NULL COMMENT 'Three level classification id(redundancy)', `sku_default_img` varchar(200) DEFAULT NULL COMMENT 'Default display picture(redundancy)', `create_time` datetime DEFAULT NULL COMMENT 'Creation time', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8 COMMENT='Inventory unit table';


Inventory data

15. User table

DROP TABLE IF EXISTS `user_info`; CREATE TABLE `user_info` ( `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'number', `login_name` varchar(200) DEFAULT NULL COMMENT 'User name', `nick_name` varchar(200) DEFAULT NULL COMMENT 'User nickname', `passwd` varchar(200) DEFAULT NULL COMMENT 'User password', `name` varchar(200) DEFAULT NULL COMMENT 'User name', `phone_num` varchar(200) DEFAULT NULL COMMENT 'cell-phone number', `email` varchar(200) DEFAULT NULL COMMENT 'mailbox', `head_img` varchar(200) DEFAULT NULL COMMENT 'head portrait', `user_level` varchar(200) DEFAULT NULL COMMENT 'User level', `birthday` date DEFAULT NULL COMMENT 'User birthday', `gender` varchar(1) DEFAULT NULL COMMENT 'Gender M male,F female', `create_time` datetime DEFAULT NULL COMMENT 'Creation time', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=153 DEFAULT CHARSET=utf8 COMMENT='User table';


Stored user data

Complete data source and data collection

The data source and data collection have been completed here

Complete ODS layer

This is the previous operation:
Open hadoop
Start zookeeper
Open kafka
Open canal CD / usr / losql / soft / canal / bin. / startup.sh
Open the flinksql client sql-client.sh embedded
Switch catalog
use catalog myhive

Create a database in flink. Each layer corresponds to a database. The table is placed in the library of flink. The data source of kafka is read, and the data source of kafka is the data monitoring MySQL

The data source and data collection have been completed here, as well as the ODS layer. The database table of the ODS layer is a table built in flinksql. The read data is accessed through kafka, and all data has been entered into the table. After the data in the original table of MySQL is modified, the data here will change accordingly; the data of the offline warehouse can be collected through datax, etc Tool import can only run T+1 time data, and the data of real-time data warehouse can be changed dynamically

Here, these things have been completed

Complete DIM layer

Here, the data of the DIM layer is placed in MySQL, and the data of the source business system is also placed in MySQL, which is a last resort. Generally, the two will not be placed in the same database. We only want to simulate a result here

Analysis table structure

-- flink Create a region dimension table in CREATE TABLE gma_dim.dim_region_info( id BIGINT, name STRING, area_code STRING, region_name STRING, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://master:3306/gma_dim?useUnicode=true&characterEncoding=utf-8', 'table-name' = 'dim_region_info', 'username' = 'root', 'password' ='123456' ); -- mysql Create receive table in CREATE TABLE `dim_region_info` ( `id` bigint(20) NOT NULL COMMENT 'id', `name` varchar(20) DEFAULT NULL COMMENT 'Province name', `area_code` varchar(20) DEFAULT NULL COMMENT 'Administrative location code', `region_name` varchar(20) DEFAULT NULL COMMENT 'Region name', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; -- establish mysql -cdc surface --cdc First do a full scan, and then incremental monitoring binlog Read data -- mysql-cdc Can only be used to read data CREATE TABLE gma_dim.dim_region_info_cdc( id BIGINT, name STRING, area_code STRING, region_name STRING, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = 'master', 'port' = '3306', 'username' = 'root', 'password' = '123456', 'database-name' = 'gma_dim', 'table-name' = 'dim_region_info' );

1. First, create a table in the flink that will eventually be written to MySQL
2. Create a receiving table in MySQL (however, a new database is given to DIM in MySQL to store data)
3. Later, we will create MySQL CDC tables so that we can monitor log data incrementally
4. Import the data into the gma_dim Library in MySQL (we imported the data directly in SQL at the beginning, but we couldn't checkpoint, so we can put it into code)
The task here has been running for 7 * 24 hours, and there is no resource scheduling
However, there will be a problem here, that is, the task is running all the time, and it is inevitable to encounter midway failure, but we can't afford to re run all the time. Therefore, we need to make a checkpoint here, so that even if the task fails, there is a place to re run, but we can't make a checkpoint in SQL, so we go to the code realization

Here, the data source and data acquisition, ODS layer and DIM layer have been completed

Use code to import data and checkpoint

package com.shujia.gma0 import org.apache.flink.configuration.Configuration import org.apache.flink.contrib.streaming.state.RocksDBStateBackend import org.apache.flink.runtime.state.StateBackend import org.apache.flink.streaming.api.CheckpointingMode import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.api.environment.CheckpointConfig import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup import org.apache.flink.table.api.EnvironmentSettings import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment import org.apache.flink.table.catalog.hive.HiveCatalog object DimRegionInfo { def main(args: Array[String]): Unit = { //flinksql environment val bsEnv: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment /** * Open checkpoint */ // Start checkpoint every 1000ms bsEnv.enableCheckpointing(10000) // Advanced options: // Set the mode to accurate once (this is the default) bsEnv.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) // Confirm that the time between checkpoints will be 500 ms bsEnv.getCheckpointConfig.setMinPauseBetweenCheckpoints(500) // The Checkpoint must be completed within one minute, or it will be abandoned bsEnv.getCheckpointConfig.setCheckpointTimeout(60000) // Only one checkpoint is allowed at a time bsEnv.getCheckpointConfig.setMaxConcurrentCheckpoints(1) val config: CheckpointConfig = bsEnv.getCheckpointConfig //Automatically keep the latest checkpoint file after the task fails config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) //Set the status backend and the location where the status is saved val stateBackend: StateBackend = new RocksDBStateBackend("hdfs://master:9000/flink/checkpoint", true) bsEnv.setStateBackend(stateBackend) val bsSettings: EnvironmentSettings = EnvironmentSettings .newInstance() .useBlinkPlanner() //Planner using blink .inStreamingMode() //Using the stream processing model .build() val configuration: Configuration = new Configuration() configuration.setString("table.exec.sink.not-null-enforcer","drop") configuration.setString("table.dynamic-table-options.enabled","true") //Environment for creating table val bsTableEnv: StreamTableEnvironment = StreamTableEnvironment.create(bsEnv, bsSettings) bsTableEnv.getConfig.addConfiguration(configuration) /** * Register hive metadata * * You can directly read the table in hive * */ val name = "myhive" val defaultDatabase = "gma_dim" val hiveConfDir = "/usr/local/soft/hive-1.2.1/conf" val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir) //Register catalog bsTableEnv.registerCatalog("myhive", hive) // Switch catalog bsTableEnv.useCatalog("myhive") bsTableEnv.executeSql( """ |insert into gma_dim.dim_region_info |select a.id as id,name,area_code,region_name from |gma_ods.ods_mysql_kafka_base_province /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ as a |join |gma_ods.ods_mysql_kafka_base_region /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ as b |on a.region_id = b.id """.stripMargin) } }

Encapsulate the same part of the code

1. Put common code together

package com.shujia.common import org.apache.flink.configuration.Configuration import org.apache.flink.contrib.streaming.state.RocksDBStateBackend import org.apache.flink.runtime.state.StateBackend import org.apache.flink.streaming.api.CheckpointingMode import org.apache.flink.streaming.api.environment.CheckpointConfig import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.table.api.EnvironmentSettings import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment import org.apache.flink.table.catalog.hive.HiveCatalog abstract class FlinkTool { var bsEnv :StreamExecutionEnvironment = _ var bsTableEnv: StreamTableEnvironment = _ def main(args: Array[String]): Unit = { //flinksql environment bsEnv= StreamExecutionEnvironment.getExecutionEnvironment /** * Open checkpoint */ // Start checkpoint every 1000ms bsEnv.enableCheckpointing(10000) // Advanced options: // Set the mode to accurate once (this is the default) bsEnv.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) // Confirm that the time between checkpoints will be 500 ms bsEnv.getCheckpointConfig.setMinPauseBetweenCheckpoints(500) // The Checkpoint must be completed within one minute, or it will be abandoned bsEnv.getCheckpointConfig.setCheckpointTimeout(60000) // Only one checkpoint is allowed at a time bsEnv.getCheckpointConfig.setMaxConcurrentCheckpoints(1) val config: CheckpointConfig = bsEnv.getCheckpointConfig //Automatically keep the latest checkpoint file after the task fails config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) //Set the status backend and the location where the status is saved val stateBackend: StateBackend = new RocksDBStateBackend("hdfs://master:9000/flink/checkpoint", true) bsEnv.setStateBackend(stateBackend) val bsSettings: EnvironmentSettings = EnvironmentSettings .newInstance() .useBlinkPlanner() //Planner using blink .inStreamingMode() //Using the stream processing model .build() val configuration: Configuration = new Configuration() configuration.setString("table.exec.sink.not-null-enforcer","drop") configuration.setString("table.dynamic-table-options.enabled","true") //Environment for creating table bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings) bsTableEnv.getConfig.addConfiguration(configuration) /** * Register hive metadata * * You can directly read the table in hive * */ val name = "myhive" val defaultDatabase = "gma_dim" val hiveConfDir = "/usr/local/soft/hive-1.2.1/conf" val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir) //Register catalog bsTableEnv.registerCatalog("myhive", hive) // Switch catalog bsTableEnv.useCatalog("myhive") this.run(args) } def run(args: Array[String]) }

When writing code, just write SQL directly

package com.shujia.gma0 import com.shujia.common.FlinkTool object DimRegionInfo extends FlinkTool{ override def run(args: Array[String]): Unit = { bsTableEnv.executeSql( """ |insert into gma_dim.dim_region_info |select a.id as id,name,area_code,region_name from |gma_ods.ods_mysql_kafka_base_province /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ as a |join |gma_ods.ods_mysql_kafka_base_region /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ as b |on a.region_id = b.id """.stripMargin) } }

2. Omit the code, directly write SQL in the content, and then run the SQL file
Write the code in one place and the executed SQL in one place. It is found that it can be parsed normally

Directly pull out the code that executes the sql file

package com.shujia.common import scala.io.Source object DimRegionInfo extends FlinkTool{ override def run(args: Array[String]): Unit = { if (args.length==0){ println("Please specify sql File path") return } val sqlFilePath = args(0) val sql: String = Source.fromFile(sqlFilePath) .getLines() .toList .mkString("\n") println("*" * 50+"Executing sql"+"*" * 50) println(sql) println("*" * 100) bsTableEnv.executeSql(sql) } }


When running sql here

You can directly execute only the program and transfer the sql file as a parameter. If you don't transfer it, you will be prompted to transfer the sql file path (this is written by yourself)
Add the sql file path here to run stably

Even if the task fails in the middle, we can return to the original state according to the ID number


Just give the path of hdfs here

3. Then write some constants into the configuration file, write a Config method, and give a method to pass parameters

package com.shujia.common import java.io.InputStream import java.util.Properties object Config { //Read configuration file val inputStream: InputStream = this.getClass .getClassLoader .getResourceAsStream("config.properties") private val properties = new Properties() properties.load(inputStream) def getString(key: String): String = { properties.getProperty(key) } def getLong(key: String): Long = { properties.getProperty(key).toLong } }


You can pass parameters in this way in the original function

After these files are completed, the DIM layer is completed

Complete DWD layer

The data of this layer is still put into kafka. Finally, our format is still changelog format, otherwise the data cannot be changed. Here, we read kafka and write kafka

Creating tables in flinksql

Submit task insert data in cluster

The ods layer data has been monitored here
Here, the DWD layer has been completed

In fact, we are a real-time project here, so it doesn't make much sense whether DWS is done or not. We omit the DWS layer and go directly to the ADS layer

Complete ADS layer

The data of the ADS layer is finally landed in mysql, so you need to create tables in the flink library and MySQL library to finally land the data in MySQL

Import data into MySQL

Execute sql to calculate indicators

All data in the process can be monitored in flink

Thank you for reading. I am shuaihe, a senior majoring in big data. I wish you happiness.

1 December 2021, 16:23 | Views: 8158

Add new comment

For adding a comment, please log in
or create account

0 comments