Real-Time Label Development - Build a real-time user portrait from scratch

Data Access

Access to data can be accessed by writing data to Kafka in real time, either directly or through real-time access methods such as oracle's ogg or mysql's binlog


Golden Gate (OGG) provides real-time capture, transformation and delivery of transaction data in heterogeneous environments.

Through OGG, data from oracle can be written to Kafka in real time.

Small impact on production systems: Real-time transaction log reading, large volume data real-time replication with low resource consumption

Copy transactions as units to ensure transaction consistency: synchronize only submitted data

High Performance

  • Intelligent transaction reorganization and operation consolidation
  • Access using database local interface
  • Parallel Processing System


MySQL's binary log binlog is arguably MySQL's most important log, which records all the DDL and DML statements (in addition to data query statements select, show, and so on) as events, and also includes the time consumed by statement execution. MySQL's binary log is transaction-safe.The main purpose of binlog is to copy and restore.

By these means, the data can be synchronized to kafka, our real-time system.

Flink Access to Kafka Data

Apache Kafka Connector provides easy access to kafka data.

rely on

<dependency>  <groupId>org.apache.flink</groupId>  <artifactId>flink-connector-kafka_2.11</artifactId>  <version>1.9.0</version></dependency>
Build FlinkKafkaConsumer

Must have:

1.topic name

2. DeserializationSchema / Kafka DeserializationSchema for deserializing Kafka data

3. Configuration parameters:bootstrap.servers"" also needs to "zookeeper.connect")

val properties = new Properties()properties.setProperty("bootstrap.servers", "localhost:9092")// only required for Kafka 0.8properties.setProperty("zookeeper.connect", "localhost:2181")properties.setProperty("", "test")stream = env    .addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))    .print()
Timestamp and Watermark

In many cases, the record's timestamp (explicit or implicit) is embedded in the record itself.In addition, users may want to issue watermarks periodically or irregularly.

We can define Timestamp Extractors / Watermark Emitters and pass them on to consumers in the following ways

val env = StreamExecutionEnvironment.getExecutionEnvironment()val myConsumer = new FlinkKafkaConsumer[String](...)myConsumer.setStartFromEarliest()      // Start from the earliest recordPossiblemyConsumer.setStartFromLatest() // start from the latestRecordmyConsumer.setStartFromTimestamp(...) // start from specified epoch timestamp (milliseconds)MyConsumer.setStartFromGroupOffsets() The default behaviour//specified location//val specificStartOffsets = NewJava.util.HashMap[KafkaTopicPartition,Java.lang.Long]()//SpecificStartOffsets.put(new KafkaTopicPartition ("myTopic", 0), 23L)//MyConsumer.setStartFromSpecificOffsets(specificStartOffsets) Val stream =Env.addSource(myConsumer)

When Flink checkpoints are enabled, Flink Kafka Consumer will use the records in the subject and periodically check the status of all its Kafka offsets and other operations in a consistent manner.If the job fails, Flink restores the streaming program to the state of the latest checkpoint and reuses Kafka's records from the offset stored in the checkpoint.

If checkpoints are disabled, Flink Kafka Consumer relies on the automatic periodic offset submission functionality of the internally used Kafka client.

If checkpoints are enabled, Flink Kafka Consumer will submit the offset stored in the checkpoint state when the checkpoint completes.

val env = StreamExecutionEnvironment.getExecutionEnvironment()env.enableCheckpointing(5000) // checkpoint every 5000 msecs

Flink Consumption Kafka Complete Code:

import org.apache.flink.api.common.serialization.SimpleStringSchema;import org.apache.flink.streaming.api.datastream.DataStream;import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;import java.util.Properties;public class KafkaConsumer {    public static void main(String[] args) throws Exception {        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();        Properties properties = new Properties();        properties.setProperty("bootstrap.servers", "localhost:9092");        properties.setProperty("", "test");        //Build FlinkKafkaConsumer FlinkKafkaConsumer <String> myConsumer = new FlinkKafkaConsumer <> ("topic", new SimpleStringSchema (), properties); //Specify offsetMyConsumer.setStartFromEarliest(); DataStream <String> stream = env.addSource (myConsumer);Env.enableCheckpointing(5000);Stream.print();Env.execute("Flink Streaming Java API Skeleton");}

This way, the data has been added to our system in real time and can be processed in Flink, so how can labels be calculated?The label calculation process relies heavily on the capabilities of the data warehouse, so having a good data warehouse makes it easy to calculate labels.

Data Warehouse Basics

A data warehouse is a thematic, integrated, stable, time-varying collection of data that can be used to support management decisions.

(1) Subject-oriented The data in the business database is mainly for processing tasks, and each business system is separated from each other.Data in a data warehouse is organized on a theme

(2) Integration The data stored in the data warehouse is extracted from the business database, but it is not a simple replication of the original data, but has been extracted, cleaned up, transformed (ETL) and so on. Business databases record pipelining accounts for each business process. This data is not suitable for analytical processing. Series calculations are required before entering the data warehouse, and some unnecessary data for analytical processing is discarded.

(3) Stability Operational database systems generally store only short-term data, so their data is unstable and records the transient changes of the data in the system. Most of the data in the data warehouse represents the data at a certain time in the past. It is mainly used for query and analysis, and it is not modified as often as the database in the business system.The general data warehouse is built for access

OLTP Online Transaction Processing OLTP is the main application of traditional relational database, which is mainly used for the processing of everyday things and transaction systems. 1. Data storage is relatively small 2. Real-time requirements are high and things need to be supported 3. Data is generally stored in relational databases (oracle or mysql)

OLAP Online Analytical Processing OLAP is the main application of data warehouse, supporting complex analysis and query, focusing on Decision Support 1. Real-time requirements are not very high, ETL is generally T+1 data; 2. Large amount of data; 3. It is mainly used for analysis and decision-making;

Star model is the most common data warehouse design structure.It consists of a fact table and a set of dimension tables, each of which has a dimension primary key. The core of this model is the fact table, which connects different dimension tables. Objects in each dimension table are related to objects in another dimension table through the fact table, which establishes the relationship between the objects in each dimension table. Dimension table: Used to store dimension information, including dimension attributes and hierarchies; Fact table: is a table used to record business facts and make corresponding indicator statistics.Fact table has a large number of records compared to dimension table

The snowflake model is an extension of the star model, where each dimension table can be joined outward to multiple detailed category tables.In addition to the functionality of dimension tables in star mode, dimensions detailing the fact table are joined to further refine the granularity of the view data For example, the location dimension table contains the attribute set {location_id, street, city, province, country}.This pattern passes through city_in the place dimension tableID and city_of the city dimension tableIDs are related and recorded as {101,'No. 10 Liberation Avenue','Wuhan','Hubei Province','China', {255','No. 85 Liberation Avenue','Wuhan','Hubei Province','China'}. Star model is the most basic mode. A star model has multiple dimension tables and only one fact table.On the basis of the star pattern, a snowflake model is obtained by describing a complex dimension with multiple tables and constructing a multilayer structure of the dimension table.

Clear data structure: Each data hierarchy has its scope, which makes it easier to locate and understand when working with tables Dirty data cleaning: masking exceptions to raw data Shield business impact: data re-entry is required without changing the business once Data consanguinity tracking: Simply put, we end up presenting a business table that can be used directly to the business, but it comes from many sources. If there is a problem with one source table, we want to be able to quickly and accurately locate the problem and understand its harmful extent. Reduce duplicate development: Standardizing data hierarchy and developing some common intermediate-tier data can reduce very large duplicate computations. Simplify complex problems.It is simple and easy to understand that a complex task is broken down into multiple steps to be accomplished, each layer dealing with a single step.Easy to maintain the accuracy of the data, when there are problems with the data, you don't need to repair all the data, just start with the steps in question.

Data warehouse data directly docks OLAP or log data. User portraits are just standing in the user's perspective to further model the data warehouse.Therefore, the scheduling of daily portrait label-related data depends on the execution of tasks related to the upstream data warehouse.

Once you understand the data warehouse, you can calculate the labels.After developing tag logic, write data to hive and druid to complete real-time and offline tag development.

Flink Hive Druid

flink hive

Flink supports integrated Hive since 1.9, but 1.9 is beta and is not recommended for use in production environments.The latest version of Flink1.10 marks the completion of Blink integration. With Hive's production-level integration, Hive, as the absolute core of the data warehouse system, undertakes the vast majority of offline data ETL computing and data management. We look forward to Flink's perfect support for Hive in the future.

HiveCatalog connects to an instance of Hive Metastore to provide metadata persistence capabilities.To interact with Hive using Flink, users need to configure a HiveCatalog and access the metadata in Hive through HiveCatalog.

Add Dependency

To integrate with Hive, you need to add additional dependent jar packages to Flink's lib directory to make the integration work in the Table API program or in SQL Client's SQL.Alternatively, you can place these dependencies in a folder and add them to the classpath using either the Table API program or the -C or -l option of the SQL Client, respectively.This article uses the first method, which copies the jar package directly to $FLINK_HOME/lib directory.The Hive version used in this article is 2.3.4 (for different versions of Hive, you can select different jar package dependencies based on the official website), which requires a total of three jar packages, as follows:

  • flink-connector-hive_2.11-1.10.0.jar
  • flink-shaded-hadoop-2-uber-2.7.5-8.0.jar
  • hive-exec-2.3.4.jar

Add Maven Dependencies

<!-- Flink Dependency -->


<!-- Hive Dependency -->

Instance Code

package com.flink.sql.hiveintegration;

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.catalog.hive.HiveCatalog;

public class FlinkHiveIntegration {

    public static void main(String[] args) throws Exception {

        EnvironmentSettings settings = EnvironmentSettings
                .useBlinkPlanner() // Using BlinkPlanner
                .inBatchMode() // Batch mode, StreamingMode by default

        //Use StreamingMode
       /* EnvironmentSettings settings = EnvironmentSettings
                .useBlinkPlanner() // Using BlinkPlanner
                .inStreamingMode() // StreamingMode

        TableEnvironment tableEnv = TableEnvironment.create(settings);

        String name = "myhive";      // Catalog name, defining a unique name representation
        String defaultDatabase = "qfbap_ods";  // Default database name
        String hiveConfDir = "/opt/modules/apache-hive-2.3.4-bin/conf";  // Hive-Site.xmlRoute
        String version = "2.3.4";       // Hive Version Number

        HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version);

        tableEnv.registerCatalog("myhive", hive);
        // Creating a database is not currently supported for creating hive tables
        String createDbSql = "CREATE DATABASE IF NOT EXISTS myhive.test123";



Flink Druid

You can write Flink's analyzed data back to kafka and then access the data in druid, or you can write the data directly to druid. Here is an example code:

rely on

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns:xsi="" xmlns=""
        <relativePath/> <!-- lookup parent from repository -->
    <description>Flink Druid Connection</description>










Sample Code

public class FlinkDruidApp {

    private static String url = "http://localhost:8200/v1/post/wikipedia";

    private static RestTemplate template;

    private static HttpHeaders headers;

    FlinkDruidApp() {

        template = new RestTemplate();
        headers = new HttpHeaders();


    public static void main(String[] args) throws Exception {, args);

        // Creating Flink Execution Environment
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //Define data source
        DataSet<String> data = env.readTextFile("/wikiticker-2015-09-12-sampled.json");

        // Trasformation on the data -> {

            return httpsPost(x).toString();


    // http post method to post data in Druid
    private static ResponseEntity<String> httpsPost(String json) {

        HttpEntity<String> requestEntity =
                new HttpEntity<>(json, headers);
        ResponseEntity<String> response =
      "http://localhost:8200/v1/post/wikipedia", HttpMethod.POST, requestEntity,

        return response;

    public RestTemplate restTemplate() {
        return new RestTemplate();


Label development is cumbersome and requires constant development and optimization, but how do you provide good labels to produce real value?In the next chapter, we will introduce the customization of user portraits, which is not yet complete~


User Portrait: Methodology and Engineering Solutions

More real-time data analysis related blogs and scientific and technological information, welcome to "real-time streaming computing"

Tags: Big Data hive kafka Apache Database

Posted on Wed, 10 Jun 2020 22:48:11 -0400 by swizenfeld