Didi x StarRocks: high speed multidimensional analysis creates greater business value

As a leading enterprise in the field of life services, Didi group is comprehensively testing and launching StarRocks. After more than a year of data system construction, we have gradually migrated some multidimensional data analysis requirements requiring real-time interactive query and ad hoc query from ClickHouse to StarRocks. StarRocks has also given us good experience in stability and real-time performance, Next, taking the funnel analysis realized by StarRocks as an example, this paper introduces the practice of StarRocks in the analysis and application of orange core optimization operation data.

Author: Wang Peng
Senior Data Architecture Development Engineer of didi orange heart preferred data architecture department, responsible for the development and construction of orange heart preferred big data basic services and data applications "

Demand introduction

At present, the funnel analysis Kanban on our data portal is scattered, and each Kanban can only support funnel analysis of one scene, which is not conducive to users' unified viewing or horizontal comparison. The Kanban cannot support flexible analysis functions such as optional funnel steps and drill down disassembly. Therefore, we need a funnel analysis tool that can cover more comprehensive traffic data, support flexible screening dimensions and flexible selection of funnel, provide a variety of analysis perspectives, and locate the lost people and transformed people, so as to narrow the scope of problems, accurately find the optimization points of operation strategy and product design, and realize fine operation.

Technology selection

Traffic logs and behavior logs in e-commerce scenarios are generally much larger than those in traditional scenarios. Therefore, funnel analysis in this context brings us two technical challenges:

Large amount of daily data: tens of millions of data are added every day. It supports flexible selection of dimensions. How to quickly conduct multidimensional analysis on the amount of 100 million data
High timeliness requirements for data analysis: how to quickly and accurately remove the duplicate based on the amount of 100 million data and obtain the number of qualified users

StarRocks and ClickHouse are widely used in orange heart, and we have accumulated rich experience, but StarRocks is better than ClickHouse in ease of use and maintainability. The following table is a simple comparison of their functions during use:

After continuous comparison and pressure test, we finally decided to use StarRocks to store the data requiring funnel analysis, because StarRocks has obvious advantages over ClickHouse in SQL monitoring and operation and maintenance, and we can create various materialized views based on funnel analysis details to meet different query scenarios and improve the speed of multidimensional data analysis.

system architecture

The responsibilities of each layer of the system are described as follows:

1. Data source: mainly the embedded point logs of the web and client, which are continuously uploaded to our data access layer

2. Data access layer:

(1) Data access bus: provide access interfaces for multiple data sources, receive and verify data, shield complex data formats for the application layer, verify and simply clean and convert the buried point log, and push the log data to Kafka cluster

(2) Kafka cluster: the intermediate layer between data access bus and data computing cluster. After the corresponding interface of the data access bus receives and verifies the data, it uniformly pushes the data to the Kafka cluster. Kafka cluster decouples the data access bus and data computing cluster, uses Kafka's own ability to realize flow control and release the pressure on downstream computing clusters and storage systems caused by excessive log data during peak hours

3. Data computing and storage tier:

(1) Data computing cluster: after the data is stored in Kafka cluster, Flink or Spark is used to perform real-time and offline ETL for the data according to different business needs, and the data is saved to StarRocks data warehouse in batches

(2) StarRocks data warehouse: Spark+Flink stores data in StarRocks through streaming data processing. We can create detail tables, aggregation tables, update tables and materialized views in StarRocks according to different business scenarios to meet the diverse data use requirements of business parties

4. Data service layer: the internal unified index definition model and index calculation logic provide a unified offline query interface and real-time query interface for each application party

5. Funnel analysis system: it supports flexible creation and editing of funnel, viewing of funnel data and export of funnel detail data

6. Data center: around the production and use scenarios of big data, provide general basic services such as metadata management, data map and job scheduling to improve the efficiency of data production and use

detailed design

At present, the bitmap type based on StarRocks can only accept integer values as input, because the user of our original table_ The ID is mixed with alphanumeric characters and cannot be directly converted into an integer. Therefore, in order to support bitmap calculation, you need to convert the current user_id is converted to a globally unique digital ID. Based on Spark+Hive, we build a global dictionary that maps the original user ID and the encoded integer user id one by one. The global dictionary itself is a Hive table. The Hive table has two columns, one is the original value and the other is the encoded Int value. The following is the construction process of the global dictionary:
1. Remove the dictionary columns of the original table and regenerate the temporary table:

Temporary table definition:

create table 'temp_table'{ 'user_id' string COMMENT 'Users after original table de duplication ID' }

Dictionary column de regeneration temporary table:

insert overwrite table temp_table select user_id from fact_log_user_hive_table group by user_id

2. left join the temporary table and the global dictionary. The suspended dictionary item is a new value. Encode the new value and insert it into the global dictionary:

Global dictionary table definition:

create table 'global_dict_by_userid_hive_table'{ 'user_id' string COMMENT 'Original user ID', 'new_user_id' int COMMENT 'For original users ID Encoded integer user ID' }

Associate the temporary table with the global dictionary table. If there is no match, it is a new user. You need to assign a new global ID and append it to the global dictionary table. The global ID is generated by using the current maximum user ID in the history table plus the row number of the new user:

--4 to update Hive Dictionary table insert overwrite global_dict_by_userid_hive_table select user_id, new_user_id from global_dict_by_userid_hive_table --3 Union with historical field data union all select t1.user_id, --2 Generate global ID: Use the current maximum user in the global dictionary table ID Add the line number of the new user (row_number() over(order by t1.user_id) + t2.max_id) as new_user_id --1 Get the new set of de duplication values from ( select user_id from temp_table where user_id is not null ) t1 left join ( select user_id, new_user_id, (max(new_user_id) over()) as max_id from global_dict_by_userid_hive_table ) t2 on t1.user_id = t2.user_id where t2.newuser_id is null

3. left join the original table and the updated global dictionary table, and insert the ID of the new user and the encoded integer user ID into the original table:

insert overwrite fact_log_user_hive_table select a.user_id, b.new_user_id from fact_log_user_hive_table a left join global_dict_by_userid_hive_table b on a.user_id=b.user_id

4. Create Spark offline synchronization task to complete data synchronization from Hive original table to StarRocks schedule: StarRocks table fact_log_user_doris_table definition (Hive table fact_log_user_hive_table is consistent with the table structure):

CREATE TABLE `fact_log_user_doris_table` ( `new_user_id` bigint(20) NULL COMMENT "Integer user id", `user_id` varchar(65533) NULL COMMENT "user id", `event_source` varchar(65533) NULL COMMENT "End (1: Mall applet 2: head applet 3: independent APP 4: Main end)", `is_new` varchar(65533) NULL COMMENT "New user", `identity` varchar(65533) NULL COMMENT "User identity(Head or ordinary user)", `biz_channel_name` varchar(65533) NULL COMMENT "Channel name of the first landing page of the day", `pro_id` varchar(65533) NULL COMMENT "province ID", `pro_name` varchar(65533) NULL COMMENT "Province name", `city_id` varchar(65533) NULL COMMENT "city ID", `city_name` varchar(65533) NULL COMMENT "City name", `dt` date NULL COMMENT "partition", `period_type` varchar(65533) NULL DEFAULT "daily" COMMENT "" ) ENGINE=OLAP DUPLICATE KEY(`index_id`, `user_id`, `biz_channel_name`, `pro_id`, `city_id`) PARTITION BY RANGE(`dt`)( PARTITION p20210731 VALUES [('2021-07-31'), ('2021-08-01')), PARTITION p20210801 VALUES [('2021-08-01'), ('2021-08-02')), PARTITION p20210802 VALUES [('2021-08-02'), ('2021-08-03')), PARTITION p20210803 VALUES [('2021-08-03'), ('2021-08-04')), PARTITION p20210804 VALUES [('2021-08-04'), ('2021-08-05')), PARTITION p20210805 VALUES [('2021-08-05'), ('2021-08-06')), PARTITION p20210806 VALUES [('2021-08-06'), ('2021-08-07')), PARTITION p20210807 VALUES [('2021-08-07'), ('2021-08-08')), PARTITION p20210808 VALUES [('2021-08-08'), ('2021-08-09'))) DISTRIBUTED BY HASH(`index_id`, `user_id`) BUCKETS 10 PROPERTIES ( "replication_num" = "3", "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.time_zone" = "Asia/Shanghai", "dynamic_partition.start" = "-2147483648", "dynamic_partition.end" = "1", "dynamic_partition.prefix" = "p", "dynamic_partition.replication_num" = "-1", "dynamic_partition.buckets" = "3", "in_memory" = "false", "storage_format" = "DEFAULT" );

Here, we use StarRocks' detail model to create a table to meet the user's use scenario of querying funnel detail data. On the detail table, we create corresponding materialized views according to different multi-dimensional funnel analysis and query requirements to meet the user's use scenario of selecting different dimensions to view the exact weight removal quantity of each step of the funnel model.

5. Create bitmap_union materialized view improves query speed and enables accurate de duplication of count(distinct):

Because users want to view the transformation of some urban users on the funnel model.

Queries are generally:

select city_id, count(distinct new_user_id) as countDistinctByID from fact_log_user_doris_table where `dt` >= '2021-08-01' AND `dt` <= '2021-08-07' AND `city_id` in (11, 12, 13) group by city_id

For this scenario of calculating the exact number of users according to the city, we can use the fact in the detail table_ log_ user_ doris_ Create a table with bitmap_ The materialized view of union can achieve a precise de duplication effect in advance. During query, StarRocks will automatically route the original query to the materialized view table to improve query performance. For this case, the user is grouped according to the city_ The materialized view of ID for accurate de duplication is as follows:

create materialized view city_user_count as select city_id, bitmap_union(to_bitmap(new_user_id)) from fact_log_user_doris_table group by city_id;

In StarRocks, count(distinct) aggregates the result and bitmap_union_ The results of count aggregation are completely consistent. And bitmap_union_count equals bitmap_ The result of union is count, so if count(distinct) is involved in the query, create a bitmap_ The materialized view of union aggregation can speed up the query. Because new_user_id itself is an INT type, so in StarRocks, you need to pass the field through the function to_bitmap can be converted to bitmap type before bitmap_union aggregation.

In this way, we can update the global dictionary and replace the Value column in the original table by running the Spark offline synchronization task every morning. At the same time, we configure the baseline and data quality alarm for the Spark task to ensure the normal operation of the task and the accuracy of the data, Ensure that the next day's operation and marketing students can see the impact of previous operation activities on user conversion rate, so that they can timely adjust operation strategies and ensure the effect of daily operation activities.

Final effect and benefit

Through the joint efforts of product and R & D students, we optimized the precise de duplication function from the three dimensions of the number of cities to be queried, time span and data volume. Under the 100 million level data volume, the precise de duplication query of 150 City ID S takes less than 3 seconds. The following is the final effect of funnel analysis:

Future planning

1. Improve the development of StarRocks' internal tool chain, integrate with didi big data scheduling platform and data development platform, and realize one click access to StarRocks for MySQL, ES, Hive and other data tables.
2. StarRocks integrates flow and batch construction. Because StarRocks provides rich data models, we can build a data calculation and storage model integrating flow and batch based on update model, detail model and materialized view. At present, the scheme is in the implementation stage, and will be extended to data products in all directions of orange heart after improvement.
3. Based on the capability of StarRocks On ElasticSearch, realize unified OLAP query of heterogeneous data sources, enable business requirements in different scenarios and accelerate data value output.

In the future, we will continue to pay attention to StarRocks and continuously upgrade and iterate internally. StarRocks is expected to provide richer functions and more open ecology. StarRocks will also serve as an important component of the OLAP platform to realize unified storage, analysis and management of the OLAP layer.