stage-5 mode-1: E-commerce offline warehouse project (top)
introduce
Notes on the actual operation of several warehouses project
Job Question 1
This assignment requires the number of active members for the last seven consecutive days. I'm using small_startlog dataset, a three-day log:
The log for one day is about 9k~1w pieces of data.
So let's take a look at the steps:
-
data acquisition
Custom interceptors, classify data and get time
public Event intercept(Event event) { Map<String, String> headersMap = event.getHeaders(); //1. Get body of event String eventBody = new String(event.getBody(), Charsets.UTF_8); // 3. Parse body to get json string String[] bodyArr = eventBody.split("\\s+"); try { String jsonStr = bodyArr[6]; //4. Parse json string to get timestamp JSONObject jsonObject = JSON.parseObject(jsonStr); String timestampStr = jsonObject.getJSONObject("app_active").getString("time"); //5. Convert timestamp to string "yyyy-MM-dd" long timestamp = Long.parseLong(timestampStr); DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyy-MM-dd"); Instant instant = Instant.ofEpochMilli(timestamp); LocalDateTime localDateTime = LocalDateTime.ofInstant(instant, ZoneId.systemDefault()); String date = formatter.format(localDateTime); //6. Place the converted String in the header headersMap.put("logtime", date); event.setHeaders(headersMap); } catch (Exception e){ headersMap.put("logtime", "Unkown"); event.setHeaders(headersMap); } return event; }
Configure the conf file used by flume:
a1.sources = r1 a1.sinks = k1 a1.channels = c1 # taildir source a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /data/lagoudw/conf/startlog_position.json a1.sources.r1.filegroups = f1 f2 a1.sources.r1.filegroups.f1 = /data/lagoudw/logs/start/.*log a1.sources.r1.headers.f1.logtype = start a1.sources.r1.filegroups.f2 = /data/lagoudw/logs/event/.*log a1.sources.r1.headers.f2.logtype = event # custom interceptor a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = cn.lagou.dw.interceptor.LogTypeInterceptor$Builder # memorychannel a1.channels.c1.type = memory a1.channels.c1.capacity = 500000 a1.channels.c1.transactionCapacity = 20000 # hdfs sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /user/data/logs/%{logtype}/dt=%{logtime}/ a1.sinks.k1.hdfs.filePrefix = startlog. a1.sinks.k1.hdfs.fileType = DataStream # Profile scrolling (file size 1G) a1.sinks.k1.hdfs.rollSize = 130000000 a1.sinks.k1.hdfs.rollCount = 0 a1.sinks.k1.hdfs.rollInterval = 0 a1.sinks.k1.hdfs.idleTimeout = 0 a1.sinks.k1.hdfs.minBlockReplicas = 1 # Number of event s refreshed to hdfs a1.sinks.k1.hdfs.batchSize = 10000 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Finally, flume is used for data collection:
flume-ng agent --conf /opt/apps/flume-1.9/conf --conf-file /data/lagoudw/conf/flume-log2hdfs4.conf -name a1 -Dflume.root.logger=INFO,console
-
Create ods layer table
Create the ods start log table to import data from/user/data/logs/start:
use ODS; create external table ods.ods_start_log( `str` string) comment 'User Startup Log Information' partitioned by (`dt` string) location '/user/data/logs/start';
Add partitions to this table:
alter table ods.ods_start_log add partition(dt='2020-07-21') alter table ods.ods_start_log add partition(dt='2020-07-22') alter table ods.ods_start_log add partition(dt='2020-07-23')
-
Create dwd layer tables and clean data
sql=" with tmp as( select split(str, ' ')[7] line from ods.ods_start_log where dt='$do_date' ) insert overwrite table dwd.dwd_start_log partition(dt='$do_date') select get_json_object(line, '$.attr.device_id'), get_json_object(line, '$.attr.area'), get_json_object(line, '$.attr.uid'), get_json_object(line, '$.attr.app_v'), get_json_object(line, '$.attr.event_type'), get_json_object(line, '$.attr.os_type'), get_json_object(line, '$.attr.channel'), get_json_object(line, '$.attr.language'), get_json_object(line, '$.attr.brand'), get_json_object(line, '$.app_active.json.entry'), get_json_object(line, '$.app_active.json.action'), get_json_object(line, '$.app_active.json.error_code') from tmp; " hive -e "$sql"
dwd_start_log is the details of daily member startup information in preparation for subsequent dws
-
Calculate active members
Via dwd_start_Log, we can calculate daily active members at the dws layer:
insert overwrite table dws.dws_member_start_day partition(dt='$do_date') select device_id, concat_ws('|', collect_set(uid)), concat_ws('|', collect_set(app_v)), concat_ws('|', collect_set(os_type)), concat_ws('|', collect_set(language)), concat_ws('|', collect_set(channel)), concat_ws('|', collect_set(area)), concat_ws('|', collect_set(brand)) from dwd.dwd_start_log where dt='$do_date' group by device_id;
-
Calculate Daily New Membership
Calculate daily dws layer new members by daily active members:
insert into table dws.dws_member_add_day select t1.device_id, t1.uid, t1.app_v, t1.os_type, t1.language, t1.channel, t1.area, t1.brand, '$do_date' from dws.dws_member_start_day t1 left join dws.dws_member_add_day t2 on t1.device_id=t2.device_id where t1.dt='$do_date' and t2.device_id is null; -- t1 In table device_id If not t2 If it does, then the user is the new user of the day, adding his information to the table corresponding to the new member -- Now you can calculate the number of new members added per day: insert overwrite table ads.ads_new_member_cnt partition (dt='$do_date') select count(1) from dws.dws_member_add_day where dt = '$do_date'
Now that we have calculated the number of new members, we will get a table as follows:
cnt(new member count) dt(new member register date) 9972 2020-07-21 5027 2020-07-22 4000 2020-07-23 -
Calculate Retained Members for 3 consecutive days
Because we need to calculate the number of people who have logged in for three consecutive days, that is, 7.21, 7.22, 7.23. To satisfy the need for such users to register on 7.21 and log in on 7.22 and 7.23, we can find out the two types of users and intersect them:
- 7.21 Registration and 7.22 Login
- 7.21 Registration and 7.23 Login
-- 6.1 Create Record Continuity n Membership details of days logged in drop table if exists dws.dws_member_retention_nday; create table dws.dws_member_retention_nday ( `device_id` string, `uid` string, `app_v` string, `os_type` string, `language` string, `channel` string, `area` string, `brand` string, `add_date` string comment 'Membership Added Time', `retention_date` int comment 'Continuous retention days' )COMMENT 'Daily Membership Retention Details' PARTITIONED BY (`dt` string) stored as parquet; -- 6.2 insert data(Only users who logged in for three consecutive days were inserted) insert overwrite table dws.dws_member_retention_nday partition(dt='2020-07-23') ( select t2.device_id, t2.uid, t2.app_v, t2.os_type, t2.language, t2.channel, t2.area, t2.brand, t2.dt add_date, 3 retention_nday from dws.dws_member_start_day t1 join dws.dws_member_add_day t2 on t1.device_id=t2.device_id where t2.dt=date_add('2020-07-23', -2) and t1.dt='2020-07-23' and t2.device_id in ( select t2.device_id from dws.dws_member_start_day t1 join dws.dws_member_add_day t2 on t1.device_id=t2.device_id where t2.dt=date_add('2020-07-23', -2) and t1.dt='2020-07-22' ) ); -- Next Enter ads Layer Summary Calculation -- 6.3 Create Continuous Login n Days Membership List drop table if exists ads.ads_member_retention_nday_count; create table ads.ads_member_retention_nday_count ( `add_date` string comment 'Add Date', `retention_nday` int comment 'Continuous active days up to current date', `retention_count` bigint comment 'Retention Number' ) COMMENT 'Membership Retention Number' partitioned by(dt string) row format delimited fields terminated by ','; -- 6.4 from dws_member_retention_nday insert data insert overwrite table ads.ads_member_retention_nday_count partition (dt='2020-07-23') select add_date, retention_date, count(*) retention_count from dws.dws_member_retention_nday where dt='2020-07-23' group by add_date, retention_date; -- 6.6 continuity n Daily user activity drop table if exists ads.ads_member_retention_nday_rate; create table ads.ads_member_retention_nday_rate ( `add_date` string comment 'Add Date', `retention_nday` int comment 'Continuous active days up to current date', `retention_count` bigint comment 'Retention Number', `new_mid_count` bigint comment 'Membership Added on the Day', `retention_ratio` decimal(10,2) comment 'Retention' ) COMMENT 'Membership retention rate' partitioned by(dt string) row format delimited fields terminated by ','; -- 6.7 insert data insert overwrite table ads.ads_member_retention_nday_rate partition (dt='2020-07-23') ( select t1.add_date, t1.retention_nday, t1.retention_count, t2.cnt, t1.retention_count/t2.cnt*100 from ads.ads_member_retention_nday_count t1 join ads.ads_new_member_cnt t2 on t1.add_date=t2.dt where t1.dt='2020-07-23' );
Through table ads_member_retention_nday_rate We got three consecutive days of membership: 1973
The number of members logged in on 7.21 is 9972, and the retention rate should be around 20%. View table ads_member_retention_nday_rate:
The result is correct, the first question is finished.
Job Question 2
Where can I optimize?
When looking at the original data, I found that there is actually some information, the header part, which is not useful at all. We can only take the body part of the second half to reduce the pressure on the ods table.