stage-5 mode-1: E-commerce offline warehouse project

stage-5 mode-1: E-commerce offline warehouse project (top)


Notes on the actual operation of several warehouses project

Job Question 1

This assignment requires the number of active members for the last seven consecutive days. I'm using small_startlog dataset, a three-day log:

The log for one day is about 9k~1w pieces of data.

So let's take a look at the steps:

  1. data acquisition

    Custom interceptors, classify data and get time

        public Event intercept(Event event) {
            Map<String, String> headersMap = event.getHeaders();
            //1. Get body of event
            String eventBody = new String(event.getBody(), Charsets.UTF_8);
            //        3. Parse body to get json string
            String[] bodyArr = eventBody.split("\\s+");
            try {
                String jsonStr = bodyArr[6];
                //4. Parse json string to get timestamp
                JSONObject jsonObject = JSON.parseObject(jsonStr);
                String timestampStr = jsonObject.getJSONObject("app_active").getString("time");
                //5. Convert timestamp to string "yyyy-MM-dd"
                long timestamp = Long.parseLong(timestampStr);
                DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyy-MM-dd");
                Instant instant = Instant.ofEpochMilli(timestamp);
                LocalDateTime localDateTime = LocalDateTime.ofInstant(instant, ZoneId.systemDefault());
                String date = formatter.format(localDateTime);
                //6. Place the converted String in the header
                headersMap.put("logtime", date);
            } catch (Exception e){
                headersMap.put("logtime", "Unkown");
            return event;

    Configure the conf file used by flume:

    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    # taildir source
    a1.sources.r1.type = TAILDIR
    a1.sources.r1.positionFile = /data/lagoudw/conf/startlog_position.json
    a1.sources.r1.filegroups = f1 f2
    a1.sources.r1.filegroups.f1 = /data/lagoudw/logs/start/.*log
    a1.sources.r1.headers.f1.logtype = start
    a1.sources.r1.filegroups.f2 = /data/lagoudw/logs/event/.*log
    a1.sources.r1.headers.f2.logtype = event
    # custom interceptor
    a1.sources.r1.interceptors = i1
    a1.sources.r1.interceptors.i1.type = cn.lagou.dw.interceptor.LogTypeInterceptor$Builder
    # memorychannel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 500000
    a1.channels.c1.transactionCapacity = 20000
    # hdfs sink
    a1.sinks.k1.type = hdfs
    a1.sinks.k1.hdfs.path = /user/data/logs/%{logtype}/dt=%{logtime}/
    a1.sinks.k1.hdfs.filePrefix = startlog.
    a1.sinks.k1.hdfs.fileType = DataStream
    # Profile scrolling (file size 1G)
    a1.sinks.k1.hdfs.rollSize = 130000000
    a1.sinks.k1.hdfs.rollCount = 0
    a1.sinks.k1.hdfs.rollInterval = 0
    a1.sinks.k1.hdfs.idleTimeout = 0
    a1.sinks.k1.hdfs.minBlockReplicas = 1
    # Number of event s refreshed to hdfs
    a1.sinks.k1.hdfs.batchSize = 10000
    # Bind the source and sink to the channel
    a1.sources.r1.channels = c1 = c1

    Finally, flume is used for data collection:

    flume-ng agent --conf /opt/apps/flume-1.9/conf --conf-file /data/lagoudw/conf/flume-log2hdfs4.conf -name a1 -Dflume.root.logger=INFO,console
  2. Create ods layer table

    Create the ods start log table to import data from/user/data/logs/start:

    use ODS;
    create external table ods.ods_start_log(
    `str` string)
    comment 'User Startup Log Information'
    partitioned by (`dt` string)
    location '/user/data/logs/start';

    Add partitions to this table:

    alter table ods.ods_start_log add partition(dt='2020-07-21')
    alter table ods.ods_start_log add partition(dt='2020-07-22')
    alter table ods.ods_start_log add partition(dt='2020-07-23')
  3. Create dwd layer tables and clean data
    with tmp as(
    select split(str, ' ')[7] line
    from ods.ods_start_log
    where dt='$do_date'
    insert overwrite table dwd.dwd_start_log
    select get_json_object(line, '$.attr.device_id'),
    get_json_object(line, '$.attr.area'),
    get_json_object(line, '$.attr.uid'),
    get_json_object(line, '$.attr.app_v'),
    get_json_object(line, '$.attr.event_type'),
    get_json_object(line, '$.attr.os_type'),
    get_json_object(line, '$'),
    get_json_object(line, '$.attr.language'),
    get_json_object(line, '$.attr.brand'),
    get_json_object(line, '$.app_active.json.entry'),
    get_json_object(line, '$.app_active.json.action'),
    get_json_object(line, '$.app_active.json.error_code')
    from tmp;
    hive -e "$sql"

    dwd_start_log is the details of daily member startup information in preparation for subsequent dws

  4. Calculate active members

    Via dwd_start_Log, we can calculate daily active members at the dws layer:

    insert overwrite table dws.dws_member_start_day
    select device_id,
    concat_ws('|', collect_set(uid)),
    concat_ws('|', collect_set(app_v)),
    concat_ws('|', collect_set(os_type)),
    concat_ws('|', collect_set(language)),
    concat_ws('|', collect_set(channel)),
    concat_ws('|', collect_set(area)),
    concat_ws('|', collect_set(brand))
    from dwd.dwd_start_log
    where dt='$do_date'
    group by device_id;
  5. Calculate Daily New Membership

    Calculate daily dws layer new members by daily active members:

    insert into table dws.dws_member_add_day
    select t1.device_id,
    from dws.dws_member_start_day t1 left join
    dws.dws_member_add_day t2
    on t1.device_id=t2.device_id
    where t1.dt='$do_date'
    and t2.device_id is null;
    -- t1 In table device_id If not t2 If it does, then the user is the new user of the day, adding his information to the table corresponding to the new member
    -- Now you can calculate the number of new members added per day:
    insert overwrite table ads.ads_new_member_cnt
    partition (dt='$do_date')
    select count(1)
    from dws.dws_member_add_day
    where dt = '$do_date'

    Now that we have calculated the number of new members, we will get a table as follows:

    cnt(new member count)dt(new member register date)
  6. Calculate Retained Members for 3 consecutive days

    Because we need to calculate the number of people who have logged in for three consecutive days, that is, 7.21, 7.22, 7.23. To satisfy the need for such users to register on 7.21 and log in on 7.22 and 7.23, we can find out the two types of users and intersect them:

    1. 7.21 Registration and 7.22 Login
    2. 7.21 Registration and 7.23 Login
    -- 6.1 Create Record Continuity n Membership details of days logged in
    drop table if exists dws.dws_member_retention_nday;
    create table dws.dws_member_retention_nday
    `device_id` string,
    `uid` string,
    `app_v` string,
    `os_type` string,
    `language` string,
    `channel` string,
    `area` string,
    `brand` string,
    `add_date` string comment 'Membership Added Time',
    `retention_date` int comment 'Continuous retention days'
    )COMMENT 'Daily Membership Retention Details'
    PARTITIONED BY (`dt` string)
    stored as parquet;
    -- 6.2 insert data(Only users who logged in for three consecutive days were inserted)
    insert overwrite table dws.dws_member_retention_nday
    select  t2.device_id,
            t2.dt add_date,
            3 retention_nday
            from dws.dws_member_start_day t1 join dws.dws_member_add_day
            t2 on t1.device_id=t2.device_id
            where t2.dt=date_add('2020-07-23', -2)
            and t1.dt='2020-07-23'
            and t2.device_id in (
                select  t2.device_id
                from dws.dws_member_start_day t1 join dws.dws_member_add_day
                t2 on t1.device_id=t2.device_id
                where t2.dt=date_add('2020-07-23', -2)
                and t1.dt='2020-07-22'
    -- Next Enter ads Layer Summary Calculation
    -- 6.3 Create Continuous Login n Days Membership List
    drop table if exists ads.ads_member_retention_nday_count;
    create table ads.ads_member_retention_nday_count
    `add_date` string comment 'Add Date',
    `retention_nday` int comment 'Continuous active days up to current date',
    `retention_count` bigint comment 'Retention Number'
    ) COMMENT 'Membership Retention Number'
    partitioned by(dt string)
    row format delimited fields terminated by ',';
    -- 6.4 from dws_member_retention_nday insert data
    insert overwrite table ads.ads_member_retention_nday_count
    partition (dt='2020-07-23')
    select add_date, retention_date,
    	   count(*) retention_count
    from dws.dws_member_retention_nday
    where dt='2020-07-23'
    group by add_date, retention_date;
    -- 6.6 continuity n Daily user activity
    drop table if exists ads.ads_member_retention_nday_rate;
    create table ads.ads_member_retention_nday_rate
    `add_date` string comment 'Add Date',
    `retention_nday` int comment 'Continuous active days up to current date',
    `retention_count` bigint comment 'Retention Number',
    `new_mid_count` bigint comment 'Membership Added on the Day',
    `retention_ratio` decimal(10,2) comment 'Retention'
     ) COMMENT 'Membership retention rate'
    partitioned by(dt string)
    row format delimited fields terminated by ',';
    -- 6.7 insert data
    insert overwrite table ads.ads_member_retention_nday_rate
    partition (dt='2020-07-23')
    select  t1.add_date,
    from ads.ads_member_retention_nday_count t1 join
    ads.ads_new_member_cnt t2 on t1.add_date=t2.dt
    where t1.dt='2020-07-23'

    Through table ads_member_retention_nday_rate We got three consecutive days of membership: 1973

The number of members logged in on 7.21 is 9972, and the retention rate should be around 20%. View table ads_member_retention_nday_rate:

The result is correct, the first question is finished.

Job Question 2

Where can I optimize?

When looking at the original data, I found that there is actually some information, the header part, which is not useful at all. We can only take the body part of the second half to reduce the pressure on the ods table.

Tags: Big Data

Posted on Sat, 04 Dec 2021 12:34:43 -0500 by elentz