Five common time crossing and continuous login problems in HIVE

This article is intended to record learning experience and facilitate communication. The blogger's level is limited. If ...

This article is intended to record learning experience and facilitate communication. The blogger's level is limited. If there are any errors in the article, you are welcome to point out. The blogger will be grateful. (Note: please indicate the source for reprint and quotation)

Continuous questions:

The following data are the carbon emission reduction received by users in ant forest. Find out the users who have reduced carbon emission by more than 100 for 3 consecutive days or more.

id dt lowcarbon 1001 2021-12-12 123 1002 2021-12-12 45 1001 2021-12-13 43 1001 2021-12-13 45 1001 2021-12-13 23 1002 2021-12-14 45 1001 2021-12-14 230 1002 2021-12-15 45 1001 2021-12-15 23 ... ...

Idea ·: 1 group and aggregate according to user id and dt, eliminate users with emission less than 100, 2 sort users, 3 subtract the difference of ranking by dt, 4 group and count the number of users with the same difference according to the difference, and reserve users with more than 3

select id, flag, count(*) ct from ( select id, dt, lowcarbon, date_sub(dt,user_order) flag from ( select id, dt, lowcarbon, rank() over(partition by id orderby dt) user_order from ( select id, dt, sum(lowcarbon) lowcarbon from test1 group by id,dt having lowcarvon>100 )t1 )t2 )t3 group by id,flag having ct>-3

Function: date_sub() function: Function direct
Continuously log in to multiple solution articles directly: Hive sql active sql queries for 3 consecutive days in seven days

Grouping problem
The following is the access time data of e-commerce company users

id ts(second) 1001 17523641234 1001 17523641256 1002 17523641278 1001 17523641334 1002 17523641434 1001 17523641534 1001 17523641544 1002 17523641634 1001 17523641638 1001 17523641654

If the time interval of a user's continuous access records is less than 60 seconds, they are divided into the same group. The results are as follows:

id ts(second) group 1001 17523641234 1 1001 17523641256 1 1001 17523641334 2 1001 17523641534 3 1001 17523641544 3 1001 17523641638 4 1001 17523641654 4 1002 17523641278 1 1002 17523641434 2 1002 17523641634 3

Idea: 1. Use the lag function to move the previous row of data down. 2. Subtract the last time from the current time to calculate the difference. 2. Count the number of users whose difference is greater than 60, that is, the number of user logins

select id, ts, lag(ts,1,0) over(partition by id order by ts) lagts from test2 // Subtract the last time from this time to calculate the difference select id, ts, ts-lagts tsdiff from ( select id, ts, lag(ts,1,0) over(partition by id order by ts) lagts from test2 )t1 // Count the number of users whose time difference is greater than 60, that is, the number of user login times select id, ts, sum(if(tsdiff>60,1,0)) over(partition by id order by ts) group_id from( select id, ts, ts-lagts tsdiff from ( select id, ts, lag(ts,1,0) over(partition by id order by ts) lagts from test2 )t1 )t2

Interval continuity problem
According to the daily login data of users recorded by a game company, the maximum continuous login days of each user can be calculated, with an interval of one day. Explanation: if a user logs in to the game on 1, 3, 5 and 6, it is regarded as logging in for 6 consecutive days.

id dt 1001 2021-12-12 1002 2021-12-12 1001 2021-12-13 1001 2021-12-14 1001 2021-12-16 1002 2021-12-16 1001 2021-12-19 1002 2021-12-17 1001 2021-12-20

Idea: 1 move the up data down 2 calculate the difference between the current login time and the last login time datadiff function 3 group according to the user and sort according to the time to calculate the total number of data in the current line greater than 2 4 group according to the user and time to calculate the maximum time minus the minimum time and add 1 5 to get the maximum value of consecutive login days

select id, max(days)+1 from( select id, flag, datadiff(max(dt),main(dt)) days from( select id, dt, sum(if(flag>2,1,0)) over(partition bu id order by dt) flag from( select id, dt, datediff(dt,lagdt) flag from( select id, dt, lag(dt,1,'1970-01-01') over(partition by id orser by dt) lagdt from test3 )t1 )t2 )t3 group by id,flag )t4 group by id

Discount date cross problem
The following is the platform commodity promotion data: the fields are brand, discount start date and discount end date. Calculate the total discount sales days of each brand. Note the cross dates, such as vivo brand. The first activity time is 2021-06-05 to 2021-06-15, and the second activity time is 2021-06-09 to 2021-06-21. The 9th to 15th are the repetition days, which are counted only once, That is, the total discount days of vivo are 17 days from 2021-06-05 to 2021-06-21.

brand stt edt oppo 2021-06-05 2021-06-09 oppo 2021-06-11 2021-06-21 vivo 2021-06-05 2021-06-15 vivo 2021-06-09 2021-06-21 redmi 2021-06-05 2021-06-21 redmi 2021-06-09 2021-06-15 redmi 2021-06-17 2021-06-26 huawei 2021-06-05 2021-06-26 huawei 2021-06-09 2021-06-15 huawei 2021-06-17 2021-06-21

Idea: 1. Get the maximum end time of the previous data of the current line and put it on the current line. 2. Compare the start time with the data moved down. If the start time is large, there is no need to operate. On the contrary, the data moved down needs to be added to replace the start time of the current line. The data in the first line does not need to be replaced. 3. Calculate the difference between the start time and end time. 4. Group and calculate each real data according to the brand Sum of plus 1

select id, sum(days>0,days+1,0) days from ( select id, datediff(edt,stt) days from( select id, if(maxEdt is null,stt,if(stt>maxEdt,stt,data_add(maxEdt,1))) stt, edt from ( select id, stt, edt, max(edt) over(partition by id order by stt rows betwwen UNBOUNDED PRECEDING and 1PRECEDING) maxEdt from test4 )t1 )t2 )t3 group by id

Simultaneous online problems
The following is the opening and closing time of the anchor of a live broadcast platform. According to the data, the number of anchors online at the peak of the platform is calculated.

id stt edt 1001 2021-06-14 12:12:12 2021-06-14 18:12:12 1003 2021-06-14 13:12:12 2021-06-14 16:12:12 1004 2021-06-14 13:15:12 2021-06-14 20:12:12 1002 2021-06-14 15:12:12 2021-06-14 16:12:12 1005 2021-06-14 15:18:12 2021-06-14 20:12:12 1001 2021-06-14 20:12:12 2021-06-14 23:12:12 1006 2021-06-14 21:12:12 2021-06-14 23:15:12 1007 2021-06-14 22:12:12 2021-06-14 23:10:12 ... ...

Idea: 1 classify the data, add positive 1 after the start data, add - 1 after the end time, 2 calculate the cumulative number of online people according to the time sequence, 3 find out the maximum number of online people at the same time

select max(sum_p) from( select id, dt, sum(p) over(order by dt) sum_p from ( select id, stt dt, 1 p from test5 union select id, edt dt, -1 p from test5 )t1 )t2

The result after the first step is

select id,stt dt,1 p from test5 union select id,edt dt,-1 p from test5;t1
1001 2021-06-14 12:12:12 1 1001 2021-06-14 18:12:12 -1 1001 2021-06-14 20:12:12 1 1001 2021-06-14 23:12:12 -1 1002 2021-06-14 15:12:12 1 1002 2021-06-14 16:12:12 -1 1003 2021-06-14 13:12:12 1 1003 2021-06-14 16:12:12 -1 1004 2021-06-14 13:15:12 1 1004 2021-06-14 20:12:12 -1 1005 2021-06-14 15:18:12 1 1005 2021-06-14 20:12:12 -1 1006 2021-06-14 21:12:12 1 1006 2021-06-14 23:15:12 -1 1007 2021-06-14 22:12:12 1 1007 2021-06-14 23:10:12 -1

27 September 2021, 03:55 | Views: 6528

Add new comment

For adding a comment, please log in
or create account

0 comments