Five common time crossing and continuous login problems in HIVE

This article is intended to record learning experience and facilitate communication. The blogger's level is limited. If there are any errors in the article, you are welcome to point out. The blogger will be grateful. (Note: please indicate the source for reprint and quotation)

Continuous questions:

The following data are the carbon emission reduction received by users in ant forest. Find out the users who have reduced carbon emission by more than 100 for 3 consecutive days or more.

id		dt				lowcarbon
1001	2021-12-12		123
1002	2021-12-12		45
1001	2021-12-13		43
1001	2021-12-13		45
1001	2021-12-13		23
1002	2021-12-14		45
1001	2021-12-14		230
1002	2021-12-15		45
1001	2021-12-15		23
... ...

Idea ·: 1 group and aggregate according to user id and dt, eliminate users with emission less than 100, 2 sort users, 3 subtract the difference of ranking by dt, 4 group and count the number of users with the same difference according to the difference, and reserve users with more than 3

select
 	id,
 	flag,
 	count(*) ct
from  (
	select
	  id,
	  dt,
	  lowcarbon,
	  date_sub(dt,user_order)  flag
	from (
		select
		  id,
		  dt,
		  lowcarbon,
		  rank() over(partition by id orderby dt) user_order
		from (
			select
			  id,
			  dt,
			  sum(lowcarbon) lowcarbon
			from test1
			group by id,dt
			having lowcarvon>100
		)t1
	)t2
)t3
group by id,flag
having ct>-3

Function: date_sub() function: Function direct
Continuously log in to multiple solution articles directly: Hive sql active sql queries for 3 consecutive days in seven days

Grouping problem
The following is the access time data of e-commerce company users

id		ts(second)
1001	17523641234
1001	17523641256
1002	17523641278
1001	17523641334
1002	17523641434
1001	17523641534
1001	17523641544
1002	17523641634
1001	17523641638
1001	17523641654

If the time interval of a user's continuous access records is less than 60 seconds, they are divided into the same group. The results are as follows:

id		ts(second)			group
1001	17523641234		1
1001	17523641256		1
1001	17523641334		2
1001	17523641534		3
1001	17523641544		3
1001	17523641638		4
1001	17523641654		4
1002	17523641278		1
1002	17523641434		2
1002	17523641634		3

Idea: 1. Use the lag function to move the previous row of data down. 2. Subtract the last time from the current time to calculate the difference. 2. Count the number of users whose difference is greater than 60, that is, the number of user logins

select
	id,
	ts,
	lag(ts,1,0) over(partition by id order by ts) lagts
from test2

// Subtract the last time from this time to calculate the difference

select
	id,
	ts,
	ts-lagts tsdiff
from (
	select
		id,
		ts,
		lag(ts,1,0) over(partition by id order by ts) lagts
	from test2
)t1
//  Count the number of users whose time difference is greater than 60, that is, the number of user login times
select
	id,
	ts,
	sum(if(tsdiff>60,1,0)) over(partition by id order by ts) group_id
from(
	select
		id,
		ts,
		ts-lagts tsdiff
	from (
		select
			id,
			ts,
			lag(ts,1,0) over(partition by id order by ts) lagts
		from test2
	)t1
)t2

Interval continuity problem
According to the daily login data of users recorded by a game company, the maximum continuous login days of each user can be calculated, with an interval of one day. Explanation: if a user logs in to the game on 1, 3, 5 and 6, it is regarded as logging in for 6 consecutive days.

id		dt
1001	2021-12-12
1002	2021-12-12
1001	2021-12-13
1001	2021-12-14
1001	2021-12-16
1002	2021-12-16
1001	2021-12-19
1002	2021-12-17
1001	2021-12-20

Idea: 1 move the up data down 2 calculate the difference between the current login time and the last login time datadiff function 3 group according to the user and sort according to the time to calculate the total number of data in the current line greater than 2 4 group according to the user and time to calculate the maximum time minus the minimum time and add 1 5 to get the maximum value of consecutive login days

select 
	id,
	max(days)+1

from(
	select
		id,
		flag,
		datadiff(max(dt),main(dt)) days

	from(
		select
			id,
			dt,
			sum(if(flag>2,1,0)) over(partition bu id order by dt) flag

		from(
			select
				id,
				dt,
				datediff(dt,lagdt) flag

			from(
				select
					id,
					dt,
					lag(dt,1,'1970-01-01') over(partition by id orser by dt) lagdt

				from test3
			)t1
		)t2
	)t3
	group by id,flag

)t4
group by id

Discount date cross problem
The following is the platform commodity promotion data: the fields are brand, discount start date and discount end date. Calculate the total discount sales days of each brand. Note the cross dates, such as vivo brand. The first activity time is 2021-06-05 to 2021-06-15, and the second activity time is 2021-06-09 to 2021-06-21. The 9th to 15th are the repetition days, which are counted only once, That is, the total discount days of vivo are 17 days from 2021-06-05 to 2021-06-21.

brand	stt			edt
oppo	2021-06-05	2021-06-09
oppo	2021-06-11	2021-06-21
vivo	2021-06-05	2021-06-15
vivo	2021-06-09	2021-06-21
redmi	2021-06-05	2021-06-21
redmi	2021-06-09	2021-06-15
redmi	2021-06-17	2021-06-26
huawei	2021-06-05	2021-06-26
huawei	2021-06-09	2021-06-15
huawei	2021-06-17	2021-06-21

Idea: 1. Get the maximum end time of the previous data of the current line and put it on the current line. 2. Compare the start time with the data moved down. If the start time is large, there is no need to operate. On the contrary, the data moved down needs to be added to replace the start time of the current line. The data in the first line does not need to be replaced. 3. Calculate the difference between the start time and end time. 4. Group and calculate each real data according to the brand Sum of plus 1

select
	id,
	sum(days>0,days+1,0) days
from
(
	select
		id,
		datediff(edt,stt) days
	from(
		select
			id,
			if(maxEdt is null,stt,if(stt>maxEdt,stt,data_add(maxEdt,1))) stt,
			edt
		from
		(
			select
				id,
				stt,
				edt,
				max(edt) over(partition by id order by stt rows betwwen UNBOUNDED PRECEDING and 1PRECEDING) maxEdt
			from test4
		)t1
	)t2
)t3
group by id

Simultaneous online problems
The following is the opening and closing time of the anchor of a live broadcast platform. According to the data, the number of anchors online at the peak of the platform is calculated.

id		stt						edt
1001	2021-06-14 12:12:12		2021-06-14 18:12:12
1003	2021-06-14 13:12:12		2021-06-14 16:12:12
1004	2021-06-14 13:15:12		2021-06-14 20:12:12
1002	2021-06-14 15:12:12		2021-06-14 16:12:12
1005	2021-06-14 15:18:12		2021-06-14 20:12:12
1001	2021-06-14 20:12:12		2021-06-14 23:12:12
1006	2021-06-14 21:12:12		2021-06-14 23:15:12
1007	2021-06-14 22:12:12		2021-06-14 23:10:12
... ...

Idea: 1 classify the data, add positive 1 after the start data, add - 1 after the end time, 2 calculate the cumulative number of online people according to the time sequence, 3 find out the maximum number of online people at the same time

select
	max(sum_p)
from(
	select
		id,
		dt,
		sum(p) over(order by dt) sum_p

	from (
		select
			id,
			stt dt,
			1 p
		from test5
		union 
		select
			id,
			edt dt,
			-1 p
		from test5
	)t1
)t2

The result after the first step is

select id,stt dt,1 p from test5
union
select id,edt dt,-1 p from test5;t1
1001	2021-06-14 12:12:12	1
1001	2021-06-14 18:12:12	-1
1001	2021-06-14 20:12:12	1
1001	2021-06-14 23:12:12	-1
1002	2021-06-14 15:12:12	1
1002	2021-06-14 16:12:12	-1
1003	2021-06-14 13:12:12	1
1003	2021-06-14 16:12:12	-1
1004	2021-06-14 13:15:12	1
1004	2021-06-14 20:12:12	-1
1005	2021-06-14 15:18:12	1
1005	2021-06-14 20:12:12	-1
1006	2021-06-14 21:12:12	1
1006	2021-06-14 23:15:12	-1
1007	2021-06-14 22:12:12	1
1007	2021-06-14 23:10:12	-1

Tags: hive SQL

Posted on Mon, 27 Sep 2021 03:55:42 -0400 by mandukar