Hive calculates the maximum number of consecutive login days

catalogue

Brother Qiang said that he found the wealth password and recently developed a stock algorithm, one of which involves the calculation method of the number of consecutive trading days of stocks. We all know that the stock market is not open on weekends. There is a fault here, which requires a little skill. I asked if it was a time series. He said I was bullshit, and I knew I was bullshit. When I asked him the way, he didn't tell me. All these years of brotherhood are nothing. Do you really think I'm not as smart as him, hum!

It's better to rely on people than myself. I decided to study the calculation method of calculating the maximum continuous days in Hive overnight.

1, Background

In the business requirements of website platform, the demand for "maximum login days" of users is relatively common.

raw data:

u0001 2019-10-10
u0001 2019-10-11
u0001 2019-10-12
u0001 2019-10-14
u0001 2019-10-15
u0001 2019-10-17
u0001 2019-10-18
u0001 2019-10-19
u0001 2019-10-20
u0002 2019-10-20

Note: the data is a simplified version, and the two columns are user_id,log_in_date. In reality, it is necessary to get the above data by de duplication and conversion from the collected data.

Let's create a table and import the data into Hive:

create table test.user_log_1 (user_id string, log_in_date string) row format delimited fields terminated by ' ';

load data local inpath '/var/lib/hadoop-hdfs/data/user_log.txt' into table test.user_log_1 ;

Check the data:

hive> select * from test.user_log_1 ;
OK
u0001	2019-10-10
u0001	2019-10-11
u0001	2019-10-12
u0001	2019-10-14
u0001	2019-10-15
u0001	2019-10-17
u0001	2019-10-18
u0001	2019-10-19
u0001	2019-10-20
u0002	2019-10-20
Time taken: 0.076 seconds, Fetched: 10 row(s)

2, Algorithm

The core is to sort by access time. The login time column subtracts the sorted serial number to get a date value, which can be grouped and counted.

1. Step 1: sort

By user_id and log by date_ in_ Date sort:

select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1;

result:

u0001	2019-10-10	1
u0001	2019-10-11	2
u0001	2019-10-12	3
u0001	2019-10-14	4
u0001	2019-10-15	5
u0001	2019-10-17	6
u0001	2019-10-18	7
u0001	2019-10-19	8
u0001	2019-10-20	9

u0002	2019-10-20	1

It can be seen here that the maximum number of consecutive login days of u0001 is 4 days, which can be verified after using the following calculation method.

2. Step 2: make the date difference between the second column and the third column

It can be seen that if the date is small, the line number is also small; If you make a difference between the date and the line number, the difference of continuous login should be the same.

select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1)m;

result:

u0001	2019-10-09
u0001	2019-10-09
u0001	2019-10-09
u0001	2019-10-10
u0001	2019-10-10
u0001	2019-10-11
u0001	2019-10-11
u0001	2019-10-11
u0001	2019-10-11
u0002	2019-10-19

Obviously, the maximum number of consecutive logins is 4.

3. Step 3: group and sum according to the second column

select user_id, dts, count(1) num from (select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1)m)m2 group by user_id, dts;

result:

u0001	2019-10-09	3
u0001	2019-10-10	2
u0001	2019-10-11	4
u0002	2019-10-19	1

4. Step 4: find the maximum number of times

The sequence of consecutive login days of each user has been calculated, and then take the maximum login days of each user:

select user_id, max(num) from (select user_id, dts, count(1) num from (select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1)m)m2 group by user_id, dts)m3 group by user_id;

The result is consistent with our expectation. The maximum login days of user u0001 is 4.

u0001	4
u0002	1

3, Extension (maximum trading days of stocks)

We know that the stock market, such as our A shares, does not open at the weekend. If a stock rises the limit last Friday and then rises the limit this Monday, it can be regarded as two consecutive days. It is impossible to use the above method. Try using the lead function:

select user_id, log_in_date, lead(log_in_date) over(partition by user_id order by log_in_date) end_date from test.user_log_1;

result

u0001	2019-10-10	2019-10-11
u0001	2019-10-11	2019-10-12
u0001	2019-10-12	2019-10-14
u0001	2019-10-14	2019-10-15
u0001	2019-10-15	2019-10-17
u0001	2019-10-17	2019-10-18
u0001	2019-10-18	2019-10-19
u0001	2019-10-19	2019-10-20
u0001	2019-10-20	NULL
u0002	2019-10-20	NULL

Ha ha, do you have a train of thought.

Idea: there are three columns in the above results. The first column is uid. Through the lead function, the next two columns are dates, so the dates in both columns are taken from Monday to Friday, that is, there are only working days in the data, and there is no weekend data. You can filter the data in advance to meet the requirements. Since it needs to be continuous, then:

  1. If the date in the third column minus the date in the second column is equal to 1, it is obviously continuous;
  2. If the date in the third column subtracts the date in the second column, the difference is equal to 3, but the date in the third column is Monday, it is also continuous;

Combining the above two conditions, you can calculate the maximum consecutive trading days of the stock. Have you learned to waste it.

Guess you like it

Snapshot explanation of HDFS

Detailed explanation of Hadoop data migration usage

Hbase repair tool Hbck

Hierarchical theory of warehouse modeling

Understand Hive's data storage and compression

Big data components focus on these aspects

Tags: hive

Posted on Wed, 03 Nov 2021 20:55:17 -0400 by seanrock