Welcome to the blog home page: WeChat Search: import_bigdata, original hard core creator of big data domain _import_bigdata_CSDN blog
Welcome to comment, collection, message, message exchange!
This article was originally created by Wang Zhiwu and first published in the CSDN blog!
The first CSDN forum of this article is forbidden to be reproduced without official and personal permission.
This article is about Learning Guide for Big Data Specialists from Zero (Full Upgrade) Part of HIV supplement.
Interview Topic One
Scenario: A log table records all the information about the changing status of a merchant's rate. Now there is a requirement to fetch the rows of data whose status has changed in chronological order.
1. The data are as follows:
create table datafrog_merchant (f_merchant_id varchar(20), f_rate varchar(20), f_date date ); insert into datafrog_merchant values (100,0.1,'2016-03-02'), (100,0.1,'2016-02-02'), (100,0.2,'2016-03-05'), (100,0.2,'2016-03-06'), (100,0.3,'2016-03-07'), (100,0.1,'2016-03-09'), (100,0.1,'2016-03-10'), (100,0.1,'2016-03-10'), (200,0.1,'2016-03-10'), (200,0.1,'2016-02-02'), (200,0.2,'2016-03-05'), (200,0.2,'2016-03-06'), (200,0.3,'2016-03-07'), (200,0.1,'2016-03-09'), (200,0.1,'2016-03-10'), (200,0.1,'2016-03-10');
Let's see how the data looks:
2. Achieve the desired effect
Solution ideas: Lag function or Lead function can get the field contents of the previous or next row to this row, which can compare whether the field has changed, and then determine whether the state has changed and whether the data row needs to be extracted.
select t1.f_merchant_id, t1.f_rate, t1.f_date from ( select f_merchant_id, f_rate, f_date, lag(f_rate,1,-999) over(partition by f_merchant_id order by f_date) as f_rate2 from datafrog_merchant ) t1 where t1.f_rate <> t1.f_rate2
Interview Topic Two
1. The title is as follows
2. Start building tables and inserting data below
create table datafrog_test1 (userid varchar(20), changjing varchar(20), inttime varchar(20) ); insert into datafrog_test1 values (1,1001,1400), (2,1002,1401), (1,1002,1402), (1,1001,1402), (2,1003,1403), (2,1004,1404), (3,1003,1400) (4,1004,1402), (4,1003,1403), (4,1001,1403), (4,1002,1404) (5,1002,1402), (5,1002,1403), (5,1001,1404), (5,1003,1405);
3.mysql solution ideas: sorting and concat connection
select concat(t.userid,'-',group_concat(t.changjing separator'-')) as result from( select userid,changjing,inttime, if(@tmp=userid,@rank:=@rank+1,@rank:=1) as new_rank, @tmp:=userid as tmp from (select userid,changjing, min(inttime) inttime from datafrog_test1 group by userid,changjing)temp order by userid,inttime )t where t.new_rank<=2 group by t.userid;
4. Output results:
There may be errors in everybody's code, mainly from ONLY_FULL_GROUP_BY. The solution is to run this set when you run the above codeSql_mode='. In fact, mysql is a good query, but when it is used for analysis, it is a bit weak, such as sorting, median and so on. pandas, sqlserver, oracle, hive, spark are usually used to do analysis in the work. This is not to say that mysql is not useful, it is very useful, easy to use, and other basis.
6. Let's take a look at the hive solution
with tmp as ( select userid, changjing,order_num,changjing1 from (SELECT userid , changjing, row_number() over(partition by userid order by inttime asc) as order_num, lag(changjing,1,'datafrog') OVER(partition by userid order by inttime asc) AS changjing1 FROM datafrog_test1) as a where changjing!=changjing1) , tmp2 as ( select userid,changjing,order_num,changjing1, row_number() over(partition by userid order by order_num ) as changjing_num from tmp ) select concat( userid,'-',concat_ws('-', collect_set(changjing)) ) from tmp2 where changjing_num <3 group by userid
The usages of with as, row_number() over(), lag() in hive sql are mainly investigated here.