Welcome to the blog home page: WeChat Search: import_bigdata, original hard core creator of big data domain _import_bigdata_CSDN blog
Welcome to comment, collection, message, message exchange!
This article was originally created by Wang Zhiwu and first published in the CSDN blog!
The first CSDN forum of this article is forbidden to be reproduced without official and personal permission.

This article is about Learning Guide for Big Data Specialists from Zero (Full Upgrade) Part of HIV supplement.

Interview Topic One

Scenario: A log table records all the information about the changing status of a merchant's rate. Now there is a requirement to fetch the rows of data whose status has changed in chronological order.

1. The data are as follows:

create table datafrog_merchant
(f_merchant_id varchar(20),
f_rate varchar(20),
f_date date

insert into datafrog_merchant values

Let's see how the data looks:

2. Achieve the desired effect
Solution ideas: Lag function or Lead function can get the field contents of the previous or next row to this row, which can compare whether the field has changed, and then determine whether the state has changed and whether the data row needs to be extracted.

    lag(f_rate,1,-999) over(partition by f_merchant_id order by f_date) as f_rate2
) t1
    t1.f_rate <> t1.f_rate2

Interview Topic Two

1. The title is as follows

2. Start building tables and inserting data below

create table datafrog_test1
(userid varchar(20),
changjing varchar(20),
inttime varchar(20)

insert into datafrog_test1 values

3.mysql solution ideas: sorting and concat connection

select concat(t.userid,'-',group_concat(t.changjing separator'-')) as result
 select userid,changjing,inttime,
     if(@tmp=userid,@rank:=@rank+1,@rank:=1) as new_rank,
   @tmp:=userid as tmp
 from (select userid,changjing, min(inttime) inttime from datafrog_test1 group by userid,changjing)temp
 order by userid,inttime
where t.new_rank<=2
group by t.userid;

4. Output results:

5. Note:
There may be errors in everybody's code, mainly from ONLY_FULL_GROUP_BY. The solution is to run this set when you run the above codeSql_mode='. In fact, mysql is a good query, but when it is used for analysis, it is a bit weak, such as sorting, median and so on. pandas, sqlserver, oracle, hive, spark are usually used to do analysis in the work. This is not to say that mysql is not useful, it is very useful, easy to use, and other basis.

6. Let's take a look at the hive solution

with tmp as  (
            (SELECT  userid ,
                    row_number() over(partition by userid order by inttime asc) as order_num,
                    lag(changjing,1,'datafrog') OVER(partition by userid order by inttime asc) AS changjing1
                FROM datafrog_test1) as a 
    where changjing!=changjing1)
, tmp2 as (
select userid,changjing,order_num,changjing1,
row_number() over(partition by userid order by order_num ) as changjing_num
from tmp
select concat( userid,'-',concat_ws('-', collect_set(changjing)) )
from tmp2 where changjing_num <3
group by userid

The usages of with as, row_number() over(), lag() in hive sql are mainly investigated here.


Tags: MySQL hive

Posted on Fri, 10 Sep 2021 12:40:56 -0400 by void