SparkSql中時間閾操作【窗口函數】

本文主要總結了一些sql在時間閾上的操作,包括連續消費,最長簽到,累計消費等問題,其實映射到其他業務場景也就變成了類似的計算;如遊戲領域,連續登陸時間,連續簽到時長,最大連續簽到天數等常見的業務場景;方法都是共通的,這裏就用sparksql來實現一些方法,hivesql的話有部分代碼可能需要略微修改,比如having這種需要外面再套一層改成where等等就不再贅述

構造數據進行測試

爲了比較好切割,我就用@進行拼湊了,第一個是日期,第二個是用戶,第三個是否消費,第四個爲消費金額

20190531@156@1@20
20190601@156@1@20
20190602@156@1@10
20190603@156@0@0
20190604@156@0@0
20190605@156@1@10
20190606@156@1@10
20190607@156@1@10
20190608@156@0@0
20190609@156@1@20
20190610@156@1@20
20190531@187@0@0
20190601@187@1@10
20190602@187@1@20
20190603@187@1@30
20190604@187@1@40
20190605@187@0@0
20190606@187@1@10
20190607@187@0@0
20190608@187@1@20
20190609@187@1@20
20190610@187@1@10
20190609@173@0@0
20190610@173@1@10

映射成表,如下結構

create table tmp_time_exp 
(
    dt string,  
    passenger_phone string,
    is_call string comment '是否消費',
    cost bigint comment '花費金額'
)
row format DELIMITED fields terminated by '@'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/hdfslocation'

查詢一下是否符合

tmp_time_exp.dt	tmp_time_exp.passenger_phone	tmp_time_exp.is_call	tmp_time_exp.cost
20190531	156	1	20
20190601	156	1	20
20190602	156	1	10
20190603	156	0	0
20190604	156	0	0
20190605	156	1	10
20190606	156	1	10
20190607	156	1	10
20190608	156	0	0
20190609	156	1	20
20190610	156	1	20
20190531	187	0	0
20190601	187	1	10
20190602	187	1	20
20190603	187	1	30
20190604	187	1	40
20190605	187	0	0
20190606	187	1	10
20190607	187	0	0
20190608	187	1	20
20190609	187	1	20
20190610	187	1	10
20190609	173	0	0
20190610	173	1	10

常見問題

1.求n天連續消費用戶

例子:如需要找到連續三天消費的用戶,他的連續消費開始時間及結束時間

select
    passenger_phone,
    is_call,
    cost,
    unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd') as start_dt,
    dt as end_dt,
    datediff(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd'),'yyyy-MM-dd')) as last3day  
from
    tmp_time_exp
where
    is_call != 0 
having  
    last3day = 2 

結果輸出

passenger_phone	is_call	cost	start_dt	end_dt	last3day
156	1	10	1559232000	20190602	2
156	1	10	1559664000	20190607	2
187	1	30	1559318400	20190603	2
187	1	40	1559404800	20190604	2
187	1	10	1559923200	20190610	2

1. 在使用datediff的是時候,需要注意傳遞的參數必須是標準日期格式的,所以需要轉化下 。2. 使用lag或者lead都可以實現類似操作,首先對用戶進行分組,然後對其消費時間進行排序,然後將下一個消費時間進行位移,然後做差。比較好理解,如上,將連續日期位移兩個位置,如果相減爲2,則這三天都是必須連續登陸的

2.用戶連續消費的時間段,持續時間及該時間段消費的金額總和

舉例:如156的用戶,連續消費的時間段是5.31-6.2;6.5-6.7;6.9-6.10,金額爲分別爲50,30,40

select
    passenger_phone,
    min(dt) as start_day,
    max(dt) as end_day,
    count(1) as last_days,
    sum(cost) as cost_sum
from
(
    select
        *,
        row_number() over(partition by passenger_phone order by dt) as ranker
    from
        tmp_time_exp
    where
        is_call != 0
)a
group by
    passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)

輸出結果

passenger_phone	start_day	end_day	last_days	cost_sum
156	20190531	20190602	3	50
156	20190605	20190607	3	30
156	20190609	20190610	2	40
173	20190610	20190610	1	10
187	20190601	20190604	4	100
187	20190606	20190606	1	10
187	20190608	20190610	3	50

上述的處理方式,也是參考一個blog的處理,鏈接找不到了,處理的很巧妙,使用日期排序的方式和自己的日期做差進行分組,如果差值都是一樣的,說明是連續的日期,且這個差值相同的個數即爲連續的天數

3.包括6.10,連續消費天數,斷更不算(消費簽到天數)

舉例:156的用戶。6.10消費了,往前推,6.9也消費了,但是6.8沒消費,所以到目前爲止連續消費的時間是2天;這個很多用於類似簽到的功能,如果今天斷籤,則重新開始計算累計的簽到天數

方法 1
select
    *
from
(
    select
        passenger_phone,
        min(dt) as start_time,
        max(dt) as end_time,
        count(1) as day_cnt
    from
    (
        select
            *,
            row_number() over(partition by passenger_phone order by dt) as ranker
        from
            tmp_time_exp
        where
            is_call = 1
    )aa
    group by
        passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)bb
where
    end_time = '20190610'

在問題2中,直接將結束日期限定爲今日(6.10)即可得出

方法 2
with end_dt as
(
    select
        passenger_phone,
        max(dt) as end_dt
    from
        tmp_time_exp
    where
        dt between '20190531' and '20190610'
        and is_call = 0  -- 先找到最大的不消費的日期
    group by
        passenger_phone
)
select
    aa.dt,
    aa.passenger_phone,
    datediff(from_unixtime(unix_timestamp(aa.dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(bb.end_dt,'yyyyMMdd'),'yyyy-MM-dd')) as day_cnt
from
(
    select
        dt,
        passenger_phone
    from
        tmp_time_exp
    where
        dt = '20190610'  -- 昨日在線用戶
)aa
join
    end_dt as bb
on
    aa.passenger_phone = bb.passenger_phone

先獲取每個用戶最大的不消費的日期,因爲從6.10開始,往前推,直到碰到第一個不消費的日期即可停止,這樣就可以得出,直到6.10消費不間斷的時間長度

結果都是

passenger_phone start_time      end_time        day_cnt
156	20190609	20190610	2
173	20190610	20190610	1
187	20190608	20190610	3

4.最長連續消費天數

舉例:如156的用戶,連續消費的時間段是5.31-6.2;6.5-6.7;6.9-6.10,時長分別爲3,3,2;金額爲分別爲50,30,40 其實就是問題 2 的衍生。

方法1
select
    passenger_phone,
    start_day,
    end_day,
    last_days,
    rank() over(partition by passenger_phone order by last_days desc) as appose_rank, -- 包括了並列第一的情況
    row_number() over(partition by passenger_phone order by last_days desc) as last_ranker  -- 不包括並列
from
(
    select
        passenger_phone,
        min(dt) as start_day,
        max(dt) as end_day,
        count(1) as last_days
    from
    (
        select
            *,
            row_number() over(partition by passenger_phone order by dt) as ranker
        from
            tmp_time_exp
        where
            is_call != 0
    )a
    group by
        passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)aa
having
    -- last_ranker = 1
    appose_rank = 1


使用問題2中的解法,直接對其結果進行下一層計算即可,即直接取出連續最長的消費時長

方法2
select
    cc.*,
    length(dd) as max_length,
    row_number() over(partition by passenger_phone order by length(dd) desc) as ranker
from
(
    select
        passenger_phone,
        concat_ws('',collect(is_call)) as call_list
    from
    (
        select
            dt,
            passenger_phone,
            is_call
        from
            tmp_time_exp
        order by
            passenger_phone desc, dt desc
    )aa
    group by
        passenger_phone
)cc
lateral view explode(split(call_list,'0')) asTable as dd
having
    ranker = 1

一種比較取巧的方式,是一次面試過程中,面試官提醒我的解法,同樣可以解決這個問題,但是如果需要加上日期就會稍微再複雜一些,需要前期concat一部分日期的數據,然後後期在進行解開

結果都是一致的

passenger_phone start_day       end_day last_days       appose_rank     last_ranker
156	20190531	20190602	3	1	1
156	20190605	20190607	3	1	2
173	20190610	20190610	1	1	1
187	20190601	20190604	4	1	1

5. 消費峯值日期

舉例:當日消費人數最高的日期

方法1
select
    dt,
    passenger_phone,
    is_call_cnt,
    rank() over(order by is_call_cnt desc) as call_ord_ranker
from
(
    select
        *,
        sum(is_call) over(partition by dt) as is_call_cnt
    from
        tmp_time_exp
)aa
having
    call_ord_ranker = 1
方法2
select
    *,
    first_value(dt) over(order by is_call_cnt desc) as max_dt
from
(
    select
        *,
        sum(is_call) over(partition by dt) as is_call_cnt
    from
        tmp_time_exp
)aa
having
    max_dt = dt

結果

dt	passenger_phone	is_call	cost	is_call_cnt	max_dt
20190610	187	1	10	3.0	20190610
20190610	173	1	10	3.0	20190610
20190610	156	1	20	3.0	20190610

6. 消費累計到達 x 元的日期

舉例:如156的用戶,消費首次到達50元的日期是6.2號,首次到達100元的日期是6.9號

select
    passenger_phone,
    max(min_gt50_dt) as min_gt50_dt,
    max(min_gt100_dt) as min_gt100_dt
from
(
    select
        *,
        min(dt) over(partition by passenger_phone,if(cost_until_today >= 50,1,0)) as min_gt50_dt,
        min(dt) over(partition by passenger_phone,if(cost_until_today >= 100,1,0)) as min_gt100_dt
    from
    (
        select
            dt,
            passenger_phone,
            cost,
            sum(cost) over(partition by passenger_phone order by dt) as cost_until_today
        from
            tmp_time_exp
    )aa
)bb
group by 
    passenger_phone

結果

passenger_phone	min_gt50_dt	min_gt100_dt
156	20190602	20190609
173	20190609	20190609
187	20190603	20190604

其中比較核心的是使用了sum() over(partition by ... order by dt)語句,表示到dt爲止的分組的總和,也就是累計截止的表達,對於一些分區邊界的限定考慮,可以參考以下第7個問題

7. 找到某個時間區間內,消費的最大值

例子:比如一個訴求是找到6.5號前後三天中,消費金額最大的一天,這種區間性質最大值的查找,大概率都會使用窗口函數來實現,類似max() over(partition by ... order by dt rows between 3 preceding and 3 following)這種,表示了到dt這一天,往前推三天,往後推三天,也就是總共七天(包括自己)內,找到該區間內的最大值,同理把窗口聚合改成sum也就變成了該區間內的總和

select
    dt,
    passenger_phone,
    cost,
    max(cost) over(partition by passenger_phone order by dt rows between unbounded preceding and current row) as until_cur_max,
    max(cost) over(partition by passenger_phone order by dt) as until_cur_max2,  -- 效果同上
    max(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_max,
    sum(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_sum
from
    tmp_time_exp

結果

dt	passenger_phone	cost	until_cur_max	until_cur_max2	before3later3_max	before3later3_sum
20190531	156	20	20	20	20	50
20190601	156	20	20	20	20	50
20190602	156	10	20	20	20	60
20190603	156	0	20	20	20	70
20190604	156	0	20	20	20	60
20190605	156	10	20	20	10	40
20190606	156	10	20	20	20	50
20190607	156	10	20	20	20	70
20190608	156	0	20	20	20	70
20190609	156	20	20	20	20	60
20190610	156	20	20	20	20	50
20190609	173	0	0	0	10	10
20190610	173	10	10	10	10	10
20190531	187	0	0	0	30	60
20190601	187	10	10	10	40	100
20190602	187	20	20	20	40	100
20190603	187	30	30	30	40	110
20190604	187	40	40	40	40	110
20190605	187	0	40	40	40	120
20190606	187	10	40	40	40	120
20190607	187	0	40	40	40	100
20190608	187	20	40	40	20	60
20190609	187	20	40	40	20	60
20190610	187	10	40	40	20	50
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章