本文主要總結了一些sql在時間閾上的操作,包括連續消費,最長簽到,累計消費等問題,其實映射到其他業務場景也就變成了類似的計算;如遊戲領域,連續登陸時間,連續簽到時長,最大連續簽到天數等常見的業務場景;方法都是共通的,這裏就用sparksql來實現一些方法,hivesql的話有部分代碼可能需要略微修改,比如having這種需要外面再套一層改成where等等就不再贅述
構造數據進行測試
爲了比較好切割,我就用@
進行拼湊了,第一個是日期,第二個是用戶,第三個是否消費,第四個爲消費金額
20190531@156@1@20
20190601@156@1@20
20190602@156@1@10
20190603@156@0@0
20190604@156@0@0
20190605@156@1@10
20190606@156@1@10
20190607@156@1@10
20190608@156@0@0
20190609@156@1@20
20190610@156@1@20
20190531@187@0@0
20190601@187@1@10
20190602@187@1@20
20190603@187@1@30
20190604@187@1@40
20190605@187@0@0
20190606@187@1@10
20190607@187@0@0
20190608@187@1@20
20190609@187@1@20
20190610@187@1@10
20190609@173@0@0
20190610@173@1@10
映射成表,如下結構
create table tmp_time_exp
(
dt string,
passenger_phone string,
is_call string comment '是否消費',
cost bigint comment '花費金額'
)
row format DELIMITED fields terminated by '@'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/hdfslocation'
查詢一下是否符合
tmp_time_exp.dt tmp_time_exp.passenger_phone tmp_time_exp.is_call tmp_time_exp.cost
20190531 156 1 20
20190601 156 1 20
20190602 156 1 10
20190603 156 0 0
20190604 156 0 0
20190605 156 1 10
20190606 156 1 10
20190607 156 1 10
20190608 156 0 0
20190609 156 1 20
20190610 156 1 20
20190531 187 0 0
20190601 187 1 10
20190602 187 1 20
20190603 187 1 30
20190604 187 1 40
20190605 187 0 0
20190606 187 1 10
20190607 187 0 0
20190608 187 1 20
20190609 187 1 20
20190610 187 1 10
20190609 173 0 0
20190610 173 1 10
常見問題
1.求n天連續消費用戶
例子:如需要找到連續三天消費的用戶,他的連續消費開始時間及結束時間
select
passenger_phone,
is_call,
cost,
unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd') as start_dt,
dt as end_dt,
datediff(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd'),'yyyy-MM-dd')) as last3day
from
tmp_time_exp
where
is_call != 0
having
last3day = 2
結果輸出
passenger_phone is_call cost start_dt end_dt last3day
156 1 10 1559232000 20190602 2
156 1 10 1559664000 20190607 2
187 1 30 1559318400 20190603 2
187 1 40 1559404800 20190604 2
187 1 10 1559923200 20190610 2
1. 在使用datediff
的是時候,需要注意傳遞的參數必須是標準日期格式的,所以需要轉化下 。2. 使用lag
或者lead
都可以實現類似操作,首先對用戶進行分組,然後對其消費時間進行排序,然後將下一個消費時間進行位移,然後做差。比較好理解,如上,將連續日期位移兩個位置,如果相減爲2,則這三天都是必須連續登陸的
2.用戶連續消費的時間段,持續時間及該時間段消費的金額總和
舉例:如156的用戶,連續消費的時間段是5.31-6.2;6.5-6.7;6.9-6.10,金額爲分別爲50,30,40
select
passenger_phone,
min(dt) as start_day,
max(dt) as end_day,
count(1) as last_days,
sum(cost) as cost_sum
from
(
select
*,
row_number() over(partition by passenger_phone order by dt) as ranker
from
tmp_time_exp
where
is_call != 0
)a
group by
passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
輸出結果
passenger_phone start_day end_day last_days cost_sum
156 20190531 20190602 3 50
156 20190605 20190607 3 30
156 20190609 20190610 2 40
173 20190610 20190610 1 10
187 20190601 20190604 4 100
187 20190606 20190606 1 10
187 20190608 20190610 3 50
上述的處理方式,也是參考一個blog的處理,鏈接找不到了,處理的很巧妙,使用日期排序的方式和自己的日期做差進行分組,如果差值都是一樣的,說明是連續的日期,且這個差值相同的個數即爲連續的天數
3.包括6.10,連續消費天數,斷更不算(消費簽到天數)
舉例:156的用戶。6.10消費了,往前推,6.9也消費了,但是6.8沒消費,所以到目前爲止連續消費的時間是2天;這個很多用於類似簽到的功能,如果今天斷籤,則重新開始計算累計的簽到天數
方法 1
select
*
from
(
select
passenger_phone,
min(dt) as start_time,
max(dt) as end_time,
count(1) as day_cnt
from
(
select
*,
row_number() over(partition by passenger_phone order by dt) as ranker
from
tmp_time_exp
where
is_call = 1
)aa
group by
passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)bb
where
end_time = '20190610'
在問題2中,直接將結束日期限定爲今日(6.10)即可得出
方法 2
with end_dt as
(
select
passenger_phone,
max(dt) as end_dt
from
tmp_time_exp
where
dt between '20190531' and '20190610'
and is_call = 0 -- 先找到最大的不消費的日期
group by
passenger_phone
)
select
aa.dt,
aa.passenger_phone,
datediff(from_unixtime(unix_timestamp(aa.dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(bb.end_dt,'yyyyMMdd'),'yyyy-MM-dd')) as day_cnt
from
(
select
dt,
passenger_phone
from
tmp_time_exp
where
dt = '20190610' -- 昨日在線用戶
)aa
join
end_dt as bb
on
aa.passenger_phone = bb.passenger_phone
先獲取每個用戶最大的不消費的日期,因爲從6.10開始,往前推,直到碰到第一個不消費的日期即可停止,這樣就可以得出,直到6.10消費不間斷的時間長度
結果都是
passenger_phone start_time end_time day_cnt
156 20190609 20190610 2
173 20190610 20190610 1
187 20190608 20190610 3
4.最長連續消費天數
舉例:如156的用戶,連續消費的時間段是5.31-6.2;6.5-6.7;6.9-6.10,時長分別爲3,3,2;金額爲分別爲50,30,40 其實就是問題 2 的衍生。
方法1
select
passenger_phone,
start_day,
end_day,
last_days,
rank() over(partition by passenger_phone order by last_days desc) as appose_rank, -- 包括了並列第一的情況
row_number() over(partition by passenger_phone order by last_days desc) as last_ranker -- 不包括並列
from
(
select
passenger_phone,
min(dt) as start_day,
max(dt) as end_day,
count(1) as last_days
from
(
select
*,
row_number() over(partition by passenger_phone order by dt) as ranker
from
tmp_time_exp
where
is_call != 0
)a
group by
passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)aa
having
-- last_ranker = 1
appose_rank = 1
使用問題2中的解法,直接對其結果進行下一層計算即可,即直接取出連續最長的消費時長
方法2
select
cc.*,
length(dd) as max_length,
row_number() over(partition by passenger_phone order by length(dd) desc) as ranker
from
(
select
passenger_phone,
concat_ws('',collect(is_call)) as call_list
from
(
select
dt,
passenger_phone,
is_call
from
tmp_time_exp
order by
passenger_phone desc, dt desc
)aa
group by
passenger_phone
)cc
lateral view explode(split(call_list,'0')) asTable as dd
having
ranker = 1
一種比較取巧的方式,是一次面試過程中,面試官提醒我的解法,同樣可以解決這個問題,但是如果需要加上日期就會稍微再複雜一些,需要前期concat一部分日期的數據,然後後期在進行解開
結果都是一致的
passenger_phone start_day end_day last_days appose_rank last_ranker
156 20190531 20190602 3 1 1
156 20190605 20190607 3 1 2
173 20190610 20190610 1 1 1
187 20190601 20190604 4 1 1
5. 消費峯值日期
舉例:當日消費人數最高的日期
方法1
select
dt,
passenger_phone,
is_call_cnt,
rank() over(order by is_call_cnt desc) as call_ord_ranker
from
(
select
*,
sum(is_call) over(partition by dt) as is_call_cnt
from
tmp_time_exp
)aa
having
call_ord_ranker = 1
方法2
select
*,
first_value(dt) over(order by is_call_cnt desc) as max_dt
from
(
select
*,
sum(is_call) over(partition by dt) as is_call_cnt
from
tmp_time_exp
)aa
having
max_dt = dt
結果
dt passenger_phone is_call cost is_call_cnt max_dt
20190610 187 1 10 3.0 20190610
20190610 173 1 10 3.0 20190610
20190610 156 1 20 3.0 20190610
6. 消費累計到達 x 元的日期
舉例:如156的用戶,消費首次到達50元的日期是6.2號,首次到達100元的日期是6.9號
select
passenger_phone,
max(min_gt50_dt) as min_gt50_dt,
max(min_gt100_dt) as min_gt100_dt
from
(
select
*,
min(dt) over(partition by passenger_phone,if(cost_until_today >= 50,1,0)) as min_gt50_dt,
min(dt) over(partition by passenger_phone,if(cost_until_today >= 100,1,0)) as min_gt100_dt
from
(
select
dt,
passenger_phone,
cost,
sum(cost) over(partition by passenger_phone order by dt) as cost_until_today
from
tmp_time_exp
)aa
)bb
group by
passenger_phone
結果
passenger_phone min_gt50_dt min_gt100_dt
156 20190602 20190609
173 20190609 20190609
187 20190603 20190604
其中比較核心的是使用了sum() over(partition by ... order by dt)
語句,表示到dt
爲止的分組的總和,也就是累計截止的表達,對於一些分區邊界的限定考慮,可以參考以下第7個問題
7. 找到某個時間區間內,消費的最大值
例子:比如一個訴求是找到6.5號前後三天中,消費金額最大的一天,這種區間性質最大值的查找,大概率都會使用窗口函數來實現,類似max() over(partition by ... order by dt rows between 3 preceding and 3 following)
這種,表示了到dt
這一天,往前推三天,往後推三天,也就是總共七天(包括自己)內,找到該區間內的最大值,同理把窗口聚合改成sum
也就變成了該區間內的總和
select
dt,
passenger_phone,
cost,
max(cost) over(partition by passenger_phone order by dt rows between unbounded preceding and current row) as until_cur_max,
max(cost) over(partition by passenger_phone order by dt) as until_cur_max2, -- 效果同上
max(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_max,
sum(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_sum
from
tmp_time_exp
結果
dt passenger_phone cost until_cur_max until_cur_max2 before3later3_max before3later3_sum
20190531 156 20 20 20 20 50
20190601 156 20 20 20 20 50
20190602 156 10 20 20 20 60
20190603 156 0 20 20 20 70
20190604 156 0 20 20 20 60
20190605 156 10 20 20 10 40
20190606 156 10 20 20 20 50
20190607 156 10 20 20 20 70
20190608 156 0 20 20 20 70
20190609 156 20 20 20 20 60
20190610 156 20 20 20 20 50
20190609 173 0 0 0 10 10
20190610 173 10 10 10 10 10
20190531 187 0 0 0 30 60
20190601 187 10 10 10 40 100
20190602 187 20 20 20 40 100
20190603 187 30 30 30 40 110
20190604 187 40 40 40 40 110
20190605 187 0 40 40 40 120
20190606 187 10 40 40 40 120
20190607 187 0 40 40 40 100
20190608 187 20 40 40 20 60
20190609 187 20 40 40 20 60
20190610 187 10 40 40 20 50