@
這個問題可以擴展到很多相似的問題:連續幾個月充值會員、連續天數有商品賣出、連續打車、連續逾期……
數據提供
用戶ID、登入日期
user01,2018-02-28
user01,2018-03-01
user01,2018-03-02
user01,2018-03-04
user01,2018-03-05
user01,2018-03-06
user01,2018-03-07
user02,2018-03-01
user02,2018-03-02
user02,2018-03-03
user02,2018-03-06
輸出字段
+---------+--------+-------------+-------------+--+
| uid | times | start_date | end_date |
+---------+--------+-------------+-------------+--+
解法一
先對每個用戶的登錄日期排序,然後拿第n行的日期,減第n-2行的日期,如果等於2,就說明連續三天登錄了。
解法二
開窗,窗囗內部排序然後做差
rownumber() oover
建表
create table wedw_dw.t_login_info(
user_id string COMMENT '用戶ID'
,login_date date COMMENT '登錄日期'
)
row format delimited fields terminated by ',';
導數據
hdfs dfs -put /test/login.txt /data/hive/test/wedw/dw/t_login_info/
驗證數據
select * from wedw_dw.t_login_info;
+----------+-------------+--+
| user_id | login_date |
+----------+-------------+--+
| user01 | 2018-02-28 |
| user01 | 2018-03-01 |
| user01 | 2018-03-02 |
| user01 | 2018-03-04 |
| user01 | 2018-03-05 |
| user01 | 2018-03-06 |
| user01 | 2018-03-07 |
| user02 | 2018-03-01 |
| user02 | 2018-03-02 |
| user02 | 2018-03-03 |
| user02 | 2018-03-06 |
+----------+-------------+--+
解決方案-使用解法二
select
t2.user_id as user_id,
count(1) as times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
from
(
select
t1.user_id,
t1.login_date,
date_sub(t1.login_date,rn) as date_diff
from
(
select
user_id,
login_date,
row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
) t1
) t2
group by
t2.user_id, t2.date_diff
having times >= 3;
結果
+----------+--------+-------------+-------------+--+
| user_id | times | start_date | end_date |
+----------+--------+-------------+-------------+--+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+----------+--------+-------------+-------------+--+
思路
- 先把數據按照用戶id分組,根據登錄日期排序
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
+----------+-------------+-----+--+
| user_id | login_date | rn |
+----------+-------------+-----+--+
| user01 | 2018-02-28 | 1 |
| user01 | 2018-03-01 | 2 |
| user01 | 2018-03-02 | 3 |
| user01 | 2018-03-04 | 4 |
| user01 | 2018-03-05 | 5 |
| user01 | 2018-03-06 | 6 |
| user01 | 2018-03-07 | 7 |
| user02 | 2018-03-01 | 1 |
| user02 | 2018-03-02 | 2 |
| user02 | 2018-03-03 | 3 |
| user02 | 2018-03-06 | 4 |
+----------+-------------+-----+--+
- 用登錄日期減去排序數字rn,得到的差值日期如果是相等的,則說明這兩天肯定是連續的
select
t1.user_id
,t1.login_date
,date_sub(t1.login_date,rn) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
) t1
;
+----------+-------------+-------------+--+
| user_id | login_date | date_diff |
+----------+-------------+-------------+--+
| user01 | 2018-02-28 | 2018-02-27 |
| user01 | 2018-03-01 | 2018-02-27 |
| user01 | 2018-03-02 | 2018-02-27 |
| user01 | 2018-03-04 | 2018-02-28 |
| user01 | 2018-03-05 | 2018-02-28 |
| user01 | 2018-03-06 | 2018-02-28 |
| user01 | 2018-03-07 | 2018-02-28 |
| user02 | 2018-03-01 | 2018-02-28 |
| user02 | 2018-03-02 | 2018-02-28 |
| user02 | 2018-03-03 | 2018-02-28 |
| user02 | 2018-03-06 | 2018-03-02 |
+----------+-------------+-------------+--+
- 根據user_id和日期差date_diff 分組,最小登錄日期即爲此次連續登錄的開始日期start_date,最大登錄日期即爲結束日期end_date,登錄次數即爲分組後的count(1)
select
t2.user_id as user_id
,count(1) as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
select
t1.user_id
,t1.login_date
,date_sub(t1.login_date,rn) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
) t1
) t2
group by
t2.user_id
,t2.date_diff
having times >= 3
;
+----------+--------+-------------+-------------+--+
| user_id | times | start_date | end_date |
+----------+--------+-------------+-------------+--+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+----------+--------+-------------+-------------+--+