使用hql-統計連續登陸的三天及以上的用戶

@

這個問題可以擴展到很多相似的問題:連續幾個月充值會員、連續天數有商品賣出、連續打車、連續逾期……

數據提供

 用戶ID、登入日期
 user01,2018-02-28
 user01,2018-03-01
 user01,2018-03-02
 user01,2018-03-04
 user01,2018-03-05
 user01,2018-03-06
 user01,2018-03-07
 user02,2018-03-01
 user02,2018-03-02
 user02,2018-03-03
 user02,2018-03-06

輸出字段

+---------+--------+-------------+-------------+--+
|   uid   | times  | start_date  |  end_date   |
+---------+--------+-------------+-------------+--+

解法一

先對每個用戶的登錄日期排序,然後拿第n行的日期,減第n-2行的日期,如果等於2,就說明連續三天登錄了。

解法二

開窗,窗囗內部排序然後做差

rownumber() oover

建表

create table wedw_dw.t_login_info(
 user_id string  COMMENT '用戶ID'
,login_date date COMMENT '登錄日期'
)
row format delimited fields terminated by ',';

導數據

hdfs dfs -put /test/login.txt /data/hive/test/wedw/dw/t_login_info/

驗證數據

select * from wedw_dw.t_login_info;
+----------+-------------+--+
| user_id  | login_date  |
+----------+-------------+--+
| user01   | 2018-02-28  |
| user01   | 2018-03-01  |
| user01   | 2018-03-02  |
| user01   | 2018-03-04  |
| user01   | 2018-03-05  |
| user01   | 2018-03-06  |
| user01   | 2018-03-07  |
| user02   | 2018-03-01  |
| user02   | 2018-03-02  |
| user02   | 2018-03-03  |
| user02   | 2018-03-06  |
+----------+-------------+--+

解決方案-使用解法二

select
 t2.user_id         as user_id,
 count(1)           as times,
 min(t2.login_date) as start_date,
 max(t2.login_date) as end_date
from
(
    select
     t1.user_id,
     t1.login_date,
     date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id,
         login_date,
         row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
) t2
group by 
 t2.user_id, t2.date_diff
having times >= 3;

結果

+----------+--------+-------------+-------------+--+
| user_id  | times  | start_date  |  end_date   |
+----------+--------+-------------+-------------+--+
| user01   | 3      | 2018-02-28   | 2018-03-02  |
| user01   | 4      | 2018-03-04  | 2018-03-07   |
| user02   | 3      | 2018-03-01   | 2018-03-03  |
+----------+--------+-------------+-------------+--+

思路

  1. 先把數據按照用戶id分組,根據登錄日期排序
select
	user_id
	,login_date
	,row_number() over(partition by user_id order by login_date asc) as rn 
	from
	wedw_dw.t_login_info

+----------+-------------+-----+--+
| user_id  | login_date  | rn  |
+----------+-------------+-----+--+
| user01   | 2018-02-28  | 1   |
| user01   | 2018-03-01  | 2   |
| user01   | 2018-03-02  | 3   |
| user01   | 2018-03-04  | 4   |
| user01   | 2018-03-05  | 5   |
| user01   | 2018-03-06  | 6   |
| user01   | 2018-03-07  | 7   |
| user02   | 2018-03-01  | 1   |
| user02   | 2018-03-02  | 2   |
| user02   | 2018-03-03  | 3   |
| user02   | 2018-03-06  | 4   |
+----------+-------------+-----+--+
  1. 用登錄日期減去排序數字rn,得到的差值日期如果是相等的,則說明這兩天肯定是連續的
select
     t1.user_id
    ,t1.login_date
    ,date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id
        ,login_date
        ,row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
    ;


+----------+-------------+-------------+--+
| user_id  | login_date  |  date_diff  |
+----------+-------------+-------------+--+
| user01   | 2018-02-28  | 2018-02-27  |
| user01   | 2018-03-01  | 2018-02-27  |
| user01   | 2018-03-02  | 2018-02-27  |
| user01   | 2018-03-04  | 2018-02-28  |
| user01   | 2018-03-05  | 2018-02-28  |
| user01   | 2018-03-06  | 2018-02-28  |
| user01   | 2018-03-07  | 2018-02-28  |
| user02   | 2018-03-01  | 2018-02-28  |
| user02   | 2018-03-02  | 2018-02-28  |
| user02   | 2018-03-03  | 2018-02-28  |
| user02   | 2018-03-06  | 2018-03-02  |
+----------+-------------+-------------+--+
  1. 根據user_id和日期差date_diff 分組,最小登錄日期即爲此次連續登錄的開始日期start_date,最大登錄日期即爲結束日期end_date,登錄次數即爲分組後的count(1)
select
 t2.user_id         as user_id
,count(1)           as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
    select
     t1.user_id
    ,t1.login_date
    ,date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id
        ,login_date
        ,row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
) t2
group by 
 t2.user_id
,t2.date_diff
having times >= 3
;

+----------+--------+-------------+-------------+--+
| user_id  | times  | start_date  |  end_date   |
+----------+--------+-------------+-------------+--+
| user01   | 3      | 2018-02-28   | 2018-03-02  |
| user01    | 4      | 2018-03-04  | 2018-03-07  |
| user02   | 3      | 2018-03-01   | 2018-03-03  |
+----------+--------+-------------+-------------+--+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章