hive窗口函數 (V1.0)

推薦大家去看原文博主的文章,條理清晰閱讀方便,轉載是爲了方便以後個人查閱

https://www.jianshu.com/p/12eaf61cf6e1

一:前言

根據官網的介紹,hive推出的窗口函數功能是對hive sql的功能增強,確實目前用於離線數據分析邏輯日趨複雜,很多場景都需要用到。以下就是對hive窗口函數的一個總結附上案例。

二:理解下什麼是WINDOW子句(靈活控制窗口的子集)

PRECEDING:往前
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點(一般結合PRECEDING,FOLLOWING使用)
UNBOUNDED PRECEDING 表示該窗口最前面的行(起點)
UNBOUNDED FOLLOWING:表示該窗口最後面的行(終點)
比如說:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示從起點到當前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往後1行)
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到當前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示當前行到終點)
官網有一段話列出了哪些窗口函數是不支持window子句的,如下圖所示:

 

 

三:準備需要演示的數據

 

insert overwrite table dw_tmp.window_function_temp
select 
split(detail,',')[0] as uname
,split(detail,',')[1] as create_time
,split(detail,',')[2] as pv
from
(
    select
    concat('測試用戶,2019-10-02,7
    #測試用戶,2019-10-05,4
    #測試用戶,2019-10-07,5
    #測試用戶,2019-10-03,6
    #測試用戶,2019-10-04,3
    #測試用戶,2019-10-01,3
    #測試用戶,2019-10-06,4') as ct_str
) t
lateral view explode(split(ct_str,'#')) t2 as detail;

 

四:Windowing functions

1.LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值第一個參數爲列名,第二個參數爲往下第n行(可選,默認爲1,不可爲負數),第三個參數爲默認值(當往下第n行爲NULL時候,取默認值,如不指定,則爲NULL)

2.LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值第一個參數爲列名,第二個參數爲往上第n行(可選,默認爲1,不可爲負數),第三個參數爲默認值(當往上第n行爲NULL時候,取默認值,如不指定,則爲NULL)

 

select 
uname
,create_time
,pv
,lead(pv,1,-9999) over (partition by uname order by create_time) as lead_1_pv
,lag(pv,1,-9999) over (partition by uname order by create_time) as lag_1_pv
from dw_tmp.window_function_temp;

 

3.FIRST_VALUE取分組內排序後,截止到當前行,第一個值,這最多需要兩個參數。第一個參數是您想要第一個值的列,第二個(可選)參數必須是false默認爲布爾值的布爾值。如果設置爲true,則跳過空值。

4.LAST_VALUE取分組內排序後,截止到當前行,最後一個值,這最多需要兩個參數。第一個參數是您想要第一個值的列,第二個(可選)參數必須是false默認爲布爾值的布爾值。如果設置爲true,則跳過空值。

 

select 
uname
,create_time
,pv
,first_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as first_value_pv
,last_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as last_value_pv
from dw_tmp.window_function_temp;

 

讓我們加上window子句來觀察一下變化,雖然FIRST_VALUE和LAST_VALUE不常於與window子句結合使用。

select 
uname
,create_time
,pv
,first_value(pv) over (partition by uname order by create_time) as first_value_pv
,first_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as window_first_value_pv
,last_value(pv) over (partition by uname order by create_time) as last_value_pv
,last_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as window_last_value_pv
from dw_tmp.window_function_temp;

 

五:aggregates functions

1.COUNT
2.SUM
3.MIN
4.MAX
5.AVG
目前支持這五種帶有聚合意義的窗口函數,以常用SUM舉例。

select 
uname
,create_time
,pv
,SUM(pv) over (partition by uname order by create_time) as sum_pv_1 --默認情況
,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_pv_2 --表示從起點到當前行
,SUM(pv) over (partition by uname) as sum_pv_3 --表示窗口內所有行
,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_pv_4 --表示起點到終點
,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) as sum_pv_5 --表示前2行到後面1行
from dw_tmp.window_function_temp;

從結果當中其實可以得到結論,默認情況就是從起點到當前行,不帶order by語句其實就是表示窗口內全部行都參與聚合處理,這裏其實還有其他用法,讀者可以自行嘗試一下。

六:Analytics functions

1.ROW_NUMBER
從1開始,按照順序,生成分組內記錄的序列,row_number()的值不會存在重複,當排序的值相同時,按照表中記錄的順序進行排列;通常用於獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。
2.RANK
生成數據項在分組中的排名,排名相等會在名次中留下空位。
3.DENSE_RANK
生成數據項在分組中的排名,排名相等會在名次中不會留下空位。
4.CUME_DIST
CUME_DIST 小於等於當前值的行數/分組內總行數
5.PERCENT_RANK
PERCENT_RANK 分組內當前行的RANK值-1/分組內總行數-1
6.NTILE
NTILE(n) 用於將分組數據按照順序切分成n片,返回當前切片值,如果切片不均勻,默認增加第一個切片的分佈。NTILE不支持ROWS BETWEEN
以上是帶有分析功能的窗口函數,使用的頻率沒有上面兩類高,但是也是需要掌握的。

我們先對1-3三種分析窗口函數進行演示

select 
uname
,create_time
,pv
,ROW_NUMBER() over (partition by uname order by pv) as row_number_pv_1
,RANK() over (partition by uname order by pv) as row_number_pv_2
,DENSE_RANK() over (partition by uname order by pv) as row_number_pv_3
from dw_tmp.window_function_temp;

 

第4-5種:

select 
uname
,create_time
,pv
,CUME_DIST() over (partition by uname order by pv) as CUME_DIST_pv_
,PERCENT_RANK() over (partition by uname order by pv) as PERCENT_RANK_pv_
from dw_tmp.window_function_temp;

第六種:NTILE

select 
uname
,create_time
,pv
,NTILE(2) over (partition by uname order by pv) as NTILE_pv_1
,NTILE(3) over (partition by uname order by pv) as NTILE_pv_2
,NTILE(4) over (partition by uname order by pv) as NTILE_pv_3
from dw_tmp.window_function_temp;

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章