hive窗口函數之ntile、lag、lead、first_value、last_value


其他窗口函數可翻看:
窗口函數之(sum、avg、max、min)
窗口函數之(row_number, rank, dense_rank)


1.樣例數據

id		crtime	   pv
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

2.ntile(n)

ntile(n)用於將分組數據進行切片,n代表切成多少片。相當於把數據分成幾等份,如果不能均勻等份,則多出來的從第一片開始加。
比如多出來1份,則加給第一片。
比如多出來2份,則分別加給第一片和第二片。

2.1實例

select id,crtime,pv,
ntile(2) over(partition by id order by crtime) n2, --分2片
ntile(3) over(partition by id order by crtime) n3, --分3片
ntile(4) over(partition by id order by crtime) n4, --分4片
ntile(5) over(partition by id order by crtime) n5  --分5片
from nt;
->
id		crtime			pv		n2		n3		n4		n5
cookie1 2015-04-10      1       1       1       1       1
cookie1 2015-04-11      5       1       1       1       1
cookie1 2015-04-12      7       1       1       2       2
cookie1 2015-04-13      3       1       2       2       2
cookie1 2015-04-14      2       2       2       3       3
cookie1 2015-04-15      4       2       3       3       4
cookie1 2015-04-16      4       2       3       4       5
cookie2 2015-04-10      2       1       1       1       1
cookie2 2015-04-11      3       1       1       1       1
cookie2 2015-04-12      5       1       1       2       2
cookie2 2015-04-13      6       1       2       2       2
cookie2 2015-04-14      3       2       2       3       3
cookie2 2015-04-15      9       2       3       3       4
cookie2 2015-04-16      7       2       3       4       5

可以看到,cookie1有7條數據,當將分組數據分成2片時,7/2餘數爲1份,加到第1片中,所以有4個1,3個2;
當將分組數據分成3片時,7/3餘數爲1份,加到第1片中,所以有3個1,2個2,2個3;
當將分組數據分成4片時,7/4餘數爲3份,分別加到第1,2,3片中,所以有2個1,2個2,2個3,1個4;
當將分組數據分成5片時,7/5餘數爲2份,分別加到第1,2片中,所以有2個1,2個2,1個3,1個4,1個5。

需求:統計cookie前1/3天的pv數有多少?
思路:前1/3天,可以使用ntile(3)分成三片,取ntile值爲1的pv進行sum。

select t.id,sum(t.pv) spv from
(select id,crtime,pv,ntile(3) over(partition by id order by crtime) nt3 from nt) t 
where t.nt3 = 1
group by t.id;
->
id		spv	
cookie1 13
cookie2 10

3.lag、lead、first_value、last_value

這幾個函數經常用於時間序列,但是不支持rows between(window子句)
lag(col,n,default):統計窗口內往上數第n行的值。

  • col:列名,n:往上數第n行,不寫默認是1,default:往上第n行爲null時取該默認值,不寫爲null。

lead(col,n,default):統計窗口內往下數第n行的值。

  • col:列名,n:往下數第n行,不寫默認是1,default:往下第n行爲null時取該默認值,不寫爲null。

first_value(col):求分組排序後截止到當前行的第一個值。
last_value(col):求分組排序後截止到當前行的最後一個值

3.1實例

select *,
lag(crtime,1,'a') over(partition by id order by crtime) lagc,
lead(crtime,2,'b') over(partition by id order by crtime) leadc,
first_value(pv) over(partition by id order by crtime) fpv,
last_value(pv) over(partition by id order by crtime) lpv 
from nt;
->
id		crtime			pv		lagc			leadc			fpv		lpv
cookie1 2015-04-10      1       a       		2015-04-12      1       1
cookie1 2015-04-11      5       2015-04-10      2015-04-13      1       5
cookie1 2015-04-12      7       2015-04-11      2015-04-14      1       7
cookie1 2015-04-13      3       2015-04-12      2015-04-15      1       3
cookie1 2015-04-14      2       2015-04-13      2015-04-16      1       2
cookie1 2015-04-15      4       2015-04-14      b       		1       4
cookie1 2015-04-16      4       2015-04-15      b       		1       4
cookie2 2015-04-10      2       a       		2015-04-12      2       2
cookie2 2015-04-11      3       2015-04-10      2015-04-13      2       3
cookie2 2015-04-12      5       2015-04-11      2015-04-14      2       5
cookie2 2015-04-13      6       2015-04-12      2015-04-15      2       6
cookie2 2015-04-14      3       2015-04-13      2015-04-16      2       3
cookie2 2015-04-15      9       2015-04-14      b       		2       9
cookie2 2015-04-16      7       2015-04-15      b       		2       7

3.1.1問題1:如果想取分組後pv最後一個值

select *,
first_value(pv) over(partition by id order by crtime desc) newpv 
from nt;
->
id		crtime			pv		newpv
cookie1 2015-04-16      4       4
cookie1 2015-04-15      4       4
cookie1 2015-04-14      2       4
cookie1 2015-04-13      3       4
cookie1 2015-04-12      7       4
cookie1 2015-04-11      5       4
cookie1 2015-04-10      1       4
cookie2 2015-04-16      7       7
cookie2 2015-04-15      9       7
cookie2 2015-04-14      3       7
cookie2 2015-04-13      6       7
cookie2 2015-04-12      5       7
cookie2 2015-04-11      3       7
cookie2 2015-04-10      2       7
但是此時的crtime是倒序的,如果想升序排序,則需要加order by id,crtime

select *,
first_value(pv) over(partition by id order by crtime desc) newpv 
from nt
order by id,crtime;
->
id		crtime			pv		newpv
cookie1 2015-04-10      1       4
cookie1 2015-04-11      5       4
cookie1 2015-04-12      7       4
cookie1 2015-04-13      3       4
cookie1 2015-04-14      2       4
cookie1 2015-04-15      4       4
cookie1 2015-04-16      4       4
cookie2 2015-04-10      2       7
cookie2 2015-04-11      3       7
cookie2 2015-04-12      5       7
cookie2 2015-04-13      6       7
cookie2 2015-04-14      3       7
cookie2 2015-04-15      9       7
cookie2 2015-04-16      7       7

3.1.2問題2:如果不排序會怎樣?

不排序則crtime既不是升序也不是降序

select *,
lag(pv) over(partition by id) lagc,  - 默認取前1行的值,前1行沒有值默認爲null
lead(pv) over(partition by id) leadc - 默認取下1行的值,下1行沒有值默認爲null
from nt;
->
id		crtime			pv		lagc	leadc
cookie1 2015-04-10      1       NULL    4
cookie1 2015-04-16      4       1       4
cookie1 2015-04-15      4       4       2
cookie1 2015-04-14      2       4       3
cookie1 2015-04-13      3       2       7
cookie1 2015-04-12      7       3       5
cookie1 2015-04-11      5       7       NULL
cookie2 2015-04-16      7       NULL    9
cookie2 2015-04-15      9       7       3
cookie2 2015-04-14      3       9       6
cookie2 2015-04-13      6       3       5
cookie2 2015-04-12      5       6       3
cookie2 2015-04-11      3       5       2
cookie2 2015-04-10      2       3       NULL

select *,
first_value(pv) over(partition by id) fpv, -取分組的第一個值
last_value(pv) over(partition by id) lpv   -取分組的最後一個值
from nt;
->
id		crtime			pv		fpv		lpv 
cookie1 2015-04-10      1       1       5
cookie1 2015-04-16      4       1       5
cookie1 2015-04-15      4       1       5
cookie1 2015-04-14      2       1       5
cookie1 2015-04-13      3       1       5
cookie1 2015-04-12      7       1       5
cookie1 2015-04-11      5       1       5
cookie2 2015-04-16      7       7       2
cookie2 2015-04-15      9       7       2
cookie2 2015-04-14      3       7       2
cookie2 2015-04-13      6       7       2
cookie2 2015-04-12      5       7       2
cookie2 2015-04-11      3       7       2
cookie2 2015-04-10      2       7       2
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章