hive 常用統計分析函數

前言:
       hive中提供了很多的的統計分析函數,實際中經常用來進行統計分析,如下筆者整理了常用的一些分析函數,並附以相關實例.
博客參考連接:http://lxw1234.com/archives/2015/07/367.htm

1.基礎函數

window 子句 rows between

  • preceding:往前
  • following:往後
  • current row:當前行
  • unbounded:起點
  • unbounded preceding:表示從前面的起點
  • unbounded following:表示到後面的終點

注意

  • 如果不指定rows betwwen,默認爲從起點到當前行
  • 如果不指定order by,則將分組內所有值累加

創建表,加載數據

create table tb_pv (
cookieid string,
dt string,
pv int
) row format delimited 
fields terminated by ',';

-- 數據
'''
cookie1,2019-04-10,1
cookie1,2019-04-11,5
cookie1,2019-04-12,7
cookie1,2019-04-13,3
cookie1,2019-04-14,2
cookie1,2019-04-15,4
cookie1,2019-04-16,4
'''

load data local inpath '/data/cookie.txt' overwrite into table test_db.tb_pv;

1.1 sum 求和

  • sql
select 
cookieid,
dt,
pv,
sum(pv) over(partition by cookieid order by dt) as pv1,
sum(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
sum(pv) over(partition by cookieid) as pv3,
sum(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
sum(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
sum(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;

''' 運行結果
cookieid	dt			pv	pv1	pv2	pv3	pv4	pv5	pv6
cookie1		2019-04-10	1	1	1	26	1	6	26
cookie1		2019-04-11	5	6	6	26	6	13	25
cookie1		2019-04-12	7	13	13	26	13	16	20
cookie1		2019-04-13	3	16	16	26	16	18	13
cookie1		2019-04-14	2	18	18	26	17	21	10
cookie1		2019-04-15	4	22	22	26	16	20	8
cookie1		2019-04-16	4	26	26	26	13	13	4
'''
  • 說明
pv1:分組內從起點到當前行的pv累計相加,如
		11號pv = 10號pv + 11號pv = 1 + 5 = 6
		12號pv = 10號pv + 11號pv + 12號pv = 1 + 5 + 7 = 13
		依此類推

pv2:同pv1

pv3:分組內(cookie1)所有pv累加,如
		 pv3 = 1 + 5 + 7 + 3 + 2 + 4 + 4 = 26

pv4:分組內,當前行 + 往前3行,如
		11號pv = 10號pv + 11號pv = 1 + 5 = 6
		12號pv = 12號pv + 10號pv + 11號pv = 7 + 1 + 5 = 13
		13號pv = 13號pv + 10號pv + 11號pv + 12號pv = 3 + 1 + 5 + 7 = 16
		14號pv = 14號pv + 11號pv + 12號pv + 13號pv = 2 + 5 + 7 + 3 = 17

pv5:分組內,當前行 + 往前3行 + 往後1行,如
		10號pv = 10號pv + 往前3行沒有 + 11號pv = 1 + 0 + 5 = 6
		13號pv = 13號pv + 10號pv + 11號pv + 12號pv + 14號pv = 3 + 1 + 5 + 7 + 2 = 18

pv6:分組內當前行 + 往後所有行
		10號pv = 10號pv + 11號pv + 12號pv + 13號pv + 14號pv + 15號pv + 16號pv = 1 + 5 + 7 + 3 + 2 + 4 + 4 = 26
		13號pv = 13號pv + 14號pv + 15號pv + 16號pv = 3 + 2 + 4 + 4 = 13

1.2 avg 求平均數

同 1.1中 sum

select 
cookieid,
dt,
pv,
avg(pv) over(partition by cookieid order by dt) as pv1,
avg(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
avg(pv) over(partition by cookieid) as pv3,
avg(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
avg(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
avg(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;

在這裏插入圖片描述
求平均數,運行的結果可以看到有很多小數位,我們使用cast函數來保留兩位小數

語法:cast(column_name as decimal(10,2))

select 
cookieid,
dt,
pv,
cast(avg(pv) over(partition by cookieid order by dt) as decimal(10,2)) as pv1,
cast(avg(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as decimal(10,2)) as pv2,
cast(avg(pv) over(partition by cookieid) as decimal(10,2)) as pv3,
cast(avg(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as decimal(10,2)) as pv4,
cast(avg(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as decimal(10,2)) as pv5,
cast(avg(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as decimal(10,2)) as pv6
from tb_pv order by dt;

運行結果

cookieid	dt			pv		pv1		pv2		pv3		pv4		pv5		pv6
cookie1		2019-04-10	1		1.00	1.00	3.71	1.00	3.00	3.71
cookie1		2019-04-11	5		3.00	3.00	3.71	3.00	4.33	4.17
cookie1		2019-04-12	7		4.33	4.33	3.71	4.33	4.00	4.00
cookie1		2019-04-13	3		4.00	4.00	3.71	4.00	3.60	3.25
cookie1		2019-04-14	2		3.60	3.60	3.71	4.25	4.20	3.33
cookie1		2019-04-15	4		3.67	3.67	3.71	4.00	4.00	4.00
cookie1		2019-04-16	4		3.71	3.71	3.71	3.25	3.25	4.00

1.3 min 求最小值

同 1.1 sum 求和

select 
cookieid,
dt,
pv,
min(pv) over(partition by cookieid order by dt) as pv1,
min(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
min(pv) over(partition by cookieid) as pv3,
min(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
min(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
min(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;

查詢結果

cookieid	dt			pv	pv1	pv2	pv3	pv4	pv5	pv6
cookie1		2019-04-10	1	1	1	1	1	1	1
cookie1		2019-04-11	5	1	1	1	1	1	2
cookie1		2019-04-12	7	1	1	1	1	1	2
cookie1		2019-04-13	3	1	1	1	1	1	2
cookie1		2019-04-14	2	1	1	1	2	2	2
cookie1		2019-04-15	4	1	1	1	2	2	4
cookie1		2019-04-16	4	1	1	1	2	2	4

1.4 max 最大值

同 1.1 sum 求和

select 
cookieid,
dt,
pv,
max(pv) over(partition by cookieid order by dt) as pv1,
max(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
max(pv) over(partition by cookieid) as pv3,
max(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
max(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
max(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;

查詢結果

cookieid	dt			pv	pv1	pv2	pv3	pv4	pv5	pv6
cookie1		2019-04-10	1	1	1	7	1	5	7
cookie1		2019-04-11	5	5	5	7	5	7	7
cookie1		2019-04-12	7	7	7	7	7	7	7
cookie1		2019-04-13	3	7	7	7	7	7	4
cookie1		2019-04-14	2	7	7	7	7	7	4
cookie1		2019-04-15	4	7	7	7	7	7	4
cookie1		2019-04-16	4	7	7	7	4	4	4

2.序列函數

序列函數:ntile,row_number,rank,dense_rank

注意:序列函數不支持window子句

數據

cookie1,2019-04-10,1
cookie1,2019-04-11,5
cookie1,2019-04-12,7
cookie1,2019-04-13,3
cookie1,2019-04-14,2
cookie1,2019-04-15,4
cookie1,2019-04-16,4
cookie2,2019-04-10,2
cookie2,2019-04-11,3
cookie2,2019-04-12,5
cookie2,2019-04-13,6
cookie2,2019-04-14,3
cookie2,2019-04-15,9
cookie2,2019-04-16,7

2.1 ntile

  • ntile(n):用於將分組數據按照順序切分成n片,返回當前切片值

    ntile不支持rows between,比如 ntile(2) over(partition by cookieid order by createtime rows between 3 preceding and current row)

  • 如果切片不均勻,默認增加第一個切片的分佈

  • 經常用來取前30% 帶有百分之多少比例的記錄什麼的

select 
cookieid,
dt,
pv,
ntile(2) over(partition by cookieid order by dt) as rn1,
ntile(3) over(partition by cookieid order by dt) as rn2,
ntile(4) over(order by dt) as rn3
from tb_pv 
order by cookieid, dt;

查詢結果

cookieid	dt			pv		rn1		rn2		rn3
cookie1		2019-04-10	1		1		1		1
cookie1		2019-04-11	5		1		1		1
cookie1		2019-04-12	7		1		1		2
cookie1		2019-04-13	3		1		2		2
cookie1		2019-04-14	2		2		2		3
cookie1		2019-04-15	4		2		3		4
cookie1		2019-04-16	4		2		3		4
cookie2		2019-04-10	2		1		1		1
cookie2		2019-04-11	3		1		1		1
cookie2		2019-04-12	5		1		1		2
cookie2		2019-04-13	6		1		2		2
cookie2		2019-04-14	3		2		2		3
cookie2		2019-04-15	9		2		3		3
cookie2		2019-04-16	7		2		3		4
  • 例1:統計一個cookie中,pv數最多的前三分之一天
select
cookieid,
dt,
pv,
ntile(3) over(partition by cookieid order by pv desc) as rn
from tb_pv;

運行結果(獲取rn爲1的記錄,即爲pv數最多的前三分之天)

cookieid	dt			pv    	rn
cookie1		2019-04-12	7		1
cookie1		2019-04-11	5		1
cookie1		2019-04-16	4		1
cookie1		2019-04-15	4		2
cookie1		2019-04-13	3		2
cookie1		2019-04-14	2		3
cookie1		2019-04-10	1		3
cookie2		2019-04-15	9		1
cookie2		2019-04-16	7		1
cookie2		2019-04-13	6		1
cookie2		2019-04-12	5		2
cookie2		2019-04-11	3		2
cookie2		2019-04-14	3		3
cookie2		2019-04-10	2		3
  • 例2:統計用戶消費排名前30%的平均消費和後70%的平均消費

創建表,加載數據

create table `tb_user_price`(
`user_id` string,
`price` double
) row format delimited 
fields terminated by ',';

1001,100
1002,200
1003,10
1004,60
1005,20
1006,40
1007,1000
1008,220
1009,110
1010,190
1011,20
1012,80
1013,2000
1014,900
1015,26

load data local inpath '/Users/harvey/data/tb_user_price.txt' overwrite into table test_db.tb_user_price;

思路:將數據按照價格排序分爲10分保存至臨時表,基於臨時表獲取前30%和後70%求平均

-- 步驟一:將數據按照價格排序分爲10分保存至臨時表
drop table if exists tb_user_price_ntile_temp;
create table tb_user_price_ntile_temp as select user_id, price, ntile(10) over(order by price desc) as rn from tb_user_price;

-- 查詢結果
user_id	price	rn
1013	2000.0	1
1007	1000.0	1
1014	900.0	2
1008	220.0	2
1002	200.0	3
1010	190.0	3
1009	110.0	4
1001	100.0	4
1012	80.0	5
1004	60.0	5
1006	40.0	6
1015	26.0	7
1011	20.0	8
1005	20.0	9
1003	10.0	10

-- 步驟二:基於臨時表獲取前30%和後70%求平均
select
a.rn as rn,
case when a.rn = 1 then 'avg_price_first_30%' when a.rn = 2 then 'avg_price_last_70%' end as avg_price_name,
avg( price ) as avg_price 
from
( select user_id, price, case when rn in ( 1, 2, 3 ) then 1 else 2 end as rn from tb_user_price_ntile_temp ) a 
group by a.rn;

-- 統計結果
rn	avg_price_name			avg_price
1	avg_price_first_30%		751.6666666666666
2	avg_price_last_70%		51.77777777777778

2.2 row_number

  1. 用於分組,比方說依照 date 分組
  2. 組內可以依照某個屬性排序,比方說依照 score 分組,組內按照imei排序
  3. 語法:row_number() over (partition by xxx order by xxx) rank,其中rank爲分組的別名,可任意
  4. 取組內第一個數據使用 rank = 1
  • 例1:求分組內pv名詞
select cookieid, dt, pv, row_number() over(partition by cookieid order by pv desc) as rn from tb_pv;

查詢結果

cookieid	dt			pv	 	rn
cookie1		2019-04-12	7		1
cookie1		2019-04-11	5		2
cookie1		2019-04-16	4		3
cookie1		2019-04-15	4		4
cookie1		2019-04-13	3		5
cookie1		2019-04-14	2		6
cookie1		2019-04-10	1		7
cookie2		2019-04-15	9		1
cookie2		2019-04-16	7		2
cookie2		2019-04-13	6		3
cookie2		2019-04-12	5		4
cookie2		2019-04-11	3		5
cookie2		2019-04-14	3		6
cookie2		2019-04-10	2		7
  • 例2:例1基礎上獲取排名爲第一的
select * from (select cookieid, dt, pv, row_number() over(partition by cookieid order by pv desc) as rn from tb_pv) a where a.rn = 1;

查詢結果

a.cookieid	a.dt		a.pv	a.rn
cookie1		2019-04-12	7		1
cookie2		2019-04-15	9		1

2.3 rank 和 dense_rank

  • rank() 生成數據項在分組中的排名,排名相同的名次一樣,同一排名有幾個,後面排名就會跳過幾次
  • dense_rank() 生成數據項在分組中的排名,排名相同的名次一樣,且後面名次不跳躍
  • row_number() 生成數據項在分組中的排名,不管排名是否有相同的,都按照順序1,2,3……n
select 
cookieid, 
dt, 
pv, 
rank() over(partition by cookieid order by pv desc) as rn1,
dense_rank() over(partition by cookieid order by pv desc) as rn2,
row_number() over(partition by cookieid order by pv desc) as rn3
from tb_pv;

查詢結果

cookieid	dt			pv		rn1		rn2		rn3
cookie1		2019-04-12	7		1		1		1
cookie1		2019-04-11	5		2		2		2
cookie1		2019-04-16	4		3		3		3
cookie1		2019-04-15	4		3		3		4
cookie1		2019-04-13	3		5		4		5
cookie1		2019-04-14	2		6		5		6
cookie1		2019-04-10	1		7		6		7
cookie2		2019-04-15	9		1		1		1
cookie2		2019-04-16	7		2		2		2
cookie2		2019-04-13	6		3		3		3
cookie2		2019-04-12	5		4		4		4
cookie2		2019-04-11	3		5		5		5
cookie2		2019-04-14	3		5		5		6
cookie2		2019-04-10	2		7		6		7

2.4 cume_dist percent_rank

ps:這兩個序列函數不常使用

創建表、導入數據

-- 數據
1201,Gopal,45000,TD
1202,Manisha,45000,HRD
1203,Masthanvali,40000,AD
1204,Kiran,40000,HRD
1205,Kranthi,30000,TD

-- 創建表
create table `emps`(
  `eid` int,
  `name` string,
  `salary` string,
  `dept` string
 ) row format delimited 
fields terminated by ',';

-- 導入數據
load data local inpath '/Users/harvey/data/emps.txt' overwrite into table emps;

-- 查詢
emps.eid	emps.name		emps.salary	emps.dept
1201		Gopal			45000			TD
1202		Manisha			45000			HRD
1203		Masthanvali		40000			AD
1204		Kiran			40000			HRD
1205		Kranthi			30000			TD

2.4.1 cume_dist

  • 小於等於當前值的行數/分組內總行數

例:統計小於等於當前薪水的人數,所佔總人數的比例

select 
  eid, 
  name, 
  salary, 
  dept, 
  cume_dist() over(order by salary) as rn1,
  cume_dist() over(partition by dept order by salary) as rn2
from emps;

查詢結果
eid		name			salary	dept	rn1		rn2
1203	Masthanvali		40000	AD		0.6		1.0
1204	Kiran			40000	HRD		0.6		0.5
1202	Manisha			45000	HRD		1.0		1.0
1205	Kranthi			30000	TD		0.2		0.5
1201	Gopal			45000	TD		1.0		1.0

說明
rn1:沒有 partition,所有數據均爲1組,總行數爲5
	   第一行:小於等於40000的行數爲3,即 3 / 5 = 0.6
	   第三行:小於等於45000的行數爲5,即 5 / 5 = 1.0

rn2:按照部門 dept 分組,dept 爲 TD 的行數爲2
	   第四行:小於等於30000的行數爲1,即 1 2 = 0.5

2.4.2 percent_rank

  • 分組內當前行的 RANK值 - 1 / 分組內總行數 - 1
select
  eid,
  name,
  salary,
  dept,
  percent_rank() over(order by salary) as rn1,						--分組內,總行數5
  rank() over(order by salary) as rn11,								--分組內 rank 值
  percent_rank() over(partition by dept order by salary) as rn2,	--按照部門 dept 分組
  rank() over(partition by dept order by salary) as rn22			--按照部門 dept 分組內 rank 值
from emps;

查詢結果
eid		name			salary	dept	rn1		rn11	rn2		rn22
1203	Masthanvali		40000	AD		0.25	2		0.0		1
1204	Kiran			40000	HRD		0.25	2		0.0		1
1202	Manisha			45000	HRD		0.75	4		1.0		2
1205	Kranthi			30000	TD		0.0		1		0.0		1
1201	Gopal			45000	TD		0.75	4		1.0		2

說明
rn1:沒有 partition,所有數據均爲1組,總行數爲5
	   第一行:(2 - 1) / (5 - 1) = 1 / 4 = 0.25 
	   第四行:(1 - 1) / (5 - 1) = 0 / 4 = 0.0

rn2:按照部門 dept 分組,salary 排序
	   第二行:HRD共兩行,(1 - 1) / (2 - 1) = 0 / 1 = 0.0
	   第五行:TD共兩行,(2 - 1) / (2 - 1) = 1 / 1 = 1.0

3.3 分析函數(窗口函數)

lag、lead、first_value、last_value
PS:這幾個函數不支持window子句

創建表,加載數據

-- 數據
user1,2019-04-10 10:00:02,https://www.baidu.com/
user1,2019-04-10 10:00:00,https://www.baidu.com/
user1,2019-04-10 10:03:04,http://www.google.cn/
user1,2019-04-10 10:50:05,https://yq.aliyun.com/topic/list/
user1,2019-04-10 11:00:00,https://www.json.cn/
user1,2019-04-10 10:10:00,https://music.163.com/
user1,2019-04-10 10:50:01,https://www.guoguo-app.com/
user2,2019-04-10 10:00:02,http://www.google.cn/
user2,2019-04-10 10:00:00,https://www.baidu.com/
user2,2019-04-10 10:03:04,https://www.runoob.com/
user2,2019-04-10 10:50:05,http://fanyi.youdao.com/
user2,2019-04-10 11:00:00,https://www.csdn.net/
user2,2019-04-10 10:10:00,https://www.json.cn/
user2,2019-04-10 10:50:01,https://yq.aliyun.com/topic/list/

-- 創建表
create table tb_page_access (
  userid string,
  create_time string,
  url string
) row format delimited 
fields terminated by ',';

-- 加載數據
load data local inpath '/Users/harvey/data/tb_page_access.txt' overwrite into table tb_page_access;

3.1 lag

  • 語法

    LAG (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)

  • lag(col, n, DEFAULT):用於統計窗口內往上第 n 行值

    第一個參數爲列名
    第二個參數爲往上第n行(可選,默認爲1)
    第三個參數爲默認值(當往上第 n 行爲 NULL 時候,取默認值,如不指定,則爲 NULL)

select
  userid,
  create_time,
  url,
  row_number() over(partition by userid order by create_time) as rn,
  lag(create_time, 1, '1970-01-01 00:00:00') over(partition by userid order by create_time) as last_1_time,
  lag(create_time, 2) over(partition by userid order by create_time) as last_2_time
from tb_page_access;

查詢結果
在這裏插入圖片描述
說明:

rn:按照userid分組,create_time排序後的排名
last_1_time
	第一行:往上一行,沒有數據,則爲	1970-01-01 00:00:00
	第二行:往上一行,爲 2019-04-10 10:00:00
	依次類推
last_2_time
	第一行:往上兩行爲空,未指定默認值,則爲 NULL
	第二行:同第一行
	第三行:往上兩行,爲第一行的數據的 create_time,即 2019-04-10 10:00:00
	依此類推

3.2 lead

  • 語法

    LEAD (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)

  • lead(col, n, DEFAULT) 用於統計窗口內往下第n行值
    第一個參數爲列名
    第二個參數爲往下第n行(可選,默認爲1)
    第三個參數爲默認值(當往下第 n 行爲 NULL 時候,取默認值,如不指定,則爲 NULL)

select
  userid,
  create_time,
  url,
  row_number() over(partition by userid order by create_time) as rn,
  lead(create_time, 1, '1970-01-01 00:00:00') over(partition by userid order by create_time) as next_1_time,
  lead(create_time, 2) over(partition by userid order by create_time) as next_2_time
from tb_page_access;

查詢結果
在這裏插入圖片描述
說明:

rn:按照 userid 分組,create_time 排序後的排名
next_1_time
	第一行:向下一行,即第二行的 create_time,即 2019-04-10 10:00:02
	依次類推
	最後一行:先下一行,沒有數據,即爲默認值 1970-01-01 00:00:00
next_2_time
	第一行:向下兩行,即第三行的 create_time,即 2019-04-10 10:03:04
	依次類推
	最後一行:向下兩行,沒有數據也沒有設置默認值,即爲 NULL

3.3 first_value

  • 取分組內排序後,截止到當前行,第一個值
select
  userid,
  create_time,
  url,
  row_number() over(partition by userid order by create_time) as rn,
  first_value(url) over(partition by userid order by create_time) as first_url
from tb_page_access;

查詢結果
在這裏插入圖片描述
PS:理解,截止到當前行的含義,指的指,在分組排序後,從第一行到當前行之間的第一個值

3.4 last_value

  • 取分組內排序後,截止到當前行,最後一個值
select
  userid,
  create_time,
  url,
  row_number() over(partition by userid order by create_time) as rn,
  last_value(url) over(partition by userid order by create_time) as last_url
from tb_page_access;

查詢結果
在這裏插入圖片描述
PS:理解,截止到當前行的含義,指的是,在分組排序後,從第一行到當前行之間中最後一個值

4 分析函數(OLAP)

       grouping sets、grouping__id、cube、rollup

       這幾個分析函數通常用於OLAP中,不能累加,而且需要根據不同維度上鑽和下鑽的指標統計,比如,分小時、天、月的UV數

       通常需要對各個維度進行交叉分析,如果只有GROUP BY子句,那我們可以寫出按各個維度或層次進行GROUP BY的查詢語句,然後再通過UNION子句把結果集拼湊起來,但是這樣的查詢語句顯得冗長、笨拙。爲了解決HQL冗長的問題。hive中提供了這幾個函數來解決

創建表、加載數據

''' 數據
2019-03,2019-03-10,127.0.0.1
2019-03,2019-03-10,192.168.1.1
2019-03,2019-03-12,61.135.169.121
2019-04,2019-04-12,203.208.41.47
2019-04,2019-04-13,39.96.252.213
2019-04,2019-04-13,121.40.179.176
2019-04,2019-04-16,139.196.135.171
2019-03,2019-03-10,119.167.188.226
2019-03,2019-03-10,59.111.181.38:
2019-04,2019-04-12,192.168.1.1
2019-04,2019-04-13,203.208.41.47
2019-04,2019-04-15,121.40.179.176
2019-04,2019-04-15,39.96.252.213
2019-04,2019-04-16,59.111.181.38
'''

-- 創建表
create table tb_uv (
  month string,
  day string,
  ip string
) row format delimited 
fields terminated by ',';

-- 加載數據
load data local inpath '/Users/harvey/data/tb_uv.txt' overwrite into table tb_uv;

4.1 grouping sets

在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL

select 
month,
day,
count(distinct ip) as uv,
grouping__id 
from tb_uv 
group by month, day 
grouping sets (month, day) 
order by grouping__id;

-- sql 等價於
select month, NULL, count(distinct ip) as uv, 1 as grouping__id from tb_uv group by month
union all
select NULL, day, count(distinct ip) as uv, 2 as grouping__id from tb_uv group by day;

''' 查詢結果
month			day						uv	grouping__id
2019-04			NULL					6		1
2019-03			NULL					5		1
NULL			2019-04-16				2		2
NULL			2019-04-15				2		2
NULL			2019-04-13				3		2
NULL			2019-04-12				2		2
NULL			2019-03-12				1		2
NULL			2019-03-10				4		2
'''
select 
month,
day,
count(distinct ip) as uv,
grouping__id 
from tb_uv 
group by month, day 
grouping sets (month, day, (month, day)) 
order by grouping__id asc;

-- 等價於
select month, null, count(distinct ip) as uv, 1 as grouping__id from tb_uv group by month 
union all 
select null, day, count(distinct ip) as uv, 2 as grouping__id from tb_uv group by day
union all 
select month, day, count(distinct ip) as uv, 3 as grouping__id from tb_uv group by month,day order by grouping__id asc;

''' 查詢結果
month			day						uv	grouping__id
2019-04			NULL					6		1
2019-03			NULL					5		1
NULL			2019-04-16				2		2
NULL			2019-04-15				2		2
NULL			2019-04-13				3		2
NULL			2019-04-12				2		2
NULL			2019-03-12				1		2
NULL			2019-03-10				4		2
2019-04			2019-04-16				2		3
2019-04			2019-04-15				2		3
2019-04			2019-04-13				3		3
2019-04			2019-04-12				2		3
2019-03			2019-03-12				1		3
2019-03			2019-03-10				4		3
''' 

4.2 cube

根據GROUP BY的維度的所有組合進行聚合

select 
month,
day,
count(distinct ip) as uv,
grouping__id 
from tb_uv 
group by month, day 
with cube 
order by grouping__id asc;

-- 等價於
select null,null,count(distinct ip) as uv,0 as grouping__id from tb_uv
union all 
select month,null,count(distinct ip) as uv,1 as grouping__id from tb_uv group by month 
union all 
select null,day,count(distinct ip) as uv,2 as grouping__id from tb_uv group by day
union all 
select month,day,count(distinct ip) as uv,3 as grouping__id from tb_uv group by month,day order by grouping__id asc;

''' 查詢結果
month		day					uv  grouping__id
2019-04		2019-04-16			2		0
2019-04		2019-04-15			2		0
2019-04		2019-04-13			3		0
2019-04		2019-04-12			2		0
2019-03		2019-03-12			1		0
2019-03		2019-03-10			4		0
2019-03		NULL				5		1
2019-04		NULL				6		1
NULL		2019-04-15			2		2
NULL		2019-04-13			3		2
NULL		2019-04-12			2		2
NULL		2019-03-12			1		2
NULL		2019-03-10			4		2
NULL		2019-04-16			2		2
NULL		NULL				10		3
'''

4.3 rollup

rollup可以實現從右到做遞減多級的統計,顯示統計某一層次結構的聚合。

select month, day, count(distinct ip) as uv, grouping__id
from tb_uv
group by month, day
with rollup
order by grouping__id asc;

-- 等價於
select month, day, count(distinct ip) as uv, grouping__id
from tb_uv
group by month, day
grouping sets((month, day), month, ())
order by grouping__id asc;

-- 等價於
select month, day, count(distinct ip) as uv, 0 as grouping__id from tb_uv group by month, day 
union all
select month, NULL, count(distinct ip) as uv, 1 as grouping__id from tb_uv group by month
union all
select NULL, NULL, count(distinct ip) as uv, 3 as grouping__id from tb_uv
order by grouping__id asc;

''' 查詢結果
month		day					uv  grouping__id
2019-03		2019-03-10			4		0
2019-03		2019-03-12			1		0
2019-04		2019-04-12			2		0
2019-04		2019-04-13			3		0
2019-04		2019-04-15			2		0
2019-04		2019-04-16			2		0
2019-03		NULL				5		1
2019-04		NULL				6		1
NULL		NULL				10		3
'''

PS:這幾個函數,理解起來相對較難,需要結合業務場景多去使用和研究。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章