前言:
hive中提供了很多的的統計分析函數,實際中經常用來進行統計分析,如下筆者整理了常用的一些分析函數,並附以相關實例.
博客參考連接:http://lxw1234.com/archives/2015/07/367.htm
1.基礎函數
window 子句 rows between
- preceding:往前
- following:往後
- current row:當前行
- unbounded:起點
- unbounded preceding:表示從前面的起點
- unbounded following:表示到後面的終點
注意
- 如果不指定rows betwwen,默認爲從起點到當前行
- 如果不指定order by,則將分組內所有值累加
創建表,加載數據
create table tb_pv (
cookieid string,
dt string,
pv int
) row format delimited
fields terminated by ',';
-- 數據
'''
cookie1,2019-04-10,1
cookie1,2019-04-11,5
cookie1,2019-04-12,7
cookie1,2019-04-13,3
cookie1,2019-04-14,2
cookie1,2019-04-15,4
cookie1,2019-04-16,4
'''
load data local inpath '/data/cookie.txt' overwrite into table test_db.tb_pv;
1.1 sum 求和
- sql
select
cookieid,
dt,
pv,
sum(pv) over(partition by cookieid order by dt) as pv1,
sum(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
sum(pv) over(partition by cookieid) as pv3,
sum(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
sum(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
sum(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;
''' 運行結果
cookieid dt pv pv1 pv2 pv3 pv4 pv5 pv6
cookie1 2019-04-10 1 1 1 26 1 6 26
cookie1 2019-04-11 5 6 6 26 6 13 25
cookie1 2019-04-12 7 13 13 26 13 16 20
cookie1 2019-04-13 3 16 16 26 16 18 13
cookie1 2019-04-14 2 18 18 26 17 21 10
cookie1 2019-04-15 4 22 22 26 16 20 8
cookie1 2019-04-16 4 26 26 26 13 13 4
'''
- 說明
pv1:分組內從起點到當前行的pv累計相加,如
11號pv = 10號pv + 11號pv = 1 + 5 = 6
12號pv = 10號pv + 11號pv + 12號pv = 1 + 5 + 7 = 13
依此類推
pv2:同pv1
pv3:分組內(cookie1)所有pv累加,如
pv3 = 1 + 5 + 7 + 3 + 2 + 4 + 4 = 26
pv4:分組內,當前行 + 往前3行,如
11號pv = 10號pv + 11號pv = 1 + 5 = 6
12號pv = 12號pv + 10號pv + 11號pv = 7 + 1 + 5 = 13
13號pv = 13號pv + 10號pv + 11號pv + 12號pv = 3 + 1 + 5 + 7 = 16
14號pv = 14號pv + 11號pv + 12號pv + 13號pv = 2 + 5 + 7 + 3 = 17
pv5:分組內,當前行 + 往前3行 + 往後1行,如
10號pv = 10號pv + 往前3行沒有 + 11號pv = 1 + 0 + 5 = 6
13號pv = 13號pv + 10號pv + 11號pv + 12號pv + 14號pv = 3 + 1 + 5 + 7 + 2 = 18
pv6:分組內當前行 + 往後所有行
10號pv = 10號pv + 11號pv + 12號pv + 13號pv + 14號pv + 15號pv + 16號pv = 1 + 5 + 7 + 3 + 2 + 4 + 4 = 26
13號pv = 13號pv + 14號pv + 15號pv + 16號pv = 3 + 2 + 4 + 4 = 13
1.2 avg 求平均數
同 1.1中 sum
select
cookieid,
dt,
pv,
avg(pv) over(partition by cookieid order by dt) as pv1,
avg(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
avg(pv) over(partition by cookieid) as pv3,
avg(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
avg(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
avg(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;
求平均數,運行的結果可以看到有很多小數位,我們使用cast函數來保留兩位小數
語法:cast(column_name as decimal(10,2))
select
cookieid,
dt,
pv,
cast(avg(pv) over(partition by cookieid order by dt) as decimal(10,2)) as pv1,
cast(avg(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as decimal(10,2)) as pv2,
cast(avg(pv) over(partition by cookieid) as decimal(10,2)) as pv3,
cast(avg(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as decimal(10,2)) as pv4,
cast(avg(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as decimal(10,2)) as pv5,
cast(avg(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as decimal(10,2)) as pv6
from tb_pv order by dt;
運行結果
cookieid dt pv pv1 pv2 pv3 pv4 pv5 pv6
cookie1 2019-04-10 1 1.00 1.00 3.71 1.00 3.00 3.71
cookie1 2019-04-11 5 3.00 3.00 3.71 3.00 4.33 4.17
cookie1 2019-04-12 7 4.33 4.33 3.71 4.33 4.00 4.00
cookie1 2019-04-13 3 4.00 4.00 3.71 4.00 3.60 3.25
cookie1 2019-04-14 2 3.60 3.60 3.71 4.25 4.20 3.33
cookie1 2019-04-15 4 3.67 3.67 3.71 4.00 4.00 4.00
cookie1 2019-04-16 4 3.71 3.71 3.71 3.25 3.25 4.00
1.3 min 求最小值
同 1.1 sum 求和
select
cookieid,
dt,
pv,
min(pv) over(partition by cookieid order by dt) as pv1,
min(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
min(pv) over(partition by cookieid) as pv3,
min(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
min(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
min(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;
查詢結果
cookieid dt pv pv1 pv2 pv3 pv4 pv5 pv6
cookie1 2019-04-10 1 1 1 1 1 1 1
cookie1 2019-04-11 5 1 1 1 1 1 2
cookie1 2019-04-12 7 1 1 1 1 1 2
cookie1 2019-04-13 3 1 1 1 1 1 2
cookie1 2019-04-14 2 1 1 1 2 2 2
cookie1 2019-04-15 4 1 1 1 2 2 4
cookie1 2019-04-16 4 1 1 1 2 2 4
1.4 max 最大值
同 1.1 sum 求和
select
cookieid,
dt,
pv,
max(pv) over(partition by cookieid order by dt) as pv1,
max(pv) over(partition by cookieid order by dt rows between unbounded preceding and current row) as pv2,
max(pv) over(partition by cookieid) as pv3,
max(pv) over(partition by cookieid order by dt rows between 3 preceding and current row) as pv4,
max(pv) over(partition by cookieid order by dt rows between 3 preceding and 1 following) as pv5,
max(pv) over(partition by cookieid order by dt rows between current row and unbounded following) as pv6
from tb_pv order by dt;
查詢結果
cookieid dt pv pv1 pv2 pv3 pv4 pv5 pv6
cookie1 2019-04-10 1 1 1 7 1 5 7
cookie1 2019-04-11 5 5 5 7 5 7 7
cookie1 2019-04-12 7 7 7 7 7 7 7
cookie1 2019-04-13 3 7 7 7 7 7 4
cookie1 2019-04-14 2 7 7 7 7 7 4
cookie1 2019-04-15 4 7 7 7 7 7 4
cookie1 2019-04-16 4 7 7 7 4 4 4
2.序列函數
序列函數:ntile,row_number,rank,dense_rank
注意:序列函數不支持window子句
數據
cookie1,2019-04-10,1
cookie1,2019-04-11,5
cookie1,2019-04-12,7
cookie1,2019-04-13,3
cookie1,2019-04-14,2
cookie1,2019-04-15,4
cookie1,2019-04-16,4
cookie2,2019-04-10,2
cookie2,2019-04-11,3
cookie2,2019-04-12,5
cookie2,2019-04-13,6
cookie2,2019-04-14,3
cookie2,2019-04-15,9
cookie2,2019-04-16,7
2.1 ntile
-
ntile(n):用於將分組數據按照順序切分成n片,返回當前切片值
ntile不支持rows between,比如 ntile(2) over(partition by cookieid order by createtime rows between 3 preceding and current row)
-
如果切片不均勻,默認增加第一個切片的分佈
-
經常用來取前30% 帶有百分之多少比例的記錄什麼的
select
cookieid,
dt,
pv,
ntile(2) over(partition by cookieid order by dt) as rn1,
ntile(3) over(partition by cookieid order by dt) as rn2,
ntile(4) over(order by dt) as rn3
from tb_pv
order by cookieid, dt;
查詢結果
cookieid dt pv rn1 rn2 rn3
cookie1 2019-04-10 1 1 1 1
cookie1 2019-04-11 5 1 1 1
cookie1 2019-04-12 7 1 1 2
cookie1 2019-04-13 3 1 2 2
cookie1 2019-04-14 2 2 2 3
cookie1 2019-04-15 4 2 3 4
cookie1 2019-04-16 4 2 3 4
cookie2 2019-04-10 2 1 1 1
cookie2 2019-04-11 3 1 1 1
cookie2 2019-04-12 5 1 1 2
cookie2 2019-04-13 6 1 2 2
cookie2 2019-04-14 3 2 2 3
cookie2 2019-04-15 9 2 3 3
cookie2 2019-04-16 7 2 3 4
- 例1:統計一個cookie中,pv數最多的前三分之一天
select
cookieid,
dt,
pv,
ntile(3) over(partition by cookieid order by pv desc) as rn
from tb_pv;
運行結果(獲取rn爲1的記錄,即爲pv數最多的前三分之天)
cookieid dt pv rn
cookie1 2019-04-12 7 1
cookie1 2019-04-11 5 1
cookie1 2019-04-16 4 1
cookie1 2019-04-15 4 2
cookie1 2019-04-13 3 2
cookie1 2019-04-14 2 3
cookie1 2019-04-10 1 3
cookie2 2019-04-15 9 1
cookie2 2019-04-16 7 1
cookie2 2019-04-13 6 1
cookie2 2019-04-12 5 2
cookie2 2019-04-11 3 2
cookie2 2019-04-14 3 3
cookie2 2019-04-10 2 3
- 例2:統計用戶消費排名前30%的平均消費和後70%的平均消費
創建表,加載數據
create table `tb_user_price`(
`user_id` string,
`price` double
) row format delimited
fields terminated by ',';
1001,100
1002,200
1003,10
1004,60
1005,20
1006,40
1007,1000
1008,220
1009,110
1010,190
1011,20
1012,80
1013,2000
1014,900
1015,26
load data local inpath '/Users/harvey/data/tb_user_price.txt' overwrite into table test_db.tb_user_price;
思路:將數據按照價格排序分爲10分保存至臨時表,基於臨時表獲取前30%和後70%求平均
-- 步驟一:將數據按照價格排序分爲10分保存至臨時表
drop table if exists tb_user_price_ntile_temp;
create table tb_user_price_ntile_temp as select user_id, price, ntile(10) over(order by price desc) as rn from tb_user_price;
-- 查詢結果
user_id price rn
1013 2000.0 1
1007 1000.0 1
1014 900.0 2
1008 220.0 2
1002 200.0 3
1010 190.0 3
1009 110.0 4
1001 100.0 4
1012 80.0 5
1004 60.0 5
1006 40.0 6
1015 26.0 7
1011 20.0 8
1005 20.0 9
1003 10.0 10
-- 步驟二:基於臨時表獲取前30%和後70%求平均
select
a.rn as rn,
case when a.rn = 1 then 'avg_price_first_30%' when a.rn = 2 then 'avg_price_last_70%' end as avg_price_name,
avg( price ) as avg_price
from
( select user_id, price, case when rn in ( 1, 2, 3 ) then 1 else 2 end as rn from tb_user_price_ntile_temp ) a
group by a.rn;
-- 統計結果
rn avg_price_name avg_price
1 avg_price_first_30% 751.6666666666666
2 avg_price_last_70% 51.77777777777778
2.2 row_number
- 用於分組,比方說依照 date 分組
- 組內可以依照某個屬性排序,比方說依照 score 分組,組內按照imei排序
- 語法:
row_number() over (partition by xxx order by xxx) rank
,其中rank爲分組的別名,可任意 - 取組內第一個數據使用 rank = 1
- 例1:求分組內pv名詞
select cookieid, dt, pv, row_number() over(partition by cookieid order by pv desc) as rn from tb_pv;
查詢結果
cookieid dt pv rn
cookie1 2019-04-12 7 1
cookie1 2019-04-11 5 2
cookie1 2019-04-16 4 3
cookie1 2019-04-15 4 4
cookie1 2019-04-13 3 5
cookie1 2019-04-14 2 6
cookie1 2019-04-10 1 7
cookie2 2019-04-15 9 1
cookie2 2019-04-16 7 2
cookie2 2019-04-13 6 3
cookie2 2019-04-12 5 4
cookie2 2019-04-11 3 5
cookie2 2019-04-14 3 6
cookie2 2019-04-10 2 7
- 例2:例1基礎上獲取排名爲第一的
select * from (select cookieid, dt, pv, row_number() over(partition by cookieid order by pv desc) as rn from tb_pv) a where a.rn = 1;
查詢結果
a.cookieid a.dt a.pv a.rn
cookie1 2019-04-12 7 1
cookie2 2019-04-15 9 1
2.3 rank 和 dense_rank
- rank() 生成數據項在分組中的排名,排名相同的名次一樣,同一排名有幾個,後面排名就會跳過幾次
- dense_rank() 生成數據項在分組中的排名,排名相同的名次一樣,且後面名次不跳躍
- row_number() 生成數據項在分組中的排名,不管排名是否有相同的,都按照順序1,2,3……n
select
cookieid,
dt,
pv,
rank() over(partition by cookieid order by pv desc) as rn1,
dense_rank() over(partition by cookieid order by pv desc) as rn2,
row_number() over(partition by cookieid order by pv desc) as rn3
from tb_pv;
查詢結果
cookieid dt pv rn1 rn2 rn3
cookie1 2019-04-12 7 1 1 1
cookie1 2019-04-11 5 2 2 2
cookie1 2019-04-16 4 3 3 3
cookie1 2019-04-15 4 3 3 4
cookie1 2019-04-13 3 5 4 5
cookie1 2019-04-14 2 6 5 6
cookie1 2019-04-10 1 7 6 7
cookie2 2019-04-15 9 1 1 1
cookie2 2019-04-16 7 2 2 2
cookie2 2019-04-13 6 3 3 3
cookie2 2019-04-12 5 4 4 4
cookie2 2019-04-11 3 5 5 5
cookie2 2019-04-14 3 5 5 6
cookie2 2019-04-10 2 7 6 7
2.4 cume_dist percent_rank
ps:這兩個序列函數不常使用
創建表、導入數據
-- 數據
1201,Gopal,45000,TD
1202,Manisha,45000,HRD
1203,Masthanvali,40000,AD
1204,Kiran,40000,HRD
1205,Kranthi,30000,TD
-- 創建表
create table `emps`(
`eid` int,
`name` string,
`salary` string,
`dept` string
) row format delimited
fields terminated by ',';
-- 導入數據
load data local inpath '/Users/harvey/data/emps.txt' overwrite into table emps;
-- 查詢
emps.eid emps.name emps.salary emps.dept
1201 Gopal 45000 TD
1202 Manisha 45000 HRD
1203 Masthanvali 40000 AD
1204 Kiran 40000 HRD
1205 Kranthi 30000 TD
2.4.1 cume_dist
- 小於等於當前值的行數/分組內總行數
例:統計小於等於當前薪水的人數,所佔總人數的比例
select
eid,
name,
salary,
dept,
cume_dist() over(order by salary) as rn1,
cume_dist() over(partition by dept order by salary) as rn2
from emps;
查詢結果
eid name salary dept rn1 rn2
1203 Masthanvali 40000 AD 0.6 1.0
1204 Kiran 40000 HRD 0.6 0.5
1202 Manisha 45000 HRD 1.0 1.0
1205 Kranthi 30000 TD 0.2 0.5
1201 Gopal 45000 TD 1.0 1.0
說明
rn1:沒有 partition,所有數據均爲1組,總行數爲5
第一行:小於等於40000的行數爲3,即 3 / 5 = 0.6
第三行:小於等於45000的行數爲5,即 5 / 5 = 1.0
rn2:按照部門 dept 分組,dept 爲 TD 的行數爲2
第四行:小於等於30000的行數爲1,即 1 2 = 0.5
2.4.2 percent_rank
- 分組內當前行的 RANK值 - 1 / 分組內總行數 - 1
select
eid,
name,
salary,
dept,
percent_rank() over(order by salary) as rn1, --分組內,總行數5
rank() over(order by salary) as rn11, --分組內 rank 值
percent_rank() over(partition by dept order by salary) as rn2, --按照部門 dept 分組
rank() over(partition by dept order by salary) as rn22 --按照部門 dept 分組內 rank 值
from emps;
查詢結果
eid name salary dept rn1 rn11 rn2 rn22
1203 Masthanvali 40000 AD 0.25 2 0.0 1
1204 Kiran 40000 HRD 0.25 2 0.0 1
1202 Manisha 45000 HRD 0.75 4 1.0 2
1205 Kranthi 30000 TD 0.0 1 0.0 1
1201 Gopal 45000 TD 0.75 4 1.0 2
說明
rn1:沒有 partition,所有數據均爲1組,總行數爲5
第一行:(2 - 1) / (5 - 1) = 1 / 4 = 0.25
第四行:(1 - 1) / (5 - 1) = 0 / 4 = 0.0
rn2:按照部門 dept 分組,salary 排序
第二行:HRD共兩行,(1 - 1) / (2 - 1) = 0 / 1 = 0.0
第五行:TD共兩行,(2 - 1) / (2 - 1) = 1 / 1 = 1.0
3.3 分析函數(窗口函數)
lag、lead、first_value、last_value
PS:這幾個函數不支持window子句
創建表,加載數據
-- 數據
user1,2019-04-10 10:00:02,https://www.baidu.com/
user1,2019-04-10 10:00:00,https://www.baidu.com/
user1,2019-04-10 10:03:04,http://www.google.cn/
user1,2019-04-10 10:50:05,https://yq.aliyun.com/topic/list/
user1,2019-04-10 11:00:00,https://www.json.cn/
user1,2019-04-10 10:10:00,https://music.163.com/
user1,2019-04-10 10:50:01,https://www.guoguo-app.com/
user2,2019-04-10 10:00:02,http://www.google.cn/
user2,2019-04-10 10:00:00,https://www.baidu.com/
user2,2019-04-10 10:03:04,https://www.runoob.com/
user2,2019-04-10 10:50:05,http://fanyi.youdao.com/
user2,2019-04-10 11:00:00,https://www.csdn.net/
user2,2019-04-10 10:10:00,https://www.json.cn/
user2,2019-04-10 10:50:01,https://yq.aliyun.com/topic/list/
-- 創建表
create table tb_page_access (
userid string,
create_time string,
url string
) row format delimited
fields terminated by ',';
-- 加載數據
load data local inpath '/Users/harvey/data/tb_page_access.txt' overwrite into table tb_page_access;
3.1 lag
-
語法
LAG (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)
-
lag(col, n, DEFAULT):用於統計窗口內往上第 n 行值
第一個參數爲列名
第二個參數爲往上第n行(可選,默認爲1)
第三個參數爲默認值(當往上第 n 行爲 NULL 時候,取默認值,如不指定,則爲 NULL)
select
userid,
create_time,
url,
row_number() over(partition by userid order by create_time) as rn,
lag(create_time, 1, '1970-01-01 00:00:00') over(partition by userid order by create_time) as last_1_time,
lag(create_time, 2) over(partition by userid order by create_time) as last_2_time
from tb_page_access;
查詢結果
說明:
rn:按照userid分組,create_time排序後的排名
last_1_time
第一行:往上一行,沒有數據,則爲 1970-01-01 00:00:00
第二行:往上一行,爲 2019-04-10 10:00:00
依次類推
last_2_time
第一行:往上兩行爲空,未指定默認值,則爲 NULL
第二行:同第一行
第三行:往上兩行,爲第一行的數據的 create_time,即 2019-04-10 10:00:00
依此類推
3.2 lead
-
語法
LEAD (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)
-
lead(col, n, DEFAULT) 用於統計窗口內往下第n行值
第一個參數爲列名
第二個參數爲往下第n行(可選,默認爲1)
第三個參數爲默認值(當往下第 n 行爲 NULL 時候,取默認值,如不指定,則爲 NULL)
select
userid,
create_time,
url,
row_number() over(partition by userid order by create_time) as rn,
lead(create_time, 1, '1970-01-01 00:00:00') over(partition by userid order by create_time) as next_1_time,
lead(create_time, 2) over(partition by userid order by create_time) as next_2_time
from tb_page_access;
查詢結果
說明:
rn:按照 userid 分組,create_time 排序後的排名
next_1_time
第一行:向下一行,即第二行的 create_time,即 2019-04-10 10:00:02
依次類推
最後一行:先下一行,沒有數據,即爲默認值 1970-01-01 00:00:00
next_2_time
第一行:向下兩行,即第三行的 create_time,即 2019-04-10 10:03:04
依次類推
最後一行:向下兩行,沒有數據也沒有設置默認值,即爲 NULL
3.3 first_value
- 取分組內排序後,截止到當前行,第一個值
select
userid,
create_time,
url,
row_number() over(partition by userid order by create_time) as rn,
first_value(url) over(partition by userid order by create_time) as first_url
from tb_page_access;
查詢結果
PS:理解,截止到當前行的含義,指的指,在分組排序後,從第一行到當前行之間的第一個值
3.4 last_value
- 取分組內排序後,截止到當前行,最後一個值
select
userid,
create_time,
url,
row_number() over(partition by userid order by create_time) as rn,
last_value(url) over(partition by userid order by create_time) as last_url
from tb_page_access;
查詢結果
PS:理解,截止到當前行的含義,指的是,在分組排序後,從第一行到當前行之間中最後一個值
4 分析函數(OLAP)
grouping sets、grouping__id、cube、rollup
這幾個分析函數通常用於OLAP中,不能累加,而且需要根據不同維度上鑽和下鑽的指標統計,比如,分小時、天、月的UV數
通常需要對各個維度進行交叉分析,如果只有GROUP BY子句,那我們可以寫出按各個維度或層次進行GROUP BY的查詢語句,然後再通過UNION子句把結果集拼湊起來,但是這樣的查詢語句顯得冗長、笨拙。爲了解決HQL冗長的問題。hive中提供了這幾個函數來解決
創建表、加載數據
''' 數據
2019-03,2019-03-10,127.0.0.1
2019-03,2019-03-10,192.168.1.1
2019-03,2019-03-12,61.135.169.121
2019-04,2019-04-12,203.208.41.47
2019-04,2019-04-13,39.96.252.213
2019-04,2019-04-13,121.40.179.176
2019-04,2019-04-16,139.196.135.171
2019-03,2019-03-10,119.167.188.226
2019-03,2019-03-10,59.111.181.38:
2019-04,2019-04-12,192.168.1.1
2019-04,2019-04-13,203.208.41.47
2019-04,2019-04-15,121.40.179.176
2019-04,2019-04-15,39.96.252.213
2019-04,2019-04-16,59.111.181.38
'''
-- 創建表
create table tb_uv (
month string,
day string,
ip string
) row format delimited
fields terminated by ',';
-- 加載數據
load data local inpath '/Users/harvey/data/tb_uv.txt' overwrite into table tb_uv;
4.1 grouping sets
在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL
select
month,
day,
count(distinct ip) as uv,
grouping__id
from tb_uv
group by month, day
grouping sets (month, day)
order by grouping__id;
-- sql 等價於
select month, NULL, count(distinct ip) as uv, 1 as grouping__id from tb_uv group by month
union all
select NULL, day, count(distinct ip) as uv, 2 as grouping__id from tb_uv group by day;
''' 查詢結果
month day uv grouping__id
2019-04 NULL 6 1
2019-03 NULL 5 1
NULL 2019-04-16 2 2
NULL 2019-04-15 2 2
NULL 2019-04-13 3 2
NULL 2019-04-12 2 2
NULL 2019-03-12 1 2
NULL 2019-03-10 4 2
'''
select
month,
day,
count(distinct ip) as uv,
grouping__id
from tb_uv
group by month, day
grouping sets (month, day, (month, day))
order by grouping__id asc;
-- 等價於
select month, null, count(distinct ip) as uv, 1 as grouping__id from tb_uv group by month
union all
select null, day, count(distinct ip) as uv, 2 as grouping__id from tb_uv group by day
union all
select month, day, count(distinct ip) as uv, 3 as grouping__id from tb_uv group by month,day order by grouping__id asc;
''' 查詢結果
month day uv grouping__id
2019-04 NULL 6 1
2019-03 NULL 5 1
NULL 2019-04-16 2 2
NULL 2019-04-15 2 2
NULL 2019-04-13 3 2
NULL 2019-04-12 2 2
NULL 2019-03-12 1 2
NULL 2019-03-10 4 2
2019-04 2019-04-16 2 3
2019-04 2019-04-15 2 3
2019-04 2019-04-13 3 3
2019-04 2019-04-12 2 3
2019-03 2019-03-12 1 3
2019-03 2019-03-10 4 3
'''
4.2 cube
根據GROUP BY的維度的所有組合進行聚合
select
month,
day,
count(distinct ip) as uv,
grouping__id
from tb_uv
group by month, day
with cube
order by grouping__id asc;
-- 等價於
select null,null,count(distinct ip) as uv,0 as grouping__id from tb_uv
union all
select month,null,count(distinct ip) as uv,1 as grouping__id from tb_uv group by month
union all
select null,day,count(distinct ip) as uv,2 as grouping__id from tb_uv group by day
union all
select month,day,count(distinct ip) as uv,3 as grouping__id from tb_uv group by month,day order by grouping__id asc;
''' 查詢結果
month day uv grouping__id
2019-04 2019-04-16 2 0
2019-04 2019-04-15 2 0
2019-04 2019-04-13 3 0
2019-04 2019-04-12 2 0
2019-03 2019-03-12 1 0
2019-03 2019-03-10 4 0
2019-03 NULL 5 1
2019-04 NULL 6 1
NULL 2019-04-15 2 2
NULL 2019-04-13 3 2
NULL 2019-04-12 2 2
NULL 2019-03-12 1 2
NULL 2019-03-10 4 2
NULL 2019-04-16 2 2
NULL NULL 10 3
'''
4.3 rollup
rollup可以實現從右到做遞減多級的統計,顯示統計某一層次結構的聚合。
select month, day, count(distinct ip) as uv, grouping__id
from tb_uv
group by month, day
with rollup
order by grouping__id asc;
-- 等價於
select month, day, count(distinct ip) as uv, grouping__id
from tb_uv
group by month, day
grouping sets((month, day), month, ())
order by grouping__id asc;
-- 等價於
select month, day, count(distinct ip) as uv, 0 as grouping__id from tb_uv group by month, day
union all
select month, NULL, count(distinct ip) as uv, 1 as grouping__id from tb_uv group by month
union all
select NULL, NULL, count(distinct ip) as uv, 3 as grouping__id from tb_uv
order by grouping__id asc;
''' 查詢結果
month day uv grouping__id
2019-03 2019-03-10 4 0
2019-03 2019-03-12 1 0
2019-04 2019-04-12 2 0
2019-04 2019-04-13 3 0
2019-04 2019-04-15 2 0
2019-04 2019-04-16 2 0
2019-03 NULL 5 1
2019-04 NULL 6 1
NULL NULL 10 3
'''
PS:這幾個函數,理解起來相對較難,需要結合業務場景多去使用和研究。