目前hive的版本支持multi-distinct的特性,這個在用起來比較方便,但是在此特性下面無法開啓防數據傾斜的開關(set hive.groupby.skewindata=true),防止數據傾斜的參數只在單distinct情況下會通過一個job來防止數據的傾斜。multi-distinct使用起來方便的同時也可能會帶來性能的不優化,如日誌中常常統計pv,Uv,獨立ip數,獨立session數,這些都要去重統計,如下面統計各個瀏覽器佔比的SQL,這個sql可能需要運行20到30分鐘(這個和集羣和日誌數據量相關),browser_core只有10個數值,其reduce壓力很大,優化後會有50%-70%的提升
1. 原SQL
Select
browser_core,
count(1) as pv,
count(distinct uniq_id) as uv,
count(distinct client_ip) as ip_cnt,
count(distinct session_id) as session_cnt,
count(distinct apay_aid) as apay_aid_cnt,
count(distinct apay_uid) as apay_uid_cnt
From dw_log
wheredt=20120101
andpage_type='page'
andagent is not null
andagent <> '-'
group bybrowser_core
2. 改進SQL如下:
步驟(1):首先進行初步去重彙總
Create table tmp_browser_core_ds_1 as
Select
Browser_core,
uniq_id,
client_ip,
session_id,
apay_aid,
apay_uid,
count(1) as pv
from dw_log
wheredt=20120101
andpage_type='page'
andagent is not null
andagent <> '-'
group bybrowser_core,uniq_id,client_ip,session_id,apay_aid,apay_uid;
步驟(2):最關鍵的一步,相當於用空間來換時間。借用union all的把數據根據distinct的字段擴充起來,假如有8個distinct,相當於數據擴充8倍,用rownumber=1來達到間接去重的目的,如果這裏不計算整體pv的話,可以直接進行Group by效果一樣。這裏的unionall只走一個job,不會因爲job多拖後腿(Hadoop不怕數據量大【一定範圍內】,就怕job多和數據傾斜)。
setmapred.reduce.tasks=300;
Create table tmp_browser_core_ds_2 as
select
type,
browser_core,
type_value,
pv,
rownumber(type,type_value,browser_core) as rn
from (
select
type,
browser_core,
type_value,
pv
from (
select
'client_ip'as type,browser_core,client_ip as type_value,pv
from tmp_browser_core_ds_1
union all
select
'uniq_id'as type,browser_core,uniq_id as type_value,pv
from tmp_browser_core_ds_1
union all
select
'session_id'as type,browser_core,session_id as type_value,pv
from tmp_st_log_browser_core_ds_1
union all
select
'apay_aid'as type,browser_core,apay_aid as type_value,pv
from tmp_browser_core_ds_1
union all
select
'apay_uid'as type,browser_core,apay_uid as type_value,pv
from tmp_browser_core_ds_1
) t
distribute by type,type_value,browser_core
sort by type,type_value,browser_core
) t1;
步驟(3): 得到最終結果,沒有一個distinct,全部走的是普通sum,可以在mapper端提前聚合,會很快
select
browser_core,
sum(case when type='uniq_id' then pv else cast(0 as bigint) end) as pv,
sum(case when type='client_ip' and rn=1 then 1else 0 end) ip_cnt,
sum(case when type='uniq_id' and rn=1 then 1 else 0 end) as uv,
sum(case when type='session_id' and rn=1 then 1 else 0 end) as session_cnt,
sum(case when type='apay_aid' and rn=1 then 1 else 0 end) as apay_aid_cnt,
sum(case when type='apay_uid' and rn=1 then 1 else 0 end) as apay_uid_cnt
fromtmp_st_log_browser_core_ds_2
group bybrowser_core
改進SQL雖然整體job數爲3個,較原sql多2個job,但整體運行時間不超過10分鐘。基本思路,通過多job來進行multi-distinct,也可以看到rownumber的妙用,充分發揮Hadoop的partition和sort的優勢。如果某個sql的multi-distinct本身很快,就不要這麼麻煩。