Hive的multi-distinct可能帶來性能惡化之案例優化

原創

2020-02-24 21:53

目前hive的版本支持multi-distinct的特性，這個在用起來比較方便，但是在此特性下面無法開啓防數據傾斜的開關(set hive.groupby.skewindata=true),防止數據傾斜的參數只在單distinct情況下會通過一個job來防止數據的傾斜。multi-distinct使用起來方便的同時也可能會帶來性能的不優化，如日誌中常常統計pv，Uv，獨立ip數，獨立session數，這些都要去重統計，如下面統計各個瀏覽器佔比的SQL，這個sql可能需要運行20到30分鐘（這個和集羣和日誌數據量相關），browser_core只有10個數值，其reduce壓力很大，優化後會有50%-70%的提升

1. 原SQL

Select

browser_core,

count(1) as pv,

count(distinct uniq_id) as uv,

count(distinct client_ip) as ip_cnt,

count(distinct session_id) as session_cnt,

count(distinct apay_aid) as apay_aid_cnt,

count(distinct apay_uid) as apay_uid_cnt

From dw_log

wheredt=20120101

andpage_type='page'

andagent is not null

andagent <> '-'

group bybrowser_core

2. 改進SQL如下：

步驟(1):首先進行初步去重彙總

Create table tmp_browser_core_ds_1 as

Select

Browser_core,

uniq_id,

client_ip,

session_id,

apay_aid,

apay_uid,

count(1) as pv

from dw_log

wheredt=20120101

andpage_type='page'

andagent is not null

andagent <> '-'

group bybrowser_core,uniq_id,client_ip,session_id,apay_aid,apay_uid;

步驟(2):最關鍵的一步，相當於用空間來換時間。借用union all的把數據根據distinct的字段擴充起來，假如有8個distinct，相當於數據擴充8倍，用rownumber=1來達到間接去重的目的，如果這裏不計算整體pv的話，可以直接進行Group by效果一樣。這裏的unionall只走一個job，不會因爲job多拖後腿（Hadoop不怕數據量大【一定範圍內】，就怕job多和數據傾斜）。

setmapred.reduce.tasks=300;

Create table tmp_browser_core_ds_2 as

select

type,

browser_core,

type_value,

pv,

rownumber(type,type_value,browser_core) as rn

from (

select

type,

browser_core,

type_value,

from (

select

'client_ip'as type,browser_core,client_ip as type_value,pv

from tmp_browser_core_ds_1

union all

select

'uniq_id'as type,browser_core,uniq_id as type_value,pv

from tmp_browser_core_ds_1

union all

select

'session_id'as type,browser_core,session_id as type_value,pv

from tmp_st_log_browser_core_ds_1

union all

select

'apay_aid'as type,browser_core,apay_aid as type_value,pv

from tmp_browser_core_ds_1

union all

select

'apay_uid'as type,browser_core,apay_uid as type_value,pv

from tmp_browser_core_ds_1

) t

distribute by type,type_value,browser_core

sort by type,type_value,browser_core

) t1;

步驟(3): 得到最終結果，沒有一個distinct，全部走的是普通sum，可以在mapper端提前聚合，會很快

select

browser_core,

sum(case when type='uniq_id' then pv else cast(0 as bigint) end) as pv,

sum(case when type='client_ip' and rn=1 then 1else 0 end) ip_cnt,

sum(case when type='uniq_id' and rn=1 then 1 else 0 end) as uv,

sum(case when type='session_id' and rn=1 then 1 else 0 end) as session_cnt,

sum(case when type='apay_aid' and rn=1 then 1 else 0 end) as apay_aid_cnt,

sum(case when type='apay_uid' and rn=1 then 1 else 0 end) as apay_uid_cnt

fromtmp_st_log_browser_core_ds_2

group bybrowser_core

改進SQL雖然整體job數爲3個，較原sql多2個job，但整體運行時間不超過10分鐘。基本思路，通過多job來進行multi-distinct，也可以看到rownumber的妙用，充分發揮Hadoop的partition和sort的優勢。如果某個sql的multi-distinct本身很快，就不要這麼麻煩。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive的multi-distinct可能帶來性能惡化之案例優化

HIVE優化提示-如何寫好HQL

Hadoop Streaming 常見錯誤（不斷更新ing)

HIVE優化總結

準確度量持續改進—網站分析驅動目標達成

京東手Q一起玩真個性

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結