背景
當多個 distinct 操作同時出現在 select 中,數據會分發多次。容易造成Reduce數據傾斜
優化點
1、如果不要求精確值,可以使用 spark-sql approx_count_distinct函數 (基數計數 hyperloglog)
2、修改SQL
基礎數據準備如下,
需要計算 不同渠道下的 不同週期 的訪問uv
presto:bi> desc tmp.multi_distinct_test;
Column | Type | Extra | Comment
---------+---------+-------+----------
user_id | bigint | | 用戶ID
channel | varchar | | 渠道名稱
day | varchar | | 訪問日期
presto:bi> select * from tmp.multi_distinct_test;
user_id | channel | day
---------+---------+------------
1 | A | 2020-01-01 -- 和下一行 數據一模一樣
1 | A | 2020-01-01 -- 👆
1 | A | 2020-01-02
1 | B | 2020-01-03
1 | C | 2020-01-01
2 | A | 2020-01-02
2 | D | 2020-01-03
3 | A | 2020-01-01
3 | B | 2020-01-02
4 | B | 2020-01-03
4 | C | 2020-01-01
4 | A | 2020-01-02
5 | B | 2020-01-03
5 | C | 2020-01-01
6 | A | 2020-01-02
7 | A | 2020-01-03
(16 rows)
最直接的寫法
select
channel,
count(distinct if(day in ("2020-01-01"), user_id, null)),
count(distinct if(day in ("2020-01-01","2020-01-02"), user_id, null)),
count(distinct if(day in ("2020-01-01","2020-01-02","2020-01-03"), user_id, null))
from tmp.multi_distinct_test
group by channel;
修改後的寫法
select
channel,
sum(c1),
sum(c2),
sum(c3)
from (
-- 先對 之前要distinct 的字段(即user_id),進行聚合操作
select
channel, user_id,
max(if(day in ("2020-01-01"), 1, 0)) as c1,
max(if(day in ("2020-01-01","2020-01-02"), 1, 0)) as c2,
max(if(day in ("2020-01-01","2020-01-02","2020-01-03"), 1, 0)) as c3
from (
-- 去重
select distinct channel, user_id, day
from tmp.multi_distinct_test
) t1
group by channel, user_id
) t2
group by channel;
關鍵點:
1、先對基礎數據進行去重
2、再對要 distinct 的字段(如例子的 user_id),先聚合後求邏輯值