Hive Sql - Multi Distinct（多個distinct在同一個query中）優化

原創

阿武z

2020-02-25 17:32

背景

當多個 distinct 操作同時出現在 select 中，數據會分發多次。容易造成Reduce數據傾斜

優化點

1、如果不要求精確值，可以使用 spark-sql approx_count_distinct函數（基數計數 hyperloglog）

2、修改SQL

基礎數據準備如下， 需要計算 不同渠道下的 不同週期 的訪問uv

presto:bi> desc tmp.multi_distinct_test;
 Column  |  Type   | Extra | Comment  
---------+---------+-------+----------
 user_id | bigint  |       | 用戶ID   
 channel | varchar |       | 渠道名稱 
 day     | varchar |       | 訪問日期 

presto:bi> select * from tmp.multi_distinct_test;
 user_id | channel |    day     
---------+---------+------------
       1 | A       | 2020-01-01 -- 和下一行 數據一模一樣
       1 | A       | 2020-01-01 -- 👆
       1 | A       | 2020-01-02 
       1 | B       | 2020-01-03 
       1 | C       | 2020-01-01 
       2 | A       | 2020-01-02 
       2 | D       | 2020-01-03 
       3 | A       | 2020-01-01 
       3 | B       | 2020-01-02 
       4 | B       | 2020-01-03 
       4 | C       | 2020-01-01 
       4 | A       | 2020-01-02 
       5 | B       | 2020-01-03 
       5 | C       | 2020-01-01 
       6 | A       | 2020-01-02 
       7 | A       | 2020-01-03 
(16 rows)

最直接的寫法

select
  channel,
  count(distinct if(day in ("2020-01-01"), user_id, null)),
  count(distinct if(day in ("2020-01-01","2020-01-02"), user_id, null)),
  count(distinct if(day in ("2020-01-01","2020-01-02","2020-01-03"), user_id, null))
from tmp.multi_distinct_test
group by channel;

修改後的寫法

select
  channel,
  sum(c1),
  sum(c2),
  sum(c3)
from (
  -- 先對 之前要distinct 的字段（即user_id），進行聚合操作
  select
    channel, user_id,
    max(if(day in ("2020-01-01"), 1, 0)) as c1,
    max(if(day in ("2020-01-01","2020-01-02"), 1, 0)) as c2,
    max(if(day in ("2020-01-01","2020-01-02","2020-01-03"), 1, 0)) as c3
  from (
    -- 去重
    select distinct channel, user_id, day
    from tmp.multi_distinct_test
  ) t1
  group by channel, user_id
) t2
group by channel;

關鍵點：
1、先對基礎數據進行去重
2、再對要 distinct 的字段（如例子的 user_id），先聚合後求邏輯值

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive Sql - Multi Distinct（多個distinct在同一個query中）優化

背景

優化點

1、如果不要求精確值，可以使用 spark-sql approx_count_distinct函數（基數計數 hyperloglog）

2、修改SQL

YARN-ApplicationMaster啓動流程

YARN-Container申請和分配

HIVE Map和Reduce數量優化點

數據資產管理-簡單總結

HIVE - UDTF開發（指定分割符分割字符串，返回對應的大小寫字符串）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Hive Sql - Multi Distinct（多個distinct在同一個query中） 優化

背景

優化點

1、如果不要求精確值，可以使用 spark-sql approx_count_distinct函數 （基數計數 hyperloglog）

2、修改SQL

Hive Sql - Multi Distinct（多個distinct在同一個query中）優化

1、如果不要求精確值，可以使用 spark-sql approx_count_distinct函數（基數計數 hyperloglog）