hive通過grouping sets多維度組合去重統計避免使用distinct

原創

2020-02-24 06:12

在hive中，如果遇到多維度組合統計，並且要進行去重統計，例如統計不同維度組合的訪問用戶數，比如統計運營商、手機品牌、網絡類型的用戶數，怎樣避免不用ditinct（因爲distinct效率低），並且grouping__id和以前維度組合保持一致呢？

select * from temp.temp_active_user_info t limit 10;

實現方法一：通過grouping sets和distinct進行統計

select cast(grouping__id as bigint)&7 as group_id, 
       nvl(phone_brand, '剔重彙總') phone_brand, 
       nvl(network_type, '剔重彙總') network_type, 
       nvl(provider_name, '剔重彙總') provider_name, 
       count(distinct user_id) user_num
  from temp.temp_active_user_info t
 group by phone_brand,     --1
          network_type,    --2
          provider_name    --4
grouping sets (
 (phone_brand),      --1
 (network_type, provider_name)   --6
);

統計結果

實現方法二：通過grouping sets將user_id加入維度組合再進行group by統計

select group_id, phone_brand, network_type, provider_name, count(1) user_num
  from 
  (
    select cast(grouping__id as bigint)&7 as group_id,--一定要先將grouping__id轉換爲數值類型 
           nvl(phone_brand, '剔重彙總') phone_brand, 
           nvl(network_type, '剔重彙總') network_type, 
           nvl(provider_name, '剔重彙總') provider_name, 
           user_id
      from temp.temp_active_user_info t
     group by phone_brand,     --1
              network_type,    --2
              provider_name,   --4
              user_id          --8
    grouping sets (
     (phone_brand, user_id),      --9&7=1
     (network_type, provider_name, user_id)   --14&7=6
    )
  ) t
 group by group_id, phone_brand, network_type, provider_name
;

統計結果：

注意：

先將grouping__id轉換爲數值類型
&前後不能有空格
&後的數字爲去重字段的位置數減去1，例如上面的SQL語句種user_id的位置數爲8，那&後緊跟7

通過實踐證明，兩種統計方法結果一樣，並且grouping__id也一樣，但第二種方法避免了distinct的出現，當數據量特別大時會感覺到方法二執行效率和佔用資源明顯優於方法一

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hive通過grouping sets多維度組合去重統計避免使用distinct

hive同一張表union all的優化

centos7中jdk的安裝

hive創建表的三種方式

hive列轉行：將多列按列名和值轉換爲兩列

hadoop安裝完後網頁http://localhost:50070/ 打不開

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結