Hive中distinct和group by去重性能對比

原創

2020-07-03 03:17

前言

操作系統：CentOS 7
hadoop：2.7.7
hive：2.3.0
實驗目的：本文主要測試在某字段各種不同值個數情況下，記錄對此字段其使用DISTINCT/GROUP BY去重的查詢語句執行時間，對比兩者在不同場景下的去重性能
實驗表格：

表名	記錄數	查詢字段不同值個數	DISTINCT	GROUP BY
tab_1	100000	3
tab_2	100000	10000

實驗過程

1）創建測試用表

drop table if exists tab_1;
create table tab_1(
    id int,
    value int
)
row format delimited
fields terminated by '\t';

drop table if exists tab_2;
create table tab_2 like tab_1;

2）加載測試數據集

測試用數據集tab_1.txt：

測試用數據集tab_2.txt：

1	3715
2	7211
3	4909
4	2913
5	9839
...
99997	2884
99998	698
99999	4839
100000	2101

分別加載數據集到對應表：

load data local inpath '/tmp/hive/data/tbl/tab_1.txt' overwrite into table tab_1;
load data local inpath '/tmp/hive/data/tbl/tab_2.txt' overwrite into table tab_2;

3）執行查詢語句，記錄執行時間

取消自動本地模式：

hive> set hive.exec.mode.local.auto = false;

手動指定Reducer個數：

hive> set mapreduce.job.reduces = 3;

執行測試查詢語句，記錄執行時間：

select distinct(value) from tab_1; -- 31.335s
select value from tab_1 group by value; -- 31.587s
select distinct(value) from tab_2; -- 32.376s
select value from tab_2 group by value; -- 33.834s

4）執行計劃對比

explain select distinct(value) from tab_1;

0: jdbc:hive2://hadoop101:10000/default (test)> explain select distinct(value) from tab_1;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: tab_1                           |
|             Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: value (type: int)       |
|               outputColumnNames: value             |
|               Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: value (type: int)            |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: int) |
|                   Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: int)              |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

explain select value from tab_1 group by value;

0: jdbc:hive2://hadoop101:10000/default (test)> explain select value from tab_1 group by value;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: tab_1                           |
|             Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: value (type: int)       |
|               outputColumnNames: value             |
|               Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: value (type: int)            |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: int) |
|                   Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: int)              |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

5）實驗結果

表名	記錄數	查詢字段不同值個數	DISTINCT	GROUP BY
tab_1	100000	3	31.335s	31.587s
tab_2	100000	10000	32.376s	33.834s

實驗結論：

在Hive 2.3.0中，使用DISTINCT去重和使用Group By去重的執行計劃相同，執行時間也大致相同，因此兩者去重性能基本無差異

實驗過程及結論，如有不足之處，歡迎指正，此實驗結論僅供參考。

PS：在Hive中使用聚集函數時一定要注意，在使用聚集函數時，一般Hive都只會使用單個Reducer來進行聚集操作（即使手動設置Reducer個數也是如此），如果此時查詢數據量過大，則會導致Reducer節點失效，因此在使用聚集函數時，且數據量較大時，可以使用子查詢來實現分步聚合，如：可以先在子查詢中，按照某個字段進行分組，然後聚合，這樣就可以使用多個Reducer加快查詢過程，最後在外部查詢中對子查詢結果進行整體聚合。

End~

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive中distinct和group by去重性能對比

前言

實驗過程

1）創建測試用表

2）加載測試數據集

3）執行查詢語句，記錄執行時間

4）執行計劃對比

5）實驗結果

實驗結論：

End~

如何在低代碼平臺中引用 JavaScript ？

探究職業發展的關鍵：能力模型解讀

高效率使用windows

如何使用 JavaScript 獲取當前頁面幀率 FPS

工程款拖欠，農民工怎麼了？就得一直忍着委屈求全嗎？

HarmonyOS 實現下拉刷新，上拉加載更多

語音信號處理中的“窗函數”

智能決策新時代：可視化大屏是否能夠超越傳統白板？

解密Prompt系列28. LLM Agent之金融領域摸索：FinMem & FinAgent

分享幾個.NET開源的AI和LLM相關項目框架

LeetCode 136.Single Number(只出現一次的數)

Hive中distinct和group by去重性能對比

LeetCode 137.Single Number II(只出現一次的數 II)

Hadoop源碼解析之Mapper數量計算公式

SQL求解TopK問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結