Hive控制Reduce個數

原創

2020-06-16 02:54

1. Hive自己如何確定reduce數：

reduce個數的設定極大影響任務執行效率，不指定reduce個數的情況下，Hive會猜測確定一個reduce個數，基於以下兩個設定：
hive.exec.reducers.bytes.per.reducer（每個reduce任務處理的數據量，默認爲1000^3=1G）
hive.exec.reducers.max（每個任務最大的reduce數，默認爲999）
計算reducer數的公式很簡單N=min(參數2，總輸入數據量/參數1)
即，如果reduce的輸入（map的輸出）總大小不超過1G,那麼只會有一個reduce任務；
如：select pt,count(1) from popt_tbaccountcopy_mes wherept = '2012-07-04' group by pt;
/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 總大小爲9G多，因此這句有10個reduce

2. 調整reduce個數方法一：
調整hive.exec.reducers.bytes.per.reducer參數的值；
set hive.exec.reducers.bytes.per.reducer=500000000; （500M）
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group bypt; 這次有20個reduce

3. 調整reduce個數方法二；
set mapred.reduce.tasks = 15;
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group bypt;這次有15個reduce

4. reduce個數並不是越多越好；
同map一樣，啓動和初始化reduce也會消耗時間和資源；
另外，有多少個reduce,就會有多少個輸出文件，如果生成了很多個小文件，那麼如果這些小文件作爲下一個任務的輸入，則也會出現小文件過多的問題；

5. 什麼情況下只有一個reduce；
很多時候你會發現任務中不管數據量多大，不管你有沒有設置調整reduce個數的參數，任務中一直都只有一個reduce任務；
其實只有一個reduce任務的情況，除了數據量小於hive.exec.reducers.bytes.per.reducer參數值的情況外，還有以下原因：
a) 沒有group by的彙總，比如把selectpt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; 寫成 select count(1) from popt_tbaccountcopy_mes where pt ='2012-07-04';
這點非常常見，希望大家儘量改寫。
b) 用了Order by
c) 有笛卡爾積
通常這些情況下，除了找辦法來變通和避免，我暫時沒有什麼好的辦法，因爲這些操作都是全局的，所以hadoop不得不用一個reduce去完成；

同樣的，在設置reduce個數的時候也需要考慮這兩個原則：使大數據量利用合適的reduce數；使單個reduce任務處理合適的數據量；

Hive是將符合SQL語法的字符串解析生成可以在Hadoop上執行的MapReduce的工具。

使用Hive儘量按照分佈式計算的一些特點來設計sql，和傳統關係型數據庫有區別，

所以需要去掉原有關係型數據庫下開發的一些固有思維。

基本原則：

1：儘量儘早地過濾數據，減少每個階段的數據量,對於分區表要加分區，同時只選擇需要使用到的字段

select ... fromA

join B

on A.key =B.key

whereA.userid>10

andB.userid<10

andA.dt='20120417'

andB.dt='20120417';

應該改寫爲：

select ....from (select .... from A

wheredt='201200417'

anduserid>10

) a

join ( select.... from B

wheredt='201200417'

and userid <10

) b

on a.key =b.key;

2：儘量原子化操作，儘量避免一個SQL包含複雜邏輯

可以使用中間表來完成複雜的邏輯

drop table ifexists tmp_table_1;

create table ifnot exists tmp_table_1 as

select ......;

drop table ifexists tmp_table_2;

create table ifnot exists tmp_table_2 as

select ......;

drop table ifexists result_table;

create table ifnot exists result_table as

select ......;

drop table ifexists tmp_table_1;

drop table ifexists tmp_table_2;

3：單個SQL所起的JOB個數儘量控制在5個以下

4：慎重使用mapjoin,一般行數小於2000行，大小小於1M(擴容後可以適當放大)的表才能使用,小表要注意放在join的左邊（目前TCL裏面很多都小表放在join的右邊）。

否則會引起磁盤和內存的大量消耗

5：寫SQL要先了解數據本身的特點，如果有join ,group操作的話，要注意是否會有數據傾斜

如果出現數據傾斜，應當做如下處理：

sethive.exec.reducers.max=200;

setmapred.reduce.tasks= 200;---增大Reduce個數

sethive.groupby.mapaggr.checkinterval=100000 ;--這個是group的鍵對應的記錄條數超過這個值則會進行分拆,值根據具體數據量設置

set hive.groupby.skewindata=true;--如果是group by過程出現傾斜應該設置爲true

sethive.skewjoin.key=100000; --這個是join的鍵對應的記錄條數超過這個值則會進行分拆,值根據具體數據量設置

sethive.optimize.skewjoin=true;--如果是join 過程出現傾斜應該設置爲true

6：如果union all的部分個數大於2，或者每個union部分數據量大，應該拆成多個insert into 語句，實際測試過程中，執行時間能提升50%

insert overwitetable tablename partition (dt= ....)

select .....from (

select ... fromA

union all

select ... fromB

union all

select ... fromC

) R

where ...;

可以改寫爲：

insert intotable tablename partition (dt= ....)

select ....from A

WHERE ...;

insert intotable tablename partition (dt= ....)

select ....from B

WHERE ...;

insert intotable tablename partition (dt= ....)

select ....from C

WHERE ...;

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive控制Reduce個數

Spark學習筆記之淺釋

Hive控制Reduce個數

MapReduce編程之數據去重

MapReduce編程之倒排索引

Linux、hive、sqoop常用腳本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結