HiveSql調優系列之Hive嚴格模式，如何合理使用Hive嚴格模式

綜述

在同樣的集羣運行環境中，hive調優有兩種方式，即參數調優和sql調優。

本篇講涉及到的Hive嚴格模式。

前兩天在優化一個前人遺留下的sql，發現關於嚴格模式參數是這樣使用的，嚴重錯誤。

set hive.strict.checks.cartesian.product=flase;
set hive.mapred.mode=nonstrict;

而且我發現在使用參數上，無論sql大小直接貼一堆參數，類似這樣。

set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16;
set hive.merge.mapfiles = true; 
set hive.merge.mapredfiles = true; 
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set mapred.max.split.size=1024000000;
set mapred.min.split.size.per.node=1024000000;
set mapred.min.split.size.per.rack=1024000000; 
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.join.emit.interval = 2000;
set hive.mapjoin.size.key = 20000;
set hive.mapjoin.cache.numrows = 20000;
set hive.exec.reducers.bytes.per.reducer=2000000000;
set hive.exec.reducers.max=999;
set hive.map.aggr=true;
set hive.groupby.mapaggr.checkinterval=100000;
set hive.auto.convert.join = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.dynamic.partition = true;
set hive.cli.print.header=true;
set hive.resultset.use.unique.column.names=false;
set mapreduce.reduce.memory.mb=4096;
set mapreduce.reduce.java.opts=-Xmx4096m;
set mapred.max.split.size=1024000000;
set mapred.min.split.size.per.node=1024000000;
set mapred.min.split.size.per.rack=1024000000;

優化是優化了，但是我看到了優化的無目標性，反而在一定程度上多消耗了計算資源。

於是打算開一個系列文章，Hive SQL調優系列，如何合理的使用參數進行SQL優化，針對什麼情況使用哪些參數優化。

本篇先說說嚴格模式相關參數怎麼使用。

正文如下。

1.嚴格模式

所謂Hive的嚴格模式，就是爲了避免用戶提交一些惡意SQL，消耗大量資源進而使得運行環境崩潰做出的一些安全性的限制。

或多或少我們都提交過一些執行很久，集羣資源不足的SQL。應該能理解。

前文Hive動態分區詳解中有提到過

1.1 參數設置

-- strict 爲開啓嚴格模式  nostrict 關閉嚴格模式
set hive.mapred.mode=strict

1.2 查看參數

通過hive的set 查看指定參數

-- 黑窗口查看Hive模式，以下結果爲未開啓嚴格模式
hive> set hive.mapred.mode;
hive.mapred.mode is undefined

1.3 嚴格模式限制內容及對應參數設置

如果Hive開啓嚴格模式，將會阻止一下三種查詢：

a.對分區表查詢，where條件中過濾字段沒有分區字段；

b.對order by查詢，order by的查詢不帶limit語句。

c.笛卡爾積join查詢，join查詢語句中不帶on條件或者where條件；

以上三種查詢情況也有自己單獨的參數可以進行控制。

分區表查詢必須指定分區

-- 開啓限制(默認爲 false)
set hive.strict.checks.no.partition.filter=true;

orderby排序必須指定limit

-- 開啓限制(默認爲false)
set hive.strict.checks.orderby.no.limit=true;

限制笛卡爾積運算

-- 開啓限制(默認爲false)
set hive.strict.checks.cartesian.product=true;

2.實際操作

2.1 分區表查詢時必須指定分區

分區表查詢必須指定分區的原因：如果該表有大量分區，如果不加限制，在讀取時會讀取到超出預估的數據量。

-- 測試
create table `lubian` (
`id` string comment 'id',
`name` string comment '姓名'
)
comment 'lubian' 
PARTITIONED BY (ymd string)
row format delimited fields terminated by '\t' lines terminated by '\n' 
stored as orc;

set hive.strict.checks.no.partition.filter=true;
select * from lubian limit 111;

執行結果

FAILED: SemanticException [Error 10056]:
    Queries against partitioned tables without a partition filter are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.no.partition.
    filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.
    No partition predicate for Alias "lubian" Table "lubian"

select * from partab where dt='11' limit 111;
Time taken: 0.77 seconds

2.2 order by必須指定limit

order by必須指定limit的主要原因: order by 爲全局排序，所有數據只有一個reduceTask來處理，防止單個reduce運行時間過長,而導致任務阻塞

-- 測試
set hive.strict.checks.orderby.no.limit=true;
select * from lubian order by name;

執行結果

FAILED: SemanticException 1:36
    Order by-s without limit are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false
    and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features..
    Error encountered near token 'name'

2.3 限制笛卡爾積

限制笛卡爾積運算原因：笛卡爾積可能會造成數據急速膨脹，例如兩個1000條數據表關聯，會產生100W條數據。n的平方增長。觸發笛卡爾積時,join操作會在一個reduceTask中執行

-- 測試
set hive.strict.checks.cartesian.product=true;
select t1.*,t2.* from lubian as t1
inner join lubian as t2;

執行結果

FAILED: SemanticException Cartesian products are disabled for safety reasons.
    If you know what you are doing, please set hive.strict.checks.cartesian.product to false
    and make sure that hive.mapred.mode is not set to 'strict' to proceed.
    Note that you may get errors or incorrect results
    if you make a mistake while using some of the unsafe features.

3.搭配使用

3.1 參數

設置hive嚴格模式參數如下

set hive.mapred.mode=strict;
set hive.strict.checks.no.partition.filter=true;
set hive.strict.checks.orderby.no.limit=true;
set hive.strict.checks.cartesian.product=true;

以上參數可以使用 set hive.mapred.mode=strict; 默認開啓三種情況的嚴格模式。也可以使用每個限制內容參數開啓指定嚴格校驗。

3.2 搭配使用案例

也可以搭配使用，但是使用以下方式就有些問題了：

-- 關閉笛卡爾積運算校驗
set hive.strict.checks.cartesian.product=flase;
-- 關閉嚴格模式
set hive.mapred.mode=nonstrict;

應該是嚴格模式默認關閉，但仍想對其中一種情況做校驗。如下

set hive.mapred.mode=nonstrict;
set hive.strict.checks.cartesian.product=true;

或者嚴格模式默認開啓，但對其中一種不想做校驗：

set hive.mapred.mode=strict;
set hive.strict.checks.cartesian.product=false;

以上內容。

按例，歡迎點擊此處關注我的個人公衆號，交流更多知識。

後臺回覆關鍵字 hive，隨機贈送一本魯邊備註版珍藏大數據書籍。

HiveSql調優系列之Hive嚴格模式，如何合理使用Hive嚴格模式

綜述

1.嚴格模式

1.1 參數設置

1.2 查看參數

1.3 嚴格模式限制內容及對應參數設置

2.實際操作

2.1 分區表查詢時必須指定分區

2.2 order by必須指定limit

2.3 限制笛卡爾積

3.搭配使用

3.1 參數

3.2 搭配使用案例

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

什麼是SQL 語句中相關子查詢與非相關子查詢

SQL窗口分析函數使用詳解系列三之偏移量類窗口函數

實時數倉構建：Flink+OLAP查詢的一些實踐與思考

hive窗口分析函數使用詳解系列二之分組排序窗口函數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結