Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:
EXPLAIN [EXTENDED] query
hive> explain SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
OKABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME invites) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar)) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (> (. (TOK_TABLE_OR_COL a) foo) 0)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) bar))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Select Operator
expressions:
expr: bar
type: string
outputColumnNames: bar
Group By Operator
aggregations:
expr: count()
bucketGroup: false
keys:
expr: bar
type: string
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 0.133 seconds
hive> explain insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (.
(TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t1
TableScan
alias: t1
Reduce Output Operator
key expressions:
expr: bar
type: string
sort order: +
Map-reduce partition columns:
expr: bar
type: string
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
t2
TableScan
alias: t2
Reduce Output Operator
key expressions:
expr: bar
type: string
sort order: +
Map-reduce partition columns:
expr: bar
type: string
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col5
Select Operator
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col5
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
Stage: Stage-2
Stats-Aggr Operator
注:
ABSTRACT SYNTAX TREE爲抽象語法樹
從信息頭:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
從這裏可以看出Plan計劃的Job任務結構,整個任務會分爲3個Job 執行,
第一個Job 將由Stage-1 構成;
第二個Job處理由Stage-0 構成,Stage-0 的處理必須依賴Stage-1 階段的結果;
第三個Job處理由Stage-2 構成,Stage-2 的處理必須依賴Stage-0 階段的結果。
下面分別解釋 Stage-1 和 Stage-0,執行SQL可以分成兩步:
(1)SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);
(2)insert overwrite TABLE lpx;
Stage: Stage-1對應一次完整的 Map Reduce任務,包括:Map Operator Tree和Reduce Operator Tree兩步操作,Map Operator Tree對應Map任務,Reduce Operator Tree對應Reduce任務。
從Map Operator Tree階段可以看出進行了兩個並列的操作t1和t2,分別SELECT t1.bar, t1.foo FROM t1;和 SELECT t2.foo FROM t2;而且兩個Map任務分別產生了Reduce階段的輸入[Reduce Output Operator]。
從Reduce Operator Tree分析可以看到如下信息,條件連接Map 的輸出以及通過預定義的輸出格式生成符合default.lpx的存儲格式的數據存儲到HDFS 中。在我們創建lpx表
的時候,沒有指定該表的存儲格式,默認會以Text 爲存儲格式,輸入輸出會以TextInputFormat 與TextOutputFormat 進行讀寫:
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
input format 的值對應org.apache.hadoop.mapred.TextInputFormat,
這是因爲在開始的Map 階段產生的臨時輸出文件是以TextOutputFormat 格式保存的,自然Reduce 的讀取是由TextInputFormat 格式處理讀入數據。這些是由Hadoop 的MapReduce 處
理細節來控制,而Hive 只需要指定處理格式即可。
Serde 值爲org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 類,這時這個對象的保存的值爲_col0, _col1, _col2,也就是我們預期要查詢的t1.bar, t1.foo, t2.foo,這個值具體的應該爲_col0+表lpx 設置的列分割符+_col1+表lpx 設置的列分割符+_col2。outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 可以知道output
的處理是使用該類來處理的。
Stage-0 對應上面提到的第二步操作。這時stage-1 產生的臨時處理文件舉例如tmp,需要經過stage-0 階段操作處理到lpx 表中。Move Operator 代表了這並不是一個
MapReduce 任務,只需要調用MoveTask 的處理就行,在處理之前會去檢查輸入文件是否符合lpx表的存儲格式。
ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain