hive sql執行計劃

Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:

EXPLAIN [EXTENDED] query

 

hive> explain SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME invites) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar)) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (> (. (TOK_TABLE_OR_COL a) foo) 0)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) bar))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a
          TableScan
            alias: a
            Filter Operator
              predicate:
                  expr: (foo > 0)
                  type: boolean
              Filter Operator
                predicate:
                    expr: (foo > 0)
                    type: boolean
                Select Operator
                  expressions:
                        expr: bar
                        type: string
                  outputColumnNames: bar
                  Group By Operator
                    aggregations:
                          expr: count()
                    bucketGroup: false
                    keys:
                          expr: bar
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: string
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


Time taken: 0.133 seconds

hive> explain insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t1
          TableScan
            alias: t1
            Reduce Output Operator
              key expressions:
                    expr: bar
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: bar
                    type: string
              tag: 0
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        t2
          TableScan
            alias: t2
            Reduce Output Operator
              key expressions:
                    expr: bar
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: bar
                    type: string
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col5
          Select Operator
            expressions:
                  expr: _col1
                  type: string
                  expr: _col0
                  type: int
                  expr: _col5
                  type: int
            outputColumnNames: _col0, _col1, _col2
            File Output Operator
              compressed: false
              GlobalTableId: 1
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.lpx

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.lpx

  Stage: Stage-2
    Stats-Aggr Operator

注:
ABSTRACT SYNTAX TREE爲抽象語法樹

從信息頭:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0
從這裏可以看出Plan計劃的Job任務結構,整個任務會分爲3個Job 執行,
第一個Job 將由Stage-1 構成;
第二個Job處理由Stage-0 構成,Stage-0 的處理必須依賴Stage-1 階段的結果;
第三個Job處理由Stage-2 構成,Stage-2 的處理必須依賴Stage-0 階段的結果。

下面分別解釋 Stage-1 和 Stage-0,執行SQL可以分成兩步:
(1)SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);
(2)insert overwrite TABLE lpx;
    Stage: Stage-1對應一次完整的 Map Reduce任務,包括:Map Operator Tree和Reduce Operator Tree兩步操作,Map Operator Tree對應Map任務,Reduce Operator Tree對應Reduce任務。
        從Map Operator Tree階段可以看出進行了兩個並列的操作t1和t2,分別SELECT t1.bar, t1.foo FROM t1;和 SELECT t2.foo FROM t2;而且兩個Map任務分別產生了Reduce階段的輸入[Reduce Output Operator]。
     從Reduce Operator Tree分析可以看到如下信息,條件連接Map 的輸出以及通過預定義的輸出格式生成符合default.lpx的存儲格式的數據存儲到HDFS 中。在我們創建lpx表
的時候,沒有指定該表的存儲格式,默認會以Text 爲存儲格式,輸入輸出會以TextInputFormat 與TextOutputFormat 進行讀寫:
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.lpx
input format 的值對應org.apache.hadoop.mapred.TextInputFormat,
這是因爲在開始的Map 階段產生的臨時輸出文件是以TextOutputFormat 格式保存的,自然Reduce 的讀取是由TextInputFormat 格式處理讀入數據。這些是由Hadoop 的MapReduce 處
理細節來控制,而Hive 只需要指定處理格式即可。
    Serde 值爲org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 類,這時這個對象的保存的值爲_col0, _col1, _col2,也就是我們預期要查詢的t1.bar, t1.foo, t2.foo,這個值具體的應該爲_col0+表lpx 設置的列分割符+_col1+表lpx 設置的列分割符+_col2。outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 可以知道output 的處理是使用該類來處理的。
    Stage-0 對應上面提到的第二步操作。這時stage-1 產生的臨時處理文件舉例如tmp,需要經過stage-0 階段操作處理到lpx 表中。Move Operator 代表了這並不是一個
MapReduce 任務,只需要調用MoveTask 的處理就行,在處理之前會去檢查輸入文件是否符合lpx表的存儲格式。

ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章