在美團點評的文章中,介紹了Hive SQL轉化爲MapReduce的過程
1、Antlr定義SQL的語法規則,完成SQL詞法,語法解析,將SQL轉化爲抽象語法樹AST Tree 2、遍歷AST Tree,抽象出查詢的基本組成單元QueryBlock 3、遍歷QueryBlock,翻譯爲執行操作樹OperatorTree 4、邏輯層優化器進行OperatorTree變換,合併不必要的ReduceSinkOperator,減少shuffle數據量 5、遍歷OperatorTree,翻譯爲MapReduce任務 6、物理層優化器進行MapReduce任務的變換,生成最終的執行計劃
參考
https://tech.meituan.com/2014/02/12/hive-sql-to-mapreduce.html
但是不是所有的SQL都有必要轉換爲MR來執行,比如
select * from xx.xx limit 1
Hive只需要直接讀取文件,並傳輸到控制檯即可
在hive-default.xml配置文件中,有2個參數,hive.fetch.task.conversion和hive.fetch.task.conversion.threshold
hive.fetch.task.conversion屬性修改爲more以後,在全局查找、字段查找、limit查找等都不走mapreduce
hive.fetch.task.conversion.threshold屬性表示在輸入大小爲多少以內的時候fetch task生效,默認1073741824 byte = 1G
<property> <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property> <property> <name>hive.fetch.task.conversion.threshold</name> <value>1073741824</value> <description> Input threshold for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, storage handler for the table can optionally implement org.apache.hadoop.hive.ql.metadata.InputEstimator interface. </description> </property>
參考:
Hive快速入門系列(14) | Hive性能調優 [一]Fetch抓取與本地模式