hive sql 底層語法解析-格式化

原文章
http://whatua.com/2018/12/02/hive-sql-format-with-antlr/

當前比較好的sql格式化工具

以下在網上搜集了一些比較好的sql格式化工具，各有利弊。但對hive sql 來說目前還沒有比較好的工具可以直接拿過來用。

1. ApexSQL Refactor SQL formatter

https://www.apexsql.com/sql-tools-refactor.aspx

2. SQL Pretty Printer

Instant SQL Formatter (在線免費版本） http://www.dpriver.com/pp/sqlformat.htm

SQL Pretty Printer （桌面應用收費版本） http://www.dpriver.com/products/sqlpp/desktop_index.php

3. druid

https://github.com/alibaba/druid 開源免費

sql格式化之druid

druid(alibaba)實現了sql語法的分析(和antlr生成的parse非常像)，目前已經支持大部分常用的sql語法。查看其代碼發現Terence Parr(antlr的作者)說的挺對的，詞法和語法分析十分的枯燥，而且容易出現錯誤。可維護性特別差。

druid內部實現了一個format方法並支持很多種sql語法，雖然druid的主要方向並不在此，有些大柴小用但是如果可用也是極好的。目前看druid在hive的語法實現上不完全，有些語法還未支持（比如定義個es的外表）。但是要實現具體的語法，門檻還是有的，相較antlr還是有點複雜、而且學習該技能通用型較差。實地勘探發現druid在sql格式化方法中的兩個問題，當然在格式化這個問題上這個還存在另外一個更大的問題就是註釋（這個它並沒有關注，只是保留了兩個listener）：

1 ) 這個地方匹配末尾的一個 \n 如果是最後一行的註釋沒有換行符號這個會報上個 intifiy error

代碼中找： // single line comment結束符錯誤這個

2) sqlformat features 參數沒有向內部傳遞，配置了不生效。裏面還沒實現這些邏輯

public static SQLStatementParser createSQLStatementParser(String sql, String dbType, SQLParserFeature… features) {

if (JdbcUtils.ORACLE.equals(dbType) || JdbcUtils.ALI_ORACLE.equals(dbType)) {

相對問題看它的實現方式還很有參考價值的，性能畢竟是druid的核心關注。druid hivesql格式化的具體執行過程：

1）類的繼承和集成關係

<–SQLExprParser <–HiveExprParser { HiveLexer(實例化) }

SQLParser { Lexer } |（實例化）

<– SQLStatementParser { SQLExprParser } <– HiveStatementParser

2）類的構建過程

SQLParser { Lexer } <– SQLStatementParser { SQLExprParser: HiveExprParser { Lexer: HiveLexer(實例化) } , Lexer: SQLExprParser的HiveLexer } <– HiveStatementParser

題外話： druid構建了在線語法解析的工具，對於權限控制、注入檢測、分庫分表等有非常大的實際使用價值（公有云的權限控制，公司集團內部的代理中間件等等都有很好的場景）。

hive 源碼中的sql語法解析解析器如何構建

hive中的sql parser 使用的是antlr ，目前用的是3.5.2。 antlr基於ll(*)實現的ll(*) parser

論文：

1. LL(*): The Foundation of the ANTLR Parser Generator https://www.antlr.org/papers/LL-star-PLDI11.pdf

2. Adaptive LL(*) Parsing: The Power of Dynamic Analysis https://www.antlr.org/papers/allstar-techreport.pdf

ll parser wiki: https://en.wikipedia.org/wiki/LL_parser

LL(*)的概念與實現 http://pfmiles.github.io/blog/concept-and-implementation-of-ll-star/

語法分析的各種工具對比：

https://en.wikipedia.org/wiki/Comparison_of_parser_generators

https://stackoverflow.com/questions/41427905/how-many-ways-are-there-to-build-a-parser

hive-master/ql/pom.xml => org.antlr

自己構建

下載antlr https://www.antlr3.org/download/antlr-3.5.2-complete.jar

1) cp hive-master/ql/src/java/org/apache/hadoop/hive/ql/parse/*.g ./

whomm@bogon > ~/Desktop/hiveparse > mkdir output

whomm@bogon > ~/Desktop/hiveparse > which antlr3

antlr3: aliased to java -jar /usr/local/lib/antlr-3.5.2-complete.jar

whomm@bogon > ~/Desktop/hiveparse > ll

total 328

-rwxr-xr-x@ 1 whomm staff 11K 10 29 11:02 FromClauseParser.g

-rwxr-xr-x@ 1 whomm staff 1.9K 10 29 11:02 HintParser.g

-rwxr-xr-x@ 1 whomm staff 11K 10 29 11:02 HiveLexer.g

-rwxr-xr-x@ 1 whomm staff 91K 10 29 11:02 HiveParser.g

-rwxr-xr-x@ 1 whomm staff 23K 10 29 11:02 IdentifiersParser.g

-rwxr-xr-x@ 1 whomm staff 11K 10 29 11:02 ResourcePlanParser.g

-rwxr-xr-x@ 1 whomm staff 5.7K 10 29 11:02 SelectClauseParser.g

drwxr-xr-x 2 whomm staff 68B 11 2 10:18 output

whomm@bogon > ~/Desktop/hiveparse > antlr3 HiveParser.g -o output

hive源碼中語法解析的過程

1. hive-master/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseDriver.java
2. hive-master/ql/src/java/org/apache/hadoop/hive/ql/parse/Driver.java
  1. Driver類:compile函數: tree = ParseUtils.parse(command, ctx); -> ParseDriver類 pd.parse(command, ctx, viewFullyQualifiedName)
  2. 最後獲取到 ASTNode tree;
3. 打開hive的debug模式，構建個錯誤的語法可以直接看到調用堆棧：NoViableAltException(26@[])

at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1028)

at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:201)

at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)

at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:418)

at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:312)

at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1201)

at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1296)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1127)

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1115)

at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:220)

at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:172)

at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:383)

at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:775)

at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:693)

at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:628)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

FAILED: ParseException line 1:0 cannot recognize input near ‘com’ ‘sdfsdf’ ‘<EOF>’

利用hive的sql語法自己實現antlr

antlr3 HiveLexer.g -o out #目前hive master分支antlr依賴的是3.5.2版本

antlr3 HiveParser.g -o out

maven中添加antlr&編譯

<build>

<sourceDirectory>${basedir}/src/java</sourceDirectory>

<testSourceDirectory>${basedir}/src/test</testSourceDirectory>

<pluginManagement>

<plugins>

<plugin>

<groupId>org.antlr</groupId>

<artifactId>antlr3-maven-plugin</artifactId>

<executions>

<execution>

<goals>

<goal>antlr</goal>

</goals>

</execution>

</executions>

<configuration>

<sourceDirectory>${basedir}/src/main/java/whomm/hsqlformat/hive/parse</sourceDirectory>

<includes>

<include>**/HiveLexer.g</include>

<include>**/HiveParser.g</include>

</includes>

</configuration>

</plugin>

命令：mvn org.antlr:antlr3-maven-plugin:antlr

maven 配置參考:

1. ANTLR 3.x Creating and Executing a Grammar in Eclipse https://vimeo.com/8015802

2. https://alexecollins.com/antlr4-and-maven-tutorial/

格式化（antlr的黑魔法）

不得不說這個antlr是個好產品，確實把複雜的事情簡單化了。

1. 邏輯和語法定義是分離的，語法的定義簡單明瞭。

2. 能直接生成語法樹，支持生成各種語言。

Hive-master的hive語法包含幾個文件(hive-master/ql/src/java/org/apache/hadoop/hive/ql/parse)：

HiveLexer.g

LINE_COMMENT : ‘–‘ (~(‘\n’|’\r’))* { $channel=HIDDEN; }

行註釋默認被送入到hidden通道了，語法解析的時候會被直接忽略

特殊的字符作爲變量就在這個文件中修改

FromClauseParser.g

from語句解析

IdentifiersParser.g

標識符定義函數名稱、系統函數、關鍵字等

裏面有個 nonReserved這個很重要非保留的關鍵字是可以作爲標識符的。

比如 select a as date from mytable 這個date不添加轉義會報錯的，但是該處如果添加 “ | KW_DATE ” date可直接作爲標識符使用

ResourcePlanParser.g

資源計劃

SelectClauseParser.g

select語句解析

HiveParser.g

import了 SelectClauseParser, FromClauseParser, IdentifiersParser, ResourcePlanParser 實現了所有的hive語法解析

statement 就是入口規則

HintParser.g

hive的hint語法解析

如果 parser裏面添加帶有linecomment的語法識別還是比較繁瑣的，畢竟語法解析的目標是實際的語句。如果利用語法樹每個節點的 line 和 charPositionInLine 和 lexer後的comment token的 line 和 charpositoninline ，的前後位置關係來確定註釋所在的位置，簡單看是行不通的因爲：

parse 後的樹裏面的接近keyword的 token 是沒有 line 和 charPositionInLine的。這個是符合邏輯的，AST有些是邏輯節點是不涉及具體 keyword的，或者是n個keyword生成的邏輯節點（Stack Overflow裏面也有相關討論：https://stackoverflow.com/questions/9954882/antlr-preserve-line-number-and-position-in-tree-grammar）。

雖然這個思路是不錯的。但這個思路也會存在一些問題：

以下兩個例子是查詢語句中的場景。ddl語句中的create語句，邏輯節點少不會出現這種情況。

1）邏輯節點被優先遍歷到

例如 union all 節點。

select a.id from a union all select b.id from b union all select c.id from c

此時會生成

TOK_QUERY

TOK_UNIONALL

TOK_QUERY ( select a.id from a )

TOK_QUERY （select b.id from b）

TOK_QUERY （select c.id from c）

這樣的棵樹，深度遍歷樹的時候遇到這個節點，如果這個節點是需要輸出的。

查看當前node和comment的偏移量。如果當前node大於等於 comment 的行號和所在的列就輸出comment。

union all 節點會被優先遍歷到，所以會將第一個select 裏面的comment 提前輸出來。

2）新增的虛擬節點

但是對於query類型的sql語句解析成語法樹的時候會在 ast tree上增加子查詢的虛擬節點。

比如：

select id from ( select a.id from a union all select b.id from b union all select c.id from c) tmp

TOK_QUERY

TOK_FROM

TOK_SUBQUERY

TOK_QUERY

TOK_FROM

TOK_SUBQUERY

TOK_UNIONALL

TOK_QUERY

_u1

TOK_INSERT

…

tmp

TOK_INSERT

…

_u1即爲解析過程中新增的邏輯節點。這個新增出來的子查詢也會導致後面的comment被提前輸出。

爲此只能思考更加精密的方法。目前製造了個簡單的工具

https://github.com/whomm/hsqlformat

參考文檔

eclipise
1. eclipse maven 配置 https://www.cnblogs.com/tangshengwei/p/6341462.html
hive
1. hive sql解析和應用 https://www.jianshu.com/p/7cd2afacc9bb
2. hive sql 的編譯過程 https://tech.meituan.com/hive_sql_to_mapreduce.html
3. hive sql解析過程詳解 https://www.cnblogs.com/yaojingang/p/5446310.html
4. Hive源碼分析：Driver類運行過程 https://yq.aliyun.com/articles/26327
5. Hive Wiki: https://cwiki.apache.org/confluence/display/Hive/Home
6. HiveSQL編譯過程: http://www.slideshare.net/recruitcojp/internal-hive
7. Join Optimization in Hive: Join Strategies in Hive from the 2011 Hadoop Summit (Liyin Tang, Namit Jain)
8. Hive Design Docs: https://cwiki.apache.org/confluence/display/Hive/DesignDocs
9. hivesql 解析過程 https://github.com/alan2lin/hive_ql_parser
antlr(Another Tool for Language Recognition)
1. Antlr: http://www.antlr.org/
2. Wiki Antlr介紹: http://en.wikipedia.org/wiki/ANTLR
3. 使用 Antlr 開發領域語言 https://www.ibm.com/developerworks/cn/java/j-lo-antlr/index.html
4. The Definitive ANTLR Reference https://doc.lagout.org/programmation/Pragmatic%20Programmers/The%20Definitive%20ANTLR%20Reference.pdf
5. 《antlr 權威指南》（pdf 下載 http://www.safuli.com/blog/articles/7985.html）
6. 《antlr 2.7.5 中文文檔》 http://www.blogjava.net/huanzhugege/archive/2008/06/30/211762.html
7. 基礎知識：詞法分析器(通常稱爲掃描器)將輸入的字符流分解爲詞彙表中的一個個的符號，然後輸出到語法分析器，語法分析器將語法結構應用於那些符號流。因爲 ANTLR 爲詞法分析、語法分析和樹分析引入了相同的識別機制，ANTLR 生成的詞法分析器比基於 DFA 詞法分析器更強大，比如 DLG 和 lex 生成的詞法分析器。
durid
1. https://github.com/alibaba/druid/wiki/SQL-Parser
2. https://github.com/alibaba/druid/wiki/SQL_Format

hive sql 底層語法解析-格式化

當前比較好的sql格式化工具

sql格式化之druid

hive 源碼中的sql語法解析解析器如何構建

利用hive的sql語法自己實現antlr

格式化（antlr的黑魔法）

參考文檔

hive 文章總覽

hive sql 底層語法解析-格式化

BindingException: Parameter 'dataBase' not found. Available parameters are [arg1, arg0, param1, par

Unrecognized option: --Xmx5120m

獲取YARN中實際執行的sql文本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結