轉載自：http://matt33.com/2019/03/07/apache-calcite-process-flow/

關於 Apache Calcite 的簡單介紹可以參考 Apache Calcite：Hadoop 中新型大數據查詢引擎這篇文章，Calcite 一開始設計的目標就是 one size fits all，它希望能爲不同計算存儲引擎提供統一的 SQL 查詢引擎，當然 Calcite 並不僅僅是一個簡單的 SQL 查詢引擎，在論文 Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources 的摘要（摘要見下面）部分，關於 Calcite 的核心點有簡單的介紹，Calcite 的架構有三個特點：flexible, embeddable, and extensible，就是靈活性、組件可插拔、可擴展，它的 SQL Parser 層、Optimizer 層等都可以單獨使用，這也是 Calcite 受總多開源框架歡迎的原因之一。

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite’s architecture consists of

a modular and extensible query optimizer with hundreds of built-in optimization rules,

a query processor capable of processing a variety of query languages,

an adapter architecture designed for extensibility,

and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in bigdata frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.

Calcite 概念

在介紹 Calcite 架構之前，先來看下與 Calcite 相關的基礎性內容。

關係代數的基本知識

關係代數是關係型數據庫操作的理論基礎，關係代數支持並、差、笛卡爾積、投影和選擇等基本運算。關係代數也是 Calcite 的核心，任何一個查詢都可以表示成由關係運算符組成的樹。在 Calcite 中，它會先將 SQL 轉換成關係表達式（relational expression），然後通過規則匹配（rules match）進行相應的優化，優化會有一個成本（cost）模型爲參考。

這裏先看下關係代數相關內容，這對於理解 Calcite 很有幫助，特別是 Calcite Optimizer 這塊的內容，關係代數的基礎可以參考這篇文章 SQL 形式化語言——關係代數，簡單總結如下：

名稱	英文	符號	說明
選擇	select	σ	類似於 SQL 中的 where
投影	project	Π	類似於 SQL 中的 select
並	union	∪	類似於 SQL 中的 union
集合差	set-difference	-	SQL中沒有對應的操作符
笛卡兒積	Cartesian-product	×	類似於 SQL 中不帶 on 條件的 inner join
重命名	rename	ρ	類似於 SQL 中的 as
集合交	intersection	∩	SQL中沒有對應的操作符
自然連接	natural join	⋈	類似於 SQL 中的 inner join
賦值	assignment	←

查詢優化

查詢優化主要是圍繞着 等價交換 的原則做相應的轉換，這部分可以參考【《數據庫系統概念（中文第六版）》第13章——查詢優化】，關於查詢優化理論知識，這裏就不再詳述，列出一些個人不錯不錯的博客，大家可以參考一下：

Calcite 中的一些概念

Calcite 拋出的概念非常多，筆者最開始在看代碼時就被這些概念繞得雲裏霧裏，這時候先從代碼的細節裏跳出來，先把這些概念理清楚、歸歸類後再去看代碼，思路就清晰很多，因此，在介紹 Calcite 整體實現前，先把這些概念梳理一下，需要對這些概念有個基本的理解，相關的概念如下圖所示：

calcite 基本概念

整理如下表所示：

類型	描述	特點
RelOptRule	transforms an expression into another。對 expression 做等價轉換	根據傳遞給它的 RelOptRuleOperand 來對目標 RelNode 樹進行規則匹配，匹配成功後，會再次調用 `matches()` 方法（默認返回真）進行進一步檢查。如果 `mathes()` 結果爲真，則調用 `onMatch()` 進行轉換。
ConverterRule	Abstract base class for a rule which converts from one calling convention to another without changing semantics.	它是 RelOptRule 的子類，專門用來做數據源之間的轉換（Calling convention），ConverterRule 一般會調用對應的 Converter 來完成工作，比如說：JdbcToSparkConverterRule 調用 JdbcToSparkConverter 來完成對 JDBC Table 到 Spark RDD 的轉換。
RelNode	relational expression，RelNode 會標識其 input RelNode 信息，這樣就構成了一棵 RelNode 樹	代表了對數據的一個處理操作，常見的操作有 Sort、Join、Project、Filter、Scan 等。它蘊含的是對整個 Relation 的操作，而不是對具體數據的處理邏輯。
Converter	A relational expression implements the interface `Converter` to indicate that it converts a physical attribute, or RelTrait of a relational expression from one value to another.	用來把一種 RelTrait 轉換爲另一種 RelTrait 的 RelNode。如 JdbcToSparkConverter 可以把 JDBC 裏的 table 轉換爲 Spark RDD。如果需要在一個 RelNode 中處理來源於異構系統的邏輯表，Calcite 要求先用 Converter 把異構系統的邏輯錶轉換爲同一種 Convention。
RexNode	Row-level expression	行表達式（標量表達式），蘊含的是對一行數據的處理邏輯。每個行表達式都有數據的類型。這是因爲在 Valdiation 的過程中，編譯器會推導出表達式的結果類型。常見的行表達式包括字面量 RexLiteral，變量 RexVariable，函數或操作符調用 RexCall 等。 RexNode 通過 RexBuilder 進行構建。
RelTrait	RelTrait represents the manifestation of a relational expression trait within a trait definition.	用來定義邏輯表的物理相關屬性（physical property），三種主要的 trait 類型是：Convention、RelCollation、RelDistribution；
Convention	Calling convention used to repressent a single data source, inputs must be in the same convention	繼承自 RelTrait，類型很少，代表一個單一的數據源，一個 relational expression 必須在同一個 convention 中；
RelTraitDef		主要有三種：ConventionTraitDef：用來代表數據源 RelCollationTraitDef：用來定義參與排序的字段；RelDistributionTraitDef：用來定義數據在物理存儲上的分佈方式（比如：single、hash、range、random 等）；
RelOptCluster	An environment for related relational expressions during the optimization of a query.	palnner 運行時的環境，保存上下文信息；
RelOptPlanner	A RelOptPlanner is a query optimizer: it transforms a relational expression into a semantically equivalent relational expression, according to a given set of rules and a cost model.	也就是優化器，Calcite 支持RBO（Rule-Based Optimizer）和 CBO（Cost-Based Optimizer）。Calcite 的 RBO （HepPlanner）稱爲啓發式優化器（heuristic implementation ），它簡單地按 AST 樹結構匹配所有已知規則，直到沒有規則能夠匹配爲止；Calcite 的 CBO 稱爲火山式優化器（VolcanoPlanner）成本優化器也會匹配並應用規則，當整棵樹的成本降低趨於穩定後，優化完成，成本優化器依賴於比較準確的成本估算。RelOptCost 和 Statistic 與成本估算相關；
RelOptCost	defines an interface for optimizer cost in terms of number of rows processed, CPU cost, and I/O cost.	優化器成本模型會依賴；

Calcite 架構

關於 Calcite 的架構，可以參考下圖（圖片來自前面那篇論文），它與傳統數據庫管理系統有一些相似之處，相比而言，它將數據存儲、數據處理算法和元數據存儲這些部分忽略掉了，這樣設計帶來的好處是：對於涉及多種數據源和多種計算引擎的應用而言，Calcite 因爲可以兼容多種存儲和計算引擎，使得 Calcite 可以提供統一查詢服務，Calcite 將會是這些應用的最佳選擇。

Calcite Architecture，圖片來自論文

在 Calcite 架構中，最核心地方就是 Optimizer，也就是優化器，一個 Optimization Engine 包含三個組成部分：

rules：也就是匹配規則，Calcite 內置上百種 Rules 來優化 relational expression，當然也支持自定義 rules；
metadata providers：主要是向優化器提供信息，這些信息會有助於指導優化器向着目標（減少整體 cost）進行優化，信息可以包括行數、table 哪一列是唯一列等，也包括計算 RelNode 樹中執行 subexpression cost 的函數；
planner engines：它的主要目標是進行觸發 rules 來達到指定目標，比如像 cost-based optimizer（CBO）的目標是減少cost（Cost 包括處理的數據行數、CPU cost、IO cost 等）。

Calcite 處理流程

Sql 的執行過程一般可以分爲下圖中的四個階段，Calcite 同樣也是這樣：

Sql 執行過程

但這裏爲了講述方便，把 SQL 的執行分爲下面五個階段（跟上面比比又獨立出了一個階段）：

解析 SQL，把 SQL 轉換成爲 AST （抽象語法樹），在 Calcite 中用 SqlNode 來表示；
語法檢查，根據數據庫的元數據信息進行語法驗證，驗證之後還是用 SqlNode 表示 AST 語法樹；
語義分析，根據 SqlNode 及元信息構建 RelNode 樹，也就是最初版本的邏輯計劃（Logical Plan）；
邏輯計劃優化，優化器的核心，根據前面生成的邏輯計劃按照相應的規則（Rule）進行優化；
物理執行，生成物理計劃，物理執行計劃執行。

這裏我們只關注前四步的內容，會配合源碼實現以及一個示例來講解。

示例

示例 SQL 如下：

select u.id as user_id, u.name as user_name, j.company as user_company, u.age as user_age 
from users u join jobs j on u.name=j.name
where u.age > 30 and j.id>10
order by user_id

這裏有兩張表，其表各個字段及類型定義如下：

SchemaPlus rootSchema = Frameworks.createRootSchema(true);
rootSchema.add("USERS", new AbstractTable() { //note: add a table
    @Override
    public RelDataType getRowType(final RelDataTypeFactory typeFactory) {
        RelDataTypeFactory.Builder builder = typeFactory.builder();

        builder.add("ID", new BasicSqlType(new RelDataTypeSystemImpl() {}, SqlTypeName.INTEGER));
        builder.add("NAME", new BasicSqlType(new RelDataTypeSystemImpl() {}, SqlTypeName.CHAR));
        builder.add("AGE", new BasicSqlType(new RelDataTypeSystemImpl() {}, SqlTypeName.INTEGER));
        return builder.build();
    }
});

rootSchema.add("JOBS", new AbstractTable() {
    @Override
    public RelDataType getRowType(final RelDataTypeFactory typeFactory) {
        RelDataTypeFactory.Builder builder = typeFactory.builder();

        builder.add("ID", new BasicSqlType(new RelDataTypeSystemImpl() {}, SqlTypeName.INTEGER));
        builder.add("NAME", new BasicSqlType(new RelDataTypeSystemImpl() {}, SqlTypeName.CHAR));
        builder.add("COMPANY", new BasicSqlType(new RelDataTypeSystemImpl() {}, SqlTypeName.CHAR));
        return builder.build();
    }
});

Step1: SQL 解析階段（SQL–>SqlNode）

使用 Calcite 進行 Sql 解析的代碼如下：

1 2	SqlParser parser = SqlParser.create(sql, SqlParser.Config.DEFAULT); SqlNode sqlNode = parser.parseStmt();

Calcite 使用 JavaCC 做 SQL 解析，JavaCC 根據 Calcite 中定義的 Parser.jj 文件，生成一系列的 java 代碼，生成的 Java 代碼會把 SQL 轉換成 AST 的數據結構（這裏是 SqlNode 類型）。

與 Javacc 相似的工具還有 ANTLR，JavaCC 中的 jj 文件也跟 ANTLR 中的 G4文件類似，Apache Spark 中使用這個工具做類似的事情。

Javacc

關於 Javacc 內容可以參考下面這幾篇文章，這裏就不再詳細展開，可以通過下面文章的例子把 JavaCC 的語法瞭解一下，這樣我們也可以自己設計一個 DSL（Doomain Specific Language）。

回到 Calcite，Javacc 這裏要實現一個 SQL Parser，它的功能有以下兩個，這裏都是需要在 jj 文件中定義的。

設計詞法和語義，定義 SQL 中具體的元素；
實現詞法分析器（Lexer）和語法分析器（Parser），完成對 SQL 的解析，完成相應的轉換。

SQL Parser 流程

當 SqlParser 調用 parseStmt() 方法後，其相應的邏輯如下：

// org.apache.calcite.sql.parser.SqlParser
public SqlNode parseStmt() throws SqlParseException {
  return parseQuery();
}

public SqlNode parseQuery() throws SqlParseException {
  try {
    return parser.parseSqlStmtEof(); //note: 解析sql語句
  } catch (Throwable ex) {
    if (ex instanceof CalciteContextException) {
      final String originalSql = parser.getOriginalSql();
      if (originalSql != null) {
        ((CalciteContextException) ex).setOriginalStatement(originalSql);
      }
    }
    throw parser.normalizeException(ex);
  }
}

其中 SqlParser 中 parser 指的是 SqlParserImpl 類（SqlParser.Config.DEFAULT 指定的），它就是由 JJ 文件生成的解析類，其處理流程如下，具體解析邏輯還是要看 JJ 文件中的定義。

//org.apache.calcite.sql.parser.impl.SqlParserImpl
public SqlNode parseSqlStmtEof() throws Exception
{
  return SqlStmtEof();
}

/**
 * Parses an SQL statement followed by the end-of-file symbol.
 * note:解析SQL語句(後面有文件結束符號)
 */
final public SqlNode SqlStmtEof() throws ParseException {
  SqlNode stmt;
  stmt = SqlStmt();
  jj_consume_token(0);
      {if (true) return stmt;}
  throw new Error("Missing return statement in function");
}

 //note: 解析 SQL statement
final public SqlNode SqlStmt() throws ParseException {
  SqlNode stmt;
  if (jj_2_34(2)) {
    stmt = SqlSetOption(Span.of(), null);
  } else if (jj_2_35(2)) {
    stmt = SqlAlter();
  } else if (jj_2_36(2)) {
    stmt = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY);
  } else if (jj_2_37(2)) {
    stmt = SqlExplain();
  } else if (jj_2_38(2)) {
    stmt = SqlDescribe();
  } else if (jj_2_39(2)) {
    stmt = SqlInsert();
  } else if (jj_2_40(2)) {
    stmt = SqlDelete();
  } else if (jj_2_41(2)) {
    stmt = SqlUpdate();
  } else if (jj_2_42(2)) {
    stmt = SqlMerge();
  } else if (jj_2_43(2)) {
    stmt = SqlProcedureCall();
  } else {
    jj_consume_token(-1);
    throw new ParseException();
  }
      {if (true) return stmt;}
  throw new Error("Missing return statement in function");
}

示例中 SQL 經過前面的解析之後，會生成一個 SqlNode，這個 SqlNode 是一個 SqlOrder 類型，DEBUG 後的 SqlOrder 對象如下圖所示。

SqlNode 結果

Step2: SqlNode 驗證（SqlNode–>SqlNode）

經過上面的第一步，會生成一個 SqlNode 對象，它是一個未經驗證的抽象語法樹，下面就進入了一個語法檢查階段，語法檢查前需要知道元數據信息，這個檢查會包括表名、字段名、函數名、數據類型的檢查。進行語法檢查的實現如下：

//note: 二、sql validate（會先通過Catalog讀取獲取相應的metadata和namespace）
//note: get metadata and namespace
SqlTypeFactoryImpl factory = new SqlTypeFactoryImpl(RelDataTypeSystem.DEFAULT);
CalciteCatalogReader calciteCatalogReader = new CalciteCatalogReader(
    CalciteSchema.from(rootScheme),
    CalciteSchema.from(rootScheme).path(null),
    factory,
    new CalciteConnectionConfigImpl(new Properties()));

//note: 校驗（包括對錶名，字段名，函數名，字段類型的校驗。）
SqlValidator validator = SqlValidatorUtil.newValidator(SqlStdOperatorTable.instance(), calciteCatalogReader, factory,
    conformance(frameworkConfig));
SqlNode validateSqlNode = validator.validate(sqlNode);

我們知道 Calcite 本身是不管理和存儲元數據的，在檢查之前，需要先把元信息註冊到 Calcite 中，一般的操作方法是實現 SchemaFactory，由它去創建相應的 Schema，在 Schema 中可以註冊相應的元數據信息（如：通過 getTableMap() 方法註冊表信息），如下所示：

//org.apache.calcite.schema.impl.AbstractSchema
/**
 * Returns a map of tables in this schema by name.
 *
 * <p>The implementations of {@link #getTableNames()}
 * and {@link #getTable(String)} depend on this map.
 * The default implementation of this method returns the empty map.
 * Override this method to change their behavior.</p>
 *
 * @return Map of tables in this schema by name
 */
protected Map<String, Table> getTableMap() {
  return ImmutableMap.of();
}

//org.apache.calcite.adapter.csvorg.apache.calcite.adapter.csv.CsvSchemasvSchema
//note: 創建表
@Override protected Map<String, Table> getTableMap() {
  if (tableMap == null) {
    tableMap = createTableMap();
  }
  return tableMap;
}

CsvSchemasvSchema 中的 getTableMap() 方法通過 createTableMap() 來註冊相應的表信息。

結合前面的例子再來分析，在前面定義了 CalciteCatalogReader 實例，該實例就是用來讀取 Schema 中的元數據信息的。真正檢查的邏輯是在 SqlValidatorImpl 類中實現的，這個 check 的邏輯比較複雜，在看代碼時通過兩種手段來看：

DEBUG 的方式，可以看到其方法調用的過程；
測試程序中故意構造一些 Case，觀察其異常棧。

比如，在示例中 SQL 中，如果把一個字段名寫錯，寫成 ids，其報錯信息如下：

org.apache.calcite.runtime.CalciteContextException: From line 1, column 156 to line 1, column 158: Column 'IDS' not found in table 'J'
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463)
    at org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:787)
    at org.apache.calcite.sql.SqlUtil.newContextException(SqlUtil.java:772)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError(SqlValidatorImpl.java:4788)
    at org.apache.calcite.sql.validate.DelegatingScope.fullyQualify(DelegatingScope.java:439)
    at org.apache.calcite.sql.validate.SqlValidatorImpl$Expander.visit(SqlValidatorImpl.java:5683)
    at org.apache.calcite.sql.validate.SqlValidatorImpl$Expander.visit(SqlValidatorImpl.java:5665)
    at org.apache.calcite.sql.SqlIdentifier.accept(SqlIdentifier.java:334)
    at org.apache.calcite.sql.util.SqlShuttle$CallCopyingArgHandler.visitChild(SqlShuttle.java:134)
    at org.apache.calcite.sql.util.SqlShuttle$CallCopyingArgHandler.visitChild(SqlShuttle.java:101)
    at org.apache.calcite.sql.SqlOperator.acceptCall(SqlOperator.java:865)
    at org.apache.calcite.sql.validate.SqlValidatorImpl$Expander.visitScoped(SqlValidatorImpl.java:5701)
    at org.apache.calcite.sql.validate.SqlScopedShuttle.visit(SqlScopedShuttle.java:50)
    at org.apache.calcite.sql.validate.SqlScopedShuttle.visit(SqlScopedShuttle.java:33)
    at org.apache.calcite.sql.SqlCall.accept(SqlCall.java:138)
    at org.apache.calcite.sql.util.SqlShuttle$CallCopyingArgHandler.visitChild(SqlShuttle.java:134)
    at org.apache.calcite.sql.util.SqlShuttle$CallCopyingArgHandler.visitChild(SqlShuttle.java:101)
    at org.apache.calcite.sql.SqlOperator.acceptCall(SqlOperator.java:865)
    at org.apache.calcite.sql.validate.SqlValidatorImpl$Expander.visitScoped(SqlValidatorImpl.java:5701)
    at org.apache.calcite.sql.validate.SqlScopedShuttle.visit(SqlScopedShuttle.java:50)
    at org.apache.calcite.sql.validate.SqlScopedShuttle.visit(SqlScopedShuttle.java:33)
    at org.apache.calcite.sql.SqlCall.accept(SqlCall.java:138)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.expand(SqlValidatorImpl.java:5272)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.validateWhereClause(SqlValidatorImpl.java:3977)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect(SqlValidatorImpl.java:3305)
    at org.apache.calcite.sql.validate.SelectNamespace.validateImpl(SelectNamespace.java:60)
    at org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:977)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:953)
    at org.apache.calcite.sql.SqlSelect.validate(SqlSelect.java:216)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression(SqlValidatorImpl.java:928)
    at org.apache.calcite.sql.validate.SqlValidatorImpl.validate(SqlValidatorImpl.java:632)
    at com.matt.test.calcite.test.SqlTest3.sqlToRelNode(SqlTest3.java:200)
    at com.matt.test.calcite.test.SqlTest3.main(SqlTest3.java:117)
Caused by: org.apache.calcite.sql.validate.SqlValidatorException: Column 'IDS' not found in table 'J'
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463)
    at org.apache.calcite.runtime.Resources$ExInst.ex(Resources.java:572)
    ... 33 more
java.lang.NullPointerException
    at org.apache.calcite.plan.hep.HepPlanner.addRelToGraph(HepPlanner.java:806)
    at org.apache.calcite.plan.hep.HepPlanner.setRoot(HepPlanner.java:152)
    at com.matt.test.calcite.test.SqlTest3.main(SqlTest3.java:124)

SqlValidatorImpl 檢查過程

語法檢查驗證是通過 SqlValidatorImpl 的 validate() 方法進行操作的，其實現如下：

org.apache.calcite.sql.validate.SqlValidatorImpl
//note: 做相應的語法樹校驗
public SqlNode validate(SqlNode topNode) {
  //note: root 對應的 Scope
  SqlValidatorScope scope = new EmptyScope(this);
  scope = new CatalogScope(scope, ImmutableList.of("CATALOG"));
  //note: 1.rewrite expression
  //note: 2.做相應的語法檢查
  final SqlNode topNode2 = validateScopedExpression(topNode, scope); //note: 驗證
  final RelDataType type = getValidatedNodeType(topNode2);
  Util.discard(type);
  return topNode2;
}

主要的實現是在 validateScopedExpression() 方法中，其實現如下

private SqlNode validateScopedExpression(
    SqlNode topNode,
    SqlValidatorScope scope) {
  //note: 1. rewrite expression，將其標準化，便於後面的邏輯計劃優化
  SqlNode outermostNode = performUnconditionalRewrites(topNode, false);
  cursorSet.add(outermostNode);
  top = outermostNode;
  TRACER.trace("After unconditional rewrite: {}", outermostNode);
  //note: 2. Registers a query in a parent scope.
  //note: register scopes and namespaces implied a relational expression
  if (outermostNode.isA(SqlKind.TOP_LEVEL)) {
    registerQuery(scope, null, outermostNode, outermostNode, null, false);
  }
  //note: 3. catalog 驗證，調用 SqlNode 的 validate 方法，
  outermostNode.validate(this, scope);
  if (!outermostNode.isA(SqlKind.TOP_LEVEL)) {
    // force type derivation so that we can provide it to the
    // caller later without needing the scope
    deriveType(scope, outermostNode);
  }
  TRACER.trace("After validation: {}", outermostNode);
  return outermostNode;
}

它的處理邏輯主要分爲三步：

rewrite expression，將其標準化，便於後面的邏輯計劃優化；
註冊這個 relational expression 的 scopes 和 namespaces（這兩個對象代表了其元信息）；
進行相應的驗證，這裏會依賴第二步註冊的 scopes 和 namespaces 信息。

Rewrite

關於 Rewrite 這一步，一直困惑比較，因爲根據 After unconditional rewrite: 這條日誌的結果看，其實前後 SqlNode 並沒有太大變化，看 performUnconditionalRewrites() 這部分代碼時，看得不是很明白，不過還是注意到了 SqlOrderBy 的註釋（註釋如下），它的意思是 SqlOrderBy 通過 performUnconditionalRewrites() 方法已經被 SqlSelect 對象中的 ORDER_OPERAND 取代了。

/**
 * Parse tree node that represents an {@code ORDER BY} on a query other than a
 * {@code SELECT} (e.g. {@code VALUES} or {@code UNION}).
 *
 * <p>It is a purely syntactic operator, and is eliminated by
 * {@link org.apache.calcite.sql.validate.SqlValidatorImpl#performUnconditionalRewrites}
 * and replaced with the ORDER_OPERAND of SqlSelect.</p>
 */
public class SqlOrderBy extends SqlCall {

注意到 SqlOrderBy 的原因是因爲在 performUnconditionalRewrites() 方法前面都是遞歸對每個對象進行處理，在後面進行真正的 ransform 時，主要在圍繞着 ORDER_BY 這個類型做處理，而且從代碼中可以看出，將其類型從 SqlOrderBy 轉換成了 SqlSelect，BUDEG 前面的示例，發現 outermostNode 與 topNode 的類型確實發生了變化，如下圖所示。

Rewrite 前後的對比

這個方法有個好的地方就是，在不改變原有 SQL Parser 的邏輯的情況下，可以在這個方法裏做一些改動，當然如果 SQL Parser 的結果如果直接可用當然是最好的，就不需要再進行一次 Rewrite 了。

registerQuery

這裏的功能主要就是將[元數據]轉換成 SqlValidator 內部的對象進行表示，也就是 SqlValidatorScope 和 SqlValidatorNamespace 兩種類型的對象：

SqlValidatorNamespace：a description of a data source used in a query，它代表了 SQL 查詢的數據源，它是一個邏輯上數據源，可以是一張表，也可以是一個子查詢；
SqlValidatorScope：describes the tables and columns accessible at a particular point in the query，代表了在某一個程序運行點，當前可見的字段名和表名。

這個理解起來並不是那麼容易，在 SelectScope 類中有一個示例講述，這個示例對這兩個概念的理解很有幫助。

/**
 * <h3>Scopes</h3>
 *
 * <p>In the query</p>
 *
 * <blockquote>
 * <pre>
 * SELECT expr1
 * FROM t1,
 *     t2,
 *     (SELECT expr2 FROM t3) AS q3
 * WHERE c1 IN (SELECT expr3 FROM t4)
 * ORDER BY expr4</pre>
 * </blockquote>
 *
 * <p>The scopes available at various points of the query are as follows:</p>
 *
 * <ul>
 * <li>expr1 can see t1, t2, q3</li>
 * <li>expr2 can see t3</li>
 * <li>expr3 can see t4, t1, t2</li>
 * <li>expr4 can see t1, t2, q3, plus (depending upon the dialect) any aliases
 * defined in the SELECT clause</li>
 * </ul>
 *
 * <h3>Namespaces</h3>
 *
 * <p>In the above query, there are 4 namespaces:</p>
 *
 * <ul>
 * <li>t1</li>
 * <li>t2</li>
 * <li>(SELECT expr2 FROM t3) AS q3</li>
 * <li>(SELECT expr3 FROM t4)</li>
 */

validate 驗證

接着回到最複雜的一步，就是 outermostNode 實例調用 validate(this, scope) 方法進行驗證的部分，對於我們這個示例，這裏最後調用的是 SqlSelect 的 validate() 方法，如下所示：

1
2
3

public void validate(SqlValidator validator, SqlValidatorScope scope) {
  validator.validateQuery(this, scope, validator.getUnknownType());
}

它調用的是 SqlValidatorImpl 的 validateQuery() 方法

public void validateQuery(SqlNode node, SqlValidatorScope scope,
    RelDataType targetRowType) {
  final SqlValidatorNamespace ns = getNamespace(node, scope);
  if (node.getKind() == SqlKind.TABLESAMPLE) {
    List<SqlNode> operands = ((SqlCall) node).getOperandList();
    SqlSampleSpec sampleSpec = SqlLiteral.sampleValue(operands.get(1));
    if (sampleSpec instanceof SqlSampleSpec.SqlTableSampleSpec) {
      validateFeature(RESOURCE.sQLFeature_T613(), node.getParserPosition());
    } else if (sampleSpec
        instanceof SqlSampleSpec.SqlSubstitutionSampleSpec) {
      validateFeature(RESOURCE.sQLFeatureExt_T613_Substitution(),
          node.getParserPosition());
    }
  }

  validateNamespace(ns, targetRowType);//note: 檢查
  switch (node.getKind()) {
  case EXTEND:
    // Until we have a dedicated namespace for EXTEND
    deriveType(scope, node);
  }
  if (node == top) {
    validateModality(node);
  }
  validateAccess(
      node,
      ns.getTable(),
      SqlAccessEnum.SELECT);
}

/**
 * Validates a namespace.
 *
 * @param namespace Namespace
 * @param targetRowType Desired row type, must not be null, may be the data
 *                      type 'unknown'.
 */
protected void validateNamespace(final SqlValidatorNamespace namespace,
    RelDataType targetRowType) {
  namespace.validate(targetRowType);//note: 驗證
  if (namespace.getNode() != null) {
    setValidatedNodeType(namespace.getNode(), namespace.getType());
  }
}

這部分的調用邏輯非常複雜，主要的語法驗證是 SqlValidatorScope 部分（它裏面有相應的表名、字段名等信息），而 namespace 表示需要進行驗證的數據源，最開始的這個 SqlNode 有一個 root namespace，上面的 validateNamespace() 方法會首先調用其 namespace 的 validate() 方法進行驗證，以前面的示例爲例，這裏是 SelectNamespace，其實現如下：

//org.apache.calcite.sql.validate.AbstractNamespace
public final void validate(RelDataType targetRowType) {
  switch (status) {
  case UNVALIDATED: //note: 還沒開始 check
    try {
      status = SqlValidatorImpl.Status.IN_PROGRESS; //note: 更新當前 namespace 的狀態
      Preconditions.checkArgument(rowType == null,
          "Namespace.rowType must be null before validate has been called");
      RelDataType type = validateImpl(targetRowType); //note: 檢查驗證
      Preconditions.checkArgument(type != null,
          "validateImpl() returned null");
      setType(type);
    } finally {
      status = SqlValidatorImpl.Status.VALID;
    }
    break;
  case IN_PROGRESS: //note: 已經開始 check 了，死循環了
    throw new AssertionError("Cycle detected during type-checking");
  case VALID://note: 檢查結束
    break;
  default:
    throw Util.unexpected(status);
  }
}

//org.apache.calcite.sql.validate.SelectNamespace
//note: 檢查，還是調用 SqlValidatorImpl 的方法
public RelDataType validateImpl(RelDataType targetRowType) {
  validator.validateSelect(select, targetRowType);
  return rowType;
}

最後驗證方法的實現是 SqlValidatorImpl 的 validateSelect() 方法（對本示例而言），其調用過程如下圖所示：

驗證部分的處理流程

Step3: 語義分析（SqlNode–>RelNode/RexNode）

經過第二步之後，這裏的 SqlNode 就是經過語法校驗的 SqlNode 樹，接下來這一步就是將 SqlNode 轉換成 RelNode/RexNode，也就是生成相應的邏輯計劃（Logical Plan），示例的代碼實現如下：

// create the rexBuilder
final RexBuilder rexBuilder =  new RexBuilder(factory);
// init the planner
// 這裏也可以註冊 VolcanoPlanner，這一步 planner 並沒有使用
HepProgramBuilder builder = new HepProgramBuilder();
RelOptPlanner planner = new HepPlanner(builder.build());

//note: init cluster: An environment for related relational expressions during the optimization of a query.
final RelOptCluster cluster = RelOptCluster.create(planner, rexBuilder);
//note: init SqlToRelConverter
final SqlToRelConverter.Config config = SqlToRelConverter.configBuilder()
    .withConfig(frameworkConfig.getSqlToRelConverterConfig())
    .withTrimUnusedFields(false)
    .withConvertTableAccess(false)
    .build(); //note: config
// 創建 SqlToRelConverter 實例，cluster、calciteCatalogReader、validator 都傳進去了，SqlToRelConverter 會緩存這些對象
final SqlToRelConverter sqlToRelConverter = new SqlToRelConverter(new DogView(), validator, calciteCatalogReader, cluster, StandardConvertletTable.INSTANCE, config);
// convert to RelNode
RelRoot root = sqlToRelConverter.convertQuery(validateSqlNode, false, true);

root = root.withRel(sqlToRelConverter.flattenTypes(root.rel, true));
final RelBuilder relBuilder = config.getRelBuilderFactory().create(cluster, null);
root = root.withRel(RelDecorrelator.decorrelateQuery(root.rel, relBuilder));

RelNode relNode = root.rel;

//DogView 的實現
private static class DogView implements RelOptTable.ViewExpander {
    public DogView() {
    }

    @Override
    public RelRoot expandView(RelDataType rowType, String queryString, List<String> schemaPath,
                              List<String> viewPath) {
        return null;
    }
}

爲了方便分析，這裏也把上面的過程分爲以下幾步：

初始化 RexBuilder；
初始化 RelOptPlanner;
初始化 RelOptCluster；
初始化 SqlToRelConverter；
進行轉換；

第1、2、4步在上述代碼已經有相應的註釋，這裏不再介紹，下面從第三步開始講述。

初始化 RelOptCluster

RelOptCluster 初始化的代碼如下，這裏基本都走默認的參數配置。

org.apache.calcite.plan.RelOptCluster

/** Creates a cluster. */
public static RelOptCluster create(RelOptPlanner planner,
    RexBuilder rexBuilder) {
  return new RelOptCluster(planner, rexBuilder.getTypeFactory(),
      rexBuilder, new AtomicInteger(0), new HashMap<>());
}

/**
 * Creates a cluster.
 *
 * <p>For use only from {@link #create} and {@link RelOptQuery}.
 */
RelOptCluster(RelOptPlanner planner, RelDataTypeFactory typeFactory,
    RexBuilder rexBuilder, AtomicInteger nextCorrel,
    Map<String, RelNode> mapCorrelToRel) {
  this.nextCorrel = nextCorrel;
  this.mapCorrelToRel = mapCorrelToRel;
  this.planner = Objects.requireNonNull(planner);
  this.typeFactory = Objects.requireNonNull(typeFactory);
  this.rexBuilder = rexBuilder;
  this.originalExpression = rexBuilder.makeLiteral("?");

  // set up a default rel metadata provider,
  // giving the planner first crack at everything
  //note: 默認的 metadata provider
  setMetadataProvider(DefaultRelMetadataProvider.INSTANCE);
  //note: trait（對於 HepPlaner 和 VolcanoPlanner 不一樣)
  this.emptyTraitSet = planner.emptyTraitSet();
  assert emptyTraitSet.size() == planner.getRelTraitDefs().size();
}

SqlToRelConverter 轉換

SqlToRelConverter 中的 convertQuery() 將 SqlNode 轉換爲 RelRoot，其實現如下：

/**
 * Converts an unvalidated query's parse tree into a relational expression.
 * note：把一個 parser tree 轉換爲 relational expression
 * @param query           Query to convert
 * @param needsValidation Whether to validate the query before converting;
 *                        <code>false</code> if the query has already been
 *                        validated.
 * @param top             Whether the query is top-level, say if its result
 *                        will become a JDBC result set; <code>false</code> if
 *                        the query will be part of a view.
 */
public RelRoot convertQuery(
    SqlNode query,
    final boolean needsValidation,
    final boolean top) {
  if (needsValidation) { //note: 是否需要做相應的校驗（如果校驗過了，這裏就不需要了）
    query = validator.validate(query);
  }

  //note: 設置 MetadataProvider
  RelMetadataQuery.THREAD_PROVIDERS.set(
      JaninoRelMetadataProvider.of(cluster.getMetadataProvider()));
  //note: 得到 RelNode(relational expression)
  RelNode result = convertQueryRecursive(query, top, null).rel;
  if (top) {
    if (isStream(query)) {//note: 如果 stream 的話
      result = new LogicalDelta(cluster, result.getTraitSet(), result);
    }
  }
  RelCollation collation = RelCollations.EMPTY;
  if (!query.isA(SqlKind.DML)) { //note: 如果是 DML 語句
    if (isOrdered(query)) { //note: 如果需要做排序的話
      collation = requiredCollation(result);
    }
  }
  //note: 對轉換前後的 RelDataType 做驗證
  checkConvertedType(query, result);

  if (SQL2REL_LOGGER.isDebugEnabled()) {
    SQL2REL_LOGGER.debug(
        RelOptUtil.dumpPlan("Plan after converting SqlNode to RelNode",
            result, SqlExplainFormat.TEXT,
            SqlExplainLevel.EXPPLAN_ATTRIBUTES));
  }

  final RelDataType validatedRowType = validator.getValidatedNodeType(query);
  return RelRoot.of(result, validatedRowType, query.getKind())
      .withCollation(collation);
}

真正的實現是在 convertQueryRecursive() 方法中完成的，如下：

/**
 * Recursively converts a query to a relational expression.
 * note：遞歸地講一個 query 轉換爲 relational expression
 *
 * @param query         Query
 * @param top           Whether this query is the top-level query of the
 *                      statement
 * @param targetRowType Target row type, or null
 * @return Relational expression
 */
protected RelRoot convertQueryRecursive(SqlNode query, boolean top,
    RelDataType targetRowType) {
  final SqlKind kind = query.getKind();
  switch (kind) {
  case SELECT:
    return RelRoot.of(convertSelect((SqlSelect) query, top), kind);
  case INSERT:
    return RelRoot.of(convertInsert((SqlInsert) query), kind);
  case DELETE:
    return RelRoot.of(convertDelete((SqlDelete) query), kind);
  case UPDATE:
    return RelRoot.of(convertUpdate((SqlUpdate) query), kind);
  case MERGE:
    return RelRoot.of(convertMerge((SqlMerge) query), kind);
  case UNION:
  case INTERSECT:
  case EXCEPT:
    return RelRoot.of(convertSetOp((SqlCall) query), kind);
  case WITH:
    return convertWith((SqlWith) query, top);
  case VALUES:
    return RelRoot.of(convertValues((SqlCall) query, targetRowType), kind);
  default:
    throw new AssertionError("not a query: " + query);
  }
}

依然以前面的示例爲例，因爲是 SqlSelect 類型，這裏會調用下面的方法做相應的轉換：

/**
 * Converts a SELECT statement's parse tree into a relational expression.
 * note：將一個 Select parse tree 轉換成一個關係表達式
 */
public RelNode convertSelect(SqlSelect select, boolean top) {
  final SqlValidatorScope selectScope = validator.getWhereScope(select);
  final Blackboard bb = createBlackboard(selectScope, null, top);
  convertSelectImpl(bb, select);//note: 做相應的轉換
  return bb.root;
}

在 convertSelectImpl() 方法中會依次對 SqlSelect 的各個部分做相應轉換，其實現如下：

/**
 * Implementation of {@link #convertSelect(SqlSelect, boolean)};
 * derived class may override.
 */
protected void convertSelectImpl(
    final Blackboard bb,
    SqlSelect select) {
  //note: convertFrom
  convertFrom(
      bb,
      select.getFrom());
  //note: convertWhere
  convertWhere(
      bb,
      select.getWhere());

  final List<SqlNode> orderExprList = new ArrayList<>();
  final List<RelFieldCollation> collationList = new ArrayList<>();
  //note: 有 order by 操作時
  gatherOrderExprs(
      bb,
      select,
      select.getOrderList(),
      orderExprList,
      collationList);
  final RelCollation collation =
      cluster.traitSet().canonize(RelCollations.of(collationList));

  if (validator.isAggregate(select)) {
    //note: 當有聚合操作時，也就是含有 group by、having 或者 Select 和 order by 中含有聚合函數
    convertAgg(
        bb,
        select,
        orderExprList);
  } else { //note: 對 select list 部分的處理
    convertSelectList(
        bb,
        select,
        orderExprList);
  }

  if (select.isDistinct()) { //note: select 後面含有 DISTINCT 關鍵字時（去重）
    distinctify(bb, true);
  }
  //note: Converts a query's ORDER BY clause, if any.
  convertOrder(
      select, bb, collation, orderExprList, select.getOffset(),
      select.getFetch());
  bb.setRoot(bb.root, true);
}

這裏以示例中的 From 部分爲例介紹 SqlNode 到 RelNode 的邏輯，按照示例 DEUBG 後的結果如下圖所示，因爲 form 部分是一個 join 操作，會進入 join 相關的處理中。

convertFrom 之 Join 的情況

這部分方法調用過程是：

convertQuery -->
convertQueryRecursive -->
convertSelect -->
convertSelectImpl -->
convertFrom & convertWhere & convertSelectList

到這裏 SqlNode 到 RelNode 過程就完成了，生成的邏輯計劃如下：

LogicalSort(sort0=[$0], dir0=[ASC])
  LogicalProject(USER_ID=[$0], USER_NAME=[$1], USER_COMPANY=[$5], USER_AGE=[$2])
    LogicalFilter(condition=[AND(>($2, 30), >($3, 10))])
      LogicalJoin(condition=[=($1, $4)], joinType=[inner])
        LogicalTableScan(table=[[USERS]])
        LogicalTableScan(table=[[JOBS]])

到這裏前三步就算全部完成了。

Step4: 優化階段（RelNode–>RelNode）

終於來來到了第四階段，也就是 Calcite 的核心所在，優化器進行優化的地方，前面 sql 中有一個明顯可以優化的地方就是過濾條件的下壓（push down），在進行 join 操作前，先進行 filter 操作，這樣的話就不需要在 join 時進行全量 join，減少參與 join 的數據量。

關於filter 操作下壓，在 Calcite 中已經有相應的 Rule 實現，就是 FilterJoinRule.FilterIntoJoinRule.FILTER_ON_JOIN，這裏使用 HepPlanner 作爲示例的 planer，並註冊 FilterIntoJoinRule 規則進行相應的優化，其代碼實現如下：

HepProgramBuilder builder = new HepProgramBuilder();
builder.addRuleInstance(FilterJoinRule.FilterIntoJoinRule.FILTER_ON_JOIN); //note: 添加 rule
HepPlanner hepPlanner = new HepPlanner(builder.build());
hepPlanner.setRoot(relNode);
relNode = hepPlanner.findBestExp();

在 Calcite 中，提供了兩種 planner：HepPlanner 和 VolcanoPlanner，關於這塊內容可以參考【Drill/Calcite查詢優化系列】這幾篇文章（講述得非常詳細，贊），這裏先簡單介紹一下 HepPlanner 和 VolcanoPlanner，後面會關於這兩個 planner 的代碼實現做深入的講述。

HepPlanner

特點（來自 Apache Calcite介紹）：

HepPlanner is a heuristic optimizer similar to Spark’s optimizer，與 spark 的優化器相似，HepPlanner 是一個 heuristic 優化器；
Applies all matching rules until none can be applied：將會匹配所有的 rules 直到一個 rule 被滿足；
Heuristic optimization is faster than cost- based optimization：它比 CBO 更快；
Risk of infinite recursion if rules make opposing changes to the plan：如果沒有每次都不匹配規則，可能會有無限遞歸風險；

VolcanoPlanner

特點（來自 Apache Calcite介紹）：

VolcanoPlanner is a cost-based optimizer：VolcanoPlanner是一個CBO優化器；
Applies matching rules iteratively, selecting the plan with the cheapest cost on each iteration：迭代地應用 rules，直到找到cost最小的plan；
Costs are provided by relational expressions；
Not all possible plans can be computed：不會計算所有可能的計劃；
Stops optimization when the cost does not significantly improve through a determinable number of iterations：根據已知的情況，如果下面的迭代不能帶來提升時，這些計劃將會停止優化；

示例運行結果

經過 HepPlanner 優化後的邏輯計劃爲：

LogicalSort(sort0=[$0], dir0=[ASC])
  LogicalProject(USER_ID=[$0], USER_NAME=[$1], USER_COMPANY=[$5], USER_AGE=[$2])
    LogicalJoin(condition=[=($1, $4)], joinType=[inner])
      LogicalFilter(condition=[>($2, 30)])
        EnumerableTableScan(table=[[USERS]])
      LogicalFilter(condition=[>($0, 10)])
        EnumerableTableScan(table=[[JOBS]])

可以看到優化的結果是符合我們預期的，HepPlanner 和 VolcanoPlanner 詳細流程比較複雜，後面會有單獨的文章進行講述。

總結

Calcite 本身的架構比較好理解，但是具體到代碼層面就不是那麼好理解了，它拋出了很多的概念，如果不把這些概念搞明白，代碼基本看得也是雲裏霧裏，特別是之前沒有接觸過這塊內容的同學（我最開始看 Calcite 代碼時是真的頭大），入門的門檻確實高一些，但是當這些流程梳理清楚之後，其實再回頭看，也沒有多少東西，在生產中用的時候主要也是針對具體的業務場景擴展相應的 SQL 語法、進行具體的規則優化。

Calcite 架構設計得比較好，其中各個組件都可以單獨使用，Rule（規則）擴展性很強，用戶可以根據業務場景自定義相應的優化規則，它支持標準的 SQL，支持不同的存儲和計算引擎，目前在業界應用也比較廣泛，這也證明其牛叉之處。

本文只是個人理解的總結，由於本人也是剛接觸這塊，理解有偏差的地方，歡迎指正~

Apache Calcite 處理流程詳解

Calcite 概念

關係代數的基本知識

查詢優化

Calcite 中的一些概念

Calcite 架構

Calcite 處理流程

示例

Step1: SQL 解析階段（SQL–>SqlNode）

Javacc

SQL Parser 流程

Step2: SqlNode 驗證（SqlNode–>SqlNode）

SqlValidatorImpl 檢查過程

Step3: 語義分析（SqlNode–>RelNode/RexNode）

初始化 RelOptCluster

SqlToRelConverter 轉換

Step4: 優化階段（RelNode–>RelNode）

HepPlanner

VolcanoPlanner

示例運行結果

總結

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

flink實戰-定時器實現已完成訂單自動五星好評

flink實戰教程-使用set實時計算當天網站uv

聊聊AWK命令的那些事

放棄fastjson，擁抱Jackson

flink實戰教程-flink streaming sql 初體驗

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結