初識Calcite——使用實例

Calcite（https://calcite.apache.org/）是Apache的一個孵化器項目，它是一個構建JDBC或者ODBC訪問數據庫的框架，通過自定義一些adapter通過sql訪問任意類型的數據，回想起我們之前使用SQL的場景只有使用訪問關係數據庫如MYSQL、ORACLE等，通過hive查詢HDFS上的數據，但是如果我們希望通過SQL接口訪問內存中的某個數據結構（首先這個結構有關係模型）、文件裏面的內容（例如CSV文件、有一定結構的普通文件，其實這些可以通過hive訪問）、訪問hbase和一些NOSQL數據庫，甚至想要跨數據源訪問（hive裏面的數據和mysql裏面的數據進行join查詢）。以上基本上代表了我們平時接觸到的各種各樣的數據存儲的位置，而Calcite要解決的問題就是讓你想辦法將這些數據建立一個關係模型，然後通過SQL查詢這些數據。
假設我們只使用calcite做查詢，因爲以上的數據基本上都是通過其他方式寫入的數據，而我們需要的是通SQL查詢，calcite實現了SQL語句的解析，生成物理執行計劃以及查詢計劃的優化，用戶需要向Calcite提供數據庫的元數據（有哪些database(schema)，每一個數據庫下有哪些table，每一個表有哪些字段，每一個字段的類型是什麼）和數據（每一個表中的每一行數據是什麼）。除此之外，用戶也可以重載它提供的執行計劃，這裏只是提及到了Calcite的一些基本功能，高階功能諸如Streaming（流式查詢）、Lattices（物化視圖）等，目前使用Calcite的方式是作爲一個本地的框架工具而非作爲一個服務存在。

Apache Calcite具有以下幾個技術特性：

支持標準SQL語言；
獨立於編程語言和數據源，可以支持不同的前端和後端；
支持關係代數、可定製的邏輯規劃規則和基於成本模型優化的查詢引擎；
支持物化視圖（materialized view）的管理（創建、丟棄、持久化和自動識別）；
基於物化視圖的Lattice和Tile機制，以應用於OLAP分析；
支持對流數據的查詢。

這裏有一篇介紹Calcite的文章可以參考：http://www.infoq.com/cn/articles/new-big-data-hadoop-query-engine-apache-calcite

下面主要以實踐的方式介紹如何使用Calcite查詢不同數據源的數據，這裏我們的實驗的存儲是內存中的數據結構，首先我們有一個map：

public static final Map<String, Database> MAP = new HashMap<String, Database>();
       public static class Database {
        public List<Table> tables = new LinkedList<Table>();
    }

   
    public static class Table{
        public String tableName;
        public List<Column> columns = new LinkedList<Column>();
        public List<List<String>> data = new LinkedList<List<String>>();
    }
   
    public static class Column{
        public String name;
        public String type;
    }

這個MAP中存儲了數據庫名到我們內存中Database結構的映射，每一個Database中存儲了多個Table對象，每一個Table對象有一些Column和一個二維的data數組，Column定義了字段名和類型，然後爲了測試創建了一個Database對象，名爲school，它包含兩個Table，分別爲Class和Student，Class對象的初始化如下：

 cl. tableName = "Class";
        Column name = new Column();
        name.name = "name";
        name.type = "varchar";
        cl.columns.add(name);
      
        Column id = new Column();
        id.name = "id";
        id.type = "integer";
        cl.columns.add(id);
       
        Column teacher = new Column();
        teacher.name = "teacher";
        teacher.type = "varchar";
        cl.columns.add(teacher);

Student對象的初始化如下：

        student. tableName = "Student";
        Column name = new Column();
        name.name = "name";
        name.type = "varchar";
        student.columns.add(name);
       
        Column id = new Column();
        id.name = "id";
        id.type = "varchar";
        student.columns.add(id);
       
        Column classId = new Column();
        classId.name = "classId";
        classId.type = "integer";
        student.columns.add(classId);
       
        Column birth = new Column();
        birth.name = "birthday";
        birth.type = "date";
        student.columns.add(birth);
       
        Column home = new Column();
        home.name = "home";
        home.type = "varchar";
        student.columns.add(home);

接着向這兩個表中分別插入一些數據，保存在data成員變量裏面，這樣，我們的數據就初始化完了，你可以想象這些數據是存儲在csv文件中或者redis中，接着就需要和Calcite進行適配，Calcite建立jdbc連接需要一個json文件，這個文件的內容可以通過配置變量傳入，也可以通過配置文件讀取，文件的格式如下：

{
  version: '1.0',
  defaultSchema: 'school',
  schemas: [
    {
      name: 'school',
      type: 'custom',
      factory: 'org.apache.kylin.calcite.test.MemorySchemaFactory',
      operand: {
        param1: 'hello',
        param2: 'world';
      }
    }
  ]
}

這裏只是一個比較簡單的Calcite json model文件，詳細的結構可以參考https://calcite.apache.org/docs/model.html，這個文件用於創建connection，所以這裏配置的信息是提供給connection使用的，defaultSchema類似於連接mysql時提供database，可以不使用database名就可以訪問該數據庫的表，schemas定義了一些schema（database的概念），每一個schema指定了name、type（可以分爲Map Schema、Custom Schema和JDBC Schema）。

這三種Schema有不同的使用方式，也決定了下面的參數。使用Map Schema意味着你需要在這個json文件中指定這個schema下的Tables和Functions（具體還需要哪些信息可以參考官方文檔），也就是說這個schema是預先定義的（有哪些表，每一個表的結構），所以一般不適用這個（因爲大部分情況下需要一些變量才能知道這個這個schema下有哪些表）；Custom Schema意味着你只需要指定factory和可選的operand參數（map結構），schema都是通過指定的factory類創建出來的（它需要實現org.apache.calcite.schema.SchemaFactory接口），具體這個schema下面有哪些表可以通過schema的name和operand變量決定生成。JDBC Schema意味着我們可以直接在這個model文件中配置一個jdbc的連接，所有向Calcite的操作其實是由這個數據庫完成的，一般也不常用（不如直接通過jdbc連這個數據庫了）。

這裏用的是最常用的Custom Schema，所以需要定義一個factory（MemorySchemaFactory），它實現了org.apache.calcite.schema.SchemaFactory接口，需要實現create函數，實現如下：

public class MemorySchemaFactory implements SchemaFactory{
    @Override
    public Schema create(SchemaPlus parentSchema, String name, Map<String, Object> operand) {
        System. out.println( "param1 : " + operand.get( "param1"));
        System. out.println( "param2 : " + operand.get( "param2"));
       
        System. out.println( "Get database " + name);
        return new MemorySchema( name);
    }
}

這裏爲了測試打印了一些變量信息，通過測試可以看到name參數傳遞的是json文件中這個schema的name，operand是文件中這個schema定義的operand。這裏要返回一個Schema對象，我們定義了MemorySchema類，需要實現org.apache.calcite.schema.Schema接口，MemorySchema繼承了org.apache.calcite.schema.impl.AbstractSchema，後者實現了Schema接口並提供了默認實現，一般情況下我們需要實現下面幾個接口：

public boolean contentsHaveChangedSince( long lastCheck , long now ) 這個接口是爲了檢查cache是否過期，因爲calcite默認會緩存schema的元數據，所以可以通過該函數實現cache有效性檢查。

protected Map<String, Table> getTableMap() 這個接口是爲了獲取schema的元數據，返回值爲表名和表對象的映射。

protected Multimap<String, Function> getFunctionMultimap() 這個接口爲了獲取該schema支持的UDF函數。

在MemorySchema中我們只實現了getTableMap函數：

@Override
    public Map<String, Table> getTableMap() {
        Map<String, Table> tables = new HashMap<String, Table>();
        Database database = MemoryData. MAP.get( this. dbName);
        if(database == null)
            return tables;
        for(MemoryData.Table table : database. tables) {
            tables.put( table. tableName, new MemoryTable( table));
        }
       
        return tables;
    }

可以看到，我們只是通過schema名在內存中MAP表裏面查看對應的Database對象，然後使用Database對象中的Table作爲Schema中的表，表的類型爲MemoryTable。

根據文檔中的指示，一般我們可以實現三種類型的Table：

a simple implementation of Table, using the ScannableTable interface, that enumerates all rows directly;

a more advanced implementation that implements FilterableTable, and can filter out rows according to simple predicates;

advanced implementation of Table, using TranslatableTable, that translates to relational operators using planner rules.

當使用ScannableTable的時候，我們只需要實現函數Enumerable<Object[]> scan(DataContext root);，該函數返回Enumerable對象，通過該對象可以一行行的獲取這個Table的全部數據（也就意味着每次的查詢都是掃描這個表的數據）；當使用FilterableTable的時候，我們需要實現函數Enumerable<Object[]> scan(DataContext root, List<RexNode> filters );參數中多了filters數組，這個數據包含了針對這個表的過濾條件，這樣我們根據過濾條件只返回過濾之後的行，減少上層進行其它運算的數據集；當使用TranslatableTable的時候，我們需要實現RelNode toRel( RelOptTable.ToRelContext context, RelOptTable relOptTable);，該函數可以讓我們根據上下文自己定義表掃描的物理執行計劃，至於爲什麼不在返回一個Enumerable對象了，因爲上面兩種其實使用的是默認的執行計劃，轉換成EnumerableTableAccessRel算子，通過TranslatableTable我們可以實現自定義的算子，以及執行一些其他的rule，Kylin就是使用這個類型的Table實現查詢。

爲了簡單，我們這裏只是使用了ScannableTable，每次做全表掃描。當然除了上面Table需要實現的接口，還需要實現Calcite中最底層Table定義的接口，當然有AbstractTable實現了一些默認的方案，我們只需要實現獲取表中元數據的函數getRowType和獲取數據的函數scan。

@Override
    public RelDataType getRowType(RelDataTypeFactory typeFactory) {
        if(dataType == null) {
            RelDataTypeFactory.FieldInfoBuilder fieldInfo = typeFactory.builder();
            for (MemoryData.Column column : this. sourceTable. columns) {
                RelDataType sqlType = typeFactory.createSqlType(
                        MemoryData.SQLTYPE_MAPPING.get(column .type ));
                sqlType = SqlTypeUtil.addCharsetAndCollation(sqlType, typeFactory);
                fieldInfo.add( column. name, sqlType);
            }
            this. dataType = typeFactory.createStructType( fieldInfo);
        }
        return this.dataType;
    }
   
    @Override
    public Enumerable<Object[]> scan(DataContext root) {
        final int[] fields = identityList(this.dataType.getFieldCount());
        return new AbstractEnumerable<Object[]>() {
            public Enumerator<Object[]> enumerator() {
                return new MemoryEnumerator<Object[]>( fields, sourceTable. data);
            }
        };
    }

表中的元數據（字段名和字段類型）是根據初始化數據中Table中每一個Column的類型轉換的，MemoryData.SQLTYPE_MAPPING提供了自定義類型到Calcite類型的映射。scan函數返回一個迭代器對象，通過調用該對象的moveNext函數可以獲取是否已經遍歷完全部的數據，current函數返回當前的一行數據，還可以根據需要實現一些其他的函數，這裏不再一一介紹了。

好了，整體的實現就是這個樣子的，對於一個查詢操作會經歷如下的流程：Calcite會解析SQL並將其轉換成邏輯執行計劃，期間會根據當前connection中schema定義的信息初始化每一個Schema，然後根據查詢中指定的schema調用對應的getTableMap函數獲取元數據，根據這個信息判斷查詢中出現的表名、字段名是否正確以及檢查SQL語法是否符合規範。然後再使用Calcite內部默認的實現生成物理執行計劃，這個查詢計劃是樹狀結構的，最底層的節點是ScanTable操作（類似於SQL執行過程中首先執行FROM子句），對每一個表獲取該表的數據，這時候使用的算子爲默認的EnumerableTableAccessRel，然後再去調用具體ScannableTable的scan方法獲取表的數據。完了之後在根據原始表的數據進行上層的JOIN、FILTER、GROUP BY、SORT、LIMIT甚至子查詢等操作。

測試代碼：

public static void main(String[] args) {
     try {
                Class. forName("org.apache.calcite.jdbc.Driver");
           } catch (ClassNotFoundException e1) {
                 e1.printStackTrace();
           }
     
        Properties info = new Properties();
        try {
            Connection connection =
                DriverManager.getConnection("jdbc:calcite:model=E:\\file\\to\\model\\file\\School.json", info );
            ResultSet result = connection.getMetaData().getTables( null, null, null, null);
            while( result.next()) {
                System. out.println( "Catalog : " + result.getString(1) + ",Database : " + result.getString(2) + ",Table : " + result .getString(3));
            }
            result.close();
           
            Statement st = connection.createStatement();
            result = st.executeQuery( "select \"home\", 1 , count(1) from \"Student\" as S INNER JOIN \"Class\" as C on S.\"classId\" = C.\"id\" group by \"home\"");
            while( result.next()) {
                System. out.println( result.getString(1) + "\t" + result.getString(2) + "\t" + result.getString(3));
            }
            result.close();
            connection.close();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

執行結果：

param1 : hello
param2 : world
Get database school
Catalog : null,Database : metadata,Table : COLUMNS
Catalog : null,Database : metadata,Table : TABLES
Catalog : null,Database : school,Table : Class
Catalog : null,Database : school,Table : Student
sichuan       1      1
zhejiang      1      1
henan  1      1
jiangsu       1      1
hebei  1      1
beijing       1      1
anhui  1      2

其中前面三行爲Calcite創建Schema的時候打印的，下面四行爲當前connection中的表，前兩個表爲系統表，後面兩個表是我們自定義的表，接下來執行一次帶有JOIN的SQL查詢，能夠輸出正確的結果。需要注意的是，Calcite中元數據類似於Oracle的，所有的表和字段名都會在解析的時候轉換成大寫，但是在Calcite中又是大小寫敏感的，因此除非你將所有的表名和字段名都定義成大寫，獲取在查詢的時候對於字段和表都加上雙引號（這樣在解析的時候就不會被轉換成大寫了），否則很可能出現字段或者表找不到的錯誤（http://stackoverflow.com/questions/31118348/table-not-found-with-apache-calcite）。

源碼下載地址：https://github.com/terry-chelsea/bigdata，會持續更新一些自己在大數據生態圈學習和開發過程中用到的代碼。歡迎交流...

教練_我要踢球

發佈了70 篇原創文章 · 獲贊 131 · 訪問量 55萬+

私信關注

初識Calcite——使用實例

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

Kylin cuboid算法修改

Kylin性能調優記——業務技術兩手抓

安卓開發筆記——從0到1

初識Calcite——使用實例

Parquet與ORC性能測試報告

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結