Spark SQL源碼剖析之SqlParser解析

在使用Spark的過程中，由於Scala語法複雜，而且更多的人越來越傾向使用SQL，將複雜的問題簡單化處理，避免編寫大量複雜的邏輯代碼，所以我們想是不是可以開發一款類似Hive的工具，將其思想也應用在Spark之上，建立SQL來處理一些離線計算場景，由於Spark SQL應用而生。在本篇文章中，我們準備深入源碼瞭解Spark SQL的內核組件以及其工作原理。

熟悉Spark的讀者都知道，當我們調用了SQLContext的sql()函數的時候，就會調用Spark SQL的內核來處理這條sql語句，那麼中間過程究竟是如何進行的呢？這裏我們先拋出它的內核基本執行流程，然後分各個階段深入瞭解。

首先我們觀察SQLContext的源碼，部分源碼如下：

 //字典表，用於註冊表，對錶緩存後便於查詢
  @transient
  protected[sql] lazy val catalog: Catalog = new SimpleCatalog(true)

  @transient
  protected[sql] lazy val functionRegistry: FunctionRegistry = new SimpleFunctionRegistry(true)

  //對未分析過的邏輯執行計劃進行分析
  @transient
  protected[sql] lazy val analyzer: Analyzer =
    new Analyzer(catalog, functionRegistry, caseSensitive = true) {
      override val extendedResolutionRules =
        ExtractPythonUdfs ::
        sources.PreInsertCastAndRename ::
        Nil

      override val extendedCheckRules = Seq(
        sources.PreWriteCheck(catalog)
      )
    }

  //查詢優化器，將邏輯計劃進行優化
  @transient
  protected[sql] lazy val optimizer: Optimizer = DefaultOptimizer

  //解析DDL語句，如何創建表
  @transient
  protected[sql] val ddlParser = new DDLParser(sqlParser.apply(_))

  //進行SQL解析操作
  @transient
  protected[sql] val sqlParser = {
    val fallback = new catalyst.SqlParser
    new SparkSQLParser(fallback(_))
  }
......
 @transient
  protected[sql] val planner = new SparkPlanner

  @transient
  protected[sql] lazy val emptyResult = sparkContext.parallelize(Seq.empty[Row], 1)

  /**
   * Prepares a planned SparkPlan for execution by inserting shuffle operations as needed.
   */
  @transient
  protected[sql] val prepareForExecution = new RuleExecutor[SparkPlan] {
    val batches =
      Batch("Add exchange", Once, AddExchange(self)) :: Nil
  }

可以看到SQLContext由以下組件構成：

Catalog：字典表，用於註冊表，對錶緩存後便於查詢
DDLParser：用於解析DDL語句，如創建表。
SparkSQLParser：作爲SqlParser的代理，處理一些SQL中的關鍵字。
SqlParser：用於解析select語句。
Analyzer：對還未分析的邏輯執行計劃進行分析。
Optimizer：對已經分析過的邏輯執行計劃進行優化。
SparkPlanner：用於將邏輯執行計劃轉換爲物理執行計劃。
prepareForExecution：用於將物理執行計劃轉換爲可執行的物理執行計劃。

它的整個執行流程如下：

首先使用SqlParser將SQL解析爲Unresolved LogicalPlan。
Analyzer結合數據字典Catalog進行綁定，生成Resolved LogicalPlan。
使用Optimizer對Resolved LogicalPlan進行優化，生成Optimizer LogicalPlan。
SparkPlanner將LogicalPlan轉換爲PhysicalPlan。
使用prepareForExecution將PhysicalPlan轉換爲可執行的物理執行計劃。
調用execute()執行可執行物理執行計劃，生成SchemaRDD。

整個執行流程圖如下：

在瞭解了它基本執行流程之後，我們看看它的各個組件的作用以及組成。

Catalog

它是一個接口，主要有以下方法：

trait Catalog {
  //大小寫是否敏感
  def caseSensitive: Boolean
  //表是否存在
  def tableExists(tableIdentifier: Seq[String]): Boolean
  //使用表名查找關係
  def lookupRelation(
      tableIdentifier: Seq[String],
      alias: Option[String] = None): LogicalPlan

  /**
   * Returns tuples of (tableName, isTemporary) for all tables in the given database.
   * isTemporary is a Boolean value indicates if a table is a temporary or not.
   */
  def getTables(databaseName: Option[String]): Seq[(String, Boolean)]

  def refreshTable(databaseName: String, tableName: String): Unit
  //註冊表
  def registerTable(tableIdentifier: Seq[String], plan: LogicalPlan): Unit
  //取消註冊表
  def unregisterTable(tableIdentifier: Seq[String]): Unit
  //取消註冊所有表
  def unregisterAllTables(): Unit

  protected def processTableIdentifier(tableIdentifier: Seq[String]): Seq[String] = {
    if (!caseSensitive) {
      tableIdentifier.map(_.toLowerCase)
    } else {
      tableIdentifier
    }
  }

  protected def getDbTableName(tableIdent: Seq[String]): String = {
    val size = tableIdent.size
    if (size <= 2) {
      tableIdent.mkString(".")
    } else {
      tableIdent.slice(size - 2, size).mkString(".")
    }
  }

  protected def getDBTable(tableIdent: Seq[String]) : (Option[String], String) = {
    (tableIdent.lift(tableIdent.size - 2), tableIdent.last)
  }
}

它的常用實現類爲SimpleCatalog類：

protected[sql] lazy val catalog: Catalog = new SimpleCatalog(true)

在這個類中，實現了上面接口中的方法，也包括註冊表，從源碼中可以看到，實際上將表名以及執行計劃加入HashSet數據結構的緩存當中。

class SimpleCatalog(val caseSensitive: Boolean) extends Catalog {
  //放入緩存當中
  val tables = new mutable.HashMap[String, LogicalPlan]()

  override def registerTable(
      tableIdentifier: Seq[String],
      plan: LogicalPlan): Unit = {
    val tableIdent = processTableIdentifier(tableIdentifier)
    //註冊表名，實際上將表名與執行計劃放入緩存當中
    tables += ((getDbTableName(tableIdent), plan))
  }

  override def unregisterTable(tableIdentifier: Seq[String]): Unit = {
    val tableIdent = processTableIdentifier(tableIdentifier)
    tables -= getDbTableName(tableIdent)
  }

  override def unregisterAllTables(): Unit = {
    tables.clear()
  }

  override def tableExists(tableIdentifier: Seq[String]): Boolean = {
    val tableIdent = processTableIdentifier(tableIdentifier)
    tables.get(getDbTableName(tableIdent)) match {
      case Some(_) => true
      case None => false
    }
  }
  ......

上面提到，它的入口函數是sql函數的調用，此時就會調用Parser對sql語句進行解析。在這個階段，對SQL解析的類主要包括：

DDLParser：臨時表創建解析器，主要用來解析創建臨時表的DDL語句。
SqlParser：SQL語句解析器，解析select，insert等語句。
SparkSQLParser：用於代理SqlParser，對AS，CACHE，SET等關鍵字進行解析。

它們的類關係繼承圖如下：

回到SQLContext的sql函數調用處，源碼如下：

 def sql(sqlText: String): DataFrame = {
    if (conf.dialect == "sql") {
      DataFrame(this, parseSql(sqlText))
    } else {
      sys.error(s"Unsupported SQL dialect: ${conf.dialect}")
    }
  }

在這裏又調用了parseSql函數

protected[sql] def parseSql(sql: String): LogicalPlan = {
    ddlParser(sql, false).getOrElse(sqlParser(sql))
  }

在DDLParser並沒有這樣的構造器，根據scala的語法我們可以知道，它調用了內部的apply方法，源碼如下：

/**
 * A parser for foreign DDL commands.
 */
private[sql] class DDLParser(
    parseQuery: String => LogicalPlan)
  extends AbstractSparkSQLParser with DataTypeParser with Logging {

  def apply(input: String, exceptionOnError: Boolean): Option[LogicalPlan] = {
    try {
    //調用父類的apply方法
      Some(apply(input))
    } catch {
      case ddlException: DDLException => throw ddlException
      case _ if !exceptionOnError => None
      case x: Throwable => throw x
    }
  }
  ......

由上面的繼承關係圖可以看出，它繼承自AbstractSparkSQLParser，它也定義了apply方法，所以接着會調用父類的該方法：

private[sql] abstract class AbstractSparkSQLParser
  extends StandardTokenParsers with PackratParsers {

  def apply(input: String): LogicalPlan = {
    // Initialize the Keywords.
    lexical.initialize(reservedWords)
    //scala的柯理化，如果input符合start模式則返回Success
    phrase(start)(new lexical.Scanner(input)) match {
      case Success(plan, _) => plan
      case failureOrError => sys.error(failureOrError.toString)
    }
  }
  .......

start模式是一系列複雜的表達式，有興趣的讀者可以查閱相關資料，其中一段源碼如下所示：

 protected lazy val start: Parser[LogicalPlan] =
    ( (select | ("(" ~> select <~ ")")) *
      ( UNION ~ ALL        ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) }
      | INTERSECT          ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Intersect(q1, q2) }
      | EXCEPT             ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Except(q1, q2)}
      | UNION ~ DISTINCT.? ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }
      )
    | insert
    )

DDLParser

該組件主要用來創建臨時表，其內部定義了一些KeyWord，它是保存在HashSet緩存中：

  // Keyword is a convention with AbstractSparkSQLParser, which will scan all of the `Keyword`
  // properties via reflection the class in runtime for constructing the SqlLexical object
  protected val CREATE = Keyword("CREATE")
  protected val TEMPORARY = Keyword("TEMPORARY")
  protected val TABLE = Keyword("TABLE")
  protected val IF = Keyword("IF")
  protected val NOT = Keyword("NOT")
  protected val EXISTS = Keyword("EXISTS")
  protected val USING = Keyword("USING")
  protected val OPTIONS = Keyword("OPTIONS")
  protected val DESCRIBE = Keyword("DESCRIBE")
  protected val EXTENDED = Keyword("EXTENDED")
  protected val AS = Keyword("AS")
  protected val COMMENT = Keyword("COMMENT")
  protected val REFRESH = Keyword("REFRESH")
  //創建臨時表，描述表，刷新表
  protected lazy val ddl: Parser[LogicalPlan] = createTable | describeTable | refreshTable

  protected def start: Parser[LogicalPlan] = ddl

可以看見，其ddl的模式是createTable、describeTable或者refreshTable。下面簡單看看createTable的源碼：

  protected lazy val createTable: Parser[LogicalPlan] =
    // TODO: Support database.table.
  //create temporary table using  options
    (CREATE ~> TEMPORARY.? <~ TABLE) ~ (IF ~> NOT <~ EXISTS).? ~ ident ~
      tableCols.? ~ (USING ~> className) ~ (OPTIONS ~> options).? ~ (AS ~> restInput).? ^^ {
      case temp ~ allowExisting ~ tableName ~ columns ~ provider ~ opts ~ query =>
        if (temp.isDefined && allowExisting.isDefined) {
          throw new DDLException(
            "a CREATE TEMPORARY TABLE statement does not allow IF NOT EXISTS clause.")
        }
        .......

SqlParser

該組件主要用來解析SQL語句。

 protected val ABS = Keyword("ABS")
  protected val ALL = Keyword("ALL")
  protected val AND = Keyword("AND")
  protected val APPROXIMATE = Keyword("APPROXIMATE")
  protected val AS = Keyword("AS")
  protected val ASC = Keyword("ASC")
  protected val AVG = Keyword("AVG")
  protected val BETWEEN = Keyword("BETWEEN")
  protected val BY = Keyword("BY")
  protected val CASE = Keyword("CASE")
  protected val CAST = Keyword("CAST")
  protected val COALESCE = Keyword("COALESCE")
  protected val COUNT = Keyword("COUNT")
  protected val DESC = Keyword("DESC")
  protected val DISTINCT = Keyword("DISTINCT")
  protected val ELSE = Keyword("ELSE")
  protected val END = Keyword("END")
  protected val EXCEPT = Keyword("EXCEPT")
  protected val FALSE = Keyword("FALSE")
  protected val FIRST = Keyword("FIRST")
  protected val FROM = Keyword("FROM")
  protected val FULL = Keyword("FULL")
  protected val GROUP = Keyword("GROUP")
  protected val HAVING = Keyword("HAVING")
  protected val IF = Keyword("IF")
  protected val IN = Keyword("IN")
  protected val INNER = Keyword("INNER")
  protected val INSERT = Keyword("INSERT")
  protected val INTERSECT = Keyword("INTERSECT")
  protected val INTO = Keyword("INTO")
  protected val IS = Keyword("IS")
  protected val JOIN = Keyword("JOIN")
  protected val LAST = Keyword("LAST")
  protected val LEFT = Keyword("LEFT")
  protected val LIKE = Keyword("LIKE")
  protected val LIMIT = Keyword("LIMIT")
  protected val LOWER = Keyword("LOWER")
  protected val MAX = Keyword("MAX")
  protected val MIN = Keyword("MIN")
  protected val NOT = Keyword("NOT")
  protected val NULL = Keyword("NULL")
  protected val ON = Keyword("ON")
  protected val OR = Keyword("OR")
  protected val ORDER = Keyword("ORDER")
  protected val SORT = Keyword("SORT")
  protected val OUTER = Keyword("OUTER")
  protected val OVERWRITE = Keyword("OVERWRITE")
  protected val REGEXP = Keyword("REGEXP")
  protected val RIGHT = Keyword("RIGHT")
  protected val RLIKE = Keyword("RLIKE")
  protected val SELECT = Keyword("SELECT")
  protected val SEMI = Keyword("SEMI")
  protected val SQRT = Keyword("SQRT")
  protected val SUBSTR = Keyword("SUBSTR")
  .......

在SqlParser的start的中，定義着一系列的解析規則，其語法晦澀難懂。下面簡單看一段：

protected lazy val start: Parser[LogicalPlan] =
    ( (select | ("(" ~> select <~ ")")) *
    // ~順序組合解析，比如A~B必須保證A在B的左邊
    //將UNION  ALL替換爲Union函數，^^^符號用於將匹配的值置換爲原值
      ( UNION ~ ALL        ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) }
      //INTERSECT 替換爲Intersect函數
      | INTERSECT          ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Intersect(q1, q2) }
      | EXCEPT             ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Except(q1, q2)}
      | UNION ~ DISTINCT.? ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }
      )
    | insert
    )

protected lazy val insert: Parser[LogicalPlan] =
//~>匹配後只保留右邊，比如A~>B，只保留B
    INSERT ~> (OVERWRITE ^^^ true | INTO ^^^ false) ~ (TABLE ~> relation) ~ select ^^ {
      case o ~ r ~ s => InsertIntoTable(r, Map.empty[String, Option[String]], s, o)
    }

SparkSQLParser

SparkSQLParser用於來代理SqlParser對AS、CACHE等關鍵字的處理。

  protected val AS      = Keyword("AS")
  protected val CACHE   = Keyword("CACHE")
  protected val CLEAR   = Keyword("CLEAR")
  protected val IN      = Keyword("IN")
  protected val LAZY    = Keyword("LAZY")
  protected val SET     = Keyword("SET")
  protected val SHOW    = Keyword("SHOW")
  protected val TABLE   = Keyword("TABLE")
  protected val TABLES  = Keyword("TABLES")
  protected val UNCACHE = Keyword("UNCACHE")

  override protected lazy val start: Parser[LogicalPlan] = cache | uncache | set | show | others

從源碼中可以看到，其start的匹配模式有cache，set，show等模式，其表達式和上面提到的類似。

在經過SqlParser處理之後，將SQL進行解析，完成一系列操作之後，就會使用Analyzer與數據字典Catalog進行綁定，然後對執行計劃進行分析，那麼，這部分內容，在後面的文章中會詳細介紹，歡迎持續關注。

歡迎加入大數據學習交流羣：731423890

Spark SQL源碼剖析之SqlParser解析

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

ElasticSearch使用Java API進行索引文檔的操作

Flink數據類型、序列化與累加器Accumulator

Spark SQL源碼剖析之SqlParser解析

淺談分佈式事務的解決方案

淺談Kafka選舉機制

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結