在使用Spark的過程中,由於Scala語法複雜,而且更多的人越來越傾向使用SQL,將複雜的問題簡單化處理,避免編寫大量複雜的邏輯代碼,所以我們想是不是可以開發一款類似Hive的工具,將其思想也應用在Spark之上,建立SQL來處理一些離線計算場景,由於Spark SQL應用而生。在本篇文章中,我們準備深入源碼瞭解Spark SQL的內核組件以及其工作原理。
熟悉Spark的讀者都知道,當我們調用了SQLContext
的sql()
函數的時候,就會調用Spark SQL的內核來處理這條sql語句,那麼中間過程究竟是如何進行的呢?這裏我們先拋出它的內核基本執行流程,然後分各個階段深入瞭解。
首先我們觀察SQLContext
的源碼,部分源碼如下:
//字典表,用於註冊表,對錶緩存後便於查詢
@transient
protected[sql] lazy val catalog: Catalog = new SimpleCatalog(true)
@transient
protected[sql] lazy val functionRegistry: FunctionRegistry = new SimpleFunctionRegistry(true)
//對未分析過的邏輯執行計劃進行分析
@transient
protected[sql] lazy val analyzer: Analyzer =
new Analyzer(catalog, functionRegistry, caseSensitive = true) {
override val extendedResolutionRules =
ExtractPythonUdfs ::
sources.PreInsertCastAndRename ::
Nil
override val extendedCheckRules = Seq(
sources.PreWriteCheck(catalog)
)
}
//查詢優化器,將邏輯計劃進行優化
@transient
protected[sql] lazy val optimizer: Optimizer = DefaultOptimizer
//解析DDL語句,如何創建表
@transient
protected[sql] val ddlParser = new DDLParser(sqlParser.apply(_))
//進行SQL解析操作
@transient
protected[sql] val sqlParser = {
val fallback = new catalyst.SqlParser
new SparkSQLParser(fallback(_))
}
......
@transient
protected[sql] val planner = new SparkPlanner
@transient
protected[sql] lazy val emptyResult = sparkContext.parallelize(Seq.empty[Row], 1)
/**
* Prepares a planned SparkPlan for execution by inserting shuffle operations as needed.
*/
@transient
protected[sql] val prepareForExecution = new RuleExecutor[SparkPlan] {
val batches =
Batch("Add exchange", Once, AddExchange(self)) :: Nil
}
可以看到SQLContext
由以下組件構成:
- Catalog:字典表,用於註冊表,對錶緩存後便於查詢
- DDLParser:用於解析DDL語句,如創建表。
- SparkSQLParser:作爲
SqlParser
的代理,處理一些SQL中的關鍵字。 - SqlParser:用於解析select語句。
- Analyzer:對還未分析的邏輯執行計劃進行分析。
- Optimizer:對已經分析過的邏輯執行計劃進行優化。
- SparkPlanner:用於將邏輯執行計劃轉換爲物理執行計劃。
- prepareForExecution:用於將物理執行計劃轉換爲可執行的物理執行計劃。
它的整個執行流程如下:
- 首先使用
SqlParser
將SQL解析爲Unresolved LogicalPlan
。 Analyzer
結合數據字典Catalog
進行綁定,生成Resolved LogicalPlan
。- 使用
Optimizer
對Resolved LogicalPlan
進行優化,生成Optimizer LogicalPlan
。 SparkPlanner
將LogicalPlan
轉換爲PhysicalPlan
。- 使用
prepareForExecution
將PhysicalPlan
轉換爲可執行的物理執行計劃。 - 調用
execute()
執行可執行物理執行計劃,生成SchemaRDD
。
整個執行流程圖如下:
在瞭解了它基本執行流程之後,我們看看它的各個組件的作用以及組成。
Catalog
它是一個接口,主要有以下方法:
trait Catalog {
//大小寫是否敏感
def caseSensitive: Boolean
//表是否存在
def tableExists(tableIdentifier: Seq[String]): Boolean
//使用表名查找關係
def lookupRelation(
tableIdentifier: Seq[String],
alias: Option[String] = None): LogicalPlan
/**
* Returns tuples of (tableName, isTemporary) for all tables in the given database.
* isTemporary is a Boolean value indicates if a table is a temporary or not.
*/
def getTables(databaseName: Option[String]): Seq[(String, Boolean)]
def refreshTable(databaseName: String, tableName: String): Unit
//註冊表
def registerTable(tableIdentifier: Seq[String], plan: LogicalPlan): Unit
//取消註冊表
def unregisterTable(tableIdentifier: Seq[String]): Unit
//取消註冊所有表
def unregisterAllTables(): Unit
protected def processTableIdentifier(tableIdentifier: Seq[String]): Seq[String] = {
if (!caseSensitive) {
tableIdentifier.map(_.toLowerCase)
} else {
tableIdentifier
}
}
protected def getDbTableName(tableIdent: Seq[String]): String = {
val size = tableIdent.size
if (size <= 2) {
tableIdent.mkString(".")
} else {
tableIdent.slice(size - 2, size).mkString(".")
}
}
protected def getDBTable(tableIdent: Seq[String]) : (Option[String], String) = {
(tableIdent.lift(tableIdent.size - 2), tableIdent.last)
}
}
它的常用實現類爲SimpleCatalog
類:
protected[sql] lazy val catalog: Catalog = new SimpleCatalog(true)
在這個類中,實現了上面接口中的方法,也包括註冊表,從源碼中可以看到,實際上將表名以及執行 計劃加入HashSet
數據結構的緩存當中。
class SimpleCatalog(val caseSensitive: Boolean) extends Catalog {
//放入緩存當中
val tables = new mutable.HashMap[String, LogicalPlan]()
override def registerTable(
tableIdentifier: Seq[String],
plan: LogicalPlan): Unit = {
val tableIdent = processTableIdentifier(tableIdentifier)
//註冊表名,實際上將表名與執行計劃放入緩存當中
tables += ((getDbTableName(tableIdent), plan))
}
override def unregisterTable(tableIdentifier: Seq[String]): Unit = {
val tableIdent = processTableIdentifier(tableIdentifier)
tables -= getDbTableName(tableIdent)
}
override def unregisterAllTables(): Unit = {
tables.clear()
}
override def tableExists(tableIdentifier: Seq[String]): Boolean = {
val tableIdent = processTableIdentifier(tableIdentifier)
tables.get(getDbTableName(tableIdent)) match {
case Some(_) => true
case None => false
}
}
......
上面提到,它的入口函數是sql
函數的調用,此時就會調用Parser
對sql語句進行解析。在這個階段,對SQL解析的類主要包括:
- DDLParser:臨時表創建解析器,主要用來解析創建臨時表的DDL語句。
- SqlParser:SQL語句解析器,解析select,insert等語句。
- SparkSQLParser:用於代理
SqlParser
,對AS,CACHE,SET等關鍵字進行解析。
它們的類關係繼承圖如下:
回到SQLContext
的sql
函數調用處,源碼如下:
def sql(sqlText: String): DataFrame = {
if (conf.dialect == "sql") {
DataFrame(this, parseSql(sqlText))
} else {
sys.error(s"Unsupported SQL dialect: ${conf.dialect}")
}
}
在這裏又調用了parseSql
函數
protected[sql] def parseSql(sql: String): LogicalPlan = {
ddlParser(sql, false).getOrElse(sqlParser(sql))
}
在DDLParser
並沒有這樣的構造器,根據scala的語法我們可以知道,它調用了內部的apply
方法,源碼如下:
/**
* A parser for foreign DDL commands.
*/
private[sql] class DDLParser(
parseQuery: String => LogicalPlan)
extends AbstractSparkSQLParser with DataTypeParser with Logging {
def apply(input: String, exceptionOnError: Boolean): Option[LogicalPlan] = {
try {
//調用父類的apply方法
Some(apply(input))
} catch {
case ddlException: DDLException => throw ddlException
case _ if !exceptionOnError => None
case x: Throwable => throw x
}
}
......
由上面的繼承關係圖可以看出,它繼承自AbstractSparkSQLParser
,它也定義了apply
方法,所以接着會調用父類的該方法:
private[sql] abstract class AbstractSparkSQLParser
extends StandardTokenParsers with PackratParsers {
def apply(input: String): LogicalPlan = {
// Initialize the Keywords.
lexical.initialize(reservedWords)
//scala的柯理化,如果input符合start模式則返回Success
phrase(start)(new lexical.Scanner(input)) match {
case Success(plan, _) => plan
case failureOrError => sys.error(failureOrError.toString)
}
}
.......
start模式是一系列複雜的表達式,有興趣的讀者可以查閱相關資料,其中一段源碼如下所示:
protected lazy val start: Parser[LogicalPlan] =
( (select | ("(" ~> select <~ ")")) *
( UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) }
| INTERSECT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Intersect(q1, q2) }
| EXCEPT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Except(q1, q2)}
| UNION ~ DISTINCT.? ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }
)
| insert
)
DDLParser
該組件主要用來創建臨時表,其內部定義了一些KeyWord
,它是保存在HashSet
緩存中:
// Keyword is a convention with AbstractSparkSQLParser, which will scan all of the `Keyword`
// properties via reflection the class in runtime for constructing the SqlLexical object
protected val CREATE = Keyword("CREATE")
protected val TEMPORARY = Keyword("TEMPORARY")
protected val TABLE = Keyword("TABLE")
protected val IF = Keyword("IF")
protected val NOT = Keyword("NOT")
protected val EXISTS = Keyword("EXISTS")
protected val USING = Keyword("USING")
protected val OPTIONS = Keyword("OPTIONS")
protected val DESCRIBE = Keyword("DESCRIBE")
protected val EXTENDED = Keyword("EXTENDED")
protected val AS = Keyword("AS")
protected val COMMENT = Keyword("COMMENT")
protected val REFRESH = Keyword("REFRESH")
//創建臨時表,描述表,刷新表
protected lazy val ddl: Parser[LogicalPlan] = createTable | describeTable | refreshTable
protected def start: Parser[LogicalPlan] = ddl
可以看見,其ddl的模式是createTable
、describeTable
或者refreshTable
。下面簡單看看createTable
的源碼:
protected lazy val createTable: Parser[LogicalPlan] =
// TODO: Support database.table.
//create temporary table using options
(CREATE ~> TEMPORARY.? <~ TABLE) ~ (IF ~> NOT <~ EXISTS).? ~ ident ~
tableCols.? ~ (USING ~> className) ~ (OPTIONS ~> options).? ~ (AS ~> restInput).? ^^ {
case temp ~ allowExisting ~ tableName ~ columns ~ provider ~ opts ~ query =>
if (temp.isDefined && allowExisting.isDefined) {
throw new DDLException(
"a CREATE TEMPORARY TABLE statement does not allow IF NOT EXISTS clause.")
}
.......
SqlParser
該組件主要用來解析SQL語句。
protected val ABS = Keyword("ABS")
protected val ALL = Keyword("ALL")
protected val AND = Keyword("AND")
protected val APPROXIMATE = Keyword("APPROXIMATE")
protected val AS = Keyword("AS")
protected val ASC = Keyword("ASC")
protected val AVG = Keyword("AVG")
protected val BETWEEN = Keyword("BETWEEN")
protected val BY = Keyword("BY")
protected val CASE = Keyword("CASE")
protected val CAST = Keyword("CAST")
protected val COALESCE = Keyword("COALESCE")
protected val COUNT = Keyword("COUNT")
protected val DESC = Keyword("DESC")
protected val DISTINCT = Keyword("DISTINCT")
protected val ELSE = Keyword("ELSE")
protected val END = Keyword("END")
protected val EXCEPT = Keyword("EXCEPT")
protected val FALSE = Keyword("FALSE")
protected val FIRST = Keyword("FIRST")
protected val FROM = Keyword("FROM")
protected val FULL = Keyword("FULL")
protected val GROUP = Keyword("GROUP")
protected val HAVING = Keyword("HAVING")
protected val IF = Keyword("IF")
protected val IN = Keyword("IN")
protected val INNER = Keyword("INNER")
protected val INSERT = Keyword("INSERT")
protected val INTERSECT = Keyword("INTERSECT")
protected val INTO = Keyword("INTO")
protected val IS = Keyword("IS")
protected val JOIN = Keyword("JOIN")
protected val LAST = Keyword("LAST")
protected val LEFT = Keyword("LEFT")
protected val LIKE = Keyword("LIKE")
protected val LIMIT = Keyword("LIMIT")
protected val LOWER = Keyword("LOWER")
protected val MAX = Keyword("MAX")
protected val MIN = Keyword("MIN")
protected val NOT = Keyword("NOT")
protected val NULL = Keyword("NULL")
protected val ON = Keyword("ON")
protected val OR = Keyword("OR")
protected val ORDER = Keyword("ORDER")
protected val SORT = Keyword("SORT")
protected val OUTER = Keyword("OUTER")
protected val OVERWRITE = Keyword("OVERWRITE")
protected val REGEXP = Keyword("REGEXP")
protected val RIGHT = Keyword("RIGHT")
protected val RLIKE = Keyword("RLIKE")
protected val SELECT = Keyword("SELECT")
protected val SEMI = Keyword("SEMI")
protected val SQRT = Keyword("SQRT")
protected val SUBSTR = Keyword("SUBSTR")
.......
在SqlParser
的start的中,定義着一系列的解析規則,其語法晦澀難懂。下面簡單看一段:
protected lazy val start: Parser[LogicalPlan] =
( (select | ("(" ~> select <~ ")")) *
// ~順序組合解析,比如A~B必須保證A在B的左邊
//將UNION ALL替換爲Union函數,^^^符號用於將匹配的值置換爲原值
( UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) }
//INTERSECT 替換爲Intersect函數
| INTERSECT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Intersect(q1, q2) }
| EXCEPT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Except(q1, q2)}
| UNION ~ DISTINCT.? ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }
)
| insert
)
protected lazy val insert: Parser[LogicalPlan] =
//~>匹配後只保留右邊,比如A~>B,只保留B
INSERT ~> (OVERWRITE ^^^ true | INTO ^^^ false) ~ (TABLE ~> relation) ~ select ^^ {
case o ~ r ~ s => InsertIntoTable(r, Map.empty[String, Option[String]], s, o)
}
SparkSQLParser
SparkSQLParser
用於來代理SqlParser
對AS
、CACHE
等關鍵字的處理。
protected val AS = Keyword("AS")
protected val CACHE = Keyword("CACHE")
protected val CLEAR = Keyword("CLEAR")
protected val IN = Keyword("IN")
protected val LAZY = Keyword("LAZY")
protected val SET = Keyword("SET")
protected val SHOW = Keyword("SHOW")
protected val TABLE = Keyword("TABLE")
protected val TABLES = Keyword("TABLES")
protected val UNCACHE = Keyword("UNCACHE")
override protected lazy val start: Parser[LogicalPlan] = cache | uncache | set | show | others
從源碼中可以看到,其start的匹配模式有cache,set,show等模式,其表達式和上面提到的類似。
在經過SqlParser
處理之後 ,將SQL進行解析,完成一系列操作之後,就會使用Analyzer
與數據字典Catalog
進行綁定,然後對執行計劃進行分析,那麼,這部分內容,在後面的文章中會詳細介紹,歡迎持續關注。
歡迎加入大數據學習交流羣:731423890