spark通常這樣開始執行一條SQL語句:
val spark_sess = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.sql.shuffle.partitions", "600")
.getOrCreate()
df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
val sqlDF = spark_sess.sql("select * from people")
sqlDF.show()
我們看看一條SQL語句在sql()中的執行過程。
其實SQLContext.sql()也是調用的SparkSession.sql():
def sql(sqlText: String): DataFrame = sparkSession.sql(sqlText)
sql()函數:
執行了一個parsePlan,返回一個DataFrame。
def sql(sqlText: String): DataFrame = {
Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
}
sessionState是一個lazy的SessionState類:
lazy val sessionState: SessionState = {
parentSessionState
.map(_.clone(this))
.getOrElse {
val state = SparkSession.instantiateSessionState(
SparkSession.sessionStateClassName(sparkContext.conf),
self)
initialSessionOptions.foreach { case (k, v) => state.conf.setConfString(k, v) }
state
}
}
SessionState類定義在
org.apache.spark.sql.internal下的SessionState.scala
它是A class that holds all session-specific state in a given [[SparkSession]].
private[sql] class SessionState(
sharedState: SharedState,
val conf: SQLConf,
val experimentalMethods: ExperimentalMethods,
val functionRegistry: FunctionRegistry,
val udfRegistration: UDFRegistration,
catalogBuilder: () => SessionCatalog,
val sqlParser: ParserInterface,
analyzerBuilder: () => Analyzer,
optimizerBuilder: () => Optimizer,
val planner: SparkPlanner,
val streamingQueryManager: StreamingQueryManager,
val listenerManager: ExecutionListenerManager,
resourceLoaderBuilder: () => SessionResourceLoader,
createQueryExecution: LogicalPlan => QueryExecution,
createClone: (SparkSession, SessionState) => SessionState) {
SparkSession.sessionStateClassName:
用builder模式構建SessionState,
如果是in-memory方式,就返回org.apache.spark.sql.internal.SessionStateBuilder
如果是 hive方式,就返回org.apache.spark.sql.hive.HiveSessionStateBuilder
private def sessionStateClassName(conf: SparkConf): String = {
// spark.sql.catalogImplementation, 分爲 hive 和 in-memory模式,默認爲 in-memory 模式
conf.get(CATALOG_IMPLEMENTATION) match {
case "hive" => HIVE_SESSION_STATE_BUILDER_CLASS_NAME
case "in-memory" => classOf[SessionStateBuilder].getCanonicalName
}
}
sqlParser.parsePlan()
sqlParser定義在org.apache.spark.sql.execution下的SparkSqlParser.scala:
class SparkSqlParser(conf: SQLConf) extends AbstractSqlParser {
AbstractSqlParser定義在catalyst項目中,org.apache.spark.sql.catalyst.parser下的ParseDriver.scala:
abstract class AbstractSqlParser extends ParserInterface
AbstractSqlParser的parsePlan()根據傳入的sqlText,返回一個邏輯計劃LogicalPlan。
邏輯計劃會生成AST抽象語法樹:
// SparkSqlParser.scala
/** Creates LogicalPlan for a given SQL string. */
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
astBuilder.visitSingleStatement(parser.singleStatement()) match {
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
}
}
LogicalPlan = parse(sqlText),這裏邏輯計劃是一個函數parse(),
參數是(command: String),返回值是(toResult: SqlBaseParser => T),定義
在SparkSqlParser中:
protected override def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
// substitutor是一個命令替換器,用於把SQL中的命令參數都替換掉
super.parse(substitutor.substitute(command))(toResult)
}
調用了AbstractSqlParser的parse():
在這個方法中調用ANLTR4的API將SQL轉換爲AST抽象語法樹,然後調用 toResult(parser) 方法,這個 toResult 方法就是parsePlan 方法的回調方法。
https://www.cnblogs.com/johnny666888/p/12345142.html
parser.singleStatement()就是執行了SqlBaseParser的回調,生成AST,把結果傳給AstBuilder.visitSingleStatement(),
封裝成unresolved LogicalPlan。
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
logDebug(s"Parsing command: $command")
// lexer 就是詞法分析了
val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
lexer.removeErrorListeners()
lexer.addErrorListener(ParseErrorListener)
lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced
val tokenStream = new CommonTokenStream(lexer)
// SqlBaseParser是從\org\apache\spark\sql\catalyst\parser\SqlBase.g4生成的,SqlBase.g4是一個“語法文件”
// 定義了ruleNames:"singleStatement", "singleDataType", "createTableHeader", "insertInto", "partitionSpecLocation",...
// ruleNames是一系列函數的函數名。
// 定義了VocabularyImpl,"SELECT", "FROM", "ADD", "AS", "ALL", "ANY", "DISTINCT", "WHERE", "GROUP", "BY",...
// 用於句法分析。
val parser = new SqlBaseParser(tokenStream)
parser.addParseListener(PostProcessor)
parser.removeErrorListeners()
parser.addErrorListener(ParseErrorListener)
parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced
try {
try {
// first, try parsing with potentially faster SLL mode
parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
// 返回的toResult看起來是用SqlBaseParser克隆的
toResult(parser)
}
catch {
case e: ParseCancellationException =>
// if we fail, parse with LL mode
tokenStream.seek(0) // rewind input stream
parser.reset()
// Try Again.
parser.getInterpreter.setPredictionMode(PredictionMode.LL)
toResult(parser)
}
}
catch {
case e: ParseException if e.command.isDefined =>
throw e
case e: ParseException =>
throw e.withCommand(command)
case e: AnalysisException =>
val position = Origin(e.line, e.startPosition)
throw new ParseException(Option(command), e.message, position, position)
}
}
}
儘管現在看來,使用ANTLR解析SQL生成AST是一個black box,但對於Spark SQL來說,其後續流程的輸入已經得到。
總體執行流程如下圖所示:從提供的輸入API(SQL,Dataset, dataframe)開始,依次經過unresolved邏輯計劃,解析的邏輯計劃,優化的邏輯計劃,物理計劃,然後根據cost based優化,選取一條物理計劃進行執行。從unresolved logical plan開始, sql的查詢是通過抽象語法樹(AST)來表示的,所以以後各個操作都是對AST進行的等價轉換操作。
Dataset.ofRows():
Dataset.ofRows()創建一個DataFrame時,回調LogicalPlan做詞法句法分析和檢查,正常後根據LogicalPlan.schema
創建一個Dataset[Row]對象,返回這個DataFrame。
可見,用戶拿到封裝了Unresolved LogicalPlan的DataFrame時,並沒有可操作的數據,
當執行到了show()、select().show()這樣的操作,或者對DataFrame轉換出的RDD執行到Action操作時,
纔會真正執行SQL語句後續的優化、物理計劃的選擇,最後執行物理計劃,生成真正的DataFrame。
// class Dataset.scala
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
// 返回QueryExecution
val qe = sparkSession.sessionState.executePlan(logicalPlan)
// assertAnalyzed 執行lazy的 analyzed 函數:
// lazy val analyzed: LogicalPlan = {
// SparkSession.setActiveSession(sparkSession)
// sparkSession.sessionState.analyzer.executeAndCheck(logical)
// }
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
// class SessionState中executePlan的定義:
def executePlan(plan: LogicalPlan): QueryExecution = createQueryExecution(plan)
// class BaseSessionStateBuilder中:
protected def createQueryExecution: LogicalPlan => QueryExecution = { plan =>
new QueryExecution(session, plan)
}