spark SQL(11)sql語句執行流程源碼

spark通常這樣開始執行一條SQL語句:

  val spark_sess = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config("spark.sql.shuffle.partitions", "600")
      .getOrCreate()
  df = spark.read.json("examples/src/main/resources/people.json")
  df.createOrReplaceTempView("people")
  val sqlDF = spark_sess.sql("select * from people")
  sqlDF.show()

我們看看一條SQL語句在sql()中的執行過程。

其實SQLContext.sql()也是調用的SparkSession.sql():

def sql(sqlText: String): DataFrame = sparkSession.sql(sqlText)

sql()函數:

執行了一個parsePlan,返回一個DataFrame。

  def sql(sqlText: String): DataFrame = {
    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
  }

sessionState是一個lazy的SessionState類:

  lazy val sessionState: SessionState = {
    parentSessionState
      .map(_.clone(this))
      .getOrElse {
        val state = SparkSession.instantiateSessionState(
          SparkSession.sessionStateClassName(sparkContext.conf),
          self)
        initialSessionOptions.foreach { case (k, v) => state.conf.setConfString(k, v) }
        state
      }
  }

SessionState類定義在
org.apache.spark.sql.internal下的SessionState.scala
它是A class that holds all session-specific state in a given [[SparkSession]].

private[sql] class SessionState(
    sharedState: SharedState,
    val conf: SQLConf,
    val experimentalMethods: ExperimentalMethods,
    val functionRegistry: FunctionRegistry,
    val udfRegistration: UDFRegistration,
    catalogBuilder: () => SessionCatalog,
    val sqlParser: ParserInterface,
    analyzerBuilder: () => Analyzer,
    optimizerBuilder: () => Optimizer,
    val planner: SparkPlanner,
    val streamingQueryManager: StreamingQueryManager,
    val listenerManager: ExecutionListenerManager,
    resourceLoaderBuilder: () => SessionResourceLoader,
    createQueryExecution: LogicalPlan => QueryExecution,
    createClone: (SparkSession, SessionState) => SessionState) {

SparkSession.sessionStateClassName:
用builder模式構建SessionState,
如果是in-memory方式,就返回org.apache.spark.sql.internal.SessionStateBuilder
如果是 hive方式,就返回org.apache.spark.sql.hive.HiveSessionStateBuilder

  private def sessionStateClassName(conf: SparkConf): String = {
    // spark.sql.catalogImplementation, 分爲 hive 和 in-memory模式,默認爲 in-memory 模式
    conf.get(CATALOG_IMPLEMENTATION) match {
      case "hive" => HIVE_SESSION_STATE_BUILDER_CLASS_NAME
      case "in-memory" => classOf[SessionStateBuilder].getCanonicalName
    }
  }

sqlParser.parsePlan()

sqlParser定義在org.apache.spark.sql.execution下的SparkSqlParser.scala:
class SparkSqlParser(conf: SQLConf) extends AbstractSqlParser {

AbstractSqlParser定義在catalyst項目中,org.apache.spark.sql.catalyst.parser下的ParseDriver.scala:
abstract class AbstractSqlParser extends ParserInterface

AbstractSqlParser的parsePlan()根據傳入的sqlText,返回一個邏輯計劃LogicalPlan。

邏輯計劃會生成AST抽象語法樹:

  // SparkSqlParser.scala
  /** Creates LogicalPlan for a given SQL string. */
  override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
    astBuilder.visitSingleStatement(parser.singleStatement()) match {
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

LogicalPlan = parse(sqlText),這裏邏輯計劃是一個函數parse(),
參數是(command: String),返回值是(toResult: SqlBaseParser => T),定義
在SparkSqlParser中:

    protected override def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
      // substitutor是一個命令替換器,用於把SQL中的命令參數都替換掉
      super.parse(substitutor.substitute(command))(toResult)
    }

調用了AbstractSqlParser的parse():

在這個方法中調用ANLTR4的API將SQL轉換爲AST抽象語法樹,然後調用 toResult(parser) 方法,這個 toResult 方法就是parsePlan 方法的回調方法。
https://www.cnblogs.com/johnny666888/p/12345142.html

parser.singleStatement()就是執行了SqlBaseParser的回調,生成AST,把結果傳給AstBuilder.visitSingleStatement(),
封裝成unresolved LogicalPlan。

  protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logDebug(s"Parsing command: $command")
    // lexer 就是詞法分析了
    val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))
    lexer.removeErrorListeners()
    lexer.addErrorListener(ParseErrorListener)
    lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced

    val tokenStream = new CommonTokenStream(lexer)
    // SqlBaseParser是從\org\apache\spark\sql\catalyst\parser\SqlBase.g4生成的,SqlBase.g4是一個“語法文件”
    //   定義了ruleNames:"singleStatement", "singleDataType", "createTableHeader", "insertInto", "partitionSpecLocation",...
    //       ruleNames是一系列函數的函數名。
    //   定義了VocabularyImpl,"SELECT", "FROM", "ADD", "AS", "ALL", "ANY", "DISTINCT", "WHERE", "GROUP", "BY",...
    // 用於句法分析。
    val parser = new SqlBaseParser(tokenStream)
    parser.addParseListener(PostProcessor)
    parser.removeErrorListeners()
    parser.addErrorListener(ParseErrorListener)
    parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced

    try {
      try {
        // first, try parsing with potentially faster SLL mode
        parser.getInterpreter.setPredictionMode(PredictionMode.SLL)
        // 返回的toResult看起來是用SqlBaseParser克隆的
        toResult(parser)
      }
      catch {
        case e: ParseCancellationException =>
          // if we fail, parse with LL mode
          tokenStream.seek(0) // rewind input stream
          parser.reset()

          // Try Again.
          parser.getInterpreter.setPredictionMode(PredictionMode.LL)
          toResult(parser)
      }
    }
    catch {
      case e: ParseException if e.command.isDefined =>
        throw e
      case e: ParseException =>
        throw e.withCommand(command)
      case e: AnalysisException =>
        val position = Origin(e.line, e.startPosition)
        throw new ParseException(Option(command), e.message, position, position)
    }
  }
}

儘管現在看來,使用ANTLR解析SQL生成AST是一個black box,但對於Spark SQL來說,其後續流程的輸入已經得到。

總體執行流程如下圖所示:從提供的輸入API(SQL,Dataset, dataframe)開始,依次經過unresolved邏輯計劃,解析的邏輯計劃,優化的邏輯計劃,物理計劃,然後根據cost based優化,選取一條物理計劃進行執行。從unresolved logical plan開始, sql的查詢是通過抽象語法樹(AST)來表示的,所以以後各個操作都是對AST進行的等價轉換操作。

在這裏插入圖片描述

Dataset.ofRows():

Dataset.ofRows()創建一個DataFrame時,回調LogicalPlan做詞法句法分析和檢查,正常後根據LogicalPlan.schema
創建一個Dataset[Row]對象,返回這個DataFrame。

可見,用戶拿到封裝了Unresolved LogicalPlan的DataFrame時,並沒有可操作的數據,
當執行到了show()、select().show()這樣的操作,或者對DataFrame轉換出的RDD執行到Action操作時,
纔會真正執行SQL語句後續的優化、物理計劃的選擇,最後執行物理計劃,生成真正的DataFrame。

  // class Dataset.scala
  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
    // 返回QueryExecution
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    // assertAnalyzed 執行lazy的 analyzed 函數:
    //  lazy val analyzed: LogicalPlan = {
    //    SparkSession.setActiveSession(sparkSession)
    //    sparkSession.sessionState.analyzer.executeAndCheck(logical)
    //  }
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }
  
  // class SessionState中executePlan的定義:
  def executePlan(plan: LogicalPlan): QueryExecution = createQueryExecution(plan)
  
  // class BaseSessionStateBuilder中:
  protected def createQueryExecution: LogicalPlan => QueryExecution = { plan =>
    new QueryExecution(session, plan)
  }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章