目錄
DataFrame & DataSet & RDD 三者區別
spark.version = 2.4.4
站在上帝角度學習下SparkSQL架構相關內容
SparkSQL
SparkSQL 是一個用於處理結構化數據的Spark組件,結構化數據既可以來自外部結構化數據源,也可以通過向已有RDD增加Schema方式得到。
DataFrame API
通俗點理解也可以說SparkSQL主要完成SQL解析相關工作,將一個SQL語句解析爲DataFrame或者RDD任務。
如下圖可以看到實際上Spark的Dataframe API底層也是基於Spark的RDD。
DataFrame 是一種以RDD爲基礎的分佈式數據集,類似於傳統數據庫中的二維表格。它也支持各種關係操作優化執行,與RDD不同的是,DataFrame帶有Schema元數據,即DataFrame所表示的二維表數據集的每一列都帶有名稱和類型。由於無法知道RDD數據集內部數據結構類型,Spark作業執行只能在調度階段層面進行簡單通用的優化,而對於DataFrame帶有數據集內部的結構,可以根據這些信息進行鍼對性的優化,最終實現優化運行效率。通俗的說SparkSQL能自動優化任務執行過程。
DataFrame & DataSet & RDD 三者區別
RDD 是一個不可變的分佈式對象集合,是 Spark 對數據的核心抽象。每個 RDD 都被分爲多個分區,每個分區就是一個數據集片段,這些分區運行在集羣中的不同節點上。RDD 提供了一種高度受限的內存共享模型,即 RDD 是隻讀的,只能基於穩定的物理儲存中的數據集來創建 RDD 或對已有的 RDD 進行轉換操作來得到新的 RDD。
DataFrame 是用在 Spark SQL 中的一種存放 Row 對象的特殊 RDD,是 Spark SQL中的數據抽象,它是一種結構化的數據集,每一條數據都由幾個命名字段組成(類似與傳統數據庫中的表),DataFrame 能夠利用結構信息更加高效的存儲數據。同時, SparkSQL 爲 DataFrame 提供了非常好用的 API,而且還能註冊成表使用 SQL 來操作。DataFrame 可以從外部數據源創建,也可以從查詢結果或已有的 RDD 創建。
Dataset 是 Spark1.6 開始提供的 API,是 Spark SQL 最新的數據抽象。它把 RDD 的優勢(強類型,可以使用 lambda 表達式函數)和 Spark SQL 的優化執行引擎結合到了一起。Dataset 可以從 JVM 對象創建得到,而且可以像 DataFrame 一樣使用 API 或 sql 來操作。
三者的關係:RDD + Schema = DataFrame = Dataset[Row]
注:RDD 是 Spark 的核心,DataFrame/Dataset 是 Spark SQL 的核心,RDD 不支持 SQL 操作。
接下來通過簡單示例&源碼看下三者怎樣轉換的?
val df:DataFrame = spark.sql(sqlText)
df.printSchema()
val rdd = df.rdd.map(...)
import spark.implicits._
val a = rdd.toDF()
// 首先執行spark.sql()將返回一個DataFrame
def sql(sqlText: String): DataFrame = {
Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
}
// 將DataFrame轉換爲RDD
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
rddQueryExecution.toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
// 將RDD轉換爲DataFrame
def toDF(): DataFrame = new Dataset[Row](sparkSession, queryExecution, RowEncoder(schema))
package object sql {
/**
* Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting
* with the query planner and is not designed to be stable across spark releases. Developers
* writing libraries should instead consider using the stable APIs provided in
* [[org.apache.spark.sql.sources]]
*/
@DeveloperApi
@InterfaceStability.Unstable
type Strategy = SparkStrategy
type DataFrame = Dataset[Row]
}
SparkSQL 組成
SparkSQL由Core、Catalyst、Hive和Hive-Thriftserver這四部分組成。
其中:
Catalyst:SparkSQL中的優化器,負責處理查詢語句的整個處理過程,包括解析、綁定、優化、物理計劃等;
Hive:兼容Hive,支持對Hive數據的處理;
Core:負責處理數據的輸入/輸出,從不同的數據源獲取數據(JSON、RDD、Parquet、CSV)然後將查詢結果輸出爲DataFrame;
Hive-ThriftServer:提供CLI和JDBC/ODBC等;
SparkSQL Catalyst Optimizer
SparkSQL基於Scala函數式編程結構設計了一個可擴展優化器,即Catalyst。這也是SparkSQL中較爲核心的內容。
Catalyst支持基於規則和基於成本的優化。
在這個基礎上,構建了專門用於關係查詢處理的庫和若干規則集來處理查詢執行不同階段:分析、邏輯優化、物理規劃等。
其次Catalyst核心部分包含一個用於表示樹和規則來操作它們。
接下來在講解SparkSQL運行原理前將簡單學習下Tree&Rule這兩個重要概念,以及Catalyst大致流程是怎樣的。
Tree
Catalyst中的主要數據類型是由節點對象組成的樹,是Catalyst執行計劃表示的數據結構。
TreeNode
新的節點類型在Scala中定義爲TreeNode類的子類,
而Tree的具體操作是通過TreeNode來實現的,LogicalPlans,Expressions和Pysical Operators都可以使用Tree來表示。
Tree具備一些Scala Collection的操作能力和樹遍歷能力。
在SparkSQL中會根據語法生成一棵樹,該樹一直在內存裏維護,不會保存到磁盤以某種格式的文件存在,且無論是Analyzer分析過的邏輯計劃還是Optimizer優化過的邏輯計劃,樹的修改都是以替換已有節點的方式進行的。這些對象是不可變的,但可以使用函數轉換進行操作。
每個節點都有一個節點類型和零個或多個子節點。Tree內部定義了一個children: Seq[BaseType]方法,可以返回一系列孩子節點。
abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
// scalastyle:on
self: BaseType =>
val origin: Origin = CurrentOrigin.get
/**
* Returns a Seq of the children of this node.
* Children should not change. Immutability required for containsChild optimization
*/
def children: Seq[BaseType]
lazy val containsChild: Set[TreeNode[_]] = children.toSet
......
而對Tree的遍歷操作,主要是藉助各個Tree之間的關係,使用transformDown、transformUp將Rule應用到給定的樹段,並匹配節點實施變化的方法。其中transform默認調用transformDown(前序遍歷)。
/**
* Returns a copy of this node where `rule` has been recursively applied to the tree.
* When `rule` does not apply to a given node it is left unchanged.
* Users should not expect a specific directionality. If a specific directionality is needed,
* transformDown or transformUp should be used.
*
* @param rule the function use to transform this nodes children
*/
def transform(rule: PartialFunction[BaseType, BaseType]): BaseType = {
transformDown(rule)
}
/**
* Returns a copy of this node where `rule` has been recursively applied to it and all of its
* children (pre-order). When `rule` does not apply to a given node it is left unchanged.
*
* @param rule the function used to transform this nodes children
*/
def transformDown(rule: PartialFunction[BaseType, BaseType]): BaseType = {
val afterRule = CurrentOrigin.withOrigin(origin) {
rule.applyOrElse(this, identity[BaseType])
}
// Check if unchanged and then possibly return old copy to avoid gc churn.
if (this fastEquals afterRule) {
mapChildren(_.transformDown(rule))
} else {
afterRule.mapChildren(_.transformDown(rule))
}
}
/**
* Returns a copy of this node where `rule` has been recursively applied first to all of its
* children and then itself (post-order). When `rule` does not apply to a given node, it is left
* unchanged.
*
* @param rule the function use to transform this nodes children
*/
def transformUp(rule: PartialFunction[BaseType, BaseType]): BaseType = {
val afterRuleOnChildren = mapChildren(_.transformUp(rule))
if (this fastEquals afterRuleOnChildren) {
CurrentOrigin.withOrigin(origin) {
rule.applyOrElse(this, identity[BaseType])
}
} else {
CurrentOrigin.withOrigin(origin) {
rule.applyOrElse(afterRuleOnChildren, identity[BaseType])
}
}
}
接下來再來看下TreeNode類的繼承體系,如下:
可以看到TreeNode下有兩個重要的子類集成體系,分別是:QueryPlan & Expression
QueryPlan
abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanType] {
self: PlanType =>
/**
* The active config object within the current scope.
* See [[SQLConf.get]] for more information.
*/
def conf: SQLConf = SQLConf.get
def output: Seq[Attribute]
/**
* Returns the set of attributes that are output by this node.
*/
def outputSet: AttributeSet = AttributeSet(output)
/**
* All Attributes that appear in expressions from this operator. Note that this set does not
* include attributes that are implicitly referenced by being passed through to the output tuple.
*/
def references: AttributeSet = AttributeSet(expressions.flatMap(_.references))
/**
* The set of all attributes that are input to this operator by its children.
*/
def inputSet: AttributeSet =
AttributeSet(children.flatMap(_.asInstanceOf[QueryPlan[PlanType]].output))
/**
* The set of all attributes that are produced by this node.
*/
def producedAttributes: AttributeSet = AttributeSet.empty
/**
* Attributes that are referenced by expressions but not provided by this node's children.
* Subclasses should override this method if they produce attributes internally as it is used by
* assertions designed to prevent the construction of invalid plans.
*/
def missingInput: AttributeSet = references -- inputSet -- producedAttributes
/**
* Runs [[transformExpressionsDown]] with `rule` on all expressions present
* in this query operator.
* Users should not expect a specific directionality. If a specific directionality is needed,
* transformExpressionsDown or transformExpressionsUp should be used.
*
* @param rule the rule to be applied to every expression in this operator.
*/
def transformExpressions(rule: PartialFunction[Expression, Expression]): this.type = {
transformExpressionsDown(rule)
}
/**
* Runs [[transformDown]] with `rule` on all expressions present in this query operator.
*
* @param rule the rule to be applied to every expression in this operator.
*/
def transformExpressionsDown(rule: PartialFunction[Expression, Expression]): this.type = {
mapExpressions(_.transformDown(rule))
}
/**
* Runs [[transformUp]] with `rule` on all expressions present in this query operator.
*
* @param rule the rule to be applied to every expression in this operator.
* @return
*/
def transformExpressionsUp(rule: PartialFunction[Expression, Expression]): this.type = {
mapExpressions(_.transformUp(rule))
}
/**
* Apply a map function to each expression present in this query operator, and return a new
* query operator based on the mapped expressions.
*/
def mapExpressions(f: Expression => Expression): this.type = {
var changed = false
@inline def transformExpression(e: Expression): Expression = {
val newE = CurrentOrigin.withOrigin(e.origin) {
f(e)
}
if (newE.fastEquals(e)) {
e
} else {
changed = true
newE
}
}
def recursiveTransform(arg: Any): AnyRef = arg match {
case e: Expression => transformExpression(e)
case Some(value) => Some(recursiveTransform(value))
case m: Map[_, _] => m
case d: DataType => d // Avoid unpacking Structs
case stream: Stream[_] => stream.map(recursiveTransform).force
case seq: Traversable[_] => seq.map(recursiveTransform)
case other: AnyRef => other
case null => null
}
val newArgs = mapProductIterator(recursiveTransform)
if (changed) makeCopy(newArgs).asInstanceOf[this.type] else this
}
/**
* Returns the result of running [[transformExpressions]] on this node
* and all its children.
*/
def transformAllExpressions(rule: PartialFunction[Expression, Expression]): this.type = {
transform {
case q: QueryPlan[_] => q.transformExpressions(rule).asInstanceOf[PlanType]
}.asInstanceOf[this.type]
}
/** Returns all of the expressions present in this query plan operator. */
final def expressions: Seq[Expression] = {
// Recursively find all expressions from a traversable.
def seqToExpressions(seq: Traversable[Any]): Traversable[Expression] = seq.flatMap {
case e: Expression => e :: Nil
case s: Traversable[_] => seqToExpressions(s)
case other => Nil
}
productIterator.flatMap {
case e: Expression => e :: Nil
case s: Some[_] => seqToExpressions(s.toSeq)
case seq: Traversable[_] => seqToExpressions(seq)
case other => Nil
}.toSeq
}
...... // 其它代碼省略
QueryPlan下也有兩個重要子類,分別是LogicalPlan(邏輯執行計劃) & SparkPlan(物理執行計劃)。
LogicalPlan(邏輯執行計劃)
LogicalPlan內部提供了resolve(nameParts: Seq[String],resolver: Resolver): Option[NamedExpression]方法,用於分析生成對應的NamedExpression。
abstract class LogicalPlan
extends QueryPlan[LogicalPlan]
with AnalysisHelper
with LogicalPlanStats
with QueryPlanConstraints
with Logging {
/**
* Optionally resolves the given strings to a [[NamedExpression]] based on the output of this
* LogicalPlan. The attribute is expressed as string in the following form:
* `[scope].AttributeName.[nested].[fields]...`.
*/
def resolve(
nameParts: Seq[String],
resolver: Resolver): Option[NamedExpression] =
outputAttributes.resolve(nameParts, resolver)
其次LogicalPlan也有許多具體子類,包括UnaryNode、BinaryNode、LeafNode三種特質(trait),如下:
UnaryNode:一元節點,即只有一個子節點;
BinaryNode:二元節點,即有左右子節點的二叉節點;
LeafNode:葉子節點,沒有子節點的節點;
針對不同的Node,Tree提供了不同的操作方法。例如UnaryNode可以進行limit和filter等;對BinaryNode可以進行join和union等操作;對LeafNode主要是用戶命令類操作,如set和command等。
SparkPlan(物理執行計劃)
abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {
/**
* A handle to the SQL Context that was used to create this plan. Since many operators need
* access to the sqlContext for RDD operations or configuration this field is automatically
* populated by the query planning infrastructure.
*/
@transient final val sqlContext = SparkSession.getActiveSession.map(_.sqlContext).orNull
Expression
/**
* An expression in Catalyst.
*
* If an expression wants to be exposed in the function registry (so users can call it with
* "name(arguments...)", the concrete implementation must be a case class whose constructor
* arguments are all Expressions types. See [[Substring]] for an example.
*
* There are a few important traits:
*
* - [[Nondeterministic]]: an expression that is not deterministic.
* - [[Unevaluable]]: an expression that is not supposed to be evaluated.
* - [[CodegenFallback]]: an expression that does not have code gen implemented and falls back to
* interpreted mode.
*
* - [[LeafExpression]]: an expression that has no child.
* - [[UnaryExpression]]: an expression that has one child.
* - [[BinaryExpression]]: an expression that has two children.
* - [[TernaryExpression]]: an expression that has three children.
* - [[BinaryOperator]]: a special case of [[BinaryExpression]] that requires two children to have
* the same output data type.
*
*/
abstract class Expression extends TreeNode[Expression] {
Expression是表達式體系,是指不需要執行引擎計算,而可以直接計算或處理的節點,包括Cast操作、Projection操作、四則運算和邏輯操作符運算等。
Rule
先看下Rule的定義
abstract class Rule[TreeType <: TreeNode[_]] extends Logging {
/** Name for this rule, automatically inferred based on class name. */
val ruleName: String = {
val className = getClass.getName
if (className endsWith "$") className.dropRight(1) else className
}
def apply(plan: TreeType): TreeType
}
Rule是一個抽象類,子類需要複寫apply方法來制定處理邏輯。
RuleExecutor
對於Rule的具體實現是通過RuleExecutor來完成的,凡是需要處理執行計劃樹進行實施規則匹配和節點處理的,都需要繼承RuleExecutor抽象類。
abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging {
/**
* An execution strategy for rules that indicates the maximum number of executions. If the
* execution reaches fix point (i.e. converge) before maxIterations, it will stop.
*/
abstract class Strategy { def maxIterations: Int }
/** A strategy that only runs once. */
case object Once extends Strategy { val maxIterations = 1 }
/** A strategy that runs until fix point or maxIterations times, whichever comes first. */
case class FixedPoint(maxIterations: Int) extends Strategy
/** A batch of rules. */
protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
/** Defines a sequence of rule batches, to be overridden by the implementation. */
protected def batches: Seq[Batch]
/**
* Defines a check function that checks for structural integrity of the plan after the execution
* of each rule. For example, we can check whether a plan is still resolved after each rule in
* `Optimizer`, so we can catch rules that return invalid plans. The check function returns
* `false` if the given plan doesn't pass the structural integrity check.
*/
protected def isPlanIntegral(plan: TreeType): Boolean = true
/**
* Executes the batches of rules defined by the subclass. The batches are executed serially
* using the defined execution strategy. Within each batch, rules are also executed serially.
*/
def execute(plan: TreeType): TreeType = {
var curPlan = plan
val queryExecutionMetrics = RuleExecutor.queryExecutionMeter
batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true
// Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val startTime = System.nanoTime()
val result = rule(plan)
val runTime = System.nanoTime() - startTime
if (!result.fastEquals(plan)) {
queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName)
queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime)
logTrace(
s"""
|=== Applying Rule ${rule.ruleName} ===
|${sideBySide(plan.treeString, result.treeString).mkString("\n")}
""".stripMargin)
}
queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime)
queryExecutionMetrics.incNumExecution(rule.ruleName)
// Run the structural integrity checker against the plan after each rule.
if (!isPlanIntegral(result)) {
val message = s"After applying rule ${rule.ruleName} in batch ${batch.name}, " +
"the structural integrity of the plan is broken."
throw new TreeNodeException(result, message, null)
}
result
}
iteration += 1
if (iteration > batch.strategy.maxIterations) {
// Only log if this is a rule that is supposed to run more than once.
if (iteration != 2) {
val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}"
if (Utils.isTesting) {
throw new TreeNodeException(curPlan, message, null)
} else {
logWarning(message)
}
}
continue = false
}
if (curPlan.fastEquals(lastPlan)) {
logTrace(
s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.")
continue = false
}
lastPlan = curPlan
}
if (!batchStartPlan.fastEquals(curPlan)) {
logDebug(
s"""
|=== Result of Batch ${batch.name} ===
|${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")}
""".stripMargin)
} else {
logTrace(s"Batch ${batch.name} has no effect.")
}
}
curPlan
}
}
在RuleExecutor類繼承體系中,也有兩個重要的實現子類,分別是Analyzer & Optimizer。
這兩個類中都會定義Batch、Once和FixedPoint。其中每個Batch代表着一套規則,這樣可以簡便地、模塊化地對Tree進行Transform操作。Onec和FixedPoint是配備的策略,相對應的是對Tree進行一次操作或多次的迭代操作。(如對某些Tree進行多次迭代操作時,達到FixedPoint次數或達到前後兩次的樹結構沒變化才停止操作)。
RuleExecutor內部有一個Seq[Batch]屬性,定義的是該RuleExecutor的處理邏輯,具體的處理邏輯由具體的Rule子類實現。
RuleExecutor中的apply方法會按照Batch順序和Batch內的Rules順序,對傳入的節點進行迭代操作。
在Analyzer過程中處理由解析器(SqlParser)生成的未綁定邏輯計劃Tree時,就定義了多種Rules應用到該Unresolved邏輯計劃Tree上。Analyzer過程中使用了自身定義的多個Batch,如MultiInstanceRelations,Resolution,CheckAnalysis和AnalysisOperators.
每個Batch又由不同的Rules構成,每個Rule又有自己相對應的處理函數。注意,不同Rule的使用次數不同(Once FixedPoint)。
Catalyst大致流程
1、將SQL語句通過詞法和語法解析生成未綁定的邏輯執行計劃(Unresolved LogicalPlan),包含Unresolved Relation、Unresolved Function和Unresolved Attribute,然後在後續步驟中使用不同的Rule應用到該邏輯計劃上
2、Analyzer使用Analysis Rules,配合元數據(如SessionCatalog 或是 Hive Metastore等)完善未綁定的邏輯計劃的屬性而轉換成綁定的邏輯計劃。具體流程是縣實例化一個Simple Analyzer,然後遍歷預定義好的Batch,通過父類Rule Executor的執行方法運行Batch裏的Rules,每個Rule會對未綁定的邏輯計劃進行處理,有些可以通過一次解析處理,有些需要多次迭代,迭代直到達到FixedPoint次數或前後兩次的樹結構沒變化才停止操作。
3、Optimizer使用Optimization Rules,將綁定的邏輯計劃進行合併、列裁剪和過濾器下推等優化工作後生成優化的邏輯計劃。
4、Planner使用Planning Strategies,對優化的邏輯計劃進行轉換(Transform)生成可以執行的物理計劃。根據過去的性能統計數據,選擇最佳的物理執行計劃CostModel,最後生成可以執行的物理執行計劃樹,得到SparkPlan。
5、在最終真正執行物理執行計劃之前,還要進行preparations規則處理,最後調用SparkPlan的execute執行計算RDD。
References
Spark SQL:Relational Data Processing in Spark