《SparkSQL內核剖析》【物理計劃篇】

一、概覽

物理計劃是將Spark SQL生成的邏輯算子樹映射成物理算子樹，並將邏輯計劃的信息映射到Spark Core模型中的RDD、Transformation、Action的過程。生成物理計劃後，一條SQL語句就變成了可以執行的Spark任務。

物理計劃的定義在org.apache.spark.sql.catalyst.plans.QueryPlan中，從定義可以看出，物理計劃是一個抽象語法樹，樹節點的主要組成部分包括：子樹節點、出現過的表達式、出現過的子查詢；

abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanType] = {
	// 子樹
	override protected def innerChildren: Seq[QueryPlan[_]] = subqueries 
	// 表達式
	final def expressions: Seq[Expression] = {}
	// 子查詢
	def subqueries: Seq[PlanType] = {}
}

一個樣例物理計劃如下所示：

 == Physical Plan ==
   *(5) SortMergeJoin [x#3L], [y#9L], Inner
   :- *(2) Sort [x#3L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(x#3L, 200)
   :     +- *(1) Project [(id#0L % 2) AS x#3L]
   :        +- *(1) Filter isnotnull((id#0L % 2))
   :           +- *(1) Range (0, 5, step=1, splits=8)
   +- *(4) Sort [y#9L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(y#9L, 200)
         +- *(3) Project [(id#6L % 2) AS y#9L]
            +- *(3) Filter isnotnull((id#6L % 2))
               +- *(3) Range (0, 5, step=1, splits=8)

上面打印出的是一個物理計劃的treeString表示，其中(5) SortMergeJoin和(3) Range 這些樹節點是Spark查詢的表達式，表達式開頭的數字(5)和(3)代表DFS遍歷物理計劃表達式樹的順序(見org.apache.spark.sql.catalyst.trees.TreeNode函數generateTreeString)，簡化版的表達式遍歷順序如下：

== Physical Plan ==
*(5) SortMergeJoin
   :- *(2) Sort 
   :     +- *(1) Project 
   :        +- *(1) Filter
   :           +- *(1) Range 
   +- *(4) Sort 
         +- *(3) Project 
            +- *(3) Filter
               +- *(3) Range

將邏輯計劃轉換成物理計劃的抽象類叫做QueryPlanner，它定義了轉換的框架：首先得到一系列候選物理計劃、然後自底向上替換算子樹節點的物理計劃、最後化簡物理計劃。

QueryPlanner源代碼

abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] {
  def strategies: Seq[GenericStrategy[PhysicalPlan]]

  def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    val candidates = strategies.iterator.flatMap(_(plan))
    val plans = candidates.flatMap { candidate =>
      val placeholders = collectPlaceholders(candidate)

      if (placeholders.isEmpty) {
        Iterator(candidate)
      } else {
        placeholders.iterator.foldLeft(Iterator(candidate)) {
          case (candidatesWithPlaceholders, (placeholder, logicalPlan)) =>
            val childPlans = this.plan(logicalPlan)

            candidatesWithPlaceholders.flatMap { candidateWithPlaceholders =>
              childPlans.map { childPlan =>
                candidateWithPlaceholders.transformUp {
                  case p if p == placeholder => childPlan
                }
              }
            }
        }
      }
    }

    val pruned = prunePlans(plans)
    assert(pruned.hasNext, s"No plan for $plan")
    pruned
  }

注意Planner的最後一個過程prunePlans，截止到Spark 2.4.4，這個方法只是佔位作用，它會原樣返回輸入的全部候選物理計劃，沒有任何剪枝，見org.apache.spark.sql.execution的SparkPlanner

override protected def prunePlans(plans: Iterator[SparkPlan]): Iterator[SparkPlan] = {
    // TODO: We will need to prune bad plans when we improve plan space exploration
    //       to prevent combinatorial explosion.
    plans
  }

二、物理計劃生成策略

SparkPlanner和SparkStrategy是這裏的核心概念。各種策略組合應用，成爲物理計劃生成的核心操作。各種具體的SparkStrategy都實現了apply方法，將傳入的LogicalPlan轉換成爲Seq[SparkPlan]。我們看一下SparkStrategy的派生類可以發現，批處理轉換策略包括：基本操作、聚集、連接、內存掃描、特殊限制；流式計算的轉換策略包括：狀態聚集、流式去重、流式連接等。

批處理

BasicOperators
Aggregation
Window
JoinSelection
InMemoryScans
SpecialLimits

這些策略都overwrite了apply函數，執行策略預定義的邏輯判斷。logicalPlan被傳遞給apply方法執行生成Seq[SparkPlan]，即物理計劃。

下面以JoinSelection爲例，說明物理計劃生成策略都在做什麼；

def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      // --- BroadcastHashJoin --------------------------------------------------------------------
      
      // 如果指定了廣播提示，則按提示進行廣播，執行broadcastHashJoin
      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if canBroadcastByHints(joinType, left, right) =>
        val buildSide = broadcastSideByHints(joinType, left, right)
        Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))
          
      // 如果沒有指定廣播提示，則根據配置和數據量，嘗試進行ShuffledHashJoin、SortMergeJoin等
      // --- ShuffledHashJoin ---------------------------------------------------------------------
      // --- SortMergeJoin ------------------------------------------------------------
      // --- Without joining keys ------------------------------------------------------------
       
      // 最差的情況，Nested Loop暴力搜索、非常慢
        joins.BroadcastNestedLoopJoinExec(
          planLater(left), planLater(right), buildSide, joinType, condition) :: Nil

      // --- Cases where this strategy does not apply ---------------------------------------------
      case _ => Nil
    }
  }

可以看到，JoinSelection這種具體的物理計劃，內部維護了執行專屬的信息，例如是否可以廣播，連接左右部分的數據量關係等等。有了這些信息，當logicalPlan被傳遞給JoinSelection的apply方法執行生成物理計劃時，JoinSelection預定義的優化策略就會生效。

流式計算

StatefulAggregationStrategy
StreamingDeduplicationStrategy
StreamingJoinStrategy
StreamingRelationStrategy
FlatMapGroupsWithStateStrategy

Patterns目前有4種

PhysicalOperation
ExtractEquiJoinKeys
ExtractFiltersAndInnerJoins
PhysicalAggregation

三、分區和排序

1. 分區 Partitioning

分佈式系統的數據是有分區的，作爲一個通用計算引擎，Spark如何建模數據的分區，如何利用分區信息優化計算效率是值得研究學習的。下面我們來看一下Spark的分區體系設計。

分區Partitioning和分佈Distribution是兩個緊密相關的抽象。Partitioning描述一個算子的輸出是如何劃分的，它有兩個主要的屬性，一個是分區數量，另一個是它是否滿足給定的分佈。

trait Partitioning {
  val numPartitions: Int

  def satisfies(required: Distribution): Boolean = required match {
    case UnspecifiedDistribution => true
    case AllTuples => numPartitions == 1
    case _ => false
  }
}

Partitioning的幾種實現策略包括：

HashPartitioning 哈希分區
RoundRobinPartitioning 輪詢分區
RangePartitioning 範圍分區
BroadcastPartitioning 廣播分區
UnknownPartitioning 未知分區

具體地說，範圍分區表示根據排序表達式計算每一行的排序值，這樣每個分區存在一個min和max, 相同排序值的行存儲在相同的分區，相鄰的數值會保持在同一個分區或相鄰的分區，會保留一定的數據連續性。哈希分區表示根據哈希函數計算每一行的哈希值，這樣相同哈希值的行存儲在相同的分區，並不存在太多連續性，同時良好設計的哈希函數往往能一定程度避免數據傾斜。廣播分區表示數據被廣播到每個節點，輪詢分區表示數據平均地分配到每個節點，未知分區代表未知，一般用在模式匹配中作爲默認分區值。

HashPartitioning源碼

case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)
  extends Expression with Partitioning with Unevaluable {

  override def children: Seq[Expression] = expressions
  override def nullable: Boolean = false
  override def dataType: DataType = IntegerType

  override def satisfies(required: Distribution): Boolean = {
    super.satisfies(required) || {
      required match {
        case h: HashClusteredDistribution =>
          expressions.length == h.expressions.length && expressions.zip(h.expressions).forall {
            case (l, r) => l.semanticEquals(r)
          }
        case ClusteredDistribution(requiredClustering, requiredNumPartitions) =>
          expressions.forall(x => requiredClustering.exists(_.semanticEquals(x))) &&
            (requiredNumPartitions.isEmpty || requiredNumPartitions.get == numPartitions)
        case _ => false
      }
    }
  }

  def partitionIdExpression: Expression = Pmod(new Murmur3Hash(expressions), Literal(numPartitions))
}

RangePartitioning源碼

case class RangePartitioning(ordering: Seq[SortOrder], numPartitions: Int)
  extends Expression with Partitioning with Unevaluable {

  override def children: Seq[SortOrder] = ordering
  override def nullable: Boolean = false
  override def dataType: DataType = IntegerType

  override def satisfies(required: Distribution): Boolean = {
    super.satisfies(required) || {
      required match {
        case OrderedDistribution(requiredOrdering) =>
          val minSize = Seq(requiredOrdering.size, ordering.size).min
          requiredOrdering.take(minSize) == ordering.take(minSize)
        case ClusteredDistribution(requiredClustering, requiredNumPartitions) =>
          ordering.map(_.child).forall(x => requiredClustering.exists(_.semanticEquals(x))) &&
            (requiredNumPartitions.isEmpty || requiredNumPartitions.get == numPartitions)
        case _ => false
      }
    }
  }
}

2. 分佈 Distribution

Distribution指定了執行查詢後同一個表達式下的不同元組將如何分佈。目前有兩種物理分佈特性，節點間分佈(Inter-node)和分區內分佈(Intra-partition)。節點間分佈表示數據元組如何在集羣的物理機器之間分佈，知道節點間分佈可以用來做優化，例如優先使用本地聚集，避免不必要的全局聚集；分區內分佈表示一個分區的數據元組的劃分情況。

常見的Distribution包括：

AllTuples 所有元組分佈
ClusteredDistribution 聚類分佈
HashClusteredDistribution 哈希聚類分佈
OrderedDistribution 排序分佈
BroadcastDistribution 廣播分佈
UnspecifiedDistribution 未指定分佈

可以看到分區和分佈是成對出現的，它們之間的關聯稱作satisfy, 即某個分區方式p 滿足某個分佈d。

3. RBO和CBO

數據庫查詢優化是一項複雜的領域，基礎原理很簡潔，應用到具體的場景又錯綜複雜。簡單來說，SQL優化策略可以分成兩大類，基於規則的優化(Rule Based Optimization, RBO)和基於代價的優化(Cost Based Optimization, CBO)。

常見的RBO規則包括連接謂詞下推Predicate Pushdown、常量合併 Constant Folding、列剪枝 Column Prunning等。這裏不展開介紹，感興趣可以參考這篇博客。

CBO需要對Spark增加一些基礎功能，例如統計信息收集，代價函數、對Operator提供基數估計等，詳見華爲貢獻的CBO的Issue。

《SparkSQL內核剖析》【物理計劃篇】

一、概覽

二、物理計劃生成策略

三、分區和排序

1. 分區 Partitioning

2. 分佈 Distribution

3. RBO和CBO

通過f-string編寫簡潔高效的Python格式化輸出代碼

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

Apache Kylin基本原理與常見優化

《Streaming 102》: Beam模型

《Streaming 101》: 當我談流計算時，我談些什麼

《SparkSQL內核剖析》【物理計劃篇】

《SparkSQL內核剖析》【基礎篇】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結