SparkSql 2.2.x 中 Broadcast Join的陷阱(hint不生效)

問題描述

在spark 2.2.0 的sparksql 中使用hint指定廣播表，卻無法進行指定廣播；

前期準備

hive> select * from test.tmp_demo_small;
OK
tmp_demo_small.pas_phone	tmp_demo_small.age
156	20
157	22
158	15

hive> analyze table test.tmp_demo_small compute statistics;
Table test.tmp_demo_small stats: [numFiles=1, numRows=3, totalSize=21, rawDataSize=18]



hive> select * from test.tmp_demo_big;
OK
tmp_demo_big.pas_phone	tmp_demo_big.ord_id	tmp_demo_big.dt
156	aa1	20191111
156	aa2	20191112
157	bb1	20191111
157	bb2	20191112
157	bb3	20191113
157	bb4	20191114
158	cc1	20191111
158	cc2	20191112
158	cc3	20191113

hive> analyze table test.tmp_demo_big compute statistics;
Table test.tmp_demo_big stats: [numFiles=1, numRows=9, totalSize=153, rawDataSize=144]

sparksql解析過程詳見：Apache Spark源碼走讀之11 – sql的解析與執行不是本篇重點，不過有個解析後的語法樹有用，可以比較明顯的展示左表右表，不然可能有小夥伴要納悶buildright是個啥了

驗證方式

結論爲先： 當小表join小表時（都符合默認廣播條件 spark.sql.autoBroadcastJoinThreshold默認10M），無論是否指定廣播對象，都是以右表優先匹配；也就是說hint在這種情況下失效。

註釋什麼的都放在代碼裏面了

使用默認方式join自動廣播

select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

查看執行計劃(每個執行過程從下往上讀，模擬樹結構)

== Parsed Logical Plan == --  抽象語法樹，由ANTLR解析
Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L]
+- Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L, ord_cnt#35L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
      +- Project [pas_phone#39, ord_id#40, age#38]  -- 只知道是選擇出了屬性，卻並不知道這些屬性屬於哪張表，更不知道其數據類型
         +- Filter (age#38 > 21)
            +- Join Inner, (pas_phone#37 = pas_phone#39)
               :- SubqueryAlias small
               :  +- SubqueryAlias tmp_demo_small
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

== Analyzed Logical Plan ==  -- 邏輯語法樹
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint  -- 數據類型解析
Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L]
+- Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L, ord_cnt#35L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
      +- Project [pas_phone#39, ord_id#40, age#38]
         +- Filter (age#38 > 21)
            +- Join Inner, (pas_phone#37 = pas_phone#39)
               :- SubqueryAlias small
               :  +- SubqueryAlias tmp_demo_small
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

== Optimized Logical Plan ==  -- 邏輯優化
Window [sum(1) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- Project [pas_phone#39, ord_id#40, age#38]
   +- Join Inner, (pas_phone#37 = pas_phone#39)
      :- Filter ((isnotnull(age#38) && (age#38 > 21)) && isnotnull(pas_phone#37))  -- 謂語下推優化
      :  +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
      +- Project [pas_phone#39, ord_id#40]
         +- Filter isnotnull(pas_phone#39)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- *Sort [pas_phone#39 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 449256327) hashpartitioning(pas_phone#39, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *Project [pas_phone#39, ord_id#40, age#38]
         +- *BroadcastHashJoin [pas_phone#37], [pas_phone#39], Inner, BuildRight -- buildright表示使用右表進行廣播
            :- *Filter ((isnotnull(age#38) && (age#38 > 21)) && isnotnull(pas_phone#37))
            :  +- HiveTableScan [pas_phone#37, age#38], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
            +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
               +- *Filter isnotnull(pas_phone#39)
                  +- HiveTableScan [pas_phone#39, ord_id#40], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]

使用hint進行指定廣播對象

select
    /*+ BROADCAST(small) */
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

執行計劃

== Parsed Logical Plan ==
Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L]
+- Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L, ord_cnt#57L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
      +- Project [pas_phone#61, ord_id#62, age#60]
         +- Filter (age#60 > 21)
            +- Join Inner, (pas_phone#59 = pas_phone#61)
               :- ResolvedHint isBroadcastable=true
               :  +- SubqueryAlias small
               :     +- SubqueryAlias tmp_demo_small
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L]
+- Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L, ord_cnt#57L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
      +- Project [pas_phone#61, ord_id#62, age#60]
         +- Filter (age#60 > 21)
            +- Join Inner, (pas_phone#59 = pas_phone#61)
               :- ResolvedHint isBroadcastable=true
               :  +- SubqueryAlias small
               :     +- SubqueryAlias tmp_demo_small
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
               +- SubqueryAlias big
                  +- SubqueryAlias tmp_demo_big
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- Project [pas_phone#61, ord_id#62, age#60]
   +- Join Inner, (pas_phone#59 = pas_phone#61)
      :- ResolvedHint isBroadcastable=true  -- 這裏可以看到在邏輯優化的時候，這個參數是生效的
      :  +- Filter ((isnotnull(age#60) && (age#60 > 21)) && isnotnull(pas_phone#59))
      :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
      +- Project [pas_phone#61, ord_id#62]
         +- Filter isnotnull(pas_phone#61)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- *Sort [pas_phone#61 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 1477200907) hashpartitioning(pas_phone#61, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *Project [pas_phone#61, ord_id#62, age#60]
         +- *BroadcastHashJoin [pas_phone#59], [pas_phone#61], Inner, BuildRight -- buildright表示仍然使用右表進行廣播
            :- *Filter ((isnotnull(age#60) && (age#60 > 21)) && isnotnull(pas_phone#59))
            :  +- HiveTableScan [pas_phone#59, age#60], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
            +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
               +- *Filter isnotnull(pas_phone#61)
                  +- HiveTableScan [pas_phone#61, ord_id#62], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]

剛開始一路走下來，感覺都正常，而且邏輯優化的時候將一些filter條件下推都是符合RBO優化原則；但是到最後的生成物理執行計劃的時候出現問題，理論上來說應該會進行比較兩個子表，哪一個小廣播哪個；爲什麼會出現這個問題？問題是應該出在物理執行計劃中Join的選擇方式上，定位spark 2.2.0 源碼; 從 apply 開始看

位置：spark-2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala


object JoinSelection extends Strategy with PredicateHelper {

  /**
   * Matches a plan whose output should be small enough to be used in broadcast join.
   */
  
  // 3. canBroadcast(right), 傳入的right是個LogicalPlan對象，也就是一個邏輯計劃，其中包含了這個子樹節點表的內部信息，包括meta信息，還有解析的hint；這裏會進行判斷；只需要存在hint語句 或者 滿足節點樹(這裏是右表)filter之後的信息大大於0且小於一個閾值(默認10M) 這兩個條件其一就返回true
  
  private def canBroadcast(plan: LogicalPlan): Boolean = {
    plan.stats(conf).hints.isBroadcastable.getOrElse(false) ||
      (plan.stats(conf).sizeInBytes >= 0 &&
        plan.stats(conf).sizeInBytes <= conf.autoBroadcastJoinThreshold)
  }

  ...  隱去一部分代碼

	// 2. canBuildRight(joinType)判斷下，返回 true
  private def canBuildRight(joinType: JoinType): Boolean = joinType match {
    case _: InnerLike | LeftOuter | LeftSemi | LeftAnti => true
    case j: ExistenceJoin => true
    case _ => false
  }

  private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
    case _: InnerLike | RightOuter => true
    case _ => false
  }

  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

    // --- BroadcastHashJoin --------------------------------------------------------------------
    // 1. 廣播判斷條件 ：首先判斷（2） canBuildRight(joinType)；然後接着判斷 （3）canBroadcast(right)；當（2）且（3）都true則開始執行broadcast，且廣播右表，不理會hint中是否制定廣播表

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
      if canBuildRight(joinType) && canBroadcast(right) =>
      Seq(joins.BroadcastHashJoinExec(
        leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
      if canBuildLeft(joinType) && canBroadcast(left) =>
      Seq(joins.BroadcastHashJoinExec(
        leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))

    // --- ShuffledHashJoin ---------------------------------------------------------------------

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
       if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
         && muchSmaller(right, left) ||
         !RowOrdering.isOrderable(leftKeys) =>
      ...

    // --- SortMergeJoin ------------------------------------------------------------

    case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
      if RowOrdering.isOrderable(leftKeys) =>
      ...
    // --- Without joining keys ------------------------------------------------------------
    ...
    case _ => Nil
  }
}

至此，解釋了爲什麼spark 2.2.0中，hint沒有生效的問題；因爲判斷join方式的時候，優先判斷是否使用broadcast join，模式匹配先匹配right的情況，也就是說，如果右表只要足夠小且滿足廣播規則，那麼無論hint是否有或者hint左表右表，都會進行廣播右表；但是一旦右邊太大，而且沒有hint的方式標註使用右表，那麼就會進入第二個，判斷左表是否符合廣播條件，是的話就進行廣播；一樣的代碼放在2.4.3中看下情況如何

select
    /*+ BROADCAST(small) */
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

執行計劃

== Parsed Logical Plan ==
Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L]
+- Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L, ord_cnt#0L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
      +- Project [pas_phone#4, ord_id#5, age#3]
         +- Filter (age#3 > 21)
            +- Join Inner, (pas_phone#2 = pas_phone#4)
               :- ResolvedHint (broadcast)
               :  +- SubqueryAlias `small`
               :     +- SubqueryAlias `test`.`tmp_demo_small`
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L]
+- Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L, ord_cnt#0L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
      +- Project [pas_phone#4, ord_id#5, age#3]
         +- Filter (age#3 > 21)
            +- Join Inner, (pas_phone#2 = pas_phone#4)
               :- ResolvedHint (broadcast)
               :  +- SubqueryAlias `small`
               :     +- SubqueryAlias `test`.`tmp_demo_small`
               :        +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- Project [pas_phone#4, ord_id#5, age#3]
   +- Join Inner, (pas_phone#2 = pas_phone#4)
      :- ResolvedHint (broadcast) -- 解析hint語句，指定廣播表
      :  +- Filter ((isnotnull(age#3) && (age#3 > 21)) && isnotnull(pas_phone#2))
      :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
      +- Project [pas_phone#4, ord_id#5]
         +- Filter isnotnull(pas_phone#4)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- *(3) Sort [pas_phone#4 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 632554218) hashpartitioning(pas_phone#4, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *(2) Project [pas_phone#4, ord_id#5, age#3]
         +- *(2) BroadcastHashJoin [pas_phone#2], [pas_phone#4], Inner, BuildLeft  -- BuildLeft hint制定生效
            :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
            :  +- *(1) Filter ((isnotnull(age#3) && (age#3 > 21)) && isnotnull(pas_phone#2))
            :     +- Scan hive test.tmp_demo_small [pas_phone#2, age#3], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
            +- *(2) Filter isnotnull(pas_phone#4)
               +- Scan hive test.tmp_demo_big [pas_phone#4, ord_id#5], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]

不指定廣播表，默認 join

select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21

執行計劃

== Parsed Logical Plan ==
Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L]
+- Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L, ord_cnt#11L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
      +- Project [pas_phone#15, ord_id#16, age#14]
         +- Filter (age#14 > 21)
            +- Join Inner, (pas_phone#13 = pas_phone#15)
               :- SubqueryAlias `small`
               :  +- SubqueryAlias `test`.`tmp_demo_small`
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L]
+- Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L, ord_cnt#11L]
   +- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
      +- Project [pas_phone#15, ord_id#16, age#14]
         +- Filter (age#14 > 21)
            +- Join Inner, (pas_phone#13 = pas_phone#15)
               :- SubqueryAlias `small`
               :  +- SubqueryAlias `test`.`tmp_demo_small`
               :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
               +- SubqueryAlias `big`
                  +- SubqueryAlias `test`.`tmp_demo_big`
                     +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- Project [pas_phone#15, ord_id#16, age#14]
   +- Join Inner, (pas_phone#13 = pas_phone#15)
      :- Filter ((isnotnull(age#14) && (age#14 > 21)) && isnotnull(pas_phone#13))
      :  +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
      +- Project [pas_phone#15, ord_id#16]
         +- Filter isnotnull(pas_phone#15)
            +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- *(3) Sort [pas_phone#15 ASC NULLS FIRST], false, 0
   +- Exchange(coordinator id: 1731877543) hashpartitioning(pas_phone#15, 1000), coordinator[target post-shuffle partition size: 67108864]
      +- *(2) Project [pas_phone#15, ord_id#16, age#14]
         +- *(2) BroadcastHashJoin [pas_phone#13], [pas_phone#15], Inner, BuildLeft -- 廣播左表成功
            :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
            :  +- *(1) Filter ((isnotnull(age#14) && (age#14 > 21)) && isnotnull(pas_phone#13))
            :     +- Scan hive test.tmp_demo_small [pas_phone#13, age#14], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
            +- *(2) Filter isnotnull(pas_phone#15)
               +- Scan hive test.tmp_demo_big [pas_phone#15, ord_id#16], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]

這就有些意思了，看下2.4.3 的源碼

位置：spark-2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 


 object JoinSelection extends Strategy with PredicateHelper {

    /**
     * Matches a plan whose output should be small enough to be used in broadcast join.
     */
    private def canBroadcast(plan: LogicalPlan): Boolean = {
      plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold
    }

    /**
     * Matches a plan whose single partition should be small enough to build a hash table.
     *
     * Note: this assume that the number of partition is fixed, requires additional work if it's
     * dynamic.
     */
    private def canBuildLocalHashMap(plan: LogicalPlan): Boolean = {
      plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions
    }

    /**
     * Returns whether plan a is much smaller (3X) than plan b.
     *
     * The cost to build hash map is higher than sorting, we should only build hash map on a table
     * that is much smaller than other one. Since we does not have the statistic for number of rows,
     * use the size of bytes here as estimation.
     */
    private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
      a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes
    }

    private def canBuildRight(joinType: JoinType): Boolean = joinType match {
      case _: InnerLike | LeftOuter | LeftSemi | LeftAnti | _: ExistenceJoin => true
      case _ => false
    }

    private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
      case _: InnerLike | RightOuter => true
      case _ => false
    }

   	// 3. 就是簡單比較左右兩表大小，
    private def broadcastSide(
        canBuildLeft: Boolean,
        canBuildRight: Boolean,
        left: LogicalPlan,
        right: LogicalPlan): BuildSide = {

      def smallerSide =
        if (right.stats.sizeInBytes <= left.stats.sizeInBytes) BuildRight else BuildLeft

      if (canBuildRight && canBuildLeft) {
        // Broadcast smaller side base on its estimated physical size
        // if both sides have broadcast hint
        smallerSide
      } else if (canBuildRight) {
        BuildRight
      } else if (canBuildLeft) {
        BuildLeft
      } else {
        // for the last default broadcast nested loop join
        smallerSide
      }
    }

   // 1 判斷 canBroadcastByHints(joinType, left, right) ，接着判斷 canBuildLeft(joinType)和canBuildRight(joinType) 兩者只需要一個爲 true就可以，join類型條件基本囊括；主要是判斷針對左右子樹表的hint制定廣播
   private def canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : Boolean = {
      val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
      val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
      buildLeft || buildRight
    }

   // 2. broadcastSideByHints(joinType, left, right) 再吊起 broadcastSide進行比較，（3）其實就是簡單比較兩個表的大小
   private def broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : BuildSide = {
      val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
      val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
      broadcastSide(buildLeft, buildRight, left, right)
    }

    private def canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : Boolean = {
      val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
      val buildRight = canBuildRight(joinType) && canBroadcast(right)
      buildLeft || buildRight
    }

    private def broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
      : BuildSide = {
      val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
      val buildRight = canBuildRight(joinType) && canBroadcast(right)
      broadcastSide(buildLeft, buildRight, left, right)
    }

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      // 區分了兩種，當指定hint時和未指定hint時
      // --- BroadcastHashJoin --------------------------------------------------------------------
			
      // broadcast hints were specified
      
     // 對於有hint的情況，先判斷 canBroadcastByHints(joinType, left, right)（1）爲true只是表示有hint語句且囊括的join類型符合條件；然後再吊起 broadcastSideByHints(joinType, left, right) 判斷廣播哪張表（2）
      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if canBroadcastByHints(joinType, left, right) =>
        val buildSide = broadcastSideByHints(joinType, left, right)
        Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))

      
      // broadcast hints were not specified, so need to infer it from size and configuration.
      // 對於沒有hint的情況，直接走到判斷兩張表大小來決定誰當廣播表（當然符合一些前置條件）
      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if canBroadcastBySizes(joinType, left, right) =>
        val buildSide = broadcastSideBySizes(joinType, left, right)
        Seq(joins.BroadcastHashJoinExec(
          leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))

 
      // --- ShuffledHashJoin ---------------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
         if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
           && muchSmaller(right, left) ||
           !RowOrdering.isOrderable(leftKeys) =>
       ...
      // --- SortMergeJoin ------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
        if RowOrdering.isOrderable(leftKeys) =>
        ...
      // --- Without joining keys ----------------------------------------------------------
      ...
    }
  }

所以綜上所述

spark 2.2.2的版本當小表join小表(兩表都符合廣播條件)，hint 指定廣播表會失效，默認廣播右表；若不hint，則默認廣播右表
spark 2.4.3的版本可以指定(inner join)廣播表(即使超過廣播閾值,但小心OOM風險)；若不hint，則在符合廣播閾值的條件下，使用較小的表進行廣播
spark不支持full outer join；對於right outer join 只能廣播左表；對於left outer join，left semi join，left anti join，internal join等只能廣播右表，inner join 可以指定廣播
其餘的一些join觸發條件要求：SparkSQL-有必要坐下來聊聊Join，Spark SQL 之 Join 實現

最後放兩張收稿圖，用於區分2.2和2.4之間的broadcastjoin判斷方式

spark 2.2.0

spark 2.4.2

by the way

本來是遇到了一個having的問題，在本地執行沒有問題，但是打包好使用spark-submit提交到集羣的時候就莫名其妙報錯了；

select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21
having
    ord_cnt > 2
   
   
Error in query: grouping expressions sequence is empty, and 'big.`pas_phone`' is not an aggregate function. Wrap '()' in windowing function(s) or wrap 'big.`pas_phone`' in first() (or first_value) if you don't care which value you get.;;
'Project [pas_phone#26, ord_id#27, age#25, ord_cnt#22L]
+- 'Project [pas_phone#26, ord_id#27, age#25, ord_cnt#22L, ord_cnt#22L]
   +- 'Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#26, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#22L], [pas_phone#26]
      +- 'Filter ('ord_cnt > 2)
         +- Aggregate [pas_phone#26, ord_id#27, age#25]
            +- Filter (age#25 > 21)
               +- Join Inner, (pas_phone#24 = pas_phone#26)
                  :- SubqueryAlias `small`
                  :  +- SubqueryAlias `test`.`tmp_demo_small`
                  :     +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#24, age#25]
                  +- SubqueryAlias `big`
                     +- SubqueryAlias `test`.`tmp_demo_big`
                        +- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#26, ord_id#27, dt#28]

問題也查到了，我在本地執行的時候使用的是yarn-client模式，所以我的driver是我的服務器，而我這臺服務器spark版本是2.2.2的，所以執行沒啥問題，因爲driver負責生成DAG，劃分task等等，這個都是在sql轉化爲rdd之後去執行的，所以還有一個就是前置的解析sql的工作，也就是sql -> rdd，這個也是由driver來完成的，而提交到集羣的方式是yarn-cluster模式，driver在集羣的某一臺機器上，這就很尬尬了，公司竟然升級到2.4.3了，導致sql解析的環境已經和我的本地不匹配了，然後查了一下新版的 spark release note

In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, SELECT 1 FROM range(10) HAVING true is executed as SELECT 1 FROM range(10) WHERE true and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means SELECT 1 FROM range(10) HAVING true will return only one row. To restore the previous behavior, set spark.sql.legacy.parser.havingWithoutGroupByAsWhere to true.

cool，問題解決，原因也找到了, 如果非要像以前2.2那樣不想改整段代碼操作，那麼再前面加set spark.sql.legacy.parser.havingWithoutGroupByAsWhere=true;解決問題

set spark.sql.legacy.parser.havingWithoutGroupByAsWhere=true;
select
    big.pas_phone,
    big.ord_id,
    small.age,
    sum(1) over(partition by big.pas_phone) as ord_cnt
from
    test.tmp_demo_small as small  -- 小表 3 行
join
    test.tmp_demo_big as big  -- 大表 9 行
on
    small.pas_phone = big.pas_phone
where
    small.age > 21
having
    ord_cnt > 2

附錄

  /**
   * Select the proper physical plan for join based on joining keys and size of logical plan.
   *
   * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of the
   * predicates can be evaluated by matching join keys. If found, join implementations are chosen
   * with the following precedence:
   *
   * - Broadcast hash join (BHJ):
   *     BHJ is not supported for full outer join. For right outer join, we only can broadcast the
   *     left side. For left outer, left semi, left anti and the internal join type ExistenceJoin,
   *     we only can broadcast the right side. For inner like join, we can broadcast both sides.
   *     Normally, BHJ can perform faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation. It could cause OOM
   *     or perform worse than the other join algorithms, especially when the build/broadcast side
   *     is big.
   *
   *     For the supported cases, users can specify the broadcast hint (e.g. the user applied the
   *     [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame) and session-based
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold to adjust whether BHJ is used and
   *     which join side is broadcast.
   *
   *     1) Broadcast the join side with the broadcast hint, even if the size is larger than
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]]. If both sides have the hint (only when the type
   *     is inner like join), the side with a smaller estimated physical size will be broadcast.
   *     2) Respect the [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold and broadcast the side
   *     whose estimated physical size is smaller than the threshold. If both sides are below the
   *     threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
   *
   * - Shuffle hash join: if the average size of a single partition is small enough to build a hash
   *     table.
   *
   * - Sort merge: if the matching join keys are sortable.
   *
   * If there is no joining keys, Join implementations are chosen with the following precedence:
   * - BroadcastNestedLoopJoin (BNLJ):
   *     BNLJ supports all the join types but the impl is OPTIMIZED for the following scenarios:
   *     For right outer join, the left side is broadcast. For left outer, left semi, left anti
   *     and the internal join type ExistenceJoin, the right side is broadcast. For inner like
   *     joins, either side is broadcast.
   *
   *     Like BHJ, users still can specify the broadcast hint and session-based
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold to impact which side is broadcast.
   *
   *     1) Broadcast the join side with the broadcast hint, even if the size is larger than
   *     [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]]. If both sides have the hint (i.e., just for
   *     inner-like join), the side with a smaller estimated physical size will be broadcast.
   *     2) Respect the [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold and broadcast the side
   *     whose estimated physical size is smaller than the threshold. If both sides are below the
   *     threshold, broadcast the smaller side. If neither is smaller, BNLJ is not used.
   *
   * - CartesianProduct: for inner like join, CartesianProduct is the fallback option.
   *
   * - BroadcastNestedLoopJoin (BNLJ):
   *     For the other join types, BNLJ is the fallback option. Here, we just pick the broadcast
   *     side with the broadcast hint. If neither side has a hint, we broadcast the side with
   *     the smaller estimated physical size.
   */

哈士奇說喵

發佈了97 篇原創文章 · 獲贊 424 · 訪問量 92萬+

他的留言板關注

SparkSql 2.2.x 中 Broadcast Join的陷阱(hint不生效)

問題描述

前期準備

驗證方式

所以綜上所述

by the way

附錄

解決：使用Photoswipe進行圖片展示

解決：Dropzone.js的圖片拖拽上傳保存本地文件夾

SparkSql中時間閾操作【窗口函數】

解決:crontab執行python no model name xxx問題

從作用域到閉包再到裝飾器

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結