問題描述
- 在spark 2.2.0 的sparksql 中使用hint指定廣播表,卻無法進行指定廣播;
前期準備
hive> select * from test.tmp_demo_small;
OK
tmp_demo_small.pas_phone tmp_demo_small.age
156 20
157 22
158 15
hive> analyze table test.tmp_demo_small compute statistics;
Table test.tmp_demo_small stats: [numFiles=1, numRows=3, totalSize=21, rawDataSize=18]
hive> select * from test.tmp_demo_big;
OK
tmp_demo_big.pas_phone tmp_demo_big.ord_id tmp_demo_big.dt
156 aa1 20191111
156 aa2 20191112
157 bb1 20191111
157 bb2 20191112
157 bb3 20191113
157 bb4 20191114
158 cc1 20191111
158 cc2 20191112
158 cc3 20191113
hive> analyze table test.tmp_demo_big compute statistics;
Table test.tmp_demo_big stats: [numFiles=1, numRows=9, totalSize=153, rawDataSize=144]
sparksql解析過程詳見:Apache Spark源碼走讀之11 – sql的解析與執行 不是本篇重點,不過有個解析後的語法樹有用,可以比較明顯的展示左表右表,不然可能有小夥伴要納悶buildright是個啥了
驗證方式
結論爲先: 當小表join小表時(都符合默認廣播條件 spark.sql.autoBroadcastJoinThreshold默認10M),無論是否指定廣播對象,都是以右表優先匹配;也就是說hint在這種情況下失效。
註釋什麼的都放在代碼裏面了
- 使用默認方式join自動廣播
select
big.pas_phone,
big.ord_id,
small.age,
sum(1) over(partition by big.pas_phone) as ord_cnt
from
test.tmp_demo_small as small -- 小表 3 行
join
test.tmp_demo_big as big -- 大表 9 行
on
small.pas_phone = big.pas_phone
where
small.age > 21
- 查看執行計劃(每個執行過程從下往上讀,模擬樹結構)
== Parsed Logical Plan == -- 抽象語法樹,由ANTLR解析
Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L]
+- Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L, ord_cnt#35L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- Project [pas_phone#39, ord_id#40, age#38] -- 只知道是選擇出了屬性,卻並不知道這些屬性屬於哪張表,更不知道其數據類型
+- Filter (age#38 > 21)
+- Join Inner, (pas_phone#37 = pas_phone#39)
:- SubqueryAlias small
: +- SubqueryAlias tmp_demo_small
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
+- SubqueryAlias big
+- SubqueryAlias tmp_demo_big
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]
== Analyzed Logical Plan == -- 邏輯語法樹
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint -- 數據類型解析
Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L]
+- Project [pas_phone#39, ord_id#40, age#38, ord_cnt#35L, ord_cnt#35L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- Project [pas_phone#39, ord_id#40, age#38]
+- Filter (age#38 > 21)
+- Join Inner, (pas_phone#37 = pas_phone#39)
:- SubqueryAlias small
: +- SubqueryAlias tmp_demo_small
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
+- SubqueryAlias big
+- SubqueryAlias tmp_demo_big
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]
== Optimized Logical Plan == -- 邏輯優化
Window [sum(1) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- Project [pas_phone#39, ord_id#40, age#38]
+- Join Inner, (pas_phone#37 = pas_phone#39)
:- Filter ((isnotnull(age#38) && (age#38 > 21)) && isnotnull(pas_phone#37)) -- 謂語下推優化
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
+- Project [pas_phone#39, ord_id#40]
+- Filter isnotnull(pas_phone#39)
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]
== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#39, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#35L], [pas_phone#39]
+- *Sort [pas_phone#39 ASC NULLS FIRST], false, 0
+- Exchange(coordinator id: 449256327) hashpartitioning(pas_phone#39, 1000), coordinator[target post-shuffle partition size: 67108864]
+- *Project [pas_phone#39, ord_id#40, age#38]
+- *BroadcastHashJoin [pas_phone#37], [pas_phone#39], Inner, BuildRight -- buildright表示使用右表進行廣播
:- *Filter ((isnotnull(age#38) && (age#38 > 21)) && isnotnull(pas_phone#37))
: +- HiveTableScan [pas_phone#37, age#38], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#37, age#38]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *Filter isnotnull(pas_phone#39)
+- HiveTableScan [pas_phone#39, ord_id#40], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#39, ord_id#40, dt#41]
- 使用hint進行指定廣播對象
select
/*+ BROADCAST(small) */
big.pas_phone,
big.ord_id,
small.age,
sum(1) over(partition by big.pas_phone) as ord_cnt
from
test.tmp_demo_small as small -- 小表 3 行
join
test.tmp_demo_big as big -- 大表 9 行
on
small.pas_phone = big.pas_phone
where
small.age > 21
- 執行計劃
== Parsed Logical Plan ==
Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L]
+- Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L, ord_cnt#57L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- Project [pas_phone#61, ord_id#62, age#60]
+- Filter (age#60 > 21)
+- Join Inner, (pas_phone#59 = pas_phone#61)
:- ResolvedHint isBroadcastable=true
: +- SubqueryAlias small
: +- SubqueryAlias tmp_demo_small
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
+- SubqueryAlias big
+- SubqueryAlias tmp_demo_big
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]
== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L]
+- Project [pas_phone#61, ord_id#62, age#60, ord_cnt#57L, ord_cnt#57L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- Project [pas_phone#61, ord_id#62, age#60]
+- Filter (age#60 > 21)
+- Join Inner, (pas_phone#59 = pas_phone#61)
:- ResolvedHint isBroadcastable=true
: +- SubqueryAlias small
: +- SubqueryAlias tmp_demo_small
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
+- SubqueryAlias big
+- SubqueryAlias tmp_demo_big
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]
== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- Project [pas_phone#61, ord_id#62, age#60]
+- Join Inner, (pas_phone#59 = pas_phone#61)
:- ResolvedHint isBroadcastable=true -- 這裏可以看到在邏輯優化的時候,這個參數是生效的
: +- Filter ((isnotnull(age#60) && (age#60 > 21)) && isnotnull(pas_phone#59))
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
+- Project [pas_phone#61, ord_id#62]
+- Filter isnotnull(pas_phone#61)
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]
== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#61, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ord_cnt#57L], [pas_phone#61]
+- *Sort [pas_phone#61 ASC NULLS FIRST], false, 0
+- Exchange(coordinator id: 1477200907) hashpartitioning(pas_phone#61, 1000), coordinator[target post-shuffle partition size: 67108864]
+- *Project [pas_phone#61, ord_id#62, age#60]
+- *BroadcastHashJoin [pas_phone#59], [pas_phone#61], Inner, BuildRight -- buildright表示仍然使用右表進行廣播
:- *Filter ((isnotnull(age#60) && (age#60 > 21)) && isnotnull(pas_phone#59))
: +- HiveTableScan [pas_phone#59, age#60], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#59, age#60]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *Filter isnotnull(pas_phone#61)
+- HiveTableScan [pas_phone#61, ord_id#62], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#61, ord_id#62, dt#63]
剛開始一路走下來,感覺都正常,而且邏輯優化的時候將一些filter條件下推都是符合RBO優化原則;但是到最後的生成物理執行計劃的時候出現問題,理論上來說應該會進行比較兩個子表,哪一個小廣播哪個;爲什麼會出現這個問題?問題是應該出在物理執行計劃中Join的選擇方式上,定位spark 2.2.0 源碼; 從 apply 開始看
位置:spark-2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
object JoinSelection extends Strategy with PredicateHelper {
/**
* Matches a plan whose output should be small enough to be used in broadcast join.
*/
// 3. canBroadcast(right), 傳入的right是個LogicalPlan對象,也就是一個邏輯計劃,其中包含了這個子樹節點表的內部信息,包括meta信息,還有解析的hint;這裏會進行判斷;只需要存在hint語句 或者 滿足節點樹(這裏是右表)filter之後的信息大大於0且小於一個閾值(默認10M) 這兩個條件其一就返回true
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.stats(conf).hints.isBroadcastable.getOrElse(false) ||
(plan.stats(conf).sizeInBytes >= 0 &&
plan.stats(conf).sizeInBytes <= conf.autoBroadcastJoinThreshold)
}
... 隱去一部分代碼
// 2. canBuildRight(joinType)判斷下,返回 true
private def canBuildRight(joinType: JoinType): Boolean = joinType match {
case _: InnerLike | LeftOuter | LeftSemi | LeftAnti => true
case j: ExistenceJoin => true
case _ => false
}
private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
case _: InnerLike | RightOuter => true
case _ => false
}
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
// --- BroadcastHashJoin --------------------------------------------------------------------
// 1. 廣播判斷條件 :首先判斷(2) canBuildRight(joinType);然後接着判斷 (3)canBroadcast(right);當(2)且(3)都true則開始執行broadcast,且廣播右表,不理會hint中是否制定廣播表
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if canBuildRight(joinType) && canBroadcast(right) =>
Seq(joins.BroadcastHashJoinExec(
leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if canBuildLeft(joinType) && canBroadcast(left) =>
Seq(joins.BroadcastHashJoinExec(
leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))
// --- ShuffledHashJoin ---------------------------------------------------------------------
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
&& muchSmaller(right, left) ||
!RowOrdering.isOrderable(leftKeys) =>
...
// --- SortMergeJoin ------------------------------------------------------------
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if RowOrdering.isOrderable(leftKeys) =>
...
// --- Without joining keys ------------------------------------------------------------
...
case _ => Nil
}
}
至此,解釋了爲什麼spark 2.2.0中,hint沒有生效的問題;因爲判斷join方式的時候,優先判斷是否使用broadcast join,模式匹配先匹配right的情況,也就是說,如果右表只要足夠小且滿足廣播規則,那麼無論hint是否有或者hint左表右表,都會進行廣播右表;但是一旦右邊太大,而且沒有hint的方式標註使用右表,那麼就會進入第二個,判斷左表是否符合廣播條件,是的話就進行廣播;一樣的代碼放在2.4.3中看下情況如何
select
/*+ BROADCAST(small) */
big.pas_phone,
big.ord_id,
small.age,
sum(1) over(partition by big.pas_phone) as ord_cnt
from
test.tmp_demo_small as small -- 小表 3 行
join
test.tmp_demo_big as big -- 大表 9 行
on
small.pas_phone = big.pas_phone
where
small.age > 21
- 執行計劃
== Parsed Logical Plan ==
Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L]
+- Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L, ord_cnt#0L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- Project [pas_phone#4, ord_id#5, age#3]
+- Filter (age#3 > 21)
+- Join Inner, (pas_phone#2 = pas_phone#4)
:- ResolvedHint (broadcast)
: +- SubqueryAlias `small`
: +- SubqueryAlias `test`.`tmp_demo_small`
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
+- SubqueryAlias `big`
+- SubqueryAlias `test`.`tmp_demo_big`
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]
== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L]
+- Project [pas_phone#4, ord_id#5, age#3, ord_cnt#0L, ord_cnt#0L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- Project [pas_phone#4, ord_id#5, age#3]
+- Filter (age#3 > 21)
+- Join Inner, (pas_phone#2 = pas_phone#4)
:- ResolvedHint (broadcast)
: +- SubqueryAlias `small`
: +- SubqueryAlias `test`.`tmp_demo_small`
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
+- SubqueryAlias `big`
+- SubqueryAlias `test`.`tmp_demo_big`
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]
== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- Project [pas_phone#4, ord_id#5, age#3]
+- Join Inner, (pas_phone#2 = pas_phone#4)
:- ResolvedHint (broadcast) -- 解析hint語句,指定廣播表
: +- Filter ((isnotnull(age#3) && (age#3 > 21)) && isnotnull(pas_phone#2))
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
+- Project [pas_phone#4, ord_id#5]
+- Filter isnotnull(pas_phone#4)
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]
== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#4, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#0L], [pas_phone#4]
+- *(3) Sort [pas_phone#4 ASC NULLS FIRST], false, 0
+- Exchange(coordinator id: 632554218) hashpartitioning(pas_phone#4, 1000), coordinator[target post-shuffle partition size: 67108864]
+- *(2) Project [pas_phone#4, ord_id#5, age#3]
+- *(2) BroadcastHashJoin [pas_phone#2], [pas_phone#4], Inner, BuildLeft -- BuildLeft hint制定生效
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
: +- *(1) Filter ((isnotnull(age#3) && (age#3 > 21)) && isnotnull(pas_phone#2))
: +- Scan hive test.tmp_demo_small [pas_phone#2, age#3], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#2, age#3]
+- *(2) Filter isnotnull(pas_phone#4)
+- Scan hive test.tmp_demo_big [pas_phone#4, ord_id#5], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#4, ord_id#5, dt#6]
- 不指定廣播表,默認 join
select
big.pas_phone,
big.ord_id,
small.age,
sum(1) over(partition by big.pas_phone) as ord_cnt
from
test.tmp_demo_small as small -- 小表 3 行
join
test.tmp_demo_big as big -- 大表 9 行
on
small.pas_phone = big.pas_phone
where
small.age > 21
- 執行計劃
== Parsed Logical Plan ==
Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L]
+- Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L, ord_cnt#11L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- Project [pas_phone#15, ord_id#16, age#14]
+- Filter (age#14 > 21)
+- Join Inner, (pas_phone#13 = pas_phone#15)
:- SubqueryAlias `small`
: +- SubqueryAlias `test`.`tmp_demo_small`
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
+- SubqueryAlias `big`
+- SubqueryAlias `test`.`tmp_demo_big`
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]
== Analyzed Logical Plan ==
pas_phone: int, ord_id: string, age: int, ord_cnt: bigint
Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L]
+- Project [pas_phone#15, ord_id#16, age#14, ord_cnt#11L, ord_cnt#11L]
+- Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- Project [pas_phone#15, ord_id#16, age#14]
+- Filter (age#14 > 21)
+- Join Inner, (pas_phone#13 = pas_phone#15)
:- SubqueryAlias `small`
: +- SubqueryAlias `test`.`tmp_demo_small`
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
+- SubqueryAlias `big`
+- SubqueryAlias `test`.`tmp_demo_big`
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]
== Optimized Logical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- Project [pas_phone#15, ord_id#16, age#14]
+- Join Inner, (pas_phone#13 = pas_phone#15)
:- Filter ((isnotnull(age#14) && (age#14 > 21)) && isnotnull(pas_phone#13))
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
+- Project [pas_phone#15, ord_id#16]
+- Filter isnotnull(pas_phone#15)
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]
== Physical Plan ==
Window [sum(1) windowspecdefinition(pas_phone#15, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#11L], [pas_phone#15]
+- *(3) Sort [pas_phone#15 ASC NULLS FIRST], false, 0
+- Exchange(coordinator id: 1731877543) hashpartitioning(pas_phone#15, 1000), coordinator[target post-shuffle partition size: 67108864]
+- *(2) Project [pas_phone#15, ord_id#16, age#14]
+- *(2) BroadcastHashJoin [pas_phone#13], [pas_phone#15], Inner, BuildLeft -- 廣播左表成功
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
: +- *(1) Filter ((isnotnull(age#14) && (age#14 > 21)) && isnotnull(pas_phone#13))
: +- Scan hive test.tmp_demo_small [pas_phone#13, age#14], HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#13, age#14]
+- *(2) Filter isnotnull(pas_phone#15)
+- Scan hive test.tmp_demo_big [pas_phone#15, ord_id#16], HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#15, ord_id#16, dt#17]
這就有些意思了,看下2.4.3 的源碼
位置:spark-2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
object JoinSelection extends Strategy with PredicateHelper {
/**
* Matches a plan whose output should be small enough to be used in broadcast join.
*/
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold
}
/**
* Matches a plan whose single partition should be small enough to build a hash table.
*
* Note: this assume that the number of partition is fixed, requires additional work if it's
* dynamic.
*/
private def canBuildLocalHashMap(plan: LogicalPlan): Boolean = {
plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions
}
/**
* Returns whether plan a is much smaller (3X) than plan b.
*
* The cost to build hash map is higher than sorting, we should only build hash map on a table
* that is much smaller than other one. Since we does not have the statistic for number of rows,
* use the size of bytes here as estimation.
*/
private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes
}
private def canBuildRight(joinType: JoinType): Boolean = joinType match {
case _: InnerLike | LeftOuter | LeftSemi | LeftAnti | _: ExistenceJoin => true
case _ => false
}
private def canBuildLeft(joinType: JoinType): Boolean = joinType match {
case _: InnerLike | RightOuter => true
case _ => false
}
// 3. 就是簡單比較左右兩表大小,
private def broadcastSide(
canBuildLeft: Boolean,
canBuildRight: Boolean,
left: LogicalPlan,
right: LogicalPlan): BuildSide = {
def smallerSide =
if (right.stats.sizeInBytes <= left.stats.sizeInBytes) BuildRight else BuildLeft
if (canBuildRight && canBuildLeft) {
// Broadcast smaller side base on its estimated physical size
// if both sides have broadcast hint
smallerSide
} else if (canBuildRight) {
BuildRight
} else if (canBuildLeft) {
BuildLeft
} else {
// for the last default broadcast nested loop join
smallerSide
}
}
// 1 判斷 canBroadcastByHints(joinType, left, right) ,接着判斷 canBuildLeft(joinType)和canBuildRight(joinType) 兩者只需要一個爲 true就可以,join類型條件基本囊括;主要是判斷針對左右子樹表的hint制定廣播
private def canBroadcastByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
: Boolean = {
val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
buildLeft || buildRight
}
// 2. broadcastSideByHints(joinType, left, right) 再吊起 broadcastSide進行比較,(3)其實就是簡單比較兩個表的大小
private def broadcastSideByHints(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
: BuildSide = {
val buildLeft = canBuildLeft(joinType) && left.stats.hints.broadcast
val buildRight = canBuildRight(joinType) && right.stats.hints.broadcast
broadcastSide(buildLeft, buildRight, left, right)
}
private def canBroadcastBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
: Boolean = {
val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
val buildRight = canBuildRight(joinType) && canBroadcast(right)
buildLeft || buildRight
}
private def broadcastSideBySizes(joinType: JoinType, left: LogicalPlan, right: LogicalPlan)
: BuildSide = {
val buildLeft = canBuildLeft(joinType) && canBroadcast(left)
val buildRight = canBuildRight(joinType) && canBroadcast(right)
broadcastSide(buildLeft, buildRight, left, right)
}
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
// 區分了兩種,當指定hint時和未指定hint時
// --- BroadcastHashJoin --------------------------------------------------------------------
// broadcast hints were specified
// 對於有hint的情況,先判斷 canBroadcastByHints(joinType, left, right)(1)爲true只是表示有hint語句且囊括的join類型符合條件;然後再吊起 broadcastSideByHints(joinType, left, right) 判斷廣播哪張表(2)
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if canBroadcastByHints(joinType, left, right) =>
val buildSide = broadcastSideByHints(joinType, left, right)
Seq(joins.BroadcastHashJoinExec(
leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))
// broadcast hints were not specified, so need to infer it from size and configuration.
// 對於沒有hint的情況,直接走到判斷兩張表大小來決定誰當廣播表(當然符合一些前置條件)
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if canBroadcastBySizes(joinType, left, right) =>
val buildSide = broadcastSideBySizes(joinType, left, right)
Seq(joins.BroadcastHashJoinExec(
leftKeys, rightKeys, joinType, buildSide, condition, planLater(left), planLater(right)))
// --- ShuffledHashJoin ---------------------------------------------------------------------
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)
&& muchSmaller(right, left) ||
!RowOrdering.isOrderable(leftKeys) =>
...
// --- SortMergeJoin ------------------------------------------------------------
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)
if RowOrdering.isOrderable(leftKeys) =>
...
// --- Without joining keys ----------------------------------------------------------
...
}
}
所以綜上所述
- spark 2.2.2的版本當小表join小表(兩表都符合廣播條件),hint 指定廣播表會失效,默認廣播右表;若不hint,則默認廣播右表
- spark 2.4.3的版本可以指定(inner join)廣播表(即使超過廣播閾值,但小心OOM風險);若不hint,則在符合廣播閾值的條件下,使用較小的表進行廣播
- spark不支持full outer join;對於right outer join 只能廣播左表;對於left outer join,left semi join,left anti join,internal join等只能廣播右表,inner join 可以指定廣播
- 其餘的一些join觸發條件要求:SparkSQL-有必要坐下來聊聊Join,Spark SQL 之 Join 實現
最後放兩張收稿圖,用於區分2.2和2.4之間的broadcastjoin判斷方式
spark 2.2.0
spark 2.4.2
by the way
本來是遇到了一個having的問題,在本地執行沒有問題,但是打包好使用spark-submit提交到集羣的時候就莫名其妙報錯了;
select
big.pas_phone,
big.ord_id,
small.age,
sum(1) over(partition by big.pas_phone) as ord_cnt
from
test.tmp_demo_small as small -- 小表 3 行
join
test.tmp_demo_big as big -- 大表 9 行
on
small.pas_phone = big.pas_phone
where
small.age > 21
having
ord_cnt > 2
Error in query: grouping expressions sequence is empty, and 'big.`pas_phone`' is not an aggregate function. Wrap '()' in windowing function(s) or wrap 'big.`pas_phone`' in first() (or first_value) if you don't care which value you get.;;
'Project [pas_phone#26, ord_id#27, age#25, ord_cnt#22L]
+- 'Project [pas_phone#26, ord_id#27, age#25, ord_cnt#22L, ord_cnt#22L]
+- 'Window [sum(cast(1 as bigint)) windowspecdefinition(pas_phone#26, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS ord_cnt#22L], [pas_phone#26]
+- 'Filter ('ord_cnt > 2)
+- Aggregate [pas_phone#26, ord_id#27, age#25]
+- Filter (age#25 > 21)
+- Join Inner, (pas_phone#24 = pas_phone#26)
:- SubqueryAlias `small`
: +- SubqueryAlias `test`.`tmp_demo_small`
: +- HiveTableRelation `test`.`tmp_demo_small`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#24, age#25]
+- SubqueryAlias `big`
+- SubqueryAlias `test`.`tmp_demo_big`
+- HiveTableRelation `test`.`tmp_demo_big`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [pas_phone#26, ord_id#27, dt#28]
問題也查到了,我在本地執行的時候使用的是yarn-client模式,所以我的driver是我的服務器,而我這臺服務器spark版本是2.2.2的,所以執行沒啥問題,因爲driver負責生成DAG,劃分task等等,這個都是在sql轉化爲rdd之後去執行的,所以還有一個就是前置的解析sql的工作,也就是sql -> rdd,這個也是由driver來完成的,而提交到集羣的方式是yarn-cluster模式,driver在集羣的某一臺機器上,這就很尬尬了,公司竟然升級到2.4.3了,導致sql解析的環境已經和我的本地不匹配了,然後查了一下新版的 spark release note
In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, SELECT 1 FROM range(10) HAVING true is executed as SELECT 1 FROM range(10) WHERE true and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means SELECT 1 FROM range(10) HAVING true will return only one row. To restore the previous behavior, set spark.sql.legacy.parser.havingWithoutGroupByAsWhere to true.
cool,問題解決,原因也找到了, 如果非要像以前2.2那樣不想改整段代碼操作,那麼再前面加set spark.sql.legacy.parser.havingWithoutGroupByAsWhere=true;
解決問題
set spark.sql.legacy.parser.havingWithoutGroupByAsWhere=true;
select
big.pas_phone,
big.ord_id,
small.age,
sum(1) over(partition by big.pas_phone) as ord_cnt
from
test.tmp_demo_small as small -- 小表 3 行
join
test.tmp_demo_big as big -- 大表 9 行
on
small.pas_phone = big.pas_phone
where
small.age > 21
having
ord_cnt > 2
附錄
/**
* Select the proper physical plan for join based on joining keys and size of logical plan.
*
* At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of the
* predicates can be evaluated by matching join keys. If found, join implementations are chosen
* with the following precedence:
*
* - Broadcast hash join (BHJ):
* BHJ is not supported for full outer join. For right outer join, we only can broadcast the
* left side. For left outer, left semi, left anti and the internal join type ExistenceJoin,
* we only can broadcast the right side. For inner like join, we can broadcast both sides.
* Normally, BHJ can perform faster than the other join algorithms when the broadcast side is
* small. However, broadcasting tables is a network-intensive operation. It could cause OOM
* or perform worse than the other join algorithms, especially when the build/broadcast side
* is big.
*
* For the supported cases, users can specify the broadcast hint (e.g. the user applied the
* [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame) and session-based
* [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold to adjust whether BHJ is used and
* which join side is broadcast.
*
* 1) Broadcast the join side with the broadcast hint, even if the size is larger than
* [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]]. If both sides have the hint (only when the type
* is inner like join), the side with a smaller estimated physical size will be broadcast.
* 2) Respect the [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold and broadcast the side
* whose estimated physical size is smaller than the threshold. If both sides are below the
* threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
*
* - Shuffle hash join: if the average size of a single partition is small enough to build a hash
* table.
*
* - Sort merge: if the matching join keys are sortable.
*
* If there is no joining keys, Join implementations are chosen with the following precedence:
* - BroadcastNestedLoopJoin (BNLJ):
* BNLJ supports all the join types but the impl is OPTIMIZED for the following scenarios:
* For right outer join, the left side is broadcast. For left outer, left semi, left anti
* and the internal join type ExistenceJoin, the right side is broadcast. For inner like
* joins, either side is broadcast.
*
* Like BHJ, users still can specify the broadcast hint and session-based
* [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold to impact which side is broadcast.
*
* 1) Broadcast the join side with the broadcast hint, even if the size is larger than
* [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]]. If both sides have the hint (i.e., just for
* inner-like join), the side with a smaller estimated physical size will be broadcast.
* 2) Respect the [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold and broadcast the side
* whose estimated physical size is smaller than the threshold. If both sides are below the
* threshold, broadcast the smaller side. If neither is smaller, BNLJ is not used.
*
* - CartesianProduct: for inner like join, CartesianProduct is the fallback option.
*
* - BroadcastNestedLoopJoin (BNLJ):
* For the other join types, BNLJ is the fallback option. Here, we just pick the broadcast
* side with the broadcast hint. If neither side has a hint, we broadcast the side with
* the smaller estimated physical size.
*/