Spark_Spark JOIN的種類 以及選擇依據

參考文章 :

 

1.Spark join種類(>3種)及join選擇依據

https://blog.csdn.net/rlnLo2pNEfx9c/article/details/106066081

 

 

Spark 內部JOIN 大致分爲以下3種實現方式 :

1.BroadCastHashJoin

2.ShuffledHashJoin

3.SortMergeJoin

 

 

1.BroadCastHashJoin

     翻過源碼之後你就會發現,Spark 1.6之前實現BroadCastHashJoin就是利用的Java的HashMap來實現的。大家感興趣可以去Spark 1.6的源碼裏搜索BroadCastHashJoin,HashedRelation,探查一下源碼。

    具體實現就是driver端根據表的統計信息,當發現一張小表達到廣播條件的時候,就會將小表collect到driver端,然後構建一個HashedRelation,然後廣播。

    其實,就跟我們在使用Spark Streaming的時候廣播hashmap一樣。

    重點強調裏面 最大行數限制最大bytes限制 並不是我們設置的自動廣播參數限制,而是內部存儲結構的限制。

 

 

2.ShuffledHashJoin

    BroadCastHashJoin適合的是大表和小表的join策略,將整個小表廣播。很多時候,參與join的表本身都不適合廣播,也不適合放入內存,但是按照一定分區拆開後就可以放入內存構建爲HashRelation。這個就是分治思想了,將兩張表按照相同的hash分區器及分區數進行,對join條件進行分區,那麼需要join的key就會落入相同的分區裏,然後就可以利用本地join的策略來進行join了
 

也即是ShuffledHashJoin有兩個重要步驟:

  1. join的兩張表有一張是相對小表,經過拆分後可以實現本地join。
  2. 相同的分區器及分區數,按照joinkey進行分區,這樣約束後joinkey範圍就限制在相同的分區中,不依賴其他分區完成join。
  3. 對小表分區構建一個HashRelation。然後就可以完成本地hashedjoin了,參考ShuffleHashJoinExec代碼。

這個如下圖:

 

 

3.SortMergeJoin

   上面兩張情況都是小表本身適合放入內存或者中表經過分區治理後適合放入內存,來完成本地化hashedjoin,小表數據放在內存中,很奢侈的,所以經常會遇到join,就oom。小表,中表都是依據內存說的,你內存無限,那是最好。

   那麼,大表和大表join怎麼辦?這時候就可以利用SortMergeJoin來完成。

SortMergeJoin基本過程如下:

  1. 首先採取相同的分區器及分區數對兩張表進行重分區操作,保證兩張表相同的key落到相同的分區。
  2. 對於單個分區節點兩個表的數據,分別進行按照key排序。
  3. 對排好序的兩張分區表數據執行join操作。join操作很簡單,分別遍歷兩個有序序列,碰到相同join key就merge輸出,否則取更小一邊。


 

 

4.Spark 中 JOIN 策略的選擇 

 

1) Spark 3.1 +,  基於Hint 

 

假如用戶使用Spark SQL的適合用了hints,那Spark會先採用Hints提示的join方式。

 

BroadcastHashJoin

hints寫法如下:

-- 支持 BROADCAST, BROADCASTJOIN and MAPJOIN 來表達 broadcast hint

SELECT /*+ BROADCAST(r) */  *  FROM records r JOIN src s ON r.key = s.key

 

ShuffledHashJoin

hints的sql寫法如下:

-- 支持 SHUFFLE_MERGE, MERGE and MERGEJOIN 來表達 SortMergeJoin hint

SELECT /*+ MERGEJOIN(r) */  * FROM records r JOIN src s ON r.key = s.key

 

SortMergeJoin

hints的SQL寫法如下:

-- 支持 SHUFFLE_MERGE, MERGE and MERGEJOIN 來表達 SortMergeJoin hint

SELECT /*+ MERGEJOIN(r) */  * FROM records r JOIN src s ON r.key = s.key

 

 

2) 未使用Hint  

 

默認判斷規則如下

Step1

1.先判斷,假設join的表統計信息現實,一張表大小大於0,且小於等於用戶配置的自動廣播閾值則,採用廣播。

plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold參數:spark.sql.autoBroadcastJoinThreshold

假設兩張表都滿足廣播需求,選最小的。 

 

Step2

2.不滿足廣播就判斷是否滿足ShuffledHashJoin,首先下面參數要設置爲false,默認爲true。

spark.sql.join.preferSortMergeJoin=true

 

還有兩個條件,根據統計信息,表的bytes是廣播的閾值*總並行度: 

i

plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions

ii

並且該表bytes乘以3  要小於等於另一張表的bytes:

a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes

 

那麼這張表就適合分治之後,作爲每個分區構建本地hashtable的表。

 

Step3

3.不滿足廣播,也不滿足ShuffledHashJoin,就判斷是否滿足SortMergeJoin。條件很簡單,那就是key要支持可排序。

def createSortMergeJoin() = {
   if (RowOrdering.isOrderable(leftKeys)) {   
        Some(Seq(
            joins.SortMergeJoinExec(
                leftKeys
                , rightKeys
                , joinType
                , condition
                , planLater(left)
                , planLater(right))))  
    } else {    
        None 
    }
}

這段代碼是在SparkStrageties類,JoinSelection單例類內部。

createBroadcastHashJoin(hintToBroadcastLeft(hint), hintToBroadcastRight(hint))  
.orElse {
     if (hintToSortMergeJoin(hint)) createSortMergeJoin() 
     else None 
}  
.orElse(createShuffleHashJoin(hintToShuffleHashLeft(hint), hintToShuffleHashRight(hint)))  .orElse { 
    if (hintToShuffleReplicateNL(hint)) createCartesianProduct() 
    else None 
}
.getOrElse(createJoinWithoutHint())

 

5.Spark 中 JOIN 策略對於等值和非等值連接的支持

   當然,這三種join都是等值join,之前的版本Spark僅僅支持等值join但是不支持非等值join,常見的業務開發中確實存在非等值join的情況,spark目前支持非等值join的實現有以下兩種,由於實現問題,確實很容易oom。

 

Broadcast nested loop joinShuffle-and-replicate nested loop join。

 

6.測試代碼  基於Spark 2.2.0

 

  我們寫了一段代碼用來測試如何進行策略的選擇

 

package com.spark.test.offline.spark_sql

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import scala.collection.mutable.ArrayBuffer
import scala.util.Random

/**
  * Created by szh on 2020/6/7.
  */
object SparkSQLStrategy {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf
    sparkConf
      .setAppName("Union data test")
      .setMaster("local[1]")
      .set("spark.sql.autoBroadcastJoinThreshold", "1048576")
      .set("spark.sql.shuffle.partitions", "10")
      .set("spark.sql.join.preferSortMergeJoin", "false")
    val spark = SparkSession.builder()
      .config(sparkConf)
      .getOrCreate()

    val sparkContext = spark.sparkContext
    sparkContext.setLogLevel("WARN")


    val arrayA = Array(
      (1, "mm")
      , (2, "cs")
      , (3, "cc")
      , (4, "px")
      , (5, "kk")
    )

    val rddA = sparkContext
      .parallelize(arrayA)

    val rddADF = spark.createDataFrame(rddA).toDF("uid", "name")
    rddADF.createOrReplaceTempView("userA")

    spark.sql("CACHE TABLE userA")

    //--------------------------
    //--------------------------

    val arrayB = new ArrayBuffer[(Int, String)]()
    val nameArr = Array[String]("sun", "zhen", "hua", "kk", "cc")

    //1000000
    for (i <- 1 to 1000000) {
      val id = i
      val name = nameArr(Random.nextInt(5))

      arrayB.+=((id, name))
    }

    val rddB = sparkContext.parallelize(arrayB)
    val rddBDF = spark.createDataFrame(rddB).toDF("uid", "name")
    rddBDF.createOrReplaceTempView("userB")



    val arrListA = new ArrayBuffer[(Int, Int)]
    for (i <- 1 to 40) {
      val id = i
      val salary = Random.nextInt(100)

      arrListA.+=((id, salary))
    }

    spark
      .createDataFrame(arrListA).toDF("uid", "salary")
      .createOrReplaceTempView("listA")




    val arrList = new ArrayBuffer[(Int, Int)]
    for (i <- 1 to 4000000) {
      val id = i
      val salary = Random.nextInt(100)

      arrList.+=((id, salary))
    }

    spark
      .createDataFrame(arrList).toDF("uid", "salary")
      .createOrReplaceTempView("listB")




    val resultBigDF = spark
      .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listA ON userB.uid = listA.uid")
    resultBigDF.show()
    resultBigDF.explain(true)



    val resultSmallDF = spark
      .sql("SELECT userA.uid, name, salary FROM userA LEFT JOIN listA ON userA.uid = listA.uid")
    resultSmallDF.show()
    resultSmallDF.explain(true)


    val resultBigDF2 = spark
      .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listb ON userB.uid = listB.uid")
    resultBigDF2.show()
    resultBigDF2.explain(true)





    Thread
    .sleep(60 * 10 * 1000)

    sparkContext.stop()
  }

}

作業JOB劃分

 

輸出 

+---+----+------+
|uid|name|salary|
+---+----+------+
|  1| sun|    62|
|  2|  kk|    76|
|  3| sun|    64|
|  4|  kk|    33|
|  5|zhen|    20|
|  6| hua|    17|
|  7|  kk|     4|
|  8|  cc|    62|
|  9| sun|    97|
| 10| sun|    87|
| 11| hua|    71|
| 12|  kk|    42|
| 13| hua|    76|
| 14| sun|    93|
| 15|zhen|     7|
| 16|  kk|    59|
| 17| hua|    98|
| 18| sun|    88|
| 19|  cc|    49|
| 20|  cc|    62|
+---+----+------+
only showing top 20 rows

== Parsed Logical Plan ==
'Project ['userB.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userB.uid = 'listA.uid)
   :- 'UnresolvedRelation `userB`
   +- 'UnresolvedRelation `listA`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#58, name#59, salary#70]
+- Join LeftOuter, (uid#58 = uid#69)
   :- SubqueryAlias userb
   :  +- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
   :        +- ExternalRDD [obj#52]
   +- SubqueryAlias lista
      +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
         +- LocalRelation [_1#64, _2#65]

== Optimized Logical Plan ==
Project [uid#58, name#59, salary#70]
+- Join LeftOuter, (uid#58 = uid#69)
   :- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- ExternalRDD [obj#52]
   +- LocalRelation [uid#69, salary#70]

== Physical Plan ==
*Project [uid#58, name#59, salary#70]
+- *BroadcastHashJoin [uid#58], [uid#69], LeftOuter, BuildRight
   :- *Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- Scan ExternalRDDScan[obj#52]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [uid#69, salary#70]
+---+----+------+
|uid|name|salary|
+---+----+------+
|  1|  mm|    62|
|  2|  cs|    76|
|  3|  cc|    64|
|  4|  px|    33|
|  5|  kk|    20|
+---+----+------+

== Parsed Logical Plan ==
'Project ['userA.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userA.uid = 'listA.uid)
   :- 'UnresolvedRelation `userA`
   +- 'UnresolvedRelation `listA`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#8, name#9, salary#70]
+- Join LeftOuter, (uid#8 = uid#69)
   :- SubqueryAlias usera
   :  +- Project [_1#3 AS uid#8, _2#4 AS name#9]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#4]
   :        +- ExternalRDD [obj#2]
   +- SubqueryAlias lista
      +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
         +- LocalRelation [_1#64, _2#65]

== Optimized Logical Plan ==
Project [uid#8, name#9, salary#70]
+- Join LeftOuter, (uid#8 = uid#69)
   :- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
   :     +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
   :           +- Scan ExternalRDDScan[obj#2]
   +- LocalRelation [uid#69, salary#70]

== Physical Plan ==
*Project [uid#8, name#9, salary#70]
+- *BroadcastHashJoin [uid#8], [uid#69], LeftOuter, BuildRight
   :- InMemoryTableScan [uid#8, name#9]
   :     +- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
   :           +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
   :              +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
   :                 +- Scan ExternalRDDScan[obj#2]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [uid#69, salary#70]
20/06/08 00:50:40 WARN TaskSetManager: Stage 4 contains a task of very large size (160161 KB). The maximum recommended task size is 100 KB.
20/06/08 00:50:43 WARN TaskSetManager: Stage 5 contains a task of very large size (20512 KB). The maximum recommended task size is 100 KB.
+---+----+------+
|uid|name|salary|
+---+----+------+
| 22|zhen|    40|
| 32|zhen|    81|
| 60|  cc|    73|
| 90|  cc|    12|
| 92|zhen|    90|
| 95|  cc|    95|
|108|  cc|    49|
|123| hua|    44|
|128| sun|    50|
|144|zhen|    63|
|148|  cc|     2|
|153|  cc|    64|
|155|zhen|    88|
|167|  cc|    94|
|168| sun|    18|
|205|  kk|     6|
|209| hua|    78|
|229|  cc|    22|
|247| sun|    53|
|288|  cc|    94|
+---+----+------+
only showing top 20 rows

== Parsed Logical Plan ==
'Project ['userB.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userB.uid = 'listB.uid)
   :- 'UnresolvedRelation `userB`
   +- 'UnresolvedRelation `listb`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#58, name#59, salary#81]
+- Join LeftOuter, (uid#58 = uid#80)
   :- SubqueryAlias userb
   :  +- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
   :        +- ExternalRDD [obj#52]
   +- SubqueryAlias listb
      +- Project [_1#75 AS uid#80, _2#76 AS salary#81]
         +- LocalRelation [_1#75, _2#76]

== Optimized Logical Plan ==
Project [uid#58, name#59, salary#81]
+- Join LeftOuter, (uid#58 = uid#80)
   :- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- ExternalRDD [obj#52]
   +- LocalRelation [uid#80, salary#81]

== Physical Plan ==
*Project [uid#58, name#59, salary#81]
+- SortMergeJoin [uid#58], [uid#80], LeftOuter
   :- *Sort [uid#58 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(uid#58, 10)
   :     +- *Project [_1#53 AS uid#58, _2#54 AS name#59]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :           +- Scan ExternalRDDScan[obj#52]
   +- *Sort [uid#80 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(uid#80, 10)
         +- LocalTableScan [uid#80, salary#81]

 

 

分SQL分析

其中 userA 是小表 ,userB是大表 , listA是小表,listB是大表

 

階段一

val resultBigDF = spark
      .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listA ON userB.uid = listA.uid")
    resultBigDF.show()
    resultBigDF.explain(true)

 

可以看到userB LEFT JOIN listA 是使用的Broadcast SHUFFLE JOIN  

== Parsed Logical Plan ==
'Project ['userB.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userB.uid = 'listA.uid)
   :- 'UnresolvedRelation `userB`
   +- 'UnresolvedRelation `listA`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#58, name#59, salary#70]
+- Join LeftOuter, (uid#58 = uid#69)
   :- SubqueryAlias userb
   :  +- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
   :        +- ExternalRDD [obj#52]
   +- SubqueryAlias lista
      +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
         +- LocalRelation [_1#64, _2#65]

== Optimized Logical Plan ==
Project [uid#58, name#59, salary#70]
+- Join LeftOuter, (uid#58 = uid#69)
   :- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- ExternalRDD [obj#52]
   +- LocalRelation [uid#69, salary#70]

== Physical Plan ==
*Project [uid#58, name#59, salary#70]
+- *BroadcastHashJoin [uid#58], [uid#69], LeftOuter, BuildRight
   :- *Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- Scan ExternalRDDScan[obj#52]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [uid#69, salary#70]

 

 

 

階段二

    val resultSmallDF = spark
      .sql("SELECT userA.uid, name, salary FROM userA LEFT JOIN listA ON userA.uid = listA.uid")
    resultSmallDF.show()
    resultSmallDF.explain(true)

可以看到userA LEFT JOIN listA 使用的是Broadcast Hash JOIN

== Parsed Logical Plan ==
'Project ['userA.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userA.uid = 'listA.uid)
   :- 'UnresolvedRelation `userA`
   +- 'UnresolvedRelation `listA`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#8, name#9, salary#70]
+- Join LeftOuter, (uid#8 = uid#69)
   :- SubqueryAlias usera
   :  +- Project [_1#3 AS uid#8, _2#4 AS name#9]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#4]
   :        +- ExternalRDD [obj#2]
   +- SubqueryAlias lista
      +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
         +- LocalRelation [_1#64, _2#65]

== Optimized Logical Plan ==
Project [uid#8, name#9, salary#70]
+- Join LeftOuter, (uid#8 = uid#69)
   :- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
   :     +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
   :           +- Scan ExternalRDDScan[obj#2]
   +- LocalRelation [uid#69, salary#70]

== Physical Plan ==
*Project [uid#8, name#9, salary#70]
+- *BroadcastHashJoin [uid#8], [uid#69], LeftOuter, BuildRight
   :- InMemoryTableScan [uid#8, name#9]
   :     +- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
   :           +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
   :              +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
   :                 +- Scan ExternalRDDScan[obj#2]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [uid#69, salary#70]

 

 

階段三

val resultBigDF2 = spark
      .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listb ON userB.uid = listB.uid")
    resultBigDF2.show()
    resultBigDF2.explain(true)

userB LEFT JOIN listB ,大表之間關聯使用的是  SortMergeJoin

== Parsed Logical Plan ==
'Project ['userB.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userB.uid = 'listB.uid)
   :- 'UnresolvedRelation `userB`
   +- 'UnresolvedRelation `listb`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#58, name#59, salary#81]
+- Join LeftOuter, (uid#58 = uid#80)
   :- SubqueryAlias userb
   :  +- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
   :        +- ExternalRDD [obj#52]
   +- SubqueryAlias listb
      +- Project [_1#75 AS uid#80, _2#76 AS salary#81]
         +- LocalRelation [_1#75, _2#76]

== Optimized Logical Plan ==
Project [uid#58, name#59, salary#81]
+- Join LeftOuter, (uid#58 = uid#80)
   :- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- ExternalRDD [obj#52]
   +- LocalRelation [uid#80, salary#81]

== Physical Plan ==
*Project [uid#58, name#59, salary#81]
+- SortMergeJoin [uid#58], [uid#80], LeftOuter
   :- *Sort [uid#58 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(uid#58, 10)
   :     +- *Project [_1#53 AS uid#58, _2#54 AS name#59]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :           +- Scan ExternalRDDScan[obj#52]
   +- *Sort [uid#80 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(uid#80, 10)
         +- LocalTableScan [uid#80, salary#81]

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章