[Spark的LeftOuterJoin操作]

    在編寫SQL語句時,大家都比較熟悉的LeftOuterJoinn來關聯兩個表之間的數據,從而查詢到我們想要的結果。在Spark的數據操作中,同樣也會經常使用LeftOuterJoin來關聯兩個數據集。那麼,在Spark數據操作中主要有那幾種數據集的LeftOuterJoin方法呢?

    本文中操作LeftOuterJoin方法時,主要用到的數據爲用戶表數據和用戶訂單交易數據,使用LeftOuterJoin方法來統計每一個產品的同一個用戶地址的訂單總數。測試數據量比較小,如下所示:

     (1) 用戶數據

    u1,UT
    u2,GA
    u3,CA
    u4,CA
    u5,GA

       (2)用戶交易數據

t1,p3,u1,1,300
t2,p1,u2,1,100
t3,p1,u1,1,100
t4,p2,u2,1,10
t5,p4,u4,1,9
t6,p1,u1,1,100
t7,p4,u1,1,9
t8,p4,u5,2,40

一、RDD的LeftOuterJoin操作

    1.1 RDD的LeftOuterJoin方法定義

    在Spark中,LeftOutJoin的方法源碼定義如下:

/**
   * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
   * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
   * pair (k, (v, None)) if no elements in `other` have key k. Hash-partitions the output
   * using the existing partitioner/parallelism level.
   */
  def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] = self.withScope {
    leftOuterJoin(other, defaultPartitioner(self, other))
  }

  /**
   * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
   * resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
   * pair (k, (v, None)) if no elements in `other` have key k. Hash-partitions the output
   * into `numPartitions` partitions.
   */
  def leftOuterJoin[W](
      other: RDD[(K, W)],
      numPartitions: Int): RDD[(K, (V, Option[W]))] = self.withScope {
    leftOuterJoin(other, new HashPartitioner(numPartitions))
  }

  /**
   * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
   * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
   * pair (k, (None, w)) if no elements in `this` have key k. Hash-partitions the resulting
   * RDD using the existing partitioner/parallelism level.
   */
  def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))] = self.withScope {
    rightOuterJoin(other, defaultPartitioner(self, other))
  }

1.2 RDD使用LeftOuterJoin的代碼實現

/**
  * Spark-RDD的左連接操作
  **/
object RDDkLeftOuterJoin {
    def main(args: Array[String]): Unit = {
        if (args.length < 3) {
            println("使用參數:SparkLeftOuterJoin <users-data-path> <transactions-data-path> <output-path>")
            sys.exit(1)
        }
        //用戶數據文件
        val usersFile: String = args(0)
        //交易數據文件
        val transactionsFile: String = args(1)
        //輸出路徑
        val output: String = args(2)

        val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("RDDLeftOuterJoinExample")
        //創建SparkContext
        val sparkContext: SparkContext = SparkSession.builder().config(sparkConf).getOrCreate().sparkContext
        //讀取用戶數據形成RDD
        val usersRaw: RDD[String] = sparkContext.textFile(usersFile)
        //讀取交易數據形成RDD
        val transactionsRaw: RDD[String] = sparkContext.textFile(transactionsFile)

        val rddUsers: RDD[(String, String)] = usersRaw.map(line => {
            val tokens = line.split(",")
            (tokens(0), tokens(1))
        })

        val rddTransactions: RDD[(String, (String, String, String))] = transactionsRaw.map(line => {
            val tokens = line.split(",")
            (tokens(2), (tokens(1), tokens(3), tokens(4)))
        })

        val rddJoined: RDD[(String, ((String, String, String), Option[String]))] = rddTransactions.leftOuterJoin(rddUsers)
        rddJoined.foreach(println)
        val rddProductLocations: RDD[(String, String)] = rddJoined.values.map(f => (f._1._1, f._2.getOrElse("unknown")))

        val rddProductByLocations: RDD[(String, Iterable[String])] = rddProductLocations.groupByKey()
        // 轉換爲Set,去掉重複數據
        val productWithUniqueLocations: RDD[(String, Set[String])] = rddProductByLocations.mapValues(_.toSet)
        // 統計產品個數,返回tuple(product, location count).
        val rddProductCount: RDD[(String, Int)] = productWithUniqueLocations.map(f => (f._1, f._2.size))
        //保存結果到輸出路徑
        rddProductCount.saveAsTextFile(output)
    }
}

1.3 輸出結果

(p1,2)
(p2,1)
(p3,1)
(p4,3)

二、DataFrame的LeftOuterJoin操作

2.1 DataFrame的LeftOuterJoin方法定義

    在Spark中DataFrame的LeftOuterJoin方法的源碼定義如下:

/**
   * Join with another `DataFrame`, using the given join expression. The following performs
   * a full outer join between `df1` and `df2`.
   *
   * {{{
   *   // Scala:
   *   import org.apache.spark.sql.functions._
   *   df1.join(df2, $"df1Key" === $"df2Key", "outer")
   *
   *   // Java:
   *   import static org.apache.spark.sql.functions.*;
   *   df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
   * }}}
   *
   * @param right Right side of the join.
   * @param joinExprs Join expression.
   * @param joinType Type of join to perform. Default `inner`. Must be one of:
   *                 `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,
   *                 `right`, `right_outer`, `left_semi`, `left_anti`.
   *
   * @group untypedrel
   * @since 2.0.0
   */
  def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame = {
    // Note that in this function, we introduce a hack in the case of self-join to automatically
    // resolve ambiguous join conditions into ones that might make sense [SPARK-6231].
    // Consider this case: df.join(df, df("key") === df("key"))
    // Since df("key") === df("key") is a trivially true condition, this actually becomes a
    // cartesian join. However, most likely users expect to perform a self join using "key".
    // With that assumption, this hack turns the trivially true condition into equality on join
    // keys that are resolved to both sides.

    // Trigger analysis so in the case of self-join, the analyzer will clone the plan.
    // After the cloning, left and right side will have distinct expression ids.
    val plan = withPlan(
      Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr)))
      .queryExecution.analyzed.asInstanceOf[Join]

    // If auto self join alias is disabled, return the plan.
    if (!sparkSession.sessionState.conf.dataFrameSelfJoinAutoResolveAmbiguity) {
      return withPlan(plan)
    }

    // If left/right have no output set intersection, return the plan.
    val lanalyzed = withPlan(this.logicalPlan).queryExecution.analyzed
    val ranalyzed = withPlan(right.logicalPlan).queryExecution.analyzed
    if (lanalyzed.outputSet.intersect(ranalyzed.outputSet).isEmpty) {
      return withPlan(plan)
    }

    // Otherwise, find the trivially true predicates and automatically resolves them to both sides.
    // By the time we get here, since we have already run analysis, all attributes should've been
    // resolved and become AttributeReference.
    val cond = plan.condition.map { _.transform {
      case catalyst.expressions.EqualTo(a: AttributeReference, b: AttributeReference)
          if a.sameRef(b) =>
        catalyst.expressions.EqualTo(
          withPlan(plan.left).resolve(a.name),
          withPlan(plan.right).resolve(b.name))
    }}

    withPlan {
      plan.copy(condition = cond)
    }
  }

2.2 DataFrame的LeftOuterJoin的代碼實現

/**
  * Spark-DataFrame的左連接操作
  **/
object DataFrameLeftOuterJoin {
    def main(args: Array[String]): Unit = {
        if (args.length < 3) {
            println("使用參數: DataFrameLeftOuterJoin <users-data-path> <transactions-data-path> <output-path>")
            sys.exit(1)
        }
        //用戶數據文件
        val usersFile: String = args(0)
        //交易數據文件
        val transactionsFile: String = args(1)
        //輸出路徑
        val output: String = args(2)

        val sparkConf = new SparkConf()
            .setMaster("local[1]")
            .setAppName("DataFrameLeftOuterJoinExample")
        val sparkSession: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

        val sparkContext: SparkContext = sparkSession.sparkContext

        // 定義用戶的schema
        val userSchema = StructType(Seq(
            StructField("userId", StringType, false),
            StructField("location", StringType, false)))

        // 定義交易數據的schema
        val transactionSchema = StructType(Seq(
            StructField("transactionId", StringType, false),
            StructField("productId", StringType, false),
            StructField("userId", StringType, false),
            StructField("quantity", IntegerType, false),
            StructField("price", DoubleType, false)))

        //加載用戶數據
        val usersRaw: RDD[String] = sparkContext.textFile(usersFile)
        //轉換爲RDD[org.apache.spark.sql.Row]
        val userRDDRows: RDD[Row] = usersRaw.map(line => {
            val tokens = line.split(",")
            Row(tokens(0), tokens(1))
        })
        //從RDD中創建DataFrame
        val dfUsers: DataFrame = sparkSession.createDataFrame(userRDDRows, userSchema)
        dfUsers.printSchema()
        //加載交易數據
        val transactionsRaw = sparkContext.textFile(transactionsFile)
        //轉換爲RDD[org.apache.spark.sql.Row]
        val transactionsRDDRows = transactionsRaw.map(line => {
            val tokens = line.split(",")
            Row(tokens(0), tokens(1), tokens(2), tokens(3).toInt, tokens(4).toDouble)
        })
        // 從RDD中創建DataFrame
        val dfTransactions = sparkSession.createDataFrame(transactionsRDDRows, transactionSchema)
        dfTransactions.printSchema()
        //DataFrame的LeftOutJoin,用戶關聯交易信息
        val dfLeftJoin: DataFrame = dfTransactions.join(dfUsers, dfTransactions("userId") === dfUsers("userId"), "left")
        dfLeftJoin.printSchema()
        dfLeftJoin.show()
        //查詢產品的用戶及地址
        val dfProductLocation: DataFrame = dfLeftJoin.select(dfUsers.col("userId"), dfLeftJoin.col("productId"), dfLeftJoin.col("location"))
        dfProductLocation.show()
        val dfProductLocationDistinct: Dataset[Row] = dfProductLocation.distinct
        dfProductLocationDistinct.show()
        val dfProductsCount: DataFrame = dfProductLocationDistinct.groupBy("productId").count()
        dfProductsCount.show()
        //重新分區,輸出到同一個文件中,小數據量可以這樣實現
        //dfProductsCount.repartition(1).write.save(output + "/df")
        dfProductsCount.rdd.repartition(1).saveAsTextFile(output + "/df_output")
    }
}

2.3 運行結果

[p2,1]
[p1,2]
[p3,1]
[p4,3]

三、SparkSQL的LeftOuterJoin

    SparkSQL中的LeftOuterJoin方法,與SQL的Left Outer Join的使用沒有任何差別,就是在Spark中編寫SQL語句。其代碼實現如下:

  //使用SparkSQL,創建一個用戶臨時表
        dfUsers.createOrReplaceTempView("users")
        //使用SparkSQL,創建一個交易數據臨時表
        dfTransactions.createOrReplaceTempView("transactions")
        val sql =
            """
              |SELECT productId, count(distinct location) locCount FROM transactions
              | LEFT OUTER JOIN users
              |     ON transactions.userId = users.userId
              |GROUP BY productId
            """.stripMargin
        val dfSqlResult = sparkSession.sql(sql)
        dfSqlResult.show()
        //重新分區,輸出到同一個文件中,小數據量可以這樣實現
        //dfSqlResult.repartition(1).write.save(output + "/sql")
        dfSqlResult.rdd.repartition(1).saveAsTextFile(output + "/sql_output")

輸出結果:

[p2,1]
[p1,2]
[p3,1]
[p4,3]

以上,分別介紹了Spark中RDD,DataFrame和SparkSQL的LeftOuterJoin的操作。在Spark中,同樣可以使用InnerJoin,RightOuterJoin方法來操作數據集,其使用與LeftOuterJoin的使用基本一樣,在使用上傳入的參數有所不同。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章