在編寫SQL語句時,大家都比較熟悉的LeftOuterJoinn來關聯兩個表之間的數據,從而查詢到我們想要的結果。在Spark的數據操作中,同樣也會經常使用LeftOuterJoin來關聯兩個數據集。那麼,在Spark數據操作中主要有那幾種數據集的LeftOuterJoin方法呢?
本文中操作LeftOuterJoin方法時,主要用到的數據爲用戶表數據和用戶訂單交易數據,使用LeftOuterJoin方法來統計每一個產品的同一個用戶地址的訂單總數。測試數據量比較小,如下所示:
(1) 用戶數據
u1,UT
u2,GA
u3,CA
u4,CA
u5,GA
(2)用戶交易數據
t1,p3,u1,1,300
t2,p1,u2,1,100
t3,p1,u1,1,100
t4,p2,u2,1,10
t5,p4,u4,1,9
t6,p1,u1,1,100
t7,p4,u1,1,9
t8,p4,u5,2,40
一、RDD的LeftOuterJoin操作
1.1 RDD的LeftOuterJoin方法定義
在Spark中,LeftOutJoin的方法源碼定義如下:
/**
* Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
* resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
* pair (k, (v, None)) if no elements in `other` have key k. Hash-partitions the output
* using the existing partitioner/parallelism level.
*/
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] = self.withScope {
leftOuterJoin(other, defaultPartitioner(self, other))
}
/**
* Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
* resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
* pair (k, (v, None)) if no elements in `other` have key k. Hash-partitions the output
* into `numPartitions` partitions.
*/
def leftOuterJoin[W](
other: RDD[(K, W)],
numPartitions: Int): RDD[(K, (V, Option[W]))] = self.withScope {
leftOuterJoin(other, new HashPartitioner(numPartitions))
}
/**
* Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
* resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
* pair (k, (None, w)) if no elements in `this` have key k. Hash-partitions the resulting
* RDD using the existing partitioner/parallelism level.
*/
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))] = self.withScope {
rightOuterJoin(other, defaultPartitioner(self, other))
}
1.2 RDD使用LeftOuterJoin的代碼實現
/**
* Spark-RDD的左連接操作
**/
object RDDkLeftOuterJoin {
def main(args: Array[String]): Unit = {
if (args.length < 3) {
println("使用參數:SparkLeftOuterJoin <users-data-path> <transactions-data-path> <output-path>")
sys.exit(1)
}
//用戶數據文件
val usersFile: String = args(0)
//交易數據文件
val transactionsFile: String = args(1)
//輸出路徑
val output: String = args(2)
val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("RDDLeftOuterJoinExample")
//創建SparkContext
val sparkContext: SparkContext = SparkSession.builder().config(sparkConf).getOrCreate().sparkContext
//讀取用戶數據形成RDD
val usersRaw: RDD[String] = sparkContext.textFile(usersFile)
//讀取交易數據形成RDD
val transactionsRaw: RDD[String] = sparkContext.textFile(transactionsFile)
val rddUsers: RDD[(String, String)] = usersRaw.map(line => {
val tokens = line.split(",")
(tokens(0), tokens(1))
})
val rddTransactions: RDD[(String, (String, String, String))] = transactionsRaw.map(line => {
val tokens = line.split(",")
(tokens(2), (tokens(1), tokens(3), tokens(4)))
})
val rddJoined: RDD[(String, ((String, String, String), Option[String]))] = rddTransactions.leftOuterJoin(rddUsers)
rddJoined.foreach(println)
val rddProductLocations: RDD[(String, String)] = rddJoined.values.map(f => (f._1._1, f._2.getOrElse("unknown")))
val rddProductByLocations: RDD[(String, Iterable[String])] = rddProductLocations.groupByKey()
// 轉換爲Set,去掉重複數據
val productWithUniqueLocations: RDD[(String, Set[String])] = rddProductByLocations.mapValues(_.toSet)
// 統計產品個數,返回tuple(product, location count).
val rddProductCount: RDD[(String, Int)] = productWithUniqueLocations.map(f => (f._1, f._2.size))
//保存結果到輸出路徑
rddProductCount.saveAsTextFile(output)
}
}
1.3 輸出結果
(p1,2)
(p2,1)
(p3,1)
(p4,3)
二、DataFrame的LeftOuterJoin操作
2.1 DataFrame的LeftOuterJoin方法定義
在Spark中DataFrame的LeftOuterJoin方法的源碼定義如下:
/**
* Join with another `DataFrame`, using the given join expression. The following performs
* a full outer join between `df1` and `df2`.
*
* {{{
* // Scala:
* import org.apache.spark.sql.functions._
* df1.join(df2, $"df1Key" === $"df2Key", "outer")
*
* // Java:
* import static org.apache.spark.sql.functions.*;
* df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
* }}}
*
* @param right Right side of the join.
* @param joinExprs Join expression.
* @param joinType Type of join to perform. Default `inner`. Must be one of:
* `inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,
* `right`, `right_outer`, `left_semi`, `left_anti`.
*
* @group untypedrel
* @since 2.0.0
*/
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame = {
// Note that in this function, we introduce a hack in the case of self-join to automatically
// resolve ambiguous join conditions into ones that might make sense [SPARK-6231].
// Consider this case: df.join(df, df("key") === df("key"))
// Since df("key") === df("key") is a trivially true condition, this actually becomes a
// cartesian join. However, most likely users expect to perform a self join using "key".
// With that assumption, this hack turns the trivially true condition into equality on join
// keys that are resolved to both sides.
// Trigger analysis so in the case of self-join, the analyzer will clone the plan.
// After the cloning, left and right side will have distinct expression ids.
val plan = withPlan(
Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr)))
.queryExecution.analyzed.asInstanceOf[Join]
// If auto self join alias is disabled, return the plan.
if (!sparkSession.sessionState.conf.dataFrameSelfJoinAutoResolveAmbiguity) {
return withPlan(plan)
}
// If left/right have no output set intersection, return the plan.
val lanalyzed = withPlan(this.logicalPlan).queryExecution.analyzed
val ranalyzed = withPlan(right.logicalPlan).queryExecution.analyzed
if (lanalyzed.outputSet.intersect(ranalyzed.outputSet).isEmpty) {
return withPlan(plan)
}
// Otherwise, find the trivially true predicates and automatically resolves them to both sides.
// By the time we get here, since we have already run analysis, all attributes should've been
// resolved and become AttributeReference.
val cond = plan.condition.map { _.transform {
case catalyst.expressions.EqualTo(a: AttributeReference, b: AttributeReference)
if a.sameRef(b) =>
catalyst.expressions.EqualTo(
withPlan(plan.left).resolve(a.name),
withPlan(plan.right).resolve(b.name))
}}
withPlan {
plan.copy(condition = cond)
}
}
2.2 DataFrame的LeftOuterJoin的代碼實現
/**
* Spark-DataFrame的左連接操作
**/
object DataFrameLeftOuterJoin {
def main(args: Array[String]): Unit = {
if (args.length < 3) {
println("使用參數: DataFrameLeftOuterJoin <users-data-path> <transactions-data-path> <output-path>")
sys.exit(1)
}
//用戶數據文件
val usersFile: String = args(0)
//交易數據文件
val transactionsFile: String = args(1)
//輸出路徑
val output: String = args(2)
val sparkConf = new SparkConf()
.setMaster("local[1]")
.setAppName("DataFrameLeftOuterJoinExample")
val sparkSession: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val sparkContext: SparkContext = sparkSession.sparkContext
// 定義用戶的schema
val userSchema = StructType(Seq(
StructField("userId", StringType, false),
StructField("location", StringType, false)))
// 定義交易數據的schema
val transactionSchema = StructType(Seq(
StructField("transactionId", StringType, false),
StructField("productId", StringType, false),
StructField("userId", StringType, false),
StructField("quantity", IntegerType, false),
StructField("price", DoubleType, false)))
//加載用戶數據
val usersRaw: RDD[String] = sparkContext.textFile(usersFile)
//轉換爲RDD[org.apache.spark.sql.Row]
val userRDDRows: RDD[Row] = usersRaw.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1))
})
//從RDD中創建DataFrame
val dfUsers: DataFrame = sparkSession.createDataFrame(userRDDRows, userSchema)
dfUsers.printSchema()
//加載交易數據
val transactionsRaw = sparkContext.textFile(transactionsFile)
//轉換爲RDD[org.apache.spark.sql.Row]
val transactionsRDDRows = transactionsRaw.map(line => {
val tokens = line.split(",")
Row(tokens(0), tokens(1), tokens(2), tokens(3).toInt, tokens(4).toDouble)
})
// 從RDD中創建DataFrame
val dfTransactions = sparkSession.createDataFrame(transactionsRDDRows, transactionSchema)
dfTransactions.printSchema()
//DataFrame的LeftOutJoin,用戶關聯交易信息
val dfLeftJoin: DataFrame = dfTransactions.join(dfUsers, dfTransactions("userId") === dfUsers("userId"), "left")
dfLeftJoin.printSchema()
dfLeftJoin.show()
//查詢產品的用戶及地址
val dfProductLocation: DataFrame = dfLeftJoin.select(dfUsers.col("userId"), dfLeftJoin.col("productId"), dfLeftJoin.col("location"))
dfProductLocation.show()
val dfProductLocationDistinct: Dataset[Row] = dfProductLocation.distinct
dfProductLocationDistinct.show()
val dfProductsCount: DataFrame = dfProductLocationDistinct.groupBy("productId").count()
dfProductsCount.show()
//重新分區,輸出到同一個文件中,小數據量可以這樣實現
//dfProductsCount.repartition(1).write.save(output + "/df")
dfProductsCount.rdd.repartition(1).saveAsTextFile(output + "/df_output")
}
}
2.3 運行結果
[p2,1]
[p1,2]
[p3,1]
[p4,3]
三、SparkSQL的LeftOuterJoin
SparkSQL中的LeftOuterJoin方法,與SQL的Left Outer Join的使用沒有任何差別,就是在Spark中編寫SQL語句。其代碼實現如下:
//使用SparkSQL,創建一個用戶臨時表
dfUsers.createOrReplaceTempView("users")
//使用SparkSQL,創建一個交易數據臨時表
dfTransactions.createOrReplaceTempView("transactions")
val sql =
"""
|SELECT productId, count(distinct location) locCount FROM transactions
| LEFT OUTER JOIN users
| ON transactions.userId = users.userId
|GROUP BY productId
""".stripMargin
val dfSqlResult = sparkSession.sql(sql)
dfSqlResult.show()
//重新分區,輸出到同一個文件中,小數據量可以這樣實現
//dfSqlResult.repartition(1).write.save(output + "/sql")
dfSqlResult.rdd.repartition(1).saveAsTextFile(output + "/sql_output")
輸出結果:
[p2,1]
[p1,2]
[p3,1]
[p4,3]
以上,分別介紹了Spark中RDD,DataFrame和SparkSQL的LeftOuterJoin的操作。在Spark中,同樣可以使用InnerJoin,RightOuterJoin方法來操作數據集,其使用與LeftOuterJoin的使用基本一樣,在使用上傳入的參數有所不同。