一.算子join
在Spark中,兩個dataframe關聯分爲使用算子join關聯和使用視圖SQL關聯兩種。在使用join算子關聯時,一般的關聯語句是這樣的:
words_df.join(words_df, words_df("word") === words_df("word")).show()
words_df.join(words_df, words_df("word") === words_df("word"), joinType = "left").show()
效果:
當沒有明確的關聯字段或想要的效果是笛卡爾積時,就會出現問題:
words_df.join(words_df).show()
異常信息:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [word#2, numbers#3L], false
and
LogicalRDD [word#82, numbers#83L], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
二.SQL語句
使用SQL語句時也存在類似的情況:
words_df.createOrReplaceTempView("words_df") //視圖 spark 2.x 常用方式
words_df.createOrReplaceTempView("words_df_2") //視圖 spark 2.x 常用方式
spark.sql("select a.*,b.* from words_df a,words_df_2 b").show(false)
異常信息:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [word#2, numbers#3L], false
and
LogicalRDD [word#82, numbers#83L], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
三.原因分析及解決方案
有以上報錯信息可知,兩種使用方式報錯信息是一樣的,都是說在兩個dataframe上使用類似笛卡爾積的CROSS JOIN時,必須添加對其的支持,需要配置一下參數:spark.sql.crossJoin.enabled=true
代碼如下:
val spark = SparkSession
.builder
.appName(s"${this.getClass.getSimpleName}")
.config("spark.sql.crossJoin.enabled", "true")
.master("local[2]")
.getOrCreate()
效果: