Spark SQL的聚合函數中有first, last函數,從字面意思就是根據分組獲取第一條和最後一條記錄的值,實際上,只在local模式下,你可以得到滿意的答案,但是在生產環境(分佈式)時,這個是不能保證的。看源碼的解釋:
/**
* Returns the first value of `child` for a group of rows. If the first value of `child`
* is `null`, it returns `null` (respecting nulls). Even if [[First]] is used on an already
* sorted column, if we do partial aggregation and final aggregation (when mergeExpression
* is used) its result will not be deterministic (unless the input table is sorted and has
* a single partition, and we use a single reducer to do the aggregation.).
*/
如何保證first, last是有效呢?表要排好序的,同時只能用一個分區處理,再用一個reducer來聚合。。。
所以,在多分區場景不能用first, last函數求得聚合的第一條和最後一條數據。
解決方案:利用Window。
val spark = SparkSession.builder().master("local").appName("Demo").getOrCreate()
import spark.implicits._
val df = Seq(("a", 10, 12345), ("a", 12, 34567), ("a", 11, 23456), ("b", 10, 55555), ("b", 8, 12348)).toDF("name", "value", "event_time")
// 定義window
val asc = Window.partitionBy("name").orderBy($"event_time")
val desc = Window.partitionBy("name").orderBy($"event_time".desc)
// 根據window生成row_number,根據row_number獲取對應的數據
val firstValue = df.withColumn("rn", row_number().over(asc)).where($"rn" === 1).drop("rn")
val lastValue = df.withColumn("rn", row_number().over(desc)).where($"rn" === 1).drop("rn")
// 利用join把數據聚合一起
df.groupBy("name")
.count().as("t1")
.join(firstValue.as("fv"), "name")
.join(lastValue.as("lv"), "name")
.select($"t1.name", $"fv.value".as("first_value"), $"lv.value".as("last_value"), $"t1.count")
.show()
輸出:
+----+-----------+----------+-----+
|name|first_value|last_value|count|
+----+-----------+----------+-----+
| b| 8| 10| 2|
| a| 10| 12| 3|
+----+-----------+----------+-----+