Spark first, last函數的坑

原創

2020-07-06 06:02

Spark SQL的聚合函數中有first, last函數，從字面意思就是根據分組獲取第一條和最後一條記錄的值，實際上，只在local模式下，你可以得到滿意的答案，但是在生產環境（分佈式）時，這個是不能保證的。看源碼的解釋：

/**
 * Returns the first value of `child` for a group of rows. If the first value of `child`
 * is `null`, it returns `null` (respecting nulls). Even if [[First]] is used on an already
 * sorted column, if we do partial aggregation and final aggregation (when mergeExpression
 * is used) its result will not be deterministic (unless the input table is sorted and has
 * a single partition, and we use a single reducer to do the aggregation.).
 */

如何保證first, last是有效呢？表要排好序的，同時只能用一個分區處理，再用一個reducer來聚合。。。

所以，在多分區場景不能用first, last函數求得聚合的第一條和最後一條數據。

解決方案：利用Window。

val spark = SparkSession.builder().master("local").appName("Demo").getOrCreate()

import spark.implicits._
val df = Seq(("a", 10, 12345), ("a", 12, 34567), ("a", 11, 23456), ("b", 10, 55555), ("b", 8, 12348)).toDF("name", "value", "event_time")

// 定義window
val asc = Window.partitionBy("name").orderBy($"event_time")
val desc = Window.partitionBy("name").orderBy($"event_time".desc)

// 根據window生成row_number，根據row_number獲取對應的數據
val firstValue = df.withColumn("rn", row_number().over(asc)).where($"rn" === 1).drop("rn")
val lastValue = df.withColumn("rn", row_number().over(desc)).where($"rn" === 1).drop("rn")

// 利用join把數據聚合一起
df.groupBy("name")
      .count().as("t1")
      .join(firstValue.as("fv"), "name")
      .join(lastValue.as("lv"), "name")
      .select($"t1.name", $"fv.value".as("first_value"), $"lv.value".as("last_value"), $"t1.count")
      .show()

輸出：

+----+-----------+----------+-----+
|name|first_value|last_value|count|
+----+-----------+----------+-----+
|   b|          8|        10|    2|
|   a|         10|        12|    3|
+----+-----------+----------+-----+

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark first, last函數的坑

移位操作搞定兩數之商

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

數據倉庫 - 事實表開發實踐（IoT場景）

數據倉庫 - 拉鍊表開發實踐

運行Spark GraphX Pregel出現Issue communicating with driver in heartbeater異常

Spark first, last函數的坑

python 數據分析學習 - 股票數據（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結