參考:Introducing Window Functions in Spark SQL
窗口函數
At its core, a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way.
個人理解:窗口函數主要作用是基於對列進行分組,
將函數作用於指定的行範圍。函數的作用功能十分powerful!
關鍵
- 分組
partitionBy
- 排序
orderby
- frame 選取
rangeBetween rowsBetween
demo
tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")]
df = spark.createDataFrame(tup, ["id", "category"])
df.show()
window = Window.partitionBy("category").orderBy(df.id.desc()).rangeBetween(Window.currentRow, 1)
df.withColumn("sum", F.sum("id").over(window)).show()
frame 選取
基準爲當前行
- 行數選擇
rowsBetween(x, y)
Window.unboundedPreceding 表示當前行之前的無限行
Window.currentRow 表示當前行
Window.unboundedFollowing 表示當前行之後的無限行
rowsBetween(-1,1)
函數作用範圍爲當前行的上一行至下一行
- 行範圍設置 rangeBetween(x,y)
基準爲當前行的值
rangeBetween(20,50)
例如當前值爲18
則選取的值範圍爲[-2,68]
主要函數
API | 作用 |
---|---|
rank | |
dense_rank | |
row_number | |
min | |
max | |
sum |