pyspark Window 窗口函數

原創

2020-07-07 22:37

參考：Introducing Window Functions in Spark SQL

窗口函數

At its core, a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way.
個人理解：窗口函數主要作用是基於對列進行分組，
將函數作用於指定的行範圍。函數的作用功能十分powerful!

關鍵

分組 partitionBy
排序 orderby
frame 選取 rangeBetween rowsBetween
demo

tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")]
df = spark.createDataFrame(tup, ["id", "category"])
df.show()
window = Window.partitionBy("category").orderBy(df.id.desc()).rangeBetween(Window.currentRow, 1)
df.withColumn("sum", F.sum("id").over(window)).show()

frame 選取

基準爲當前行

行數選擇
rowsBetween(x, y)
Window.unboundedPreceding 表示當前行之前的無限行
Window.currentRow 表示當前行
Window.unboundedFollowing 表示當前行之後的無限行

rowsBetween(-1,1)
函數作用範圍爲當前行的上一行至下一行

行範圍設置 rangeBetween(x,y)
基準爲當前行的值
rangeBetween(20,50)
例如當前值爲18
則選取的值範圍爲[-2,68]

主要函數

API	作用
rank
dense_rank
row_number
min
max
sum

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pyspark Window 窗口函數

窗口函數

關鍵

frame 選取

主要函數

認知提升的方法

C#開源的兩款功能強大的錄屏神器

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

前端 Vue yarn.lock文件：詳解和使用指南

pyspark Window 窗口函數

pyspark dataframe 自定義分區器

(待解決) java.io.EOFException: End of File Exception between local host

Mysql 必知必會（持續更新中）

Keras Embedding詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結