sparkSql中的那些函數

對於sparksql的應用企業基本只要屬於大數據相關的互聯網公司都會安裝和使用spark,而sparksql對於對於那些不熟悉sparkapi的人更是一件利器,這對於熟悉mysql的人如虎添翼,好了,廢話不多說,我們看下sparksql中的那些很少被用到卻非常有用的函數。

 lit:Creates a [[Column]] of literal value.創建一個字面 值得列;eg:df.select(lit("2020-02-19").as("now"))直接創建了一個時間now列;

 typedLit:The difference between this function and [[lit]] is that this function * can handle parameterized scala types e.g.: List, Seq and Map.意思就是可以傳集合作爲列。

Sort functions:
  desc,asc兩種

Aggregate functions:聚合不用多說,用的最多的,如求每個學生的所有成績,每個部門的人數等。
  
 approx_count_distinct:Aggregate function: returns the approximate number of distinct items in a group
返回聚合組中的不同項目的成員
SPARK SQL AGGREGATE FUNCTIONS	FUNCTION DESCRIPTION
approx_count_distinct(e: Column)	Returns the count of distinct items in a group.
approx_count_distinct(e: Column, rsd: Double)	Returns the count of distinct items in a group.
avg(e: Column)	Returns the average of values in the input column.
collect_list(e: Column)	Returns all values from an input column with duplicates.
collect_set(e: Column)	Returns all values from an input column with duplicate values .eliminated.
corr(column1: Column, column2: Column)	Returns the Pearson Correlation Coefficient for two columns.
count(e: Column)	Returns number of elements in a column.
countDistinct(expr: Column, exprs: Column*)	Returns number of distinct elements in the columns.
covar_pop(column1: Column, column2: Column)	Returns the population covariance for two columns.
covar_samp(column1: Column, column2: Column)	Returns the sample covariance for two columns.
first(e: Column, ignoreNulls: Boolean)	Returns the first element in a column when ignoreNulls is set to true, it returns first non null element.
first(e: Column): Column	Returns the first element in a column.
grouping(e: Column)	Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
kurtosis(e: Column)	Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Boolean)	Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element.
last(e: Column)	Returns the last element in a column.
max(e: Column)	Returns the maximum value in a column.
mean(e: Column)	Alias for Avg. Returns the average of the values in a column.
min(e: Column)	Returns the minimum value in a column.
skewness(e: Column)	Returns the skewness of the values in a group.
stddev(e: Column)	alias for `stddev_samp`.
stddev_samp(e: Column)	Returns the sample standard deviation of values in a column.
stddev_pop(e: Column)	Returns the population standard deviation of the values in a column.
sum(e: Column)	Returns the sum of all values in a column.
sumDistinct(e: Column)	Returns the sum of all distinct values in a column.
variance(e: Column)	alias for `var_samp`.
var_samp(e: Column)	Returns the unbiased variance of the values in a column.
var_pop(e: Column)	returns the population variance of the values in a column.

下面看一下不是常用,但卻很有用的window funtion,什麼是window funciton?

a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way.

Spark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. Spark SQL supports three kinds of window functions:

WINDOW FUNCTIONS USAGE & SYNTAX SPARK SQL WINDOW FUNCTIONS DESCRIPTION
row_number(): Column Returns a sequential number starting from 1 within a window partition
rank(): Column Returns the rank of rows within a window partition, with gaps.
percent_rank(): Column Returns the percentile rank of rows within a window partition.
dense_rank(): Column Returns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): Column Returns the ntile id in a window partition
cume_dist(): Column Returns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章