Spark DataFrame sql函數總結
Spark DataFrame內置了200+個函數供使用,包括聚合、集合、時間、字符串、數學、排序、窗口、UDF等多類函數,是個十分齊全的百寶箱,靈活運用可以事半功倍。
用之前需要導入sql函數
import org.apache.spark.sql.functions._
自定義UDF函數
如果覺得百寶箱不夠用,需要自己造個輪子,可以用udf來實現
// 自定義udf的函數
val ageFiled = (age: String) => {
val ageInt = age.toInt
ageInt match {
case age if age <= 12 => "1"
case age if age >= 13 && age <= 17 => "2"
case age if age >= 18 && age <= 24 => "3"
case age if age >= 25 && age <= 30 => "4"
case age if age >= 31 && age <= 35 => "5"
case age if age >= 36 && age <= 40 => "6"
case age if age >= 41 && age <= 50 => "7"
case age if age >= 51 && age <= 60 => "8"
case age if age >= 61 => "9"
}
}
在具體使用的時候,對這個自定義的udf函數進行聲明後就可以和內置函數一樣的方式使用啦,例如新增一列年齡段示例如下
val getAgeField = udf(ageFiled)
val result = df.withColumn("agefield", getAgeField($"age"))
窗口函數
窗口函數和Hive裏面的一樣,在Spark DataFrame中使用,主要有兩個步驟。
第一,定義窗口的特徵。首先定義分組(partitionBy),然後定義排序(orderBy),最後定義窗口大小(rowsBetween、rangeBetween)
第二,使用窗口函數。現階段有2類函數可以作爲窗口函數,聚集函數(max、min、avg…),排序函數(rank、dense_rank、percent_rank、ntile、row_number)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Test")
.master("local")
.getOrCreate()
import spark.implicits._
val df = spark.createDataset(Seq(
("2010-07-22", "豬聰明", 1, "男", 51, 84, 73, "四川"),
("2010-07-23", "豬堅強", 2, "男", 89, 85, 88, "廣東"),
("2010-07-24", "豬勇敢", 1, "男", 40, 86, 78, "廣東"),
("2010-07-22", "豬能幹", 2, "男", 81, 85, 56, "湖北"),
("2010-07-23", "豬豪傑", 3, "男", 77, 82, 93, "四川"),
("2010-07-21", "豬可愛", 3, "女", 11, 82, 70, "湖北"),
("2010-07-22", "豬溫柔", 4, "女", 42, 86, 34, "湖北"),
("2010-07-23", "豬美麗", 4, "女", 54, 84, 92, "湖北"),
("2010-07-25", "豬優雅", 4, "女", 40, 91, 68, "四川"),
("2010-07-26", "豬大方", 4, "女", 23, 97, 68, "廣東"))
).toDF("birthday", "name", "class", "sex", "math", "eng", "tech", "address")
// 按男女分組的數學平均值
val windowBySex = Window.partitionBy("sex")
df.withColumn("sex_avg", avg("math").over(windowBySex)).show(false)
}
}
輸出結果
+----------+----+-----+---+----+---+----+-------+-------+
|birthday |name|class|sex|math|eng|tech|address|sex_avg|
+----------+----+-----+---+----+---+----+-------+-------+
|2010-07-22|豬聰明 |1 |男 |51 |84 |73 |四川 |67.6 |
|2010-07-23|豬堅強 |2 |男 |89 |85 |88 |廣東 |67.6 |
|2010-07-24|豬勇敢 |1 |男 |40 |86 |78 |廣東 |67.6 |
|2010-07-22|豬能幹 |2 |男 |81 |85 |56 |湖北 |67.6 |
|2010-07-23|豬豪傑 |3 |男 |77 |82 |93 |四川 |67.6 |
|2010-07-21|豬可愛 |3 |女 |11 |82 |70 |湖北 |34.0 |
|2010-07-22|豬溫柔 |4 |女 |42 |86 |34 |湖北 |34.0 |
|2010-07-23|豬美麗 |4 |女 |54 |84 |92 |湖北 |34.0 |
|2010-07-25|豬優雅 |4 |女 |40 |91 |68 |四川 |34.0 |
|2010-07-26|豬大方 |4 |女 |23 |97 |68 |廣東 |34.0 |
+----------+----+-----+---+----+---+----+-------+-------+
按地域分組求每組數據成績最高值
// 按地域分組看每組英語最高值
val windowByAddress = Window.partitionBy("address")
df.withColumn("max_eng_addr", max("eng").over(windowByAddress)).show(false)
結果輸出
+----------+----+-----+---+----+---+----+-------+------------+
|birthday |name|class|sex|math|eng|tech|address|max_eng_addr|
+----------+----+-----+---+----+---+----+-------+------------+
|2010-07-23|豬堅強 |2 |男 |89 |85 |88 |廣東 |97 |
|2010-07-24|豬勇敢 |1 |男 |40 |86 |78 |廣東 |97 |
|2010-07-26|豬大方 |4 |女 |23 |97 |68 |廣東 |97 |
|2010-07-22|豬能幹 |2 |男 |81 |85 |56 |湖北 |86 |
|2010-07-21|豬可愛 |3 |女 |11 |82 |70 |湖北 |86 |
|2010-07-22|豬溫柔 |4 |女 |42 |86 |34 |湖北 |86 |
|2010-07-23|豬美麗 |4 |女 |54 |84 |92 |湖北 |86 |
|2010-07-22|豬聰明 |1 |男 |51 |84 |73 |四川 |91 |
|2010-07-23|豬豪傑 |3 |男 |77 |82 |93 |四川 |91 |
|2010-07-25|豬優雅 |4 |女 |40 |91 |68 |四川 |91 |
+----------+----+-----+---+----+---+----+-------+------------+
按班級分組看每組數學成績排序
// 按班級分組看每組數學成績排序(需要先對window分組進行orderBy排序)
val windowByClass = Window.partitionBy("class").orderBy($"math".desc)
df.withColumn("rank_math_class", rank().over(windowByClass)).show(false)
結果輸出
+----------+----+-----+---+----+---+----+-------+---------------+
|birthday |name|class|sex|math|eng|tech|address|rank_math_class|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-22|豬聰明 |1 |男 |51 |84 |73 |四川 |1 |
|2010-07-24|豬勇敢 |1 |男 |40 |86 |78 |廣東 |2 |
|2010-07-23|豬豪傑 |3 |男 |77 |82 |93 |四川 |1 |
|2010-07-21|豬可愛 |3 |女 |11 |82 |70 |湖北 |2 |
|2010-07-23|豬美麗 |4 |女 |54 |84 |92 |湖北 |1 |
|2010-07-22|豬溫柔 |4 |女 |42 |86 |34 |湖北 |2 |
|2010-07-25|豬優雅 |4 |女 |40 |91 |68 |四川 |3 |
|2010-07-26|豬大方 |4 |女 |23 |97 |68 |廣東 |4 |
|2010-07-23|豬堅強 |2 |男 |89 |85 |88 |廣東 |1 |
|2010-07-22|豬能幹 |2 |男 |81 |85 |56 |湖北 |2 |
+----------+----+-----+---+----+---+----+-------+---------------+
按照性別分組看每組數學成績排序,新增一列按性別分組對每組進行分桶並編號
// 按照性別分組看每組數學成績排序
// 按性別分組對每組進行分桶並編號
val windowBySex1 = Window.partitionBy("sex").orderBy($"math".desc)
df.withColumn("dense_rank_math_class", dense_rank().over(windowBySex1))
.withColumn("ntile_3", ntile(3).over(windowBySex1))
.withColumn("math_no", row_number().over(windowBySex1))
.withColumn("math_no_persent", percent_rank().over(windowBySex1))
.show(false)
輸出結果
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|birthday |name|class|sex|math|eng|tech|address|dense_rank_math_class|ntile_3|math_no|math_no_persent|
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|2010-07-23|豬堅強 |2 |男 |89 |85 |88 |廣東 |1 |1 |1 |0.0 |
|2010-07-22|豬能幹 |2 |男 |81 |85 |56 |湖北 |2 |1 |2 |0.25 |
|2010-07-23|豬豪傑 |3 |男 |77 |82 |93 |四川 |3 |2 |3 |0.5 |
|2010-07-22|豬聰明 |1 |男 |51 |84 |73 |四川 |4 |2 |4 |0.75 |
|2010-07-24|豬勇敢 |1 |男 |40 |86 |78 |廣東 |5 |3 |5 |1.0 |
|2010-07-23|豬美麗 |4 |女 |54 |84 |92 |湖北 |1 |1 |1 |0.0 |
|2010-07-22|豬溫柔 |4 |女 |42 |86 |34 |湖北 |2 |1 |2 |0.25 |
|2010-07-25|豬優雅 |4 |女 |40 |91 |68 |四川 |3 |2 |3 |0.5 |
|2010-07-26|豬大方 |4 |女 |23 |97 |68 |廣東 |4 |2 |4 |0.75 |
|2010-07-21|豬可愛 |3 |女 |11 |82 |70 |湖北 |5 |3 |5 |1.0 |
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
按照性別分組取每組數學成績第一名和最後一名
val windowBySex2 = Window.partitionBy("sex").orderBy($"math".desc)
df.withColumn("number_math_sex", row_number().over(windowBySex2))
.filter($"number_math_sex" <= 1)
.show(false)
// 這裏使用selectExpr完成同樣的功能,按照性別分組求每組數學成績最高值
df.selectExpr("sex","max(math) over (partition by sex order by math desc ) as max_math")
.distinct()
.show(false)
輸出結果
+----------+----+-----+---+----+---+----+-------+---------------+
|birthday |name|class|sex|math|eng|tech|address|number_math_sex|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-23|豬堅強 |2 |男 |89 |85 |88 |廣東 |1 |
|2010-07-23|豬美麗 |4 |女 |54 |84 |92 |湖北 |1 |
+----------+----+-----+---+----+---+----+-------+---------------+
+---+--------+
|sex|max_math|
+---+--------+
|男 |89 |
|女 |54 |
+---+--------+
結構化數據類型
對於結構化數據類型的訪問示例,可以使用DataSet方式來處理
import org.apache.spark.sql.SparkSession
object Test {
case class Score(math: Int, eng: Int, tech: Int)
case class Info(name: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)
case class Interest(list: Seq[String])
case class AddressInfo(name: String, address: String)
case class Student(name: String, address: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Test")
.master("local")
.getOrCreate()
import spark.implicits._
val df = spark.createDataFrame(Seq(
Info("豬聰明", "2010-07-22", 1, "男", Score(51, 84, 73), Interest(Seq("彈琴", "跳舞", "畫畫"))),
Info("豬堅強", "2010-07-23", 2, "男", Score(89, 85, 88), Interest(Seq("跳舞", "畫畫"))),
Info("豬勇敢", "2010-07-24", 1, "男", Score(40, 86, 78), Interest(Seq("彈琴", "跳舞"))),
Info("豬能幹", "2010-07-22", 2, "男", Score(81, 85, 56), Interest(Seq("彈琴"))),
Info("豬豪傑", "2010-07-23", 3, "男", Score(77, 82, 93), Interest(Seq("畫畫"))),
Info("豬可愛", "2010-07-21", 3, "女", Score(11, 82, 70), Interest(Seq("畫畫"))),
Info("豬溫柔", "2010-07-22", 4, "女", Score(42, 86, 34), Interest(Seq("彈琴"))),
Info("豬美麗", "2010-07-23", 4, "女", Score(54, 84, 92), Interest(Seq("跳舞"))),
Info("豬優雅", "2010-07-25", 4, "女", Score(40, 91, 68), Interest(Seq("彈琴", "跳舞", "畫畫"))),
Info("豬大方", "2010-07-26", 4, "女", Score(23, 97, 68), Interest(Seq("跳舞", "畫畫")))
))
val address = spark.createDataFrame(
Seq(
AddressInfo("豬聰明", "四川"),
AddressInfo("豬堅強", "廣東"),
AddressInfo("豬勇敢", "廣東"),
AddressInfo("豬能幹", "湖北"),
AddressInfo("豬豪傑", "四川"),
AddressInfo("豬可愛", "湖北"),
AddressInfo("豬溫柔", "湖北"),
AddressInfo("豬美麗", "湖北"),
AddressInfo("豬優雅", "四川"),
AddressInfo("豬大方", "廣東")
)
)
// 對於結構化數據類型的訪問
df.join(address, Seq("name"), "inner")
.select("name", "score.math", "address")
.show(false)
// 可以直接轉爲DataSet,這樣可以使用其屬性
val addressDS = address.as[AddressInfo]
val ds = df.as[Info]
ds.join(addressDS, Seq("name"), "inner")
.as[Student] // 這樣直接轉爲DataSet對象
.map(x => (x.name, x.address))
.show(false)
}
}