Spark DataFrame內置sql函數總結

Spark DataFrame sql函數總結

Spark DataFrame內置了200+個函數供使用，包括聚合、集合、時間、字符串、數學、排序、窗口、UDF等多類函數，是個十分齊全的百寶箱，靈活運用可以事半功倍。
用之前需要導入sql函數

import org.apache.spark.sql.functions._

自定義UDF函數

如果覺得百寶箱不夠用，需要自己造個輪子，可以用udf來實現

  // 自定義udf的函數
  val ageFiled = (age: String) => {
    val ageInt = age.toInt
    ageInt match {
      case age if age <= 12 => "1"
      case age if age >= 13 && age <= 17 => "2"
      case age if age >= 18 && age <= 24 => "3"
      case age if age >= 25 && age <= 30 => "4"
      case age if age >= 31 && age <= 35 => "5"
      case age if age >= 36 && age <= 40 => "6"
      case age if age >= 41 && age <= 50 => "7"
      case age if age >= 51 && age <= 60 => "8"
      case age if age >= 61 => "9"
    }
  }

在具體使用的時候，對這個自定義的udf函數進行聲明後就可以和內置函數一樣的方式使用啦，例如新增一列年齡段示例如下

val getAgeField = udf(ageFiled)
val result = df.withColumn("agefield", getAgeField($"age"))

窗口函數

窗口函數和Hive裏面的一樣，在Spark DataFrame中使用，主要有兩個步驟。
第一，定義窗口的特徵。首先定義分組（partitionBy），然後定義排序（orderBy），最後定義窗口大小（rowsBetween、rangeBetween）
第二，使用窗口函數。現階段有2類函數可以作爲窗口函數，聚集函數（max、min、avg…），排序函數（rank、dense_rank、percent_rank、ntile、row_number）

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object Test {

  def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder()
          .appName("Test")
          .master("local")
          .getOrCreate()

        import spark.implicits._
        val df = spark.createDataset(Seq(
          ("2010-07-22", "豬聰明", 1, "男", 51, 84, 73, "四川"),
          ("2010-07-23", "豬堅強", 2, "男", 89, 85, 88, "廣東"),
          ("2010-07-24", "豬勇敢", 1, "男", 40, 86, 78, "廣東"),
          ("2010-07-22", "豬能幹", 2, "男", 81, 85, 56, "湖北"),
          ("2010-07-23", "豬豪傑", 3, "男", 77, 82, 93, "四川"),
          ("2010-07-21", "豬可愛", 3, "女", 11, 82, 70, "湖北"),
          ("2010-07-22", "豬溫柔", 4, "女", 42, 86, 34, "湖北"),
          ("2010-07-23", "豬美麗", 4, "女", 54, 84, 92, "湖北"),
          ("2010-07-25", "豬優雅", 4, "女", 40, 91, 68, "四川"),
          ("2010-07-26", "豬大方", 4, "女", 23, 97, 68, "廣東"))
        ).toDF("birthday", "name", "class", "sex", "math", "eng", "tech", "address")

    // 按男女分組的數學平均值
    val windowBySex = Window.partitionBy("sex")
    df.withColumn("sex_avg", avg("math").over(windowBySex)).show(false)

  }

}

輸出結果

+----------+----+-----+---+----+---+----+-------+-------+
|birthday  |name|class|sex|math|eng|tech|address|sex_avg|
+----------+----+-----+---+----+---+----+-------+-------+
|2010-07-22|豬聰明 |1    |男  |51  |84 |73  |四川     |67.6   |
|2010-07-23|豬堅強 |2    |男  |89  |85 |88  |廣東     |67.6   |
|2010-07-24|豬勇敢 |1    |男  |40  |86 |78  |廣東     |67.6   |
|2010-07-22|豬能幹 |2    |男  |81  |85 |56  |湖北     |67.6   |
|2010-07-23|豬豪傑 |3    |男  |77  |82 |93  |四川     |67.6   |
|2010-07-21|豬可愛 |3    |女  |11  |82 |70  |湖北     |34.0   |
|2010-07-22|豬溫柔 |4    |女  |42  |86 |34  |湖北     |34.0   |
|2010-07-23|豬美麗 |4    |女  |54  |84 |92  |湖北     |34.0   |
|2010-07-25|豬優雅 |4    |女  |40  |91 |68  |四川     |34.0   |
|2010-07-26|豬大方 |4    |女  |23  |97 |68  |廣東     |34.0   |
+----------+----+-----+---+----+---+----+-------+-------+

按地域分組求每組數據成績最高值

    // 按地域分組看每組英語最高值
    val windowByAddress = Window.partitionBy("address")
    df.withColumn("max_eng_addr", max("eng").over(windowByAddress)).show(false)

結果輸出

+----------+----+-----+---+----+---+----+-------+------------+
|birthday  |name|class|sex|math|eng|tech|address|max_eng_addr|
+----------+----+-----+---+----+---+----+-------+------------+
|2010-07-23|豬堅強 |2    |男  |89  |85 |88  |廣東     |97          |
|2010-07-24|豬勇敢 |1    |男  |40  |86 |78  |廣東     |97          |
|2010-07-26|豬大方 |4    |女  |23  |97 |68  |廣東     |97          |
|2010-07-22|豬能幹 |2    |男  |81  |85 |56  |湖北     |86          |
|2010-07-21|豬可愛 |3    |女  |11  |82 |70  |湖北     |86          |
|2010-07-22|豬溫柔 |4    |女  |42  |86 |34  |湖北     |86          |
|2010-07-23|豬美麗 |4    |女  |54  |84 |92  |湖北     |86          |
|2010-07-22|豬聰明 |1    |男  |51  |84 |73  |四川     |91          |
|2010-07-23|豬豪傑 |3    |男  |77  |82 |93  |四川     |91          |
|2010-07-25|豬優雅 |4    |女  |40  |91 |68  |四川     |91          |
+----------+----+-----+---+----+---+----+-------+------------+

按班級分組看每組數學成績排序

// 按班級分組看每組數學成績排序(需要先對window分組進行orderBy排序)
    val windowByClass = Window.partitionBy("class").orderBy($"math".desc)
    df.withColumn("rank_math_class", rank().over(windowByClass)).show(false)

結果輸出

+----------+----+-----+---+----+---+----+-------+---------------+
|birthday  |name|class|sex|math|eng|tech|address|rank_math_class|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-22|豬聰明 |1    |男  |51  |84 |73  |四川     |1              |
|2010-07-24|豬勇敢 |1    |男  |40  |86 |78  |廣東     |2              |
|2010-07-23|豬豪傑 |3    |男  |77  |82 |93  |四川     |1              |
|2010-07-21|豬可愛 |3    |女  |11  |82 |70  |湖北     |2              |
|2010-07-23|豬美麗 |4    |女  |54  |84 |92  |湖北     |1              |
|2010-07-22|豬溫柔 |4    |女  |42  |86 |34  |湖北     |2              |
|2010-07-25|豬優雅 |4    |女  |40  |91 |68  |四川     |3              |
|2010-07-26|豬大方 |4    |女  |23  |97 |68  |廣東     |4              |
|2010-07-23|豬堅強 |2    |男  |89  |85 |88  |廣東     |1              |
|2010-07-22|豬能幹 |2    |男  |81  |85 |56  |湖北     |2              |
+----------+----+-----+---+----+---+----+-------+---------------+

按照性別分組看每組數學成績排序，新增一列按性別分組對每組進行分桶並編號

    // 按照性別分組看每組數學成績排序
    // 按性別分組對每組進行分桶並編號
    val windowBySex1 = Window.partitionBy("sex").orderBy($"math".desc)
    df.withColumn("dense_rank_math_class", dense_rank().over(windowBySex1))
      .withColumn("ntile_3", ntile(3).over(windowBySex1))
      .withColumn("math_no", row_number().over(windowBySex1))
      .withColumn("math_no_persent", percent_rank().over(windowBySex1))
      .show(false)

輸出結果

+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|birthday  |name|class|sex|math|eng|tech|address|dense_rank_math_class|ntile_3|math_no|math_no_persent|
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|2010-07-23|豬堅強 |2    |男  |89  |85 |88  |廣東     |1                    |1      |1      |0.0            |
|2010-07-22|豬能幹 |2    |男  |81  |85 |56  |湖北     |2                    |1      |2      |0.25           |
|2010-07-23|豬豪傑 |3    |男  |77  |82 |93  |四川     |3                    |2      |3      |0.5            |
|2010-07-22|豬聰明 |1    |男  |51  |84 |73  |四川     |4                    |2      |4      |0.75           |
|2010-07-24|豬勇敢 |1    |男  |40  |86 |78  |廣東     |5                    |3      |5      |1.0            |
|2010-07-23|豬美麗 |4    |女  |54  |84 |92  |湖北     |1                    |1      |1      |0.0            |
|2010-07-22|豬溫柔 |4    |女  |42  |86 |34  |湖北     |2                    |1      |2      |0.25           |
|2010-07-25|豬優雅 |4    |女  |40  |91 |68  |四川     |3                    |2      |3      |0.5            |
|2010-07-26|豬大方 |4    |女  |23  |97 |68  |廣東     |4                    |2      |4      |0.75           |
|2010-07-21|豬可愛 |3    |女  |11  |82 |70  |湖北     |5                    |3      |5      |1.0            |
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+

按照性別分組取每組數學成績第一名和最後一名

    val windowBySex2 = Window.partitionBy("sex").orderBy($"math".desc)
    df.withColumn("number_math_sex", row_number().over(windowBySex2))
      .filter($"number_math_sex" <= 1)
      .show(false)

    // 這裏使用selectExpr完成同樣的功能,按照性別分組求每組數學成績最高值
    df.selectExpr("sex","max(math) over (partition by sex order by math desc )  as max_math")
      .distinct()
      .show(false)

輸出結果

+----------+----+-----+---+----+---+----+-------+---------------+
|birthday  |name|class|sex|math|eng|tech|address|number_math_sex|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-23|豬堅強 |2    |男  |89  |85 |88  |廣東     |1              |
|2010-07-23|豬美麗 |4    |女  |54  |84 |92  |湖北     |1              |
+----------+----+-----+---+----+---+----+-------+---------------+

+---+--------+
|sex|max_math|
+---+--------+
|男  |89      |
|女  |54      |
+---+--------+

結構化數據類型

對於結構化數據類型的訪問示例，可以使用DataSet方式來處理

import org.apache.spark.sql.SparkSession

object Test {

  case class Score(math: Int, eng: Int, tech: Int)

  case class Info(name: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)

  case class Interest(list: Seq[String])

  case class AddressInfo(name: String, address: String)

  case class Student(name: String, address: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("Test")
      .master("local")
      .getOrCreate()

    import spark.implicits._
    val df = spark.createDataFrame(Seq(
      Info("豬聰明", "2010-07-22", 1, "男", Score(51, 84, 73), Interest(Seq("彈琴", "跳舞", "畫畫"))),
      Info("豬堅強", "2010-07-23", 2, "男", Score(89, 85, 88), Interest(Seq("跳舞", "畫畫"))),
      Info("豬勇敢", "2010-07-24", 1, "男", Score(40, 86, 78), Interest(Seq("彈琴", "跳舞"))),
      Info("豬能幹", "2010-07-22", 2, "男", Score(81, 85, 56), Interest(Seq("彈琴"))),
      Info("豬豪傑", "2010-07-23", 3, "男", Score(77, 82, 93), Interest(Seq("畫畫"))),
      Info("豬可愛", "2010-07-21", 3, "女", Score(11, 82, 70), Interest(Seq("畫畫"))),
      Info("豬溫柔", "2010-07-22", 4, "女", Score(42, 86, 34), Interest(Seq("彈琴"))),
      Info("豬美麗", "2010-07-23", 4, "女", Score(54, 84, 92), Interest(Seq("跳舞"))),
      Info("豬優雅", "2010-07-25", 4, "女", Score(40, 91, 68), Interest(Seq("彈琴", "跳舞", "畫畫"))),
      Info("豬大方", "2010-07-26", 4, "女", Score(23, 97, 68), Interest(Seq("跳舞", "畫畫")))
    ))

    val address = spark.createDataFrame(
      Seq(
        AddressInfo("豬聰明", "四川"),
        AddressInfo("豬堅強", "廣東"),
        AddressInfo("豬勇敢", "廣東"),
        AddressInfo("豬能幹", "湖北"),
        AddressInfo("豬豪傑", "四川"),
        AddressInfo("豬可愛", "湖北"),
        AddressInfo("豬溫柔", "湖北"),
        AddressInfo("豬美麗", "湖北"),
        AddressInfo("豬優雅", "四川"),
        AddressInfo("豬大方", "廣東")
      )
    )

    // 對於結構化數據類型的訪問
    df.join(address, Seq("name"), "inner")
      .select("name", "score.math", "address")
      .show(false)

    // 可以直接轉爲DataSet，這樣可以使用其屬性
    val addressDS = address.as[AddressInfo]
    val ds = df.as[Info]
    ds.join(addressDS, Seq("name"), "inner")
      .as[Student] // 這樣直接轉爲DataSet對象
      .map(x => (x.name, x.address))
      .show(false)
  }

}

Spark DataFrame內置sql函數總結

Spark DataFrame sql函數總結

自定義UDF函數

窗口函數

結構化數據類型

參考

Notification Volume Control and Optimization System at Pinterest 小記

二叉樹的遍歷小結

Python實現均勻拆分大文件

FastJson在scala中序列化與反序列化

Spark DataFrame內置sql函數總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結