Spark DataFrame入門學習筆記

Spark DataFrame入門學習筆記

1. 添加配置文件

1.1、 配置文件存放目錄

resources 目錄下,以 scala.properties 文件爲例

1.2、 添加讀取配置代碼

import java.util.Properties

object LoadProperties {

  def get_instance: Properties = {
    val properties = new Properties()
    val in = Thread.currentThread().getContextClassLoader.
      getResourceAsStream("scala.properties")
    properties.load(in)
    properties
  }

}

2. 初始化Spark

// 初始化配置文件加載器
val pros: Properties = LoadProperties.get_instance
// 判斷是否本地運行
val isLocal: Boolean = pros.get("is_load_local_data").toString == "true"
val conf: SparkConf =
	if (isLocal)
    	// 設置本地運行時 SparkConf 與 modelPath
    	new SparkConf().setAppName(getClass.getSimpleName).setMaster("local[2]")
    else
    	// 設置線上運行時 SparkConf 與 modelPath
        new SparkConf().setAppName(getClass.getSimpleName)
        
	// 設置 SparkContext
    val sc = new SparkContext(conf)
    // 設置 HiveContext
    val spark = new HiveContext(sc)
    spark.setConf("spark.sql.shuffle.partitions", "3000")
    import spark.implicits._

注意:

此處應該格外注意,對spark相關上下文初始化的過程中,應該放在main方法中,而不是放在靜態變量加載區。由於Spark的運行是分佈式的運行,靜態變量(包括spark相關上下文及自定義對象的創建)在一個節點加載後,在其他節點運行時,靜態變量及靜態方法不會重新加載,從而導致該類初始化失敗。

  • SparkConf:Spark配置類,配置已鍵值對形式存儲,封裝了一個ConcurrentHashMap類實例settings用於存儲Spark的配置信息。負責管理所有的Spark配置項。
  • SparkContext:Spark的主要入口點,Spark的靈魂。
  • HiveContext:Spark訪問Hive表(或本地文件)的引擎

3. 讀入數據

3.1、 本地文件導入

case class Data(phone: String, app_code: String, count: Double, master_flag: String, gender: String)
val dpi_path: Any = pros.get("dpi.data.path")   //可通過配置文件配置
data = = sc.textFile("" + dpi_path)
                .map(_.split("\t"))
                .map(x => Data(x(0), x(1), x(2).toDouble, x(3), x(4)))
                .toDF()

2.2 從Hive數據庫中讀取

val tableName1 = pros.get("tableName1").toString
val tableName2 = pros.get("tableName2").toString
val beginDay = pros.get("beginDay").toString
val endDay = pros.get("endDay").toString
val dpiSql: String =
                s" select a.phone phone, a.app_code app_code, a.count count," +                           s" b.master_flag master_flag " +
                s" from " +
                s" (select phone phone, appcode app_code, sum(usetimes) count" +
                s" from $tableName1 where day>=$beginDay and day <= $endDay" +
                s" group by phone, appcode) as a" +
                s" inner join " +
                s" (select phone, master_flag from $tableName2 where year > 2018)"
                s" as b " +
                s" on a.phone=b.phone"
dpiData = spark.sql(dpiSql)

2.3 從關係型數據庫中讀取(eg: Mysql)

def main(args: Array[String]): Unit = {
        val pros = LoadProperties.get_instance
        val conf: SparkConf = new 			     SparkConf().setAppName(getClass.getSimpleName).setMaster("local[2]")
        // 設置 SparkContext
        val sc = new SparkContext(conf)
        // 設置 HiveContext
        val spark = new HiveContext(sc)
        val jdbcDF = spark.read.format("jdbc")
            .option("url", "jdbc:mysql://" + pros.get("mysql.url").toString)
            .option("driver", pros.getProperty("mysql.driver"))
            .option("user", pros.getProperty("mysql.user"))
            .option("password", pros.getProperty("mysql.password"))
    		.option("dbtable", pros.getProperty("mysql.dbtable"))
            .load()
        jdbcDF.show()
    }

HiveContext 默認讀入Mysql表中所有數據,若需要添加過濾條件,則需要通過 sql 語句構造中間表。修改 jdbcDF 如下:

val tableName = pros.getProperty("mysql.dbtable")
val table = s"(select id, username, email from $tableName where id < 3 ) as t1"
val jdbcDF = spark.read.format("jdbc")
	.option("url", "jdbc:mysql://" + pros.get("mysql.url").toString)
    .option("driver", pros.getProperty("mysql.driver"))
    .option("user", pros.getProperty("mysql.user"))
    .option("password", pros.getProperty("mysql.password"))
	.option("dbtable", table)
    .load()

4. 數據傾斜後的散列操作

4.1 添加隨機數散列到不同節點

  • 環境:40個節點
  • 性別:男,女

問題描述:若不做散列,則容易出現數據傾斜到兩個節點,導致兩個節點崩掉

解決方案:將數據散列到不同節點

// 稀疏特徵
def keyWithRandom(string: String): String = {
	string + Random.nextInt(200)
}
val w = Window.partitionBy("gender").orderBy(rand())
// 註冊UDF函數
val udfKeyAddRandomValue = spark.udf.register("keyWithRandom", keyWithRandom _)
var tmpData = featureData.withColumn("gender", udfKeyAddRandomValue($"gender"))
	.withColumn("rk", row_number().over(w))
	.filter($"rk" <= 5000)
	.withColumn("gender", substring($"gender", 0, 1))
	.cache()

代碼含義:

  • udfKeyAddRandomValue : 將gender列特徵的值加上一個隨機數
  • row_number.over(w) : over根據gender排序進行累計,row_number將產生一個新的編號,所以該代碼含義是將gender對不同的partition(partition是性別)進行排序後對不同的partition添加新的編號
  • w :將render列排序。orderBy(rand())的作用是排序(每次排序的結果都不一樣)【Sql】
  • filter :過濾rk(gender編號)小於5000的值,防止某個節點過載。

row_number.over(w) 函數功能示意圖:

row_number.over(w) 函數功能示意圖

5. 數據分批次處理

spark.udf.register("hash_phone",(phone:String)=>phone.hashCode % 20)
hash = 10
val tableName1 = pros.get("tableName1").toString
val tableName2 = pros.get("tableName2").toString
val beginDay = pros.get("beginDay").toString
val endDay = pros.get("endDay").toString
val dpiSql: String =
                s" select a.phone phone, a.app_code app_code, a.count count," +                           s" b.master_flag master_flag " +
                s" from " +
                s" (select phone phone, appcode app_code, sum(usetimes) count" +
                s" from $tableName1" +
                s" where hash_phone(phoneno) = $hash" +
                s" group by phone, appcode) as a" +
                s" inner join " +
                s" (select phone, master_flag from $tableName2 where year > 2018)"
                s" as b " +
                s" on a.phone=b.phone"
dpiData = spark.sql(dpiSql)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章