Spark DataFrame入門學習筆記
文章目錄
1. 添加配置文件
1.1、 配置文件存放目錄
resources
目錄下,以 scala.properties 文件爲例
1.2、 添加讀取配置代碼
import java.util.Properties
object LoadProperties {
def get_instance: Properties = {
val properties = new Properties()
val in = Thread.currentThread().getContextClassLoader.
getResourceAsStream("scala.properties")
properties.load(in)
properties
}
}
2. 初始化Spark
// 初始化配置文件加載器
val pros: Properties = LoadProperties.get_instance
// 判斷是否本地運行
val isLocal: Boolean = pros.get("is_load_local_data").toString == "true"
val conf: SparkConf =
if (isLocal)
// 設置本地運行時 SparkConf 與 modelPath
new SparkConf().setAppName(getClass.getSimpleName).setMaster("local[2]")
else
// 設置線上運行時 SparkConf 與 modelPath
new SparkConf().setAppName(getClass.getSimpleName)
// 設置 SparkContext
val sc = new SparkContext(conf)
// 設置 HiveContext
val spark = new HiveContext(sc)
spark.setConf("spark.sql.shuffle.partitions", "3000")
import spark.implicits._
注意:
此處應該格外注意,對spark相關上下文初始化的過程中,應該放在main方法中,而不是放在靜態變量加載區。由於Spark的運行是分佈式的運行,靜態變量(包括spark相關上下文及自定義對象的創建)在一個節點加載後,在其他節點運行時,靜態變量及靜態方法不會重新加載,從而導致該類初始化失敗。
- SparkConf:Spark配置類,配置已鍵值對形式存儲,封裝了一個ConcurrentHashMap類實例settings用於存儲Spark的配置信息。負責管理所有的Spark配置項。
- SparkContext:Spark的主要入口點,Spark的靈魂。
- HiveContext:Spark訪問Hive表(或本地文件)的引擎
3. 讀入數據
3.1、 本地文件導入
case class Data(phone: String, app_code: String, count: Double, master_flag: String, gender: String)
val dpi_path: Any = pros.get("dpi.data.path") //可通過配置文件配置
data = = sc.textFile("" + dpi_path)
.map(_.split("\t"))
.map(x => Data(x(0), x(1), x(2).toDouble, x(3), x(4)))
.toDF()
2.2 從Hive數據庫中讀取
val tableName1 = pros.get("tableName1").toString
val tableName2 = pros.get("tableName2").toString
val beginDay = pros.get("beginDay").toString
val endDay = pros.get("endDay").toString
val dpiSql: String =
s" select a.phone phone, a.app_code app_code, a.count count," + s" b.master_flag master_flag " +
s" from " +
s" (select phone phone, appcode app_code, sum(usetimes) count" +
s" from $tableName1 where day>=$beginDay and day <= $endDay" +
s" group by phone, appcode) as a" +
s" inner join " +
s" (select phone, master_flag from $tableName2 where year > 2018)"
s" as b " +
s" on a.phone=b.phone"
dpiData = spark.sql(dpiSql)
2.3 從關係型數據庫中讀取(eg: Mysql)
def main(args: Array[String]): Unit = {
val pros = LoadProperties.get_instance
val conf: SparkConf = new SparkConf().setAppName(getClass.getSimpleName).setMaster("local[2]")
// 設置 SparkContext
val sc = new SparkContext(conf)
// 設置 HiveContext
val spark = new HiveContext(sc)
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:mysql://" + pros.get("mysql.url").toString)
.option("driver", pros.getProperty("mysql.driver"))
.option("user", pros.getProperty("mysql.user"))
.option("password", pros.getProperty("mysql.password"))
.option("dbtable", pros.getProperty("mysql.dbtable"))
.load()
jdbcDF.show()
}
HiveContext 默認讀入Mysql表中所有數據,若需要添加過濾條件,則需要通過 sql 語句構造中間表。修改 jdbcDF 如下:
val tableName = pros.getProperty("mysql.dbtable")
val table = s"(select id, username, email from $tableName where id < 3 ) as t1"
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:mysql://" + pros.get("mysql.url").toString)
.option("driver", pros.getProperty("mysql.driver"))
.option("user", pros.getProperty("mysql.user"))
.option("password", pros.getProperty("mysql.password"))
.option("dbtable", table)
.load()
4. 數據傾斜後的散列操作
4.1 添加隨機數散列到不同節點
- 環境:40個節點
- 性別:男,女
問題描述:若不做散列,則容易出現數據傾斜到兩個節點,導致兩個節點崩掉
解決方案:將數據散列到不同節點
// 稀疏特徵
def keyWithRandom(string: String): String = {
string + Random.nextInt(200)
}
val w = Window.partitionBy("gender").orderBy(rand())
// 註冊UDF函數
val udfKeyAddRandomValue = spark.udf.register("keyWithRandom", keyWithRandom _)
var tmpData = featureData.withColumn("gender", udfKeyAddRandomValue($"gender"))
.withColumn("rk", row_number().over(w))
.filter($"rk" <= 5000)
.withColumn("gender", substring($"gender", 0, 1))
.cache()
代碼含義:
- udfKeyAddRandomValue : 將gender列特徵的值加上一個隨機數
- row_number.over(w) : over根據gender排序進行累計,row_number將產生一個新的編號,所以該代碼含義是將gender對不同的partition(partition是性別)進行排序後對不同的partition添加新的編號
- w :將render列排序。orderBy(rand())的作用是排序(每次排序的結果都不一樣)【Sql】
- filter :過濾rk(gender編號)小於5000的值,防止某個節點過載。
row_number.over(w) 函數功能示意圖:
5. 數據分批次處理
spark.udf.register("hash_phone",(phone:String)=>phone.hashCode % 20)
hash = 10
val tableName1 = pros.get("tableName1").toString
val tableName2 = pros.get("tableName2").toString
val beginDay = pros.get("beginDay").toString
val endDay = pros.get("endDay").toString
val dpiSql: String =
s" select a.phone phone, a.app_code app_code, a.count count," + s" b.master_flag master_flag " +
s" from " +
s" (select phone phone, appcode app_code, sum(usetimes) count" +
s" from $tableName1" +
s" where hash_phone(phoneno) = $hash" +
s" group by phone, appcode) as a" +
s" inner join " +
s" (select phone, master_flag from $tableName2 where year > 2018)"
s" as b " +
s" on a.phone=b.phone"
dpiData = spark.sql(dpiSql)