1、轉化原理
RDD關心數據,DataFrame包含結構信息,關心結構,DataSet包含結構和類型信息,關心類型;
① 將RDD轉換爲DataFrame,需要增加結構信息,所以調用toDF方法,需要增加結構;
② 將RDD轉換爲DataSet,需要增加結構和類型信息,所以需要轉換爲指定類型後,調用toDS方法;
③ 將DataFrame轉換爲DataSet時,因爲已經包含結構信息,只有增加類型信息就可以,所以調用as[類型]
④因爲DF中本身包含數據,所以轉換爲RDD時,直接調用rdd即可;
⑤因爲DS中本身包含數據,所以轉換爲RDD時,直接調用rdd即可;
⑥因爲DS本身包含數據結構信息,所以轉換爲DF時,直接調用toDF即可
2、Rdd->Dataset->Rdd轉換案例
case class AbtestActionLogsModel(atype: String, domain: String,
platform: String, version: String, action_big_id: String,
action_small_id: String, action_id: String, txid: String,
memo: String, location_id: String, msg: String, uid: String, url: String,
uniqueid: String, projectid: String, planid: String, logType: String, var memoParam: String, var description: String, var memoValue: String) extends Serializable {
}
object AbtestActionMemo {
def main(args: Array[String]): Unit = {
var idate = args(0)
println("date is:" + idate)
val conf = new SparkConf()
.setAppName("AbtestActionMemo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val actionLogs: RDD[String] = sc.textFile("abtest_action_logs/dt=" + idate + "/type=t_app_action/*")
//將日誌轉換成AbtestActionLogsModel類
val actionLogsRdd: RDD[AbtestActionLogsModel] = actionLogs.map(lineToActionClass(_))
//通過caseClassLogRdd獲取DAU<1000的projectId和itemId
import sqlContext.implicits._
val actionLogsDataSet = actionLogsRdd.toDS()
actionLogsDataSet.createTempView("actionLogs")
val validProjectFrame: DataFrame = sqlContext.sql("select projectid,planid,count(distinct(uid)) uidCount from actionLogs " +
"where txid in ('1002','2002') " +
"group by projectid,planid " +
"having uidCount>5000 " +
"order by uidCount desc")
//找到有效的projectId
val validProjectsSet: Set[String] = validProjectFrame.rdd.map(row => {
row.getAs[String]("projectid")
}).collect().toSet
validProjectsSet.foreach(println(_))
}
}
3、三者的共性
(1)RDD、DataFrame、Dataset全都是spark平臺下的分佈式彈性數據集,爲處理超大型數據提供便利;
(2)三者都有惰性機制,在進行創建、轉換,如map方法時,不會立即執行,只有在遇到Action如foreach時,三者纔會開始遍歷運算;
(3)三者有許多共同的函數,如filter,排序等;
4、三者的區別
區別:RDD 優點: ①編譯時類型安全 ;②面向對象的編程風格 ; ③直接通過類名點的方式來操作數據; 缺點是通信or IO操作都需要序列化和反序列化的性能開銷 ,比較耗費性能; GC的性能開銷 ,頻繁的創建和銷燬對象, 勢必會增加GC;
DataFrame引入了schema和off-heap堆外內存不會頻繁GC,減少了內存的開銷; 缺點是類型不安全;
DataSet結合了它倆的優點並且把缺點給屏蔽掉了;
DataFrame也可以叫Dataset[Row],每一行的類型是Row,需要這取每一列的值
val validProjectsSet: Set[String] = validProjectFrame.rdd.map(row => {
row.getAs[String]("projectid")
}).collect().toSet