1. RDD 五大特性
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
1). RDD由一系列Partition組成
protected def getPartitions: Array[Partition]
2). 針對 RDD的操作其實就是針對RDD底層Partition進行操作
def compute(split: Partition, context: TaskContext): Iterator[T]
3). 依賴於其他RDD的列表
protected def getDependencies: Seq[Dependency[_]] = deps
4). 可選:key- value數據類型的RDD分區器( a Partitioner for key- alue RDDS)、控制分區策略和分區數
類似於mapreduce當中的paritioner接口,控制Key分到哪個reduce。
/** Optionally overridden by subclasses to specify how they are partitioned. */
@transient val partitioner: Option[Partitioner] = None
5). 每個分區都有一個優先位置列表( a list of preferred locations to compute each split on)
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
2. RDD創建方式
1).集合創建parallelize
val rdd = sc.parallelize(List(1,2,3,4,5,6))
2).其他RDD轉換而來
val rdd2 = rdd.map((_,100))
3).外部數據集
val hadoopRDD = sc.textFile("./data/graph/g.txt")
3. 算子對應的RDD
4.算子分類
常用 Transformation:{
1、map算子
2、flatMap算子
3、mapPartitions算子
4、filter算子
5、distinct算子
6、groupByKey算子
7、reduceByKey算子
8、join算子
}
常用 Action: {
1、foreach算子
2、saveAsTextFile算子
3、saveAsObjectFile算子
4、collect算子
6、collectAsMap算子
7、count算子
8、top算子
9、reduce算子
10、fold算子
11、aggregate算子
}
【注】:countByKey 是 action 算子
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
5. 查看RDD的依賴列表 toDebugString
ps.toDebugString
(1) MapPartitionsRDD[7] at sortBy at T3.scala:16 []
| ShuffledRDD[6] at sortBy at T3.scala:16 []
+-(1) MapPartitionsRDD[5] at sortBy at T3.scala:16 []
| MapPartitionsRDD[4] at sortBy at T3.scala:15 []
| ShuffledRDD[3] at sortBy at T3.scala:15 []
+-(1) MapPartitionsRDD[2] at sortBy at T3.scala:15 []
| MapPartitionsRDD[1] at map at T3.scala:15 []
| ParallelCollectionRDD[0] at parallelize at T3.scala:9 []
6. 寬 / 窄依賴
寬窄依賴劃分:
窄依賴:一個父RDD的partition 至多被子RDD的partition使用一次
寬依賴:產生shuffle ,是劃分stage的標誌
生產中大部分場景下,能用窄依賴就用窄依賴,shuffle的代價是昂貴的,消耗網絡IO和磁盤IO
窄依賴: NarrowDependency => {
OneToOneDependency
RangeDependency
PruneDependency
}
寬依賴:ShuffleDependency
7.reduceByKey 和 groupByKey的區別
reduceByKey在map階段有combine(局部聚合)操作, groupByKey沒有combine(局部聚合)操作,所以reduceByKey的性能高於groupByKey
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
combineByKeyWithClassTag[CompactBuffer[V]](
//.......
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
}
8. cache 和 persist
cache() 底層調用的persist()
def cache(): this.type = persist()
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
MEMORY_ONLY 緩存在內存中
MEMORY_ONLY_SER 序列化後緩存在內存中
不建議緩存在磁盤上,RDD部分partition失效後,重新計算都比從磁盤上讀取性能高
9. repartition 與 coalesce
repartition執行時發生shuffle,所以可以增大分區數
coalesce默認關閉shuffle,不能增大分區數,只能減少分區數
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
10. 排序
1). case class (Product 繼承 Ordered)
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("青椒", "3.0", "675"), ("豬肉", "30", "250")))
products.collect().foreach(println)
val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
.sortBy(x => x)
ps.collect().foreach(println)
sc.stop()
}
}
case class Product(name: String, price: Double, amount: Int) extends Ordered[Product] {
override def compare(that: Product): Int = {
this.amount - that.amount
}
}
//Product(豬肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(蘿蔔,2.5,2567)
3). 隱式類型轉換 (Product => Ordered)
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("青椒", "3.0", "675"), ("豬肉", "30", "250")))
products.collect().foreach(println)
val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
.sortBy(x => x)
ps.collect().foreach(println)
sc.stop()
}
implicit def Product2Ordered(product: Product): Ordered[Product] = new Ordered[Product] {
override def compare(that: Product): Int = {
product.amount - that.amount
}
}
case class Product(name: String, price: Double, amount: Int)
}
//Product(豬肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(蘿蔔,2.5,2567)
3). Ordering.on
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("土豆", "2.5", "3567"),("青椒", "3.0", "675"), ("豬肉", "30", "250")))
products.collect().foreach(println)
implicit var ord = Ordering[(Double, Int)].on[(String, Double, Int)](x => (-x._2, -x._3))
val ps = products.map(p => (p._1, p._2.toDouble, p._3.toInt)).sortBy(x => x)
.sortBy(x => x)
ps.collect().foreach(println)
sc.stop()
}
}
//(豬肉,30.0,250)
//(青椒,3.0,675)
//(土豆,2.5,3567)
//(蘿蔔,2.5,2567)
//(白菜,2.0,1000)
11. 根據 key輸出到不同文件中
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
import org.apache.spark.{SparkConf, SparkContext}
object T3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("sortBy")
val sc = new SparkContext(conf)
val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("土豆", "2.5", "3567"), ("青椒", "3.0", "675"), ("豬肉", "30", "250")))
products.collect().foreach(println)
products.map(x => (x._1, x._2)).saveAsHadoopFile("out/hadoop/file", classOf[String], classOf[String], classOf[RZMultipleTextOutputFormat])
sc.stop()
}
class RZMultipleTextOutputFormat extends MultipleTextOutputFormat[Any,Any]{
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
s"${key}/${name}"
}
override def generateActualKey(key: Any, value: Any): Any = {
null
}
}
}
如果去掉 generateActualKey 這個方法的重載,每一行將會帶上 key