sparkCore 知識點

1. RDD 五大特性

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for
    an HDFS file)
1). RDD由一系列Partition組成
protected def getPartitions: Array[Partition]
2). 針對 RDD的操作其實就是針對RDD底層Partition進行操作
def compute(split: Partition, context: TaskContext): Iterator[T]
3). 依賴於其他RDD的列表
protected def getDependencies: Seq[Dependency[_]] = deps
4). 可選:key- value數據類型的RDD分區器( a Partitioner for key- alue RDDS)、控制分區策略和分區數

類似於mapreduce當中的paritioner接口,控制Key分到哪個reduce。

 /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None
5). 每個分區都有一個優先位置列表( a list of preferred locations to compute each split on)
protected def getPreferredLocations(split: Partition): Seq[String] = Nil

2. RDD創建方式

1).集合創建parallelize
val rdd = sc.parallelize(List(1,2,3,4,5,6))
2).其他RDD轉換而來
val rdd2 = rdd.map((_,100))
3).外部數據集
val hadoopRDD = sc.textFile("./data/graph/g.txt")

3. 算子對應的RDD

在這裏插入圖片描述

4.算子分類

常用 Transformation:{
    1、map算子
    2、flatMap算子
    3、mapPartitions算子
    4、filter算子
    5、distinct算子
    6、groupByKey算子
    7、reduceByKey算子
    8、join算子
}

常用 Action: {
    1、foreach算子
    2、saveAsTextFile算子
    3、saveAsObjectFile算子
    4、collect算子
    6、collectAsMap算子
    7、count算子
    8、top算子
    9、reduce算子
    10、fold算子
    11、aggregate算子
}
【注】:countByKey 是 action 算子

  def countByKey(): Map[K, Long] = self.withScope {
    self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
  }

5. 查看RDD的依賴列表 toDebugString

ps.toDebugString

(1) MapPartitionsRDD[7] at sortBy at T3.scala:16 []
 |  ShuffledRDD[6] at sortBy at T3.scala:16 []
 +-(1) MapPartitionsRDD[5] at sortBy at T3.scala:16 []
    |  MapPartitionsRDD[4] at sortBy at T3.scala:15 []
    |  ShuffledRDD[3] at sortBy at T3.scala:15 []
    +-(1) MapPartitionsRDD[2] at sortBy at T3.scala:15 []
       |  MapPartitionsRDD[1] at map at T3.scala:15 []
       |  ParallelCollectionRDD[0] at parallelize at T3.scala:9 []

6. 寬 / 窄依賴

寬窄依賴劃分:
窄依賴:一個父RDD的partition 至多被子RDD的partition使用一次
寬依賴:產生shuffle ,是劃分stage的標誌
生產中大部分場景下,能用窄依賴就用窄依賴,shuffle的代價是昂貴的,消耗網絡IO和磁盤IO
窄依賴: NarrowDependency => {
OneToOneDependency
RangeDependency
PruneDependency
}
寬依賴:ShuffleDependency


7.reduceByKey 和 groupByKey的區別

reduceByKey在map階段有combine(局部聚合)操作, groupByKey沒有combine(局部聚合)操作,所以reduceByKey的性能高於groupByKey

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
  combineByKeyWithClassTag[CompactBuffer[V]](
  //.......
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
  }

8. cache 和 persist

cache() 底層調用的persist()

 def cache(): this.type = persist()
 def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

MEMORY_ONLY 緩存在內存中
MEMORY_ONLY_SER 序列化後緩存在內存中
不建議緩存在磁盤上,RDD部分partition失效後,重新計算都比從磁盤上讀取性能高


9. repartition 與 coalesce

repartition執行時發生shuffle,所以可以增大分區數
coalesce默認關閉shuffle,不能增大分區數,只能減少分區數

   def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }
  
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

10. 排序

1). case class (Product 繼承 Ordered)
import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("青椒", "3.0", "675"), ("豬肉", "30", "250")))
    products.collect().foreach(println)
    val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
      .sortBy(x => x)
    ps.collect().foreach(println)

    sc.stop()
  }

}

case class Product(name: String, price: Double, amount: Int) extends Ordered[Product] {
  override def compare(that: Product): Int = {
    this.amount - that.amount
  }
}


//Product(豬肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(蘿蔔,2.5,2567)
3). 隱式類型轉換 (Product => Ordered)
import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("青椒", "3.0", "675"), ("豬肉", "30", "250")))
    products.collect().foreach(println)
    val ps = products.map(p => Product(p._1, p._2.toDouble, p._3.toInt))
      .sortBy(x => x)
    ps.collect().foreach(println)

    sc.stop()
  }

  implicit def Product2Ordered(product: Product): Ordered[Product] = new Ordered[Product] {
    override def compare(that: Product): Int = {
      product.amount - that.amount
    }
  }
  case class Product(name: String, price: Double, amount: Int)
}

//Product(豬肉,30.0,250)
//Product(青椒,3.0,675)
//Product(白菜,2.0,1000)
//Product(蘿蔔,2.5,2567)
3). Ordering.on
import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("土豆", "2.5", "3567"),("青椒", "3.0", "675"), ("豬肉", "30", "250")))
    products.collect().foreach(println)
    implicit var ord = Ordering[(Double, Int)].on[(String, Double, Int)](x => (-x._2, -x._3))
    val ps = products.map(p => (p._1, p._2.toDouble, p._3.toInt)).sortBy(x => x)
      .sortBy(x => x)
    ps.collect().foreach(println)

    sc.stop()
  }
}

//(豬肉,30.0,250)
//(青椒,3.0,675)
//(土豆,2.5,3567)
//(蘿蔔,2.5,2567)
//(白菜,2.0,1000)

11. 根據 key輸出到不同文件中

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
import org.apache.spark.{SparkConf, SparkContext}

object T3 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("sortBy")
    val sc = new SparkContext(conf)
    val products = sc.parallelize(List(("白菜", "2.0", "1000"), ("蘿蔔", "2.5", "2567"), ("土豆", "2.5", "3567"), ("青椒", "3.0", "675"), ("豬肉", "30", "250")))
    products.collect().foreach(println)

    products.map(x => (x._1, x._2)).saveAsHadoopFile("out/hadoop/file", classOf[String], classOf[String], classOf[RZMultipleTextOutputFormat])
    sc.stop()
  }

  class RZMultipleTextOutputFormat extends MultipleTextOutputFormat[Any,Any]{
    override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
      s"${key}/${name}"
    }

    override def generateActualKey(key: Any, value: Any): Any = {
      null
    }
  }
}

在這裏插入圖片描述
在這裏插入圖片描述
如果去掉 generateActualKey 這個方法的重載,每一行將會帶上 key
在這裏插入圖片描述


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章