RDD分區及重新分區

RDD分區

rdd劃分成很多的分區(partition)分佈到集羣的節點,分區的多少涉及對這個rdd進行並行計算的粒度。分區是一個概念,變換前後的新舊分區在物理上可能是同一塊內存或存儲,這種優化防止函數式不變性導致的內存需求無限擴張。在rdd中用戶可以使用partitions方法獲取RDD劃分的分區數,當然用戶也可以設定分區數目。如果沒有指定將使用默認值,而默認值是該程序所分配到的cpu核數,如果是從hdfs文件創建,默認爲文件的數據塊數。

scala> val part=sc.textFile("file:/hadoop/spark/README.md")
part: org.apache.spark.rdd.RDD[String] = file:/hadoop/spark/README.md MapPartitionsRDD[5] at textFile at <console>:24
scala> part.partitions.size
res2: Int = 2

scala> val part=sc.textFile("file:/hadoop/spark/README.md",4)
part: org.apache.spark.rdd.RDD[String] = file:/hadoop/spark/README.md MapPartitionsRDD[7] at textFile at <console>:24
scala> part.partitions.size
res3: Int = 4

RDD分區計算(Iterator)

spark中RDD計算是以分區爲單位的,而且計算函數都是在對迭代器複合,不需要保存每次計算結果。分區計算一般使用mapPartitions等操作進行,mapPartitions的輸入函數應用於每一個分區,也就是把每個分區的內容作爲整體來處理的:

def mapPartitions [U:ClassTag](f:Iterator[T]=>Iterator[U],preservesPartitioning:Boolean=false):RDD[U]

在下面的例子中,函數iterfunc是把分區中的一個元素和它的下一個元素組成一個Tuple,因爲分區中最後一個元素沒有下一個元素,所以(3,4)和(6,7)不在結果中。

val a=sc.parallelize(1 to 9,3)
#查看每個分區的內容
scala> a.mapPartitionsWithIndex{(partid,iter)=>{
     | var part_map=scala.collection.mutable.Map[String,List[Int]]()
     | var part_name="part_"+partid
     | part_map(part_name)=List[Int]()
     | while(iter.hasNext){
     | part_map(part_name):+=iter.next()}
     | part_map.iterator}}.collect
res9: Array[(String, List[Int])] = Array((part_0,List(1, 2, 3)), (part_1,List(4, 5, 6)), (part_2,List(7, 8, 9)))

scala> def iterfunc [T](iter:Iterator[T]):Iterator[(T,T)]={
     | var res=List[(T,T)]()
     | var pre=iter.next
     | while(iter.hasNext){
     | val cur=iter.next
     | res::=(pre,cur)
     | pre=cur}
     | res.iterator}
iterfunc: [T](iter: Iterator[T])Iterator[(T, T)]
scala> a.mapPartitions(iterfunc).collect
res10: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

RDD分區函數

分區劃分對於shuffle類操作很關鍵,它決定該操作的父RDD和子RDD之間的依賴類型。例如Join操作,如果協同劃分的話,兩個父RDD之間,父RDD與子RDD之間能形成一致的分區安排,即同一個key保證被映射到同一個分區,這樣就能形成窄依賴。反之,如果沒有協同劃分,導致寬依賴。這裏所說的協同劃分是指定分區劃分器以產生前後一致的分區安排。
在spark中默認提供兩種劃分器:哈希分區劃分器(hashPartitioner)和範圍分區劃分器(RangePartitioner),且partitioner只存在於(K,V)類型的RDD中,對於非(K,V)類型的Partitioner值爲none。
在以下程序中,首先構造一個MappedRDD,其partitioner的值爲none,然後對RDD進行groupByKey操作group_rdd變量,對於groupByKey操作而言,這裏創建了新的HashPartitioner對象。

scala> var part=sc.textFile("file:/hadoop/spark/README.md")
part: org.apache.spark.rdd.RDD[String] = /hadoop/spark/README.md MapPartitionsRDD[12] at textFile at <console>:24
scala> part.partitioner
res11: Option[org.apache.spark.Partitioner] = None
 val group_rdd=part.map(x=>(x,x)).groupByKey(new org.apache.spark.HashPartitioner(4))
group_rdd: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[16] scala> group_rdd.partitioner
res14: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@4)
at groupByKey at <console>:26
#查看每個分區的內容
scala> part.mapPartitionsWithIndex{(partid,iter)=>{
     |  var part_map=scala.collection.mutable.Map[String,List[String]]()
     |  var part_name="part_"+partid
     |  part_map(part_name)=List[String]()
     | while(iter.hasNext){
     | part_map(part_name):+=iter.next()}
     |  part_map.iterator}}.collect
res19: Array[(String, List[String])] = Array((part_0,List(# Apache Spark, "", Spark is a fast and general cluster computing system for Big Data. It provides, high-level APIs in Scala, Java, Python, and R, and an optimized engine that, supports general computation graphs for data analysis. It also supports a, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, MLlib for machine learning, GraphX for graph processing,, and Spark Streaming for stream processing., "", <http://spark.apache.org/>, "", "", ## Online Documentation, "", You can find the latest Spark documentation, including a programming, guide, on the [project web page](http://spark.apache.org/documentation.html)., This README file only contains basic setup instructions., "", ## Building Spark, "", Spark ...

分區函數

coalesce(numpartitions:Int,shuffle:Boolean=false):RDD[T]
repartition(numPartitions:Int):RDD[T]

coalesce和repartition都是對RDD進行重新分區。coalesce操作使用HashPartitioner進行重分區,第一個參數爲重分區的數目,第二個爲是否shuffle,默認情況爲false。repartition操作是coalesce函數第二個參數爲true的實現。如果分區的數目大於原來的分區數,那麼必須指定shuffle參數爲true,否則分區數不變。

glom():RDD[Array[T]]

glom操作是RDD中的每一個分區所有類型爲T的數據轉變成元素類型爲T的數組[Array[T]]

mapPartitions操作和map類似,只不過映射的參數由RDD中的每一個元素變成了RDD中每一個分區的迭代器,mapPartitionsWithIndex作用類似於mapPartitions,只是輸入參數多了一個分區索引。

scala> var rdd1=sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at makeRDD at <console>:24
#mapPartitions累加每個分區的數
scala> var rdd3=rdd1.mapPartitions{x=>{
     | var result=List[Int]()
     | var i=0
     | while(x.hasNext){
     | i+=x.next()}
     | result.::(i).iterator}}
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at mapPartitions at <console>:26
scala> rdd3.collect
res20: Array[Int] = Array(3, 12)
scala> var rdd2=rdd1.mapPartitionsWithIndex{
     | (x,iter)=>{
     |  var result=List[String]()
     | var i=0
     | while(iter.hasNext){
     | i+=iter.next()}
     | result.::(x+"|"+i).iterator}}
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitionsWithIndex at <console>:26
scala> rdd2.collect
res21: Array[String] = Array(0|3, 1|12)
partitionBy(partitioner:Partitioner):RDD[(K,V)]

partitionBy操作根據partitioner函數生成新的ShuffleRDD,將原RDD重新分區

scala> var rdd1=sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[28] at makeRDD at <console>:24
scala> var rdd2=rdd1.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[29] at partitionBy at <console>:26
#查看分區中的元素
scala> rdd2.mapPartitionsWithIndex{
     | (partIdx,iter)=>{
     | var part_map=scala.collection.mutable.Map[String,List[(Int,String)]]()
     | while(iter.hasNext){
     | var part_name="part_"+partIdx
     | var elem=iter.next()
     | if(part_map.contains(part_name)){
     | var elems=part_map(part_name)
     |  elems::=elem
     | part_map(part_name)=elems
     | }else{
     | part_map(part_name)=List[(Int,String)]{elem}
     | }}
     | part_map.iterator}}.collect
res23: Array[(String, List[(Int, String)])] = Array((part_0,List((4,D), (2,B))), (part_1,List((3,C), (1,A))))

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章