spark - Advanced Spark Programming

- Accumulator

val blankLines = new LongAccumulator
sc.register(blankLines)

put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accumulator is accurate. 

- Broadcast (read only)

val signPrefixes = sc.broadcast(loadCallSignTable())

broad cast value is sent to each working node only once. 

try to make broadcast variable immutable. for mutable variable, the update is only in local working node whilst other working nodes doesn't get affected.

choose right serializer for broadcast variable


- per-partition basis 

mapPartitions() | Iterator of the elements in that partition | Iterator of our return elements | f: (Iterator[T]) → Iterator[U]
mapPartitionsWithIndex() | Integer of partition number, and Iterator of the elements in that partition | Iterator of our return elements | f: (Int, Iterator[T]) → Iterator[U]

foreachPartition() | Iterator of the elements | Nothing | f: (Iterator[T]) → Unit

- pipe to external program (external program can get input from standard input and output to standard output)

- StatCounter on numeric RDD

- partitions after transformation 

  • general

filter(),map(),flatMap(),distinct()和父RDD相同
rdd.union(otherRDD)rdd.partitions.size + otherRDD. partitions.size
rdd.intersection(otherRDD)max(rdd.partitions.size, otherRDD. partitions.size)
rdd.subtract(otherRDD)rdd.partitions.size
rdd.cartesian(otherRDD)rdd.partitions.size * otherRDD. partitions.size
  • pair
reduceByKey(),foldByKey(),combineByKey(), groupByKey()和父RDD相同
sortByKey()同上
mapValues(),flatMapValues()同上
cogroup(), join(), ,leftOuterJoin(), rightOuterJoin()所有父RDD按照其partition數降序排列,從partition數最大的RDD開始查找是否存在partitioner,存在則partition數由此partitioner確定,否則,所有RDD不存在partitioner,由spark.default.parallelism確定,若還沒設置,最後partition數爲所有RDD中partition數的最大值

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章