- Accumulator
val blankLines = new LongAccumulator
sc.register(blankLines)
put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accumulator is accurate.
- Broadcast (read only)
val signPrefixes = sc.broadcast(loadCallSignTable())
broad cast value is sent to each working node only once.
try to make broadcast variable immutable. for mutable variable, the update is only in local working node whilst other working nodes doesn't get affected.
choose right serializer for broadcast variable
- per-partition basis
mapPartitions() | Iterator of the elements in that partition | Iterator of our return elements | f: (Iterator[T]) → Iterator[U]
mapPartitionsWithIndex() | Integer of partition number, and Iterator of the elements in that partition | Iterator of our return elements | f: (Int, Iterator[T]) → Iterator[U]
foreachPartition() | Iterator of the elements | Nothing | f: (Iterator[T]) → Unit
- pipe to external program (external program can get input from standard input and output to standard output)
- StatCounter on numeric RDD
- partitions after transformation
- general
filter(),map(),flatMap(),distinct() | 和父RDD相同 |
rdd.union(otherRDD) | rdd.partitions.size + otherRDD. partitions.size |
rdd.intersection(otherRDD) | max(rdd.partitions.size, otherRDD. partitions.size) |
rdd.subtract(otherRDD) | rdd.partitions.size |
rdd.cartesian(otherRDD) | rdd.partitions.size * otherRDD. partitions.size |
- pair
reduceByKey(),foldByKey(),combineByKey(), groupByKey() | 和父RDD相同 |
sortByKey() | 同上 |
mapValues(),flatMapValues() | 同上 |
cogroup(), join(), ,leftOuterJoin(), rightOuterJoin() | 所有父RDD按照其partition數降序排列,從partition數最大的RDD開始查找是否存在partitioner,存在則partition數由此partitioner確定,否則,所有RDD不存在partitioner,由spark.default.parallelism確定,若還沒設置,最後partition數爲所有RDD中partition數的最大值 |