spark - Advanced Spark Programming

原創

zjfzjf2012

2020-02-26 07:02

- Accumulator

val blankLines = new LongAccumulator
sc.register(blankLines)

put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accumulator is accurate.

- Broadcast (read only)

val signPrefixes = sc.broadcast(loadCallSignTable())

broad cast value is sent to each working node only once.

try to make broadcast variable immutable. for mutable variable, the update is only in local working node whilst other working nodes doesn't get affected.

choose right serializer for broadcast variable

- per-partition basis

foreachPartition() | Iterator of the elements | Nothing | f: (Iterator[T]) → Unit

- pipe to external program (external program can get input from standard input and output to standard output)

- StatCounter on numeric RDD

- partitions after transformation

general

filter(),map(),flatMap(),distinct()	和父RDD相同
rdd.union(otherRDD)	rdd.partitions.size + otherRDD. partitions.size
rdd.intersection(otherRDD)	max(rdd.partitions.size, otherRDD. partitions.size)
rdd.subtract(otherRDD)	rdd.partitions.size
rdd.cartesian(otherRDD)	rdd.partitions.size * otherRDD. partitions.size

pair

reduceByKey(),foldByKey(),combineByKey(), groupByKey()	和父RDD相同
sortByKey()	同上
mapValues(),flatMapValues()	同上
cogroup(), join(), ,leftOuterJoin(), rightOuterJoin()	所有父RDD按照其partition數降序排列，從partition數最大的RDD開始查找是否存在partitioner，存在則partition數由此partitioner確定，否則，所有RDD不存在partitioner，由spark.default.parallelism確定，若還沒設置，最後partition數爲所有RDD中partition數的最大值

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark - Advanced Spark Programming

MapReduce Workflow

scala notes (3) - Files & Regular Expression, Trait, Operation and Function

MapReduce Features

scala notes (5) - pattern and case class

scala type parameters

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結