RDD是啥

Resilient Distributed Dataset (RDD)，彈性分佈式數據集，是對不可修改，分區的數據集合的抽象。

RDD is characterized by five main properties:

A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

org.spark.rdd.RDD類方法

RDD是一個抽象類，定義如下

abstract class RDD[T] extends Serializable with Logging

RDD類的public方法大約有80多個（包括不同參數重載的）,均在下面列出。

值得注意的是，RDD類中並沒有定義xxxByKey形式的方法，這類方法其實是在PairRDDFunctions中定義的，通過隱式轉換，鍵值對形式的RDD（即RDD[(K, V)）可以調用PairRDDFunctions中定義的方法。

鍵值轉換操作

filter(f: (T) ⇒ Boolean): RDD[T]
過濾數據，僅留下使得f返回true的元素。
map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
將一個RDD中的每個數據項，通過map中的函數映射變爲一個新的元素。
輸入分區與輸出分區一對一，即：有多少個輸入分區，就有多少個輸出分區。
flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassTag[U]): RDD[U]
第一步和map一樣，最後將所有的輸出分區合併成一個。
使用flatMap時候需要注意：
flatMap會將字符串看成是一個字符數組。
mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
該函數和map函數類似，只不過映射函數的參數由RDD中的每一個元素變成了RDD中每一個分區的迭代器。如果在映射的過程中需要頻繁創建額外的對象，使用mapPartitions要比map高效的過。
比如，將RDD中的所有數據通過JDBC連接寫入數據庫，如果使用map函數，可能要爲每一個元素都創建一個connection，這樣開銷很大，如果使用mapPartitions，那麼只需要針對每一個分區建立一個connection。
參數preservesPartitioning表示是否保留父RDD的partitioner分區信息。
mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
函數作用同mapPartitions，不過提供了兩個參數，第一個參數爲分區的索引。
keyBy[K](f: (T) ⇒ K): RDD[(K, T)]
通過f函數爲每個元素生成一個KEY
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
通過給定的函數對元素排序
zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
與另一個RDD組合成（k,v)對。
zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) ⇒ Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
zipWithIndex(): RDD[(T, Long)]
zipWithUniqueId(): RDD[(T, Long)]

聚合相關

aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
aggregate用戶聚合RDD中的元素，先使用seqOp將RDD中每個分區中的T類型元素聚合成U類型，再使用combOp將之前每個分區聚合後的U類型聚合成U類型，特別注意seqOp和combOp都會使用zeroValue的值，zeroValue的類型爲U。
treeAggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0: ClassTag[U]): U
多層級聚合
reduce(f: (T, T) ⇒ T): T
根據映射函數f，對RDD中的元素進行二元計算，返回計算結果。
treeReduce(f: (T, T) ⇒ T, depth: Int = 2): T
多級reduce歸併聚合
fold(zeroValue: T)(op: (T, T) ⇒ T): T
fold是aggregate的簡化，將aggregate中的seqOp和combOp使用同一個函數op。
count(): Long
count返回RDD中的元素數量。
countApprox(timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]
近似count
countApproxDistinct(relativeSD: Double = 0.05): Long
countApproxDistinct(p: Int, sp: Int): Long
近似distinct count
countByValue()(implicit ord: Ordering[T] = null): Map[T, Long]
計算每個值出現次數
countByValueApprox(timeout: Long, confidence: Double = 0.95)(implicit ord: Ordering[T] = null):
計算每個值出現次數近似值
distinct(): RDD[T]
distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
返回元素去重後的RDD
groupBy[K](f: (T) ⇒ K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
groupBy[K](f: (T) ⇒ K, numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
groupBy[K](f: (T) ⇒ K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null): RDD[(K, Iterable[T])]
按指定函數生成key，並按key分組。
注意：性能比較差，推薦用PairRDDFunctions.reduceByKey or PairRDDFunctions.aggregateByKey.
因爲reduceByKey會先在分區內做聚合，再進行數據交換(shuffle)。
glom(): RDD[Array[T]]
該函數是將RDD中每一個分區中類型爲T的元素轉換成Array[T]，這樣每一個分區就只有一個數組元素。
max()(implicit ord: Ordering[T]): T
最大的元素
min()(implicit ord: Ordering[T]): T
最小的元素

遍歷元素

foreach(f: (T) ⇒ Unit): Unit
foreach用於遍歷RDD,將函數f應用於每一個元素。
但要注意，如果對RDD執行foreach，只會在Executor端有效，而並不是Driver端。
比如：rdd.foreach(println)，只會在Executor的stdout中打印出來，Driver端是看不到的。
foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit
foreachPartition和foreach類似，只不過是對每一個分區使用f。

取元素相關

collect(): Array[T]
collect用於將一個RDD轉換成數組。
first(): T
first返回RDD中的第一個元素，不排序。
take(num: Int): Array[T]
take用於獲取RDD中從0到num-1下標的元素，不排序。
top(num: Int)(implicit ord: Ordering[T]): Array[T]
top函數用於從RDD中，按照默認（降序）或者指定的排序規則，返回前num個元素。
takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
takeOrdered和top類似，只不過以和top相反的順序返回元素
takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T]
取樣本元素

集合間運算


++(other: RDD[T]): RDD[T]
與另一個RDD union。
intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
intersection(other: RDD[T], numPartitions: Int): RDD[T]
intersection(other: RDD[T]): RDD[T]
取交集
subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
subtract(other: RDD[T], numPartitions: Int): RDD[T]
subtract(other: RDD[T]): RDD[T]
求差集
union(other: RDD[T]): RDD[T]
與另一個RDD合併，類似union all,不會去重。

其他

persist(): RDD.this.type
persist(newLevel: StorageLevel): RDD.this.type
緩存數據，可設置緩存級別(如果尚未設置過，纔可以設置，本地checkpoint除外)
unpersist(blocking: Boolean = true): RDD.this.type
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
cache(): RDD.this.type
MEMORY_ONLY級別緩存數據
cartesian[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]：
計算兩個RDD的迪卡爾積
checkpoint(): Unit
標記將該RDD進行checkpoint處理？
coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
分區合併(只能減少分區)，使用HashPartitioner。
第一個參數爲重分區的數目，第二個爲是否進行shuffle，默認爲false;
repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
調整分區數，會導致shuffle，如果是減少分區，可以使用coalesce，避免shuffle。
toDebugString: String
返回RDD依賴樹/血統圖
getCheckpointFile: Option[String]
獲取checkpoint文件夾名稱
localCheckpoint(): RDD.this.type
標記爲使用本地checkpoint
isEmpty(): Boolean
是否含0個元素
iterator(split: Partition, context: TaskContext): Iterator[T]
返回迭代器，不應直接調用，而是給RDD的子類用的。
toLocalIterator: Iterator[T]
返回元素的本地迭代器
pipe(command: String): RDD[String]
pipe(command: String, env: Map[String, String]): RDD[String]
pipe(command: Seq[String], env: Map[String, String] = Map(), printPipeContext: ((String) ⇒ Unit) ⇒ Unit = null, printRDDElement: (T, (String) ⇒ Unit) ⇒ Unit = null, separateWorkingDir: Boolean = false, bufferSize: Int = 8192, encoding: String = Codec.defaultCharsetCodec.name): RDD[String]
調用外部進程處理RDD,如通過標準輸入傳給shell腳本。
preferredLocations(split: Partition): Seq[String]
Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.
randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
按權隨機將元素分組
sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]
取樣本/子集
setName(_name: String): RDD.this.type
設置RDD名字

保存

saveAsObjectFile(path: String): Unit
保存爲SequenceFile
saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit
saveAsTextFile(path: String): Unit
保存爲文本文件

變量

context: SparkContext
創建RDD的SparkContext
sparkContext: SparkContext
創建RDD的SparkContext
dependencies: Seq[Dependency[_]]
RDD的依賴列表
getNumPartitions: Int
獲取RDD的分區數
getStorageLevel: StorageLevel
獲取存儲等級，如果設置爲none,則返回StorageLevel.NONE 。
id: Int
該RDD的unique ID
isCheckpointed: Boolean
是否checkpointed and materialized, either reliably or locally.
name: String
RDD的名字
partitioner: Option[Partitioner]
分區器
partitions: Array[Partition]
各個分區

Spark RDD API全集

RDD是啥

org.spark.rdd.RDD類方法

鍵值轉換操作

聚合相關

遍歷元素

取元素相關

集合間運算

其他

保存

變量

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

Scala併發編程實戰 - 2：Lock 鎖

Spark Core解析 2：Scheduler 調度體系

Scala併發編程實戰：Monitor與synchronized

Spark Core 解析：RDD

基於Spark實現推薦算法-2:基於用戶的協同過濾(理論篇)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結