Spark -- RDD两种算子:Transformation 和 Action

Transformation

map(func)

通过对RDD中每个元素执行一个function然后返回新的RDD

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U]

例如,将RDD中的元素倍乘2

scala> sc.parallelize(1 to 5).map(_*2).collect()
res0: Array[Int] = Array(2, 4, 6, 8, 10)

filter(func)

对每个元素执行一个function后然后选择返回true的元素来返回一个新的数据集

/**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T]

例如,选择大于2的元素返回

scala> sc.parallelize(1 to 5).filter(_>2).collect()
res2: Array[Int] = Array(3, 4, 5)

flatMap(func)

与map类似,但是每个输入项(元素)可以映射到0或多个输出项(因此func应该返回一个Seq,而不是单个item)

/**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

例如,第一个例子将单一的元素String通过split操作变成了string数组,第二个例子就是直接将单一的元素数组直接输出

scala> sc.parallelize(Array("redis redis spark","yarn hadoop spark")).flatMap(_.split(" ")).collect()
res17: Array[String] = Array(redis, redis, spark, yarn, hadoop, spark)

scala> sc.parallelize(Array(1 to 5, 5 to 10, 11 to 15)).flatMap(x=>x.map(y=>y)).collect()
res18: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

mapPartitions(func)

与map类似,但是是在RDD的每个partition上单独运行,所以func在类型为T的RDD上运行时必须是Iterator[T] => Iterator[U]类型。
即map的输入函数是应用于RDD中每个元素,而mapPartitions的输入函数是应用于每个分区,也就是把每个分区中的内容作为整体来处理的

/**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U]

例如,RDD中的元素是seq,计算每个分区中seq里的所有元素之和

val rdd = sc.parallelize(Array(1 to 5, 5 to 10, 11 to 15),3)
val mapParRDD = rdd.mapPartitionsWithIndex((index,iter)=>{
  var num = 0
  while(iter.hasNext){
    var seq = iter.next()
    seq.map(x=>num=x+num)
    println(s"$index-----$seq")
  }
  Array(num).iterator
})
mapParRDD.collect().foreach(println)

在这里插入图片描述

mapPartitionsWithIndex(func)

mapPartitions类似,但是多提供了一个integer型的参数表示分区号,所以函数类型是(Int, Iterator[T]) => Iterator[U]

/**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U]

例子请参照上面的例子。

sample(withReplacement, fraction, seed)

使用给定的随机数生成器种子对数据的一部分进行采样,采样的元素可以重复也可以不重复。

/**
   * Return a sampled subset of this RDD.
   *
   * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
   * @param fraction expected size of the sample as a fraction of this RDD's size
   *  without replacement: probability that each element is chosen; fraction must be [0, 1]
   *  with replacement: expected number of times each element is chosen; fraction must be greater
   *  than or equal to 0
   * @param seed seed for the random number generator
   *
   * @note This is NOT guaranteed to provide exactly the fraction of the count
   * of the given [[RDD]].
   */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T]

例如

scala> var sampleRDD = sc.parallelize(1 to 10)
sampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at parallelize at <console>:24
scala> sampleRDD.sample(false,0.1).collect
res47: Array[Int] = Array(3, 5)
scala> sampleRDD.sample(true,0.2).collect
res66: Array[Int] = Array(1, 4, 4, 7, 9)

union(otherDataset)

对两个RDD做union操作并返回新的RDD

/**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T]

例如合并两个RDD

scala> var rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[74] at parallelize at <console>:24
scala> var rdd2 = sc.parallelize(3 to 7)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[75] at parallelize at <console>:24
scala> rdd1.union(rdd2).collect().foreach(println)
1
2
3
4
5
3
4
5
6
7

intersection(otherDataset)

求两个RDD的交集

/**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T]

例如

scala> rdd1.intersection(rdd2).collect().foreach(println)
4
3
5

distinct([numPartitions]))

对RDD中的元素进行去重操作

/**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T]

例如

scala> sc.parallelize(Array(1,1,2,2,2,3,5)).distinct().collect()
res73: Array[Int] = Array(2, 1, 3, 5)

groupByKey([numPartitions])

当对元素类型为K-V对的RDD进行groupByKey操作时,返回一个元素类型为(K, Iterable<V>)的RDD。
注意如果分组是为了对每个键执行聚合(例如求和或平均值),那么使用reduceByKeyaggregateByKey将产生更好的性能
注意:默认情况下,输出中的并行度取决于父RDD的分区数。您可以传递一个可选的numPartitions参数来设置不同数量的任务。

/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])]

例如,注意arr这个RDD的元素类型是(Tuple)元组(String, Iterable[Int])

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
    val arr = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).groupByKey().collect()
    arr.foreach[Unit](x=>{
      println(s"key=${x._1}, iterable=${x._2}")
    })

reduceByKey(func, [numPartitions])

使用reduce function聚合每个key的所有值。该方法会在发送结果到reducer之前会在本地进行合并,类似于MR中的combiner

/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)]

例如,单词统计

scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)

scala> sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
(yarn,3)
(spark,1)
(hadoop,2)
(redis,2)

aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])

对每个key的值进行聚合操作。该方法是可以返回一个不同的结果类型
第一个参数zeroValue:每个key的初始值
第二个参数seqOp:用来先对每个分区内的数据按照key分别进行定义进行函数定义的操作
第三个参数combOp:对经过 Seq Function 处理过的数据按照key分别进行合并

/**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

例如,特别注意下面的yarn是23,为什么呢?因为0号分区和1号分区都有yarn,所有初始值10增加了两次

scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)

scala> var mapRDD = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[117] at map at <console>:26

scala> mapRDD.mapPartitionsWithIndex((index,ite)=>{
     |   ite.map(x=>(index,x))
     | }).collect().foreach(println)
(0,(redis,1))
(0,(redis,1))
(0,(spark,1))
(0,(yarn,1))
(0,(yarn,1))
(1,(yarn,1))
(1,(hadoop,1))
(1,(hadoop,1))

scala> mapRDD.aggregateByKey(10)((u,v)=>u+v,_+_).collect().foreach(println)
(yarn,23)
(spark,11)
(hadoop,12)
(redis,12)

求平均值,可以用groupByKey,也可以用reduceByKey

// 方法一,用groupByKey进行分组,然后用map求平均值,好理解
var numRDD = sc.parallelize(1 to 5).map(x=>("num",x)).groupByKey().map(x=>{
  var ct=0
  var num=0
  x._2.foreach(a=>{
    num=a+num
    ct+=1
  })
  (x._1,num/ct)
})
numRDD.collect()

//方法二,先用map把key的值和key出现的次数用元组记录起来,然后用reduceByKey方法计算平均值
sc.parallelize(1 to 5).map(x=>("num",x)).map(x=>(x._1,(x._2,1))).reduceByKey((a,b)=>{
  (a._1+b._1,a._2+b._2)
}).map(x=>(x._1,x._2._1/x._2._2)).collect()

sortByKey([ascending], [numPartitions])

根据key对RDD进行排序,返回一个有序的RDD即每个partition内的数据都是有序的。全部返回给driver program会是全局有序的。
还可以通过指定ascending=false来降序

/**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)]

例如

scala> val path = "/user/root/input/words.txt"
path: String = /user/root/input/words.txt
scala> val fileRDD = sc.textFile(path)
fileRDD: org.apache.spark.rdd.RDD[String] = /user/root/input/words.txt MapPartitionsRDD[1] at textFile at <console>:26
scala> fileRDD.flatMap(x=>x.split(" ")).map((_,1)).sortByKey(false).collect().foreach(println)
(spark.2,1)
(spark.2,1)
(spark.1,1)
(redis.4,1)
(redis.4,1)
(redis.3,1)
(redis.2,1)
(redis.1,1)
(redis.1,1)
(flume.4,1)
(flume.3,1)
(flume.3,1)

join(otherDataset, [numPartitions])

根据key来做两个RDD之间的join。可以参考数据库中两个表的join操作。
同样地也有leftOuterJoin、rightOuterJoin、fullOuterJoin

/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

例如

val path = "/user/root/input/words.txt"
val fileRDD = sc.textFile(path)
val wcRDD1 = fileRDD.flatMap(x=>x.split(" ")).map((_,1))

val strArray = Array("redis.1 redis.2 spark.2 yarn yarn flume.4","yarn hadoop hadoop")
val wcRDD2 = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))

wcRDD1.collect().foreach(println)
wcRDD2.collect().foreach(println)
wcRDD1.join(wcRDD2).collect().foreach(println)
wcRDD1.leftOuterJoin(wcRDD2).collect().foreach(println)
wcRDD1.rightOuterJoin(wcRDD2).collect().foreach(println)
wcRDD1.fullOuterJoin(wcRDD2).collect().foreach(println)

在这里插入图片描述
在这里插入图片描述

cogroup(otherDataset, [numPartitions])

如果输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable[V], Iterable[W])) . 该操作与 groupWith等同

/**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

例如,基于上面例子中join的代码

wcRDD1.cogroup(wcRDD2).collect().foreach(println)

在这里插入图片描述

cartesian(otherDataset)

对两个RDD做笛卡尔积,RDD[T] 笛卡尔 RDD[U],返回RDD[(T, U)]

/**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

例如

val numRDD1 = sc.parallelize(1 to 2)
val numRDD2 = sc.parallelize(3 to 5)
numRDD1.cartesian(numRDD2).collect().foreach(println)

在这里插入图片描述

pipe(command, [envVars])

对RDD的每个partition都调用外部程序。通过pipe(),你可以将RDD中的各元素从标准输入流中以字符串形式读出,并对这些元素执行任何你需要的操作,然后把结果以字符串的形式写入标准输出。通过这个方法可以与shell、python等其他语言协作完成计算。

/**
   * Return an RDD created by piping elements to a forked external process.
   */
  def pipe(command: String, env: Map[String, String]): RDD[String]

例如,我们以Hadoop Streaming中的例子为例,调用python版本的reduce.py文件来执行reduce操作。由于RDD中的partition是分布到各个机器的Executor进程里,所有脚本文件需要在每个机器上都存在。其中我们还传了环境变量信息。

import sys
import os

# 存储<word,count>的字典 根本不需要, 这样极大地浪费存储空间
word2count = {}
# 需要临时变量存储key即word,也需要临时变量count来统计
word = ""
count = 0
started = 0
# 从标准输入一行行读取数据
for line in sys.stdin:
    # 去除首尾空格
    line = line.strip()
    # 获取单词和单词数
    newword, newcount = line.split()
    if word != newword:
        if started == 1:
            # 打印上一轮的单词统计
            print "{}\t{}".format(word, count)
        word = newword
        count = int(newcount)
        started = 1
    else:
        count = count + int(newcount)
    # word2count[word] = word2count.get(word, 0) + int(count)

print "{}\t{}".format(word, count)
print "{}\t{}".format(os.getenv("red"), 1)
print "{}\t{}".format(os.getenv("azure"), 1)
# 不准这样输出, 这样会失去map输出数据的顺序性
# for word, count in word2count.items():
#     print "{}\t{}".format(word, count)

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map(x=>x+"\t"+1)
val colors = Map("red" -> "#FF0000", "azure" -> "#F0FFFF")
rdd.pipe("/tmp/reduce.py",colors).collect().foreach(println)

在这里插入图片描述

coalesce(numPartitions)

将RDD的分区数减至指定的numPartitions分区数。通常在用filter算子过滤掉大量数据后再执行coalesce会执行的更加高效。
这会造成narrow dependency。如果从1000个分区降到10个分区,那么就不会有shuffle操作,而是每个新分区占用10个当前分区。如果是设置更高的分区数,会保留当前分区。
然而如果做一个很极致的coalesce,如设置分区数为1,则有可能造成计算只会在很少的节点上运行。为了避免这种情况,可以添加shuffle=true。但是会意味着当前的upstream partition会并行执行。
注意:如果添加shuffle=true,那么就真的可以设置更高的分区数。这是很有用的,如果你有很少的分区数,但是可能存在几个分区数据量异常的大。这个时候调用coalesce(1000, shuffle = true)将会使用has partitioner将数据分发至1000个分区。可选参数partition coalescer一定要是序列化的。

/**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T]

例如

scala> rdd.partitions.length
res15: Int = 2

scala> rdd.coalesce(1).partitions.length
res16: Int = 1

scala> rdd.coalesce(3).partitions.length
res17: Int = 2

scala> rdd.coalesce(3,true).partitions.length
res18: Int = 3

repartition(numPartitions)

随机地重新shuffleRDD中的数据,以创建更多或更少的分区,并在它们之间进行平衡。这总是通过网络shuffle所有的数据,即一定会造成shuffle操作。如果是减少分区数,可以考虑用上面的coalesce

/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

例如

scala> rdd.repartition(1).partitions.length
res20: Int = 1

repartitionAndSortWithinPartitions(partitioner)

根据指定的分区器对RDD进行重分区,并且,重分区后对每个分区的数据根据key进行排序,即保证区内有序。
如果想重分区后再在每个分区内排序,可以调用该方法repartitionAndSortWithinPartitions(partitioner),这是更加有效的,因为会在shuffle的过程就进行排序。

/**
   * Repartition the RDD according to the given partitioner and, within each resulting partition,
   * sort records by their keys.
   *
   * This is more efficient than calling `repartition` and then sorting within each partition
   * because it can push the sorting down into the shuffle machinery.
   */
  def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

例如

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
val newRDD = rdd.repartitionAndSortWithinPartitions(new org.apache.spark.HashPartitioner(3))

在这里插入图片描述

Action

reduce(func)

使用函数func(它接受两个参数并返回一个)来聚合RDD中的元素。这个函数应该符合交换律和结合律,这样它才能被正确地并行计算。

/**
   * Reduces the elements of this RDD using the specified commutative and
   * associative binary operator.
   */
  def reduce(f: (T, T) => T): T

例如,求和

scala> sc.parallelize(1 to 10).reduce(_+_)
res34: Int = 55

collect()

将RDD中的所有元素以数组的形式返回给driver program,注意最好是在使用filter算子过滤掉大量的数据后且确保返回的数据充分小的时候使用,因为返回的数据集会全部加载到driver program节点的内存里

/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T]

例如

scala> sc.parallelize(1 to 10).collect()
res35: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

count()

计算RDD中元素的个数

/**
   * Return the number of elements in the RDD.
   */
  def count(): Long

例如,计算文件有多少行

scala> sc.textFile("/user/root/input/words.txt").count()
res36: Long = 3

scala> sc.textFile("/user/root/input/words.txt").collect().foreach(println)
redis.1 redis.1 redis.2 redis.3 redis.4 redis.4
spark.1 spark.2 spark.2
flume.3 flume.3 flume.4

first()

获取RDD中第一个元素,等同于take(1)

/**
   * Return the first element in this RDD.
   */
  def first(): T

take(n)

获取RDD中前n个元素,会一个分区一个分区的扫描知道满足返回数量

/**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @note Due to complications in the internal implementation, this method will raise
   * an exception if called on an RDD of `Nothing` or `Null`.
   */
  def take(num: Int): Array[T]

例如,

scala> val numRDD = sc.parallelize(1 to 10,3)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24
scala> numRDD.take(4)
res39: Array[Int] = Array(1, 2, 3, 4)

scala> numRDD.take(1)
res40: Array[Int] = Array(1)

scala> numRDD.first()
res41: Int = 1

takeSample(withReplacement, num, [seed])

返回一个包含数据集的随机num元素样本的数组,可以替换,也可以不替换,可以预先指定随机数生成器种子。

/**
   * Return a fixed-size sampled subset of this RDD in an array
   *
   * @param withReplacement whether sampling is done with replacement
   * @param num size of the returned sample
   * @param seed seed for the random number generator
   * @return sample of specified size in an array
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

例如

scala> numRDD.takeSample(true,6)
res42: Array[Int] = Array(2, 9, 9, 10, 2, 7)

scala> numRDD.takeSample(false,6)
res43: Array[Int] = Array(5, 9, 6, 10, 3, 8)

takeOrdered(n, [ordering])

使用自然顺序或者比较器返回前n个元素

/**
   * Returns the first k (smallest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
   * For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
   *   // returns Array(2)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
   *   // returns Array(2, 3)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

例如

scala> numRDD.takeOrdered(3)
res45: Array[Int] = Array(1, 2, 3)

top(n, [ordering])

和takeOrdered类似,只不过底层实现就是用的takeOrdered即takeOrdered(num)(ord.reverse)

/**
   * Returns the top k (largest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of
   * [[takeOrdered]]. For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
   *   // returns Array(12)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
   *   // returns Array(6, 5)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of top elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T]

例如

scala> numRDD.top(2)
res46: Array[Int] = Array(10, 9)

saveAsTextFile(path)

将RDD中的元素作为文本文件(或文本文件集)写入本地文件系统、HDFS或任何其他hadoop支持的文件系统的给定目录中。Spark将对每个元素调用toString,将其转换为文件中的一行文本

/**
   * Save this RDD as a text file, using string representations of elements.
   */
  def saveAsTextFile(path: String): Unit

例如

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_)
rdd.saveAsTextFile("/tmp/rdd.5")

在这里插入图片描述

saveAsSequenceFile(path)

只有键值对RDD才能调用。将RDD文件输出成Hadoop SequenceFile

/**
   * Output the RDD as a Hadoop SequenceFile using the Writable types we infer from the RDD's key
   * and value types. If the key or value are Writable, then we use their classes directly;
   * otherwise we map primitive types such as Int and Double to IntWritable, DoubleWritable, etc,
   * byte arrays to BytesWritable, and Strings to Text. The `path` can be on any Hadoop-supported
   * file system.
   */
  def saveAsSequenceFile(
      path: String,
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit

例如,用saveAsSequenceFile将RDD数据存储为Hadoop SequenceFile文件,并用sequenceFile读取Hadoop SequenceFile文件

rdd.saveAsSequenceFile("/tmp/rdd.6")
scala> sc.sequenceFile[String,Int]("/tmp/rdd.6")
res59: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[77] at sequenceFile at <console>:26
scala> res59.collect()
res60: Array[(String, Int)] = Array((yarn,3), (spark,1), (hadoop,2), (redis,2))

在这里插入图片描述

saveAsObjectFile(path)

以Java序列化对象的方式将RDD中数据输出,可以用SparkContext.objectFile()来加载

/**
   * Save this RDD as a SequenceFile of serialized objects.
   */
  def saveAsObjectFile(path: String): Unit

例如,将RDD中的数据以序列化对象的方式输出,可以用SparkContext.objectFile()来加载回来

scala> import scala.Tuple2
scala> rdd.saveAsObjectFile("/tmp/rdd.7")
scala> sc.objectFile[Tuple2[String,Int]]("/tmp/rdd.7").collect()
res66: Array[(String, Int)] = Array((yarn,3), (spark,1), (hadoop,2), (redis,2))

countByKey()

仅KV对类型的RDD才有该方法,获取每个key的个数,收集后返回一个Map对象给driver program
如果结果数据量很大,建议用 rdd.mapValues(_ => 1L).reduceByKey(_ + _) 这样是返回一个RDD。底层就是这么实现的。

/**
   * Count the number of elements for each key, collecting the results to a local Map.
   *
   * @note This method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   */
  def countByKey(): Map[K, Long] = self.withScope {
    self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
  }

例如

scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)

scala> val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[88] at map at <console>:27

scala> rdd.countByKey()
res68: scala.collection.Map[String,Long] = Map(yarn -> 3, spark -> 1, hadoop -> 2, redis -> 2)

foreach(func)

对RDD中的每个元素运行函数func。这通常是为了解决诸如更新Accumulator或与外部存储系统交互的问题。
注意:在foreach()之外修改除Accumulator之外的其他变量可能会导致未定义的行为。有关更多细节,请参考Understanding closures
foreachforeachPartition在代码实现上几乎是一样的

/**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

例如,注意下面中并没有元素打印,不是因为RDD没有数据,而是因为数据打印在executor节点上

scala> rdd.foreach(println)

scala>

在这里插入图片描述

异步版本的Action算子

Spark RDD API还公开了一些操作的异步版本,比如foreachAsync相对应foreach,它立即向调用者返回一个FutureAction,而不是阻塞。这可以用于管理或等待操作的异步执行。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章