Spark常見的Transformation算子（三）

`初始化數據`

println("======================= 原始數據 ===========================")
val data1: RDD[Int] = sc.parallelize(1 to 10, 3)
println(s"原始數據爲：${data1.collect.toBuffer}")
val data2: RDD[Int] = sc.parallelize(5 to 15, 2)
println(s"原始數據爲：${data2.collect.toBuffer}")
val data3: RDD[Int] = sc.parallelize(List(1, 2, 3, 4, 5, 5, 4, 3, 2, 1))
println(s"原數數據爲：${data3.collect.toBuffer}")

結果

`distinct`

用於去重，生成的RDD可能有重複的元素，使用distinct方法可以去掉重複的元素，此方法會打亂元素的順序，操作開銷很大

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
// 第一種實現：需要參數numPartitions，這個類似於一個因子，如果數據集中的元素可以被numPartitions整除，則排在前面，之後排被numPartitions整除餘1的，以此類推，體現局部無序，整體有序
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
// 第二種實現：調用了第一種實現，參數採用了默認的參數
def distinct(): RDD[T] = withScope {
  distinct(partitions.length)
}

Scala版本

println("======================= distinct-1 ===========================")
// 如果沒有指定numPartitions參數，則爲創建數據時的分區數量
val value1: RDD[Int] = data3.distinct()
println(s"經過distinct處理後的數據爲：${value1.collect.toBuffer}")

println("======================= distinct-2 ===========================")
// 局部無序，整體有序。以傳入的參數numPartitions作爲因子，所有的元素除以numPartitions，模爲0的排在第一位，之後排模爲1的，以此類推
val value2: RDD[Int] = data3.distinct(2)
println(s"經過distinct處理後的數據爲：${value2.collect.toBuffer}")

// 返回結果
// (4, 2, 1, 3, 5)
// 4, 2 ==> 模爲0
// 1, 3, 5 ==> 模爲1

運行結果

`union`

兩個RDD進行合併，不去重

/**
 * Return the union of this RDD and another one. Any identical elements will appear multiple
 * times (use `.distinct()` to eliminate them).
 */
// 返回此RDD和另一個RDD的並集，不去重，順序連接
def union(other: RDD[T]): RDD[T] = withScope {
  sc.union(this, other)
}

Scala版本

println("======================= union ===========================")
val value: RDD[Int] = data1.union(data2)
println(s"經過union處理後的數據爲：${value3.collect.toBuffer}")

運行結果

`intersection`

對於兩個RDD求交集，並去重，無序返回，操作開銷很大

/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.
 *
 * @note This method performs a shuffle internally.
 */
// 第一種實現：一個參數，返回此RDD和另一個RDD的交集，不包含重複元素
// 最後返回也是局部無序，整體有序。分區大小採用兩個RDD中分區數量較大的
def intersection(other: RDD[T]): RDD[T] = withScope {
  this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
      .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
      .keys
}

/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.
 *
 * @note This method performs a shuffle internally.
 *
 * @param partitioner Partitioner to use for the resulting RDD
 */
// 第二種實現：兩個參數，另一個RDD和一個分區器
def intersection(
    other: RDD[T],
    partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
      .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
      .keys
}

/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.  Performs a hash partition across the cluster
 *
 * @note This method performs a shuffle internally.
 *
 * @param numPartitions How many partitions to use in the resulting RDD
 */
// 第三種實現：兩個參數，第二個參數傳入numPartitions，內部調用調用第二種實現，使用默認分區器HashPartitioner(numPartitions)，並且返回結果局部無序，整體有序和distinct規則一樣
def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
  intersection(other, new HashPartitioner(numPartitions))
}

Scala版本

println("======================= intersection-1 ===========================")
val value1: RDD[Int] = data1.intersection(data2)
println(s"分區數量爲：${value1.getNumPartitions}")
println(s"經過intersection處理後的數據爲：${value1.collect.toBuffer}")

println("======================= intersection-2 ===========================")
val value2: RDD[Int] = data1.intersection(data2, new HashPartitioner(4))
println(s"分區數量爲：${value2.getNumPartitions}")
println(s"經過intersection處理後的數據爲：${value2.collect.toBuffer}")

println("======================= intersection-3 ===========================")
val value3: RDD[Int] = data1.intersection(data2, 5)
println(s"分區數量爲：${value3.getNumPartitions}")
println(s"經過intersection處理後的數據爲：${value3.collect.toBuffer}")

運行結果

`subtract`

RDD1.substract(RDD2)，返回在RDD1中出現但是不在RDD2中出現的元素

/**
 * Return an RDD with the elements from `this` that are not in `other`.
 *
 * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
 * RDD will be &lt;= us.
 */
// 第一種實現：一個參數，調用了第三種實現
// 最後返回也是局部無序，整體有序。分區大小採用兩個RDD中分區數量較大的
def subtract(other: RDD[T]): RDD[T] = withScope {
  subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
}

/**
 * Return an RDD with the elements from `this` that are not in `other`.
 */
// 第二種實現，調用了第三種實現，使用默認分區器HashPartitioner(numPartitions)，並且返回結果局部無序，整體有序和distinct規則一樣
def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
  subtract(other, new HashPartitioner(numPartitions))
}

/**
 * Return an RDD with the elements from `this` that are not in `other`.
 */
// 第三種實現，兩個參數，第二個參數爲分區器
def subtract(
    other: RDD[T],
    p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  if (partitioner == Some(p)) {
    // Our partitioner knows how to handle T (which, since we have a partitioner, is
    // really (K, V)) so make a new Partitioner that will de-tuple our fake tuples
    val p2 = new Partitioner() {
      override def numPartitions: Int = p.numPartitions
      override def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)
    }
    // Unfortunately, since we're making a new p2, we'll get ShuffleDependencies
    // anyway, and when calling .keys, will not have a partitioner set, even though
    // the SubtractedRDD will, thanks to p2's de-tupled partitioning, already be
    // partitioned by the right/real keys (e.g. p).
    this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys
  } else {
    this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys
  }
}

Scala版本

println("======================= subtract-1 ===========================")
val value1: RDD[Int] = data1.subtract(data2)
println(s"分區數量爲：${value1.getNumPartitions}")
println(s"經過subtract處理後的數據爲：${value1.collect.toBuffer}")

println("======================= subtract-2 ===========================")
val value2: RDD[Int] = data1.subtract(data2, new HashPartitioner(4))
println(s"分區數量爲：${value2.getNumPartitions}")
println(s"經過subtract處理後的數據爲：${value2.collect.toBuffer}")

println("======================= subtract-3 ===========================")
val value3: RDD[Int] = data1.subtract(data2, 5)
println(s"分區數量爲：${value3.getNumPartitions}")
println(s"經過subtract處理後的數據爲：${value3.collect.toBuffer}")

運行結果

`cartesian`

返回兩個RDD的笛卡爾積，開銷非常大

/**
 * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
 * elements (a, b) where a is in `this` and b is in `other`.
 */
// 分區數量爲兩個RDD之積
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
  new CartesianRDD(sc, this, other)
}

Scala版本

println("======================= cartesian ===========================")
val value1: RDD[(Int, Int)] = data1.cartesian(data2)
println(s"分區數量爲：${value1.getNumPartitions}")
println(s"經過cartesian處理後的數據爲：${value1.collect.toBuffer}")

運行結果

`sample`

採樣操作，用於從樣本中取出部分數據

/**
 * Return a sampled subset of this RDD.
 *
 * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
 * @param fraction expected size of the sample as a fraction of this RDD's size
 *  without replacement: probability that each element is chosen; fraction must be [0, 1]
 *  with replacement: expected number of times each element is chosen; fraction must be greater
 *  than or equal to 0
 * @param seed seed for the random number generator
 *
 * @note This is NOT guaranteed to provide exactly the fraction of the count
 * of the given [[RDD]].
 */
// 返回此RDD的採樣子集
// withReplacement 是否放回
// fraction，如果withReplacement爲false，則fraction表示概率，介於(0,1]
// fraction，如果withReplacement爲true，則fraction表示期望的次數，大於等於0
// seed 用於指定的隨機數生成器的種子，一般情況下，seed不建議指定
def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = {
  require(fraction >= 0,
    s"Fraction must be nonnegative, but got ${fraction}")

  withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }
}

Scala版本

println("======================= sample-1 ===========================")
val value1: RDD[Int] = data1.sample(withReplacement = false, 0.5)
println(s"分區數量爲：${value1.getNumPartitions}")
println(s"經過sample抽樣的結果爲：${value1.collect.toBuffer}")

println("======================= sample-2 ===========================")
val data4: RDD[Int] = data1.repartition(2)
val value2: RDD[Int] = data4.sample(withReplacement = false, 0.5)
println(s"分區數量爲：${value2.getNumPartitions}")
println(s"經過sample抽樣的結果爲：${value2.collect.toBuffer}")

運行結果

Spark常見的Transformation算子（三）

Spark常見的Transformation算子（三）

`初始化數據`

`distinct`

Scala版本

`union`

Scala版本

`intersection`

Scala版本

`subtract`

Scala版本

`cartesian`

Scala版本

`sample`

Scala版本

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

Spark常見的Transformation算子（三）

Hadoop註解InterfaceAudience InterfaceStability

MapReduce編程實例

Sqoop錯誤

MapReduce—平均工資

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結