sample(withReplacement, fraction, seed)
以指定的隨機種子隨機抽樣出數量爲 fraction 的數據,withReplacement 表示是抽出的數據是否放回,true 爲有放回的抽樣,false 爲無放回的抽樣,seed 用於指定隨機數生成器種子。
例如:從 RDD 中隨機且有放回的抽出 50% 的數據,隨機種子值爲 3(即可能以1 2 3的其中一個起始值)。主要用於觀察大數據集的分佈情況。
源碼:
**
* Return a sampled subset of this RDD.
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be >= 0
* @param seed seed for the random number generator
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}
示例代碼:
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> rdd.collect()
res11: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> var sample1 = rdd.sample(true, 0.4, 2).collect
sample1: Array[Int] = Array(1, 2, 2, 7, 7, 8, 9) 爲什麼抽樣出7個數據呢?
scala> var sample2 = rdd.sample(false, 0.2, 3).collect
sample2: Array[Int] = Array(1, 9)