1. pyspark 版本
2.3.0版本
2. 官網
takeSample
(withReplacement, num, seed=None)[source]¶
Return a fixed-size sampled subset of this RDD.
中文: 返回此RDD的固定大小的採樣子集。
Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
注意 僅當預期結果數組較小時才應使用此方法,因爲所有數據均已加載到驅動程序的內存中。
>>> rdd = sc.parallelize(range(0, 10))
>>> len(rdd.takeSample(True, 20, 1))
20
>>> len(rdd.takeSample(False, 5, 2))
5
>>> len(rdd.takeSample(False, 15, 3))
10
3. 我的代碼
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("takeSample")
sc = SparkContext(conf=conf)
lines = sc.parallelize([1, 2, 3, 4, 5])
rdd = lines.takeSample(True, 3) #Flase: 數據集中沒重複的數據 True: 數據集中有重複的數據
print('rdd= ', rdd)
# rdd = lines.takeSample(Flase, 3)
# rdd= [5, 2, 3]
>>> rdd= [2, 1, 2]
4. notebook