[spark]RDD合併

將spark的兩個rdd合併成一個rdd

scala> val rdd1 = sc.parallelize(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd1.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 

scala> val rdd2 = sc.parallelize(101 to 110)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd1.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val rdd3=rdd1.union(rdd2)
rdd3: org.apache.spark.rdd.RDD[Int] = UnionRDD[2] at union at <console>:27

scala> rdd3.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110)

scala> val rdd4=rdd1.zip(rdd2)
rdd4: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[3] at zip at <console>:27

scala> rdd4.collect
res4: Array[(Int, Int)] = Array((1,101), (2,102), (3,103), (4,104), (5,105), (6,106), (7,107), (8,108), (9,109), (10,110))

scala> val rdd5=rdd4.zip(rdd1)
rdd5: org.apache.spark.rdd.RDD[((Int, Int), Int)] = ZippedPartitionsRDD2[4] at zip at <console>:27

scala> rdd5.collect
res5: Array[((Int, Int), Int)] = Array(((1,101),1), ((2,102),2), ((3,103),3), ((4,104),4), ((5,105),5), ((6,106),6), ((7,107),7), ((8,108),8), ((9,109),9), ((10,110),10))


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章