Spark學習筆記——轉換操作(四)

原創

YF_Li123

2020-05-23 18:12

基礎轉換操作

鍵值轉換操作

鍵值轉換操作

cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
cogroup[W!, W2](other1: RDD[(K, W1])], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
cogroup[W!, W2, W3](other1: RDD[(K, W1])], other2: RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

cogroup類似於SQL中的全外連接，返回左右RDD中的記錄，關聯不上的爲空

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24

scala> var rdd3 = sc.makeRDD(Array(("A", "A"), ("E", "E")), 2)
rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[28] at makeRDD at <console>:24

scala> rdd1.cogroup(rdd2).collect
res26: Array[(String, (Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(2),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(d))), (A,(CompactBuffer(1),CompactBuffer(a))), (C,(CompactBuffer(3),CompactBuffer(c))))

scala> rdd1.cogroup(rdd2, rdd3).collect
res27: Array[(String, (Iterable[String], Iterable[String], Iterable[String]))] = Array((B,(CompactBuffer(2),CompactBuffer(),CompactBuffer())), (D,(CompactBuffer(),CompactBuffer(d),CompactBuffer())), (A,(CompactBuffer(1),CompactBuffer(a),CompactBuffer(A))), (C,(CompactBuffer(3),CompactBuffer(c),CompactBuffer())), (E,(CompactBuffer(),CompactBuffer(),CompactBuffer(E))))

join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
fullOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Option[V], Option[W]))]
fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Option[V], Option[W]))]
leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
leftOuterJoin[W](other: RDD[(K, W)], numPartition: Int): RDD[(K, (V, Option[W]))]
leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]
rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
rightOuterJoin[W](other: RDD[(K, W)], numPartition: Int): RDD[(K, (V, Option[W]))]
rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

join、fullOuterJoin、leftOuterJoin和rightOuterJoin操作對RDD[K, V]中K值相等的進行連接操作，分別對應內連接、全連接、左連接和有連接，其內部都是通過cogroup實現的。

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24

scala> rdd1.join(rdd2).collect
res28: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

scala> rdd1.leftOuterJoin(rdd2).collect
res29: Array[(String, (String, Option[String]))] = Array((B,(2,None)), (A,(1,Some(a))), (C,(3,Some(c))))

scala> rdd1.rightOuterJoin(rdd2).collect
res30: Array[(String, (Option[String], String))] = Array((D,(None,d)), (A,(Some(1),a)), (C,(Some(3),c)))

scala> rdd1.fullOuterJoin(rdd2)
res31: org.apache.spark.rdd.RDD[(String, (Option[String], Option[String]))] = MapPartitionsRDD[46] at fullOuterJoin at <console>:28

scala> rdd1.fullOuterJoin(rdd2).collect
res32: Array[(String, (Option[String], Option[String]))] = Array((B,(Some(2),None)), (D,(None,Some(d))), (A,(Some(1),Some(a))), (C,(Some(3),Some(c))))

subtractByKey[W](other: RDD[(K, W)]): RDD[(K, V)]
subtractByKey[W](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]
subtractByKey[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

subtractByKey操作類似於subtract，區別在於針對的是鍵值RDD[K, V]

scala> var rdd1 = sc.makeRDD(Array(("A", "1"), ("B", "2"), ("C", "3")), 2)
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[26] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(Array(("A", "a"), ("C", "c"), ("D", "d")), 2)
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at makeRDD at <console>:24

scala> rdd1.subtractByKey(rdd2).collect
res33: Array[(String, String)] = Array((B,2))

scala> rdd2.subtractByKey(rdd1).collect
res34: Array[(String, String)] = Array((D,d))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark學習筆記——轉換操作(四)

鍵值轉換操作

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

Spark操作——行動操作(一)

Zookeeper架構淺析

IDEA Maven項目利用Junit4進行單元測試

Supervisor配置文件淺析

Redis未授權訪問漏洞介紹及修復方案

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結