需求背景:
統計相鄰兩個單詞出現的次數。
val s="A;B;C;D;B;D;C;B;D;A;E;D;C;A;B"
s: String = A;B;C;D;B;D;C;B;D;A;E;D;C;A;B
val data=sc.parallelize(Seq(s))
data.collect()
res0: Array[String] = Array(A;B;C;D;B;D;C;B;D;A;E;D;C;A;B)
截止目前位置是一個String類型的數組。
val mapTemp=data.map(_.split(";"))
scala> mapTemp.collect
res4: Array[Array[String]] = Array(Array(A, B, C, D, B, D, C, B, D, A, E, D, C, A, B))
map操作在於處理之前和處理之後的數據類型是一致的。
val mapRs=data.map(_.split(";")).map(x=>{for(i<-0 until x.length-1) yield (x(i)+","+x(i+1),1)})
mapRs.collect
res1: Array[scala.collection.immutable.IndexedSeq[(String, Int)]] = Array(Vector((A,B,1), (B,C,1), (C,D,1), (D,B,1), (B,D,1), (D,C,1), (C,B,1), (B,D,1), (D,A,1), (A,E,1), (E,D,1), (D,C,1), (C,A,1), (A,B,1)))
val flatMapRs=data.map(_.split(";")).flatMap(x=>{for(i<-0 until x.length-1) yield (x(i)+","+x(i+1),1)})
flatMapRs.collect
res3: Array[(String, Int)] = Array((A,B,1), (B,C,1), (C,D,1), (D,B,1), (B,D,1), (D,C,1), (C,B,1), (B,D,1), (D,A,1), (A,E,1), (E,D,1), (D,C,1), (C,A,1), (A,B,1))
而flatMap會把一類集合類的數據抹平從而展示的效果是元素的方式,比如從Vector中遍歷然後羅列出來。
val flatMap= data.map(_.split(";")).flatMap(x=>{for(i<-0 until x.length-1) yield (x(i)+","+x(i+1),1)}).reduceByKey(_+_).foreach(println)
(A,E,1)
(C,D,1)
(B,D,2)
(D,B,1)
(C,A,1)
(C,B,1)
(E,D,1)
(D,A,1)
(B,C,1)
(D,C,2)
(A,B,2)
reduceByKey算數因子解釋:
- Basically reduceByKey function works only for RDDs which contains key and value pairs kind of elements(i.e RDDs having tuple or Map as a data element). It is a transformation operation which means it is lazily evaluated.We need to pass one associative function as a parameter, which will be applied to the source RDD and will create anew RDD as with resulting values(i.e. key value pair). This operation is a wide operation as data shuffling may happen across the partitions.【本質上來講,reduceByKey函數(說算子也可以)只作用於包含key-value的RDDS上,它是 transformation類型的算子,這也就意味着它是懶加載的(就是說不調用Action的方法,是不會去計算的 ),在使用時,我們需要傳遞一個相關的函數(_+_)作爲參數,這個函數將會被應用到源RDD上並且創建一個新的 RDD作爲返回結果,這個算子作爲data Shuffling 在分區的時候被廣泛使用】