RDD
核心概念-弹性分布式数据集
类似Map/Reduce始终使用KV数据对,Spark中RDD可以保存所有类型数据,类似数据库中的一张表。RDD是不可变的,通过变换操作,返回全新RDD,原来RDD不变。
RDD两种操作:
- 变换(Transformation)
map,filter,flatMap,groupByKey,reduceByKey,aggregateByKey,pipe,coalesce
2. 行动(Action)
reduce,collect,count,first,take,countByKey,foreach
分类
两种基础RDD
- 并行集合(Parallelized Collections)
接受一个已经存在的Scala集合来运算
sc.parallelize(List(1,2,3,4,5,6)).sum()
- Hadoop数据集(Hadoop Datasets)
hadoop支持的任意存储系统即可,如本地文件/HDFS/Cassandra/Hbase/Amazon S3等,支持文本格式/Sequence Files/Hadoop InputFormat格式
常用方法
- textFile
- sequenceFile将hadoop的sequence file转成RDD
- hadoopRDD 将任意hadoop输入转成RDD,每个HDFS Block对应一个RDD分区
RDD 操作演示
- 变换-map/filter/flatMap
val num = sc.parallelize(List(1,2,3,4,5,6,7,8))
num.map(_*2).collect
//res10: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16)
num.filter(_%2==0).collect
//res11: Array[Int] = Array(2, 4, 6, 8)
val num2 = sc.parallelize(List(List(1,2), List(3,4), List(5,6), List(7,8)))
num2.flatMap(x=>x.map(_+1)).collect
//num2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9)
- 变换-union/intersection/distinct
val a = sc.parallelize(List(1,2,2,3,3,4,5))
val b = sc.parallelize(List(3,4,5,6))
a.union(b).collect
//res19: Array[Int] = Array(1, 2, 2, 3, 3, 4, 5, 3, 4, 5, 6)
a.intersection(b).collect
//res21: Array[Int] = Array(3, 4, 5)
a.distinct.collect
//res17: Array[Int] = Array(1, 2, 3, 4, 5)
- 变化-KV对
val g = sc.parallelize(List(("D",1), ("D",2), ("B",2), ("B",3), ("C",1), ("C",2)))
g.reduceByKey(_+_).collect
//res49: Array[(String, Int)] = Array((B,5), (C,3), (D,3))
g.sortByKey().collect
//res32: Array[(String, Int)] = Array((B,2), (B,3), (C,1), (C,2), (D,1), (D,2))
g.groupByKey().collect
//res50: Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 3)), (C,CompactBuffer(1, 2)), (D,CompactBuffer(1, 2)))
val h = sc.parallelize(List(("C",3), ("C",4), ("C",5), ("D",1), ("D",2), ("E",1)))
g.join(h).collect
//res33: Array[(String, (Int, Int))] = Array((C,(1,3)), (C,(1,4)), (C,(1,5)), (C,(2,3)), (C,(2,4)), (C,(2,5)), (D,(1,1)), (D,(1,2)), (D,(2,1)), (D,(2,2))) 求笛卡尔积
g.cogroup(h).collect
//res34: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((B,(CompactBuffer(2, 3),CompactBuffer())), (C,(CompactBuffer(1, 2),CompactBuffer(3, 4, 5))), (D,(CompactBuffer(1, 2),CompactBuffer(1, 2))), (E,(CompactBuffer(),CompactBuffer(1)))) 每个自己先分组再按key组合
- 行动-action操作
执行action时才会真正生成任务来执行
val c = sc.parallelize(List(1,2,3,4,5,6,7))
c.count
//res51: Long = 7
c.sum
//res52: Double = 28.0
c.reduce(_+_)
//res53: Int = 28
c.foreach(println)
/*
2
4
6
3
5
7
1
*/
val a = sc.parallelize(List("A","B","C","D"))
a.repartition(1).saveAsTextFile("hdfs://localhost:9000/output/a")
//saveAsObjectFile/saveAsSequenceFile等等
- 行动-cache
还有一个非常重要的action,spark中优先通过内存存储计算,对于热点数据可以直接指定缓存在内存中,这个热点数据也是分布式存储的,实际上可以很大
val c = sc.textFile("hdfs://localhost:9000/data/word.data").cache()
c.count
//res0: Long = 5 第一次访问时缓存,较慢
c.count
//res1: Long = 5 第二次访问时直接访问缓存,很快
此时,如下可查看任务监控界面的存储,此时数据已缓存
原创,转载请注明来自