1. 算子分類
從大方向來說,Spark 算子大致可以分爲以下兩類
- Transformation:操作是延遲計算的,也就是說從一個RDD 轉換生成另一個 RDD 的轉換操作不是馬上執行,需要等到有 Action 操作的時候纔會真正觸發運算。
- Action:會觸發 Spark 提交作業(Job),並將數據輸出 Spark系統。
從小方向來說,Spark 算子大致可以分爲以下三類:
- Value數據類型的Transformation算子。
- Key-Value數據類型的Transfromation算子。
- Action算子
類型 |
算子 |
輸入分區與輸出分區一對一型 |
map、flatMap、mapPartitions、glom |
輸入分區與輸出分區多對一型 |
union、cartesian |
輸入分區與輸出分區多對多型 |
groupBy |
輸出分區爲輸入分區子集型 |
filter、distinct、subtract、sample、takeSample |
Cache型 |
cache、persist |
1.2 Key-Value數據類型的Transfromation算子
類型 |
算子 |
輸入分區與輸出分區一對一 |
mapValues |
對單個RDD |
combineByKey、reduceByKey、partitionBy |
兩個RDD聚集 |
Cogroup |
連接 |
join、leftOutJoin、rightOutJoin |
1.3 Action算子
類型 |
算子 |
無輸出 |
foreach |
HDFS |
saveAsTextFile、saveAsObjectFile |
Scala集合和數據類型 |
collect、collectAsMap、reduceByKeyLocally、lookup、count、top、reduce、fold、aggregate |
2.1 map
2.1.1 概述
語法(Scala):
def map[U: ClassTag](f: T => U): RDD[U]
說明:
將原來RDD的每個數據項通過map中的用戶自定義函數f映射轉變爲一個新的元素
2.1.2 Java示例
/**
* map算子
* <p>
* map和foreach算子:
* 1. 循環map調用元的每一個元素;
* 2. 執行call函數, 並返回.
* </p>
*/
private static void map() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> datas = Arrays.asList(
"{'id':1,'name':'xl1','pwd':'xl123','sex':2}",
"{'id':2,'name':'xl2','pwd':'xl123','sex':1}",
"{'id':3,'name':'xl3','pwd':'xl123','sex':2}");
JavaRDD<String> datasRDD = sc.parallelize(datas);
JavaRDD<User> mapRDD = datasRDD.map(
new Function<String, User>() {
public User call(String v) throws Exception {
Gson gson = new Gson();
return gson.fromJson(v, User.class);
}
});
mapRDD.foreach(new VoidFunction<User>() {
public void call(User user) throws Exception {
System.out.println("id: " + user.id
+ " name: " + user.name
+ " pwd: " + user.pwd
+ " sex:" + user.sex);
}
});
sc.close();
}
id: 1 name: xl1 pwd: xl123 sex:2
id: 2 name: xl2 pwd: xl123 sex:1
id: 3 name: xl3 pwd: xl123 sex:2
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
2.1.3 Scala示例
private def map() {
val conf = new SparkConf().setAppName(ScalaOperatorDemo.getClass.getSimpleName).setMaster("local")
val sc = new SparkContext(conf)
val datas: Array[String] = Array(
"{'id':1,'name':'xl1','pwd':'xl123','sex':2}",
"{'id':2,'name':'xl2','pwd':'xl123','sex':1}",
"{'id':3,'name':'xl3','pwd':'xl123','sex':2}")
sc.parallelize(datas)
.map(v => {
new Gson().fromJson(v, classOf[User])
})
.foreach(user => {
println("id: " + user.id
+ " name: " + user.name
+ " pwd: " + user.pwd
+ " sex:" + user.sex)
})
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
2.2 filter
2.2.1 概述
語法(scala):
def filter(f: T => Boolean): RDD[T]
說明:
對元素進行過濾,對每個元素應用f函數,返回值爲true的元素在RDD中保留,返回爲false的將過濾掉
2.2.2 Java示例
static void filter() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> datas = Arrays.asList(1, 2, 3, 7, 4, 5, 8);
JavaRDD<Integer> rddData = sc.parallelize(datas);
JavaRDD<Integer> filterRDD = rddData.filter(
new Function<Integer, Boolean>() {
public Boolean call(Integer v) throws Exception {
return v >= 3;
}
}
);
filterRDD.foreach(
new VoidFunction<Integer>() {
@Override
public void call(Integer integer) throws Exception {
System.out.println(integer);
}
}
);
sc.close();
}
3
7
4
5
8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
2.2.3 Scala示例
def filter {
val conf = new SparkConf().setAppName(ScalaOperatorDemo.getClass.getSimpleName).setMaster("local")
val sc = new SparkContext(conf)
val datas = Array(1, 2, 3, 7, 4, 5, 8)
sc.parallelize(datas)
.filter(v => v >= 3)
.foreach(println)
}
2.3 flatMap
2.3.1 簡述
語法(scala):
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
說明:
與map類似,但每個輸入的RDD成員可以產生0或多個輸出成員
2.3.2 Java示例
static void flatMap() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local")
JavaSparkContext sc = new JavaSparkContext(conf)
List<String> data = Arrays.asList(
"aa,bb,cc",
"cxf,spring,struts2",
"java,C++,javaScript")
JavaRDD<String> rddData = sc.parallelize(data)
JavaRDD<String> flatMapData = rddData.flatMap(
v -> Arrays.asList(v.split(",")).iterator()
// new FlatMapFunction<String, String>() {
// @Override
// public Iterator<String> call(String t) throws Exception {
// List<String> list= Arrays.asList(t.split(","))
// return list.iterator()
// }
// }
)
flatMapData.foreach(v -> System.out.println(v))
sc.close()
}
// 結果
aa
bb
cc
cxf
spring
struts2
java
C++
javaScript
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
2.3.3 Scala示例
sc.parallelize(datas)
.flatMap(line => line.split(","))
.foreach(println)
2.4 mapPartitions
2.4.1 概述
語法(scala):
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
說明:
與Map類似,但map中的func作用的是RDD中的每個元素,而mapPartitions中的func作用的對象是RDD的一整個分區。所以func的類型是Iterator<T> => Iterator<U>
,其中T是輸入RDD元素的類型。preservesPartitioning表示是否保留輸入函數的partitioner,默認false。
2.4.2 Java示例
static void mapPartitions() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> names = Arrays.asList("張三1", "李四1", "王五1", "張三2", "李四2",
"王五2", "張三3", "李四3", "王五3", "張三4");
JavaRDD<String> namesRDD = sc.parallelize(names, 3);
JavaRDD<String> mapPartitionsRDD = namesRDD.mapPartitions(
new FlatMapFunction<Iterator<String>, String>() {
int count = 0;
@Override
public Iterator<String> call(Iterator<String> stringIterator) throws Exception {
List<String> list = new ArrayList<String>();
while (stringIterator.hasNext()) {
list.add("分區索引:" + count++ + "\t" + stringIterator.next());
}
return list.iterator();
}
}
);
List<String> result = mapPartitionsRDD.collect();
result.forEach(System.out::println);
sc.close();
}
分區索引:0 張三1
分區索引:1 李四1
分區索引:2 王五1
分區索引:0 張三2
分區索引:1 李四2
分區索引:2 王五2
分區索引:0 張三3
分區索引:1 李四3
分區索引:2 王五3
分區索引:3 張三4
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
2.4.3 Scala示例
sc.parallelize(datas, 3)
.mapPartitions(
n => {
val result = ArrayBuffer[String]()
while (n.hasNext) {
result.append(n.next())
}
result.iterator
}
)
.foreach(println)
2.5 mapPartitionsWithIndex
2.5.1 概述
語法(scala):
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
說明:
與mapPartitions類似,但輸入會多提供一個整數表示分區的編號,所以func的類型是(Int, Iterator<T>) => Iterator<R>
,多了一個Int
2.5.2 Java示例
private static void mapPartitionsWithIndex() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> names = Arrays.asList("張三1", "李四1", "王五1", "張三2", "李四2",
"王五2", "張三3", "李四3", "王五3", "張三4");
JavaRDD<String> namesRDD = sc.parallelize(names, 3);
JavaRDD<String> mapPartitionsWithIndexRDD = namesRDD.mapPartitionsWithIndex(
new Function2<Integer, Iterator<String>, Iterator<String>>() {
private static final long serialVersionUID = 1L;
public Iterator<String> call(Integer v1, Iterator<String> v2) throws Exception {
List<String> list = new ArrayList<String>();
while (v2.hasNext()) {
list.add("分區索引:" + v1 + "\t" + v2.next());
}
return list.iterator();
}
},
true);
List<String> result = mapPartitionsWithIndexRDD.collect();
result.forEach(System.out::println);
sc.close();
}
分區索引:0 張三1
分區索引:0 李四1
分區索引:0 王五1
分區索引:1 張三2
分區索引:1 李四2
分區索引:1 王五2
分區索引:2 張三3
分區索引:2 李四3
分區索引:2 王五3
分區索引:2 張三4
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
2.5.3 Scala示例
sc.parallelize(datas, 3)
.mapPartitionsWithIndex(
(m, n) => {
val result = ArrayBuffer[String]()
while (n.hasNext) {
result.append("分區索引:" + m + "\t" + n.next())
}
result.iterator
}
)
.foreach(println)
2.6 sample
2.6.1 概述
語法(scala):
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T]
說明:
對RDD進行抽樣,其中參數withReplacement爲true時表示抽樣之後還放回,可以被多次抽樣,false表示不放回;fraction表示抽樣比例;seed爲隨機數種子,比如當前時間戳
2.6.2 Java示例
static void sample() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local")
JavaSparkContext sc = new JavaSparkContext(conf)
List<Integer> datas = Arrays.asList(1, 2, 3, 7, 4, 5, 8)
JavaRDD<Integer> dataRDD = sc.parallelize(datas)
JavaRDD<Integer> sampleRDD = dataRDD.sample(false, 0.5, System.currentTimeMillis())
sampleRDD.foreach(v -> System.out.println(v))
sc.close()
}
// 結果
7
4
5
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
2.6.3 Scala示例
sc.parallelize(datas)
.sample(withReplacement = false, 0.5, System.currentTimeMillis)
.foreach(println)
2.7 union
2.7.1 概述
語法(scala):
def union(other: RDD[T]): RDD[T]
說明:
合併兩個RDD,不去重,要求兩個RDD中的元素類型一致
2.7.2 Java示例
static void union() {
SparkConf conf = new SparkConf().setAppName(JavaOperatorDemo.class.getSimpleName())
.setMaster("local")
JavaSparkContext sc = new JavaSparkContext(conf)
List<String> datas1 = Arrays.asList("張三", "李四")
List<String> datas2 = Arrays.asList("tom", "gim")
JavaRDD<String> data1RDD = sc.parallelize(datas1)
JavaRDD<String> data2RDD = sc.parallelize(datas2)
JavaRDD<String> unionRDD = data1RDD
.union(data2RDD)
unionRDD.foreach(v -> System.out.println(v))
sc.close()
}
// 結果
張三
李四
tom
gim
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
2.7.3 Scala示例
// sc.parallelize(datas1)
// .union(sc.parallelize(datas2))
// .foreach(println)
// 或
(sc.parallelize(datas1) ++ sc.parallelize(datas2))
.foreach(println)
2.8 intersection
2.8.1 概述
語法(scala):
def intersection(other: RDD[T]): RDD[T]
說明:
返回兩個RDD的交集
2.8.2 Java示例
static void intersection(JavaSparkContext sc) {
List<String> datas1 = Arrays.asList("張三", "李四", "tom")
List<String> datas2 = Arrays.asList("tom", "gim")
sc.parallelize(datas1)
.intersection(sc.parallelize(datas2))
.foreach(v -> System.out.println(v))
}
// 結果
tom
2.8.3 Scala示例
sc.parallelize(datas1)
.intersection(sc.parallelize(datas2))
.foreach(println)
2.9 distinct
2.9.1 概述
語法(scala):
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
def distinct(): RDD[T]
說明:
對原RDD進行去重操作,返回RDD中沒有重複的成員
2.9.2 Java示例
static void distinct(JavaSparkContext sc) {
List<String> datas = Arrays.asList("張三", "李四", "tom", "張三")
sc.parallelize(datas)
.distinct()
.foreach(v -> System.out.println(v))
}
// 結果
張三
tom
李四
2.9.3 Scala示例
sc.parallelize(datas)
.distinct()
.foreach(println)
2.10 groupByKey
2.10.1 概述
語法(scala):
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
def groupBy[K](
f: T => K,
numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(): RDD[(K, Iterable[V])]
說明:
對<key, value>
結構的RDD進行類似RMDB的group by聚合操作,具有相同key的RDD成員的value會被聚合在一起,返回的RDD的結構是(key,
Iterator<value>)
2.10.2 Java示例
static void groupBy(JavaSparkContext sc) {
List<Integer> datas = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9)
sc.parallelize(datas)
.groupBy(new Function<Integer, Object>() {
@Override
public Object call(Integer v1) throws Exception {
return (v1 % 2 == 0) ? "偶數" : "奇數"
}
})
.collect()
.forEach(System.out::println)
List<String> datas2 = Arrays.asList("dog", "tiger", "lion", "cat", "spider", "eagle")
sc.parallelize(datas2)
.keyBy(v1 -> v1.length())
.groupByKey()
.collect()
.forEach(System.out::println)
}
// 結果
(奇數,[1, 3, 5, 7, 9])
(偶數,[2, 4, 6, 8])
(4,[lion])
(6,[spider])
(3,[dog, cat])
(5,[tiger, eagle])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
2.10.3 Scala示例
def groupBy(sc: SparkContext): Unit = {
sc.parallelize(1 to 9, 3)
.groupBy(x => {
if (x % 2 == 0) "偶數"
else "奇數"
})
.collect()
.foreach(println)
val datas2 = Array("dog", "tiger", "lion", "cat", "spider", "eagle")
sc.parallelize(datas2)
.keyBy(_.length)
.groupByKey()
.collect()
.foreach(println)
}
2.11 reduceByKey
2.11.1 概述
語法(scala):
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
說明:
對<key, value>
結構的RDD進行聚合,對具有相同key的value調用func來進行reduce操作,func的類型必須是(V,
V) => V
2.11.2 Java示例
static void reduceByKey(JavaSparkContext sc) {
JavaRDD<String> lines = sc.textFile("file:///Users/zhangws/opt/spark-2.0.1-bin-hadoop2.6/README.md");
JavaRDD<String> wordsRDD = lines.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
public Iterator<String> call(String line) throws Exception {
List<String> words = Arrays.asList(line.split(" "));
return words.iterator();
}
});
JavaPairRDD<String, Integer> wordsCount = wordsRDD.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
JavaPairRDD<String, Integer> resultRDD = wordsCount.reduceByKey(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
resultRDD.foreach(new VoidFunction<Tuple2<String, Integer>>() {
private static final long serialVersionUID = 1L;
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println(t._1 + "\t" + t._2());
}
});
sc.close();
}
package 1
For 3
Programs 1
(略)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
2.11.3 Scala示例
val textFile = sc.textFile("file:///home/zkpk/spark-2.0.1/README.md")
val words = textFile.flatMap(line => line.split(" "))
val wordPairs = words.map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey((a, b) => a + b)
println("wordCounts: ")
wordCounts.collect().foreach(println)
2.12 aggregateByKey
2.12.1 概述
語法(Java):
<U> JavaPairRDD<K,U> aggregateByKey(U zeroValue,
Partitioner partitioner,
Function2<U,V,U> seqFunc,
Function2<U,U,U> combFunc)
<U> JavaPairRDD<K,U> aggregateByKey(U zeroValue,
int numPartitions,
Function2<U,V,U> seqFunc,
Function2<U,U,U> combFunc)
<U> JavaPairRDD<K,U> aggregateByKey(U zeroValue,
Function2<U,V,U> seqFunc,
Function2<U,U,U> combFunc)
說明:
aggregateByKey函數對PairRDD中相同Key的值進行聚合操作,在聚合過程中同樣使用了一箇中立的初始值。和aggregate函數類似,aggregateByKey返回值得類型不需要和RDD中value的類型一致。因爲aggregateByKey是對相同Key中的值進行聚合操作,所以aggregateByKey函數最終返回的類型還是Pair RDD,對應的結果是Key和聚合好的值;而aggregate函數直接返回非RDD的結果。
參數:
- zeroValue:表示在每個分區中第一次拿到key值時,用於創建一個返回類型的函數,這個函數最終會被包裝成先生成一個返回類型,然後通過調用seqOp函數,把第一個key對應的value添加到這個類型U的變量中。
- seqOp:這個用於把迭代分區中key對應的值添加到zeroValue創建的U類型實例中。
- combOp:這個用於合併每個分區中聚合過來的兩個U類型的值。
2.12.2 Java示例
static void aggregateByKey(JavaSparkContext sc) {
List<Tuple2<Integer, Integer>> datas = new ArrayList<>();
datas.add(new Tuple2<>(1, 3));
datas.add(new Tuple2<>(1, 2));
datas.add(new Tuple2<>(1, 4));
datas.add(new Tuple2<>(2, 3));
sc.parallelizePairs(datas, 2)
.aggregateByKey(
0,
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("seq: " + v1 + "\t" + v2);
return Math.max(v1, v2);
}
},
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("comb: " + v1 + "\t" + v2);
return v1 + v2;
}
})
.collect()
.forEach(System.out::println);
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
2.12.3 Scala示例
def aggregateByKey(sc: SparkContext): Unit = {
// 合併在同一個partition中的值,a的數據類型爲zeroValue的數據類型,b的數據類型爲原value的數據類型
def seq(a:Int, b:Int): Int = {
println("seq: " + a + "\t" + b)
math.max(a, b)
}
// 合併在不同partition中的值,a,b的數據類型爲zeroValue的數據類型
def comb(a:Int, b:Int): Int = {
println("comb: " + a + "\t" + b)
a + b
}
// 數據拆分成兩個分區
// 分區一數據: (1,3) (1,2)
// 分區二數據: (1,4) (2,3)
// zeroValue 中立值,定義返回value的類型,並參與運算
// seqOp 用來在一個partition中合併值的
// 分區一相同key的數據進行合併
// seq: 0 3 (1,3)開始和中位值合併爲3
// seq: 3 2 (1,2)再次合併爲3
// 分區二相同key的數據進行合併
// seq: 0 4 (1,4)開始和中位值合併爲4
// seq: 0 3 (2,3)開始和中位值合併爲3
// comb 用來在不同partition中合併值的
// 將兩個分區的結果進行合併
// key爲1的, 兩個分區都有, 合併爲(1,7)
// key爲2的, 只有一個分區有, 不需要合併(2,3)
sc.parallelize(List((1, 3), (1, 2), (1, 4), (2, 3)), 2)
.aggregateByKey(0)(seq, comb)
.collect()
.foreach(println)
}
// 結果
(2,3)
(1,7)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
2.13 sortByKey
2.13.1 概述
語法(java):
JavaRDD<T> sortBy(Function<T,S> f,
boolean ascending,
int numPartitions)
JavaPairRDD<K,V> sortByKey()
JavaPairRDD<K,V> sortByKey(boolean ascending)
JavaPairRDD<K,V> sortByKey(boolean ascending,
int numPartitions)
JavaPairRDD<K,V> sortByKey(java.util.Comparator<K> comp)
JavaPairRDD<K,V> sortByKey(java.util.Comparator<K> comp,
boolean ascending)
JavaPairRDD<K,V> sortByKey(java.util.Comparator<K> comp,
boolean ascending,
int numPartitions)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
說明:
對<key, value>
結構的RDD進行升序或降序排列
參數:
- comp:排序時的比較運算方式。
- ascending:false降序;true升序。
2.13.2 Java示例
static void sortByKey(JavaSparkContext sc) {
List<Integer> datas = Arrays.asList(60, 70, 80, 55, 45, 75);
// sc.parallelize(datas)
// .sortBy(new Function<Integer, Object>() {
// @Override
// public Object call(Integer v1) throws Exception {
// return v1;
// }
// }, true, 1)
// .foreach(v -> System.out.println(v));
sc.parallelize(datas)
.sortBy((Integer v1) -> v1, false, 1)
.foreach(v -> System.out.println(v));
List<Tuple2<Integer, Integer>> datas2 = new ArrayList<>();
datas2.add(new Tuple2<>(3, 3));
datas2.add(new Tuple2<>(2, 2));
datas2.add(new Tuple2<>(1, 4));
datas2.add(new Tuple2<>(2, 3));
sc.parallelizePairs(datas2)
.sortByKey(false)
.foreach(v -> System.out.println(v));
}
// 結果
80
75
70
60
55
45
(3,3)
(2,2)
(2,3)
(1,4)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
2.13.3 Scala示例
def sortByKey(sc: SparkContext) : Unit = {
sc.parallelize(Array(60, 70, 80, 55, 45, 75))
.sortBy(v => v, false)
.foreach(println)
sc.parallelize(List((3, 3), (2, 2), (1, 4), (2, 3)))
.sortByKey(true)
.foreach(println)
}
2.14 join
2.14.1 概述
語法(java):
JavaPairRDD<K,scala.Tuple2<V,W>> join(JavaPairRDD<K,W> other)
JavaPairRDD<K,scala.Tuple2<V,W>> join(
JavaPairRDD<K,W> other,
int numPartitions)
JavaPairRDD<K,scala.Tuple2<V,W>> join(
JavaPairRDD<K,W> other,
Partitioner partitioner)
說明:
對<K, V>
和<K, W>
進行join操作,返回(K,
(V, W))
外連接函數爲leftOuterJoin、rightOuterJoin和fullOuterJoin
2.14.2 Java示例
static void join(JavaSparkContext sc) {
List<Tuple2<Integer, String>> products = new ArrayList<>();
products.add(new Tuple2<>(1, "蘋果"));
products.add(new Tuple2<>(2, "梨"));
products.add(new Tuple2<>(3, "香蕉"));
products.add(new Tuple2<>(4, "石榴"));
List<Tuple2<Integer, Integer>> counts = new ArrayList<>();
counts.add(new Tuple2<>(1, 7));
counts.add(new Tuple2<>(2, 3));
counts.add(new Tuple2<>(3, 8));
counts.add(new Tuple2<>(4, 3));
counts.add(new Tuple2<>(5, 9));
JavaPairRDD<Integer, String> productsRDD = sc.parallelizePairs(products);
JavaPairRDD<Integer, Integer> countsRDD = sc.parallelizePairs(counts);
productsRDD.join(countsRDD)
.foreach(v -> System.out.println(v));
}
(4,(石榴,3))
(1,(蘋果,7))
(3,(香蕉,8))
(2,(梨,3))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
2.14.3 Scala示例
sc.parallelize(List((1, "蘋果"), (2, "梨"), (3, "香蕉"), (4, "石榴")))
.join(sc.parallelize(List((1, 7), (2, 3), (3, 8), (4, 3), (5, 9))))
.foreach(println)
2.15 cogroup
2.15.1 概述
語法(java):
JavaPairRDD<K,scala.Tuple2<Iterable<V>,Iterable<W>>> cogroup(JavaPairRDD<K,W> other,
Partitioner partitioner)
JavaPairRDD<K,scala.Tuple3<Iterable<V>,Iterable<W1>,Iterable<W2>>> cogroup(JavaPairRDD<K,W1> other1,
JavaPairRDD<K,W2> other2,
Partitioner partitioner)
JavaPairRDD<K,scala.Tuple4<Iterable<V>,Iterable<W1>,Iterable<W2>,Iterable<W3>>> cogroup(JavaPairRDD<K,W1> other1,
JavaPairRDD<K,W2> other2,
JavaPairRDD<K,W3> other3,
Partitioner partitioner)
JavaPairRDD<K,scala.Tuple2<Iterable<V>,Iterable<W>>> cogroup(JavaPairRDD<K,W> other)
JavaPairRDD<K,scala.Tuple3<Iterable<V>,Iterable<W1>,Iterable<W2>>> cogroup(JavaPairRDD<K,W1> other1,
JavaPairRDD<K,W2> other2)
JavaPairRDD<K,scala.Tuple4<Iterable<V>,Iterable<W1>,Iterable<W2>,Iterable<W3>>> cogroup(JavaPairRDD<K,W1> other1,
JavaPairRDD<K,W2> other2,
JavaPairRDD<K,W3> other3)
JavaPairRDD<K,scala.Tuple2<Iterable<V>,Iterable<W>>> cogroup(JavaPairRDD<K,W> other,
int numPartitions)
JavaPairRDD<K,scala.Tuple3<Iterable<V>,Iterable<W1>,Iterable<W2>>> cogroup(JavaPairRDD<K,W1> other1,
JavaPairRDD<K,W2> other2,
int numPartitions)
JavaPairRDD<K,scala.Tuple4<Iterable<V>,Iterable<W1>,Iterable<W2>,Iterable<W3>>> cogroup(JavaPairRDD<K,W1> other1,
JavaPairRDD<K,W2> other2,
JavaPairRDD<K,W3> other3,
int numPartitions)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
說明:
cogroup:對多個RDD中的KV元素,每個RDD中相同key中的元素分別聚合成一個集合。與reduceByKey不同的是針對兩個RDD中相同的key的元素進行合併。
2.15.2 Java示例
static void cogroup(JavaSparkContext sc) {
List<Tuple2<Integer, String>> datas1 = new ArrayList<>();
datas1.add(new Tuple2<>(1, "蘋果"));
datas1.add(new Tuple2<>(2, "梨"));
datas1.add(new Tuple2<>(3, "香蕉"));
datas1.add(new Tuple2<>(4, "石榴"));
List<Tuple2<Integer, Integer>> datas2 = new ArrayList<>();
datas2.add(new Tuple2<>(1, 7));
datas2.add(new Tuple2<>(2, 3));
datas2.add(new Tuple2<>(3, 8));
datas2.add(new Tuple2<>(4, 3));
List<Tuple2<Integer, String>> datas3 = new ArrayList<>();
datas3.add(new Tuple2<>(1, "7"));
datas3.add(new Tuple2<>(2, "3"));
datas3.add(new Tuple2<>(3, "8"));
datas3.add(new Tuple2<>(4, "3"));
datas3.add(new Tuple2<>(4, "4"));
datas3.add(new Tuple2<>(4, "5"));
datas3.add(new Tuple2<>(4, "6"));
sc.parallelizePairs(datas1)
.cogroup(sc.parallelizePairs(datas2),
sc.parallelizePairs(datas3))
.foreach(v -> System.out.println(v));
}
(4,([石榴],[3],[3, 4, 5, 6]))
(1,([蘋果],[7],[7]))
(3,([香蕉],[8],[8]))
(2,([梨],[3],[3]))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
2.15.3 Scala示例
def cogroup(sc: SparkContext): Unit = {
val datas1 = List((1, "蘋果"),
(2, "梨"),
(3, "香蕉"),
(4, "石榴"))
val datas2 = List((1, 7),
(2, 3),
(3, 8),
(4, 3))
val datas3 = List((1, "7"),
(2, "3"),
(3, "8"),
(4, "3"),
(4, "4"),
(4, "5"),
(4, "6"))
sc.parallelize(datas1)
.cogroup(sc.parallelize(datas2),
sc.parallelize(datas3))
.foreach(println)
}
// 結果
(4,(CompactBuffer(石榴),CompactBuffer(3),CompactBuffer(3, 4, 5, 6)))
(1,(CompactBuffer(蘋果),CompactBuffer(7),CompactBuffer(7)))
(3,(CompactBuffer(香蕉),CompactBuffer(8),CompactBuffer(8)))
(2,(CompactBuffer(梨),CompactBuffer(3),CompactBuffer(3)))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
2.16 cartesian
2.16.1 概述
語法(java):
static <U> JavaPairRDD<T,U> cartesian(JavaRDDLike<U,?> other)
說明:
兩個RDD進行笛卡爾積合併
2.16.2 Java示例
static void cartesian(JavaSparkContext sc) {
List<String> names = Arrays.asList("張三", "李四", "王五");
List<Integer> scores = Arrays.asList(60, 70, 80);
JavaRDD<String> namesRDD = sc.parallelize(names);
JavaRDD<Integer> scoreRDD = sc.parallelize(scores);
JavaPairRDD<String, Integer> cartesianRDD = namesRDD.cartesian(scoreRDD);
cartesianRDD.foreach(new VoidFunction<Tuple2<String, Integer>>() {
private static final long serialVersionUID = 1L;
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println(t._1 + "\t" + t._2());
}
});
}
張三 60
張三 70
張三 80
李四 60
李四 70
李四 80
王五 60
王五 70
王五 80
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
2.16.3 Scala示例
namesRDD.cartesian(scoreRDD)
.foreach(println)
2.17 pipe
2.17.1 概述
語法(java):
JavaRDD<String> pipe(String command)
JavaRDD<String> pipe(java.util.List<String> command)
JavaRDD<String> pipe(java.util.List<String> command,
java.util.Map<String,String> env)
JavaRDD<String> pipe(java.util.List<String> command,
java.util.Map<String,String> env,
boolean separateWorkingDir,
int bufferSize)
static JavaRDD<String> pipe(java.util.List<String> command,
java.util.Map<String,String> env,
boolean separateWorkingDir,
int bufferSize,
String encoding)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
說明:
執行cmd命令,創建RDD
2.17.2 Java示例
static void pipe(JavaSparkContext sc) {
List<String> datas = Arrays.asList("hi", "hello", "how", "are", "you")
sc.parallelize(datas)
.pipe("/Users/zhangws/echo.sh")
.collect()
.forEach(System.out::println)
}
2.17.3 Scala示例
echo.sh內容
#!/bin/bash
echo "Running shell script"
RESULT=""
while read LINE; do
RESULT=${RESULT}" "${LINE}
done
echo ${RESULT} > /Users/zhangws/out123.txt
測試代碼
def pipe(sc: SparkContext): Unit = {
val data = List("hi", "hello", "how", "are", "you")
sc.makeRDD(data)
.pipe("/Users/zhangws/echo.sh")
.collect()
.foreach(println)
}
結果
# out123.txt
hi hello how are you
# 輸出
Running shell script
2.18 coalesce
2.18.1 概述
語法(java):
JavaRDD<T> coalesce(int numPartitions)
JavaRDD<T> coalesce(int numPartitions,
boolean shuffle)
JavaPairRDD<K,V> coalesce(int numPartitions)
JavaPairRDD<K,V> coalesce(int numPartitions,
boolean shuffle)
說明:
用於將RDD進行重分區,使用HashPartitioner。且該RDD的分區個數等於numPartitions個數。如果shuffle設置爲true,則會進行shuffle。
2.18.2 Java示例
static void coalesce(JavaSparkContext sc) {
List<String> datas = Arrays.asList("hi", "hello", "how", "are", "you")
JavaRDD<String> datasRDD = sc.parallelize(datas, 4)
System.out.println("RDD的分區數: " + datasRDD.partitions().size())
JavaRDD<String> datasRDD2 = datasRDD.coalesce(2)
System.out.println("RDD的分區數: " + datasRDD2.partitions().size())
}
// 結果
RDD的分區數: 4
RDD的分區數: 2
2.18.3 Scala示例
def coalesce(sc: SparkContext): Unit = {
val datas = List("hi", "hello", "how", "are", "you")
val datasRDD = sc.parallelize(datas, 4)
println("RDD的分區數: " + datasRDD.partitions.length)
val datasRDD2 = datasRDD.coalesce(2)
println("RDD的分區數: " + datasRDD2.partitions.length)
}
2.19 repartition
2.19.1 概述
語法(java):
JavaRDD<T> repartition(int numPartitions)
JavaPairRDD<K,V> repartition(int numPartitions)
說明:
該函數其實就是coalesce函數第二個參數爲true的實現
示例略
2.20 repartitionAndSortWithinPartitions
2.20.1 概述
語法(java):
JavaPairRDD<K,V> repartitionAndSortWithinPartitions(Partitioner partitioner)
JavaPairRDD<K,V> repartitionAndSortWithinPartitions(Partitioner partitioner,
java.util.Comparator<K> comp)
說明:
根據給定的Partitioner重新分區,並且每個分區內根據comp實現排序。
2.20.2 Java示例
static void repartitionAndSortWithinPartitions(JavaSparkContext sc) {
List<String> datas = new ArrayList<>();
Random random = new Random(1);
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 100; j++) {
datas.add(String.format("product%02d,url%03d", random.nextInt(10), random.nextInt(100)));
}
}
JavaRDD<String> datasRDD = sc.parallelize(datas);
JavaPairRDD<String, String> pairRDD = datasRDD.mapToPair((String v) -> {
String[] values = v.split(",");
return new Tuple2<>(values[0], values[1]);
});
JavaPairRDD<String, String> partSortRDD = pairRDD.repartitionAndSortWithinPartitions(
new Partitioner() {
@Override
public int numPartitions() {
return 10;
}
@Override
public int getPartition(Object key) {
return Integer.valueOf(((String) key).substring(7));
}
}
);
partSortRDD.collect()
.forEach(System.out::println);
}
// 結果
(product00,url099)
(product00,url006)
(product00,url088)
略
(product09,url004)
(product09,url021)
(product09,url036)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
2.20.3 Scala示例
def repartitionAndSortWithinPartitions(sc: SparkContext): Unit = {
def partitionFunc(key:String): Int = {
key.substring(7).toInt
}
val datas = new Array[String](1000)
val random = new Random(1)
for (i <- 0 until 10; j <- 0 until 100) {
val index: Int = i * 100 + j
datas(index) = "product" + random.nextInt(10) + ",url" + random.nextInt(100)
}
val datasRDD = sc.parallelize(datas)
val pairRDD = datasRDD.map(line => (line, 1))
.reduceByKey((a, b) => a + b)
// .foreach(println)
pairRDD.repartitionAndSortWithinPartitions(new Partitioner() {
override def numPartitions: Int = 10
override def getPartition(key: Any): Int = {
val str = String.valueOf(key)
str.substring(7, str.indexOf(',')).toInt
}
}).foreach(println)
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
3. Action
3.1 reduce
3.1.1 概述
語法(java):
static T reduce(Function2<T,T,T> f)
說明:
對RDD成員使用func進行reduce操作,func接受兩個參數,合併之後只返回一個值。reduce操作的返回結果只有一個值。需要注意的是,func會併發執行
3.1.2 Scala示例
def reduce(sc: SparkContext): Unit = {
println(sc.parallelize(1 to 10)
.reduce((x, y) => x + y))
}
// 結果
55
3.2 collect
3.2.1 概述
語法(java):
static java.util.List<T> collect()
說明:
將RDD讀取至Driver程序,類型是Array,一般要求RDD不要太大。
示例略
3.3 count
3.3.1 概述
語法(java):
static long count()
說明:
返回RDD的成員數量
3.3.2 Scala示例
def count(sc: SparkContext): Unit = {
println(sc.parallelize(1 to 10)
.count)
}
// 結果
10
3.4 first
3.4.1 概述
語法(java):
static T first()
說明:
返回RDD的第一個成員,等價於take(1)
3.4.2 Scala示例
def first(sc: SparkContext): Unit = {
println(sc.parallelize(1 to 10)
.first())
}
// 結果
1
3.5 take
3.5.1 概述
語法(java):
static java.util.List<T> take(int num)
說明:
返回RDD前n個成員
3.5.2 Scala示例
def take(sc: SparkContext): Unit = {
sc.parallelize(1 to 10)
.take(2).foreach(println)
}
// 結果
1
2
3.6 takeSample
3.6.1 概述
語法(java):
static java.util.List<T> takeSample(boolean withReplacement,
int num,
long seed)
說明:
和sample用法相同,只不第二個參數換成了個數。返回也不是RDD,而是collect。
3.6.2 Scala示例
def takeSample(sc: SparkContext): Unit = {
sc.parallelize(1 to 10)
.takeSample(withReplacement = false, 3, 1)
.foreach(println)
}
// 結果
1
8
10
3.7 takeOrdered
3.7.1 概述
語法(java):
java.util.List<T> takeOrdered(int num)
java.util.List<T> takeOrdered(int num,
java.util.Comparator<T> comp)
說明:
用於從RDD中,按照默認(升序)或指定排序規則,返回前num個元素。
3.7.2 Scala示例
def takeOrdered(sc: SparkContext): Unit = {
sc.parallelize(Array(5,6,2,1,7,8))
.takeOrdered(3)(new Ordering[Int](){
override def compare(x: Int, y: Int): Int = y.compareTo(x)
})
.foreach(println)
}
// 結果
8
7
6
3.8 saveAsTextFile
3.8.1 概述
語法(java):
void saveAsTextFile(String path)
void saveAsTextFile(String path,
Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)
說明:
將RDD轉換爲文本內容並保存至路徑path下,可能有多個文件(和partition數有關)。路徑path可以是本地路徑或HDFS地址,轉換方法是對RDD成員調用toString函數
3.8.2 Scala示例
def saveAsTextFile(sc: SparkContext): Unit = {
sc.parallelize(Array(5,6,2,1,7,8))
.saveAsTextFile("/Users/zhangws/Documents/test")
}
// 結果
/Users/zhangws/Documents/test目錄下
_SUCCESS
part-00000
// part-00000文件內容
5
6
2
1
7
8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
3.9 saveAsSequenceFile
3.9.1 概述
語法(java):
def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None): Unit
說明:
與saveAsTextFile類似,但以SequenceFile格式保存,成員類型必須實現Writeable接口或可以被隱式轉換爲Writable類型(比如基本Scala類型Int、String等)
示例略
3.10 saveAsObjectFile
3.10.1 概述
語法(java):
static void saveAsObjectFile(String path)
說明:
用於將RDD中的元素序列化成對象,存儲到文件中。對於HDFS,默認採用SequenceFile保存。
示例略
3.11 countByKey
3.11.1 概述
語法(java):
java.util.Map<K,Long> countByKey()
說明:
僅適用於(K, V)類型,對key計數,返回(K, Int)
3.11.2 Scala示例
def reduce(sc: SparkContext): Unit = {
println(sc.parallelize(Array(("A", 1), ("B", 6), ("A", 2), ("C", 1), ("A", 7), ("A", 8)))
.countByKey())
}
Map(B -> 1, A -> 4, C -> 1)
3.12 foreach
3.12.1 概述
語法(java):
static void foreach(VoidFunction<T> f)
說明:
對RDD中的每個成員執行func,沒有返回值,常用於更新計數器或輸出數據至外部存儲系統。這裏需要注意變量的作用域
3.12.2 Java示例
forEach(System.out::println)
forEach(v -> System.out.println(v))
3.12.3 Scala示例
foreach(println)
4. 參考
Spark中parallelize函數和makeRDD函數的區別
spark transformation算子
Spark的算子的分類
Spark函數講解:aggregateByKey
pyspark和spark pipe性能對比 用例程序
Spark Rdd coalesce()方法和repartition()方法
Spark如何解決常見的Top N問題
【Spark Java API】Action(4)—sortBy、takeOrdered、takeSample