Spark入門教程

這篇文章是翻譯http://spark.apache.org/docs/latest/programming-guide.html官方的指導手冊

轉載註明:ylf13@元子

一、Overview概述

在spark應用程序中,有一個Driver Program(驅動程序)來執行用戶定義的main函數,並且在集羣上執行各種並行操作。Spark主要提供的抽象層是RDD(resilient distributed dataset,彈性的分佈式數據集),這個抽象層是集羣上各個節點存儲的元素的集合,可以並行運行。創建RDD的方式很多,例如Hadoop文件系統上的文件,或者現有的Scala集合(Scala語言提供的數據結構)。用戶也可以讓spark將RDD在內存中持久化,這使得在並行操作中能夠快速重用。(譯者注:對於迭代運算有很好的效率提升,例如一些machine learning需要反覆運用中見結果,如果是Hadoop MR Job,會不斷的寫入hdfs,大量磁盤IO對於算法性能影響較大),最後,RDD的容錯性也很好,實現自動恢復。

 

第二個抽象層是spark的共享變量,該共享變量可以在並行操作中實現共享。默認情況下,當spark在不同節點的並行環境下運行一個函數時,spark會將該函數所需的變量拷貝到各個節點中,各自運行時調用本地的副本進行計算。然而,有時我們需要一個變量能夠被不同任務所共享,或者是在task和driver program之間共享(譯者注:這個問題在hadoop MR中就存在,無法共享同一個靜態變量,當然可以藉助其他手段,例如存儲在hdfs上,每次調用時讀取,或者redis緩存等等機制)。這裏spark提供了兩種共享變量:

(1)廣播變量(broadcast varables):緩存在所有節點的內存中

(2)積累變量(accumulators)僅僅允許進行加法運算

 

二‘連接Spark

下面以Java爲例子

To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:

groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.3.0

In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page.

groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

Finally, you need to import some Spark classes into your program. Add the following lines:

importorg.apache.spark.api.java.JavaSparkContext
importorg.apache.spark.api.java.JavaRDD
importorg.apache.spark.SparkConf

 


當然,如果不用maven,可以直接在eclipse中配置external jars導入spark/lib目錄下的jar包。 


三、初始化Spark

首先需要創建一個JavaSparkContext對象,該對象告訴Spark如何訪問擊暈,在創建SparkContext之前需要創建一個配置對象SparkConf,該對象包含應用程序的相關信息。

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

JavaSparkContext  sc  =  new JavaSparkContext(conf);

 

appName: 顯示在cluster UI上的名字

master: 是一個Spark, Mesos 或者YARN集羣的URL或者特殊字符“local”,表示以本地模式運行。

 

四、使用Shell

spark自帶了一個交互式shell界面。該shell默認中已經定義好SparkContext變量,用sc變量表示,如果需要指定sparkContext指向的master,可以帶上參數:--master URL

如果需要在類路徑下加入新的jar包,--jars

 

./bin/spark-shell –master local[4]  --jars code.jar

 

五、RDD介紹

創建RDD有兩種方式:

(1)在driver program中並行化現有的集合

(2)指向一個現有的外部存儲系統(HDFS, HBASE, 或者別的數據源)

 

5.1 並行化集合(parallelizedCollections)

並行化集合是通過JavaSparkContext的並行化方法:parallelize(),作用在現有的集合上。該集合會被拷貝到別的節點上,以供並行操作。

List<Integer>  data = Arrays.asList(1 ,2, 3, 4, 5);

JavaRDD<Integer> distData  =  sc.parallelize(data);

 

這樣我們就可以在distData上進行並行操作:

distData.reduce((a,b) -> a+b )

但需要注意,上面這個內部函數是利用了Java8的函數機制,在老版本的Java中沒有該特性,所以需要利用org.apache.spark.api.java.function 來取代。

 

在並行化集合過程中,還有一個參數很重要,就是集合的分區數目,spark究竟會把集合拆分成幾塊,都影響着計算效率。對每個分區一般spark都會執行一個task,一般對於一個CPU可以執行2-4個分區,據此可以進行合理分區:

sc.parallelize(data, 10) // 指定分區數量

 

5.2 外部數據集

外部數據源可以支持多種:本地文件系統,HDFS, HBase, Amazon S3,等。支持的文件類型:text file,SequenceFile,和其他Hadoop 的inputformat

例如創建本地的文本文件,讀取成一個行的集合。

JavaRDD<String> distFile = sc.textFile(“data.txt”);

 

同時,textFile也支持參數爲目錄,或者帶有通配符的路徑

sc.textFile(“/home/ylf/examples/”)

sc.textFile(“/home/ylf/examples/*.txt”)

 

六 、RDD操作

RDD支持兩種操作類型:transformatons和 actions

(1)transformations:從現有的數據集中創建一個新的

(2)actions:對數據集計算完成後,返回給driver program是一個值。

例如map是一種transformations,它將現有的數據集元素通過計算,返回一個新的RDD

Reduce則是一個actions運算,聚合RDD上的所有元素。但是有一個特例:reduceByKey不是actions,返回的是一個RDD.

 

所有的transformations都是lazy(延遲的),在進行transformations操作時候,不會立刻執行該命令。這些transformatons會在actions操作命令需要進行時候執行,這就使得Spark運行的效率更高。

 

另一個方法就是內存持久化:persist or cache。

 

下面我們來一個簡單的例子:

JavaRDD<String> lines = sc.textFile(“data.txt”);

JavaRDD<Integer> lineLengths = lines.map(line -> line.length);

Int totalLength = lineLengths.reduce((a,b) -> a + b);

 

該例子中,lineLengths並不會迅速得到結果,因爲transformations的laziness,在reduce命令下達時,整個計算纔開始。

如果我們需要再次使用lineLengths,可以執行持久化:

lineLengths.persist()

before the reduce, which would cause lineLengths to be saved in memory after the first time it is computed.

 

七、傳遞函數給Spark

Spark依賴於driver program傳遞的函數進行計算,在Java中,函數變量的實現有兩種方式:

(1)實現org.apache.spark.api.java.function包裏的接口

(2)Java8自帶的lambda 表達式

 

前面我們已經用lambda例子了,下面舉個傳統方式

// map function

Class GetLineLength implements Function<String, Integer>{

  Public Integer call(String s){

    Return s.length;

}

}

 

// reduce function

Class Sum implements Function2<Integer, Integer, Integer>{

  Public Integer call(Integer a, Integer b){

    Return a+b;

}

}

 

// driver function

JavaRDD<String> file = sc.textFile(“xxx”);

JavaRDD<Integer> lineLengths = file.map(new GetLineLength());

Int total = lineLengths.reduce(new Sum());

 

 

八、使用Key-Value鍵值對

當我們需要統計單詞的詞頻時候,我們需要鍵值對來保存對應word的詞頻,所以就有了鍵值對的需求,雖然Spark的RDD支持多種類型,但是鍵值對卻較少,在Java中,key-value使用scala.Tuple2類來表示。簡單new Tuple2(a, b)就可以了。

 

返回的RDD類型也有所變化,JavaPairRDD<T1, T2>

JavaRDD<String> file = sc.textFile(“xxx”);

JavaPairRDD<String, Integer> paris = file.mapToPair(line -> new Tuple2(line, 1));

JavaPairRDD<String , Integer> counters = pairs.reduceByKey((a,b) -> a+b);

Counters.sortByKey();

Counters.collect()可以將結果表示成Java的數組

 

如果要自定義Object作爲key,記得重寫hashCode() equals()

 

九、常用的Transformations and Actions

Transformations

下面羅列一些Spark支持的transformations變化。

Transformation

 解釋

Map(func)

傳遞原數據集的每一個element,返回一個新的分佈式數據集。映射

Filter(func)

從原數據集中選擇出滿足給定條件的elements,組成新的RDD

flatMap(func)

第一步和map一致,然後再進行扁平化,即把數組合並

MapPartitions(func)

也和map類似,只是這裏操作的是分區block,不是每一個element,所以func(iterator<T>  =>  iterator<U>)  對block的數據進行運算

mapPartitionsWithIndex

(func)

類似mapPartitions,不過會傳遞一個分區號,所以輸入變成

Func(  (Int, Iterator<U>)  =>  Iterator<U>)

Sample(withReplacement

fractionseed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

distinct([numTasks]))

Return a new dataset that contains the distinct elements of the source dataset.

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 
Note: 
If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

aggregateByKey(zeroValue)(seqOpcombOp, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numTasks])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the booleanascending argument.

join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.

cartesian(otherDataset)

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

pipe(command[envVars])

Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery

 

Actions

Thefollowing table lists some of the common actions supported by Spark. Refer tothe RDD API doc (ScalaJavaPython) and pair RDD functions doc (ScalaJava) for details.

Action

Meaning

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count()

Return the number of elements in the dataset.

first()

Return the first element of the dataset (similar to take(1)).

take(n)

Return an array with the first n elements of the dataset.

takeSample(withReplacement,num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n[ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

saveAsSequenceFile(path
(Java and Scala)

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that either implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

saveAsObjectFile(path
(Java and Scala)

Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile().

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems.

 

十、RDD持久化

Spark的另一個優點就是數據集內存持久化,當執行持久化命令後,集羣上每個節點都緩存各自分區,然後能夠供下次action重用,能夠提高運行速度(經常能達到hadoop 10X),緩存機制也爲迭代提供了快速運算方案。

 

RDD持久化方法很簡單,只需要運行persist()或者 cache()方法,第一次執行action後,運算結果就會保存下來,供以後重複使用,同時,Spark也是容錯的,如果某個node機器上數據丟失,集羣會根據記錄的transformations來重新創建。

 

其實緩存有多種類型,不僅僅可以持久化在內存,還可以硬盤,緩存方式可以是序列化等,在執行persist()時候可以指定類型,cache()則默認採用內存緩存。

Storage Level

Meaning

MEMORY_ONLY

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER

Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY

Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental)

Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.

 

 

當然內存不是無限的,Spark會自動監控每個節點的內存使用情況,然後使用LRU(latest-recently-used)算法來清理久遠的結果,當然也可以手動刪除,rdd.unpersist()

 

十一、共享變量

集羣機器在執行task時候,都是各個節點保持自己的運算所需元素,不會進行同步,因爲同步會帶來效率的降低,但是有些場合又不得不使用共享變量,所以Spark做了折衷,僅僅提供兩種類型的共享變量:broadcast variables and accumulators.

11.1 broadcast variables

Broadcast變量運行程序保存一份只讀變量緩存在每臺機器上,這與原來傳遞每一份給task是不同的。

Broadcast varables是從普通變量創建而來的,使用SparkContext.broadcast(v),使用value()可以訪問,但不可以修改,這是read-only.

Broadcast<int[]> broadcastVar = sc.broadcast(new int[]{1,2,3});

broadcastVar.value();  // return [1, 2, 3]

 

11.2 accumulators

這個變量可以在分佈式下計算“加法”計算

Accumulator<Integer> accum = sc.accumulator(0);

Sc.parallelize(Arrays.asList(1,2,3,4)).foreach(x->accum.add(x));

Accum.value();

即便數據源是在分佈式環境下,但是依然能夠保證accum的一致性。可能只是誰先加

運算可能性: 1+2+3+4 或者 2+4+3+1 等等。

 

當然這裏的加法參數我們用的內建的Integer,也可以使用我們自定義的AccumulatorParam

這個AccumulatorParam有兩種方法:zero(提供一個“零值”)和addInPlace(進行加法運算)。

Class VectorAccumulatorParam  implements  AccumulatorParam<Vector>{

  Public Vector zero(){

    Return Vector.zero(initialValue.size());

}

  Public Vector addInPlace(Vector v1, Vector v2){

    V1.addInPlace(v2); return v1; // 就是自定義兩個對象怎麼進行加法運算啦

}

}

 

這樣,我們就可以自定義加法類型

Accumulator<Vector> vecAccum = sc.accumulator(new Vector(..), new VectorAccumulatorParam())

第一個參數就是初始值,後面就是我們定義的加法

 

補充:如何運行打包好的jar包

其實可以參照你的Spark安裝目錄bin下的run-example腳本

$cd $SPARK_HOME/bin
$vi run-example

可以看到最後執行命令是

./spark-submit --master xxxx --class $Main-class $Jar 參數
所以我們只要指定master jar包以及對應的主函數所在類即可

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章