Spark系列:Spark學習筆記

Spark

  1. 閱讀官方文檔
    Spark Quick Start
    Spark Programming Guide
    Spark SQL, DataFrames and Datasets Guide
    Cluster Mode Overview
    Spark Standalone Mode

重要的概念:resilient distributed dataset (RDD), a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

two types of shared variables: broadcast variables & accumulators
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

  1. 樣例學習Databricks Spark Reference Applications

Spark Streaming
When data is streamed into Spark, there are two common use cases covered:

Windowed Calculations means that you only care about data received in the last N amount of time. When monitoring your web servers, perhaps you only care about what has happened in the last hour.
Spark Streaming conveniently splits the input data into the desired time windows for easy processing, using the window function of the streaming library.
The forEachRDD function allows you to access the RDD’s created each time interval.
Cumulative Calculations means that you want to keep cumulative statistics, while streaming in new data to refresh those statistics. In that case, you need to maintain the state for those statistics.
The Spark Streaming library has some convenient functions for maintaining state to support this use case, updateStateByKey.
Reusing code from Batching covers how to organize business logic code from the batch examples so that code can be reused in Spark Streaming.
The Spark Streaming library has transform functions which allow you to apply arbitrary RDD-to-RDD functions, and thus to reuse code from the batch mode of Spark.


rsync Linux同步工具


snippets

Spark 應用示例

前提條件

  • 安裝JDK
  • 安裝spark軟件
  1. 安裝JDK
    注意:不要安裝到默認路徑“c:\Program Files”文件夾的名字包含空格會導致一些問題。

  2. 安裝spark軟件
    Spark網站下載pre-build版本,文件名類似spark-1.2.0-bin-hadoop2.4.tgz
    解壓
    驗證安裝的正確性

c:
cd c:\dev\spark-1.2.0-bin-hadoop2.4
bin\spark-shell

可以鍵入如下命令檢查Spark Shell是否工作正常。

sc.version

sc.appName

完成後退出

:quit
  1. Spark Word Count示例

首先讓我們用Spark API運行流行的Word Count示例。如果還沒有運行Spark Scala Shell,首先打開一個Scala Shell窗口。這個示例的相關命令如下所示:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
 
val txtFile = "README.md"
val txtData = sc.textFile(txtFile)
txtData.cache()

我們可以調用cache函數將上一步生成的RDD對象保存到緩存中,在此之後Spark就不需要在每次數據查詢時都重新計算。需要注意的是,cache()是一個延遲操作。在我們調用cache時,Spark並不會馬上將數據存儲到內存中。只有當在某個RDD上調用一個行動時,纔會真正執行這個操作。

現在,我們可以調用count函數,看一下在文本文件中有多少行數據。

txtData.count()

然後,我們可以執行如下命令進行字數統計。在文本文件中統計數據會顯示在每個單詞的後面。

val wcData = txtData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

wcData.collect().foreach(println)

流數據分析

流數據基本上是一組連續的數據記錄,它們通常產生於諸如傳感器、服務器流量與在線搜索等數據源。常見的流數據的例子有網站上的用戶行爲、監控數據、服務器日誌與其他事件數據。
流數據處理應用會有助於現場面板、實時在線推薦與即時詐騙檢測。
如果我們正在構建一個實時收集、處理與分析流數據的應用,我們需要按照與批處理數據應用不同的設計視角進行考慮。
下面列出了三種不同的流數據處理框架:

Apache Samza
Storm
Spark流

彈性分佈式數據集(Resilient Distributed Datasets,RDDs)
Dstream(離散流,Discretized Stream,的縮寫)

Spark流工作的方式是將數據流按照預先定義的間隔(N秒)劃分爲批(稱微批次)然後將每批數據視爲一個彈性分佈式數據集(Resilient Distributed Datasets,RDDs)。隨後我們就可以使用諸如map、reduce、reduceByKey、join和window這樣的操作來處理這些RDDs。這些RDD操作的結果會以批的形式返回。通常我們會將這些結果保存到數據存儲中以供未來分析並生成報表與面板,或是發送基於事件的預警。
爲Spark流決定時間間隔是很重要的,這需要基於你的用例與數據處理要求。如果值N太低,那麼在分析階段微批次就沒有足夠的數據以給出有意義的結果。
與Spark流相比,其他流處理框架是基於每個事件而非一個微批次來處理數據流的。用微批次的方法,我們可以在同一應用下使用Spark流API來應用其他Spark庫(比如核心、機器學習等)。
流數據可以來源於許多不同的數據源。下面列出一些這樣的數據源:

Kafka
Flume
Twitter
ZeroMQ
Amazon’s Kinesis
TCP sockets

若要編寫Spark流程序,我們需要知曉兩個組件:DStream與流上下文。

DStream

Dstream(離散流,Discretized Stream,的縮寫)是Spark流中最基本的抽象,它描述了一個持續的數據流。DStream既可以從諸如Kafka、Flume與Kinesis這樣的數據源中創建,也可以對其他DStream實施操作。在內部,一個DStream被描述爲一個RDD對象的序列。
與RDDs上的轉換與動作操作類似,DStream支持以下操作:

map
flatMap
filter
count
reduce
countByValue
reduceByKey
join
updateStateByKey

流上下文

與Spark中的Spark上下文(SparkContext)相似,流上下文(StreamingContext)是所有流功能的主入口。
流上下文擁有內置方法可以將流數據接收到Spark流程序中。
使用該上下文,我們可以創建一個描述基於TCP數據源的流數據的DStream,可以用主機名與端口號指定TCP數據源。比如,如果我們使用像netcat這樣的工具來測試Spark流程序的話,我們將會從運行netcat的機器(比如localhost)的9999端口上接收到數據流。
當代碼被執行,在啓動時,Spark流僅是設置將要執行的計算,此時還沒有進行實時處理。在所有的轉換都被設置完畢後,爲了啓動處理,我們最終會調用start()方法來啓動計算,還有awaitTermination()方法來等待計算終結。

Spark編程的步驟

在我們討論樣例應用之前,先來看看Spark流編程中與衆不同的步驟:

Spark流上下文被用於處理實時數據流。因此,第一步就是用兩個參數初始化流上下文對象,Spark上下文和切片間隔時間。切片間隔設置了流中我們處理輸入數據的更新窗口。一旦上下文被初始化,就無法再向已經存在的上下文中定義或添加新的計算。並且,在同一時間只有一個流上下文對象可以被激活。
當Spark流上下文被定義後,我們通過創建輸入DStreams來指定輸入數據源。在我們的樣例應用中,輸入數據源是一個使用了Apache Kafka分佈式數據庫和消息系統的日誌消息生成器。日誌生成器程序創建隨機日誌消息以模擬網絡服務器的運行時環境,作爲各種網絡應用服務用戶而產生的流量,日誌消息被持續不斷地生成。
使用map和reduce這樣的Spark流變換API爲DStreams定義計算。
當流計算邏輯被定義好後,我們可以使用先前創建的流上下文對象中的start方法來開始接收並處理數據。
最終,我們使用流上下文對象的awaitTermination方法等待流數據處理完畢並停止它。


Anaconda Python發行版本
Binary Classification

參考
用Apache Spark進行大數據處理——第一部分:入門介紹
用Apache Spark進行大數據處理——第一部分:Spark SQL
用Apache Spark進行大數據處理——第三部分:Spark流
Spark Programming Guide


Apache ZooKeeper
Apache Storm
Apache Kafka

map 與 flatMap的區別

  • Spark 中 map函數會對每一條輸入進行指定的操作,然後爲每一條輸入返回一個對象;

  • 而flatMap函數則是兩個操作的集合——正是“先映射後扁平化”:

    • 操作1:同map函數一樣:對每一條輸入進行指定的操作,然後爲每一條輸入返回一個對象

    • 操作2:最後將所有對象合併爲一個對象

Master URLs

Spark Configuration

優先級

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.

Properties set directly on the SparkConf > flags passed to spark-submit or spark-shell > options in the spark-defaults.conf file

Launching Applications with spark-submit

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

–class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
–master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
–deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
–conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

RDD

理解Spark的核心RDD
tutorialpoints: Apache Spark - RDD

Building Spark

API: persist vs cache

1)RDD的cache()方法其實調用的就是persist方法,緩存策略均爲MEMORY_ONLY;

2)可以通過persist方法手工設定StorageLevel來滿足工程需要的存儲級別;

3)cache或者persist並不是action;

transformation vs action

  1. transformation是得到一個新的RDD,方式很多,比如從數據源生成一個新的RDD,從RDD生成一個新的RDD

  2. action是得到一個值,或者一個結果(直接將RDD cache到內存中)

所有的transformation都是採用的懶策略,就是如果只是將transformation提交是不會執行計算的,計算只有在action被提交的時候才被觸發。

Transformations
The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.

Transformation Meaning
map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.
sample(withReplacement, fraction, seed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.
cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

Actions
The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)

and pair RDD functions doc (Scala, Java) for details.

Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, [seed]) Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering]) Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path)
(Java and Scala) Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path)
(Java and Scala) Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.

Spark Regression

Collaborative Filtering

GraphX

At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.


Scala Pattern Matching


spark小技巧-mapPartitions

與map方法類似,map是對rdd中的每一個元素進行操作,而mapPartitions(foreachPartition)則是對rdd中的每個分區的迭代器進行操作。如果在map過程中需要頻繁創建額外的對象(例如將rdd中的數據通過jdbc寫入數據庫,map需要爲每個元素創建一個鏈接而mapPartition爲每個partition創建一個鏈接),則mapPartitions效率比map高的多。


Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).


API

repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
該函數用於將RDD進行重分區,使用HashPartitioner。
第一個參數爲重分區的數目,第二個爲是否進行shuffle,默認爲false;

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章