

  1. 閱讀官方文檔
    Spark Quick Start
    Spark Programming Guide
    Spark SQL, DataFrames and Datasets Guide
    Cluster Mode Overview
    Spark Standalone Mode

重要的概念:resilient distributed dataset (RDD), a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

two types of shared variables: broadcast variables & accumulators
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

  1. 樣例學習Databricks Spark Reference Applications

Spark Streaming
When data is streamed into Spark, there are two common use cases covered:

Windowed Calculations means that you only care about data received in the last N amount of time. When monitoring your web servers, perhaps you only care about what has happened in the last hour.
Spark Streaming conveniently splits the input data into the desired time windows for easy processing, using the window function of the streaming library.
The forEachRDD function allows you to access the RDD’s created each time interval.
Cumulative Calculations means that you want to keep cumulative statistics, while streaming in new data to refresh those statistics. In that case, you need to maintain the state for those statistics.
The Spark Streaming library has some convenient functions for maintaining state to support this use case, updateStateByKey.
Reusing code from Batching covers how to organize business logic code from the batch examples so that code can be reused in Spark Streaming.
The Spark Streaming library has transform functions which allow you to apply arbitrary RDD-to-RDD functions, and thus to reuse code from the batch mode of Spark.

rsync Linux同步工具


Spark 應用示例

- 安裝JDK
- 安裝spark軟件

  1. 安裝JDK
    注意:不要安裝到默認路徑“c:\Program Files”文件夾的名字包含空格會導致一些問題。

  2. 安裝spark軟件

cd c:\dev\spark-1.2.0-bin-hadoop2.4

可以鍵入如下命令檢查Spark Shell是否工作正常。




  1. Spark Word Count示例

首先讓我們用Spark API運行流行的Word Count示例。如果還沒有運行Spark Scala Shell,首先打開一個Scala Shell窗口。這個示例的相關命令如下所示:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val txtFile = ""
val txtData = sc.textFile(txtFile)





val wcData = txtData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)




Apache Samza

彈性分佈式數據集(Resilient Distributed Datasets,RDDs)
Dstream(離散流,Discretized Stream,的縮寫)

Spark流工作的方式是將數據流按照預先定義的間隔(N秒)劃分爲批(稱微批次)然後將每批數據視爲一個彈性分佈式數據集(Resilient Distributed Datasets,RDDs)。隨後我們就可以使用諸如map、reduce、reduceByKey、join和window這樣的操作來處理這些RDDs。這些RDD操作的結果會以批的形式返回。通常我們會將這些結果保存到數據存儲中以供未來分析並生成報表與面板,或是發送基於事件的預警。

Amazon’s Kinesis
TCP sockets



Dstream(離散流,Discretized Stream,的縮寫)是Spark流中最基本的抽象,它描述了一個持續的數據流。DStream既可以從諸如Kafka、Flume與Kinesis這樣的數據源中創建,也可以對其他DStream實施操作。在內部,一個DStream被描述爲一個RDD對象的序列。






當Spark流上下文被定義後,我們通過創建輸入DStreams來指定輸入數據源。在我們的樣例應用中,輸入數據源是一個使用了Apache Kafka分佈式數據庫和消息系統的日誌消息生成器。日誌生成器程序創建隨機日誌消息以模擬網絡服務器的運行時環境,作爲各種網絡應用服務用戶而產生的流量,日誌消息被持續不斷地生成。

Anaconda Python發行版本
Binary Classification


用Apache Spark進行大數據處理——第一部分:入門介紹
用Apache Spark進行大數據處理——第一部分:Spark SQL
用Apache Spark進行大數據處理——第三部分:Spark流
Spark Programming Guide

Apache ZooKeeper
Apache Storm
Apache Kafka

map 與 flatMap的區別

  • Spark 中 map函數會對每一條輸入進行指定的操作,然後爲每一條輸入返回一個對象;

  • 而flatMap函數則是兩個操作的集合——正是“先映射後扁平化”:

    • 操作1:同map函數一樣:對每一條輸入進行指定的操作,然後爲每一條輸入返回一個對象

    • 操作2:最後將所有對象合併爲一個對象

Master URLs

Spark Configuration


Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.

Properties set directly on the SparkConf > flags passed to spark-submit or spark-shell > options in the spark-defaults.conf file

Launching Applications with spark-submit

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \

–class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
–master: The master URL for the cluster (e.g. spark://
–deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
–conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark:// \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark:// \
  --deploy-mode cluster
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \

# Run on a YARN cluster
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark:// \
  examples/src/main/python/ \

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos:// \
  --deploy-mode cluster
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \


tutorialpoints: Apache Spark - RDD

Building Spark

API: persist vs cache




transformation vs action

  1. transformation是得到一個新的RDD,方式很多,比如從數據源生成一個新的RDD,從RDD生成一個新的RDD

  2. action是得到一個值,或者一個結果(直接將RDD cache到內存中)


The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.

Transformation Meaning
map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.
sample(withReplacement, fraction, seed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.
cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)

and pair RDD functions doc (Scala, Java) for details.

Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, [seed]) Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering]) Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
(Java and Scala) Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
(Java and Scala) Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.

Spark Regression

Collaborative Filtering


At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

Scala Pattern Matching



Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).


repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.