第7章 Spark Streaming核心概念與編程

7-1 -課程目錄

核心概念

Transformation

Output Operations

案例實戰

7-2 -核心概念之StreamingContext

參考文檔:

http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext

 

Initializing StreamingContext

To initialize a Spark Streaming program, a StreamingContext object has to be created which is the main entry point of all Spark Streaming functionality.

StreamingContext object can be created from a SparkConf object.

import org.apache.spark._ import org.apache.spark.streaming._ val conf = new SparkConf().setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))

The appName parameter is a name for your application to show on the cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special “local[*]” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local[*]” to run Spark Streaming in-process (detects the number of cores in the local system). Note that this internally creates a SparkContext (starting point of all Spark functionality) which can be accessed as ssc.sparkContext.

 

The batch interval must be set based on the latency requirements of your application and available cluster resources. See the Performance Tuningsection for more details.

batch interval 可以根據你的應用程序需求的延遲要求以及集羣可用的資源情況來設置

A StreamingContext object can also be created from an existing SparkContext object.

import org.apache.spark.streaming._ val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1))

After a context is defined, you have to do the following.

1、Define the input sources by creating input DStreams.

2、Define the streaming computations by applying transformation and output operations to DStreams.

3、Start receiving data and processing it using streamingContext.start().

4、Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().

5、The processing can be manually stopped using streamingContext.stop().

一旦StreamingContext定義好以後,就可以做一些事情

Points to remember:

  • Once a context has been started, no new streaming computations can be set up or added to it.
  • Once a context has been stopped, it cannot be restarted.
  • Only one StreamingContext can be active in a JVM at the same time.
  • stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
  • A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.

 

7-3 -核心概念之DStream

參考文檔:

http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext

Discretized Streams (DStreams)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

 

Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.

對DStream操作算子,比如map/flatmap,其實底層會翻譯爲對DStream中的每個RDD都相同操作,因爲一個DStream是由不同批次的RDD所構成。

 

These underlying RDD transformations are computed by the Spark engine. The DStream operations hide most of these details and provide the developer with a higher-level API for convenience. These operations are discussed in detail in later sections.

 

7-4 -核心概念之Input DStreams和Receivers

參考文檔:

http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext

Input DStreams are DStreams representing the stream of input data received from streaming sources. In the quick example, lines was an input DStream as it represented the stream of data received from the netcat server. Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala docJava doc) object which receives the data from a source and stores it in Spark’s memory for processing.

 

Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

We are going to discuss some of the sources present in each category later in this section.

Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).

Points to remember

  • When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
  • Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

 

 

7-5 -核心概念之Transformation和Output Operations

參考文檔:

http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext

Similar to that of RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.

Transformation

Meaning

map(func)

Return a new DStream by passing each element of the source DStream through a function func.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items.

filter(func)

Return a new DStream by selecting only the records of the source DStream on which func returns true.

repartition(numPartitions)

Changes the level of parallelism in this DStream by creating more or fewer partitions.

union(otherStream)

Return a new DStream that contains the union of the elements in the source DStream and otherDStream.

count()

Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.

reduce(func)

Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.

countByValue()

When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

reduceByKey(func, [numTasks])

When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

join(otherStream, [numTasks])

When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.

cogroup(otherStream, [numTasks])

When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.

transform(func)

Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.

updateStateByKey(func)

Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

   

A few of these transformations are worth discussing in more detail.

 

Output Operations on DStreams

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs). Currently, the following output operations are defined:

 

7-6 -案例實戰之Spark Streaming處理socket數據

代碼地址:

https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/NetworkWordCount.scala

源碼:

package com.imooc.spark

import org.apache.spark.SparkConf

import org.apache.spark.streaming.{Seconds, StreamingContext}

/**

* Spark Streaming處理Socket數據

*

* 測試: nc

*/

object NetworkWordCount {

def main(args: Array[String]): Unit = {

val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

/**

* 創建StreamingContext需要兩個參數:SparkConf和batch interval

*/

val ssc = new StreamingContext(sparkConf, Seconds(5))

val lines = ssc.socketTextStream("localhost", 6789)

val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

result.print()

ssc.start()

ssc.awaitTermination()

}

}

 

 

7-7 -案例實戰之Spark Streaming處理文件系統數據

 

代碼地址:

https://gitee.com/sag888/big_data/blob/master/Spark%20Streaming%E5%AE%9E%E6%97%B6%E6%B5%81%E5%A4%84%E7%90%86%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98/project/l2118i/sparktrain/src/main/scala/com/imooc/spark/FileWordCount.scala

源碼:

package com.imooc.spark

import org.apache.spark.SparkConf

import org.apache.spark.streaming.{Seconds, StreamingContext}

/**

* 使用Spark Streaming處理文件系統(local/hdfs)的數據

*/

object FileWordCount {

def main(args: Array[String]): Unit = {

val sparkConf = new SparkConf().setMaster("local").setAppName("FileWordCount")

val ssc = new StreamingContext(sparkConf, Seconds(5))

val lines = ssc.textFileStream("file:///Users/rocky/data/imooc/ss/")

val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

result.print()

ssc.start()

ssc.awaitTermination()

}

}

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章