1 概述

Spark Streaming是Spark core API的擴展，支持實時數據流的處理，並且具有可擴展，高吞吐量，容錯的特點。數據可以從許多來源獲取，如Kafka，Flume，Kinesis或TCP sockets，並且可以使用複雜的算法進行處理，這些算法使用諸如map，reduce，join和window等高級函數表示。最後，處理後的數據可以推送到文件系統，數據庫等。實際上，您可以將Spark的機器學習和圖形處理算法應用於數據流。

總的來說我們可以從三點進行考慮：輸入—–計算—–輸出。正如下圖所示：

1. 輸入：可以從Kafka，Flume,HDFS等獲取數據
2. 計算：我們可以通過map，reduce,join等一系列算子通過spark計算引擎進行計算（基本和RDD一樣，使用起來更方便。）
3. 輸出：可以輸出到HDFS,數據庫，HBase等。

2 處理數據的特點

在內部，它的工作原理如下。 Spark Streaming接收實時輸入數據流並將數據分成批，然後由Spark引擎處理，以批量生成最終結果流。

從圖中也能看出它將輸入的數據分成多個batch進行處理，嚴格來說spark streaming 並不是一個真正的實時框架,因爲他是分批次進行處理的。

Spark Streaming提供了一個高層抽象，稱爲discretized stream或DStream，它表示連續的數據流。 DStream可以通過Kafka，Flume和Kinesis等來源的輸入數據流創建，也可以通過在其他DStream上應用高級操作來創建。在內部，DStream表示爲一系列RDD。

3 wordcount代碼演示進行進一步認識

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SocketWordCountApp {

  def main(args: Array[String]): Unit = {
    //創建SparkConf
    val conf=new SparkConf().setAppName("SocketWordCountApp").setMaster("local[2]")
    //通過conf 得到StreamingContext,底層就是創建了一個SparkContext
    val ssc=new StreamingContext(conf,Seconds(5))
    //通過socketTextStream創建一個DSteam，可以看出這裏返回的是ReceiverInputDStream[T]，後面從源碼進行分析
    val DStream=ssc.socketTextStream("192.168.137.130",9998)
    //wc （看看是不是和RDD中的wc一樣呢）
    DStream.flatMap(x=>x.split(",")).map(x=>(x,1)).reduceByKey(_+_).print()

    // 開始計算
    ssc.start()
    // 等待計算結束
    ssc.awaitTermination()
  }
}

我們在通過nc命令像端口9998輸入數據;

[hadoop@hadoop ~]$ nc -lp 9998  
a,a,a,a,b,b,b

查看結果

(b,3)
(a,4)

4 初始化StreamingContext

要初始化Spark Streaming程序，必須創建一個StreamingContext對象，它是所有Spark Streaming功能的主要入口點。StreamingContext對象也可以從現有的SparkContext對象創建。

val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))

當一個Context被定義，你必須做以下的事情： 
1. 通過定義輸入DStream來創建輸入源。 
2. 通過在DStream上應用轉換操作和輸出操作來定義流計算。 
3. 使用StreamContext.start()開始來接收數據和處理數據。 
4. 使用StreamContext.awaitTermination()來等待計算完成（手動或者因錯誤終止）。 
5. 可以StreamContext.stop()來手動停止計算（一般不會停止）。

注意

a.一旦一個StreamingContext被啓動，就不能再設置或添加新的流計算。
b.一旦一個StreamingContext被停止，就不能重新啓動。
c.同一時間內，在JVM內部只有一個StreamingContext處於活躍狀態。
d.默認情況下使用stop()方法停止StreamingContext的同時也會停止SparkContext，如果執行停止
StreamingContext,可以將stop()的可選參數設置爲false。
e.SparkContext可以複用，即用來創建多個StreamingContext，只要在創建新的StreamingContext時，
之前創建的StreamingContext是處於stop狀態即可（SparkContext沒有被停止）。

5 Discretized Streams (DStreams)

Discretized Streams或DStream是Spark Streaming提供的基本抽象。它表示連續的數據流，即從源接收的輸入數據流或通過轉換輸入流生成的已處理數據流。在內部，DStream由連續的RDD系列表示，這是Spark對不可變的分佈式數據集的抽象。 DStream中的每個RDD都包含來自特定時間間隔的數據，如下圖所示。

在DStream上應用的任何操作都會轉化爲對每個RDD的操作。例如，wordcount案例中（下面會進行代碼演示），flatMap操作應用於DStream行中的每個RDD，以生成單詞DStream的RDD。這在下圖中顯示。

這些基礎RDD轉換由Spark引擎計算。 DStream操作隱藏了大部分這些細節，併爲開發人員提供了更高級別的API以方便使用。這些操作將在後面的章節中詳細討論。

6 spark streaming架構

我們應該知道spark有很多種運行模式，下面通過spark on yarn (cluster模式)的模式圖進行介紹，所以想要對spark streaming的運行架構進行理解，你要知道在yarn上提交作業的流程（可以參考該篇博客），以及spark的運行流程（參考該篇博客），下面是我在網上找的一幅圖，我們根據這幅圖進行一個學習：

下面對於這幅圖進行詳細的剖析：
符號表示：1，2，3….代表Spar on Yarn啓動流程 ;(1)(2)(3)….代表Spark Streaming執行過程。
1. 通過spark client提交作業到RM
2. ResouceManager爲該作業分配第一個Container，並與對應的NodeManager通信，創建Spark ApplicationMaster（每個SparkContext都有一個ApplicationMaster）
3. NodeManager啓動Spark AppMaster。
4. Spark AppMaster並向ResourceManager AsM註冊，用戶就可以通過UI查看作業情況。
5. ResourceManager通知NodeManager分配Container。（每個container的對應一個executor）
6. NodeManager準備資源，並分配給executor。

Spark Streaming執行過程
spark on Yarn模式Driver運行在NM的container之中，運行application的main()函數並自動創建SparkContext，在SparkContext之上會創建一個 StreamingContext（因爲途中並沒有標出，這裏說明下）。
（1）SparkContext向資源管理器註冊並申請運行Executor資源；
（2）Executor會啓動Receive接收數據（Data Received），分批處理。
（3）Receive接收到數據後彙報給streamingcontext(底層調用的sparkcontext)，他會以多個副本存儲，默認兩個（後面進行源碼解讀就知道了。）
（4）Spark ApplicationMaster和executor（container）進行交互，分配task。
（5）每個executor上會運行多個task執行任務。

最後把結果保存在HDFS上。

Saprk Streaming數據處理過程

首先，Spark Streaming把實時輸入數據流以時間片Δt （如1秒）爲單位切分成塊。Spark Streaming會把每塊數據作爲一個RDD，並使用RDD操作處理每一小塊數據。每個塊都會生成一個Spark Job處理，最終結果也返回多塊。這裏也和上面所說的一致。

7 源碼解析

前面我們說了StreamingContext,底層就是創建了一個SparkContext，我們從源碼中進行證明：
new StreamingContext(conf,Seconds(5))

/**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }


我們new StreamingContext(conf,Seconds(5)) 其實調用的上面的方法，
1.傳遞一個SparkConf應該不陌生把，指定Spark參數的org.apache.spark.SparkConf對象；
2.Duration流式數據分成批次的時間間隔

我們接着看看這句話StreamingContext.createNewSparkContext(conf)

 private[streaming] def createNewSparkContext(conf: SparkConf): SparkContext = {
    new SparkContext(conf)
  }

可以看到底層就是給我們創建了一個SparkContext

通過socketTextStream創建一個DSteam，可以看出這裏返回的是ReceiverInputDStream[T]，從源碼進行分析：

ssc.socketTextStream("192.168.137.130",9998)

/**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

1.通過tcp socket監聽hostname:port，接受字節（UTF8編碼，`\ n`分隔）
2.這裏我們看到了默認的存儲級別StorageLevel.MEMORY_AND_DISK_SER_2，因爲他是一個默認參數，所以我們直接使用了默認的就木有傳遞。（和前面對應了吧）
（還有一個問題，還記得spark中緩存級別嗎？？？）
3.返回ReceiverInputDStream
4.調用了socketStream方法

繼續查看socketStream方法

/**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes it interpreted as object using the given
   * converter.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param converter     Function to convert the byte stream to objects
   * @param storageLevel  Storage level to use for storing the received objects
   * @tparam T            Type of the objects received (after converting bytes to objects)
   */
  def socketStream[T: ClassTag](
      hostname: String,
      port: Int,
      converter: (InputStream) => Iterator[T],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[T] = {
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
  }

這些參數大家應該明白了吧，我們繼續看看SocketInputDStream[T]到底是什麼吧。

class SocketInputDStream[T: ClassTag](
    _ssc: StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T](_ssc) {

SocketInputDStream這個類繼承了ReceiverInputDStream，感覺快看到希望了啊。

/**
 * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
 * that has to start a receiver on worker nodes to receive external data.
 * Specific implementations of ReceiverInputDStream must
 * define [[getReceiver]] function that gets the receiver object of type
 * [[org.apache.spark.streaming.receiver.Receiver]] that will be sent
 * to the workers to receive data.
 * @param _ssc Streaming context that will execute this input stream
 * @tparam T Class type of the object of this stream
 */
abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)
  extends InputDStream[T](_ssc) {

1.ReceiverInputDStream是一個抽象列繼承了InputDStream，必須在工作節點上啓動接收器才能接收外部數據，
2.ReceiverInputDStream 通過getReceiver函數獲取類型的接收器對象，即org.apache.spark.streaming.receiver.Receiver，
3.Receiver的作用是接受數據。

InputDStream[T](_ssc)是什麼呢?

/**
 * This is the abstract base class for all input streams. This class provides methods
 * start() and stop() which are called by Spark Streaming system to start and stop
 * receiving data, respectively.
 * Input streams that can generate RDDs from new data by running a service/thread only on
 * the driver node (that is, without running a receiver on worker nodes), can be
 * implemented by directly inheriting this InputDStream. For example,
 * FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for
 * new files and generates RDDs with the new files. For implementing input streams
 * that requires running a receiver on the worker nodes, use
 * [[org.apache.spark.streaming.dstream.ReceiverInputDStream]] as the parent class.
 *
 * @param _ssc Streaming context that will execute this input stream
 */
abstract class InputDStream[T: ClassTag](_ssc: StreamingContext)
  extends DStream[T](_ssc) {

1.這是所有輸入流的抽象基類。這個類提供了方法由Spark Streaming系統調用的start（）和stop（）來啓動和停止
接收數據。
2.輸入流會被差分成多個rdd，運行在每一個線程中。
3.驅動程序節點（即，不在工作節點上運行接收器）可以 通過直接繼承此InputDStream實現。例如，
FileInputDStream，InputDStream的一個子類。產生一個新文件或者生成多個rdd都會產生新文件，
這是通過driver監視HDFS目錄
4.用於實現輸入流需要在工作節點上運行接收器，請使用
[[org.apache.spark.streaming.dstream.ReceiverInputDStream]]作爲父類。

這裏我們可以看到InputDStream繼承DStream，終於看到了DStream,全村人的希望啊.

**
 * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
 * sequence of RDDs (of the same type) representing a continuous stream of data (see
 * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
 * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
 * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
 * transforming existing DStreams using operations such as `map`,
 * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
 * periodically generates a RDD, either from live data or by transforming the RDD generated by a
 * parent DStream.
 *
 * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
 * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
 * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
 * `join`. These operations are automatically available on any DStream of pairs
 * (e.g., DStream[(Int, Int)] through implicit conversions.
 *
 * A DStream internally is characterized by a few basic properties:
 *  - A list of other DStreams that the DStream depends on
 *  - A time interval at which the DStream generates an RDD
 *  - A function that is used to generate an RDD after each time interval
 */
abstract class DStream[T: ClassTag] (
    @transient private[streaming] var ssc: StreamingContext
  ) extends Serializable with Logging {

這一大段註釋，，頭大。。。
其實就是我們前面說的輸入-----計算-----輸出：
可以通過TCP socket，Kafka，Flume 輸入數據，轉化爲DStream，
通過一些算子 `map`, `filter` and`window`進行計算

DStream內部具有幾個基本屬性：
 * - DStream依賴的其他DStream列表（個人感覺就是RDD之間的依賴關係）
 * - DStream生成RDD的時間間隔（可自行設置）
 * - 每個時間間隔後用於生成RDD的函數（生成多個RDD）

終於從頭到尾給看完了哈，進行了一個簡單的介紹，不知道小夥伴有沒有理解呢。

8 輸入數據流

從源端接受的數據代表輸入數據流，通過接受輸入數據會產生一個DStream，例如我們上面進行的wc中val DStream=ssc.socketTextStream("192.168.137.130",9998)這句代碼接收數據後返回一個DStream，每個輸入DStream都與Receiver對象相關聯，Receiver對象從源接收數據並將其存儲在Spark的內存中進行處理。

Spark Streaming提供了兩類內置streaming sources。
基本來源：StreamingContext API中直接可用的來源。示例：文件系統和socket connections。
高級來源：可通過額外的實用程序課程獲得Kafka，Flume，Kinesis等來源。這些要求鏈接部分中討論的額外依賴關係。（後面我們會進行講解）

注意，如果您想在流式傳輸應用程序中並行接收多個數據流，則可以創建多個輸入DStream（在性能調整部分中進一步討
論）。 這將創建多個接收器，它將同時接收多個數據流。 但請注意，Spark worker / executor是一項長期運行的任
務，因此它佔用了分配給Spark Streaming應用程序的核心之一。 因此，重要的是要記住，Spark Streaming應用程序
需要分配足夠的內核（或線程，如果在本地運行）來處理接收到的數據以及運行接收器。

對local進行進一步理解
1. local ：用一個工作線程在本地運行Spark
2. local[K]：使用K工作線程在本地運行Spark（理想情況下，將其設置爲您計算機上的核心數）。
3. local[K,F]：使用K工作線程，最多爲F在本地運行Spark（可以通過參數spark.task.maxFailures設置）
4. local[*]：使用與您的計算機上的邏輯內核一樣多的工作線程在本地運行Spark。

在本地運行Spark Streaming程序時，請勿使用“local”或“local [1]”作爲主URL。這兩者中的任何一個都意味着只有一個線程將用於本地運行任務。如果您使用的是基於接收器的輸入DStream（例如套接字，Kafka，Flume等），那麼receive接收器對象將佔用一個線程，那就意味着沒有足夠的線程來處理數據。因此，在本地運行時，請始終用“local [n]”作爲主URL，其中n>要運行的接收器的數量。在羣集上運行，分配給Spark Streaming應用程序的內核數量必須多於接收器的數量。否則系統將接收數據，但無法處理它。
原文:https://blog.csdn.net/yu0_zhang0/article/details/80569946

Spark Streaming 介紹及架構——基礎篇

1 概述

2 處理數據的特點

3 wordcount代碼演示進行進一步認識

4 初始化StreamingContext

5 Discretized Streams (DStreams)

6 spark streaming架構

7 源碼解析

8 輸入數據流

分組加密的4種模式

logback原理與配置

es啓動發現master沒有發現集羣其他節點

Maven插件:buildnumber-maven-plugin

SimpleChannelInboundHandler解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結