spark streaming源碼分析1 StreamingContext

原創

2020-02-22 07:08

首先看一個最簡單的例子，瞭解大致的樣子：

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

本小節主要介紹StreamingContext的構造

class StreamingContext private[streaming] (
    sc_ : SparkContext,
    cp_ : Checkpoint,
    batchDur_ : Duration
  )

一、API：

1、cp_爲null

def this(sparkContext: SparkContext, batchDuration: Duration)

2、方法內部也是通過conf自動創建一個sparkContext,cp_爲null

def this(conf: SparkConf, batchDuration: Duration)

3、conf由默認的和參數部分組合而成,cp_爲null

def this(
    master: String,
    appName: String,
    batchDuration: Duration,
    sparkHome: String = null,
    jars: Seq[String] = Nil,
    environment: Map[String, String] = Map())

4、從path目錄下讀取checkpoint的信息來重建streamingContext,也就不需要sparkContext和Duration參數

def this(path: String, hadoopConf: Configuration)

def this(path: String)//hadoopConf使用默認的hadoop配置文件自動構造

5、使用存在的sparkContext和checkpoint路徑來構造

def this(path: String, sparkContext: SparkContext)

6、需要注意的是，streamingContext對象內部有一個getOrCreate方法，指明如果在checkpointPath路徑下讀取不到，則調用creatingFunc創建新的streamingContext

def getOrCreate(
    checkpointPath: String,
    creatingFunc: () => StreamingContext,
    hadoopConf: Configuration = new Configuration(),
    createOnError: Boolean = false
  ): StreamingContext

二、StreamingContext主要的構造邏輯（checkpoint暫不討論）

1、構造一個graph: DStreamGraph

作用於DStream上的operation分成兩類 1. Transformation，2. Output 表示將輸出結果。DStreamGraph 有輸入就要有輸出，如果沒有輸出，則前面所做的所有動作全部沒有意義，那麼如何將這些輸入和輸出綁定起來呢？這個問題的解決就依賴於DStreamGraph，DStreamGraph記錄輸入的Stream和輸出的Stream。

2、構造一個JobScheduler

JobScheduler內部會構造一個jobGenerator，它用於按我們設定的批處理間隔產生job

3、狀態設置爲INITIALIZED

下一節介紹上面例子中的operation部分

yueqian_zhu

發佈了79 篇原創文章 · 獲贊 6 · 訪問量 9萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark streaming源碼分析1 StreamingContext

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

spark core源碼分析13 異常情況下的容錯保證

spark core源碼分析12 spark緩存清理

spark core源碼分析7 Executor的運行

spark core源碼分析6 Spark job的提交

spark core源碼分析9 從簡單例子看action操作

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結