1. SparkStreaming是什麼

SparkStreaming對於Spark核心API的拓展，從而支持對於實時數據流的可拓展，高吞吐量和容錯性流處理。數據可以由多個源取得，例如：Kafka，Flume，Twitter，ZeroMQ，Kinesis或者TCP接口，同時可以使用由如map，reduce，join和window這樣的高層接口描述的複雜算法進行處理。最終，處理過的數據可以被推送到文件系統，數據庫和HDFS。

在內部，其按如下方式運行。Spark Streaming接收到實時數據流同時將其劃分爲分批，這些數據的分批將會被Spark的引擎所處理從而生成同樣按批次形式的最終流。(所以講，sparkStreaming的處理方式依然是批，這一點是和Flink不同的)

2. DStream

DStream：Discretized Stream，離散流，Spark Streaming提供的一種高級抽象，代表了一個持續不斷的數據流；
DStream可以通過輸入數據源來創建，比如Kafka、Flume，也可以通過對其他DStream應用高階函數來創建，比如map、reduce、join、window；

DStream的內部，其實是一系列持續不斷產生的RDD，RDD是Spark Core的核心抽象，即，不可變的，分佈式的數據集；
DStream中的每個RDD都包含了一個時間段內的數據；
對DStream應用的算子，其實在底層會被翻譯爲對DStream中每個RDD的操作；

底層的RDD的transformation操作，還是由Spark Core的計算引擎來實現的，Spark Streaming對Spark core進行了一層封裝，隱藏了細節，然後對開發人員提供了方便易用的高層次API。

3. 從端口接收數據，WordCount

package SparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WordCountStreaming {
  def main(args: Array[String]): Unit = {
    //配置
    val conf  = new SparkConf().setMaster("local[2]").setAppName("WordCountStreaming")
    //實時數據分析環境對象
    //採集週期，以指定的時間爲採集週期
    val Scontext = new StreamingContext(conf,Seconds(3))
    //從指定的端口中採集
    val socketDstream: ReceiverInputDStream[String] = Scontext.socketTextStream("192.168.159.200",9999)
    //將採集的數據進行分解，扁平化
    val wordDstream: DStream[String] = socketDstream.flatMap(line => {
      line.split(" ")
    })
    //將數據進行結構的轉變，元組
    val tupDstream: DStream[(String, Int)] = wordDstream.map(word => {
      (word, 1)
    })
    //聚合
    val reduced: DStream[(String, Int)] = tupDstream.reduceByKey(_+_)
    //將結果打印出來
    reduced.print()
    //不要習慣性的加上stop程序,要讓main方法一直運行，與採集器綁定在一起
    //啓動採集器
    Scontext.start()
    //Driver等待採集器的執行
    Scontext.awaitTermination()
  }
}

4. 從kafka接收數據，手動維護偏移量且追加歷史數據

package day12.demo03_kafka.sample02_directway

import org.apache.kafka.common.serialization.{StringDeserializer}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

/**
 * @description 手動維護偏移量，kafka歷史+實時WordCount
 * @author YDAlex
 * @data 2019/11/15.
 * @version 1.0
 */
object demo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
      .appName("MyWC")
      .master("local")
      .getOrCreate

    val sc: SparkContext = spark.sparkContext
    val ssc:StreamingContext = new StreamingContext(sc,Seconds(2))
    ssc.checkpoint("file:///G:\\IntelliJ IDEA 2019.2.1\\Workspace\\sparktest\\data\\ck")//設置檢查點本地存儲，一般是hdfs
    val topics = Array("test")
    val groupID = "group01"
    //以鍵值對的方式配置文件（注意這是直連方式並非receiver）
    val KafkaParams = Map[String,Object](
      "bootstrap.servers" -> "hadoop:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupID,
      "auto.offset.reset" -> "earliest",
      "enable.auto.commit" -> (false:java.lang.Boolean)
    )
    val streaming = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, KafkaParams)
    )
    streaming
      .flatMap(_.value().split("\\s+"))
      .filter(_.nonEmpty)
      .map((_,1))
      //歷史數據追加
      .updateStateByKey((nowBatch: Seq[Int], historyResult: Option[Int]) => Some(nowBatch.sum + historyResult.getOrElse(0)))
      .print(100)

    //wordcount.updateStateByKey(Myfun)
     // .print(100)
    streaming.foreachRDD { rdd =>
      //獲取該RDD對於的偏移量
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      //更新偏移量
      // some time later, after outputs have completed(將偏移量更新【Kafka】)
      streaming.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }


    ssc.start
    ssc.awaitTermination
  }
  //def Myfun = (nowBatch:Seq[Int],history:Option[Int]) => Some(nowBatch.sum + history.getOrElse(0))
}

5. Window時間窗口

本函數，適用於求歷史一定時長的數據留存，比如天貓雙十一時，求歷史一小時內的交易總額等信息

object SlideWindowFunDemo {
  def main(args: Array[String]): Unit = {
    //SparkSession
    val spark: SparkSession = SparkSession.builder()
      .appName(SlideWindowFunDemo.getClass.getSimpleName)
      .master("local[*]")
      .getOrCreate()

    val sc: SparkContext = spark.sparkContext


    val ssc: StreamingContext = new StreamingContext(sc, Seconds(3))

    //checkpoint
    //ssc.checkpoint("file:///C:\\Users\\Administrator\\IdeaProjects\\spark-streaming-study\\data\\ck")


    //DStream,迭代計算,並顯示內容
    val ds: DStream[(String, Int)] = ssc.socketTextStream("NODE01", 7777)
      .flatMap(_.split("\\s+"))
      .filter(_.nonEmpty)
      .map((_, 1))
      .cache

    //注意：
    //參數2：窗口長度
    //參數3：時間間隔，
    //上述兩個參數必須是每個批次時間間隔的整數倍
    //否則針對參數2，報錯：Exception in thread "main" java.lang.Exception: The window duration of windowed DStream (4000 ms) must be a multiple of the slide duration of parent DStream (3000 ms)
    //否則針對參數3，報錯：Exception in thread "main" java.lang.Exception: The slide duration of windowed DStream (3000 ms) must be a multiple of the slide duration of parent DStream (2000 ms)
    ds.reduceByKeyAndWindow((nowValue: Int, nextValue: Int) => nowValue + nextValue,
      Seconds(6), Seconds(3))
      .print(100)

    //啓動SparkStreaming應用
    ssc.start

    //等待結束（必須要添加）
    ssc.awaitTermination
  }
}

SparkStreaming簡單總結一下和幾個小案例，整合kafka

1. SparkStreaming是什麼

2. DStream

3. 從端口接收數據，WordCount

4. 從kafka接收數據，手動維護偏移量且追加歷史數據

5. Window時間窗口

Hue 的編譯安裝及簡單使用

Spark第二天的RDD概念

淺探scala閉包

redis集羣優化，JedisCluster實現Pipeline功能，進而實現批處理

HIve修改字段或者增加字段後，Spark訪問不生效問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結