1. SparkStreaming是什麼
SparkStreaming對於Spark核心API的拓展,從而支持對於實時數據流的可拓展,高吞吐量和容錯性流處理。數據可以由多個源取得,例如:Kafka,Flume,Twitter,ZeroMQ,Kinesis或者TCP接口,同時可以使用由如map,reduce,join和window這樣的高層接口描述的複雜算法進行處理。最終,處理過的數據可以被推送到文件系統,數據庫和HDFS。
在內部,其按如下方式運行。Spark Streaming接收到實時數據流同時將其劃分爲分批,這些數據的分批將會被Spark的引擎所處理從而生成同樣按批次形式的最終流。(所以講,sparkStreaming的處理方式依然是批,這一點是和Flink不同的)
2. DStream
DStream:Discretized Stream,離散流,Spark Streaming提供的一種高級抽象,代表了一個持續不斷的數據流;
DStream可以通過輸入數據源來創建,比如Kafka、Flume,也可以通過對其他DStream應用高階函數來創建,比如map、reduce、join、window;
DStream的內部,其實是一系列持續不斷產生的RDD,RDD是Spark Core的核心抽象,即,不可變的,分佈式的數據集;
DStream中的每個RDD都包含了一個時間段內的數據;
對DStream應用的算子,其實在底層會被翻譯爲對DStream中每個RDD的操作;
底層的RDD的transformation操作,還是由Spark Core的計算引擎來實現的,Spark Streaming對Spark core進行了一層封裝,隱藏了細節,然後對開發人員提供了方便易用的高層次API。
3. 從端口接收數據,WordCount
package SparkStreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WordCountStreaming {
def main(args: Array[String]): Unit = {
//配置
val conf = new SparkConf().setMaster("local[2]").setAppName("WordCountStreaming")
//實時數據分析環境對象
//採集週期,以指定的時間爲採集週期
val Scontext = new StreamingContext(conf,Seconds(3))
//從指定的端口中採集
val socketDstream: ReceiverInputDStream[String] = Scontext.socketTextStream("192.168.159.200",9999)
//將採集的數據進行分解,扁平化
val wordDstream: DStream[String] = socketDstream.flatMap(line => {
line.split(" ")
})
//將數據進行結構的轉變,元組
val tupDstream: DStream[(String, Int)] = wordDstream.map(word => {
(word, 1)
})
//聚合
val reduced: DStream[(String, Int)] = tupDstream.reduceByKey(_+_)
//將結果打印出來
reduced.print()
//不要習慣性的加上stop程序,要讓main方法一直運行,與採集器綁定在一起
//啓動採集器
Scontext.start()
//Driver等待採集器的執行
Scontext.awaitTermination()
}
}
4. 從kafka接收數據,手動維護偏移量且追加歷史數據
package day12.demo03_kafka.sample02_directway
import org.apache.kafka.common.serialization.{StringDeserializer}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
/**
* @description 手動維護偏移量,kafka歷史+實時WordCount
* @author YDAlex
* @data 2019/11/15.
* @version 1.0
*/
object demo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName("MyWC")
.master("local")
.getOrCreate
val sc: SparkContext = spark.sparkContext
val ssc:StreamingContext = new StreamingContext(sc,Seconds(2))
ssc.checkpoint("file:///G:\\IntelliJ IDEA 2019.2.1\\Workspace\\sparktest\\data\\ck")//設置檢查點本地存儲,一般是hdfs
val topics = Array("test")
val groupID = "group01"
//以鍵值對的方式配置文件(注意這是直連方式並非receiver)
val KafkaParams = Map[String,Object](
"bootstrap.servers" -> "hadoop:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupID,
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false:java.lang.Boolean)
)
val streaming = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, KafkaParams)
)
streaming
.flatMap(_.value().split("\\s+"))
.filter(_.nonEmpty)
.map((_,1))
//歷史數據追加
.updateStateByKey((nowBatch: Seq[Int], historyResult: Option[Int]) => Some(nowBatch.sum + historyResult.getOrElse(0)))
.print(100)
//wordcount.updateStateByKey(Myfun)
// .print(100)
streaming.foreachRDD { rdd =>
//獲取該RDD對於的偏移量
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//更新偏移量
// some time later, after outputs have completed(將偏移量更新【Kafka】)
streaming.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
ssc.start
ssc.awaitTermination
}
//def Myfun = (nowBatch:Seq[Int],history:Option[Int]) => Some(nowBatch.sum + history.getOrElse(0))
}
5. Window時間窗口
本函數,適用於求歷史一定時長的數據留存,比如天貓雙十一時,求歷史一小時內的交易總額等信息
object SlideWindowFunDemo {
def main(args: Array[String]): Unit = {
//SparkSession
val spark: SparkSession = SparkSession.builder()
.appName(SlideWindowFunDemo.getClass.getSimpleName)
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
val ssc: StreamingContext = new StreamingContext(sc, Seconds(3))
//checkpoint
//ssc.checkpoint("file:///C:\\Users\\Administrator\\IdeaProjects\\spark-streaming-study\\data\\ck")
//DStream,迭代計算,並顯示內容
val ds: DStream[(String, Int)] = ssc.socketTextStream("NODE01", 7777)
.flatMap(_.split("\\s+"))
.filter(_.nonEmpty)
.map((_, 1))
.cache
//注意:
//參數2:窗口長度
//參數3:時間間隔,
//上述兩個參數必須是每個批次時間間隔的整數倍
//否則針對參數2,報錯:Exception in thread "main" java.lang.Exception: The window duration of windowed DStream (4000 ms) must be a multiple of the slide duration of parent DStream (3000 ms)
//否則針對參數3,報錯:Exception in thread "main" java.lang.Exception: The slide duration of windowed DStream (3000 ms) must be a multiple of the slide duration of parent DStream (2000 ms)
ds.reduceByKeyAndWindow((nowValue: Int, nextValue: Int) => nowValue + nextValue,
Seconds(6), Seconds(3))
.print(100)
//啓動SparkStreaming應用
ssc.start
//等待結束(必須要添加)
ssc.awaitTermination
}
}