第6章 Spark Streaming入門

6-1 -課程目錄

概述

發展史

應用場景

從詞頻統計功能着手入門

集成Spark生態系統的使用

工作原理

6-2 -Spark Streaming概述

 

官網:http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

 

 

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

 

 

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages.

Note: There are a few APIs that are either different or not available in Python. Throughout this guide, you will find the tag Python API highlighting these differences.

Spak Streaming個人定義:

將不同的數據源經過spark streaming處理之後將結果輸出到外部文件系統

特點:

低延時

能從錯誤中高效恢復

能夠運行在成百上千的節點

能夠將批處理,機器學習,圖計算等子框架和spark Streaming綜合使用

 

 

 Spark Streaming 是否需要獨立安裝?

on stack to rule them all : 一棧式

6-3 -Spark Streaming應用場景

 

6-3 -Spark Streaming應用場景

 

 

6-5 -Spark Streaming發展史

 

6-6 -從詞頻統計功能着手入門Spark Streaming

 GitHub

https://github.com/apache/spark

源碼地址:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala

1、spark-submit執行

./spark-submit --master local[2] \

--lass org.apache.spark.examples.streaming.NetworkWordCout \在examples/jars裏面的jar包 hadoop000 9999

2、spark-shell執行

拷貝本段代碼,粘貼去終端運行,方法同spark-submit

 

6-7 -Spark Streaming工作原理(粗粒度)

 

工作原理:粗粒度

Spark Streaming 接受到實時數據流,把數據按照指定的時間段切分成一小片片小的數據塊,然後把小的數據塊傳給Spark Engine處理

 

6-8 -Spark Streaming工作原理(細粒度)

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章