結構化流-Structured Streaming(八-上)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與AI相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI等相關文章,目前已經輸出了40萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在前面的章節中,我們學習瞭如何使用結構化API來處理數據規模巨大的有界數據。但是,數據經常連續到達並且需要實時處理。在本章中,我們將討論如何將相同的結構化API來處理數據流。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Apache Spark流處理引擎的演變","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流處理被定義爲連續處理無窮無盡的數據流。隨着大數據時代的到來,流處理系統已從單節點處理引擎過渡到多節點分佈式處理引擎。傳統的分佈式流處理是用一個一次一記錄的處理模型來實現的,如圖8-1所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e062cd82a1014624297843f9bb632ff2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"處理管道由節點有向圖組成,如圖8-1所示。每個節點每次連續接收一條記錄,進行處理,然後將生成的記錄轉發到圖中的下一個節點。此處理模型可以實現非常低的延遲,也就是說,輸入記錄可以由管道處理,並且可以在毫秒內生成結果輸出。但是,該模型在從節點故障和散亂的節點(即比其他節點慢的節點)中恢復時效率不是很高。它可以使用大量額外的故障轉移資源從故障中快速恢復,或者使用最少的額外資源,但是恢復速度相當緩慢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"微批流處理的出現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當Apache Spark引入Spark Streaming(也稱爲DStreams)時,這種傳統方法受到了挑戰。它引入了","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"微批流處理","attrs":{}},{"type":"text","text":"的思想,其中流計算模型是基於一系列連續的微型map/reduce批處理作業實現的(因此稱爲“微批處理”)。這在示出的圖8-2。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a1/a189e4099e306b3dfbac009970c81953.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如此處所示,Spark Streaming將來自輸入流的數據劃分爲大小爲1秒的微批數據進行處理。每個批次在Spark集羣中以分佈式方式處理,形成一系列小的確定性任務,這些任務以微批次的形式生成輸出。將流計算分解爲諸多小任務,與傳統的連續運算符模型相比,主要有以下兩個優點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 通過在其他Executor上重新計劃一個或多個任務副本,Spark的敏捷任務調度可以非常快速,有效地從故障和Executor中恢復。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 任務的確定性確保無論任務重新執行多少次,輸出數據都是相同的。這一關鍵特性使Spark Streaming能夠提供端到端的精確一次處理保證,即,生成的輸出結果是由每個輸入記錄都能夠被精確地處理一次而得到的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種有效的容錯能力確實以時間延遲爲代價的,微批模型無法達到毫秒級的時間延遲;通常可以達到幾秒鐘的延遲(在某些情況下,延遲僅爲半秒)。但是,我們已經觀察到,對於絕大多數流處理用例而言,微批處理的好處勝過二級延遲的缺點。這是因爲大多數流傳輸管道至少具有以下特徵之一:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 管道不需要嚴格要求時間延遲低於幾秒鐘。例如,當按小時作業讀取流進行輸出時,生成具有亞秒級延遲的輸出是沒有什麼意義的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 管道的其他部分存在較大的延遲。例如,如果對傳感器寫入Apache Kafka(用於攝取數據流的系統)的數據進行批處理以實現更高的吞吐量,那麼下游處理系統中的任何優化都無法使端到端延遲低於批處理延遲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,DStream API是在Spark的批處理RDD API的基礎上構建的。因此,DStream具有與RDD相同的功能語義和容錯模型。因此,Spark Streaming證明了一個統一的處理引擎可以爲批處理,交互式和流式處理工作負載提供一致的API和語義。流處理中這種基本的範式轉變促使Spark Streaming成爲使用最廣泛的開源流處理引擎之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"從Spark Streaming(DStreams)獲得的經驗教訓","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管Spark Streaming具備了很多的優點,但DStream API並非沒有缺陷。以下是一些存在並且需要改進的關鍵領域:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"缺少用於批處理和流處理的單一API","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即使DStream和RDD具有一致的API(即相同的操作和相同的語義),開發人員在將批處理作業轉換爲流式作業時仍必須顯式重寫其代碼以使用不同的類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"邏輯計劃與物理計劃之間缺乏隔離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark Streaming按照開發人員指定的順序執行DStream操作。由於開發人員可以有效地指定確切的物理計劃,因此沒有自動優化的空間,並且開發人員必須手動優化其代碼才能獲得最佳性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"缺少對事件時間窗口的本地支持","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DStream僅根據Spark Streaming接收每個記錄的時間(稱爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"處理時間","attrs":{}},{"type":"text","text":")來定義窗口操作。但是,許多用例需要根據生成記錄的時間(稱爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"事件時間","attrs":{}},{"type":"text","text":")而不是接收或處理記錄的時間來計算窗口聚合。缺少事件時間窗口的本機支持,使開發人員很難通過Spark Streaming構建此類管道。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上這些缺點形成了結構化流(Structured Streaming)的設計理念,我們將在下面討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"結構化流(Structured Streaming)的設計哲學","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於DStreams的這些教訓,結構化流從一開始就設計了一種核心理念——對於開發人員而言,編寫流處理管道應該與編寫批處理管道一樣容易。簡而言之,結構化流的指導原則是:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"單一,統一的編程模型和接口,用於批處理和流處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個統一的模型爲批處理和流式工作負載提供了一個簡單的API接口。你可以像在批處理中一樣在流上使用熟悉的SQL或類似批處理的DataFrame查詢(就像你在上一章中學到的那樣),從而將容錯、優化和遲到數據的潛在複雜性留給引擎本身。在接下來的部分中,我們將研究你可能會編寫的一些查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"流處理的更廣泛定義","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據處理應用程序已經變得足夠複雜,以至於實時處理和批處理之間的界線已大大模糊。結構化流的目的是將其適用性從傳統的流處理擴展到更廣泛的應用類別。從定時應用程序(例如,每隔幾個小時)到連續(如傳統流應用程序)處理數據的任何應用程序都應使用結構化流來處理數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,我們將討論結構化流使用的編程模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"結構化流的編程模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“表”是開發人員在構建批處理應用程序時所熟悉的一個衆所周知的概念。通過將數據流視爲一張無界的、連續追加的表,結構化流將該概念擴展到流應用程序,如圖8-3所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/82/82e923a11235b1498805f4687c075079.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據流中接收到的每條新記錄就像是追加到無界輸入表中新的行。結構化流實際上並不會保留所有輸入,但是結構化流產生的輸出相當於將直到時間T的所有輸入都存儲在一個靜態的、有界的表中,並基於該表上運行批處理作業。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖8-4所示,開發人員接着在此概念性輸入表上定義一個查詢,就像它是靜態表一樣,以計算將要寫入輸出接收端的結果表。結構化流式處理會自動將這種類似批處理的查詢轉換爲流式執行計劃。這稱爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"增量化","attrs":{}},{"type":"text","text":":結構化流計算出每次記錄到達時需要維護什麼狀態才能更新結果。最後,開發人員指定觸發策略來控制何時更新結果。每次觸發觸發器時,結構化流都會檢查新數據(即輸入表中的新行)並增量更新結果。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/161bdaa42acd2b2b43b404c23cddc1e9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型的最後一部分是輸出模式。每次更新結果表時,開發人員都希望將更新結果寫入外部系統,例如文件系統(例如,HDFS,Amazon S3)或數據庫(例如,MySQL,Cassandra)。我們通常希望以增量的方式進行輸出。爲此,結構化流提供了三種輸出模式:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"追加模式(Append mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當最後一個觸發器觸發後,只有新行纔會追加到結果表繼而寫入外部存儲。這僅適用於結果表中現有行無法更改的查詢(例如,輸入流上的map)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"更新方式(Update mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顧名思義,自上次觸發以來,只會在外部存儲中更改結果表中更新的行。此模式適用於可以在適當位置更新的輸出接收器,例如MySQL表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"全量模式(Complete mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個更新的結果表將被寫入外部存儲。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除非指定全量模式,否則結構化流將不會完全實現結果表。僅維護足夠的信息(稱爲“狀態”)以確保可以計算結果表中的更改並可以輸出更新。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將數據流看成表,不僅可以使概念化數據的邏輯計算變得更加容易,而且還可以使得在代碼中表達它們更爲容易。由於Spark的DataFrame是表的編程表示形式,因此你可以使用DataFrame API來表示對流數據的計算。你需要做的就是從流數據源定義一個輸入DataFrame(即輸入表),然後像在批處理源上定義的DataFrame上應用相同的操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一節中,你將看到使用DataFrames編寫結構化流查詢是多麼容易。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"結構化流查詢的基礎","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本節中,我們將介紹開發結構化流查詢時需要理解的一些高級概念。我們將首先介紹定義和啓動流查詢的關鍵步驟,然後討論如何監控活動的查詢並管理其生命週期。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"定義流查詢的五個步驟","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上一節所述,結構化流使用與批處理查詢相同的DataFrame API來表示數據處理邏輯。但是,定義結構化流查詢需要了解一些關鍵的區別。在本節中,我們將通過構建一個簡單的查詢來定義流查詢的步驟,該查詢通過讀取socket上的文本數據流並計算字數。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"步驟1:定義輸入源","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與批處理查詢一樣,第一步是從流數據源定義DataFrame。但是,在讀取批處理數據源時,我們需要使用spark.read創建一個DataFrameReader,而對於流數據源,我們需要通過spark.readStream創建一個DataStreamReader。DataStreamReader與DataFrameReader具有大部分相同的方法,因此你可以用類似的方式使用它。下面是一個通過socket連接接收文本數據流創建DataFrame的示例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark = SparkSession...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lines = (spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .readStream.format(\"socket\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"host\", \"localhost\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"port\", 9999)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val spark = SparkSession...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val lines = spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .readStream.format(\"socket\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"host\", \"localhost\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"port\", 9999)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此代碼從localhost:9999讀取的以換行符分隔的文本數據生成lines DataFrame無界表。請注意,類似於使用spark.read的批處理源,它不會立即開始讀取流數據。僅在顯式啓動流式查詢後,它才設置讀取數據所需的配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了socket之外,Apache Spark原生還支持從Apache Kafka和DataFrameReader支持的所有各種基於文件的格式(Parquet,ORC,JSON等)讀取數據流。本章後面介紹這些來源的詳細信息及其支持的選項。此外,流查詢可以定義多個輸入源,包括流輸入源和批輸入源,可以使用DataFrame操作,如聯合和連接(本章後面也討論)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"步驟2:轉換數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,我們可以應用一些常規的DataFrame操作,例如將行拆分爲單個單詞,然後對它們進行計數,如以下代碼所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.sql.functions import *","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"words = lines.select(split(col(\"value\"), \"\\\\s\").alias(\"word\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"counts = words.groupBy(\"word\").count()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.functions._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val words = lines.select(split(col(\"value\"), \"\\\\s\").as(\"word\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val counts = words.groupBy(\"word\").count()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"counts是一個","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"流式DataFrame","attrs":{}},{"type":"text","text":"(即,無邊界流數據上的一個DataFrame),它表示一旦流查詢開啓則會對流輸入數據進行連續處理,執行計數操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請注意,如果lines是批處理DataFrame,則這些流式DataFrame的轉換操作將以完全相同的方式工作。通常,大多數可以應用於批處理DataFrame的DataFrame操作也可以應用於流式DataFrame。要了解結構化流支持哪些操作,你必須認識以下兩類數據轉換:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"無狀態轉換","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如select(),filter(),map()等操作不需要依賴以前的行來處理下一行的任何信息; 每一行都可以獨立處理。這些操作缺少先前的“狀態”,因此它們是“無狀態”的。無狀態操作可以同時應用於批處理和流式DataFrame。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"有狀態的轉換","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相反,像count()這樣的聚合操作需要維護狀態才能跨多行合併數據。更具體地說,任何涉及分組、連接或聚合的DataFrame操作都是有狀態轉換。雖然結構化流中支持許多這些操作,但不支持它們的一些組合,因爲以增量方式計算它們在計算上很困難或不可行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章稍後將討論結構化流支持的有狀態操作以及如何管理runtime狀態。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"步驟3:定義輸出接收器和輸出模式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉換數據後,我們可以通過定義DataFrame.writeStream決定(而不是DataFrame.write,用於批處理數據)寫入處理後的輸出數據的模式。這會創建一個與DataFrameWriter類似的 DataStreamWriter,並具有其他方法來指定以下內容:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 輸出的詳細信息(寫入到輸出的位置和方式)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 處理細節(如何處理數據以及如何從故障中恢復)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們從輸出的細節開始(我們將在下一步重點關注處理細節)。例如,下面的片段展示瞭如何將最終計數寫入到控制檯:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"writer = counts.writeStream.format(\"console\").outputMode(\"complete\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val writer = counts.writeStream.format(\"console\").outputMode(\"complete\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏,我們指定\"console\"作爲輸出流接收器並且指定了\"complete\"輸出模式。流查詢的輸出模式指定在處理新的輸入數據之後要寫入更新後輸出的哪一部分。在此示例中,隨着大量新輸入數據的處理和單詞字數的更新,我們可以選擇將到目前爲止看到的所有單詞(即","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"complete模式","attrs":{}},{"type":"text","text":")的計數打印到控制檯,也可以僅將最後一個輸入數據塊中更新的內容打印到控制檯中。這是由指定的輸出模式決定的,可以是以下輸出模式之一(正如我們在“結構化流的編程模型”中已經看到的那樣:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"追加模式(Append mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認是追加模式,在該模式下,僅將自上次觸發以來添加到結果表或者DataFrame(例如,表)中的新行輸出到接收器。從語義上講,此模式可以保證將來查詢不會更改或更新輸出的任何行。因此,只有那些永遠不會修改以前的輸出數據的查詢(例如,無狀態查詢)才支持追加模式。相反,我們的計數查詢需要更新以前生成的字數;因此,它不支持追加模式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"全量模式(Complete mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種模式下,結果表或者DataFrame的所有行將在每個觸發器的末尾輸出。在結果表可能比輸入數據小得多的查詢中可以支持此操作,因此可以將其保留在內存中。例如,我們的計數查詢支持全量模式,因爲計數數據可能比輸入數據小得多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"更新模式(Update mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在此模式下,將僅在每個觸發器的末尾輸出自上一個觸發器以來已更新的結果表或者DataFrame的行。這與追加模式相反,因爲輸出行可能會被查詢然後修改並在將來再次輸出。大多數查詢都支持更新模式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在最新的《結構化流編程指南》中可以找到有關不同查詢支持的輸出模式的完整詳細信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了將輸出寫入控制檯外,結構化流還原生支持對文件和Apache Kafka的流寫入。此外,你可以使用foreachBatch()和foreach()API方法寫入任意位置。實際上,你可以使用foreachBatch()方法基於現有的批處理數據源來寫入流輸出(但是無法保證精確一次性)。這些接收器及其支持的選項的詳細信息將在本章後面討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"步驟4:指定處理詳細信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開始查詢之前的最後一步是指定如何處理數據的詳細信息。繼續我們的計數示例,我們將指定處理細節,如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"checkpointDir = \"...\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"writer2 = (writer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .trigger(processingTime=\"1 second\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"checkpointLocation\", checkpointDir))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.streaming._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val checkpointDir = \"...\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val writer2 = writer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .trigger(Trigger.ProcessingTime(\"1 second\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"checkpointLocation\", checkpointDir)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏,我們使用DataFrame.writeStream 創建DataStreamWriter指定了兩種類型的詳細信息:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"觸發細節","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這指示何時觸發發現和處理新的可用的流數據。有四個選項:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"默認","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果未顯式指定觸發器,則默認情況下,流查詢將以微批處理執行數據計算,一旦前一個微批處理完成,就會觸發下一個微批處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"帶觸發間隔的處理時間","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以使用ProcessingTime 觸發器指定時間間隔,查詢將以該固定間隔觸發微批。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"只觸發一次","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種模式下,流查詢將僅執行一個微批處理——它會在單個批處理中處理所有可用的新數據,然後自行停止。當你想要通過外部調度程序來控制觸發和處理時,該調度程序將使用任何自定義調度來重新啓動查詢(例如,通過每天僅執行一次查詢來控制成本),該功能非常有用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"連續觸發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一種實驗性模式(從Spark 3.0開始),在這種模式下,流查詢將連續處理數據而不是微批處理。儘管只有一小部分的DataFrame操作允許使用此模式,但它可以提供比微批觸發模式低得多的延遲(低至毫秒)。有關最新信息,請參閱最新的《結構化流編程指南》。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"檢查點位置觸發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是任何與HDFS兼容的文件系統中的目錄,流式查詢可以保存其進度信息,即已成功處理了哪些數據。失敗時,此元數據將用於從失敗查詢的結束位置重新啓動失敗的查詢。因此,設置此選項對於具有精確一次保證的故障恢復是必要的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"步驟5:開始查詢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"指定所有內容後,最後一步是啓動查詢,你可以執行以下操作:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamingQuery = writer2.start()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val streamingQuery = writer2.start()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回streamingQuery類型的對象表示活躍的(active)查詢,可用於管理查詢,我們將在本章後面介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請注意,start()是一種非阻塞方法,因此它將在後臺啓動查詢後立即返回。如果你希望主線程在流查詢終止之前一直阻塞,則可以使用streamingQuery.awaitTermination()。如果查詢在後臺失敗並顯示錯誤,則awaitTermination()也會因相同的異常而失敗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以使用awaitTermination(timeoutMillis)來設置等待超時時間,並且可以使用streamingQuery.stop()方法來停止查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"將以上步驟組合起來使用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結一下,這是完整的代碼,用於通過socket讀取文本數據流,對單詞進行計數並將計數輸出到控制檯:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.sql.functions import *","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark = SparkSession...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lines = (spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .readStream.format(\"socket\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"host\", \"localhost\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"port\", 9999)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"words = lines.select(split(col(\"value\"), \"\\\\s\").alias(\"word\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"counts = words.groupBy(\"word\").count()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"checkpointDir = \"...\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamingQuery = (counts","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .writeStream","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .format(\"console\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .outputMode(\"complete\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .trigger(processingTime=\"1 second\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"checkpointLocation\", checkpointDir)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .start())","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamingQuery.awaitTermination()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.functions._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.streaming._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val spark = SparkSession...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val lines = spark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .readStream.format(\"socket\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"host\", \"localhost\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"port\", 9999)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .load()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val words = lines.select(split(col(\"value\"), \"\\\\s\").as(\"word\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val counts = words.groupBy(\"word\").count()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val checkpointDir = \"...\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val streamingQuery = counts.writeStream","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .format(\"console\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .outputMode(\"complete\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .trigger(Trigger.ProcessingTime(\"1 second\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .option(\"checkpointLocation\", checkpointDir)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .start()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamingQuery.awaitTermination()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢開始後,後臺線程不斷從流數據源中讀取新數據,對其進行處理,然後將其寫入流接收器。接下來,讓我們快速瞭解一下如何執行此操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"活躍(Active)流查詢的底層原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢開始後,引擎中將發生以下步驟序列,如圖8-5所示。DataFrame操作將轉換爲邏輯計劃,這是Spark SQL用於查詢計劃計算的抽象表示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. Spark SQL分析並優化此邏輯計劃,以確保可以在流數據上增量高效地執行該邏輯計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. Spark SQL啓動一個後臺線程,該線程連續執行以下循環操作:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a. 基於配置的觸發間隔,線程檢查流數據源是否有新數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b. 如果可用,則通過運行微批處理來執行新數據。根據優化的邏輯計劃,生成優化的Spark執行計劃,該計劃從源讀取新數據,增量計算更新的結果,然後根據配置的輸出模式將輸出寫入接收器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c. 對於每個微批處理,已處理數據的精確範圍(例如,文件集或Apache Kafka偏移量)和任何關聯狀態都保存在已配置的檢查點位置中,以便查詢可以在需要的時候確定地進行重放,重新處理所需的確切範圍內的數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 該循環一直持續到查詢終止爲止,另一方面,以下幾種原因也可以造成該查詢終止:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a. 查詢中發生了故障(處理錯誤或集羣中的故障)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b. 使用streamingQuery.stop()顯式停止查詢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c. 如果將觸發器設置爲Once,則在執行包含所有可用數據的單個微批處理後,查詢將自行停止。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/ec9a9715f03deb0745683b761eb88a2d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於結構化流,你應該記住的一個關鍵點是,在它的下面是使用Spark SQL來執行數據。因此,利用SparkSQL的超優化執行引擎來最大限度地提高流處理吞吐量,提供了關鍵的性能優勢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,我們將討論終止後如何重新啓動流查詢以及流查詢的生命週期。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"精確一次性從故障中恢復","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要以全新的過程重新啓動終止的查詢,你必須創建一個新的SparkSession,重新定義所有DataFrame,然後使用與第一次啓動查詢時所使用的檢查點相同的檢查點位置對最終結果啓動流式查詢。對於我們的字數統計示例,你可以簡單地重新執行前面顯示的整個代碼段,從spark第一行中的定義到最後一行中的start()。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重新啓動之間的檢查點位置必須相同,因爲此目錄包含流查詢的唯一標識並確定查詢的生命週期。如果刪除了檢查點目錄,或者使用不同的檢查點目錄啓動了相同的查詢,則就像從頭開始新的查詢一樣。具體來說,檢查點具有記錄級別的信息(例如Apache Kafka偏移量)以跟蹤上次未完成的微批處理所處理的數據範圍。重新啓動的查詢將使用此信息來精確地在最後一次成功完成微批處理之後開始處理記錄。如果先前的查詢計劃在完成微批處理之前終止,則重新啓動的查詢將在處理新數據之前重新處理相同範圍的數據。結合Spark的確定性任務執行,重新生成的輸出將與重新啓動前的預期輸出相同。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當滿足以下條件時,結構化流可以確保","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"端到端的精確一次保證","attrs":{}},{"type":"text","text":"(即,輸出就像每個輸入記錄被精確地處理了一次一樣):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"可重放的流數據源","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後一個不完整的微批處理的數據範圍可以從源代碼中重讀取。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"確定性計算","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"給定相同的輸入數據時,所有的數據轉換都會確定性地產生相同的結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"冪等性流接收器","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接收器可以識別重新執行的微批處理,並忽略可能由重啓引起的重複寫入。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請注意,由於socket源不可重播且控制檯接收器不是冪等的,因此我們的字數統計示例不提供一次準確的保證。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲關於重新啓動查詢的最後注意事項,可以在重新啓動之間對查詢進行較小的修改。你可以通過以下幾種方式修改查詢:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"DataFrame轉換","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以對重新啓動之間的轉換進行較小的修改。例如,在我們的流字數示例中,如果你要忽略字節序列損壞的行,這些行可能使查詢崩潰,則可以在轉換中添加過濾器:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" isCorruptedUdf = udf to detect corruption in string","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"filteredLines = lines.filter(\"isCorruptedUdf(value) = false\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"words = filteredLines.select(split(col(\"value\"), \"\\\\s\").alias(\"word\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// val isCorruptedUdf = udf to detect corruption in string","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val filteredLines = lines.filter(\"isCorruptedUdf(value) = false\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val words = filteredLines.select(split(col(\"value\"), \"\\\\s\").as(\"word\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用此修改後的words DataFrame重新啓動後,重新啓動的查詢將對自重新啓動以來處理的所有數據(包括最後一個不完整的微批處理)應用過濾器,以防止查詢再次失敗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"源和接收器選項","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是否可以在重新啓動之間更改readStream或writeStreamoption,取決於特定源或接收器的語義。例如,如果要將數據發送到該主機和端口,則不應更改socket源的host和port選項。但是你可以在控制檯接收器中添加一個選項,以便在每個觸發器觸發之後打印多達100個更改的計數:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"writeStream.format(\"console\").option(\"numRows\", \"100\")...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"處理細節","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,在重新啓動之間不得更改檢查點位置。但是,可以在不破壞容錯保證的情況下更改觸發間隔等其他詳細信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有關重啓之間允許微調的更多信息,請參見最新的《結構化流編程指南》。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"監控活躍查詢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在生產中運行流傳輸管道的重要部分是跟蹤其運行狀況。結構化流提供了幾種跟蹤活動查詢的狀態和處理指標的方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"StreamingQuery","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"查詢當前狀態","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以使用StreamingQuery實例來查詢活動查詢的當前運行狀況。有以下兩種方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"使用StreamingQuery獲取當前指標","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當查詢微批處理中的某些處理數據時,我們認爲它已經取得了一些進展。lastProgress()方法返回有關最後完成的微批處理的信息。例如,打印返回的對象(在Scala / Java中或在Python中StreamingQueryProgress的字典中)將產生以下內容:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala/Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"id\" : \"ce011fdc-8762-4dcb-84eb-a77333e28109\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"runId\" : \"88e2ff94-ede0-45a8-b687-6316fbef529a\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"name\" : \"MyQuery\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"timestamp\" : \"2016-12-14T18:45:24.873Z\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"numInputRows\" : 10,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"inputRowsPerSecond\" : 120.0,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"processedRowsPerSecond\" : 200.0,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"durationMs\" : {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"triggerExecution\" : 3,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"getOffset\" : 2","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" },","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"stateOperators\" : [ ],","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"sources\" : [ {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"description\" : \"KafkaSource[Subscribe[topic-0]]\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"startOffset\" : {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"topic-0\" : {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"2\" : 0,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"1\" : 1,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"0\" : 1","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" },","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"endOffset\" : {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"topic-0\" : {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"2\" : 0,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"1\" : 134,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"0\" : 534","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" },","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"numInputRows\" : 10,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"inputRowsPerSecond\" : 120.0,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"processedRowsPerSecond\" : 200.0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" } ],","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"sink\" : {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"description\" : \"MemorySink\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些值得注意的列是:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"id","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綁定到檢查點位置的唯一標識符。在查詢的整個生命週期內(即在重新啓動之間),保持不變。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"runId","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前(重新)啓動的查詢實例的唯一標識符。每次重新啓動都會改變。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"numInputRows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一個微批處理中處理的輸入行數。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"inputRowsPerSecond","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在源端生成輸入行的當前速率(最後一個微批持續時間的平均值)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"processedRowsPerSecond","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接收器正在處理和寫入行的當前速率(最後一個微批處理持續時間的平均值)。如果此速率始終低於輸入速率,則查詢將無法像源生成數據一樣快地處理數據。這是查詢運行狀況的關鍵指標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"sources 和 sink","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供上一批中處理的數據的特定於源和接收器的詳細信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"使用StreamingQuery.status() 獲取當前狀態","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這提供了有關後臺查詢線程當前正在執行的操作的信息。例如,打印返回的對象將產生如下內容:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala/Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"message\" : \"Waiting for data to arrive\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"isDataAvailable\" : false,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \"isTriggerActive\" : false","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用Dropwizard度量發佈指標","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark通過一個名爲Dropwizard Metrics的通用庫支持報告指標。該庫允許將度量標準發佈到許多流行的監控框架(Ganglia,Graphite等)。默認情況下,結構化流查詢不啓用這些指標,因爲報告的數據量很大。要啓用它們,除了爲Spark配置Dropwizard Metrics外,你必須在開始查詢之前顯式設置SparkSession的配置spark.sql.streaming.metricsEnabled爲true。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請注意,只有通過StreamingQuery.lastProgress() 可獲得Dropwizard Metrics發佈的信息的一部分。如果要連續將更多進度信息發佈到任意位置,則必須編寫自定義監聽器,如下所述。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用自定義StreamingQueryListeners 發佈指標","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StreamingQueryListener是事件監聽器接口,你可以使用它注入任意邏輯以連續發佈指標。該API只支持開發人員在Scala 或 Java中使用。使用自定義監聽器有兩個步驟:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.定義你的自定義監聽器。該StreamingQueryListener接口提供了三種可以實現的方法,以獲取與流查詢相關的3種類型的事件:開始、進度(即,執行了觸發器)和終止。這是一個例子:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.streaming._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val myListener = new StreamingQueryListener() {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  override def onQueryStarted(event: QueryStartedEvent): Unit = {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    println(\"Query started: \" + event.id)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    println(\"Query terminated: \" + event.id)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  override def onQueryProgress(event: QueryProgressEvent): Unit = {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    println(\"Query made progress: \" + event.progress)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.在開始查詢之前,將你的監聽器添加到SparkSession中:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.streams.addListener(myListener)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"添加監聽器後,在SparkSession上運行的流查詢的所有事件將開始調用監聽器的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在下一篇文章中會講解流數據源和接收器、數據轉換、有狀態的流聚合等知識點。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章