數據庫內核雜談(二十一): 流處理系統簡介

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然咱們是數據庫內核博客,但流式處理系統已經成爲數據系統的主流之一,並且提供了類似於SQL的接口,現在也有流批一體的趨勢(我個人覺得,還得觀察一下,因爲畢竟數據源方式不同,服務的應用也不同,使用一套系統,感覺很難魚和熊掌兼得)。這一期,咱們聊一聊流處理系統。內容源於 Facebook 2016 年 "},{"type":"link","attrs":{"href":"https:\/\/sigmod.org\/","title":"xxx","type":null},"content":[{"type":"text","text":"SIGMOD"}]},{"type":"text","text":" 上發表的一篇文章,標題就叫做"},{"type":"link","attrs":{"href":"https:\/\/blog.acolyer.org\/2016\/07\/11\/realtime-data-processing-at-facebook\/","title":"xxx","type":null},"content":[{"type":"text","text":"《Realtime Data Processing at Facebook (Meta)》"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,爲什麼需要流處理系統,因爲有低延時的應用需求:如實時數據分析,如性能指標,error指標;推薦系統,爲了取得最好的推薦結果,希望可以採集到某些實時的特徵等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇文章首先討論了一下流處理(或者叫實時數據處理)系統的5個重要屬性。分別是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)易用性:程序員如何聲明流處理的邏輯,SQL語言(或者類SQL語言)支持是否已經足夠;還是要支持general purpose的處理邏輯,比如可以讓程序員用C++或者Java語言來實現處理邏輯然後交給系統執行(類似於map-reduce)。從聲明,測試到發佈,整個生命週期需要多長時間?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)性能:性能一般指延遲和吞吐量(throughput)需求。延遲是毫秒級別,秒級別,或者是分鐘級別?吞吐量需要多高?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)健壯性(fault-tolerance):系統能夠支持什麼級別的崩潰恢復?對於數據處理,能提供什麼樣的service level agreement ,是至少一次,至多一次,還是保證一次?如果某個task崩潰了,如何恢復in-memory的狀態,等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)擴展性(scalability):數據處理是否能被shard或者reshard來提高吞吐量?系統是否能動態地伸縮(elasticity)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5)正確性:是否提供類似數據庫的ACID保證?是否會有數據丟失(這點和上面的健壯性有重疊)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Facebook在設計流系統時的決策是基於這個前提:秒級別的延遲和幾百GB\/s吞吐量需求(a few seconds of latency with hundreds of GB\/s throughput)。在這個前提下,不同的批處理過程可以通過一個persistent的message bus系統(Scribe,類似於Kafka)相連來傳輸數據。異構數據傳輸和數據處理,能夠使得整個系統更好地處理上述提到的這些屬性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Facebook(Meta)流處理系統簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個Facebook流處理生態提供了3個不同的系統。結合下面這張數據流圖,依次來介紹。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c0\/e0\/c0b1312cf38e952yy4c93260a61151e0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據從mobile端或者服務器(web)端產生,首先以log形式記錄到Scribe(上文提到的persistent的message bus系統)。流系統Puma,Stylus和Swift可以從Scribe中讀取數據,執行數據處理,再寫回Scribe。以這種方式,三個系統結合Scribe可以組成複雜的數據處理DAG。最終,處理完的數據通過Scribe寫入Laser,Scuba和Hive三類Data stores。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Scribe"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scribe是一個非常scalable,基於persistent store(文件系統)的"},{"type":"link","attrs":{"href":"https:\/\/www.cnblogs.com\/svenzhang9527\/p\/7354684.html","title":"xxx","type":null},"content":[{"type":"text","text":"message bus"}]},{"type":"text","text":"系統,類似開源的Kafka系統。數據以一個個category(Kafka中的術語叫topic)的形式存在,每個category可以shard成多個bucket來提高吞吐量。bucket是流處理系統的基本單元。Scribe將數據存儲在HDFS上,通常retention可以到幾天。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Puma"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Puma提供了類SQL的語法並支持用Java語言寫可擴展的UDF(user defined functions)。Puma的優勢在於開發流程非常快(因爲提供了類SQL語法),整個聲明週期可以在小時級別完成。Puma可以非常高效地完成簡單的類SQL的聚合操作。文中給出了一個簡單示例,在5分鐘的sliding window中計算topK events。Puma的簡易code如下,即使從來沒接觸過Puma語法,相信理解下面的內容也不困難。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/22\/1b\/22efc757b0bef09a61c7ef8d05f09b1b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Puma的另一個優勢是對於簡單的filtering邏輯,比如只選取某些相關的數據,可以提供秒級別的延遲(這些處理後的數據可以馬上被寫入到另一個scribe category)。和傳統數據庫不同,Puma選擇更好地支持那些被長期運行的app而不是ad-hoc analytics,因此它可以通過code generation來生成優化的處理代碼。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Swift"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(插一句題外話,在讀這篇paper前,我都不知道有這個系統,其實讀完簡介,我依然是雲裏霧裏)。Swift只提供了非常簡單的API:從某個scribe中讀取N個string或者bytes,然後週而復始。如果在處理某個checkpoint的時候app crash了,可以接着從當前checkpoint重來。Swfit通常用於非常低吞吐量,且無狀態的數據處理。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Stylus"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Stylus是一個通用的流處理系統,語言是C++。它提供的API和開源的流處理系統如Storm,Samza,Millwheel類似,它分別支出無狀態和有狀態的流處理。因爲實現語言是C++,因此Stylus不僅支持各種操作(包括讀取外部系統獲取信息),性能也非常高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"咱們也快速介紹一下data store,這些系統可以從Scribe導入數據,但不再支持導出到Scribe,而是通過自身的API對外提供數據服務。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Laser"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Laser是一個高吞吐量,低延遲的key-value存儲,它可以通過Scribe導入數據,之後這些數據就可以被其他應用訪問,包括Stylus,Swift和Puma。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Scuba"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scuba可以看成一個高性能,但支持單個table的in-memory數據庫。它可以支持非常低延時的數據導入,然後通過類SQL(但是隻能查詢單個table)或者UI操作來查詢數據,查詢也在毫秒級別完成。因此Scuba廣泛應用在各種性能,監控, debug指標中。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Hive data warehouse"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive data warehouse就省略了,大家都懂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"介紹完了所有系統,再通過一個簡單的例子來梳理一下。文中給出了下面這個示例:從event流裏找出最熱的event topic(通過將event count進行高到低排序),輸入event流有event的基本信息如event timestamp,event type,dimension_id(用來獲取相關dimension信息)event text等,輸出就是每個topic的TopK events。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bf\/32\/bfd61a5f3f9b40f88f9ef59737110f32.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)Filterer:可以過濾掉不符合規定的信息,並且將event流重新以event的dimension_id作爲sharding的形式分發到下游的scribe中(這樣,下游處理可以根據dimension_id來進行並行處理)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)Joiner:Joiner需要根據dimension_id抓取相應的dimension信息,並且調用classification系統來得到event topic。因爲上游的scribe是以dimension_id作爲sharding,因此joiner可以cache相應的dimension信息來減少network bandwidth(有狀態的處理)。處理過的信息以的pair形式發送到下游的Scribe。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)Scorer:Scorer通過收集一個sliding window裏topic的event count來計算score。由於計算score需要考慮到long-term trend和current count,因此scorer需要存儲long-term trend作爲狀態。最終輸出(shard by topic)到下游。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)Ranker:最終, Ranker針對每個topic計算出當前sliding window的topK events。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文中有提到,所有的logic都可以用Stylus來實現。不過,Filterer和Ranker可以更快地用Puma實現。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"設計決策"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來纔是本文的重點,文中介紹了5個維度的設計決策。並且討論了這些決策是如何影響文章最開始介紹的流處理系統的5個屬性(易用性,性能,健壯性,擴展性,正確性)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"編程語言支持(language paradigm)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"編程語言支持會影響到易用性和性能。文中介紹了三大類:declarative(聲明式)類似於SQL應該是最易於理解和上手的,缺點在於表達的侷限性;Functional(函數式)將整個application封裝成不同function(operator)的組合,不如SQL那麼容易上手,但提供了更多的控制。最後就是procedural:直接提供C++或者Java等語言接口。Procedural提供了最大的控制同時也在很大程度上能保證性能,缺點就是開發週期更長。這三類各有優缺點,在Facebook內部,Puma實現了declarative,而Stylus實現了procedural。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據傳輸(data transport)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"複雜的流處理邏輯通常用DAG表示。如何實現數據從一個節點傳輸到另一個節點,影響到整個流數據的健壯性,性能以及可擴展性,以及一定程度的易用性(尤其是在debugging時)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文中也介紹了三大類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)direct message transfer:類似於用RPC或者in-mem message queue來直接傳輸數據,這類的好處在於延遲非常低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)broker based:通過引入中間broker來decouple上游和下游。Broker雖然增加了性能負擔,但提高了擴展性,方便scale out。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)persistent storage based broker:類似Scribe或者Kafka。毋庸置疑,這個方法雖然最heavy,但是帶來了message bus系統所有的好處,解耦,擴容,訂閱分發,持久保存等等。Facebook內部使用第三類,用來提升健壯性,可擴展性,以及易用性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據處理語義(processing semantics)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據處理語義決定了正確性和健壯性。 文中也介紹了三大類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)更新內部狀態:讀取一個event,進行相應處理(如查詢外部系統)然後對in-memory狀態進行更新;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)生成output event:處理完event後,生成一個output event到下游;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)保存狀態至外部系統,如數據庫:這裏面可以涉及到offset和checkpoint的保存來進行災備恢復。如果是無狀態的節點,只能選擇生成output event,有狀態的節點三者都可能涉及。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於event處理的正確性,如果選擇at least once(至少一次),節點應該選擇先保存in-memory state,再更新offset;如果選擇 at most once(至多一次):節點應該選擇先保存offset,再更新in-memory state;如果選擇exactly once(強一致):必須保證原子更新,如利用transaction機制。在介紹的系統中,Puma選擇了at least once,而Scuba選擇了at most once。因爲Scuba本身就自帶sampling,而且查詢,爲了追求效率是best effort,因此,少量的數據丟失是可以接受的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"狀態保存方式(state-saving mechanism)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於有狀態的處理節點,如何保存狀態。文中介紹了下面這幾類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)複製到其他節點;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)本地數據庫或文件存儲;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)遠程數據庫或文件存儲;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)依賴上游節點存儲;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5)全局snapshot存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介紹的系統中,Stylus提供了本地數據庫和遠程數據庫的狀態存儲。本地存儲的優勢是減少帶寬,程序崩潰恢復也快。而遠程存儲則可以應對硬件級別的機器故障(需要重新provision一個新node,再將狀態導入)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"重複處理機制(reprocessing\/backfill mechanism)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於某些特定應用場景,我們會需要重新處理一些舊數據。如引入了一個新的流處理邏輯,需要用一段過去的數據來測試;引入新指標,需要重新運行數據來獲取這個指標。要處理舊數據,需要以下這些機制:1)stream的數據保留的retention足夠長,比如在Scribe中設置更長的retention;2)使得流處理系統可以處理data warehouse的數據(batch處理)。Facebook系統中Scribe的retention通常不能很久,通常幾天。因此,需要使得流系統對接data warehouse來處理,通過引入tailer。Backfill機制會影響系統的易用性,可擴展性和正確性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"總結一下,這期,我們通過介紹Facebook內部的流處理系統生態,討論了流處理系統中5個維度的設計決策,以及它們對流處理系統5個關鍵屬性的影響(下圖展示了不同維度的設計決策分別會影響哪些屬性,以供參考)。感覺閱讀!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/94\/68\/9467bd6e8e19c1a37dab16a542ab3868.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章