Kafka系列第6篇:消息是如何在服務端存儲與讀取的,你真的知道嗎?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過前 5 篇文章的介紹,估麼着小夥伴們已經對消息生產和消費的流程應該有一個比較清晰的認識了。當然小夥伴們肯定也比較好奇,Kafka 能夠處理千萬級消息,那它的消息是如何在 Partition 上存儲的呢?今天這篇文章就來爲大家揭祕消息是如何存儲的。本文主要從消息的"},{"type":"text","marks":[{"type":"strong"}],"text":"邏輯存儲"},{"type":"text","text":"和"},{"type":"text","marks":[{"type":"strong"}],"text":"物理存儲"},{"type":"text","text":"兩個角度來介紹其實現原理。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"文章概覽"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Partition、Replica、Log 和 LogSegment 的關係。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"寫入消息流程分析。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"消費消息及副本同步流程分析。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Partition、Replica、Log 和 LogSegment 的關係"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設有一個 Kafka 集羣,Broker 個數爲 3,Topic 個數爲 1,Partition 個數爲 3,Replica 個數爲 2。Partition 的物理分佈如下圖所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/90389a17f2938659d6d4af75d722d4ff.png","alt":null,"title":"Partition分佈圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上圖可以看出,該 Topic 由三個 Partition 構成,並且每個 Partition 由主從兩個副本構成。每個 Partition 的主從副本分佈在不同的 Broker 上,通過這點也可以看出,當某個 Broker 宕機時,可以將分佈在其他 Broker 上的從副本設置爲主副本,因爲只有主副本對外提供讀寫請求,當然在最新的 2.x 版本中從副本也可以對外讀請求了。將主從副本分佈在不同的 Broker 上從而提高系統的可用性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Partition 的實際物理存儲是以 Log 文件的形式展示的,而每個 Log 文件又以多個 LogSegment 組成。Kafka 爲什麼要這麼設計呢?其實原因比較簡單,隨着消息的不斷寫入,Log 文件肯定是越來越大,Kafka 爲了方便管理,將一個大文件切割成一個一個的 LogSegment 來進行管理;每個 LogSegment 由數據文件和索引文件構成,數據文件是用來存儲實際的消息內容,而索引文件是爲了加快消息內容的讀取。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能又有朋友會問,Kafka 本身消費是以 Partition 維度順序消費消息的,磁盤在順序讀的時候效率很高完全沒有必要使用索引啊。其實 Kafka 爲了滿足一些特殊業務需求,比如要隨機消費 Partition 中的消息,此時可以先通過索引文件快速定位到消息的實際存儲位置,然後進行處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結一下 Partition、Replica、Log 和 LogSegment 之間的關係。消息是以 Partition 維度進行管理的,爲了提高系統的可用性,每個 Partition 都可以設置相應的 Replica 副本數,一般在創建 Topic 的時候同時指定 Replica 的個數;Partition 和 Replica 的實際物理存儲形式是通過 Log 文件展現的,爲了防止消息不斷寫入,導致 Log 文件大小持續增長,所以將 Log 切割成一個一個的 LogSegment 文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"注意:"},{"type":"text","text":" 在同一時刻,每個主 Partition 中有且只有一個 LogSegment 被標識爲可寫入狀態,當一個 LogSegment 文件大小超過一定大小後(比如當文件大小超過 1G,這個就類似於 HDFS 存儲的數據文件,HDFS 中數據文件達到 128M 的時候就會被分出一個新的文件來存儲數據),就會新創建一個 LogSegment 來繼續接收新寫入的消息。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"寫入消息流程分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d606ad965df30f247150685adb7e3900.png","alt":null,"title":"消息寫入及落盤流程","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"流程解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第 3 篇文章講過,生產者客戶端對於每個 Partition 一次會發送一批消息到服務端,服務端收到一批消息後寫入相應的 Partition 上。上圖流程主要分爲如下幾步:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"客戶端消息收集器收集屬於同一個分區的消息,並對每條消息設置一個偏移量,且每一批消息總是從 0 開始單調遞增。比如第一次發送 3 條消息,則對三條消息依次編號 [0,1,2],第二次發送 4 條消息,則消息依次編號爲 [0,1,2,3]。注意此處設置的消息偏移量是相對偏移量。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"客戶端將消息發送給服務端,服務端拿到下一條消息的絕對偏移量,將傳到服務端的這批消息的相對偏移量修改成絕對偏移量。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"將修改後的消息以追加的方式追加到當前活躍的 LogSegment 後面,然後更新絕對偏移量。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"將消息集寫入到文件通道。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"文件通道將消息集 flush 到磁盤,完成消息的寫入操作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"瞭解以上過程後,我們在來看看消息的具體構成情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2b/2bdd6361c2d70ec481abe2892cea8b04.png","alt":null,"title":"消息構成細節圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一條消息由如下三部分構成:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"OffSet:偏移量,消息在客戶端發送前將相對偏移量存儲到該位置,當消息存儲到 LogSegment 前,先將其修改爲絕對偏移量在寫入磁盤。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Size:本條 Message 的內容大小"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Message:消息的具體內容,其具體又由 7 部分組成,crc 用於校驗消息,Attribute 代表了屬性,key-length 和 value-length 分別代表 key 和 value 的長度,key 和 value 分別代表了其對應的內容。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"消息偏移量的計算過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過以上流程可以看出,每條消息在被實際存儲到磁盤時都會被分配一個絕對偏移量後才能被寫入磁盤。在同一個分區內,消息的絕對偏移量都是從 0 開始,且單調遞增;在不同分區內,消息的絕對偏移量是沒有任何關係的。接下來討論下消息的絕對偏移量的計算規則。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"確定消息偏移量有兩種方式,一種是順序讀取每一條消息來確定,此種方式代價比較大,實際上我們並不想知道消息的內容,而只是想知道消息的偏移量;第二種是讀取每條消息的 Size 屬性,然後計算出下一條消息的起始偏移量。比如第一條消息內容爲 “abc”,寫入磁盤後的偏移量爲:8(OffSet)+ 4(Message 大小)+ 3(Message 內容的長度)= 15。第二條寫入的消息內容爲“defg”,其起始偏移量爲 15,下一條消息的起始偏移量應該是:15+8+4+4=31,以此類推。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"消費消息及副本同步流程分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和寫入消息流程不同,讀取消息流程分爲兩種情況,分別是消費端消費消息和從副本(備份副本)同步主副本的消息。在開始分析讀取流程之前,需要先明白幾個用到的變量,不然流程分析可能會看的比較糊塗。"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"BaseOffSet"},{"type":"text","text":":基準偏移量,每個 Partition 由 N 個 LogSegment 組成,每個 LogSegment 都有基準偏移量,大概由如下構成,數組中每個數代表一個 LogSegment 的基準偏移量:[0,200,400,600, ...]。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"StartOffSet"},{"type":"text","text":":起始偏移量,由消費端發起讀取消息請求時,指定從哪個位置開始消費消息。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"MaxLength"},{"type":"text","text":":拉取大小,由消費端發起讀取消息請求時,指定本次最大拉取消息內容的數據大小。該參數可以通過"},{"type":"codeinline","content":[{"type":"text","text":"max.partition.fetch.bytes"}]},{"type":"text","text":"來指定,默認大小爲 1M。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"MaxOffSet"},{"type":"text","text":":最大偏移量,消費端拉取消息時,最高可拉取消息的位置,即俗稱的“高水位”。該參數由服務端指定,其作用是爲了防止生產端還未寫入的消息就被消費端進行消費。此參數對於從副本同步主副本不會用到。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"MaxPosition"},{"type":"text","text":":LogSegment 的最大位置,確定了起始偏移量在某個 LogSegment 上開始,讀取 MaxLength 後,不能超過 MaxPosition。MaxPosition 是一個實際的物理位置,而非偏移量。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設消費端從 000000621 位置開始消費消息,關於幾個變量的關係如下圖所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cf/cfc7109f3feee411295f4fff53dfd6a1.png","alt":null,"title":"位置關係圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"消費端和從副本拉取流程如下:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"客戶端確定拉取的位置,即 StartOffSet 的值,找到主副本對應的 LogSegment。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"LogSegment 由索引文件和數據文件構成,由於索引文件是從小到大排列的,首先從索引文件確定一個小於等於 StartOffSet 最近的索引位置。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"根據索引位置找到對應的數據文件位置,由於數據文件也是從小到大排列的,從找到的數據文件位置順序向後遍歷,直到找到和 StartOffSet 相等的位置,即爲消費或拉取消息的位置。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"從 StartOffSet 開始向後拉取 MaxLength 大小的數據,返回給消費端或者從副本進行消費或備份操作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設拉取消息起始位置爲 00000313,消息拉取流程圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1f57c3ed946ddf792568d620a74e097a.png","alt":null,"title":"消息拉取流程圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文從邏輯存儲和物理存儲的角度,分析了消息的寫入與消費流程。其中邏輯存儲是以 Partition 來管理一批一批的消息,Partition 映射 Log 對象,Log 對象管理了多個 LogSegment,多個 Partition 構成了一個完整的 Topic。消息的實際物理存儲是由一個一個的 LogSegment 構成,每個 LogSegment 又由索引文件和數據文件構成。下篇文章我們來"},{"type":"text","marks":[{"type":"strong"}],"text":"分析一些實際生產環境中的常用操作及數據接入方案"},{"type":"text","text":",敬請期待。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微信公衆號搜索"},{"type":"text","marks":[{"type":"strong"}],"text":"【z小趙】"},{"type":"text","text":",更多系列精彩文章等你解鎖。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/86/862b51ac70a151f66fa21406f2affc39.gif","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更多系列文章鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/e185bc26ad2805c1663aee3bc","title":""},"content":[{"type":"text","text":"Kafka系列第5篇:一文讀懂消費者背後的那點\"貓膩\""}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/85ea30a57605e6e9945bbfd0e","title":""},"content":[{"type":"text","text":"Kafka系列第4篇:消息發送時,網絡“偷偷”幫忙做的那點事兒"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/5627b0a079b69d8491c229e26","title":""},"content":[{"type":"text","text":"重要:Kafka第3篇之一條消息如何被存儲到Broker上"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/712daa4dfd35d8d8b81e33654","title":""},"content":[{"type":"text","text":"Kafka系列第2篇:安裝測試"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章