Kafka集羣突破百萬partition 的技術探索

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇文章主要從元數據,controller 邏輯等方面介紹瞭如何解決支撐百萬 partition 的問題,運營大規模集羣其實還涉及到磁盤故障、冷讀、數據均衡等數據方面的問題,監控和報警服務同樣非常的重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於小業務量的業務,往往多個業務共享 kafka 集羣,隨着業務規模的增長需要不停的增加 topic 或者是在原 topic 的基礎上擴容 partition 數,另外一些後來大體量的業務在試水階段也可能不會部署獨立的集羣,當業務規模爆發時,需要迅速擴容擴容集羣節點。在不犧牲穩定性的前提下單集羣規模有限,常常會碰到業務體量變大後無法在原集羣上直接進行擴容,只能讓業務創建新的集羣來支撐新增的業務量,這時用戶面臨系統變更的成本,有時由於業務關聯的原因,集羣分開後涉及到業務部署方案的改變,很難短時間解決。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了快速支持業務擴容,就需要我們在不需要業務方做任何改動的前提下對集羣進行擴容,大規模的集羣,往往意味着更多的 partition 數,更多的 broker 節點,下面會描述當集羣規模增長後主要面臨哪些方面的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. ZK 節點數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 的 topic 在 broker 上是以 partition 爲最小單位存放和進行復制的, 因此集羣需要維護每個 partition 的 Leader 信息,單個 partition 的多個副本都存放在哪些 broker 節點上,處於複製同步狀態的副本都有哪些。爲了存放這些元數據,kafka 集羣會爲每一個 partition 在 zk 集羣上創建一個節點,partition 的數量直接決定了 zk 上的節點數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設集羣上有 1 萬個 topic,每個 topic 包含 100 個 partition,則 ZK 上節點數約爲 200 多萬個,快照大小約爲 300MB,ZK 節點數據變更,會把數據會寫在事務日誌中進行持久化存儲,當事務日誌達到一定的條目會全量寫入數據到持久化快照文件中,partition 節點數擴大意味着快照文件也大,全量寫入快照與事務日誌的寫入會相互影響,從而影響客戶端的響應速度,同時 zk 節點重啓加載快照的時間也會變長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. Partition 複製"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 的 partition 複製由獨立的複製線程負責,多個 partition 會共用複製線程,當單個 broker 上的 partition 增大以後,單個複製線程負責的 partition 數也會增多,每個 partition 對應一個日誌文件,當大量的 partition 同時有寫入時,磁盤上文件的寫入也會更分散,寫入性能變差,可能出現複製跟不上,導致 ISR 頻繁波動,調整複製線程的數量可以減少單個線程負責的 partition 數量,但是也加劇了磁盤的爭用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. Controller 切換時長"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於網絡或者機器故障等原因,運行中的集羣可能存在 controller 切換的情況,當 controller 切換時需要從 ZK 中恢復 broker 節點信息、topic 的 partition 複製關係、partition 當前 leader 在哪個節點上等,然後會把 partition 完整的信息同步給每一個 broker 節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在虛擬機上測試,100 萬 partition 的元數據從 ZK 恢復到 broker 上約需要 37s 的時間,100 萬 partition 生成的元數據序列化後大約 80MB(數據大小與副本數、topic 名字長度等相關),其他 broker 接收到元數據後,進行反序列化並更新到本機 broker 內存中,應答響應時間約需要 40s(測試時長與網絡環境有關)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Controller 控制了 leader 切換與元數據的下發給集羣中其他 broker 節點,controller 的恢復時間變長增加了集羣不可用風險,當 controller 切換時如果存在 partition 的 Leader 需要切換,就可能存在客戶端比較長的時間內拿不到新的 leader,導致服務中斷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4. broker 上下線恢復時長"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日常維護中可能需要對 broker 進行重啓操作,爲了不影響用戶使用,broker 在停止前會通知 controller 進行 Leader 切換,同樣 broker 故障時也會進行 leader 切換,leader 切換信息需要更新 ZK 上的 partition 狀態節點數據,並同步給其他的 broker 進行 metadata 信息更新。當 partition 數量變多,意味着單個 broker 節點上的 partitiion Leader 切換時間變長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上述幾個影響因素,我們知道當 partition 數量增加時會直接影響到 controller 故障恢復時間;單個 broker 上 partition 數量增多會影響磁盤性能,複製的穩定性;broker 重啓 Leader 切換時間增加等。當然我們完全可以在現有的架構下限制每個 broker 上的 partition 數量,來規避單 broker 上受 partition 數量的影響,但是這樣意味着集羣內 broker 節點數會增加,controller 負責的 broker 節點數增加,同時 controller 需要管理的 partition 數並不會減少,如果我們想解決大量 partition 共用一個集羣的場景,那麼核心需要解決的問題就是要麼提升單個 controller 的處理性能能力,要麼增加 controller 的數量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. 單 ZK 集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從提升單個 controller 處理性能方面可以進行下面的優化:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"並行拉取 zk 節點"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Controller 在拉取 zk 上的元數據時,雖然採用了異步等待數據響應的方式,請求和應答非串行等待,但是單線程處理消耗了大約 37s,我們可以通過多線程並行拉取元數據,每個線程負責一部分 partition,從而縮減拉取元數據的時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在虛擬機上簡單模擬獲取 100 萬個節點數據,單線程約花費 28s,分散到 5 個線程上並行處理,每個線程負責 20 萬 partition 數據的拉取,總時間縮短爲 14s 左右(這個時間受虛擬機本身性能影響,同虛擬機上如果單線程拉取 20 萬 partition 約只需要 6s 左右),因此在 controller 恢復時,並行拉取 partition 可以明顯縮短恢復時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"變更同步元數據的方式"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文中提到 100 萬 partition 生成的元數據約 80MB,如果我們限制了單 broker 上 partition 數量,意味着我們需要增加 broker 的節點數,controller 切換並行同步給大量的 broker,會給 controller 節點帶來流量的衝擊,同時同步 80MB 的元數據也會消耗比較長的時間。因此需要改變現在集羣同步元數據的方式,比如像存放消費位置一樣,通過內置 topic 來存放元數據,controller 把寫入到 ZK 上的數據通過消息的方式發送到內置存放元數據的 topic 上,broker 分別從 topic 上消費這些數據並更新內存中的元數據,這類的方案雖然可以在 controller 切換時全量同步元數據,但是需要對現在的 kafka 架構進行比較大的調整(當然還有其他更多的辦法,比如不使用 ZK 來管理元數據等,不過這不在本篇文章探討的範圍內)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那有沒有其他的辦法,在對 kafka 架構改動較小的前提下來支持大規模 partition 的場景呢?我們知道 kafka 客戶端與 broker 交互時,會先通過指定的地址拉取 topic 元數據,然後再根據元數據連接 partition 相應的 Leader 進行生產和消費,我們通過控制元數據,可以控制客戶端生產消費連接的機器,這些機器在客戶端並不要求一定在同一個集羣中,只需要客戶端能夠拿到這些 partition 的狀態信息,因此我們可以讓不同的 topic 分佈到不同的集羣上,然後再想辦法把不同集羣上的 topic 信息組合在一起返回給客戶端,就能達到客戶端同時連接不同集羣的效果,從客戶端視角來看就就是一個大的集羣。這樣不需要單個物理集羣支撐非常大的規模,可以通過組合多個物理集羣的方式來達到支撐更大的規模,通過這種方式,擴容時不需要用戶停機修改業務,下面我們就來描述一下怎麼實現這種方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 小集羣組建邏輯集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們需要組建邏輯集羣時,有幾個核心問題需要解決:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、當客戶端需要拉取元數據時,怎麼把多個小的物理集羣上的元數據組裝在一起返回給客戶端;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、不同集羣上的元數據變更時怎麼及時地通知變更;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、多個集羣上保存消費位置和事務狀態的 topic 怎麼分佈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面針對這些問題一一進行講解:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/07\/07aa0a4d9108be1882e4ac07013d1ad5.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"metadata 服務"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對 metadata 組裝問題,我們可以在邏輯集羣裏的多個物理集羣中選一個爲主集羣,其他集羣爲擴展集羣,由主集羣負責對外提供 metadata、消費位置、事務相關的服務,當然主集羣也可以同時提供消息的生產消費服務,擴展集羣只能用於業務消息的生產和消費。我們知道當 partition 的 Leader 切換時需要通過集羣中的 controller 把新的 metadata 數據同步給集羣中的 broker。當邏輯集羣是由多個相互獨立的物理集羣組成時,controller 無法感知到其他集羣中的 Broker 節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以對主集羣中的 metada 接口進行簡單的改造,當客戶端拉取 metadata 時,我們可以跳轉到其他的集羣上拉取 metadata, 然後在主集羣上進行融合組裝再返回給客戶端。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然跳轉拉取 metadata 的方式有一些性能上的消耗,但是正常情況下並不在消息生產和消費的路徑上,對客戶端影響不大。通過客戶端拉取時再組裝 metadata,可以規避跨物理集羣更新 metadata 的問題,同時也能夠保證實時性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"消費分組與事務協調"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當消費分組之間的成員需要協調拉取數據的 partition 時,服務端會根據保存消費位置 topic 的 partition 信息返回對應的協調節點,因此我們在一個邏輯集羣中需要確定消費位置 topic 分佈的集羣,避免訪問不同物理集羣的節點返回的協調者不一樣,從不同集羣上拉取到的消費位置不一樣等問題。我們可以選主集羣的 broker 節點提供消費和事務協調的服務,消費位置也只保存在主集羣上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上述的一些改造,我們就可以支持更大的業務規模,用戶在使用時只需要知道主集羣的地址就可以了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"組建邏輯集羣除了上述的核心問題外,我們也需要關注 topic 的分配,由於騰訊雲的 ckafka 本身就會把 broker 上創建 topic 的請求轉發給管控模塊創建,因此可以很方便的解決 topic 在多個物理集羣的分佈,也可以規避同一邏輯集羣上,不同物理集羣內可能出現同名 topic 的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"單物理集羣分裂"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a4\/a4e8a5258dde689912b7448fad297644.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講述了多個物理集羣怎麼組建成單個邏輯集羣,有時可能面臨一個問題,就是單個物理集羣由於一些原因需要在現有的 topic 上不斷的擴充 partition,如果多個 topic 同時需要擴容可能出現單個物理集羣過大的情況,因此需要對現有的集羣進行分裂,一個物理集羣拆分成兩個物理集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進行集羣的分裂涉及到 ZK 集羣的分裂和對 broker 節點進行分組拆分,首先對集羣中的 broker 節點分成兩組,每組連接不同的 ZK 節點,比如我們可以在原來的 zk 集羣中增加 observer 節點,新增的 broker 爲一組,原來集羣中的 broker 爲一組,我們讓新 broker 只填寫 observer 的地址。ZK 集羣分裂前,通過 KAFKA 內置遷移工具可以很方便地把不同的 topic 遷移到各自的 broker 分組上,同一個 topic 的 partition 只會分佈在同一個分組的 broker 節點上,後續把 observer 節點從現有的 ZK 集羣中移除出去,然後讓 observer 與別的 ZK 節點組成新的 ZK 集羣,從而實現 kafka 集羣的分裂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結束語"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過提升 controller 的性能,和通過把多個物理集羣組裝成一個邏輯集羣的做法都可以提升單集羣承載 partition 的規模。但是相比而言,通過組建多個物理集羣的方式對 kafka 現有的架構改動更小一些,故障恢復的時間上更有保障一些,服務更穩定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然業務在使用 kafka 服務時,如果業務允許保持一個 partition 數量適度的集羣規模,通過業務拆分的方式連接不同的集羣也是一種很好的實踐方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:丁俊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/LRM8GWFQbxQnKoq6HgCcwQ"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:Kafka集羣突破百萬partition 的技術探索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:騰訊雲中間件 - 微信公衆號 [ID:gh_6ea1bc2dd5fd]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章