聊聊 Kafka: Kafka 爲啥這麼快?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們都知道 Kafka 是基於磁盤進行存儲的,但 Kafka 官方又稱其具有高性能、高吞吐、低延時的特點,其吞吐量動輒幾十上百萬。小夥伴們是不是有點困惑了,一般認爲在磁盤上讀寫數據是會降低性能的,因爲尋址會比較消耗時間。那 Kafka 又是怎麼做到其吞吐量動輒幾十上百萬的呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 高性能,是多方面協同的結果,包括宏觀架構、分佈式 partition 存儲、ISR 數據同步、以及“無所不用其極”的高效利用磁盤、操作系統特性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"別急,下面老周從數據的寫入與讀取兩個維度來帶大家一探究竟。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、順序寫入","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"磁盤讀寫有兩種方式:順序讀寫或者隨機讀寫。在順序讀寫的情況下,磁盤的順序讀寫速度和內存持平。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲磁盤是機械結構,每次讀寫都會尋址->寫入,其中尋址是一個“機械動作”。爲了提高讀寫磁盤的速度,Kafka 就是使用順序 I/O。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/59/59ff0e2725731f3d30bb90b1685b1abf.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 利用了一種分段式的、只追加 (Append-Only) 的日誌,基本上把自身的讀寫操作限制爲順序 I/O,也就使得它在各種存儲介質上能有很快的速度。一直以來,有一種廣泛的誤解認爲磁盤很慢。實際上,存儲介質 (特別是旋轉式的機械硬盤) 的性能很大程度依賴於訪問模式。在一個 7200 轉/分鐘的 SATA 機械硬盤上,隨機 I/O 的性能比順序 I/O 低了大概 3 到 4 個數量級。此外,一般來說現代的操作系統都會提供預讀和延遲寫技術:以大數據塊的倍數預先載入數據,以及合併多個小的邏輯寫操作成一個大的物理寫操作。正因爲如此,順序 I/O 和隨機 I/O 之間的性能差距在 flash 和其他固態非易失性存儲介質中仍然很明顯,儘管它遠沒有旋轉式的存儲介質那麼明顯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏給出著名學術期刊 ACM Queue 上的性能對比圖: ","attrs":{}},{"type":"link","attrs":{"href":"https://queue.acm.org/detail.cf","title":"","type":null},"content":[{"type":"text","text":"https://queue.acm.org/detail.cf","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/06/06da3ee6a1ae893e1ad2d0222c6e3de3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖就展示了 Kafka 是如何寫入數據的, 每一個 Partition 其實都是一個文件 ,收到消息後 Kafka 會把數據插入到文件末尾(虛框部分):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/52/52c3550bb8ca227be565e29fc97955a3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種方法採用了只讀設計 ,所以 Kafka 是不會修改、刪除數據的,它會把所有的數據都保留下來,每個消費者(Consumer)對每個 Topic 都有一個 offset 用來表示讀取到了第幾條數據 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/26/2607db7b18b2c40ca7f2d2bbd92c2ded.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"磁盤的順序讀寫是磁盤使用模式中最有規律的,並且操作系統也對這種模式做了大量優化,Kafka 就是使用了磁盤順序讀寫來提升的性能。Kafka 的 message 是不斷追加到本地磁盤文件末尾的,而不是隨機的寫入,這使得 Kafka 寫入吞吐量得到了顯著提升。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、頁緩存","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即便是順序寫入硬盤,硬盤的訪問速度還是不可能追上內存。所以 Kafka 的數據並不是實時的寫入硬盤 ,它充分利用了現代操作系統分頁存儲來利用內存提高 I/O 效率。具體來說,就是把磁盤中的數據緩存到內存中,把對磁盤的訪問變爲對內存的訪問。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 接收來自 socket buffer 的網絡數據,應用進程不需要中間處理、直接進行持久化時。可以使用mmap 內存文件映射。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3.1 Memory Mapped Files","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡稱 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"mmap","attrs":{}}],"attrs":{}},{"type":"text","text":",簡單描述其作用就是:","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"將磁盤文件映射到內存,用戶通過修改內存就能修改磁盤文件","attrs":{}}],"attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它的工作原理是直接利用操作系統的 Page 來實現磁盤文件到物理內存的直接映射。完成映射之後你對物理內存的操作會被同步到硬盤上(操作系統在適當的時候)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/24/24cbcf2a170b4fb6d9a5becc170e08ac.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 mmap,進程像讀寫硬盤一樣讀寫內存(當然是虛擬機內存)。使用這種方式可以獲取很大的 I/O 提升,省去了用戶空間到內核空間複製的開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mmap 也有一個很明顯的缺陷:不可靠,寫到 mmap 中的數據並沒有被真正的寫到硬盤,操作系統會在程序主動調用 flush 的時候才把數據真正的寫到硬盤。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 提供了一個參數 producer.type 來控制是不是主動 flush:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果 Kafka 寫入到 mmap 之後就立即 flush,然後再返回 Producer 叫同步(sync);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫入 mmap 之後立即返回 Producer 不調用 flush 叫異步(async)。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3.2 Java NIO 對文件映射的支持","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Java NIO,提供了一個 MappedByteBuffer 類可以用來實現內存映射。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MappedByteBuffer 只能通過調用 FileChannel 的 map() 取得,再沒有其他方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FileChannel.map() 是抽象方法,具體實現是在 FileChannelImpl.map() 可自行查看 JDK 源碼,其 map0() 方法就是調用了 Linux 內核的 mmap 的 API。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/57c4c7e9b87cf0deb00706b81860efc2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/31/31b86066b703c7e2d6662c5380b62bf6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0b/0b07926e039c15ebd99e0bb54b625f6d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3.3 使用 MappedByteBuffer 類注意事項","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mmap 的文件映射,在 full gc 時纔會進行釋放。當 close 時,需要手動清除內存映射文件,可以反射調用 sun.misc.Cleaner 方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當一個進程準備讀取磁盤上的文件內容時:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作系統會先查看待讀取的數據所在的頁(page)是否在頁緩存(pagecache)中,如果存在(命中) 則直接返回數據,從而避免了對物理磁盤的 I/O 操作;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果沒有命中,則操作系統會向磁盤發起讀取請求並將讀取的數據頁存入頁緩存,之後再將數據返回給進程。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果一個進程需要將數據寫入磁盤:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作系統也會檢測數據對應的頁是否在頁緩存中,如果不存在,則會先在頁緩存中添加相應的頁,最後將數據寫入對應的頁。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"被修改過後的頁也就變成了髒頁,操作系統會在合適的時間把髒頁中的數據寫入磁盤,以保持數據的一致性。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對一個進程而言,它會在進程內部緩存處理所需的數據,然而這些數據有可能還緩存在操作系統的頁緩存中,因此同一份數據有可能被緩存了兩次。並且,除非使用 Direct I/O 的方式, 否則頁緩存很難被禁止。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當使用頁緩存的時候,即使 Kafka 服務重啓, 頁緩存還是會保持有效,然而進程內的緩存卻需要重建。這樣也極大地簡化了代碼邏輯,因爲維護頁緩存和文件之間的一致性交由操作系統來負責,這樣會比進程內維護更加安全有效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 中大量使用了頁緩存,這是 Kafka 實現高吞吐的重要因素之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消息先被寫入頁緩存,由操作系統負責刷盤任務。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、零拷貝","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"導致應用程序效率低下的一個典型根源是緩衝區之間的字節數據拷貝。Kafka 使用由 Producer、Broker 和 Consumer 多方共享的二進制消息格式,因此數據塊即便是處於壓縮狀態也可以在不被修改的情況下在端到端之間流動。雖然消除通信各方之間的結構化差異是非常重要的一步,但它本身並不能避免數據的拷貝。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 通過利用 Java 的 NIO 框架,尤其是 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"java.nio.channels.FileChannel","attrs":{}}],"attrs":{}},{"type":"text","text":" 裏的 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"transferTo","attrs":{}}],"attrs":{}},{"type":"text","text":" 這個方法,解決了前面提到的在 Linux 等類 UNIX 系統上的數據拷貝問題。此方法能夠在不借助作爲傳輸中介的應用程序的情況下,將字節數據從源通道直接傳輸到接收通道。要了解 NIO 的帶來的改進,請考慮傳統方式下作爲兩個單獨的操作:源通道中的數據被讀入字節緩衝區,接着寫入接收通道:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"File.read(fileDesc, buf, len);\nSocket.send(socket, buf, len);\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過圖表來說明,這個過程可以被描述如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4b9f669ab61d5e3b3d70a42040a4581d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管上面的過程看起來已經足夠簡單,但是在內部仍需要 4 次用戶態和內核態的上下文切換來完成拷貝操作,而且需要拷貝 4 次數據才能完成這個操作。下面的示意圖概述了每一個步驟中的上下文切換。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f5/f5b87dba849066e96922f7caea5624a8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們來更詳細地看一下細節:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始的 read() 調用導致了一次用戶態到內核態的上下文切換。DMA (Direct Memory Access 直接內存訪問) 引擎讀取文件,並將其內容複製到內核地址空間中的緩衝區中。這個緩衝區和上面的代碼片段裏使用的並非同一個。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在從 read() 返回之前,內核緩衝區的數據會被拷貝到用戶態的緩衝區。此時,我們的程序可以讀取文件的內容。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來的 send() 方法會切換回內核態,拷貝用戶態的緩衝區數據到內核地址空間 —— 這一次是拷貝到一個關聯着目標套接字的不同緩衝區。在後臺,DMA 引擎會接手這一操作,異步地把數據從內核緩衝區拷貝到協議堆棧,由網卡進行網絡傳輸。 send() 方法在返回之前不等待此操作。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"send() 調用返回,切換回用戶態。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管模式切換的效率很低,而且需要進行額外的拷貝,但在許多情況下,中間內核緩衝區的性能實際上可以進一步提高。比如它可以作爲一個預讀緩存,異步預載入數據塊,從而可以在應用程序前端運行請求。但是,當請求的數據量極大地超過內核緩衝區大小時,內核緩衝區就會成爲性能瓶頸。它不會直接拷貝數據,而是迫使系統在用戶態和內核態之間搖擺,直到所有數據都被傳輸完成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,零拷貝方式能在單個操作中處理完成。前面示例中的代碼片段現在能重寫爲一行程序:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"fileDesc.transferTo(offset, len, socket);\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"零拷貝方式可以用下圖來說明:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a9/a989cb6813f4bddf58c51af4b5f17cd7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種模式下,上下文的切換次數被縮減至一次。具體來說, transferTo() 方法指示數據塊設備通過 DMA 引擎將數據讀入讀緩衝區,然後這個緩衝區的數據拷貝到另一個內核緩衝區中,分階段寫入套接字。最後,DMA 將套接字緩衝區的數據拷貝到 NIC 緩衝區中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7e/7e24be3d841b9819f2705901a794726d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終結果,我們已經把拷貝的次數從 4 降到了 3,而且其中只有一次拷貝佔用了 CPU 資源。我們也已經把上下文切換的次數從 4 降到了 2。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把磁盤文件讀取 OS 內核緩衝區後的 fileChannel,直接轉給 socketChannel 發送;底層就是 sendfile。消費者從 broker 讀取數據,就是由此實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體來看,Kafka 的數據傳輸通過 TransportLayer 來完成,其子類 PlaintextTransportLayer 通過 Java NIO 的 FileChannel 的 transferTo 和 transferFrom 方法實現零拷貝。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/07f048c16d1d90a333bd278e12d70a73.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注:transferTo 和 transferFrom 並不保證一定能使用零拷貝,需要操作系統支持。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個巨大的提升,不過還沒有實現完全 \"零拷貝\"。我們可以通過利用 Linux 內核 2.4 或更高版本以及支持 gather 操作的網卡來做進一步的優化從而實現真正的 \"零拷貝\"。下面的示意圖可以說明:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/73/738c1b89fef6ad64f535a7f5ba756888.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用 transferTo() 方法會致使設備通過 DMA 引擎將數據讀入內核讀緩衝區,就像前面的例子那樣。然而,通過 gather 操作,讀緩衝區和套接字緩衝區之間的數據拷貝將不復存在。相反地,NIC 被賦予一個指向讀緩衝區的指針,連同偏移量和長度,所有數據都將通過 DMA 抽取乾淨並拷貝到 NIC 緩衝區。在這個過程中,在緩衝區間拷貝數據將無需佔用任何 CPU 資源。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的方式和零拷貝方式在 MB 字節到 GB 字節的文件大小範圍內的性能對比顯示,零拷貝方式相較於傳統方式的性能提升幅度在 2 到 3 倍。但更令人驚歎的是,Kafka 僅僅是在一個純 JVM 虛擬機下、沒有使用本地庫或 JNI 代碼,就實現了這一點。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、Broker 性能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"5.1 日誌記錄批處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"順序 I/O 在大多數的存儲介質上都非常快,幾乎可以和網絡 I/O 的峯值性能相媲美。在實踐中,這意味着一個設計良好的日誌結構的持久層將可以緊隨網絡流量的速度。事實上,Kafka 的瓶頸通常是網絡而非磁盤。因此,除了由操作系統提供的底層批處理能力之外,Kafka 的 Clients 和 Brokers 會把多條讀寫的日誌記錄合併成一個批次,然後才通過網絡發送出去。日誌記錄的批處理通過使用更大的包以及提高帶寬效率來攤薄網絡往返的開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"5.2 批量壓縮","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當啓用壓縮功能時,批處理的影響尤爲明顯,因爲壓縮效率通常會隨着數據量大小的增加而變得更高。特別是當使用 JSON 等基於文本的數據格式時,壓縮效果會非常顯著,壓縮比通常能達到 5 到 7 倍。此外,日誌記錄批處理在很大程度上是作爲 Client 側的操作完成的,此舉把負載轉移到 Client 上,不僅對網絡帶寬效率、而且對 Brokers 的磁盤 I/O 利用率也有很大的提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"5.3 非強制刷新緩衝寫操作","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個助力 Kafka 高性能、同時也是一個值得更進一步去探究的底層原因:Kafka 在確認寫成功 ACK 之前的磁盤寫操作不會真正調用 fsync 命令;通常只需要確保日誌記錄被寫入到 I/O Buffer 裏就可以給 Client 回覆 ACK 信號。這是一個鮮爲人知卻至關重要的事實:事實上,這正是讓 Kafka 能表現得如同一個內存型消息隊列的原因 —— 因爲 Kafka 是一個基於磁盤的內存型消息隊列 (受緩衝區/頁面緩存大小的限制)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,這種形式的寫入是不安全的,因爲副本的寫失敗可能會導致數據丟失,即使日誌記錄似乎已經被確認成功。換句話說,與關係型數據庫不同,確認一個寫操作成功並不等同於持久化成功。真正使得 Kafka 具備持久化能力的是運行多個同步的副本的設計;即便有一個副本寫失敗了,其他的副本(假設有多個)仍然可以保持可用狀態,前提是寫失敗是不相關的(例如,多個副本由於一個共同的上游故障而同時寫失敗)。因此,不使用 fsync 的 I/O 非阻塞方法和冗餘同步副本的結合,使得 Kafka 同時具備了高吞吐量、持久性和可用性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、流數據並行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌結構 I/O 的效率是影響性能的一個關鍵因素,主要影響寫操作;Kafka 在對 Topic 結構和 Consumer 羣組的並行處理是其讀性能的基礎。這種組合產生了非常高的端到端消息傳遞總體吞吐量。併發性根深蒂固地存在於 Kafka 的分區方案和 Consumer Groups 的操作中,這是 Kafka 中一種有效的負載均衡機制 —— 把數據分區 (Partition) 近似均勻地分配給組內的各個 Consumer 實例。將此與更傳統的 MQ 進行比較:在 RabbitMQ 的等效設置中,多個併發的 Consumers 可能以輪詢的方式從隊列讀取數據,然而這樣做,就會失去消息消費的順序性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分區機制也使得 Kafka Brokers 可以水平擴展。每個分區都有一個專門的 Leader;因此,任何重要的主題 Topic (具有多個分區) 都可以利用整個 Broker 集羣進行寫操作,這是 Kafka 和消息隊列之間的另一個區別;後者利用集羣來獲得可用性,而 Kafka 將真正地在 Brokers 之間負載均衡,以獲得可用性、持久性和吞吐量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生產者在發佈日誌記錄之時指定分區,假設你正在發佈消息到一個有多個分區的 Topic 上。(也可能有單一分區的 Topic, 這種情況下將不成問題。) 這可以通過直接指定分區索引來完成,或者間接通過日誌記錄的鍵值來完成,該鍵值能被確定性地哈希到一個一致的 (即每次都相同) 分區索引。擁有相同哈希值的日誌記錄將會被存儲到同一個分區中。假設一個 Topic 有多個分區,那些不同哈希值的日誌記錄將很可能最後被存儲到不同的分區裏。但是,由於哈希碰撞的緣故,不同哈希值的日誌記錄也可能最後被存儲到相同的分區裏。這是哈希的本質,如果你理解哈希表的原理,那應該是顯而易見的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌記錄的實際處理是由一個在 (可選的) Consumer Group 中的 Consumer 操作完成。Kafka 確保一個分區最多隻能分配給它的 Consumer Group 中的一個 Consumer 。(我們說 \"最多\" 是因爲考慮到一種全部 Consumer 都離線的情況。) 當第一個 Consumer Group 裏的 Consumer 訂閱了 Topic,它將消費這個 Topic 下的所有分區的數據。當第二個 Consumer 緊隨其後加入訂閱時,它將大致獲得這個 Topic 的一半分區,減輕第一個 Consumer 先前負荷的一半。這使得你能夠並行處理事件流,並根據需要增加 Consumer (理想情況下,使用自動伸縮機制),前提是你已經對事件流進行了合理的分區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌記錄吞吐量的控制一般通過以下兩種方式來達成:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Topic 的分區方案。應該對 Topics 進行分區,以最大限度地增加獨立子事件流的數量。換句話說,日誌記錄的順序應該只保留在絕對必要的地方。如果任意兩個日誌記錄在某種意義上沒有合理的關聯,那它們就不應該被綁定到同一個分區。這暗示你要使用不同的鍵值,因爲 Kafka 將使用日誌記錄的鍵值作爲一個散列源來派生其一致的分區映射。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個組裏的 Consumers 數量。你可以增加 Consumer Group 裏的 Consumer 數量來均衡入站的日誌記錄的負載,這個數量的上限是 Topic 的分區數量。(如果你願意的話,你當然可以增加更多的 Consumers ,不過分區計數將會設置一個上限來確保每一個活躍的 Consumer 至少被指派到一個分區,多出來的 Consumers 將會一直保持在一個空閒的狀態。) 請注意, Consumer 可以是進程或線程。依據 Consumer 執行的工作負載類型,你可以在線程池中使用多個獨立的 Consumer 線程或進程記錄。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你之前一直想知道 Kafka 是否很快、它是如何擁有其現如今公認的高性能標籤,或者它是否可以滿足你的使用場景,那麼相信你現在應該有了所需的答案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了讓事情足夠清楚,必須說明 Kafka 並不是最快的 (也就是說,具有最大吞吐量能力的) 消息傳遞中間件,還有其他具有更大吞吐量的平臺 —— 有些是基於軟件的 —— 有些是在硬件中實現的。Apache Pulsar 是一項極具前景的技術,它具備可擴展性,在提供相同的消息順序性和持久性保證的同時,還能實現更好的吞吐量-延遲效果。使用 Kafka 的根本原因是,它作爲一個完整的生態系統仍然是無與倫比的。它展示了卓越的性能,同時提供了一個豐富和成熟而且還在不斷進化的環境,儘管 Kafka 的規模已經相當龐大了,但仍以一種令人羨慕的速度在成長。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 的設計者和維護者們在創造一個以性能導向爲核心的解決方案這方面做得非常出色。它的大多數設計/理念元素都是早期就構思完成、幾乎沒有什麼是事後纔想到的,也沒有什麼是附加的。從把工作負載分攤到 Client 到 Broker 上的日誌結構持久性,批處理、壓縮、零拷貝 I/O 和流數據級並行 —— Kafka 向幾乎所有其他面向消息的中間件 (商業的或開源的) 發起了挑戰。而且最令人歎爲觀止的是,它做到這些事情的同時竟然沒有犧牲掉持久性、日誌記錄順序性和至少交付一次的語義等特性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"七、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"7.1 mmap 和 sendfile","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux 內核提供、實現零拷貝的 API。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mmap 將磁盤文件映射到內存,支持讀和寫,對內存的操作會反映在磁盤文件上。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sendfile 是將讀到內核空間的數據,轉到 socket buffer,進行網絡發送。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RocketMQ 在消費消息時,使用了 mmap;Kafka 使用了 sendfile。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"7.2 Kafka 爲啥這麼快?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Partition 順序讀寫,充分利用磁盤特性,這是基礎。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Producer 生產的數據持久化到 Broker,採用 mmap 文件映射,實現順序的快速寫入。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Customer 從 Broker 讀取數據,採用 sendfile,將磁盤文件讀到 OS 內核緩衝區後,直接轉到 socket buffer 進行網絡發送。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker 性能優化:日誌記錄批處理、批量壓縮、非強制刷新緩衝寫操作等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流數據並行","attrs":{}}]}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章