kafka文檔（16）－－－－0.10.1－Document－文檔（8）－Design－kafka設計原理相關

4. DESIGN

4.設計相關

4.1 Motivation

4.1 目的

We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.

It would have to have high-throughput to support high volume event streams such as real-time log aggregation.

It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.

It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.

We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.

Finally in cases where the stream is fed into other data systems for serving, we knew the system would have to be able to guarantee fault-tolerance in the presence of machine failures.

Supporting these uses led us to a design with a number of unique elements, more akin to a database log than a traditional messaging system. We will outline some elements of the design in the following sections.

kafka設計之初是用來作爲處理大公司一般都會有的所有實時數據信息流的統一平臺。爲此，我們想了大量的使用場景。

需要高吞吐量，以支持實時日誌收集產生的大量的事件流。

需要優雅的處理大量數據的備份，以支持週期性從離線系統下載數據。

同時需要處理低延遲傳輸，以處理更多傳統消息使用場景。

我們還想要它支持分區的，分佈式的，實時處理這些信息流，用來產生新的的信息流。這個場景促使我們創建了partitioning和consumer模型。

最後，還需要能夠支持數據流可以導入其它數據系統，系統還需要保證容錯性，可以毫不費力的應對機器損壞。

爲了支持以上場景，kafka具備了大量的獨特元素，相比傳統消息而言，更加像數據庫日誌。下面會描述某些設計上的特徵。

4.2 Persistence

4.2 持久化

Don't fear the filesystem!

不要害怕文件系統

Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.

kafka由於需要緩存和存儲消息，所以重度依賴文件系統。一般認爲“磁盤很慢”，所以人們都懷疑持久化的數據結構能否提供有競爭力的性能。實際上，磁盤可能比人們期待的要快，也可能比人們期待的要慢，這取決於怎麼來用它。合理的磁盤數據結構將比網絡讀寫更快。

The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!

磁盤性能的關鍵事實是過去十年間硬盤驅動的吞吐量已經遠遠超過了磁盤查詢的延遲（譯者注：應該是磁盤查詢帶來的延遲在性能上完全無法滿足磁盤讀寫的吞吐量）。實際上，在6塊JBOD配置爲7200rpm SATA RAID-5陣列的磁盤上，線性寫的速度可以高達600MB/s，而隨機寫的速度只有去取100k/s－－－－差距竟然爲6000倍。線性讀寫是所有使用場景中最有可能預測的，可以被操作系統深度優化。現代操作系統提供預讀和後寫技術：預讀取大塊數據，並將較小的邏輯寫入合併爲大的物理寫入。有關這方面的討論可以查看ACM Queue article。討論結果是他們發現：順序磁盤讀寫在某些情況下要比隨機內存讀寫都要快。

To compensate for this performance divergence, modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:

The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.

爲了彌補這種性能差距，現代操作系統越來越傾向於使用內存作爲磁盤緩存。現代操作系統很樂意將所有可用內存作爲磁盤的緩存，而釋放這些內存時幾乎沒有性能損失。所有的磁盤讀寫都將通過這些統一的高速緩存。如果不使用直接I/O，就不能輕易禁用這個功能，因此，即使進程只是維護自己內部的緩存數據，這些數據在OS的page cache中有可能重複，有效的存儲所有內容兩次。

而且，kafka構建在JVM之上，使用java內存的人都知道兩件事情：

1.對象的內存開銷非常高，通常是存儲數據的兩倍（或者更高）

2.java的垃圾回收會隨着堆內存的增長變得越來越緊張和緩慢

As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore, this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.

由於使用文件系統以及依靠page cache這兩種方式優於維護內存或者其他數據結構的方式，因此，通過自動訪問所有可用內存以及存儲壓縮的字節結構而不是單個的對象結構，至少有兩倍可用的緩存。這樣做的好處是：在一個32GB的機子上可能擁有高達28～32GB的緩存，而沒有GC損失。而且，即使服務重啓，這種緩存依然是溫暖的（服務重啓，老數據依然部分可用？），然而，進程內數據需要在內存中重建（10GB的緩存可能需要10分鐘），或者需要啓動時加載完全的冷緩存（這可能意味着糟糕的初始化性能）。這可能極大的簡化代碼，因爲所有維護page cache以及文件系統的代碼目前都在OS中，這要比每次啓動進程加載數據更加直接有效。如果你在使用磁盤時更加傾向於線性讀操作，那麼每次預讀取數據將會極大的提高數據的命中率。

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.

This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogance).

這就建議一個簡單的設計：原來是內存中維護儘可能多的數據而且只有當耗盡內存時纔將所有數據回刷到文件系統，現在我們要正好相反，所有的數據應當立刻寫入文件系統上持久化日誌而不是回刷到磁盤上。這就意味着，這些數據轉移到內核的page cache中。

這種以page cache爲中心的設計在這篇文章中有描述，文章中描述了Varnish設計方案。

Constant Time Suffices

恆定時間

The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse than twice as slow.

消息系統中持久化數據結構通常是每個consumer都有一個隊列，隊列採用相關聯的BTree或者其他通用隨機訪問數據結構來維護消息的元數據。BTrees是最通用的數據結構，可以支持消息系統中大量的事務性或者非事務性語義。但是BTrees代價還是稍微有點高，Btree操作時間複雜度爲O（logN）。一般來說，O（logN）和常量時間基本是等價的，但是對於磁盤操作來說並不是這樣的。磁盤查詢每個pop需要10ms，每次磁盤一次只能進行一次查詢，這限制了並行處理能力。因此，即使很少的磁盤查詢依然需要高負載。因爲存儲系統混合了非常快的緩存操作和非常慢的物理磁盤操作，所以樹結構的查詢性能通常是超級線性的，因爲數據隨着固定緩存增加而增加－－－兩倍的數據會使查詢性能降低不止兩倍。

Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.

直觀來講，持久化隊列可以建立在簡單的讀寫文件上，因爲這通常是日誌解決方案。這種結構的優點是：所有操作都是O（1）的，讀寫操作並沒有相互阻塞。這樣有明顯的性能優勢，因爲性能完全於數據尺寸無關了－－一臺server就可以利用多臺廉價的、低轉速的、1+TB SATA磁盤。儘管這樣的磁盤查詢性能比較糟糕，但是擁有大塊數據的讀寫性能並且價格只爲1/3和3倍的空間容量。

Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to delete messages as soon as they are consumed, we can retain messages for a relatively long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.

訪問沒有限制的虛擬磁盤空間沒有任何性能損耗，這意味着可以針對這個特徵提供一些消息系統不常見的設計。例如，kafka中，不要一消費完就刪除，而是可以將消息保存一段時間（一般來說爲1周）。這對consumer來說，可以做大量靈活的事情，具體見下文描述。

4.3 Efficiency

4.3 效率

We have put significant effort into efficiency. One of our primary use cases is handling web activity data, which is very high volume: each page view may generate dozens of writes. Furthermore, we assume each message published is read by at least one consumer (often many), hence we strive to make consumption as cheap as possible.

在性能方面，我們做了很多努力。一個主要的使用場景就是可以處理網絡活動數據，這種數據量比較大。每個網頁瀏覽都可能帶來數十個寫操作。而且，假定每條消息都會被每個consumer至少消費一次（通常每個consumer都會消費分多次），因此，需要使消費帶來的資源消耗儘可能的小。

We have also found, from experience building and running a number of similar systems, that efficiency is a key to effective multi-tenant operations. If the downstream infrastructure service can easily become a bottleneck due to a small bump in usage by the application, such small changes will often create problems. By being very fast we help ensure that the application will tip-over under load before the infrastructure. This is particularly important when trying to run a centralized service that supports dozens or hundreds of applications on a centralized cluster as changes in usage patterns are a near-daily occurrence.

從以往經驗中我們還發現，創建和運行多個相似的系統，性能是有效的多租戶操作的關鍵點。如果下游基礎構建服務因爲應用中一些小的突變而很容易成爲瓶頸的話，那麼微小的改變有可能引發大問題。我們需要確保應用很快，在到達基礎設施服務之前的負載下就可以翻轉。這非常重要，當試圖運行在集中式集羣上支持支持成千上萬應用的中心化服務時，使用場景可能每天都會變化。

We discussed disk efficiency in the previous section. Once poor disk access patterns have been eliminated, there are two common causes of inefficiency in this type of system: too many small I/O operations, and excessive byte copying.

The small I/O problem happens both between the client and the server and in the server's own persistent operations.

前面討論了磁盤效率。即使糟糕的磁盤使用模式消除了，這種類型的系統還有兩個常見的低效率原因：太多的小塊數據I/O操作，過多的字節拷貝。

小塊數據I/O問題發生在客戶端、server端還有server自身持久化操作。

To avoid this, our protocol is built around a "message set" abstraction that naturally groups messages together. This allows network requests to group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time. The server in turn appends chunks of messages to its log in one go, and the consumer fetches large linear chunks at a time.

爲了避免這個問題，我們的協議圍繞着“消息集合”的概念：天然的將消息分組。這就允許網絡請求可以將消息進行分組，合併發送請求而不是每次發送單個消息，以減少網絡請求的開銷。server可以一次追加大塊消息到日誌中，consumer也可以一次獲取一大塊消息。

This simple optimization produces orders of magnitude speed up. Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes that flow to the consumers.

The other inefficiency is in byte copying. At low message rates this is not an issue, but under load the impact is significant. To avoid this we employ a standardized binary message format that is shared by the producer, the broker, and the consumer (so data chunks can be transferred without modification between them).

這種簡單的優化將產生數量級的加速。批量處理將產生更大的網絡包，更大的序列化磁盤操作，持續的內存阻塞等等，所有這些都使得kafka將隨機消息的突發流轉換爲跟隨consumers的線性寫入。

另一個低效率操作是字節拷貝。消息速率比較低時這沒有問題，但是負載大了話，影響還是比較大的。爲了避免這個影響，我們採用了有producer、broker、consumer三者共享的二進制消息格式（因此數據塊不用經過轉換就可以在他們之間傳輸）。

The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks. Modern unix operating systems offer a highly optimized code path for transferring data out of pagecache to a socket; in Linux this is done with the sendfile system call.

broker維護的消息日誌存放在不同的文件目錄中，每個topic-partition佔據一個目錄，目錄下的每個文件都是producer和consumer共用格式的一系列消息集合組成。維護通用的格式將支持最重要操作的優化：持久化日誌的網絡傳輸。現代操作系統提供高度優化的代碼路徑，用於將page cache的數據導入socket；linux系統中，這採用sendfile系統調用。

To understand the impact of sendfile, it is important to understand the common data path for transfer of data from file to socket:

The operating system reads data from the disk into pagecache in kernel space
The application reads the data from kernel space into a user-space buffer
The application writes the data back into kernel space into a socket buffer
The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network

要想理解sendfile的效果，重要的是需要理解從文件傳輸數據到socket的通用數據路徑：

1.操作系統先從磁盤將數據讀入內核空間的page cache

2.應用從內核空間將數據讀入用戶空間的緩存

3.應用將數據將數據寫回內核空間到socket緩存中

4.操作系統從socket緩存將數據拷貝到NIC緩存中，然後通過網絡發送

This is clearly inefficient, there are four copies and two system calls. Using sendfile, this re-copying is avoided by allowing the OS to send the data from pagecache to the network directly. So in this optimized path, only the final copy to the NIC buffer is needed.

We expect a common use case to be multiple consumers on a topic. Using the zero-copy optimization above, data is copied into pagecache exactly once and reused on each consumption instead of being stored in memory and copied out to kernel space every time it is read. This allows messages to be consumed at a rate that approaches the limit of the network connection.

This combination of pagecache and sendfile means that on a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks whatsoever as they will be serving data entirely from cache.

For more background on the sendfile and zero-copy support in Java, see this article.

這很明顯效率很低，有四次拷貝以及兩次系統調用。使用sendfile，可以通過OS將數據從page cache直接發送到網絡的方式來避免重拷貝，只有最後一步到NIC緩存的拷貝無法避免。

我們希望多個consumers使用同一個topic的通用使用場景。通過使用上面零拷貝的優化，數據只需要拷貝到page cache中一次，就可以每次消費時反覆重用，而不必每次讀取時都需要將數據保存到內存，然後再導出內核空間。

page cache和sendfile的結合使用意味着，kafka集羣中，大部分consumers的讀取操作都可以完全從緩存中讀取，而不用讀取磁盤上的數據。

更多有關java支持sendfile以及零拷貝的信息可以閱讀這篇文章。

End-to-end Batch Compression

端到端的批量壓縮

In some cases the bottleneck is actually not CPU or disk but network bandwidth. This is particularly true for a data pipeline that needs to send messages between data centers over a wide-area network. Of course, the user can always compress its messages one at a time without any support needed from Kafka, but this can lead to very poor compression ratios as much of the redundancy is due to repetition between messages of the same type (e.g. field names in JSON or user agents in web logs or common string values). Efficient compression requires compressing multiple messages together rather than compressing each message individually.

某些情況下，瓶頸實際上不是CPU或者磁盤而是網絡帶寬。這對於在建立在廣域網之上的數據中心之間傳輸消息的數據管道來說，是非常正確的。當然，用戶可以在沒有kafka任何支持的情況下壓縮消息，但是有可能導致比較差的壓縮比，因爲大部分冗餘是相同消息之間重複（例如，JSON的字符名字或者web日誌中用戶代理或者通用字符串）。

高效壓縮要求一次壓縮多條消息而不是一次壓縮一條消息。

Kafka supports this by allowing recursive message sets. A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.

Kafka supports GZIP, Snappy and LZ4 compression protocols. More details on compression can be found here.

kafka通過允許遞歸消息集來支持這一點。批量消息可以壓縮然後以壓縮後的格式發送到server。這種批量消息將以壓縮格式的方式保存在日誌中，並由consumer來解壓縮。

Kafka支持GZIP、Snappy以及LZ4壓縮協議。更多細節請查看這裏。

4.4 The Producer

4.4 生產者

Load balancing

負載均衡

The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. To help the producer do this all Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are at any given time to allow the producer to appropriately direct its requests.

生產者直接發送消息到partition的leader所在的broker，沒有經過任何中間路由。爲了幫助producer實現這一點，所有kafka節點需要能夠回答有關元數據的請求：哪些servers還活着，以及topic的某個partition的leader是哪個broker，以允許生產者能夠直接發送它的請求到合適的broker。

The client controls which partition it publishes messages to. This can be done at random, implementing a kind of random load balancing, or it can be done by some semantic partitioning function. We expose the interface for semantic partitioning by allowing the user to specify a key to partition by and using this to hash to a partition (there is also an option to override the partition function if need be). For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers.

客戶端控制向哪個partition發送消息。可以隨機發送，通過隨機負載均衡方法實現，或者可以而通過某些語義分區函數實現。我們暴露了語義分區的接口：允許用戶指定partition的關鍵字，並通過這些關鍵字將消息hash到不同的partition（如果需要的話，允許覆蓋partition的函數）。例如，如果選擇user id作爲關鍵字的話，那麼所有來自同一個user的消息都會分發到同一個partition。反過來這允許consumers對他們的消費行爲做局部性假設。這種分區方式是顯式的設計用來允許consumers進行局部語義處理。

Asynchronous send

異步發送

Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.

Details on configuration and the api for the producer can be found elsewhere in the documentation.

批量處理可以極大提高效率，爲了支持批量處理，kafka生產者將嘗試在內存中積累數據，以滿足在單次請求中發送大量數據的目的。批量處理不止可以配置成積累某個確定數量的消息，還能配置成等待不超過某個確定時間的延遲（即64k或者10ms）。這將允許積累更多的字節再發送，這樣可以有更少的I/O操作。這種緩存方式是可以配置的，同時給出了一種機制：即以較少的延遲獲得更好的吞吐量。

有關配置和API的細節可以查看本文的其他部分。

4.5 The Consumer

4.5 消費者

The Kafka consumer works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has significant control over this position and can rewind it to re-consume data if need be.

kafka消費者通過“fetch”請求partition的leader以獲取數據。consumer在每次請求中指定日誌offset，然後就會收到從這個位置開始的大塊日誌。consumer這樣就可以靈活的控制消費的位置，也就可以重複消費數據。

Push vs. pull

推VS拉

An initial question we considered is whether consumers should pull data from brokers or brokers should push data to the consumer. In this respect Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer and pulled from the broker by the consumer. Some logging-centric systems, such as Scribe and Apache Flume, follow a very different push-based path where data is pushed downstream. There are pros and cons to both approaches. However, a push-based system has difficulty dealing with diverse consumers as the broker controls the rate at which data is transferred. The goal is generally for the consumer to be able to consume at the maximum possible rate; unfortunately, in a push system this means the consumer tends to be overwhelmed when its rate of consumption falls below the rate of production (a denial of service attack, in essence). A pull-based system has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems. Previous attempts at building systems in this fashion led us to go with a more traditional pull model.

我們考慮的一個初始問題是：consumer是應該從brokers拉數據還是brokers將數據推向consumer。kafka在這個方面遵循更加傳統的設計，就像大部分消息系統那樣，producer將數據推向broker，然後consumer從broker拉取數據。某些日誌集中系統，就像Scribe和Apache Flume，這些都是採用一種不同於此的基於推送的方式：即將數據推向下游。這兩種方法各有利弊。然而，基於推送的系統很難處理consumers和brokers之間數據傳輸速率不匹配的問題。consumer一般的目標是儘可能快的消費消息。但是不幸的是，在一個推系統中，這就意味着當consumer的速率低於生產者時（實際上服務攻擊拒絕），consumer是過載的。基於拉的系統有更好的特徵：consumer可能會落後於生產者並儘可能的趕上。這可以通過某些退避協議改善，消費者可以通過這些協議指示它已經過載了，但是獲得傳輸速率以獲取完全利用consumer（從來不是過載使用），consumer比看起來的要麻煩很多。以前以這種方式創建系統的嘗試使得我們使用一種更加傳統的模式。

Another advantage of a pull-based system is that it lends itself to aggressive batching of data sent to the consumer. A push-based system must choose to either send a request immediately or accumulate more data and then send it later without knowledge of whether the downstream consumer will be able to immediately process it. If tuned for low latency, this will result in sending a single message at a time only for the transfer to end up being buffered anyway, which is wasteful. A pull-based design fixes this as the consumer always pulls all available messages after its current position in the log (or up to some configurable max size). So one gets optimal batching without introducing unnecessary latency.

基於拉的系統另一個優勢是積極的批處理髮送到consumer的數據。基於推的系統必須選擇要麼是每次立即發送一條消息要麼是積累更多的消息，然後在不知道下游consumer是否能夠立即處理它的情況下就發送。如果想要更低的延遲，則需要每次發送一條消息，可以減少緩存時間，但是這樣對於網絡消耗是浪費的。基於拉的設計修正了這一點，consumer總會儘可能拉取當前消費位置之後更多的消息（或者是某些配置的最大值）。因此，沒有引入更多不必要的延遲就可以獲得最佳的批處理性能。

The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a "long poll" waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).

You could imagine other possible designs which would be only pull, end-to-end. The producer would locally write to a local log, and brokers would pull from that with consumers pulling from them. A similar type of "store-and-forward" producer is often proposed. This is intriguing but we felt not very suitable for our target use cases which have thousands of producers. Our experience running persistent data systems at scale led us to feel that involving thousands of disks in the system across many applications would not actually make things more reliable and would be a nightmare to operate. And in practice we have found that we can run a pipeline with strong SLAs at large scale without a need for producer persistence.

基於拉的系統缺陷是如果broker沒有數據，則consumer可能會頻繁的循環中輪詢，只有在這種忙等待中纔能有效等待消息到來。爲了避免這種情況，我們在拉取請求中設置了這種參數，即可以允許consumer請求長時間的阻塞在等待中，直到有數據到來（也可以選擇等待某個給定數量的字節數，以保證獲取大德傳輸尺寸）。

你可以想想其他可能的設計：只能拉，端到端。生產者可能寫入本地日誌，broker主動拉取這些日誌，然後consumers從brokers拉取日誌。通常提出類似類型的“存儲-發送”的生產者。這很有趣，但是可以想一下，如果我們有成千上萬個producers，這不是特別合適。我們在運行大規模持久化數據系統方面的經驗告訴我們：在許多應用中涉及到系統中成千上萬個磁盤的操作不會使系統更健壯，反而對操作來說是個噩夢。實際上，我們可以發現運行大規模的具有SALs的管道根本不需要生產者持久化。

Consumer Position

消費者位置

Keeping track of what has been consumed is, surprisingly, one of the key performance points of a messaging system.

Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker either records that fact locally immediately or it may wait for acknowledgement from the consumer. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else this state could go. Since the data structures used for storage in many messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately delete it, keeping the data size small.

追蹤消費的位置，對於消息系統來說是一個關鍵的性能點。

大多數消息系統在broker保存了消費的消息的元數據。就是說，當消息發向consumer，broker或者是立刻在本地記錄下這個事實或者是等待來自consumer的確認消息。這是個相當直觀的選擇，事實上對於單個機器server來說，並不清楚將消費狀態存儲在其他什麼地方。因爲很多消息系統中用來存儲的數據結構擴展性很差，所以這也是務實的選擇--因爲broker直到哪些已經消費了，並且可以立即刪掉它，以保持數據尺寸比較小。

What is perhaps not obvious is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.

不明顯的性能點是，讓broker和consumer對於哪些消息已經消費了達成一致並不是一個小事情。如果broker在每次消息通過網絡發向consumer時就記錄該消息已經被消費了，那麼如果consumer處理消息失敗（即因爲程序崩潰或者請求超時或者其他什麼原因），那麼這條消息就丟失了。爲了解決這個問題，很多消息系統增加了確認的特徵，即只有當消息發送未被消費時就打上標記。這個策略解決了丟失消息的問題，但是又引發了新問題。第一，如果consumer處理了消息，但是在發送確認之前失效了，那麼消息就會被消費兩次。第二是性能問題，broker需要保存每個消息的多個狀態（首先目的是爲了鎖住消息以防發出兩次。然後標記爲永久消費，就可以刪除了）。還有一些小問題需要處理，如果消息發出了但是沒有確認的話怎麼辦。

Kafka handles this differently. Our topic is divided into a set of totally ordered partitions, each of which is consumed by exactly one consumer within each subscribing consumer group at any given time. This means that the position of a consumer in each partition is just a single integer, the offset of the next message to consume. This makes the state about what has been consumed very small, just one number for each partition. This state can be periodically checkpointed. This makes the equivalent of message acknowledgements very cheap.

There is a side benefit of this decision. A consumer can deliberately rewind back to an old offset and re-consume data. This violates the common contract of a queue, but turns out to be an essential feature for many consumers. For example, if the consumer code has a bug and is discovered after some messages are consumed, the consumer can re-consume those messages once the bug is fixed.

kafka可以用不同方式處理這些問題。我們的topic劃分成完全有序的partitions的集合，每個partition都可以被每個正在訂閱的consumer group的consumer在任意時刻準確的消費。這就意味着，每個partition的consumer的位置僅僅是一個單獨的整型數據-用來指明下一條待消費消息的offset。這就使得記錄哪些消息已經消費的狀態花費很少的資源，只是每個partition一個數字。這個狀態可以週期性的檢查。這就使消息確認也變得很廉價。

這個決定有一個附帶好處。consumer可以有目的的回退到舊的offset並重新消費數據。這其實違反了隊列的一般抽象，但是對於很多consumers來說證明是一個基本的特徵。例如，如果consumer代碼有bug，並且發現某些消息已經消費過了，那麼consumer可以在修復bug之後選擇重新消費消息。

Offline Data Load

離線數據加載

Scalable persistence allows for the possibility of consumers that only periodically consume such as batch data loads that periodically bulk-load data into an offline system such as Hadoop or a relational data warehouse.

In the case of Hadoop we parallelize the data load by splitting the load over individual map tasks, one for each node/topic/partition combination, allowing full parallelism in the loading. Hadoop provides the task management, and tasks which fail can restart without danger of duplicate data—they simply restart from their original position.

可擴展性以及持久化提供了這種consumers的可能性：僅僅週期性消費行爲，諸如批量數據加載--加載數據到諸如Hadoop或者關係型數據倉庫的離線系統。

在Hadoop這種使用場景中，我們可以並行處理數據，通過將負載分化到單獨的map任務中，每個任務一個節點/topic/partition，這樣可以在加載時完全並行處理。Hadoop提供了任務管理，失敗的任務可以重啓，而且沒有丟失數據的風險--他們可以簡單從他們最初的位置重啓。

4.6 Message Delivery Semantics

4.6 消息傳輸語義

Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:

At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message is delivered once and only once.

現在知道了一點consumers和producers是如何工作的，讓我們討論一下語義保證吧。kafka提供producer以及consumer的保證。清楚的是，有多種可能的消息傳輸保證可以提供：

- 最多一次：消息可能丟失並從來不會重發

- 最少一次：消息從來不丟失但是可能重發

- 準確一次：這纔是人們想要的，每條消息只傳輸一次而且僅有一次

It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.

Many systems claim to provide "exactly once" delivery semantics, but it is important to read the fine print, most of these claims are misleading (i.e. they don't translate to the case where consumers or producers can fail, cases where there are multiple consumer processes, or cases where data written to disk can be lost).

值得注意的是，這分爲兩個問題：發佈消息的可用性保證以及消費消息的保證。

很多系統聲稱可以提供“準確一次”的傳輸語義，但是重要的是閱讀詳細指導，大多數聲稱都是誤導（例如，他們沒有考慮到consumer或者producers失敗的情況，以及多個consumer進程或者數據寫入磁盤可能失敗的情況）。

Kafka's semantics are straight-forward. When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive". The definition of alive as well as a description of which types of failures we attempt to handle will be described in more detail in the next section. For now let's assume a perfect, lossless broker and try to understand the guarantees to the producer and consumer. If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.

kafka的語義是直接了當的。當發佈消息時，有一個“提交”日誌的概念。一旦發佈的消息提交到partition了，只要備份partition的broker依然有存活的，則消息不會丟失。下一節將描述各種類型的失敗情況以及活躍狀況，以及我們是如何處理這些失敗情況。現在我們假設完美的、無損的broker，用來嘗試理解producer以及consumer的保證。如果producer嘗試發佈消息並遇到網絡錯誤，它不能確定這個錯誤是發生在消息提交之前還是提交之後。這類似於使用自動生成的鍵插入數據庫的語義。

These are not the strongest possible semantics for publishers. Although we cannot be sure of what happened in the case of a network error, it is possible to allow the producer to generate a sort of "primary key" that makes retrying the produce request idempotent. This feature is not trivial for a replicated system because of course it must work even (or especially) in the case of a server failure. With this feature it would suffice for the producer to retry until it receives acknowledgement of a successfully committed message at which point we would guarantee the message had been published exactly once. We hope to add this in a future Kafka version.

Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.

這些不是發佈商最強的語義。儘管不能確定網絡錯誤時會發生什麼，允許producer產生一種“主鍵”，使得重試產生請求冪（即保證唯一性的東西，冪等性）。這個特徵對於備份系統來說不是個小事情，因爲即使遇到server失敗的場景也依然能夠工作。利用這種特徵，生產者不斷重試直到收到成功提交的確認消息，在這個時間點，我們可以保證消息只發布一次。我們希望在將來的kafka版本中添加這個這個特徵。

不是所有消息都需要這麼強的保證。對於延遲性敏感的語義，允許producers指定需要的持久化水平。如果生產者指定它希望等待消息被提交，可以指定爲10ms的數量級。然而，producer也可以指定它想完全的異步發送消息或者想完全等待leader寫入消息的確認消息（不需要指定followers）。

Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the consumer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.

現在讓我們描述一下consumer端的語義。所有備份都需要相同的offsets具有相同的日誌。consumer控制日誌中的消費位置。如果consumer從來不崩潰，則不需要存儲這個位置。但是一旦consumer失敗或者我們想要這個topic partition可以由其他進程消費，那麼新進程需要選擇合適的位置開始消費。讓我們說消費者讀取一些消息---如何處理消息以及更新它的消費位置有一些選擇：

1.It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed. This corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.

1.先閱讀這些消息，然後保存日誌中位置，最後處理這些消息。這種情況下，可能會出現保存讀取位置之後但是在保存消息處理結果之前的時候consumer進程崩潰。這種情況下，進程重啓處理過程，從保存的位置開始，即使在這之前的某些消息沒有真正處理。這對應於“最多一次”的語義，consumer失敗的情況下某些消息沒有被處理。

2.It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself)。

2.先閱讀消息，處理消息，最後保存位置。這種情況下，可能會出現處理消息之後但是在保存位置之前consumer進程崩潰。這種情況下，當新進程重新讀取最初的一些消息時，其實這些消息已經處理過了。這對應於“最少一次”語義，很多情況下，消息是有主鍵的，因此更新是冪等的（接收到相同消息兩次只是會重寫）。

3.So what about exactly once semantics (i.e. the thing you actually want)? The limitation here is not actually a feature of the messaging system but rather the need to co-ordinate the consumer's position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage for the consumer position and the storage of the consumers output. But this can be handled more simply and generally by simply letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an example of this, our Hadoop ETL that populates data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a primary key to allow for deduplication.

3.因此，什麼纔是準確的一次語義？這個限制實際上並不是消息系統的特徵而是需要將consumer的位置和輸出實際存儲之間相協調的問題。解決這個問題的經典方式是在consumer位置存儲和consumer輸出之間如入兩段提交方式。但是更加簡單和通用的方式是通過consumer自己在輸出的位置存儲自己的offset。這種方式更好是因爲consumer想要寫入的很多輸出系統並不支持兩段提交。例如，在HDFS中填充數據的Hadoop ETL將offsets存儲到其讀取數據的HDFS中，從而保證數據和offsets要麼都被更新要麼都不更新。對於需要這些更強語義並且消息不具有允許重複消除的主鍵的其他系統來說，我們遵循類似的語義。

So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.

因此kafka默認情況下可以有效的保證至少一次發送，允許用戶通過設置producer不進行重試並在處理一批消息之前不提交它的offset來保證至多一次的發送。準確的一次發送要求目的存儲系統之間的協作。但是kafka提供的offset使得可以直接實現這種需求。

4.7 Replication

4.7 備份

Kafka replicates the log for each topic's partitions across a configurable number of servers (you can set this replication factor on a topic-by-topic basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures.

Other messaging systems provide some replication-related features, but, in our (totally biased) opinion, this appears to be a tacked-on thing, not heavily used, and with large downsides: slaves are inactive, throughput is heavily impacted, it requires fiddly manual configuration, etc. Kafka is meant to be used with replication by default—in fact we implement un-replicated topics as replicated topics where the replication factor is one.

kafka爲每個topic的partitions都提供可配置的備份數。（你可以設置基於topic的備份數目）。這就允許自動的故障轉移到這些備份，當集羣中的server失敗，這些消息依然可用。

其他消息提供一些備份相關的特徵，但是，在我們的觀點中，這似乎是勉強加上的東西，並沒有大量使用，並且有大量的副作用：slaves並不是活躍的，吞吐量受到嚴重影響，它需要輕鬆的人工配置。kafka打算默認使用備份-實際上我們將未備份topic的備份數設置爲1.

The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single leader and zero or more followers. The total number of replicas including the leader constitute the replication factor. All reads and writes go to the leader of the partition. Typically, there are many more partitions than brokers and the leaders are evenly distributed among brokers. The logs on the followers are identical to the leader's log—all have the same offsets and messages in the same order (though, of course, at any given time the leader may have a few as-yet unreplicated messages at the end of its log).

Followers consume messages from the leader just as a normal Kafka consumer would and apply them to their own log. Having the followers pull from the leader has the nice property of allowing the follower to naturally batch together log entries they are applying to their log.

備份的單元是topic partition。在沒有失敗的情況下，kafka中的每個partition都只有一個leader，0個或者更多的followers。備份的總數包括leader本身。所有讀寫操作時leader完成的。一般來說，partitions的數目多於brokers，leaders也就分佈在不同的brokers上。followers上的日誌和leaders上的日誌相同－所有相同topic partition都有相同的offsets和messages，並且順序相同（然而，任意時刻來看的話，leader可能有某些消息是沒有備份的）。

followers從leader消費消息，就像普通的kafka consumer一樣，然後將消費的消息追加到日誌中。爲了讓followers有良好的備份性能，允許follower可以批量獲取日誌並且保存日誌。

As with most distributed systems automatically handling failures requires having a precise definition of what it means for a node to be "alive". For Kafka node liveness has two conditions

A node must be able to maintain its session with ZooKeeper (via ZooKeeper's heartbeat mechanism)
If it is a slave it must replicate the writes happening on the leader and not fall "too far" behind

就像大多數可以自動處理失敗的分佈式系統一樣，kafka也需要一些表明節點是否存活的定義。對於kafka來說，節點存活有兩個條件：

1.節點必須維護同zookeeper之間的會話（通過zookeeper的心跳機制）

2.如果某個節點是slave，它必須從leader備份數據，並且不能落後leader太遠

We refer to nodes satisfying these two conditions as being "in sync" to avoid the vagueness of "alive" or "failed". The leader keeps track of the set of "in sync" nodes. If a follower dies, gets stuck, or falls behind, the leader will remove it from the list of in sync replicas. The determination of stuck and lagging replicas is controlled by the replica.lag.time.max.ms configuration.

In distributed systems terminology we only attempt to handle a "fail/recover" model of failures where nodes suddenly cease working and then later recover (perhaps without knowing that they have died). Kafka does not handle so-called "Byzantine" failures in which nodes produce arbitrary or malicious responses (perhaps due to bugs or foul play).

節點必須滿足這兩個條件才能是出於"in sync"，以避免“活着”或者“失敗”的模糊性。leaders必須保持追蹤“in sync”節點的集合。如果某個followers死掉了，或者阻塞了，或者落後太多日誌，leader將從sync備份列表中刪除這個followers。阻塞或者落後備份節點是由replica.lag.time.max.ms配置控制的。

在分佈式系統中，我們只能嘗試處理故障的“失敗／恢復”模式，其節點突然停止工作，然後又恢復（可能不知道它們實際上已經死過了）。kafka不能處理所謂的“拜占庭”失敗情況，其節點產生任意或者惡意的迴應（可能由於錯誤或者犯規）。

A message is considered "committed" when all in sync replicas for that partition have applied it to their log. Only committed messages are ever given out to the consumer. This means that the consumer need not worry about potentially seeing a message that could be lost if the leader fails. Producers, on the other hand, have the option of either waiting for the message to be committed or not, depending on their preference for tradeoff between latency and durability. This preference is controlled by the acks setting that the producer uses.

The guarantee that Kafka offers is that a committed message will not be lost, as long as there is at least one in sync replica alive, at all times.

Kafka will remain available in the presence of node failures after a short fail-over period, but may not remain available in the presence of network partitions.

只有所有同步備份節點都已經將消息寫入日誌，消息才能認爲是“提交了”。只有提交的消息才能發往消費者。這意味着consumer不需要擔心：如果leader掛掉，消息是否會丟失。另外，Producers可以選擇等待消息被提交或者不等待，這取決於在延遲與可用性之間的折衷傾向。這種傾向由producer設置的ack控制。

kafka提供的這種保證是：任意時刻，只要至少有一個in sync備份或者，提交的消息就不會丟失。

kafka在短暫的故障轉移期間之後將保持可用，但是在網絡分區出問題時可能無法使用。

Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)

備份日誌：仲裁、ISRs，以及狀態機

At its heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, and there are many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distributed systems in the state-machine style.

A replicated log models the process of coming into consensus on the order of a series of values (generally numbering the log entries 0, 1, 2, ...). There are many ways to implement this, but the simplest and fastest is with a leader who chooses the ordering of values provided to it. As long as the leader remains alive, all followers need to only copy the values and ordering the leader chooses.

kafka分區的核心是日誌備份。備份日誌是大多數分佈式系統的一個基本特徵，而且有很多方法實現備份。備份日誌被其他系統用來作爲基本特徵或者以狀態機的方式實現另一個分佈式系統。

日誌備份模擬一系列值按照順序達成一致的過程（一般對日誌進行編號）。有很多辦法實現這一點，但是最簡單易記最快的方法是使用leader，leader來確定這些值的順序。只要leader一直活着，所有followers只需要按照leader的編號複製消息。

Of course if leaders didn't fail we wouldn't need followers! When the leader does die we need to choose a new leader from among the followers. But followers themselves may fall behind or crash so we must ensure we choose an up-to-date follower. The fundamental guarantee a log replication algorithm must provide is that if we tell the client a message is committed, and the leader fails, the new leader we elect must also have that message. This yields a tradeoff: if the leader waits for more followers to acknowledge a message before declaring it committed then there will be more potentially electable leaders.

If you choose the number of acknowledgements required and the number of logs that must be compared to elect a leader such that there is guaranteed to be an overlap, then this is called a Quorum.

當然，如果leaders沒有失敗，我們不需要followers。當leader確定死掉的時候，我們需要重新從followers之中選擇一個新的leader出來。但是followers它們自己可能落後很多或者崩潰了，所以我們需要保證我們選擇的是狀態最新的follower。日誌備份的基本保證必須提供：如果我們告訴客戶端一條消息已經提交了，那麼即使leader掛掉了，新的leader必須有這條消息。這就會產生折衷：如果leader等待越多的followers確認提交消息，那麼就有更多的followers可以被選爲新leader。

確認收到最新消息的followers範圍和用來選舉leader的followers範圍，必須保證這兩個followers範圍有重疊，才能正確選出所需要的leader。

A common approach to this tradeoff is to use a majority vote for both the commit decision and the leader election. This is not what Kafka does, but let's explore it anyway to understand the tradeoffs. Let's say we have 2f+1 replicas. If f+1 replicas must receive a message prior to a commit being declared by the leader, and if we elect a new leader by electing the follower with the most complete log from at least f+1 replicas, then, with no more than f failures, the leader is guaranteed to have all committed messages. This is because among any f+1 replicas, there must be at least one replica that contains all committed messages. That replica's log will be the most complete and therefore will be selected as the new leader. There are many remaining details that each algorithm must handle (such as precisely defined what makes a log more complete, ensuring log consistency during leader failure or changing the set of servers in the replica set) but we will ignore these for now.

這種折衷的通用方法是使用大多數投票機制：在提交決議和leader選舉中。這個不是kafka來做的，但是我們可以探討它以理解這種折衷。來看一下吧，如果我們有2f＋1個備份。如果leader聲明消息提交決議之前，必須有f＋1個備份已經收到消息，那麼如果我們從至少f＋1個備份的具有最完整日誌follower之中選舉新leader，那麼故障備份數不能操作f，新leader才能保證擁有所有已經提交的消息。這是因爲，在任何f＋1備份節點之間，必須至少有一個備份節點有所有已經提交的消息。備份節點的日誌必須是最完整的，才能被選爲新leader。每個算法必須處理很多細節信息（諸如，必須定義什麼叫更加完整的日誌，必須保證leader失效期間日誌的一致性，或者改變備份集合的servers列表），但是我們忽略了這些細節信息。

This majority vote approach has a very nice property: the latency is dependent on only the fastest servers. That is, if the replication factor is three, the latency is determined by the faster slave not the slower one.

There are a rich variety of algorithms in this family including ZooKeeper's Zab, Raft, and Viewstamped Replication. The most similar academic publication we are aware of to Kafka's actual implementation is PacificA from Microsoft.

大多數選舉方法有一個非常好的特徵：延遲依賴於最快的servers。即，如果備份數是3個，那麼延遲取決於最快的slave而非最慢的。

選舉算法相當多，包括Zookeeper的Zab算法，Raft算法，以及Viewstamped Replication。與kafka最相似的學術發表是微軟的PacificA 。

The downside of majority vote is that it doesn't take many failures to leave you with no electable leaders. To tolerate one failure requires three copies of the data, and to tolerate two failures requires five copies of the data. In our experience having only enough redundancy to tolerate a single failure is not enough for a practical system, but doing every write five times, with 5x the disk space requirements and 1/5th the throughput, is not very practical for large volume data problems. This is likely why quorum algorithms more commonly appear for shared cluster configuration such as ZooKeeper but are less common for primary data storage. For example in HDFS the namenode's high-availability feature is built on a majority-vote-based journal, but this more expensive approach is not used for the data itself.

多數投票的缺點是：他不能接受太多的失敗，否則沒有可選的leaders。如果應對一個失敗，它需要三份數據備份，要想應對兩個失敗，需要5份數據備份。我們的經驗是，只能具有足夠的數據備份冗餘卻僅僅應對一個失敗，對於大多數實際系統來說是不夠的，每次都需要寫5次，需要5倍的磁盤空間，而只有1/5的吞吐量，對於大數據問題來說，是非常不實際的。這也可能是爲什麼仲裁算法在共享集羣配置系統中例如zookeeper更加常見，而不是在數據存儲系統中更常見。例如，HDFS中，namenode的高性能特徵基於大多數投票算法上，但是這種開銷更大的方法沒有用於數據存儲本身。

Kafka takes a slightly different approach to choosing its quorum set. Instead of majority vote, Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader. Only members of this set are eligible for election as leader. A write to a Kafka partition is not considered committed until all in-sync replicas have received the write. This ISR set is persisted to ZooKeeper whenever it changes. Because of this, any replica in the ISR is eligible to be elected leader. This is an important factor for Kafka's usage model where there are many partitions and ensuring leadership balance is important. With this ISR model and f+1 replicas, a Kafka topic can tolerate f failures without losing committed messages.

kafka在選擇它的仲裁集合方面稍有不同。沒有使用大多數投票，kafka動態的維護一個in－sync備份（ISR）的集合，用來和leader保持一致。只有這個集合的成員纔有權利競選leader。直到所有in－sync的備份成員都收到消息了，這條消息才認爲是提交到kafka了。這個ISR集合無論何時發生變化，都會更新到zookeeper中。因爲這，ISR中的任何備份有權利競選leader。這對kafka使用模型來說是非常重要的特徵，kafka維護很多partitions並保證leader負載均衡是非常重要的。有了這個ISR模型，以及f＋1個備份，kafka topic可以應對f個錯誤而不沒有丟失提交消息風險。

For most use cases we hope to handle, we think this tradeoff is a reasonable one. In practice, to tolerate f failures, both the majority vote and the ISR approach will wait for the same number of replicas to acknowledge before committing a message (e.g. to survive one failure a majority quorum needs three replicas and one acknowledgement and the ISR approach requires two replicas and one acknowledgement). The ability to commit without the slowest servers is an advantage of the majority vote approach. However, we think it is ameliorated by allowing the client to choose whether they block on the message commit or not, and the additional throughput and disk space due to the lower required replication factor is worth it.

我們希望可以應對大多數使用場景，我們認爲這種折衷是合理的。世紀鐘，爲了容錯，大多數選舉和ISR備份方法都需要在提交消息之前等待相同數量的備份發出確認消息（例如，爲了應對失敗情況，大多數選舉仲裁需要三個備份和一個確認，ISR方法需要兩個備份以及一個確認）。不受最慢服務器影響的提交能力是大多數投票方法的優勢。然而，我們認爲，通過允許客戶端選擇它們是否在提交消息時阻塞，並且備份機制所附帶的吞吐量以及磁盤損耗，這是值得的。

Another important design distinction is that Kafka does not require that crashed nodes recover with all their data intact. It is not uncommon for replication algorithms in this space to depend on the existence of "stable storage" that cannot be lost in any failure-recovery scenario without potential consistency violations. There are two primary problems with this assumption. First, disk errors are the most common problem we observe in real operation of persistent data systems and they often do not leave data intact. Secondly, even if this were not a problem, we do not want to require the use of fsync on every write for our consistency guarantees as this can reduce performance by two to three orders of magnitude. Our protocol for allowing a replica to rejoin the ISR ensures that before rejoining, it must fully re-sync again even if it lost unflushed data in its crash.

另一個重要的設計就是：kafka不要求崩潰的節點可以完好無損的恢復所有數據。這種空間下的備份算法是不常見的：依賴於“穩定存儲”，即在任何失敗恢復語義且沒有違反潛在的一致性條件下都不會丟失。對於這種消耗來說有兩個主要問題。第一，磁盤錯誤是持久化數據系統的實際操作中最常見的錯誤，一旦發生，數據基本都會丟失。第二，即使這不是問題，我們也不希望每次寫入數據時都通過fsync方式（刷磁盤）保證數據持久化，因爲這可能使性能降低兩到三個數量級。我們的協議是：在備份節點重新加入ISR之前，必須保證，在加入之前，此備份節點必須再次完整的重新同步，即使它在崩潰中丟失了沒有來得及刷磁盤的數據。

Unclean leader election: What if they all die?

不清楚的leader的選舉：如果所有備份節點都死掉了怎麼辦？

Note that Kafka's guarantee with respect to data loss is predicated on at least one replica remaining in sync. If all the nodes replicating a partition die, this guarantee no longer holds.

However a practical system needs to do something reasonable when all the replicas die. If you are unlucky enough to have this occur, it is important to consider what will happen. There are two behaviors that could be implemented:

Wait for a replica in the ISR to come back to life and choose this replica as the leader (hopefully it still has all its data).
Choose the first replica (not necessarily in the ISR) that comes back to life as the leader.

注意：Kafka的有關數據丟失的保證是至少存在一個保持同步的備份節點。如果所有備份節點都死掉了，則這種保證不再有效。

儘管實際系統需要在所有備份節點死掉時做一些合理的事情。如果不幸的發生了這種事情，重要的是考慮需要做些什麼，有兩種選擇以供參考：

1.等待ISR中的某個備份重新活躍，然後選擇這個備份節點稱爲新的leader（希望這個備份節點擁有完整的數據）

2.選擇重新活躍的第一個備份節點（可能不在ISR中）作爲leader。

This is a simple tradeoff between availability and consistency. If we wait for replicas in the ISR, then we will remain unavailable as long as those replicas are down. If such replicas were destroyed or their data was lost, then we are permanently down. If, on the other hand, a non-in-sync replica comes back to life and we allow it to become leader, then its log becomes the source of truth even though it is not guaranteed to have every committed message. By default, Kafka chooses the second strategy and favor choosing a potentially inconsistent replica when all replicas in the ISR are dead. This behavior can be disabled using configuration property unclean.leader.election.enable, to support use cases where downtime is preferable to inconsistency.

這是可用性和一致性之間的折衷。如果我們等待ISR中的備份，則在備份節點不活躍其間需要一直等待。如果這些備份節點都損壞了或者數據丟失了，那麼數據也就永久損壞了。如果，從另一方面來說，一個沒有完整同步的備份重新活躍，同時我們選擇它作爲leader，那麼它就會成爲新的數據源，即使這個備份不能保證每條已經提交的消息。默認情況下，Kafka選擇第二種策略，當ISR中所有備份節點都死掉的時候，支持選擇潛在的非一致性備份。可以使用配置選項unclean.leader.election.enable禁用此行爲，以避免宕機期間選擇不一致的備份。

This dilemma is not specific to Kafka. It exists in any quorum-based scheme. For example in a majority voting scheme, if a majority of servers suffer a permanent failure, then you must either choose to lose 100% of your data or violate consistency by taking what remains on an existing server as your new source of truth.

這種困境不是kafka特有的。它存在於任何基於仲裁的機制中。例如，在一個大多數選舉機制中，如果大多數servers遭受永久失敗，那麼你必須要麼接受丟失所有數據，要麼選擇現有的數據作爲新數據源，儘管後者破壞了一致性，那也比丟失所有數據都強。

Availability and Durability Guarantees

可用性和持久性保證

When writing to Kafka, producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas. Note that "acknowledgement by all replicas" does not guarantee that the full set of assigned replicas have received the message. By default, when acks=all, acknowledgement happens as soon as all the current in-sync replicas have received the message. For example, if a topic is configured with only two replicas and one fails (i.e., only one in sync replica remains), then writes that specify acks=all will succeed. However, these writes could be lost if the remaining replica also fails. Although this ensures maximum availability of the partition, this behavior may be undesirable to some users who prefer durability over availability. Therefore, we provide two topic-level configurations that can be used to prefer message durability over availability:

Disable unclean leader election - if all replicas become unavailable, then the partition will remain unavailable until the most recent leader becomes available again. This effectively prefers unavailability over the risk of message loss. See the previous section on Unclean Leader Election for clarification.
Specify a minimum ISR size - the partition will only accept writes if the size of the ISR is above a certain minimum, in order to prevent the loss of messages that were written to just a single replica, which subsequently becomes unavailable. This setting only takes effect if the producer uses acks=all and guarantees that the message will be acknowledged by at least this many in-sync replicas. This setting offers a trade-off between consistency and availability. A higher setting for minimum ISR size guarantees better consistency since the message is guaranteed to be written to more replicas which reduces the probability that it will be lost. However, it reduces availability since the partition will be unavailable for writes if the number of in-sync replicas drops below the minimum threshold.

當寫入kafka時，producer可以選擇是否等待消息確認（可以選擇0，1，或者全部（－1）備份）。注意，所有備份的確認並不是保證備份集合中的所有分配的備份都需要收到這個消息。默認情況下，當acks＝all時，確認只需要當前在in－sync備份列表中的備份節點收到消息即可。例如，如果一個topic只有兩個備份節點，且有一個失效了（同步備份列表中只有一個了），那麼指定acks＝all依然會成功。然而，如果剩下的那個備份節點失效的話，這些後面的寫入都會丟失。儘管這保證了partition的最大可用性，但是這種方式對於那些對持久性要求高過可用性的用戶來說，可能並不可取。因此，我們提供了兩個topic級別的配置選項，可以用來指定持久性優先級高於可用性。

1.禁用unclean leader election－如果所有備份節點不可用了，那麼partition會一直保持不可用，直到最近選舉的leader變的可用爲止。這個特徵可以優先選擇丟失消息的風險的不可用性。查看前面有關unclean leader election的說明

2.指定最小的ISR個數－只有當ISR的個數大於某個確定的最小值時，partition才能接受寫入，目的是爲了避免消息寫入單個備份節點而丟失，因爲單個備份節點數據有可能變的不可用。這個設置只有當producer使用acks＝all的配置時纔有效，並且保證了消息可以被至少in-sync中備份節點確認收到。這個設置提供了一致性和可用性之間的折衷。最低ISR設置的越高，一致性保證就越強，因爲消息可以被更過的備份節點寫入，這樣可以降低消息丟失的可能性。然而，如果in-sync中備份數降低到最低限度之下時，partition的寫入就不可用了，這樣也會降低可用性。

Replica Management

備份管理

The above discussion on replicated logs really covers only a single log, i.e. one topic partition. However a Kafka cluster will manage hundreds or thousands of these partitions. We attempt to balance partitions within a cluster in a round-robin fashion to avoid clustering all partitions for high-volume topics on a small number of nodes. Likewise we try to balance leadership so that each node is the leader for a proportional share of its partitions.

It is also important to optimize the leadership election process as that is the critical window of unavailability. A naive implementation of leader election would end up running an election per partition for all partitions a node hosted when that node failed. Instead, we elect one of the brokers as the "controller". This controller detects failures at the broker level and is responsible for changing the leader of all affected partitions in a failed broker. The result is that we are able to batch together many of the required leadership change notifications which makes the election process far cheaper and faster for a large number of partitions. If the controller fails, one of the surviving brokers will become the new controller.

上邊有關備份日誌的討論只是覆蓋了單獨的日誌，日曆，一個topic-partition。然而，kafka集羣可以管理成千上萬個這樣的partitions。我們在集羣內部嘗試採用輪詢的方式對partitions進行負載均衡，以避免集羣中包含大量數據的topics的所有的partitions只分布在少數節點上。同樣，我們努力均衡leader的分佈，這樣一來，每個節點都是集羣中某個partitions子集的leader。

對優化leadership選舉過程非常重要的是不可用階段的關鍵窗口。leader選舉的一個簡單實現是：爲每個partition分配一個節點，當此節點不可用時，託管該節點上的所有partitions。我們選擇某個broker作爲“controller”。這個controller發現broker層面的失效，並負責更改所有以此失效broker爲leader節點的partitions的leader。結果就是：我們能夠批量處理很多有關leadership改變的需求，這可以更快的執行以及更少的消耗資源。如果此controller失效了，某個活着的broker會重新成爲新的controller。

4.8 Log Compaction

4.8 日誌壓縮

Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. It addresses use cases and scenarios such as restoring state after application crashes or system failure, or reloading caches after application restarts during operational maintenance. Let's dive into these use cases in more detail and then describe how compaction works.

So far we have described only the simpler approach to data retention where old log data is discarded after a fixed period of time or when the log reaches some predetermined size. This works well for temporal event data such as logging where each record stands alone. However an important class of data streams are the log of changes to keyed, mutable data (for example, the changes to a database table).

日誌壓縮保證了kafka可以在單個的topic partition日誌數據中對每條消息來說都保存最少的字節數。它解決了這樣的用例場景：在應用崩潰或者系統失效之後重新恢復狀態，或者在操作維護期間應用重啓之後的緩存重載等場景。讓我們深入瞭解一下這些場景的更過細節並且描述如何進行壓縮工作的。

到目前爲止，我們只是描述了數據保存的簡單方法：老的日誌數據在某個固定的時間之後或者當日志達到某些預設的尺寸時，就會被刪除。這對時間事件數據來說可以很好的工作，例如每條記錄是單獨的記錄，即彼此之間沒有任何關係。然而，一類非常重要的數據流是對帶密鑰的可變數據流（例如，某個數據庫表格的改變，這樣一來，某些消息都是針對某些key的操作）。

Let's discuss a concrete example of such a stream. Say we have a topic containing user email addresses; every time a user updates their email address we send a message to this topic using their user id as the primary key. Now say we send the following messages over some time period for a user with id 123, each message corresponding to a change in email address (messages for other ids are omitted):

讓我們討論一下這樣的數據流的一個例子。如果說我們有一個這樣的topic，裏面包含用戶的email地址：每次一個用戶更新它們的email地址，我們就使用它們的user id作爲key，發送一條消息到這個topic。現在如果說我們某段時間內發送了下面的一些消息，這些消息都是用戶id 123，每條消息都是有關email地址的變化（忽略其他的用戶id的消息）：

  123 => [email protected]
                .
                .
                .
        123 => [email protected]
                .
                .
                .
        123 => [email protected]

Log compaction gives us a more granular retention mechanism so that we are guaranteed to retain at least the last update for each primary key (e.g. [email protected]). By doing this we guarantee that the log contains a full snapshot of the final value for every key not just keys that changed recently. This means downstream consumers can restore their own state off this topic without us having to retain a complete log of all changes.

日誌壓縮給予我們更細化的保留機制，我們就可以保證至少保存每個主key最後的更新（例如，[email protected]）。通過這些，我們可以保證日誌包含了每個key最後value的完整快照，而不是隻有最近改變的key的消息。這就意味着，下游系統的consumers不需要上游系統保存所有的消息變化，就可以擁有完整的最終狀態。

Let's start by looking at a few use cases where this is useful, then we'll see how it can be used.

Database change subscription. It is often necessary to have a data set in multiple data systems, and often one of these systems is a database of some kind (either a RDBMS or perhaps a new-fangled key-value store). For example you might have a database, a cache, a search cluster, and a Hadoop cluster. Each change to the database will need to be reflected in the cache, the search cluster, and eventually in Hadoop. In the case that one is only handling the real-time updates you only need recent log. But if you want to be able to reload the cache or restore a failed search node you may need a complete data set.
Event sourcing. This is a style of application design which co-locates query processing with application design and uses a log of changes as the primary store for the application.
Journaling for high-availability. A process that does local computation can be made fault-tolerant by logging out changes that it makes to its local state so another process can reload these changes and carry on if it should fail. A concrete example of this is handling counts, aggregations, and other "group by"-like processing in a stream query system. Samza, a real-time stream-processing framework, uses this feature for exactly this purpose.

讓我們開始看一下有用的場景，以及是如何使用的：

1.數據庫改變的訂閱。通常需要在多個數據系統來說擁有一個數據集合，這些系統中某個通常是某種數據庫（或者是RDBMS或者可能新卷繞的key-value存儲方式）。例如，你可能有一個數據庫，一個緩存，一個搜索集羣，以及一個Hadoop集羣。對於數據庫的每個變化都需要映射到緩存中、搜索集羣，最終反應到Hadoop集羣中。這種情況下，在一個處理實時更新的系統中，你只需要最近的日誌。但是你可能想能夠重新加載緩存或者重新恢復失效的搜索節點，這樣你就需要一份完整的數據集合。

2.事件源頭。這是一種應用設計方式，它將查詢進程和應用設計共存，並使用日誌變化作爲應用的主要存儲。

3.日誌記錄實現高可用性。進行本地計算的進程可以通過輸出變化進行容錯，這樣就使得另一個進程在本地進程實效時可以重新加載本地狀態以容錯。具體例子是：在流查詢系統中，處理計數，聚合，以及其他“逐組”處理。Samza，實時流式處理框架，使用這個特徵正是爲了這個目的。

In each of these cases one needs primarily to handle the real-time feed of changes, but occasionally, when a machine crashes or data needs to be re-loaded or re-processed, one needs to do a full load. Log compaction allows feeding both of these use cases off the same backing topic. This style of usage of a log is described in more detail in this blog post.

The general idea is quite simple. If we had infinite log retention, and we logged each change in the above cases, then we would have captured the state of the system at each time from when it first began. Using this complete log, we could restore to any point in time by replaying the first N records in the log. This hypothetical complete log is not very practical for systems that update a single record many times as the log will grow without bound even for a stable dataset. The simple log retention mechanism which throws away old updates will bound space but the log is no longer a way to restore the current state—now restoring from the beginning of the log no longer recreates the current state as old updates may not be captured at all.

在這些情況中每一種情況下，都需要處理實時變化的信息流，但是偶爾會出現，當機器崩潰或者數據需要重載或者重新處理時，就需要完整的加載。日誌壓縮允許相同的topic滿足這兩個應用場景。日誌的這種使用方式在這片文章中有更多細節描述。

一般的想法很簡單。如果可以無限存儲日誌，我們可以記錄上述的每種變化，則從系統第一次啓動之時起的所有狀態都可以記錄下來。使用這種完整的日誌，我們可以恢復到任意時刻。這種假設的完整日誌對於系統來說不是特別實際，即使對於穩定的數據集來說，頻繁的更新會使日誌超出限制。這種簡單的日誌保存機制--刪除老的更新將釋放空間，但是日誌不再是恢復到當前狀態的方式-從日誌開始恢復日誌狀態不再能恢復到當前狀態，因爲老的更新已經丟棄了。

Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key.

This retention policy can be set per-topic, so a single cluster can have some topics where retention is enforced by size or time and other topics where retention is enforced by compaction.

This functionality is inspired by one of LinkedIn's oldest and most successful pieces of infrastructure—a database changelog caching service called Databus. Unlike most log-structured storage systems Kafka is built for subscription and organizes data for fast linear reads and writes. Unlike Databus, Kafka acts as a source-of-truth store so it is useful even in situations where the upstream data source would not otherwise be replayable.

日誌壓縮是一種可以提供以每條記錄爲粒度的保存機制，而不是較粗的以時間爲粒度的保存機制。這種想法是有選擇的刪除具有相同主鍵的更新記錄。日誌的這種方式保證了保存每個key的至少一條最後更新的狀態。

保存策略可以每個topic都設置自己的，因此集羣可以有很多topics，這樣不同的topics的保存策略可能時基於尺寸或者基於時間，或者是基於壓縮的。

這種功能的靈感來自於LinkedIn的最古老以及最有效的基礎設施之一--一個稱爲Databus的數據庫更新日誌緩存服務。不像大多數日誌存儲結構系統，kafka的構建是爲了訂閱，數據的組織是爲了快速的線性讀寫。不像Databus，Kafka充當真實的數據來源，即使在上游數據源不能重放數據時也是如此。

Log Compaction Basics

日誌壓縮基礎

Here is a high-level picture that shows the logical structure of a Kafka log with the offset for each message.

下圖展示了kafka日誌中帶有offset的每個消息的邏輯結構

The head of the log is identical to a traditional Kafka log. It has dense, sequential offsets and retains all messages. Log compaction adds an option for handling the tail of the log. The picture above shows a log with a compacted tail. Note that the messages in the tail of the log retain the original offset assigned when they were first written—that never changes. Note also that all offsets remain valid positions in the log, even if the message with that offset has been compacted away; in this case this position is indistinguishable from the next highest offset that does appear in the log. For example, in the picture above the offsets 36, 37, and 38 are all equivalent positions and a read beginning at any of these offsets would return a message set beginning with 38.

日誌的頭和傳統kafka日誌相同。它具有密集的順序的offsets，以此保存所有消息。日誌壓縮增加了用於處理日誌尾部的選項。上圖展示了帶有壓縮尾部的日誌。注意：日誌尾部的消息保留了首次寫入時的原始offset--這個從來不會改。也需要注意：即使某些offset的消息已經壓縮了，所有offsets依然保留在日誌中正確位置。這種情況下，此位置和日誌中出現的下一個最高的offset無法區分。例如，上面圖片中德offsets 36,37,38都是等效位置，並且在這三個offsets中任何一個開始讀取日誌，都會返回從38開始的消息集。

Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. This delete marker will cause any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that they will themselves be cleaned out of the log after a period of time to free up space. The point in time at which deletes are no longer retained is marked as the "delete retention point" in the above diagram.

The compaction is done in the background by periodically recopying log segments. Cleaning does not block reads and can be throttled to use no more than a configurable amount of I/O throughput to avoid impacting producers and consumers. The actual process of compacting a log segment looks something like this:

壓縮也允許刪除。帶有key和空內容的消息都將被視爲日誌中已經刪除的消息。這個刪除標記將導致刪除前面的任何具有該key的消息，但是刪除標記是特別的，因爲他們自己將在一段時間內之後刪除自己以釋放空間。不再保留刪除標記的時間點在上圖中標記爲“刪除點”。

壓縮在後臺通過定期複製日誌端來完成。清除並沒有阻塞讀操作，並且可以限制爲使用不超過配置的I/O吞吐量，以避免影響生產者和消費者。壓縮日誌段的實際過程就像以下：

What guarantees does log compaction provide?

日誌壓縮提供了什麼保證？

Log compaction guarantees the following:

Any consumer that stays caught-up to within the head of the log will see every message that is written; these messages will have sequential offsets. The topic's min.compaction.lag.ms can be used to guarantee the minimum length of time must pass after a message is written before it could be compacted. I.e. it provides a lower bound on how long each message will remain in the (uncompacted) head.
Ordering of messages is always maintained. Compaction will never re-order messages, just remove some.
The offset for a message never changes. It is the permanent identifier for a position in the log.
Any read progressing from offset 0 will see at least the final state of all records in the order they were written. All delete markers for deleted records will be seen provided the reader reaches the head of the log in a time period less than the topic's delete.retention.ms setting (the default is 24 hours). This is important as delete marker removal happens concurrently with read (and thus it is important that we not remove any delete marker prior to the reader seeing it).
Any consumer progressing from the start of the log will see at least the final state of all records in the order they were written. All delete markers for deleted records will be seen provided the consumer reaches the head of the log in a time period less than the topic's delete.retention.ms setting (the default is 24 hours). This is important as delete marker removal happens concurrently with read, and thus it is important that we do not remove any delete marker prior to the consumer seeing it.

日誌壓縮提供了以下保證：

1.任何consumer，如果能夠跟得上日誌寫入的速度，那麼可以看到所有寫入的消息；這些消息都有順序的offsets。topic的min.compaction.lag.ms可以用來保證消息寫入之後壓縮之前可以停留的時間。例如，它提供了比較小的時間區間：每條消息可以在日誌頭部（未壓縮）保留的時間。

2.消息的順序總是不會變。壓縮不會重排消息，只是會刪除某些消息。

3.消息的offset從來不變。日誌中的位置標記是永遠不變的。

4.從offset 0的任何閱讀處理都將至少看到所有寫入記錄的最後狀態。如果consumer可以在topic的delete.retention.ms設置（默認是24小時）之內到達日誌頭部，則可以看到所有已刪除日誌的刪除標記。這很重要，因爲刪除標記的移除和讀取同時發生（這樣，很重要的還有：我們在consumer沒有讀到消息之前沒有移除任何刪除標記）。

5.任何consumer從日誌開始的處理都將至少所有記錄的最後狀態：按照他們寫入的順序。如果consumer可以在topic的delete.retention.ms設置（默認是24小時）之內到達日誌頭部，則可以看到所有已刪除日誌的刪除標記。這很重要，因爲刪除標記的移除和讀取同時發生（這樣，很重要的還有：我們在consumer沒有讀到消息之前沒有移除任何刪除標記）。

Log Compaction Details

日誌壓縮細節

Log compaction is handled by the log cleaner, a pool of background threads that recopy log segment files, removing records whose key appears in the head of the log. Each compactor thread works as follows:

It chooses the log that has the highest ratio of log head to log tail
It creates a succinct summary of the last offset for each key in the head of the log
It recopies the log from beginning to end removing keys which have a later occurrence in the log. New, clean segments are swapped into the log immediately so the additional disk space required is just one additional log segment (not a fully copy of the log).
The summary of the log head is essentially just a space-compact hash table. It uses exactly 24 bytes per entry. As a result with 8GB of cleaner buffer one cleaner iteration can clean around 366GB of log head (assuming 1k messages).

日誌壓縮由日誌清理器完成，是由一系列後臺線程組成：重拷貝日誌段文件，刪除前面已經出現的key的日誌。每個壓縮線程像下面一樣工作：

1.它選擇日誌頭到日誌尾比率最高的日誌

2.它爲日誌中每個key創建最後狀態的簡明摘要

3.它重新拷貝日誌：從頭到尾刪除相同key的老舊消息。那麼，新的乾淨的日誌段會立刻替代原有日誌，因此只需要增加新日誌段的磁盤空間，而不是完整日誌的拷貝

4.日誌頭的本質只是一個空間緊湊的hash表。它每個條目使用24個字節。因此，使用8GB的清理器緩存，一個清理器可以請求366GB的日誌頭（假設每條消息1k）

Configuring The Log Cleaner

配置日誌清理器

The log cleaner is enabled by default. This will start the pool of cleaner threads. To enable log cleaning on a particular topic you can add the log-specific property

默認是開啓日誌清理器。這將啓動清理器線程池。爲了能夠清理某個特定的topic，你需要添加特定的日誌壓縮特徵：

  log.cleanup.policy=compact

This can be done either at topic creation time or using the alter topic command.

The log cleaner can be configured to retain a minimum amount of the uncompacted "head" of the log. This is enabled by setting the compaction time lag.

也可以通過命令行在創建topic時設置或者使用alter命令改變。

日誌清理器可以配置爲：保存少量的未壓縮的日誌頭。這可以通過設置壓縮落後時間來設定。

  log.cleaner.min.compaction.lag.ms

This can be used to prevent messages newer than a minimum message age from being subject to compaction. If not set, all log segments are eligible for compaction except for the last segment, i.e. the one currently being written to. The active segment will not be compacted even if all of its messages are older than the minimum compaction time lag.

Further cleaner configurations are described here.

這可以用來避免消息在最小的消息保留時間內就被壓縮。如果不設置，除了最後的日誌段之外（即當前正在寫入的日誌段），所有日誌段都會壓縮。活躍日誌段將不會被壓縮，即使所有消息都超出了最小壓縮落後時間。

更清楚的配置請看這裏。

4.9 Quotas

指標

Starting in 0.9, the Kafka cluster has the ability to enforce quotas on produce and fetch requests. Quotas are basically byte-rate thresholds defined per group of clients sharing a quota.

從0.9開始，kafka集羣可以對produce和fetch請求指定指標。指標基本上是每組共享指標的客戶端定義的字節率限制。

Why are quotas necessary?

爲什麼需要指標

It is possible for producers and consumers to produce/consume very high volumes of data and thus monopolize broker resources, cause network saturation and generally DOS other clients and the brokers themselves. Having quotas protects against these issues and is all the more important in large multi-tenant clusters where a small set of badly behaved clients can degrade user experience for the well behaved ones. In fact, when running Kafka as a service this even makes it possible to enforce API limits according to an agreed upon contract.

生產者和消費者有可能生產或者消費大量的數據，這樣有可能獨佔broker資源，引起網絡飽和以及一般的拒絕服務其他客戶端或者brokers自己。擁有指標就可以避免這些問題，對於大型多租戶集羣來說更加重要，因爲某組表現不好的客戶端可能引發其他表現良好的客戶端體驗。實際上，當把kafka作爲一個服務來看時，這甚至可以根據約定來實現API限制。

Client groups

客戶端組

The identity of Kafka clients is the user principal which represents an authenticated user in a secure cluster. In a cluster that supports unauthenticated clients, user principal is a grouping of unauthenticated users chosen by the broker using a configurable PrincipalBuilder. Client-id is a logical grouping of clients with a meaningful name chosen by the client application. The tuple (user, client-id) defines a secure logical group of clients that share both user principal and client-id.

Quotas can be applied to (user, client-id), user or client-id groups. For a given connection, the most specific quota matching the connection is applied. All connections of a quota group share the quota configured for the group. For example, if (user="test-user", client-id="test-client") has a produce quota of 10MB/sec, this is shared across all producer instances of user "test-user" with the client-id "test-client".

kafka客戶端的身份是安全集羣中已認證用戶的用戶主體。在支持非認證客戶端的的集羣中，用戶主體是由broker使用可配置PrincipalBuilder選擇的非認證用戶組。Client-id是客戶端應用選擇的用於對clients進行邏輯分組的有意義的名字。組(user,client-id)定義了一個安全的邏輯的用戶組，由用戶主體和client-id共用。限額可以用於用戶或者client-id組。對於給定鏈接來說，將匹配最特定的限額。一個限額組的所有鏈接都共享此組配置好的限額。例如，如果（user="test-user", client-id="test-client")的限額是10mb/s，那麼所有user"test-user"以及client-id"test-client"的所有實例都共享這個限額。

Quota Configuration

限額配置

Quota configuration may be defined for (user, client-id), user and client-id groups. It is possible to override the default quota at any of the quota levels that needs a higher (or even lower) quota. The mechanism is similar to the per-topic log config overrides. User and (user, client-id) quota overrides are written to ZooKeeper under /config/users and client-id quota overrides are written under /config/clients. These overrides are read by all brokers and are effective immediately. This lets us change quotas without having to do a rolling restart of the entire cluster. See here for details. Default quotas for each group may also be updated dynamically using the same mechanism.

可以爲(user,client-id)，user和client-id組配置限額。如果需要更高或者更低的限額，則可以重置默認限額。這個機制類似於對每個topic都可以重置日誌的配置選項。用戶和（user，client-id）限額寫入zookeeper中/config/users目錄下，重置client-id限額可以通過/configs/clients目錄設置。這些重置可以被所有brokers讀到，並可以立即生效。這將使得我們不需要重啓整個集羣就可以改變限額。查看這裏獲得更多細節。每個組的默認配額也可以使用同樣的機制進行動態更新。

The order of precedence for quota configuration is:

/config/users/<user>/clients/<client-id>
/config/users/<user>/clients/<default>
/config/users/<user>
/config/users/<default>/clients/<client-id>
/config/users/<default>/clients/<default>
/config/users/<default>
/config/clients/<client-id>
/config/clients/<default>

Broker properties (quota.producer.default, quota.consumer.default) can also be used to set defaults for client-id groups. These properties are being deprecated and will be removed in a later release. Default quotas for client-id can be set in Zookeeper similar to the other quota overrides and defaults.

broker的特徵值（quota.producer.default，quota.consumer.default）可以用來爲client-id組設置默認值。這些特徵值正在被拋棄，將在後面的發佈版本中刪除。client-id的默認限額將在zookeeper中設置，類似於其他的限額重置和默認值設置。

Enforcement

強制執行

By default, each unique client group receives a fixed quota in bytes/sec as configured by the cluster. This quota is defined on a per-broker basis. Each client can publish/fetch a maximum of X bytes/sec per broker before it gets throttled. We decided that defining these quotas per broker is much better than having a fixed cluster wide bandwidth per client because that would require a mechanism to share client quota usage among all the brokers. This can be harder to get right than the quota implementation itself!

How does a broker react when it detects a quota violation? In our solution, the broker does not return an error rather it attempts to slow down a client exceeding its quota. It computes the amount of delay needed to bring a guilty client under its quota and delays the response for that time. This approach keeps the quota violation transparent to clients (outside of client-side metrics). This also keeps them from having to implement any special backoff and retry behavior which can get tricky. In fact, bad client behavior (retry without backoff) can exacerbate the very problem quotas are trying to solve.

Client byte rate is measured over multiple small windows (e.g. 30 windows of 1 second each) in order to detect and correct quota violations quickly. Typically, having large measurement windows (for e.g. 10 windows of 30 seconds each) leads to large bursts of traffic followed by long delays which is not great in terms of user experience.

默認情況下，每個獨一無二的客戶端組都可以收到一個由集羣配置的固定配額（單位爲bytes/sec）。這個配額是基於每個broker自己設置的。在broker配置改變之前，每個客戶端可以從某個broker都發布或者拉取某個最大值的bytes/sec。我們決定：爲每個broker定義它自己的限額，要比每個客戶端都有固定的集羣帶寬要好的多，因爲後者需要所有brokers共享客戶端限額的機制。這可能比限額實現本身更難。

當檢測到配額違規時，broker會怎麼應對。在我們的解決方案中，broker不會返回錯誤，而是試圖減緩客戶端超出限額的行爲。它將計算將有罪客戶端置於限額之下所需要的延遲時間，並延遲迴應客戶端的時間。此方法使限額對客戶端保持透明（在客戶端指標之外）。這也使他們不需要任何特殊的停留和重試行爲，這些行爲可能會使事情變得更糟糕。事實上，不好的客戶端行爲(沒有停留的重試）可能會加劇限額正在試圖解決的問題。

客戶端字節率在多個小窗口上測量（每秒內30個窗口），以快速測量和校正限額。一般來說，大量的測試窗口（每30秒10個窗口）將導致大量的突發業務量，緊接着是長時間的延遲，這在用戶體驗方面上倒不是很大。