bLSM: A General Purpose Log Structured Merge Tree

  1. bLSM 提出了一種新的合併調度器來限制寫入延遲,從而保持穩定的寫入吞吐量,並且還使用 bloom 過濾器來提高性能。

Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma:
Update in place systems have unmatched latency but poor write throughput. In contrast, existing log structured techniques improve write throughput but sacrice read performance and exhibit unacceptable latency spikes.
We begin by presenting a new performance metric: read fanout, and argue that, with read and write amplication, it better characterizes real-world indexes than approaches such as asymptotic analysis and price/performance.
We then present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches: (1) Unlike existing log structured trees, bLSM has near-optimal read and scan performance, and (2) its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time. It does this by ensuring merges at each level of the tree make steady progress with- out resorting to techniques that degrade read performance.
We use Bloom lters to improve index performance, and nd a number of subtleties arise. First, we ensure reads can stop after nding one version of a record. Otherwise, frequently written items would incur multiple B-Tree lookups.
Second, many applications check for existing values at insert. Avoiding the seek performed by the check is crucial.

數據管理工作負載越來越多的寫入密集型工作,並且要遵守嚴格的延遲SLA。這帶來了一個難題:
就地更新系統具有無與倫比的延遲,但寫入吞吐量很差。相反,現有的日誌結構化技術可提高寫入吞吐量,但會犧牲讀取性能,並且會出現不可接受的延遲峯值。
我們首先提出一種新的性能指標:讀取扇出,並認爲通過讀寫放大,它比漸進分析和價格/性能方法更好地刻畫了現實世界中的指標。
然後,我們介紹bLSM,這是一種具有B樹和日誌結構化方法優點的日誌結構合併(LSM)樹:(1)與現有的日誌結構樹不同,bLSM具有近乎最佳的讀取和掃描性能,以及(2)其新的“ spring and gear”合併調度程序在不影響吞吐量的情況下限制了寫入延遲,也不允許合併在更長的時間內阻止寫入。它通過確保在樹的每個級別上的合併都能穩步進行,而無需訴諸降低讀取性能的技術來做到這一點。
我們使用Bloom篩選器來改善索引性能,並且會發現許多細微差別。首先,我們確保找到記錄的一個版本後可以停止讀取。否則,經常寫入的項目將導致多次B-Tree查找。
其次,許多應用程序會在插入時檢查現有值。避免由檢查執行的查找至關重要。

1. INTRODUCTION


Modern web services rely upon two types of storage for small objects. The rst, update-in-place, optimizes for random reads and worst case write latencies. Such stores are used by interactive, user facing portions of applications. 
The second type is used for analytical workloads and emphasizes write throughput and sequential reads over latency or random access. This forces applications to be broken into \fast-path" processing, and asynchronous analytical tasks.
This impacts end-users (e.g., it may take hours for machine learning models to react to users' behavior), and forces operators to manage redundant storage infrastructures.
Such limitations are increasingly unacceptable. Cloud computing, mobile devices and social networking write data at unprecedented rates, and demand that updates be synchronously exposed to devices, users and other services. 
Unlike traditional write-heavy workloads, these applications have stringent latency SLAs.
These trends have two immediate implications at Yahoo!.
First, in 2010, typical low latency workloads were 80-90% reads. Today the ratio is approaching 50%, and the shift from reads to writes is expected to accelerate in 2012. These trends are driven by applications that ingest event logs (such as user clicks and mobile device sensor readings), and later mine the data by issuing long scans, or targeted point queries.
Second, the need for performant index probes in write optimized systems is increasing. bLSM is designed to be used as backing storage for PNUTS, our geographically-distributed key-value storage system [10], and Walnut, our next-generation elastic cloud storage system [9].

1.引言
現代Web服務依靠兩種類型的小對象存儲。第一個是就地更新,它針對隨機讀取和最壞情況下的寫入延遲進行了優化。此類存儲由應用程序的交互式,面向用戶的部分使用。
第二種類型用於分析工作負載,並強調寫入吞吐量和在延遲或隨機訪問上的順序讀取。這迫使應用程序分成“快速路徑”處理和異步分析任務。
這會影響最終用戶(例如,機器學習模型可能需要數小時才能對用戶的行爲做出反應),並迫使運營商管理冗餘存儲基礎架構。
這種限制越來越難以接受。雲計算,移動設備和社交網絡以前所未有的速度寫入數據,並要求將更新同步顯示給設備,用戶和其他服務。
與傳統的寫繁重工作負載不同,這些應用程序具有嚴格的延遲SLA。
這些趨勢對Yahoo!有兩個直接的影響。
首先,在2010年,典型的低延遲工作負載是80-90%的讀取。如今,這一比例已接近50%,預計從2012年到現在,讀和寫的轉變將加速。這些趨勢是由應用程序驅動的,這些應用程序提取事件日誌(例如用戶點擊和移動設備傳感器讀數),然後通過發出長時間掃描或目標點查詢。
其次,在寫優化系統中對高性能索引探針的需求正在增加。 bLSM旨在用作PNUTS(我們的地理分佈鍵值存儲系統[10])和Walnut(我們的下一代彈性雲存儲系統[9])的後備存儲。

Historically, limitations of log structured indexes have presented a trade off. Update-in-place storage provided superior read performance and predictability for writes; log-structured trees traded this for improved write throughput.
This is re ected in the infrastructures of companies such as Yahoo!, Facebook [6] and Google, each of which employs InnoDB and either HBase [1] or BigTable [8] in production.
This paper argues that, with appropriate tuning and our merge scheduler improvements, LSM-trees are ready to supplant B-Trees in essentially all interesting application scenarios. The two workloads we describe above, interactive and analytical, are prime examples: they cover most applications, and, as importantly, are the workloads that mostfrequently push the performance envelope of existing systems. 
Inevitably, switching from B-Trees to LSM-Trees entails a number of tradeoffs, and B-Trees still outperform logstructured approaches in a number of corner cases. These are summarized in Table 1.

從歷史上看,日誌結構化索引的侷限性是一個折衷方案。就地更新存儲可提供卓越的讀取性能和寫入可預測性;日誌結構樹對此進行了交換,以提高寫入吞吐量。
這在Yahoo!,Facebook [6]和Google等公司的基礎架構中得到了體現,這些公司的基礎架構在生產中均使用InnoDB以及HBase [1]或BigTable [8]。
本文認爲,通過適當的調整和我們的合併調度程序的改進,LSM樹準備好在基本上所有有趣的應用場景中取代B樹。我們上面描述的兩個工作負載(交互和分析)是主要的示例:它們涵蓋了大多數應用程序,並且重要的是,這些工作負載最經常推動現有系統的性能。
不可避免地,從B樹切換到LSM樹需要進行許多折衷,並且B樹在許多極端情況下仍優於對數結構化方法。這些總結在表1中。

Concretely, we target "workhorse" data management systems that are provisioned to minimize the price/performance of storing and processing large sets of small records. 
Serving applications written atop such systems are dominated by point queries, updates and occasional scans, while analytical processing consists of bulk writes and scans. Unlike B-Trees and existing log structured systems, our approach is appropriate for both classes of applications.
Section 2 provides an overview of log structured index variants. We explain why partitioned, three level trees with Bloom filters (Figure 1) are particularly compelling.
Section 3 discusses subtleties that arise with the base approach, and presents algorithmic tweaks that improve read, scan and insert performance.
Section 4 describes our design in detail and presents the missing piece: with careful scheduling of background merge tasks we are able to automatically bound write latencies without sacricing throughput.
Beyond recommending and justifying a combination of LSM-Tree related optimizations, the primary contribution of this paper is a new class of merge schedulers called level schedulers. We distinguish level schedulers from existing partition schedulers and present a level scheduler we call the spring and gear scheduler.
In Section 5 we conrm our LSM-Tree design matches or outperforms B-Trees on a range of workloads. We compare against LevelDB, a state-of-the-art LSM-Tree variant that has been highly optimized for desktop workloads and makes different tradeoffs than our approach. It is a multi-level tree that does not make use of Bloom lters and uses a partition scheduler to schedule merges. These differences allow us to isolate and experimentally validate the effects of each of the major decisions that went into our design.

具體來說,我們的目標是“主力”數據管理系統,該系統的配置旨在最大程度地降低存儲和處理大量小型記錄的價格/性能。
在此類系統上編寫的服務應用程序主要由點查詢,更新和偶發掃描組成,而分析處理則由批量寫入和掃描組成。與B樹和現有的日誌結構化系統不同,我們的方法適用於兩種類型的應用程序。

第2節概述了日誌結構化索引變量。我們解釋了爲什麼具有Bloom過濾器的分區三級樹(圖1)特別引人注目。
第3節討論了基本方法帶來的細微差別,並提出了提高讀取,掃描和插入性能的算法調整。
第4節詳細介紹了我們的設計,並提出了缺少的部分:通過精心安排後臺合併任務,我們能夠自動限制寫延遲,而無需犧牲吞吐量。
除了推薦和證明與LSM-Tree相關的優化的組合之外,本文的主要貢獻是一類新的合併調度程序,稱爲級別調度程序。我們將級別調度程序與現有分區調度程序區分開,並介紹一個稱爲彈簧和齒輪調度程序的級別調度程序。
在第5節中,我們確定我們的LSM-Tree設計在一系列工作負載上匹配或優於B-Tree。我們將LevelDB與最先進的LSM-Tree變體進行了比較,該變體已針對桌面工作負載進行了高度優化,並與我們的方法進行了權衡。它是一個多級樹,不使用Bloom lters,而是使用分區調度程序來調度合併。這些差異使我們能夠隔離和實驗驗證設計中每個主要決策的效果。

2. BACKGROUND


Storage systems can be characterized in terms of whether they use random I/O to update data, or write sequentially and then use additional sequential I/O to asynchronously maintain the resulting structure. Over time, the ratio of the cost of hard disk random I/O to sequential I/O has increased, decreasing the relative cost of the additional sequential I/O and widening the range of workloads that can benet from log-structured writes.
As object sizes increase, update-in-place techniques begin to outperform log structured techniques. Increasing the relative cost of random I/O increases the object size that determines the "cross over" point where update-in-place techniques outperform log structured ones. These trends make log structured techniques more attractive over time.
Our discussion focuses on three classes of disk layouts: update-in-place B-Trees, ordered log structured stores, and unordered log structured stores. B-Tree read performance is essentially optimal. For most applications they perform at most one seek per read. Unfragmented B-Trees perform one seek per scan. However, they use random writes to achieve these properties. The goal of our work is to improve upon B-Tree writes without sacricing read or scan performance.
Ordered log structured indexes buer updates in RAM, sort them, and then write sorted runs to disk. Over time, the runs are merged, bounding overheads incurred by reads and scans. The cost of the merges depends on the indexed data's size and the amount of RAM used to buer writes.
Unordered log structured indexes write data to disk immediately, eliminating the need for a separate log. The cost of compacting these stores is a function of the amount of free space reserved on the underlying device, and is independent of the amount of memory used as cache. Unordered stores typically have higher sustained write throughput than ordered stores (order of magnitude dierences are not uncommon [22, 27, 28, 32]). These benets come at a price:unordered stores do not provide efficient scan operations.
Scans are required by a wide range of applications (and exported by PNUTS and Walnut), and are essential for efficient relational query processing. Since we target such use cases, we are unable to make use of unordered techniques.
However, such techniques complement ours, and a number of implementations are available (Section 6). We now turn to a discussion of the tradeos made by ordered approaches.

2.背景
可以根據存儲系統是使用隨機I / O更新數據還是順序寫入,然後使用其他順序I / O異步維護生成的結構來表徵存儲系統。隨着時間的流逝,硬盤隨機I / O與順序I / O的成本之比增加了,從而降低了其他順序I / O的相對成本,並擴大了日誌結構寫入可能帶來的工作負載範圍。
隨着對象大小的增加,就地更新技術開始勝過日誌結構化技術。增加隨機I / O的相對成本會增加確定“交叉”點的對象大小,在該點上,就地更新技術的性能優於日誌結構化技術。這些趨勢使日誌結構化技術隨着時間的推移更具吸引力。
我們的討論集中在三類磁盤佈局上:就地更新B樹,有序日誌結構化存儲和無序日誌結構化存儲。

B樹的讀取性能本質上是最佳的。對於大多數應用程序,每次讀取最多執行一次尋道。未碎片化的B樹每次掃描執行一次搜索。但是,它們使用隨機寫入來實現這些屬性。我們工作的目標是在不犧牲讀取或掃描性能的情況下改進B樹的寫入。
有序的日誌結構化索引會在RAM中進行更新,對其進行排序,然後將排序後的運行寫入磁盤。隨着時間的流逝,運行會合並,從而限制了讀取和掃描所產生的開銷。合併的成本取決於索引數據的大小和用於緩衝區寫入的RAM數量。
無序的日誌結構化索引會立即將數據寫入磁盤,從而無需單獨的日誌。壓縮這些存儲的成本是在基礎設備上保留的可用空間量的函數,並且與用作緩存的內存量無關。與有序存儲相比,無序存儲通常具有更高的持續寫入吞吐量(數量級差異並不罕見[22、27、28、32])。這些好處是有代價的:無序存儲不提供有效的掃描操作。
掃描是各種應用程序所需要的(並由PNUTS和Walnut導出),對於有效的關係查詢處理而言,掃描是必不可少的。由於我們以此類用例爲目標,因此我們無法利用無序技術。
但是,這種技術是對我們技術的補充,並且有許多實現方式可用(第6節)。現在我們來討論通過有序方法進行的交易。

2.1 Terminology


A common theme of this work is the tradeoff between asymptotic and constant factor performance improvements.
We use the following concepts to reason about such tradeoffs. Read amplication [7] and write amplication [22] characterize the cost of reads and writes versus optimal schemes.
We measure read amplication in terms of seeks, since at least one random read is required to access an uncached piece of data, and the seek cost generally dwarfs the transfer cost.
In contrast, writes can be performed using sequential I/O, so we express write amplication in terms of bandwidth. 
By convention, our computations assume worst-case access patterns and optimal caching policies.

Write amplication includes both the synchronous cost of the write, and the cost of deferred merges or compactions.
Given a desired read amplication, we can compute the read fanout (our term) of an index. The read fanout is the ratio of the data size to the amount of RAM used by the index. 
To simplify our calculations, we linearly approximate read fanout by only counting the cost of storing the bottom-most layer of the index pages in RAM. Main memory has grown to the point where read amplications of one (or even zero) are common (Appendix A). Here, we focus on read fanouts with a read amplication of one.

2.1術語
這項工作的一個共同主題是在漸近和恆定因子性能改進之間進行權衡。
我們使用以下概念來進行這種折衷。讀取放大[7]和寫入放大[22]表徵了讀寫方案相對於最佳方案的成本。
由於需要至少一個隨機讀取才能訪問未緩存的數據,因此我們根據尋道來標識讀放大,並且尋道成本通常會使傳輸成本忽略不計。
相反,可以使用順序I / O執行寫操作,因此我們以帶寬表示寫放大。
按照慣例,我們的計算採用最壞情況下的訪問模式和最佳的緩存策略。

寫放大包括同步寫開銷和延遲合併或壓縮的開銷。
給定所需的讀放大,我們可以計算索引的讀扇出(我們的術語)。讀取扇出是數據大小與索引使用的RAM量的比率。
爲了簡化計算,我們僅通過計算將索引頁的最底層存儲在RAM中的成本,來線性近似讀取扇出。主存儲器已經發展到一種普遍的讀取放大倍數(甚至零)(附錄A)。在這裏,我們將重點放在讀取放大倍數爲1的讀取扇出上。

2.2 B-Trees


Assuming that keys fit in memory, B-Trees provide optimal random reads; they perform a single disk seek each time an uncached piece of data is requested. In order to perform an update, B-Trees read the old version of the page, modify it, and asynchronously write the modification to disk. 
This works well if data fits in memory, but performs two disk seeks when the data to be modified resides on disk. Update-inplace hashtables behave similarly, except that they give up the ability to perform scans in order to make more efficient use of RAM.
Update in place techniques’ effective write amplifications depend on the underlying disk. Modern hard disks transfer 100-200MB/sec, and have mean access times over 5ms. One thousand byte key value pairs are fairly common; it takes the disk 10us to write such a tuple sequentially.
Performing two seeks takes a total of 10ms, giving us a write amplification of approximately 1000.
Appendix A applies these calculations and a variant of the five minute rule [15] to estimate the memory required for a read amplification of one with various disk technologies.

2.2 B樹
假設密鑰適合內存,B樹將提供最佳的隨機讀取。每當請求未緩存的數據時,它們都會執行一次磁盤搜索。爲了執行更新,B樹會讀取頁面的舊版本,對其進行修改,然後將修改異步寫入磁盤。
如果數據適合內存,此方法效果很好,但是當要修改的數據駐留在磁盤上時,執行兩次磁盤搜索。就地更新哈希表的行爲類似,不同之處在於它們放棄了執行掃描以更有效地使用RAM的能力。
就地更新技術的有效寫放大取決於底層磁盤。現代硬盤的傳輸速度爲100-200MB /秒,平均訪問時間超過5ms。一千個字節的鍵值對非常普遍;磁盤10us依次寫入這樣的元組。
執行兩次尋道總共需要10毫秒,因此我們的寫入放大倍數約爲1000。一種新的寫入放大計算方式。

附錄A應用了這些計算和五分鐘規則的一種變體[15]來估計使用各種磁盤技術對一個磁盤進行讀放大所需要的內存。

2.3 LSM-Trees


This section begins by describing the base LSM-Tree algorithm, which has unacceptably high read amplification, cannot take advantage of write locality, and can stall application writes for arbitrary periods of time. However, LSMTree write amplification is much lower than that of a B-Tree, and the disadvantages are avoidable.
Write skew can be addressed with tree partitioning, which is compatible with the other optimizations. However, we find that long write pauses are not adequately addressed by existing proposals or by state-of-the-art implementations

2.3 LSM樹
本節從描述基本的LSM-Tree算法開始,該算法具有無法接受的高讀取放大率,無法利用寫入局部性,並且可以在任意時間段內停止應用程序寫入。 但是,LSMTree的寫放大比B-Tree低得多,可以避免這些缺點。
可以通過與其他優化兼容的樹分區解決寫偏斜。 但是,我們發現現有的提案或最新的實現方式並不能充分解決長寫入暫停問題

2.3.1 Base algorithm


LSM-Trees consist of a number of append-only B-Trees and a smaller update-in-place tree that fits in memory. 
We call the in-memory tree C0. Repeated scans and merges of these trees are used to spill memory to disk, and to bound the number of trees consulted by each read.
The trees are stored in key order on disk and the inmemory tree supports efficient ordered scans. Therefore, each merge can be performed in a single pass. In the version of the algorithm we implement (Figure 1), the number trees is constant. The on-disk trees are ordered by freshness;
the newest data is in C0. Newly merged trees replace the higher-numbered of the two input trees. Tree merges are always performed between Ci and Ci+1.
The merge threads attempt to ensure that the trees increase in size exponentially, as this minimizes the amortized cost of garbage collection. This can be proven in many different ways, and stems from a number of observations [25]:

2.3.1基本算法
LSM樹由多個僅附加的B樹和一個較小的就地更新樹(適合內存)組成。
我們將內存樹稱爲C0。 這些樹的重複掃描和合並用於將內存溢出到磁盤,並限制每次讀取所查詢的樹數。
這些樹以密鑰順序存儲在磁盤上,內存樹支持高效的有序掃描。 因此,每個合併都可以在一次通過中執行。 在我們實現的算法版本中(圖1),樹的數量是恆定的。 磁盤樹按新鮮度排序;
最新數據在C0中。 新合併的樹替換了兩個輸入樹中編號較高的樹。 樹合併始終在Ci和Ci + 1之間執行。
合併線程試圖確保樹的大小成倍增加,因爲這使垃圾回收的攤餘成本最小化。 這可以用許多不同的方式來證明,並且源於許多觀察[25]:

2.3.2 Leveraging write skew


Although we describe a special case solution in Section 4.2, partitioning is the best way to allow LSM-Trees to leverage write skew [16]. Breaking the LSM-Tree into smaller trees and merging the trees according to their update rates concentrates merge activity on frequently updated key ranges.
It also addresses one source of write pauses. If the distribution of the keys of incoming writes varies significantly from the existing distribution, then large ranges of the larger tree component may be disjoint from the smaller tree. Without partitioning, merge threads needlessly copy the disjoint data, wasting I/O bandwidth and stalling merges of smaller trees. 
During these stalls, the application cannot make forward progress.
Partitioning can intersperse merges that will quickly consume data from the small tree with merges that will slowly consume the data, “spreading out” the pause over time. Sections 4.1 and 5 argue and show experimentally that, although necessary in some scenarios, such techniques are inadequate protection against long merge pauses. This conclusion is in line with the reported behavior of systems with partition-based merge schedulers, [1, 12, 19] and with previous work, [16] which finds that partitioned and baseline LSM-Tree merges have similar latency properties.

2.3.2 改善寫偏斜
儘管我們在4.2節中描述了一種特殊情況的解決方案,但分區是允許LSM-Tree改善寫入偏斜的最佳方法[16]。將LSM-Tree分解爲較小的樹,然後根據樹的更新速率合併樹,將合併活動集中在頻繁更新的鍵範圍上。
它還解決了寫入暫停的一種來源。如果傳入寫入的鍵的分佈與現有分佈有很大不同,則較大樹組件的較大範圍可能與較小樹不相交。如果不進行分區,合併線程將不必要地複製不相交的數據,從而浪費I / O帶寬並停止較小樹的合併。

在這些停頓期間,應用程序無法前進。
分區可以散佈合併,合併將迅速消耗小樹中的數據,合併將緩慢消耗數據,並隨着時間的推移“散佈”暫停。第4.1節和第5節爭論並通過實驗表明,儘管在某些情況下是必需的,但這種技術不足以防止較長的合併暫停。該結論與已報道的基於分區的合併調度程序的系統行爲[1,12,19]和先前的工作[16]一致,後者發現分區的LSM-Tree合併和基線LSM-Tree合併具有相似的延遲屬性。

3. ALGORITHMIC IMPROVEMENTS


We now present bLSM, our new LSM-Tree variant, which addresses the LSM-Tree limitations we describe above.

The first limitation of LSM-Trees, excessive read amplification, is only partially addressed by Bloom filters, and is closely related to two other issues: exhaustive lookups which needlessly retrieve multiple versions of a record, and seeks during insert. 
The second issue, write pauses, requires scheduling infrastructure that is missing from current implementations

3.算法改進
現在,我們介紹bLSM,這是我們的新LSM-Tree變體,它解決了我們上面描述的LSM-Tree限制。

LSM樹的第一個限制,即過度的讀取放大,只能由Bloom過濾器部分解決,並且與其他兩個問題密切相關:MVCC過程中的昂貴的查找方式,在插入期間進行查找。
第二個問題,寫暫停,要求安排當前實現中缺少的基礎結構

3.1 Reducing read amplification


Fractal cascading and Bloom filters both reduce read amplification. Fractal cascading reduces asymptotic costs; Bloom filters instead improve performance by a constant factor.
The Bloom filter approach protects the C1...CN tree components with Bloom filters. The amount of memory it requires is a function of the number of items to be inserted, not the items’ sizes. 
Allocating 10 bits per item leads to a 1% false positive rate, and is a reasonable tradeoff in practice.
Such Bloom filters reduce the read amplification of LSMTree point lookups from N to 1+ N/100 . 
Unfortunately, Bloom Filters do not improve scan performance. Appendix A runs through a “typical” application scenario; Bloom filters would increase memory utilization by about 5% in that setting.
Unlike Bloom filters, fractional cascading [18] reduces the asymptotic complexity of write-optimized LSM-Trees. Instead of varying R, these trees hold R constant and add additional levels as needed, leading to a logarithmic number of levels and logarithmic write amplification. Lookups and scans access a logarithmic (instead of constant) number of tree components. 
Such techniques are used in systems that must maintain large number of materialized views, such as the TokuDB MySQL storage engine [18].
Fractional cascading includes pointers in tree component leaf pages that point into the leaves of the next largest tree.
Since R is constant, the cost of traversing one of these pointers is also constant. This eliminates the logarithmic factor associated with performing multiple B-Tree traversals.
The problem with this scheme is that the cascade steps of the search examine pages that likely reside on disk. In effect, it eliminates a logarithmic in-memory overhead by increasing read amplification by a logarithmic factor.
Figure 2 provides an overview of the lookup process, and plots read amplification vs. fanout for fractional cascading and for three-level LSM-Trees with Bloom filters. No setting of R allows fractional cascading to provide reads competitive with Bloom filters—reducing read amplification to 1 requires an R large enough to ensure that there is a single on-disk tree component. Doing so leads to O(n) write amplifications. Given this, we opt for Bloom filters.

3.1減少讀放大


分形級聯和布隆過濾器均會降低讀取放大率。分形級聯降低了漸近成本;相反,布隆過濾器通過一個常數將性能提高。
布隆過濾器方法使用布隆過濾器保護C1 ... CN樹組件。所需的存儲量取決於要插入的項目數,而不是項目的大小。
每個項目分配10位會導致1%的誤報率,並且在實踐中是一個合理的權衡。
這種Bloom過濾器將LSMTree點查找的讀取放大率從N降低到1+ N / 100。
不幸的是,布隆過濾器不能提高掃描性能。附錄A貫穿“典型”應用場景;在該設置下,Bloom過濾器會將內存利用率提高約5%。

與布隆過濾器不同,分數級聯[18]降低了寫優化的LSM樹的漸近複雜度。這些樹沒有改變R,而是使R保持不變,並根據需要添加其他級別,從而導致級別的對數和對數寫入放大。查找和掃描訪問對數(而不是常數)的樹組件。
此類技術用於必須維護大量實例化視圖的系統中,例如TokuDB MySQL存儲引擎[18]。
分數級聯包括在樹組件葉頁面中的指針,這些指針指向第二棵大樹的葉子。
由於R爲常數,因此遍歷這些指針之一的開銷也爲常數。這消除了與執行多個B-Tree遍歷相關的對數因子。
該方案的問題在於,搜索的級聯步驟將檢查可能駐留在磁盤上的頁面。實際上,它通過增加對數因子的讀取放大來消除對數內存開銷。
圖2概述了查找過程,並繪製了級聯級聯和具有Bloom過濾器的三級LSM樹的讀取放大與扇出曲線。沒有設置R允許小數級聯提供與Bloom過濾器競爭的讀取-將讀取放大率減小到1需要一個足夠大的R以確保只有一個磁盤樹組件。這樣做會導致O(n)寫放大。鑑於此,我們選擇Bloom過濾器。

3.1.1 Limiting read amplification for frequently updated data


On their own, Bloom filters cannot ensure that read amplifications are close to 1, since copies of a record (or its deltas) may exist in multiple trees. To get maximum read performance, applications should avoid writing deltas, and instead write base records for each update.
Our reads begin with the lowest numbered tree component, continue with larger components in order and stop at the first base record. Our reads are able to terminate early because they distinguish between base records and deltas, and because updates to the same tuple are placed in tree levels consistent with their ordering. 
This guarantees that reads encounter the most recent version first, and has no negative impact on write throughput. Other systems nondeterministically assign reads to on-disk components, and use timestamps to infer write ordering. This breaks early termination, and can lead to update anomalies [1].

3.1.1對頻繁更新的數據限制讀放大
Bloom過濾器本身不能確保讀取的放大率接近1,因爲一條記錄(或其增量)的副本可能存在於多個樹中。 爲了獲得最大的讀取性能,應用程序應避免寫入增量,而應爲每次更新寫入基本記錄。
我們的讀取從編號最低的樹組件開始,依次從大的組件開始,直到第一個基本記錄。 我們的讀取能夠提前終止,因爲它們區分基本記錄和增量,並且因爲對相同元組的更新位於與它們的順序一致的樹級別中。
這樣可以確保讀取操作首先遇到最新版本,並且對寫入吞吐量沒有負面影響。 其他系統不確定地將讀取分配給磁盤上的組件,並使用時間戳推斷寫順序。 這破壞了提前終止,並可能導致更新異常[1]。

3.1.2 Zero-seek “insert if not exists”


One might think that maintaining a Bloom filter on the largest tree component is a waste; this Bloom filter is by far the largest in the system, and (since C2 is the last tree to be searched) it only accelerates lookups of non-existent data.
It turns out that such lookups are extremely common; they are performed by operations such as “insert if not exists.”
In Section 5.2, we present performance results for bulk loads of bLSM, InnoDB and LevelDB. Of the three, only bLSM could efficiently load and check our modest 50GB unordered data set for duplicates. “Insert if not exists” is a widely used primitive; lack of efficient support for it renders high-throughput writes useless in many environments.

3.1.2零尋找“如果不存在則插入”
有人可能會認爲在最大的樹組件上維護Bloom過濾器是一種浪費;該Bloom過濾器是迄今爲止系統中最大的過濾器,並且(由於C2是要搜索的最後一棵樹),它只會加速對不存在數據的查找。
事實證明,這種查詢極爲普遍。它們通過諸如“如果不存在則插入”之類的操作來執行。
在5.2節中,我們介紹了bLSM,InnoDB和LevelDB的大容量負載的性能結果。在這三個中,只有bLSM可以有效地加載和檢查我們適度的50GB無序數據集是否重複。 “如果不存在,則插入”是一種廣泛使用的原語;缺乏有效的支持使高吞吐量寫入在許多環境中無用。

3.2 Dealing with write pauses


Regardless of optimizations that improve read amplification and leverage write skew, index implementations that impose long, sporadic write outages on end users are not particularly practical. Despite the lack of good solutions LSM-Trees are regularly put into production. 
We describe workarounds that are used in practice here.
At the index level, the most obvious solution (other than unplanned downtime) is to introduce extra C1 components whenever C0 is full and the C1 merge has not yet completed [13]. Bloom filters reduce the impact of extra trees, but this approach still severely impacts scan performance.
Systems such as HBase allow administrators to temporarily disable compaction, effectively implementing this policy [1].
As we mentioned above, applications that do not require performant scans would be better off with an unordered log structured index.
Passing the problem off to the end user increases operations costs, and can lead to unbounded space amplification.
However, merges can be run during off-peak periods, increasing throughput during peak hours. Similarly, applications that index data according to insertion time end up writing data in “almost sorted” order, and are easily handled by existing merge strategies, providing a stop-gap solution until more general purposes systems become available.
Another solution takes partitioning to its logical extreme, creating partitions so small that even worst case merges introduce short pauses. This is the technique taken by Partitioned Exponential Files [16] and LevelDB [12]. Our experimental results show that this, on its own, is inadequate. In particular, with uniform inserts and a “fair” partition scheduler, each partition would simultaneously evolve into the same bad state described in Figure 4. 
At best, this would lead to a throughput collapse (instead of a complete cessation of application writes).
Obviously, we find each of these approaches to be unacceptable. Section 4.1 presents our approach to merge scheduling. After presenting a simple merge scheduler, we describe an optimization and extensions designed to allow our techniques to coexist with partitioning

3.2處理寫暫停
不管改善讀取放大並利用寫入偏斜的優化如何,對最終用戶造成長時間零星寫入中斷的索引實現都不是特別實用。儘管缺少好的解決方案,但LSM-Trees仍會定期投入生產。
我們在此介紹在實踐中使用的變通方法。
在索引級別,最明顯的解決方案(計劃外停機時間除外)是在C0已滿且C1合併尚未完成時引入額外的C1組件[13]。布隆過濾器可以減少多餘樹木的影響,但是這種方法仍然會嚴重影響掃描性能。
諸如HBase之類的系統允許管理員暫時禁用壓縮,從而有效地實施此策略[1]。
正如我們上面提到的,不需要執行掃描的應用程序最好使用無序的日誌結構化索引。
將問題傳遞給最終用戶會增加運營成本,並可能導致無限的空間放大。
但是,可以在非高峯時段運行合併,從而增加了高峯時段的吞吐量。同樣,根據插入時間對數據進行索引的應用程序最終以“幾乎排序”的順序寫入數據,並易於通過現有的合併策略進行處理,從而提供了一個權宜之計,直到可以使用更多通用系統爲止。
另一種解決方案將分區擴展到其邏輯極限,創建的分區是如此之小,以至於即使是最壞的情況下的合併也會導致短暫的暫停。這是分區指數文件[16]和LevelDB [12]所採用的技術。我們的實驗結果表明,這本身是不夠的。特別是,使用統一的插入和“公平”的分區調度程序,每個分區將同時演變爲圖4中描述的相同壞狀態。
充其量,這將導致吞吐量崩潰(而不是完全停止應用程序寫操作)。
顯然,我們發現每種方法都是不可接受的。第4.1節介紹了我們的合併調度方法。在介紹了一個簡單的合併調度程序之後,我們描述了一種優化和擴展,旨在使我們的技術能夠與分區共存

3.3 Two-seek scans


Scan operations do not benefit from Bloom filters and must examine each tree component. This, and the importance of delta-based updates led us to bound the number of on-disk tree components. bLSM currently has three components and performs three seeks. Indeed, Section 5.6 presents the sole experiment in which InnoDB outperforms bLSM: a scan-heavy workload.
We can further improve short-scan performance in conjunction with partitioning. One of the three on-disk components only exists to support the ongoing merge. In a system that made use of partitioning, only a small fraction of the tree would be subject to merging at any given time. The remainder of the tree would require two seeks per scan.

3.3兩遍掃描
掃描操作無法從布隆過濾器中受益,必須檢查每個樹組件。 這以及基於增量的更新的重要性使我們限制了磁盤上樹組件的數量。 bLSM當前包含三個組件,並執行三個搜索。 實際上,第5.6節介紹了InnoDB優於bLSM的唯一實驗:掃描繁重的工作負載。
我們可以結合分區進一步提高短掃描性能。 磁盤上三個組件之一僅用於支持正在進行的合併。 在使用分區的系統中,在任何給定時間僅會合並一小部分樹。 樹的其餘部分每次掃描將需要兩次搜索。

4. BLSM


As the previous sections explained, scans and reads are crucial to real-world application performance. As we decided which LSM-Tree variants and optimizations to use in our design, our first priority was to outperform B-Trees in practice. 
Our second concern was to do so while providing asymptotically optimal LSM-Tree write throughput. 
The previous section outlined the changes we made to the base LSM-Tree algorithm in order to meet these requirements.
Figure 1 presents the architecture of our system. We use a three-level LSM-Tree and protect the two on-disk levels with Bloom filters. We have not yet implemented partitioning, and instead focused on providing predictable, low latency writes. We avoid long write pauses by introducing a new class of merge scheduler that we call a level scheduler (Figure 4). 
Level schedulers are designed to complement
existing partition schedulers (Figure 3).
4. BLSM
如前所述,掃描和讀取對於實際應用程序性能至關重要。 當我們決定在設計中使用哪種LSM-Tree變體和優化時,我們的首要任務是在實踐中勝過B-Tree。
我們的第二個考慮是這樣做,同時提供漸近最佳的LSM-Tree寫吞吐量。
上一節概述了爲滿足這些要求而對基本LSM-Tree算法所做的更改。
圖1展示了我們系統的體系結構。 我們使用三級LSM-Tree,並使用Bloom過濾器保護兩個磁盤級。 我們尚未實現分區,而是專注於提供可預測的低延遲寫入。 通過引入新的合併調度程序類(我們稱爲級別調度程序),我們避免了長時間的寫暫停(圖4)。
級別調度程序旨在補充
現有的分區調度程序(圖3)。

4.1 Gear scheduler
In this section we describe a gear scheduler that ensures merge processes complete at the same time (Figure 5). 
This scheduler is a subcomponent of the spring and gear scheduler (Section 4.3). We begin with the gear scheduler because it is conceptually simpler, and to make it clear how to generalize spring and gear to multiple levels of tree components.
As Section 3.2 explained, we are unwilling to let extra tree components accumulate, as doing so compromises scan performance. Instead, once a tree component is full, we must block upstream writes while downstream merges complete.
Downstream merges can take indefinitely long, and we are unwilling to give up write availability, so our only option is to synchronize merge completions with the processes that fill each tree component. Each process computes two progress indicators: inprogress and outprogress. 
In Figure 5, numbers represent the outprogress of C0 and the inprogress of C1. Letters represent the outprogress of C1 and the inprogress of C2.
We use a clock analogy to reason about the merge processes in our system. Clock gears ensure each hand moves at a consistent rate (so the minute and hour hand reach 12 at the same time, for example). Our merge processes ensure that trees fill at the same time as more space becomes available downstream. As in a clock, multiple short upstream merges (sweeps of the minute hand) may occur per downstream merge (sweep of the hour hand). 
Unlike with a clock, the only important synchronization point is the hand-off from smaller component to larger (the meeting of the hands at 12).

4.1齒輪調度器
在本節中,我們描述了一個齒輪調度程序,它確保合併過程同時完成(圖5)。
該調度程序是彈簧和齒輪調度程序的子組件(第4.3節)。我們從齒輪調度器開始,因爲它在概念上更簡單,並且要弄清楚如何將彈簧和齒輪歸納爲多個層次的樹組件。
如3.2節所述,我們不願意讓多餘的樹組件累積,因爲這樣做會損害掃描性能。相反,一旦樹組件已滿,我們就必須在下游合併完成時阻止上游寫入。
下游合併可能需要無限長的時間,並且我們不願意放棄寫可用性,因此我們唯一的選擇是將合併完成與填充每個樹組件的進程同步。每個過程都計算兩個進度指示器:進度和進度。
在圖5中,數字表示C0的進度和C1的進度。字母表示C1的進度和C2的進度。
我們使用時鐘類比來推斷系統中的合併過程。時鐘齒輪確保每隻指針均以一致的速度運動(例如,分針和時針同時達到12點)。我們的合併過程可確保在下游有更多可用空間的同時填充樹木。就像在時鐘中一樣,每個下游合併(時針的掃描)可能會發生多個短的上游合併(分針的掃描)。
與時鐘不同,唯一重要的同步點是從較小的組件到較大的組件的交接(指針在12點相遇)。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章