Bigtable: A Distributed Storage System for Structured Data : part7 Performance Evaluation

7 Performance Evaluation
We set up a Bigtable cluster with N tablet servers to measure the performance and scalability of Bigtable as N is varied.
The tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each.
N client machines generated the Bigtable load used for these tests.
(We used the same number of clients as tablet servers to ensure that clients were never a bottleneck.)
Each machine had two dual-core Opteron 2 GHz chips, enough physical memory to hold the working set of all running processes, and a single gigabit Ethernet link.
The machines were arranged in a two-level tree-shaped switched network with approximately 100-200 Gbps of aggregate bandwidth available at the root.
All of the machines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond.
The tablet servers and master, test clients, and GFS servers all ran on the same set of machines.

Every machine ran a GFS server.
Some of the machines also ran either a tablet server, or a client process, or processes from other jobs that were using the pool at the same time as these experiments.

7性能評估
我們設置了一個帶有N個 tablet 服務器的Bigtable集羣，以測量Bigtable的性能和可擴展性，因爲N是多樣的。
tablet 服務器配置爲使用1 GB內存，並寫入由1786臺機器組成的GFS單元，每臺機器具有兩個400 GB IDE硬盤驅動器。
N個客戶機生成了用於這些測試的Bigtable負載。
（我們使用與 tablet 服務器相同數量的客戶端，以確保客戶端不會成爲瓶頸。）
每臺機器都有兩個雙核Opteron 2 GHz芯片，足夠的物理內存來保存所有運行進程的工作集，以及一個千兆以太網鏈路。
這些機器被安排在兩層樹狀交換網絡中，其根部可用的總帶寬約爲100-200Gbps。
所有的機器都在同一個託管設備中，因此任何一臺機器之間的往返時間都不到一毫秒。
tablet 服務器和主服務器，測試客戶端和GFS服務器都運行在同一臺機器上。

每臺機器都運行一個GFS服務器。
某些計算機還運行 tablet 服務器或客戶端進程，或者在與這些實驗同時使用池的其他作業進行處理。

R is the distinct number of Bigtable row keys involved in the test.
R was chosen so that each benchmark read or wrote approximately 1 GB of data per tablet server.
The sequential write benchmark used row keys with names 0 to R − 1.
This space of row keys was partitioned into 10N equal-sized ranges.
These ranges were assigned to the N clients by a central scheduler that assigned the next available range to a client as soon as the client finished processing the previous range assigned to it.
This dynamic assignment helped mitigate the effects of performance variations caused by other processes running on the client machines.
We wrote a single string under each row key.
Each string was generated randomly and was therefore uncompressible.
In addition, string under different row key were distinct, so no cross-row compression was possible.
The random write benchmark was similar except that the row key was hashed modulo R immediately before writing so that the write load was spread roughly uniformly across the entire row space for the entire duration of the benchmark.

The sequential read benchmark generated row keys in exactly the same way as the sequential write benchmark, but instead of writing under the row key, it read the string stored under the row key (which was written by an earlier invocation of the sequential write benchmark).
Similarly,the random read benchmark shadowed the operation of the random write benchmark.

R是測試中涉及到的Bigtable行鍵的不同數量。
選擇R，以便每個基準測試讀取或寫入每個 tablet 服務器大約1GB的數據。
順序寫入基準測試使用名稱爲0到R-1的行鍵。
行鍵的這個空間被分割成10N相等大小的範圍。
只要客戶端完成處理分配給它的上一個範圍，中央調度程序將這些範圍分配給N個客戶端，該中央調度程序爲客戶端分配了下一個可用範圍。
此動態分配有助於緩解由客戶端計算機上運行的其他進程引起的性能變化的影響。
我們在每行鍵下寫了一個字符串。
每個字符串都是隨機產生的，因此是不可壓縮的。
另外，不同行鍵下的字符串是不同的，所以不能進行跨行壓縮。
隨機寫入基準測試是相似的，只是行鍵在寫入之前立即散列爲模R，以便寫入負載在基準的整個持續時間內在整個行空間中大致均勻地擴散。
順序讀取基準測試生成的行鍵與順序寫入基準完全相同，但不是在行鍵下寫入，它讀取存儲在行鍵下的字符串（由先前調用順序寫入基準寫入的字符串）。
類似地，隨機讀取基準影響了隨機寫入基準的操作。

The scan benchmark is similar to the sequential read benchmark, but uses support provided by the Bigtable API for scanning over all values in a row range.

Using a scan reduces the number of RPCs executed by the benchmark since a single RPC fetches a large sequence of values from a tablet server.
The random reads (mem) benchmark is similar to the random read benchmark, but the locality group that contains the benchmark data is marked as in-memory, and therefore the reads are satisfied from the tablet server’s memory instead of requiring a GFS read.
For just this benchmark, we reduced the amount of data per tablet server from 1 GB to 100 MB so that it would fit comfortably in the memory available to the tablet server.

Figure 6 shows two views on the performance of our benchmarks when reading and writing 1000-byte values to Bigtable.

The table shows the number of operations per second per tablet server; the graph shows the aggregate number of operations per second.

掃描基準測試與順序讀取基準測試類似，但使用Bigtable API提供的支持來掃描行範圍內的所有值。

使用掃描可以減少基準測試執行的RPC數量，因爲單個RPC從 tablet 服務器獲取大量的值。
隨機讀取（mem）基準測試與隨機讀取基準測試類似，但包含基準數據的位置組被標記爲內存，因此從 tablet 服務器的內存中可以看到讀數，而不需要GFS讀取。
對於這個基準測試，我們將每個 tablet 服務器的數據量從1 GB減少到100 MB，以便舒適地適應 tablet 服務器的內存。
圖6顯示了對BigTable讀取和寫入1000字節值時，我們的基準測試的性能的兩個視圖。
該表顯示每個 tablet 服務器每秒的操作次數; 該圖顯示了每秒的總操作次數。

Single tablet-server performance
Let us first consider performance with just one tablet server.
Random reads are slower than all other operations by an order of magnitude or more.
Each random read involves the transfer of a 64 KB SSTable block over the network from GFS to a tablet server, out of which only a single 1000-byte value is used.
The tablet server executes approximately 1200 reads per second, which translates into approximately 75 MB/s of data read from GFS.
This bandwidth is enough to saturate the tablet server CPUs because of overheads in our networking stack, SSTable parsing, and Bigtable code, and is also almost enough to saturate the network links used in our system.
Most Bigtable applications with this type of an access pattern reduce the block size to a smaller value, typically 8KB.
Random reads from memory are much faster since each 1000-byte read is satisfied from the tablet server’s local memory without fetching a large 64 KB block from GFS.

單個 tablet 服務器性能
讓我們首先考慮一下只需一臺 tablet 的性能。
隨機讀取比所有其他操作慢一個數量級或更多。
每個隨機讀取涉及通過網絡將64 KB SSTable塊從GFS傳輸到 tablet 服務器，其中僅使用單個1000字節值。
tablet 服務器每秒執行大約1200次讀取，這轉換爲從GFS讀取的大約75 MB / s的數據。
這個帶寬足以使 tablet 服務器CPU飽和，因爲我們的網絡堆棧，SSTable解析和Bigtable代碼的開銷，也足以使我們系統中使用的網絡鏈接飽和。
大多數Bigtable應用程序使用這種類型的訪問模式將塊大小減小到較小的值，通常爲8KB。
從內存中隨機讀取速度要快得多，因爲從 tablet 服務器的本地內存中可以滿足每個1000字節的讀取，而不會從GFS中獲取大型64 KB塊。

Random and sequential writes perform better than random reads since each tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes efficiently to GFS.
There is no significant difference between the performance of random writes and sequential writes; in both cases, all writes to the tablet server are recorded in the same commit log.
Sequential reads perform better than random reads since every 64 KB SSTable block that is fetched from GFS is stored into our block cache, where it is used to serve the next 64 read requests.
Scans are even faster since the tablet server can return a large number of values in response to a single client RPC, and therefore RPC overhead is amortized over a large number of values.

隨機和順序寫入性能優於隨機讀取，因爲每個 tablet 服務器將所有傳入寫入附加到單個提交日誌，並使用組提交將這些寫入有效地流式傳輸到GFS。
隨機寫入和順序寫入的性能沒有顯着差異; 在這兩種情況下，對 tablet 服務器的所有寫入都將記錄在同一個提交日誌中。
順序讀取比隨機讀取更好，因爲從GFS獲取的每個64 KB SSTable塊都存儲在我們的塊高速緩存中，用於提供下一個64個讀取請求。
掃描速度甚至更快，因爲 tablet 服務器可以響應於單個客戶端RPC返回大量值，因此RPC開銷在大量值上進行分攤。

Scaling
Aggregate throughput increases dramatically, by over a factor of a hundred, as we increase the number of tablet servers in the system from 1 to 500.

縮放
隨着我們將系統中的 tablet 服務器數量從1個增加到500個，總吞吐量大幅提升了一百倍。

Table 1: Distribution of number of tablet servers in Bigtable clusters.

For example, the performance of random reads from memory increases by almost a factor of 300 as the number of tablet server increases by a factor of 500.
This behavior occurs because the bottleneck on performance for this benchmark is the individual tablet server CPU.
However, performance does not increase linearly.
For most benchmarks, there is a significant drop in per-server throughput when going from 1 to 50 tablet servers.
This drop is caused by imbalance in load in multiple server configurations, often due to other processes contending for CPU and network.
Our load balancing algorithm attempts to deal with this imbalance, but cannot do a perfect job for two main reasons: rebalancing is throttled to
reduce the number of tablet movements (a tablet is unavailable for a short time, typically less than one second, when it is moved), and the load generated by our benchmarks shifts around as the benchmark progresses.
The random read benchmark shows the worst scaling
(an increase in aggregate throughput by only a factor of 100 for a 500-fold increase in number of servers).
This behavior occurs because (as explained above) we transfer one large 64KB block over the network for every 1000-byte read.
This transfer saturates various shared 1 Gigabit links in our network and as a result, the per-server throughput drops significantly as we increase the number of machines.

表1：Bigtable羣集中的 tablet 服務器數量的分佈。

例如，由於 tablet 服務器的數量增加了500倍，所以從內存中隨機讀取的性能提高了近300倍。
出現此行爲是因爲該基準測試的性能瓶頸是單個 tablet 服務器CPU。
然而，性能不會線性增加。
對於大多數基準測試，從1到50個 tablet 服務器，每個服務器的吞吐量顯着下降。
這種下降是由於多個服務器配置中的負載不平衡導致的，這通常是由於其他與CPU和網絡競爭的過程。
我們的負載平衡算法試圖處理這種不平衡，但是由於兩個主要原因，不能做一個完美的工作：重新平衡被限制到
減少 tablet 移動的數量（一個 tablet 在短時間內不可用，通常在移動時通常不到一秒），並且基準測試產生的負載隨着基準測試的進行而發生變化。
隨機讀取基準顯示最差的縮放
（聚合吞吐量僅增加100倍，服務器數量增加了500倍）。
出現這種情況是因爲（如上所述），我們每隔1000字節讀取一次通過網絡傳輸一個大的64KB塊。
這種傳輸使我們網絡中的各種共享的1千兆位鏈路飽和，因此，隨着機器數量的增加，每服務器吞吐量顯着下降。