HBase2.0官方文檔翻譯-RegionServer Sizing Rules of Thumb

36. On the number of column families

HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing and compaction interaction can make for a bunch of needless i/o (To be addressed by changing flushing and compaction to work on a per column family basis). For more information on compactions, see Compaction.

HBase現在還不能很好的處理超過2、3個列族的情況,所以儘可能保持較少的列族數量。目前,flush和compact是基於region的,所以如果其中一個列族由於數據過多觸發flush,其它列族即使數據較少,也會一起被flush。當許多列族同時進行flush和compact,會造成大量不必要的i/o(待通過修改爲基於列族進行flush和compact來解決)。關於compact的更多信息,請查看Compaction章節。

Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.

可能的話,嘗試只使用一個列族。只有當數據的訪問總是涉及一定範圍的列時可以考慮引入第二個或第三個列族;比如,你會查詢這個列或另一個列,而不會同時查詢。

36.1. 列族基數(Cardinality of ColumnFamilies)

Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.

如果表包含多個列族,需要注意基數問題(比如,行數)。如果ColumnFamilyA包含100萬行而ColumnFamilyB包含10億行,那麼ColumnFamilyA的數據會被分散到很多很多region(以及RegionServer)中。這會使對ColumnFamilyA的大規模scan比較低效。

37. Rowkey Design

37.1. 熱點(Hotspotting)

Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other. However, poorly designed row keys are a common source of hotspotting. Hotspotting occurs when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability. This can also have adverse effects on other regions hosted by the same region server as that host is unable to service the requested load. It is important to design data access patterns such that the cluster is fully and evenly utilized.

HBase中的行按照rowkey的字典序存儲。這種設計優化了scan,允許你把有關聯的,或者會被一起讀取的行放在臨近的地方。然而,不良的行鍵設計是熱點的常見來源。當大量的客戶端流量被導向集羣中的一個或者少數幾個節點時,就會出現熱點。流量可能是讀取、寫入,或者其它操作。流量會壓垮託管這些region的單個機器,導致性能下降甚至region不可用。由於主機不能夠再提供服務,所以這同樣會對這些regionServer上的其它region帶來負面影響。對數據訪問模式進行設計,使集羣得到充分和均勻的使用,是很重要的。

To prevent hotspotting on writes, design your row keys such that rows that truly do need to be in the same region are, but in the bigger picture, data is being written to multiple regions across the cluster, rather than one at a time. Some common techniques for avoiding hotspotting are described below, along with some of their advantages and drawbacks.

要避免寫熱點,需將rowkey設計爲,確實需要臨近的行才存在於同一個region,總體上看,數據寫到集羣中的多個region比一個要好。下面是一些常見的避免熱點的技術手段,以及它們的優點和缺點。

加鹽(Salting)

Salting in this sense has nothing to do with cryptography, but refers to adding random data to the start of a row key. In this case, salting refers to adding a randomly-assigned prefix to the row key to cause it to sort differently than it otherwise would. The number of possible prefixes correspond to the number of regions you want to spread the data across. Salting can be helpful if you have a few "hot" row key patterns which come up over and over amongst other more evenly-distributed rows. Consider the following example, which shows that salting can spread write load across multiple RegionServers, and illustrates some of the negative implications for reads.

這裏的加鹽與密碼學無關,而是關於在rowkey的開頭添加隨機數據。在本例中,加鹽是指通過給rowkey增加隨機分配的前綴,來使其排序不同於其它方式。可能的前綴數量與你希望將數據分散到的region數量一致。如果存在一些行,相對於其它分佈均勻的行來說,總是反覆出現,那麼加鹽就會有很用。考慮後面這個例子,其展示了加鹽能夠將寫入壓力分散到多個RegionServer,同時對讀取的一些負面影響。

Example 11. Salting Example

Suppose you have the following list of row keys, and your table is split such that there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b' is another. In this table, all rows starting with 'f' are in the same region. This example focuses on rows with keys like the following:

假設你有下面這個rowkey列表,並且表按照每個首字母對應一個region的方式split。前綴a爲一個region,前綴b爲另一個region。在這個表中,以f開頭的行存在於同一個reigon。這個例子主要關注具有如下鍵的行:

foo0001
foo0002
foo0003
foo0004

Now, imagine that you would like to spread these across four different regions. You decide to use four different salts: a, b, c, and d. In this scenario, each of these letter prefixes will be on a different region. After applying the salts, you have the following rowkeys instead. Since you can now write to four separate regions, you theoretically have four times the throughput when writing that you would have if all the writes were going to the same region.

現在,想象下你需要將他們分散到不同的region去。你決定使用四種不同的鹽:a, b, c, and d。
在這個場景裏,每個字母前綴會位於不同的region。使用這些鹽之後,取而代之的是以下行鍵。由於你現在可以寫入到四個獨立的region,理論上與全部寫入到同一個region相比,你獲取了四倍的吞吐。

a-foo0003
b-foo0001
c-foo0004
d-foo0002

Then, if you add another row, it will randomly be assigned one of the four possible salt values and end up near one of the existing rows.

然後,如果你增加其它行,它會被隨機的分配到四種鹽值之一,並且放在現有的行附近。

a-foo0003
b-foo0001
c-foo0003
c-foo0004
d-foo0002

Since this assignment will be random, you will need to do more work if you want to retrieve the rows in lexicographic order. In this way, salting attempts to increase throughput on writes, but has a cost during reads.

由於分配是隨機的,你需要做一些額外的工作來恢復行的字典順序。在這個方法中,加鹽嘗試增加寫入的吞吐能力,但是增加了讀取時的代價。

哈希(Hashing)

Instead of a random assignment, you could use a one-way hash that would cause a given row to always be "salted" with the same prefix, in a way that would spread the load across the RegionServers, but allow for predictability during reads. Using a deterministic hash allows the client to reconstruct the complete rowkey and use a Get operation to retrieve that row as normal.

你可以用單向哈希使給定的行總是以相同的前綴加鹽,來取代隨機分配,這個方法可以將壓力分散到各個regionServer,同時在讀取的時候能夠預知前綴。使用一個確定的哈希,客戶端能夠重新構造完整的rowkey,然後使用一個普通的get操作去獲取行。

Example 12. Hashing Example

Given the same situation in the salting example above, you could instead apply a one-way hash that would cause the row with key foo0003 to always, and predictably, receive the a prefix. Then, to retrieve that row, you would already know the key. You could also optimize things so that certain pairs of keys were always in the same region, for instance.

上面加鹽的例子中,你可以換用一個單向哈希來使foo0003總是能夠得到a這個前綴。這樣的話,你已經知道了用什麼key去獲取行。你還可以做一些優化,例如,使特定的一些key總是位於同樣的region。

反轉鍵(Reversing the Key)

A third common trick for preventing hotspotting is to reverse a fixed-width or numeric row key so that the part that changes the most often (the least significant digit) is first. This effectively randomizes row keys, but sacrifices row ordering properties.

第三種常見的避免熱點的方法是將固定長度或數字類型的rowkey進行反轉,這樣變化頻繁的部分就會到前面。這使得rowkey變得隨機,不過會失去順序性。

See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, and article on Salted Tables from the Phoenix project, and the discussion in the comments of HBASE-11682 for more information about avoiding hotspotting.

查看https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, 和Phoenix項目中其它加鹽表相關的文章,以及HBASE-11682中評論的討論,以瞭解更多關於避免熱點的信息。

37.2. 遞增rowkey/時序數據(Monotonically Increasing Row Keys/Timeseries Data)

In the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table’s regions (and thus, a single node), then moving onto the next region, etc.

With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad.

The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

在Tom White關於Hadoop的書中的HBase章節中:在權威指南里面有一個優化說明,其中指出要注意這樣一種現象,所有客戶端的寫入操作全部集中在表的某一個region(也即,單個節點),然後轉換到下一個region,一直這樣。

使用單向遞增的rowkey時(例如,使用時間戳),這就會發生。參考IKai Lan的連載,關於爲什麼在BigTable類的數據庫中單向遞增的rowkey會是問題:monotonically increasing values are bad。

可以通過將輸入記錄隨機化而變得無序來緩解單向遞增key帶來的單region壓力,不過通常更好的做法是避免使用時間戳或是一個序列(比如:1,2,3)作爲rowkey。

If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase. The key format in OpenTSDB is effectively metric_type, which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.

如果你需要將時序數據存入HBase,你應該將OpenTSDB作爲一個成功案例去學習。它有個頁面描述其在HBase中使用的模式。OpenTSDB使用
metric_type作爲key的格式,乍一看與之前建議的避免使用時間戳作爲key相矛盾。不過,區別在於時間戳沒有處於key的前導位,並且該設計假設會有幾十或幾百個不同的指標類型。因此,即使有連續的混雜不同指標類型的輸入數據,寫入也會分佈到表的不同region中去。

See schema.casestudies for some rowkey design examples.

更多關於rowkey設計的示例可查看schema.casestudies。

37.3. 儘可能最小化row和column大小(Try to minimize row and column sizes)

In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (StoreFile (HFile)) to facilitate random access may end up occupying large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list.

在HBase中,value總是帶有其座標;cell的value在系統中處理時總是攜帶着row,column名稱,以及時間戳。如果你的row和column名稱很大,尤其是相對於value來說,那麼你可能會碰到一些有意思的情景。在HBASE-3551的末尾Marc Limotte描述了這樣的一個案例。
其中,由於cell的value座標過大,storefiles中存儲的用來加速隨機訪問的索引數據佔用了大量的HBase可用內存。在之前的回覆中,Mark建議增加block的大小,使得store file中能以更大的間隔產生index,或者修改表設計,使用更小的row和column名稱。壓縮也能夠帶來較大的索引。查看用戶郵件列表中的這個主題:a question storefileIndexSize。

Most of the time small inefficiencies don’t matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data.

多數時候,細微的低效並不重要。不幸的是,該案例中正是由此導致的。無論選擇怎樣的列族、屬性、和行鍵,它們總是會在你的數據中重複數十億次。

See keyvalue for more information on HBase stores data internally to see why this is important.

查看keyvalue章節,瞭解關於HBase內部數據存儲的更多信息,來理解爲什麼這很重要。

37.3.1. 列族(Column Families)

Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).

See KeyValue for more information on HBase stores data internally to see why this is important.

嘗試讓列族名稱儘可能短,最好是一個字符。
查看keyvalue章節,瞭解關於HBase內部數據存儲的更多信息,來理解爲什麼這很重要。

37.3.2. 屬性(Attributes)

Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase.

See keyvalue for more information on HBase stores data internally to see why this is important.

雖然詳細的屬性名稱容易閱讀,但是短一些更有利於存儲到HBase中。
查看keyvalue章節,瞭解關於HBase內部數據存儲的更多信息,來理解爲什麼這很重要。

37.3.3. 行鍵長度(Rowkey Length)

Keep them as short as is reasonable such that they can still be useful for required data access (e.g. Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs when designing rowkeys.

使其合理的簡短而不喪失數據訪問時的可用性。一個簡短但對數據訪問來說無用的鍵,並不比一個長一些的鍵更好。在設計行鍵的時候需要進行權衡。

37.3.4. 字節模式(Byte Patterns)

A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String — presuming a byte per character — you need nearly 3x the bytes.

long類型佔8個字節。你可以存儲一個小於18,446,744,073,709,551,615的數字。如果你將該數字存儲爲字符串-假設每個字符一個字節-
你需要三倍的字節數。

Not convinced? Below is some sample code that you can run on your own.

不相信嗎?下面是一些示例代碼,你可以自己運行看看。

// long
//
long l = 1234567890L;
byte[] lb = Bytes.toBytes(l);
System.out.println("long bytes length: " + lb.length);   // returns 8

String s = String.valueOf(l);
byte[] sb = Bytes.toBytes(s);
System.out.println("long as string length: " + sb.length);    // returns 10

// hash
//
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(Bytes.toBytes(s));
System.out.println("md5 digest bytes length: " + digest.length);    // returns 16

String sDigest = new String(digest);
byte[] sbDigest = Bytes.toBytes(sDigest);
System.out.println("md5 digest as string length: " + sbDigest.length);    // returns 26

Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example, this is what you will see in the shell when you increment a value:

不幸的是,用二進制類型會導致你的數據在代碼之外難以理解。例如,當你incr一個值時你會在shell中看到這些東西。

hbase(main):001:0> incr 't', 'r', 'f:q', 1
COUNTER VALUE = 1

hbase(main):002:0> get 't', 'r'
COLUMN                                        CELL
 f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
1 row(s) in 0.0310 seconds

The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will happen to your row keys inside the region names. It can be okay if you know what’s being stored, but it might also be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.

shell會儘可能的打印出字符串,但在該示例中它決定只是打印十六進制。這同樣會發生在你的region名稱中的行鍵。如果你知道所存儲的東西,這可以接受,但在cell中放入任意數據可能會失去可讀性。這是主要的權衡點。

37.4.反轉時間戳(Reverse Timestamps)

反向scan接口(Reverse Scan API)

HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See Scan.setReversed() for more information.

HBASE-4811實現了一個可以反向scan表或其中一個範圍的接口,減少你因爲正向或反向掃描而進行模式優化的需要。該功能在HBase 0.98或更高版本中可用。更多信息可查看Scan.setReversed()。

A common problem in database processing is quickly finding the most recent version of a value. A technique using reverse timestamps as a part of the key can help greatly with a special case of this problem. Also found in the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly), the technique involves appending (Long.MAX_VALUE - timestamp) to the end of any key, e.g. key.

The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record. Since HBase keys are in sorted order, this key sorts before any older row-keys for [key] and thus is first.

This technique would be used instead of using Number of Versions where the intent is to hold onto all versions "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.

數據庫處理中有這樣一個常見的問題,快速找到最新版本的一個值。在特定的與此有關的案例中,把反轉時間戳作爲key的一部分,會有很大的幫助。Tom White的hadoop書籍的HBase章節:權威指南,關於在任意key的末尾添加(Long.MAX_VALUE - timestamp)的技巧。

一個表中鍵的最新值可通過執行一個對該鍵的scan並獲取第一個記錄得到。由於HBase中的鍵是有序的,該鍵會排在更老的行鍵之前,因此是第一個。

這個技巧可被用來替代意圖永久保留所有版本(或一個較長的時期)的多版本技術,並且同時可使用同樣的掃描方式來快速獲取任意其它版本數據。

37.5. 行鍵和列族(Rowkeys and ColumnFamilies)

Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.

行鍵的作用域是列族。因此,相同的行鍵可以存在於表的每個列族中而不會衝突。

37.6. 行鍵的不變性(Immutability of Rowkeys)

Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you’ve inserted a lot of data).

行鍵是不可變的。唯一使它們“改變”的的方法使先刪除再重新插入。這是一個很常見的問題,因此有必要一開始就使用正確的行鍵(在你插入很多數據之前)。

37.7. 行鍵和region分片的關係(Relationship Between RowKeys and Region Splits)

If you pre-split your table, it is critical to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split (which is the split strategy used when creating regions in Admin.createTable(byte[] startKey, byte[] endKey, numRegions) for 10 regions will generate the following splits…​

如果你預拆分你的表,理解你的行鍵在region邊界如何分佈非常重要。考慮這個使用可見十六進制字符作爲先導位的行鍵(比如,"0000000000000000" to "ffffffffffffffff")的例子,用來說明爲什麼這很重要。運行Bytes.split(使用Admin.createTable(byte[] startKey, byte[] endKey, numRegions )創建region時使用的分片策略)將該範圍的行鍵分爲10個region,會得到下面這些分片。

(note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', everything is great, right? Not so fast.

(注:首字節作爲註釋在右邊列出)。假設第一個分片是'0',最後一個分片是'f',一切都挺好,是嗎?先別急。

The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer to an ASCII Table. '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions will never be used. To make pre-splitting work with this example keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) is required.

問題在於,所有數據會集中在前2個region以及最後1個region,因而帶來了熱點region問題。參考ASCII表來理解爲什麼。'0' 對應字節的值爲48,'f'對應字節的值爲102,但由於可能的值只有[0-9]和[a-f],其中有很大一部分字節值(58-96)不會出現在行鍵區間中,此時需要一個自定義分片策略(比如,不依賴內置的分片方法)。

Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen with any keyspace. Know your data.

經驗1:預拆分表通常是一個最佳實踐,但你需要以一種所有region都會被訪問的方式去拆分。這個例子只是演示了使用十六進制行鍵時的問題,使用其它任意行鍵都可能有類似問題。理解你的數據。

Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split tables as long as all the created regions are accessible in the keyspace.

經驗2:雖然通常不建議使用十六進制行鍵(通常採用可見字符),但只要能夠使所有region都能被訪問,就可以進行預拆分。

To conclude this example, the following is an example of how appropriate splits can be pre-created for hex-keys:.

作爲總結,下面是一個如何爲十六進制行鍵進行適當的預拆分的示例:

public static boolean createTable(Admin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;
  }
}

public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i < numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  }
  return splits;
}

38. Number of Versions

38.1. 最大版本數(Maximum Number of Versions)

The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 1. This is an important parameter because as described in Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.

行的最大保存版本數通過HColumnDescriptor爲每個列族配置。默認最大版本數爲1.這是個很重要的參數,正如數據模型章節所述,HBase不會覆蓋數據,而是按時間(和限定符)爲每行保存不同的值。多餘的版本會在major合併時刪除。最大版本數可根據應用需要增大或減小。

It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.

不建議將最大版本數設置的過大(比如,幾百或更多),因爲這會大幅增加StoreFile的大小,除非那些舊數據對你來說很有價值。

38.2. 最小版本數(Minimum Number of Versions)

Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via HColumnDescriptor. The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, but keep at least M versions around" (where M is the value for minimum number of row versions, M

與最大版本數一樣,最小版本數也通過HColumnDescriptor爲每個列族配置。默認最小版本數爲0,意味着該功能未啓用。最小版本數可以與存活時間以及最大版本數一起使用,來進行"保留最近T分鐘內的數據,最多N個版本,但最少要保留M個版本"(M代表最小版本數,M

39. Supported Datatypes

HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.

HBase通過Put和Result支持"字節輸入/字節輸出"接口,所以可被轉換爲字節數組的任意東西都能夠被作爲值存儲。輸入可以是字符串、數字、組合對象或者甚至圖片也可以只要它們可以被表示爲字節。

There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.

實際上對於值大小有一些限制(比如,在HBase中存儲10-50MB的對象可能要求太高);搜索郵件列表來查看與此話題相關的討論。HBase中的所有行都需要遵循數據模型,包括版本控制。與列族的塊大小一樣,你需要在設計時考慮這些。

39.1. Counters

One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in Table.

特別值得一提的一種數據類型是"計數器"(用於實現數值原子遞增)。See Increment in Table.

Synchronization on counters are done on the RegionServer, not in the client.

對計數器的同步是在RegionServer完成,而不是客戶端。

40. Joins

If you have multiple tables, don’t forget to factor in the potential for Joins into the schema design.

如果你有多個表,不要忘記將連接的潛力考慮到模式設計中。

41. Time To Live (TTL)

ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.

列族可以設置以秒爲單位的存活時間,HBase會在過期時自動刪除這些行。這將應用到行的所有版本-甚至當前那個。存活時間在HBase中採用UTC進行編碼。

Store files which contains only expired rows are deleted on minor compaction. Setting hbase.store.delete.expired.storefile to false disables this feature. Setting minimum number of versions to other than 0 also disables this.

See HColumnDescriptor for more information.

只包含已過期數據的Store files會在minor合併的時候 被刪除。可將hbase.store.delete.expired.storefile設置爲false來禁用此功能。也可以將最小版本數設置爲大於0的值來禁用。
更多信息可查看HColumnDescriptor.

Recent versions of HBase also support setting time to live on a per cell basis. See HBASE-10560 for more information. Cell TTLs are submitted as an attribute on mutation requests (Appends, Increments, Puts, etc.) using Mutation#setTTL. If the TTL attribute is set, it will be applied to all cells updated on the server by the operation. There are two notable differences between cell TTL handling and ColumnFamily TTLs:

Cell TTLs are expressed in units of milliseconds instead of seconds.

A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting.

最近版本的HBase支持基於每個cell設置存活時間。更新信息查看HBASE-10560。cell的存活時間,通過Mutation#setTTL方法,將其作爲mutation請求的一個屬性進行提交。如果設置了存活時間屬性,則會應用到被此操作更新的所有cell。cell的存活時間和列族的存活時間有2個明顯的不同:

cell的存活時間單位是毫秒而不是秒。

cell的存活時間不能超過列族的存活時間而延長cell的有效壽命。

42. Keeping Deleted Cells

By default, delete markers extend back to the beginning of time. Therefore, Get or Scan operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time range before the delete marker was placed.

默認情況下,刪除標記會作用至最開始的時間。因此,Get或Scan操作將不會看到已刪除的cell(行或列),即使其指定了早於刪除標記的時間範圍。

ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be retrieved, as long as these operations specify a time range that ends before the timestamp of any delete that would affect the cells. This allows for point-in-time queries even in the presence of deletes.

列族可以選擇保留已刪除cell。這種情況下,已刪除的cell可以被獲取,只要操作所指定的時間範圍,早於這些cell的刪除操作的時間點。這允許在存在刪除的情況下,進行任意時間點的查詢。

Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete markers.

已刪除的cell依然受存活時間和最大版本數的約束。一個新的"raw"scan選項可返回所有已刪除的行和刪除標記。

通過shell修改KEEP_DELETED_CELLS的值

hbase> hbase> alter ‘t1′, NAME => ‘f1′, KEEP_DELETED_CELLS => true

通過api修改KEEP_DELETED_CELLS的值

...
HColumnDescriptor.setKeepDeletedCells(true);
...

Let us illustrate the basic effect of setting the KEEP_DELETED_CELLS attribute on a table.
First, without:
舉例說明一下給表設置KEEP_DELETED_CELLS屬性後的基本影響。

首先,未設置:

create 'test', {NAME=>'e', VERSIONS=>2147483647}
put 'test', 'r1', 'e:c1', 'value', 10
put 'test', 'r1', 'e:c1', 'value', 12
put 'test', 'r1', 'e:c1', 'value', 14
delete 'test', 'r1', 'e:c1',  11

hbase(main):017:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
 r1                                              column=e:c1, timestamp=11, type=DeleteColumn
 r1                                              column=e:c1, timestamp=10, value=value
1 row(s) in 0.0120 seconds

hbase(main):018:0> flush 'test'
0 row(s) in 0.0350 seconds

hbase(main):019:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
 r1                                              column=e:c1, timestamp=11, type=DeleteColumn
1 row(s) in 0.0120 seconds

hbase(main):020:0> major_compact 'test'
0 row(s) in 0.0260 seconds

hbase(main):021:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
1 row(s) in 0.0120 seconds

Notice how delete cells are let go.

注意被刪除的cell是如何消失的。

Now let’s run the same test only with KEEP_DELETED_CELLS set on the table (you can do table or per-column-family):

現在只給表增加KEEP_DELETED_CELLS設置(可以在表上或者列族上),並重新運行同樣的測試:

hbase(main):005:0> create 'test', {NAME=>'e', VERSIONS=>2147483647, KEEP_DELETED_CELLS => true}
0 row(s) in 0.2160 seconds

=> Hbase::Table - test
hbase(main):006:0> put 'test', 'r1', 'e:c1', 'value', 10
0 row(s) in 0.1070 seconds

hbase(main):007:0> put 'test', 'r1', 'e:c1', 'value', 12
0 row(s) in 0.0140 seconds

hbase(main):008:0> put 'test', 'r1', 'e:c1', 'value', 14
0 row(s) in 0.0160 seconds

hbase(main):009:0> delete 'test', 'r1', 'e:c1',  11
0 row(s) in 0.0290 seconds

hbase(main):010:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0550 seconds

hbase(main):011:0> flush 'test'
0 row(s) in 0.2780 seconds

hbase(main):012:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0620 seconds

hbase(main):013:0> major_compact 'test'
0 row(s) in 0.0530 seconds

hbase(main):014:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0650 seconds

KEEP_DELETED_CELLS is to avoid removing Cells from HBase when the only reason to remove them is the delete marker. So with KEEP_DELETED_CELLS enabled deleted cells would get removed if either you write more versions than the configured max, or you have a TTL and Cells are in excess of the configured timeout, etc.

KEEP_DELETED_CELLS用來避免刪除那些只是被刪除標記所刪除的cell。因此KEEP_DELETED_CELLS啓用時,如果超出最大版本數,或者超出了配置的存活時間,被delete的cell還是會被真正刪除掉。

43. Secondary Indexes and Alternate Query Paths

This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.

這個章節也可以使用"如果我的錶行鍵是這樣但是希望以那樣的方式去查詢"的標題。問題列表中常見的一個例子是行鍵的格式是"用戶-時間戳",但存在按照特定時間範圍查詢用戶活動的報表需求。此時,按用戶查詢很容易,因爲它位於行鍵的先導位,但按時間查詢就比較難。

There is no single answer on the best way to handle this because it depends on…​

對於如何以最好的方式去解決該問題並沒有單一的答案,因爲這取決於...

  • Number of users
  • Data size and data arrival rate
  • Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)
  • Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)
  • 用戶數量
  • 數據大小和到達速率
  • 報表需求的複雜度(比如,完全自由的日期選擇 vs 預先配置範圍)
  • 查詢所需的執行速度(比如,對於一個ad-hoc報表,90秒可能是合理的,但是對於其它情況就太久了)

and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.

而且解決方案也受集羣大小和能夠投入的處理器多少的影響。後面的子章節列出了常用的技術手段。這是一份全面而並不詳盡的方法列表。

It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RDBMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.

毫無疑問二級索引需要額外的集羣空間和處理.這就是關係型數據庫中所發生的,因爲創建額外索引既需要空間也需要花時間去更新。在開箱即用的索引管理方面,關係型數據庫更爲先進。然而,HBase在更大數據量是具備更好的擴展性,因此這是一個功能上的權衡。

Pay attention to Apache HBase Performance Tuning when implementing any of these approaches.

在實現那些方法時,請注意"性能調優"。

Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase

此外,可查看David Butler在問題列表中的回覆,HBase, mail # user - Stargate+hbase。

43.1. (過濾器查詢)Filter Query

Depending on the case, it may be appropriate to use Client Request Filters. In this case, no secondary index is created. However, don’t try a full-scan on a large table like this from an application (i.e., single-threaded client).

根據具體情況,使用客戶端過濾器進行請求可能時合適的。但是,不要嘗試從應用程序中對一個大表進行全掃描(比如,單線程客戶端)。

43.2. (週期性更新二級索引)Periodic-Update Secondary Index

A secondary index could be created in another table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.

See mapreduce.example.readwrite for more information.

二級索引可通過另外一張表創建,通過MapReduce作業週期性更新。該作業可以當天運行,不過取決於負載策略,它仍然可能與主表不同步。

更多信息查看mapreduce.example.readwrite。

43.3. (多寫二級索引)Dual-Write Secondary Index

Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see secondary.indexes.periodic).

另一個策略是在寫入數據到集羣的時候構建二級索引(比如,寫入數據表,然後寫入索引表)。如果是對已存在的表採用該方法,則需要先執行一個MapReduce作業來進行初始化(查看secondary.indexes.periodic)。

43.4. (彙總表)Summary Tables

Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.

See mapreduce.example.summary for more information.

在時間範圍很長且數據量很大時,彙總表是常用的方法。可通過MapReduce作業將其生成爲另外一個表。

更多信息查看mapreduce.example.summary。

43.5. (協處理器二級索引)Coprocessor Secondary Index

Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, see coprocessors

協處理器類似關係型數據庫中的觸發器。在0.92版本中加入。更多信息,查看coprocessors。

44. Constraints

HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (e.g. make sure values are in the range 1-10). Constraints could also be used to enforce referential integrity, but this is strongly discouraged as it will dramatically decrease the write throughput of the tables where integrity checking is enabled. Extensive documentation on using Constraints can be found at Constraint since version 0.94.

HBase現在支持傳統數據庫所說的"約束"。約束用來強制表中的屬性遵守業務規則(比如,確保值在1-10之間)。約束也可以用來強制參照完整性,但是由於它會顯著降低寫吞吐,因此強烈不贊成使用。在0.94版本之後,關於如何使用約束,可查看擴展文檔Constraint

45. Schema Design Case Studies

The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, not an exhaustive list. Know your data, and know your processing requirements.

以下會描述一些使用HBase進行數據獲取的用戶案例,以及如何進行行鍵設計和構造的方法。注:這裏只是對可能的方法的說明,並非一個詳盡的列表。理解你的數據,以及你的處理需求。

It is highly recommended that you read the rest of the HBase and Schema Design first, before reading these case studies.

強烈推薦你在閱讀這些學習案例之前,先讀一讀HBase and Schema Design的剩餘內容。

The following case studies are described:

  • Log Data / Timeseries Data
  • Log Data / Timeseries on Steroids
  • Customer/Order
  • Tall/Wide/Middle Schema Design
  • List Data

以下描述的是這些案例:

  • 日誌數據 / 時序數據
  • 日誌數據 / 聚合時序數據
  • 客戶/訂單
  • 高/寬/中等 模式設計
  • 列表數據

45.1. 案例學習-日誌和時序數據(Case Study - Log Data and Timeseries Data)

Assume that the following data elements are being collected.

  • Hostname
  • Timestamp
  • Log event
  • Value/message

假設收集到的是以下數據元素

  • 主機名
  • 時間戳
  • 日誌事件
  • 值/消息

We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event - but what specifically?

我們可以將它們存儲在一個叫做LOG_DATA的表中,但是行鍵是什麼呢?由這些屬性可知,應該是主機、時間戳和日誌事件的一些組合,但具體是什麼?

45.1.1. 時間戳位於前導位(Timestamp In The Rowkey Lead Position)

The rowkey timestamp[log-event] suffers from the monotonically increasing rowkey problem described in Monotonically Increasing Row Keys/Timeseries Data.

timestamp[log-event]組成的行鍵會遇到Monotonically Increasing Row Keys/Timeseries Data中所描述的單調遞增行鍵問題。

There is another pattern frequently mentioned in the dist-lists about "bucketing" timestamps, by performing a mod operation on the timestamp. If time-oriented scans are important, this could be a useful approach. Attention must be paid to the number of buckets, because this will require the same number of scans to return results.

還有另一種dist-lists中經常提到的,對時間戳取模進行分桶的模式。如果基於時間的掃描比較重要,這會是一個有用的方法。注意桶的數量,因爲這會帶來同樣數量的scan,以返回結果。

long bucket = timestamp % numBuckets;  
to construct:  
[bucket][timestamp][hostname][log-event]

As stated above, to select data for a particular timerange, a Scan will need to be performed for each bucket. 100 buckets, for example, will provide a wide distribution in the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there are trade-offs.

如上所述,要獲取一個特定時間範圍的數據,需要對每個桶執行一個scan。比如100個桶,能夠對鍵空間提供一個廣泛的分佈,但在獲取某個時間戳範圍的數據時需要100個scan,因此需要做權衡。

45.1.2. 主機名位於前導位(Host In The Rowkey Lead Position)

The rowkey hostname[timestamp] is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace. This approach would be useful if scanning by hostname was a priority.

如果有很多的節點來分散對鍵空間的寫入和讀取,hostname[timestamp]也是個可選項。這個方法在主要以主機進行掃描時會比較有效。

45.1.3. 時間戳,或反轉時間戳(Timestamp, or Reverse Timestamp?)

If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., timestamp = Long.MAX_VALUE – timestamp) will create the property of being able to do a Scan on hostname to obtain the most recently captured events.

如果最重要的訪問方式是得到最新的事件,那麼以反轉時間戳的方式存儲的話(e.g., timestamp = Long.MAX_VALUE – timestamp),將產生這樣的特性:在對hostname進行scan時可以獲取最近得到的事件。

Neither approach is wrong, it just depends on what is most appropriate for the situation.

方法無所謂對錯,只取決於對具體情況是否最爲適合。

Reverse Scan API
HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See Scan.setReversed() for more information.

反轉scan接口
HBASE-4811實現了一個接口,用來反向掃描一個表或其中一個範圍,以減少爲能夠反向掃描而所需的設計優化。在HBase 0.98及其後版本可用。See Scan.setReversed() for more information。

45.1.4. 變長 或 定長行鍵(Variable Length or Fixed Length Rowkeys?)

It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is a and the event type is e1 then the resulting rowkey would be quite small. However, what if the ingested hostname is myserver1.mycompany.com and the event type is com.package1.subpackage2.subsubpackage3.ImportantService?

務必要記得,在HBase中行鍵會重複存在於每個列。如果主機名和事件類型分別是a和e1,行鍵就會非常小,但如果主機名和事件類型是myserver1.mycompany.com和com.package1.subpackage2.subsubpackage3.ImportantService呢?

It might make sense to use some substitution in the rowkey. There are at least two approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it might look like this:

對行鍵進行一些替換也許是有意義的。至少有2種方法:哈希和數字。在主機名作爲行鍵前導位的例子中,看起來是這樣:

Composite Rowkey With Hashes:

  • [MD5 hash of hostname] = 16 bytes
  • [MD5 hash of event-type] = 16 bytes
  • [timestamp] = 8 bytes

使用哈希的組合行鍵

  • [MD5 hash of hostname] = 16 bytes
  • [MD5 hash of event-type] = 16 bytes
  • [timestamp] = 8 bytes

Composite Rowkey With Numeric Substitution:

For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES. The rowkey of LOG_TYPES would be:

  • type
  • [bytes] variable length bytes for raw hostname or event-type.

A column for this rowkey could be a long with an assigned number, which could be obtained by using an HBase counter

So the resulting composite rowkey would be:

  • [substituted long for hostname] = 8 bytes
  • [substituted long for event type] = 8 bytes
  • [timestamp] = 8 bytes

使用數字的組合行鍵

這個方法在LOG_DATA之外,還需要另一張叫做LOG_TYPE的查找表。LOG_TYPE表的行鍵是:

  • type
  • [bytes] 代表原始主機名和事件的定長字節數組

該行鍵的列可以是通過計數器獲取到的一個數值。

因此最終的組合行鍵是這樣:

  • [substituted long for hostname] = 8 bytes
  • [substituted long for event type] = 8 bytes
  • [timestamp] = 8 bytes

In either the Hash or Numeric substitution approach, the raw values for hostname and event-type can be stored as columns.

無論是用哈希還是數字的替換方法,主機名和事件類型的原始值都可以作爲列進行存儲。

45.2. 案例學習 - 日誌數據和聚合時序數據(Case Study - Log Data and Timeseries Data on Steroids)

This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.

這實際上就是OpenTSDB採用的方法。它把數據進行重寫並按照一定的時間週期將行打包成列。對其細節的解釋, see: http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.

But this is how the general concept works: data is ingested, for example, in this manner…​

hostname[timestamp1]
hostname[timestamp2]
hostname[timestamp3]
with separate rowkeys for each detailed event, but is re-written like this…​

hostname[timerange]
and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.

不過這裏展示了大概的工作原理:比如,數據以下面的方式被獲取:

[hostname][log-event][timestamp1]  
[hostname][log-event][timestamp2]  
[hostname][log-event][timestamp3]

每一個明細事件作爲一個行鍵,但會被重寫成這樣:

[hostname][log-event][timerange]

並且以上的每個事件,都會轉換爲一個列,存儲着相對於起始時間範圍的一個時間偏移(比如,每5分鐘)。這顯然是一個非常先進的處理技術,但是HBase使之成爲可能。

45.3. (案例學習 - 客戶/訂單)Case Study - Customer/Order

Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.

The Customer record type would include all the things that you’d typically expect:

假設使用HBase存儲客戶和訂單信息。會獲取到兩種主要的記錄類型:客戶記錄,和訂單記錄。

客戶記錄會包含如下內容:

  • Customer number
  • Customer name
  • Address (e.g., city, state, zip)
  • Phone numbers, etc.

訂單記錄會包含如下內容:

  • Customer number
  • Order number
  • Sales date
  • A series of nested objects for shipping locations and line-items (see Order Object Design for details)

Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:

假設客戶號和訂單號的組合唯一標識一個訂單,對於訂單表,將會由這2個屬性組成行鍵,如下:

[customer number][order number]

for an ORDER table.

However, there are more design decisions to make: are the raw values the best choices for rowkeys?

當然,還有更多設計決策需要去做:原始值對行鍵來說是不是
最好的選擇?

The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:

日誌數據案例中遇到的設計問題,這裏一樣存在。客戶號的鍵空間是怎樣的,格式如何(比如,數字?字符串?)在HBase中使用定長以及能夠合理分佈的行鍵是有益的,類似這樣:

Composite Rowkey With Hashes:

  • [MD5 of customer number] = 16 bytes
  • [MD5 of order number] = 16 bytes

Composite Numeric/Hash Combo Rowkey:

  • [substituted long for customer number] = 8 bytes
  • [MD5 of order number] = 16 bytes

哈希方式組合行鍵:

  • [MD5 of customer number] = 16 bytes
  • [MD5 of order number] = 16 bytes

混合數字和哈希的方式組合行鍵:

  • [substituted long for customer number] = 8 bytes
  • [MD5 of order number] = 16 bytes

45.3.1. (單個表?多個表?)Single Table? Multiple Tables?

A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).

一個典型的設計方法是將客戶和銷售分爲獨立的表。另一個選項是將多種記錄類型放到一個表中(比如,CUSTOMER++)。

Customer Record Type Rowkey:

[customer-id]

[type] = type indicating `1' for customer record type

Order Record Type Rowkey:

[customer-id]

[type] = type indicating `2' for order record type

[order]

The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it’s not as easy to scan for a particular record-type.

這種獨特的CUSTOMER++方法的優勢是將多種不同的記錄類型通過客戶id進行組織(比如,單個scan就可以獲取該客戶的所有數據)。劣勢是對於特定的記錄類型進行掃描不太容易。

45.3.2. (訂單對象設計)Order Object Design

Now we need to address how to model the Order object. Assume that the class structure is as follows:

Order
(an Order can have multiple ShippingLocations

LineItem
(a ShippingLocation can have multiple LineItems

there are multiple options on storing this data.

選擇我們需要解決如何對訂單對象建模。假設類結構如下:
訂單
(一個訂單可以包含多個物流地址)
明細項
(一個物流地址可以含有多個明細項)
對此類數據的存儲有多種選擇。

完全標準化(Completely Normalized)

With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.

在這個方法中,將會分爲ORDER, SHIPPING_LOCATION, and LINE_ITEM等獨立的表。

The ORDER table’s rowkey was described above: schema.casestudies.custorder

The SHIPPING_LOCATION’s composite rowkey would be something like this:

[order-rowkey]

shipping location number

The LINE_ITEM table’s composite rowkey would be something like this:

[order-rowkey]

shipping location number

line item number

ORDER表的行鍵如上所述:schema.casestudies.custorder

SHIPPING_LOCATION表的組合主鍵是:

[order-rowkey]

[shipping location number](e.g., 1st location, 2nd, etc.)

LINE_ITEM表的組合主鍵是:

[order-rowkey]

[shipping location number](e.g., 1st location, 2nd, etc.)

[line item number](e.g., 1st lineitem, 2nd, etc.)

Such a normalized model is likely to be the approach with an RDBMS, but that’s not your only option with HBase. The cons of such an approach is that to retrieve information about any Order, you will need:

Get on the ORDER table for the Order

Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances

Scan on the LINE_ITEM for each ShippingLocation

granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you’re just more aware of this fact.

RDBMS中常會採用這樣的一個標準模型,但在HBase中卻非唯一選擇。這種方法的缺點是,要檢索任意訂單的信息,你需要:

從ORDER表中獲取訂單信息

掃描SHIPPING_LOCATION表獲取該訂單的物流地址信息

掃描LINE_ITEM表獲取每個物流地址的物品項

當然,這就是RDBMS底層實際所做的,但由於HBase不支持join,所以你更理解了這個事實。

帶有記錄類型的單個表(Single Table With Record Types)

With this approach, there would exist a single table ORDER that would contain

在這個方法中,將會存在單個表ORDER,包含

The Order rowkey was described above: schema.casestudies.custorder

[order-rowkey]

[ORDER record type]

The ShippingLocation composite rowkey would be something like this:

[order-rowkey]

[SHIPPING record type]

shipping location number

The LineItem composite rowkey would be something like this:

[order-rowkey]

[LINE record type]

shipping location number

line item number

ORDER表的行鍵如上所述:schema.casestudies.custorder

[order-rowkey]

[ORDER record type]

ShippingLocation表的組合行鍵是:

[order-rowkey]

[SHIPPING record type]

[shipping location number](e.g., 1st location, 2nd, etc.)

LineItem表的組合行鍵是:

[order-rowkey]

[LINE record type]

[shipping location number](e.g., 1st location, 2nd, etc.)

[line item number](e.g., 1st lineitem, 2nd, etc.)

非規範化(Denormalized)

A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.

對帶記錄類型的單個表的一個變化,是將對象結構扁平化,比如將ShippingLocation屬性放到每個明細項去。

LineItem表的組合行鍵是:

[order-rowkey]

[LINE record type]

[line item number](e.g., 1st lineitem, 2nd, etc., care must be taken that there are unique across the entire order)

LineItem表的列是:

itemNumber

quantity

price

shipToLine1 (denormalized from ShippingLocation)

shipToLine2 (denormalized from ShippingLocation)

shipToCity (denormalized from ShippingLocation)

shipToState (denormalized from ShippingLocation)

shipToZip (denormalized from ShippingLocation)

The pros of this approach include a less complex object hierarchy, but one of the cons is that updating gets more complicated in case any of this information changes.

這個方法的優點是可以包含一些複雜對象結構,缺點是一旦信息有變將難以更新。

Object BLOB

With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the ORDER table’s rowkey was described above: schema.casestudies.custorder, and a single column called "order" would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.

這個方法中,整個訂單對象圖,以這樣或那樣的方式,處理爲BLOB。例如,訂單表的行鍵如上所述:schema.casestudies.custorder,然後單個的稱爲order的列會包含一個可被反序列化的對象,包含Order, ShippingLocations, and LineItems.

There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward compatibility in case the object model changes such that older persisted structures can still be read back out of HBase.

有多種選項:JSON, XML, Java Serialization, Avro, Hadoop Writables, 等等。它們都可以做到:將對象圖編碼爲字節數組。對於該方法,需要注意的是,確保向後兼容,舊的數據結構在對象模型變化之後仍然能夠從HBase中讀取。

Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatibility of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.

優點是可以通過很小的IO管理複雜的對象圖(比如, 在該例中單個get請求就可以獲取整個訂單信息 ),但缺點如前所述,需要小心序列化方面的向後兼容,序列化的語言依賴(比如,java的序列化只能通過java的客戶端),獲取一點點數據也需要反序列化整個對象,以及類似Hive這樣的框架難以處理此類自定義對象。

45.4. Case Study - "Tall/Wide/Middle" Schema Design Smackdown

This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.

這個章節將描述出現在dist-list中的另外一些設計問題,特別是關於高表和寬表。這些是一般性的指南而不是法律 - 每個應用必須考慮其自身所需。

45.4.1. 行 vs 版本(Rows vs. Versions)

A common question is whether one should prefer rows or HBase’s built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwrite with each successive update.

Preference: Rows (generally speaking).

一個常見的問題是使用行還是內置的版本。典型的情況是那些一個行有很多版本需要保存(比如,明顯需要超過默認的最大一個版本)。行的方式需要在行鍵的某個部分存儲一個時間戳,從而不會覆蓋每次更新。

優先:行(通常來說)

45.4.2. 行 vs 列(Rows vs. Columns)

Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.

Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two options, and that is "Rows as Columns."

另一個常見的問題是使用行還是列。典型的情況是較爲極端的寬表,比如一行含有一百萬列,或者一百萬行各自含一個列。

優先:行(通常來說)。澄清一下,該準則針對極端寬表的情況,而不是常規的只需要存儲幾十或幾百個列的使用場景。但在這兩個選項之間還有一箇中間選則,即"行作爲列"。

45.4.3. 行作爲列(Rows as Columns)

The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see schema.casestudies.log-steroids.

行vs列的中間選擇是針對一些特定的行,將其數據打包作爲列。 OpenTSDB 就是一個最好的例子,單個行表示一個既定的時間範圍,而離散的事件作爲列。這種方法通常會更復雜,並且需要額外的複雜度去重寫你的數據,但在I/O性能上有優勢。對方法的概要說明,查看schema.casestudies.log-steroids

45.5. 案例學習 - 列表數據(Case Study - List Data)

The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.

以下是來自用戶dist-list的關於一個常見問題的交流:如何用HBase處理用戶列表數據。

  • QUESTION *

We’re looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:

我們在研究如何在HBase中存儲大量列表數據,並嘗試找出最有意義的訪問模式。一個選項是將數據的主要部分存爲一個鍵,看起來是這樣:

<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we had was to do this entirely using:

另一個選項是完全使用:

<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values. So in one case reading the first thirty values would be:

每行會包含多個值。因此讀取前三十個值的話前者可以這樣:

scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}

And in the second case it would be

而後者是這樣:

get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have ⇐ 30 total values in these lists, and some users would have millions (i.e. power-law distribution)

常見的使用方式是隻從列表中讀取前30行,較少去讀取更多。有些用戶的列表共有30行,而有些用戶則有百萬行。

The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?

單個值的格式在HBase中看起來會佔用更多空間,但能夠提供更優的檢索/分頁靈活性。通過gets分頁是否比scans分頁有明顯的性能優勢?

My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we’ll always need the same page size. I’ve ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we’d need to update all subsequent rows).

Thanks for help / suggestions / follow-up questions.

我最初的理解是,如果分頁大小未知的話,執行一個scan會比較快,但如果總是需要同樣的分頁大小,那麼gets會更快。我聽到其他人對於性能有不同看法。我假設分頁大小會相對一致,因此對於大多用例,我們可以保證我們只獲取固定大小的一頁數據。我也假設我們很少更新,但在列表的中間插入數據(意味着我們需要更新所有後續的行)。

  • ANSWER *

If I understand you correctly, you’re ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:

如果我理解的沒錯,你本質上是想存儲"user, valueid, value"的元組?類似這樣:

"user123, firstname, Paul",
"user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).

(不過usernames爲定長,並且valueids也是定長)。

And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?

The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you’re really sure it is needed.

並且,你的訪問模式是"對於用戶x,列出從Y開始的30個值"。是這樣嗎?另外,這些值需要以valueid順序返回?

tl;dr版本是,你或許應該每個user+value作爲一行,而不是去親自構建一個行內分頁模式,除非你確定這是需要的。

Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you’re giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn’t sound like you need that. Doing it this way is generally recommended (see here https://hbase.apache.org/book.html#schema.smackdown).

你的兩個選項反映了人們在設計HBase模式時的一個常見問題:應該用高表還是寬表?你第一個模式時高表:每一行代表一個用戶的一個值;行鍵是user + valueid,且只有一個列限定符叫做"the value"。如果你想基於有序行鍵進行掃描的話,這很不錯。你可以從任意的user+valueid開始一個scan,讀取接下來的30行,就可以了。你所放棄的是對於某個用戶所有行的事務保證方面的能力,但貌似你並不需要這個。這是通常所推薦的方式(看這裏:https://hbase.apache.org/book.html#schema.smackdown)。

Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I’m guessing you jumped to the "paginated" version because you’re assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you’re not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn’t be fundamentally worse. The client has methods that allow you to get specific slices of columns.

你的第二個選項是寬表:你在一行中存儲一批值,用不同的限定符(這裏使用valueid)。要做到這樣只需要簡單的將單個用戶的數據存爲一行。我猜你想到了分頁版本,因爲你假定在一行中存儲百萬列性能會比較差,但未必是這樣;只要你沒有試圖在單個請求中獲取過多數據,或掃描並返回行的所有cell,實際上就不會更差。客戶端有一些方法,允許你指定部分列。

Note that neither case fundamentally uses more disk space than the other; you’re just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George’s excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).

注意,沒有哪個選項會佔用更多的空間;你只是將值的標識信息放在左邊(行鍵中)或右邊(列限定符)。在底層,每個鍵值對仍然會存儲整個行鍵和列名稱。(如果有一些困惑,花一個小時看下Lars George關於理解HBase模式設計的視頻:http://www.youtube.com/watch?v=_HLoH_PgrLk)

A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don’t have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! )

手工分頁的版本更爲複雜,如你所知,比如需要跟蹤每頁有多少內容,有新的數據插入時需要重新調整,等等。這看起來明顯更爲複雜。也許在極端高吞吐情況下,它會有微小的速度優勢(或劣勢),但只能通過測試來知道真實情況。如果你沒時間去構建它們並比較,我的建議是從最簡單的選項開始(每個user+value作爲一行)。從簡單開始然後迭代!)

46. Operational and Performance Configuration Options

46.1. 優化HBase 服務端RPC處理(Tune HBase Server RPC Handling)

  • Set hbase.regionserver.handler.count (in hbase-site.xml) to cores x spindles for concurrency.
  • Optionally, split the call queues into separate read and write queues for differentiated service. The parameter hbase.ipc.server.callqueue.handler.factor specifies the number of call queues:

    • 0 means a single shared queue
    • 1 means one queue for each handler.
    • A value between 0 and 1 allocates the number of queues proportionally to the number of handlers. For instance, a value of .5 shares one queue between each two handlers.
  • Use hbase.ipc.server.callqueue.read.ratio (hbase.ipc.server.callqueue.read.share in 0.98) to split the call queues into read and write queues:

    • 0.5 means there will be the same number of read and write queues
    • < 0.5 for more read than write
    • > 0.5 for more write than read
  • Set hbase.ipc.server.callqueue.scan.ratio (HBase 1.0+) to split read call queues into small-read and long-read queues:

    • 0.5 means that there will be the same number of short-read and long-read queues
    • < 0.5 for more short-read
    • > 0.5 for more long-read
  • 將hbase.regionserver.handler.count設置爲cpu數量的倍數.
  • 可選的,針對不同服務將請求隊列進行隔離,hbase.ipc.server.callqueue.handler.factor參數定義了請求隊列的數量:

    • 0 代表共用1個隊列。
    • 1 代表每個handler對應1個隊列。
    • 0-1中間的值,代表根據handler的數量,按比例分配隊列。比如,0.5意味着2個handler共用1個隊列。
  • 使用hbase.ipc.server.callqueue.read.ratio將請求隊列拆分爲讀和寫隊列:

    • 0.5 代表讀隊列和寫隊列數量一樣
    • < 0.5 代表讀隊列更多
    • > 0.5 代表寫隊列更多
  • 配置hbase.ipc.server.callqueue.scan.ratio (HBase 1.0+) 將讀隊列拆分爲short-read和long-read隊列:

    • 0.5 代表short-read和long-read隊列數量一樣
    • < 0.5 代表short-read隊列更多
    • > 0.5 代表long-read隊列更多

46.2. 對RPC禁用Nagle(Disable Nagle for RPC)

Disable Nagle’s algorithm. Delayed ACKs can add up to ~200ms to RPC round trip time. Set the following parameters:

  • In Hadoop’s core-site.xml:

    • ipc.server.tcpnodelay = true
    • ipc.client.tcpnodelay = true
  • In HBase’s hbase-site.xml:

    • hbase.ipc.client.tcpnodelay = true
    • hbase.ipc.server.tcpnodelay = true

禁用Nagle算法. 延遲的ACKs會將RPC往返時間最多增加到200ms。 Set the following parameters:

  • In Hadoop’s core-site.xml:

    • ipc.server.tcpnodelay = true
    • ipc.client.tcpnodelay = true
  • In HBase’s hbase-site.xml:

    • hbase.ipc.client.tcpnodelay = true
    • hbase.ipc.server.tcpnodelay = true

46.3. 限制服務端錯誤影響(Limit Server Failure Impact)

Detect regionserver failure as fast as reasonable. Set the following parameters:

  • In hbase-site.xml, set zookeeper.session.timeout to 30 seconds or less to bound failure detection (20-30 seconds is a good start).

    • Notice: the sessionTimeout of zookeeper is limited between 2 times and 20 times the tickTime(the basic time unit in milliseconds used by ZooKeeper.the default value is 2000ms.It is used to do heartbeats and the minimum session timeout will be twice the tickTime).
  • Detect and avoid unhealthy or failed HDFS DataNodes: in hdfs-site.xml and hbase-site.xml, set the following parameters:

    • dfs.namenode.avoid.read.stale.datanode = true
    • dfs.namenode.avoid.write.stale.datanode = true

在合理範圍內儘快發現regionserver的錯誤. 配置以下參數:

  • 在hbase-site.xml中, 將zookeeper.session.timeout設置爲30秒或更少 (20-30秒是個不錯的開始)。

    • 注意: zookeeper的會話超時時間被限制爲tickTime的2倍到20倍之間(ZooKeeper使用的一個基本時間單位.默認值是2000ms.它被用來發送心跳,且最小的會話過期時間應2倍於此值)。
  • 發現和避免非健康或失敗的HDFS節點: in hdfs-site.xml and hbase-site.xml, set the following parameters:

    • dfs.namenode.avoid.read.stale.datanode = true
    • dfs.namenode.avoid.write.stale.datanode = true

46.4. Optimize on the Server Side for Low Latency

Skip the network for local blocks when the RegionServer goes to read from HDFS by exploiting HDFS’s Short-Circuit Local Reads facility. Note how setup must be done both at the datanode and on the dfsclient ends of the conneciton — i.e. at the RegionServer and how both ends need to have loaded the hadoop native .so library. After configuring your hadoop setting dfs.client.read.shortcircuit to true and configuring the dfs.domain.socket.path path for the datanode and dfsclient to share and restarting, next configure the regionserver/dfsclient side.

當RegionServer從HDFS讀取時,利用HDFS的短路讀特性,針對本地塊可以跳過網絡。注意需要在datanode和dfsclient中同時配置,並且都需要加載Hadoop的本地.so庫。將hadoop的dfs.client.read.shortcircuit設置爲true,並且配置dfs.domain.socket.path用來共享,然後重啓,接下來配置regionserver端。

  • In hbase-site.xml, set the following parameters:

    • dfs.client.read.shortcircuit = true
    • dfs.client.read.shortcircuit.skip.checksum = true so we don’t double checksum (HBase does its own checksumming to save on i/os. See hbase.regionserver.checksum.verify for more on this.
    • dfs.domain.socket.path to match what was set for the datanodes.
    • dfs.client.read.shortcircuit.buffer.size = 131072 Important to avoid OOME — hbase has a default it uses if unset, see hbase.dfs.client.read.shortcircuit.buffer.size; its default is 131072.
  • Ensure data locality. In hbase-site.xml, set hbase.hstore.min.locality.to.skip.major.compact = 0.7 (Meaning that 0.7 <= n <= 1)
  • Make sure DataNodes have enough handlers for block transfers. In hdfs-site.xml, set the following parameters:

    • dfs.datanode.max.xcievers >= 8192
    • dfs.datanode.handler.count = number of spindles

Check the RegionServer logs after restart. You should only see complaint if misconfiguration. Otherwise, shortcircuit read operates quietly in background. It does not provide metrics so no optics on how effective it is but read latencies should show a marked improvement, especially if good data locality, lots of random reads, and dataset is larger than available cache.

重啓之後檢查RegionServer的日誌。如果配置錯誤會看到異常日誌。否則,短路讀並不會有顯式的輸出。它並未提供監控指標,所以效果如何不好看出,但是讀取延遲應該會有顯著提升,尤其是如果數據有較好的本地性,大量的隨機讀取,且數據集遠大於可用緩存。

Other advanced configurations that you might play with, especially if shortcircuit functionality is complaining in the logs, include dfs.client.read.shortcircuit.streams.cache.size and dfs.client.socketcache.capacity. Documentation is sparse on these options. You’ll have to read source code.

另一個你可能需要處理的高級配置,尤其是當日志裏出現關於短路功能異常時,包含dfs.client.read.shortcircuit.streams.cache.size 和 dfs.client.socketcache.capacity。它們的配置文檔比較分散。你可能需要閱讀源碼。

For more on short-circuit reads, see Colin’s old blog on rollout, How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop. The HDFS-347 issue also makes for an interesting read showing the HDFS community at its best (caveat a few comments).

更多關於短路讀的信息,可以查看Colin的舊博客,How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop。有興趣的話可以閱讀HDFS-347,其展示了HDFS社區在這上面的努力(一些評論值得關注)。

46.5. JVM Tuning

46.5.1. Tune JVM GC for low collection latencies

Use the CMS collector: -XX:+UseConcMarkSweepGC

Keep eden space as small as possible to minimize average collection time. Example:

-XX:CMSInitiatingOccupancyFraction=70
Optimize for low collection latency rather than throughput: -Xmn512m

Collect eden in parallel: -XX:+UseParNewGC

Avoid collection under pressure: -XX:+UseCMSInitiatingOccupancyOnly

Limit per request scanner result sizing so everything fits into survivor space but doesn’t tenure. In hbase-site.xml, set hbase.client.scanner.max.result.size to 1/8th of eden space (with -Xmn512m this is ~51MB )

Set max.result.size x handler.count less than survivor space

使用CMS收集器:-XX:+UseConcMarkSweepGC

使eden區儘可能的小,來最小化平均收集時間。例如:
-XX:CMSInitiatingOccupancyFraction=70

爲低延遲而不是吞吐進行優化:-Xmn512m

eden區使用並行收集:-XX:+UseParNewGC

避免在壓力大時收集:
-XX:+UseCMSInitiatingOccupancyOnly

限制單個請求的結果大小,從而都可以放到survivor區而不是tenure區。在hbase-site.xml中,配置hbase.client.scanner.max.result.size爲eden區的八分之一(with -Xmn512m this is ~51MB)

使max.result.size x handler.count小於survivor區。

46.5.2. OS-Level Tuning

Turn transparent huge pages (THP) off:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Set vm.swappiness = 0

Set vm.min_free_kbytes to at least 1GB (8GB on larger memory systems)

Disable NUMA zone reclaim with vm.zone_reclaim_mode = 0x

47. Special Cases

47.1. 對於那些希望快速失敗而非等待的應用(For applications where failing quickly is better than waiting)

In hbase-site.xml on the client side, set the following parameters:

Set hbase.client.pause = 1000

Set hbase.client.retries.number = 3

If you want to ride over splits and region moves, increase hbase.client.retries.number substantially (>= 20)

Set the RecoverableZookeeper retry count: zookeeper.recovery.retry = 1 (no retry)

In hbase-site.xml on the server side, set the Zookeeper session timeout for detecting server failures: zookeeper.session.timeout ⇐ 30 seconds (20-30 is good).

47.2. 對於那些能夠容忍稍微過時信息的應用(For applications that can tolerate slightly out of date information)

HBase timeline consistency (HBASE-10070) With read replicas enabled, read-only copies of regions (replicas) are distributed over the cluster. One RegionServer services the default or primary replica, which is the only replica that can service writes. Other RegionServers serve the secondary replicas, follow the primary RegionServer, and only see committed updates. The secondary replicas are read-only, but can serve reads immediately while the primary is failing over, cutting read availability blips from seconds to milliseconds. Phoenix supports timeline consistency as of 4.4.0 Tips:

  • Deploy HBase 1.0.0 or later.
  • Enable timeline consistent replicas on the server side.
  • Use one of the following methods to set timeline consistency:

    • Use ALTER SESSION SET CONSISTENCY = 'TIMELINE’
    • Set the connection property Consistency to timeline in the JDBC connect string

HBase時間線一致性(HBase -10070)在啓用讀副本的情況下,region的只讀副本分佈在集羣中。 一個RegionServer提供默認的或主副本服務, 寫服務只能由該副本提供. 其它RegionServers提供從副本服務, 跟進主RegionServer, 只對已提交的更新可見.從副本是隻讀的,但當主副本掛掉時,能夠立即提供讀服務,將讀不可用的時間從秒級減少到毫秒級。 Phoenix從4.4.0開始支持時間線一致性:

  • 部署HBase 1.0.0之後的版本。
  • 在服務端啓用時間線一致性.
  • 使用下述的方法之一來配置時間線一致性:

    • Use ALTER SESSION SET CONSISTENCY = 'TIMELINE’
    • Set the connection property Consistency to timeline in the JDBC connect string
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章