在工作中大家對hbase的bloom filter是否能作用於scan展開討論。在沒討論前,我還真沒想過這個問題,想當然的以爲bloom filter肯定可以爲scan剔除掉不需要的hfile。但Google了下才發現事實不是如此!
首先,學習了以下2篇文章:
hbase對bf的理解和使用
http://zjushch.iteye.com/blog/1530143
hbase的主要代碼提交者對hbase Bloomfilter的解釋
http://blog.csdn.net/macyang/article/details/6182629
大概對BloomFilter有了一些瞭解,然後找到了hbase中對有bloomfilter的table查詢的2個優化:
1.get操作會enable bloomfilter幫助剔除掉不會用到的Storefile
在scan初始化時(get會包裝爲scan)對於每個storefile會做shouldSeek的檢查,如果返回false,則表明該storefile裏沒有要找的內容,直接跳過
if (memOnly == false
&& ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {
scanners.add(kvs);
}
shouldSeek方法:如果是scan直接返回true表明不能跳過,然後根據bloomfilter類型檢查。 if (!scan.isGetScan()) {
return true;
}
byte[] row = scan.getStartRow();
switch (this.bloomFilterType) {
case ROW:
return passesBloomFilter(row, 0, row.length, null, 0, 0);
case ROWCOL:
if (columns != null && columns.size() == 1) {
byte[] column = columns.first();
return passesBloomFilter(row, 0, row.length, column, 0,
column.length);
}
// For multi-column queries the Bloom filter is checked from the
// seekExact operation.
return true;
default:
return true;
2.指明qualified的scan在配了rowcol的情況下會剔除不會用掉的StoreFile。
對指明瞭qualify的scan或者get進行檢查:seekExactly
// Seek all scanners to the start of the Row (or if the exact matching row
// key does not exist, then to the start of the next matching Row).
if (matcher.isExactColumnQuery()) {
for (KeyValueScanner scanner : scanners)
scanner.seekExactly(matcher.getStartKey(), false);
} else {
for (KeyValueScanner scanner : scanners)
scanner.seek(matcher.getStartKey());
}
如果bloomfilter沒命中,則創建一個很大的假的keyvalue,表明該storefile不需要實際的scanpublic boolean seekExactly(KeyValue kv, boolean forward)
throws IOException {
if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||
kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {
return forward ? reseek(kv) : seek(kv);
}
boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),
kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),
kv.getQualifierOffset(), kv.getQualifierLength());
if (isInBloom) {
// This row/column might be in this store file. Do a normal seek.
return forward ? reseek(kv) : seek(kv);
}
// Create a fake key/value, so that this scanner only bubbles up to the top
// of the KeyValueHeap in StoreScanner after we scanned this row/column in
// all other store files. The query matcher will then just skip this fake
// key/value and the store scanner will progress to the next column.
cur = kv.createLastOnRowCol();
return true;
}
這邊爲什麼是rowcol才能剔除storefile納,很簡單,scan是一個範圍,如果是row的bloomfilter不命中只能說明該rowkey不在此storefile中,但next rowkey可能在。而rowcol的bloomfilter就不一樣了,如果rowcol的bloomfilter沒有命中表明該qualifiy不在這個storefile中,因此這次scan就不需要scan此storefile了!
結論如下:
1.任何類型的get(基於rowkey和基於row+col)bloomfilter都能生效,關鍵是get的類型要匹配bloomfilter的類型
2.基於row的scan是沒辦法優化的
3.row+col+qualify的scan可以去掉不存在此qualify的storefile,也算是不錯的優化了,而且指明qualify也能減少流量,因此scan儘量指明qualify。