hbase scan和bloom filter的討論

原創

MrTitan

2020-02-25 10:29

在工作中大家對hbase的bloom filter是否能作用於scan展開討論。在沒討論前，我還真沒想過這個問題，想當然的以爲bloom filter肯定可以爲scan剔除掉不需要的hfile。但Google了下才發現事實不是如此！

首先，學習了以下2篇文章：

hbase對bf的理解和使用

http://zjushch.iteye.com/blog/1530143

hbase的主要代碼提交者對hbase Bloomfilter的解釋

http://blog.csdn.net/macyang/article/details/6182629

大概對BloomFilter有了一些瞭解，然後找到了hbase中對有bloomfilter的table查詢的2個優化：

1.get操作會enable bloomfilter幫助剔除掉不會用到的Storefile

在scan初始化時（get會包裝爲scan）對於每個storefile會做shouldSeek的檢查，如果返回false，則表明該storefile裏沒有要找的內容，直接跳過

 if (memOnly == false
            && ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {
          scanners.add(kvs);
        }

shouldSeek方法：如果是scan直接返回true表明不能跳過，然後根據bloomfilter類型檢查。

 if (!scan.isGetScan()) {
        return true;
      }

      byte[] row = scan.getStartRow();
      switch (this.bloomFilterType) {
        case ROW:
          return passesBloomFilter(row, 0, row.length, null, 0, 0);

        case ROWCOL:
          if (columns != null && columns.size() == 1) {
            byte[] column = columns.first();
            return passesBloomFilter(row, 0, row.length, column, 0, 
                column.length);
          }

          // For multi-column queries the Bloom filter is checked from the
          // seekExact operation.
          return true;

        default:
          return true;

2.指明qualified的scan在配了rowcol的情況下會剔除不會用掉的StoreFile。

對指明瞭qualify的scan或者get進行檢查：seekExactly

 // Seek all scanners to the start of the Row (or if the exact matching row
    // key does not exist, then to the start of the next matching Row).
    if (matcher.isExactColumnQuery()) {
      for (KeyValueScanner scanner : scanners)
        scanner.seekExactly(matcher.getStartKey(), false);
    } else {
      for (KeyValueScanner scanner : scanners)
        scanner.seek(matcher.getStartKey());
    }

如果bloomfilter沒命中，則創建一個很大的假的keyvalue，表明該storefile不需要實際的scan

public boolean seekExactly(KeyValue kv, boolean forward)
      throws IOException {
    if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||
        kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {
      return forward ? reseek(kv) : seek(kv);
    }

    boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),
        kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),
        kv.getQualifierOffset(), kv.getQualifierLength());
    if (isInBloom) {
      // This row/column might be in this store file. Do a normal seek.
      return forward ? reseek(kv) : seek(kv);
    }

    // Create a fake key/value, so that this scanner only bubbles up to the top
    // of the KeyValueHeap in StoreScanner after we scanned this row/column in
    // all other store files. The query matcher will then just skip this fake
    // key/value and the store scanner will progress to the next column.
    cur = kv.createLastOnRowCol();
    return true;
  }

這邊爲什麼是rowcol才能剔除storefile納，很簡單，scan是一個範圍，如果是row的bloomfilter不命中只能說明該rowkey不在此storefile中，但next rowkey可能在。而rowcol的bloomfilter就不一樣了，如果rowcol的bloomfilter沒有命中表明該qualifiy不在這個storefile中，因此這次scan就不需要scan此storefile了！

結論如下：

1.任何類型的get（基於rowkey和基於row+col）bloomfilter都能生效，關鍵是get的類型要匹配bloomfilter的類型

2.基於row的scan是沒辦法優化的

3.row+col+qualify的scan可以去掉不存在此qualify的storefile，也算是不錯的優化了，而且指明qualify也能減少流量，因此scan儘量指明qualify。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hbase scan和bloom filter的討論

985 碩士程序員，空窗 4 個月沒有 Offer！

營銷系統黑名單優化：位圖的應用解析

一文搞懂 Spring 循環依賴

我真的從測試轉成了開發......

nginx添加相應配置，通過瀏覽器訪問或curl時返回客戶端對應公網IP

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

python內置函數——sorted

[oeasy]python020在遊戲中體驗數值自由_勇闖地下城_終端文字遊戲

爲何我建議你學會抄代碼

抖音面試：說說延遲任務的調度算法？

hbase scan和bloom filter的討論

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結