簡介

在luncene中布隆過濾器主要保存在.blm文件中，主要是用來判斷特定的內容是否存在，比如在寫入時判斷文檔id是否存在。此外，布隆過濾器只能判斷特定內容肯定不存在，而不能得出肯定存在的結論。

實現

在luncene中不BloomFilter的具體實現主要是在FuzzySet。其入口爲DefaultBloomFilterFactory，這裏可以通過getSetForField函數獲取一個布隆過濾器。這裏的兩個參數分別爲

maxNumUniqueValues：可能存在的最大不同值的數量
desiredMaxSaturation: 包和度，接下來會具體介紹，默認爲0.1


public FuzzySet getSetForField(SegmentWriteState state,FieldInfo info) {
    //Assume all of the docs have a unique term (e.g. a primary key) and we hope to maintain a set with 10% of bits set
    return FuzzySet.createSetBasedOnQuality(state.segmentInfo.maxDoc(), 0.10f);
  }

在FuzzySet.java中首先會根據輸入的segment中的文檔個數以及飽和度估算出一個set的容量。

 public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
  {
      int setSize=getNearestSetSize(maxNumUniqueValues,desiredMaxSaturation);
      return new FuzzySet(new FixedBitSet(setSize+1),setSize, hashFunctionForVersion(VERSION_CURRENT));
  }

這裏的FixBitSet是一個在lucene中非常重要的數據結構。它的其中一個用途用來存儲文檔號，用一個bit位來描述（存儲）一個文檔號。該類特別適合存儲連續並且沒有重複的int類型的數值。最好情況可以用8個字節來描述64個int類型的值。其結構如下，我們首先探究以下FixBitset的實現原理

  private final long[] bits; // Array of longs holding the bits 
  private final int numBits; // The number of bits in use
  private final int numWords; // The exact number of longs needed to hold numBits (<= bits.length)

bit:存儲bit的數組
numBits：參數numBits用來確定需要多少bit位來存儲我們的int數值。如果我們另numBits的值爲300，實際會分配一個64的整數倍的bit位。因爲比300大的第一個64的倍數是 320 (64 * 5)，所以實際上我們可以存儲 [0 ~319]範圍的數值。
numWords：表示bit數組的容量，即需要numWords的long值存儲numsBit的數組

在FixBit中給我們提供了兩個基本的操作函數：讀取與寫入

 public boolean get(int index) {
    assert index >= 0 && index < numBits: "index=" + index + ", numBits=" + numBits;
    int i = index >> 6;               // div 64
    // signed shift will keep a negative index and force an
    // array-index-out-of-bounds-exception, removing the need for an explicit check.
    long bitmask = 1L << index;
    return (bits[i] & bitmask) != 0;
  }

  public void set(int index) {
    assert index >= 0 && index < numBits: "index=" + index + ", numBits=" + numBits;
    int wordNum = index >> 6;      // div 64
    long bitmask = 1L << index;
    bits[wordNum] |= bitmask;
  }

此外BitSet還提供了用戶不同BitSet之間的交併集操作。

然後我們接下來看一下FuzzySet中的插入和查詢操作。

public void addValue(BytesRef value) throws IOException {    
      int hash = hashFunction.hash(value);
      if (hash < 0) {
        hash = hash * -1;
      }
      // Bitmasking using bloomSize is effectively a modulo operation.
      // 取模 （正數與取餘一樣，負數）
      int bloomPos = hash & bloomSize;
      filter.set(bloomPos);
  }

通過MurMurHash2生成一個int的Hash編碼
對BloomSize（預設的文檔的數量）進行取餘操作
filter調用add方法添加value
整體的數據流轉如下：

存在的問題

生命週期

目前es主要在uid上字段上維護布隆過濾器，主要用於判定文檔是否存在：

數據寫入時：判斷是否存在
數據查詢時：判斷id文檔是否存在

創建時期：構建IndexReader時，從b1m文件中加載bit數組到內存
回收時期：在IndexReader.close內存的bit數組進行gc回收
merge：因爲每個seg的bit信息時獨立的因此在merge時，會讀取bit信息並進行merge

對GC的影響

實驗表明1億的文檔通常情況下會佔用120mb左右的內存，同時會對GC產生如下影響：

小文件產生的內存碎片
大文件觸發GC

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Lucene隨筆-BoomFilter布隆過濾器

簡介

實現

存在的問題

生命週期

對GC的影響

C語言學習-探索編譯過程

Lucene隨筆-Lucene的索引文件格式

Lucene隨筆-聊聊IndexWriter

Lucene隨筆-關於double類型轉換成Long

ElasicSearch源碼-集羣啓動

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結