一個用於白名單服務的布隆過濾器(bloom filter)

     
      bloom filter這種數據結構用於判斷一個元素是否在集合內,當然,這種功能也可以由HashMap來實現。bloom filter與HashMap的區別在於,HashMap會儲存代表這個元素的key自身(如key爲"IKnow7",那麼HashMap將存儲"IKnow7"這12個字節(java),其實還需要包括引用大小,但java中相同string只存一份),而bloom filter在底層只會使用幾個bit來代表這個元素。在速度上,bloom filter對比與HashMap相差不大,底層同樣是hash+隨機訪問。由於bloom filter對空間節省的特性,bloom filter適合判斷一個元素是否在海量數據集合中。

bloom filter的一些概念

     bloom filter並非十全十美。bloom filter在添加元素時,會將對象hash到底層位圖數組的k個位上,對這些位,bloom filter會將其值設爲1。由於hash函數特性以及位圖數組長度有限,不同的對象可能在某些位上有重疊。bloom filter在檢查元素是否存在時,會檢查該對象所對應的k個位是否爲1,如果全部都爲1表示存在,這裏就出現問題了,這些位上的1未必是該元素之前設置的,有可能是別的元素所設置的,所以會造成一些誤判,即原本不在bloom filter中的一些元素也被判別在bloom filter中。bloom filter的這種誤判被稱爲"積極的誤判",即存在的元素的一定會通過,不存在的元素也有可能通過,而不會造成對存在的元素結果爲否的判定。
                    
     可以簡單猜測,誤判的概率與hash的選擇、位圖數組的大小、當前元素的數量以及K(映射位的個數)有關。一般來說,hash值越平均、位圖數組越大、元素數量越少那麼誤判的概率就越低。
     這是一個大牛寫的關於bloom filter設計與誤判率的理論分析,大夥可以去看看:http://www.cnblogs.com/allensun/archive/2011/02/16/1956532.html

bloom filter在web上的應用

     在web應用中我們經常需要使用白名單來過濾一些請求,用以避免一些無效的數據庫訪問或者惡意攻擊。對於允許一些誤判率且存在海量數據的白名單來說,使用bloom filter是不二的選擇。

使用bloom filter實現一個支持增量請求的白名單

     白名單通常是需要更新的,更新的方式一般有全量和增量更新。全量不必說,重新定義個bloom filter將當前所有數據放入其中即可。增量更新的話,一般會提供一段時間內新增和刪除的數據,所以需要在白名單中將數據進行合併,該添加的添加,該刪除的刪除。
     可是...... 原生的bloom filter並不支持元素的刪除操作,因爲某一位可能爲多個元素所用。一種不切實際的想法是爲bloom filter的每一位設置一個引用計數,每刪除一個元素減1。
     一種可行的做法是,另外使用一個map來保存已刪除的元素,在判斷元素是否存在時先判斷在該deletemap中是否存在,如果存在,直接false。如果不存在,再通過bloom filter進行判斷。在新添加元素時,如果deletemap中存在,刪除該deletemap中的該元素,再添加到bloom filter中。在實際應用中,使用白名單的場景需要刪除的元素一般是較少的,所以這種方式從效率是可行的。這種方式存在一個問題,當deletemap中元素過多時,勢必會造成bloom filter的誤判率上升,因爲某些原本被刪除元素設置爲1的位並沒有被歸0。該問題的解決措施是,當deletemap的容量到達的一個界線時,使用全量同步更新該bloom filter。

白名單bloom filter的實現

     這類構件複用性很強,可以輕鬆的集成到現有的代碼之上。下面直接貼出來:
public class BloomFilter<E> implements Serializable {
    
    private static final long serialVersionUID = 3507830443935243576L;
    private long timestamp;//用於時間戳更新機制
    private HashMap<E, Boolean> deleteMap ; //儲存已刪除元素
    private BitSet bitset;//位圖存儲
    private int bitSetSize;
     // expected (maximum) number of elements to be added
    private int expectedNumberOfFilterElements; 
     // number of elements actually added to the Bloom filter
    private int numberOfAddedElements; 
    private int k;     //每一個元素對應k個位
     // encoding used for storing hash values as strings
    static Charset charset = Charset.forName("UTF-8"); 
     // MD5 gives good enough accuracy in most circumstances. 
     // Change to SHA1 if it's needed
    static String hashName = "MD5"; 
    static final MessageDigest digestFunction;

    static { // The digest method is reused between instances to provide higher entropy.
        MessageDigest tmp;
        try {
            tmp = java.security.MessageDigest.getInstance(hashName);
        } catch (NoSuchAlgorithmException e) {
            tmp = null;
        }
        digestFunction = tmp;
    }

    /**
     * Constructs an empty Bloom filter.
     *
     * @param bitSetSize defines how many bits should be used for the filter.
     * @param expectedNumberOfFilterElements defines the maximum 
     *           number of elements the filter is  expected to contain.
     */
    public BloomFilter(int bitSetSize, int expectedNumberOfFilterElements) {
        this.expectedNumberOfFilterElements = expectedNumberOfFilterElements;
        this.k = (int) Math.round(
               (bitSetSize / expectedNumberOfFilterElements) * Math.log(2.0));
        bitset = new BitSet(bitSetSize);
        deleteMap = new HashMap<E, Boolean>();
        this.bitSetSize = bitSetSize;
        numberOfAddedElements = 0;
    }

    /**
     * Generates a digest based on the contents of a String.
     *
     * @param val specifies the input data.
     * @param charset specifies the encoding of the input data.
     * @return digest as long.
     */
    public static long createHash(String val, Charset charset) {
        try {
            return createHash(val.getBytes(charset.name()));
        }
        catch (UnsupportedEncodingException e) {
            e.printStackTrace();
            // Ingore
        }
        return -1;
    }

    /**
     * Generates a digest based on the contents of a String.
     *
     * @param val specifies the input data. The encoding is expected to be UTF-8.
     * @return digest as long.
     */
    public static long createHash(String val) {
        return createHash(val, charset);
    }

    /**
     * Generates a digest based on the contents of an array of bytes.
     *
     * @param data specifies input data.
     * @return digest as long.
     */
    public static long createHash(byte[] data) {
        long h = 0;
        byte[] res;

        synchronized (digestFunction) {
            res = digestFunction.digest(data);
        }

        for (int i = 0; i < 4; i++) {
            h <<= 8;
            h |= ((int) res[i]) & 0xFF;
        }
        return h;
    }

    /**
     * Compares the contents of two instances to see if they are equal.
     *
     * @param obj is the object to compare to.
     * @return True if the contents of the objects are equal.
     */
    @SuppressWarnings("unchecked")
    @Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        final BloomFilter<E> other = (BloomFilter<E>) obj;        
        if (this.expectedNumberOfFilterElements != 
               other.expectedNumberOfFilterElements) {
            return false;
        }
        if (this.k != other.k) {
            return false;
        }
        if (this.bitSetSize != other.bitSetSize) {
            return false;
        }
        if (this.bitset != other.bitset && 
               (this.bitset == null || !this.bitset.equals(other.bitset))) {
            return false;
        }
        return true;
    }

    /**
     * Calculates a hash code for this class.
     * @return hash code representing the contents of an instance of this class.
     */
    @Override
    public int hashCode() {
        int hash = 7;
        hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);
        hash = 61 * hash + this.expectedNumberOfFilterElements;
        hash = 61 * hash + this.bitSetSize;
        hash = 61 * hash + this.k;
        return hash;
    }


    /**
     * Calculates the expected probability of false positives based on
     * the number of expected filter elements and the size of the Bloom filter.
     * <br /><br />
     * The value returned by this method is the <i>expected</i> rate of false
     * positives, assuming the number of inserted elements equals the number of
     * expected elements. If the number of elements in the Bloom filter is less
     * than the expected value, the true probability of false positives will be lower.
     *
     * @return expected probability of false positives.
     */
    public double expectedFalsePositiveProbability() {
        return getFalsePositiveProbability(expectedNumberOfFilterElements);
    }

    /**
     * Calculate the probability of a false positive given the specified
     * number of inserted elements.
     *
     * @param numberOfElements number of inserted elements.
     * @return probability of a false positive.
     */
    public double getFalsePositiveProbability(double numberOfElements) {
        // (1 - e^(-k * n / m)) ^ k
        return Math.pow((1 - Math.exp(-k * (double) numberOfElements
                        / (double) bitSetSize)), k);

    }

    /**
     * Get the current probability of a false positive. The probability is calculated from
     * the size of the Bloom filter and the current number of elements added to it.
     *
     * @return probability of false positives.
     */
    public double getFalsePositiveProbability() {
        return getFalsePositiveProbability(numberOfAddedElements);
    }


    /**
     * Returns the value chosen for K.<br />
     * <br />
     * K is the optimal number of hash functions based on the size
     * of the Bloom filter and the expected number of inserted elements.
     *
     * @return optimal k.
     */
    public int getK() {
        return k;
    }

    /**
     * Sets all bits to false in the Bloom filter.
     */
    public void clear() {
        bitset.clear();
        numberOfAddedElements = 0;
    }

    /**
     * Adds an object to the Bloom filter. The output from the object's
     * toString() method is used as input to the hash functions.
     *
     * @param element is an element to register in the Bloom filter.
     */
    public void add(E element) {
        deleteMap.remove(element);
       long hash;
       String valString = element.toString();
       for (int x = 0; x < k; x++) {
           hash = createHash(valString + Integer.toString(x));
           hash = hash % (long)bitSetSize;
           bitset.set(Math.abs((int)hash), true);
       }
       numberOfAddedElements ++;
    }

    /**
     * Remove all elements from a Collection to the Bloom filter.
     * @param c Collection of elements.
     */
    public void removeAll(Collection<? extends E> c) {
        for (E element : c)
            remove(element);
    }
    
    
    public void remove(E element) {
        deleteMap.put(element, Boolean.TRUE);
    }
    
    
    public int getDeleteMapSize(){
        return deleteMap.size();
    }

    /**
     * Adds all elements from a Collection to the Bloom filter.
     * @param c Collection of elements.
     */
    public void addAll(Collection<? extends E> c) {
        for (E element : c) {
            if (element != null)
                add(element);
        }
    }

    /**
     * Returns true if the element could have been inserted into the Bloom filter.
     * Use getFalsePositiveProbability() to calculate the probability of this
     * being correct.
     *
     * @param element element to check.
     * @return true if the element could have been inserted into the Bloom filter.
     */
    public boolean contains(E element) {
        Boolean contains = deleteMap.get(element);
        if (contains != null && contains)
            return false;
        long hash;
        String valString = element.toString();
        for (int x = 0; x < k; x++) {
            hash = createHash(valString + Integer.toString(x));
            hash = hash % (long) bitSetSize;
            if (!bitset.get(Math.abs((int) hash)))
                return false;
        }
        return true;
    }

    /**
     * Returns true if all the elements of a Collection could have been inserted
     * into the Bloom filter. Use getFalsePositiveProbability() to calculate the
     * probability of this being correct.
     * @param c elements to check.
     * @return true if all the elements in c could have been inserted into the Bloom filter.
     */
    public boolean containsAll(Collection<? extends E> c) {
        for (E element : c)
            if (!contains(element))
                return false;
        return true;
    }

    /**
     * Read a single bit from the Bloom filter.
     * @param bit the bit to read.
     * @return true if the bit is set, false if it is not.
     */
    public boolean getBit(int bit) {
        return bitset.get(bit);
    }

    /**
     * Set a single bit in the Bloom filter.
     * @param bit is the bit to set.
     * @param value If true, the bit is set. If false, the bit is cleared.
     */
    public void setBit(int bit, boolean value) {
        bitset.set(bit, value);
    }

    /**
     * Return the bit set used to store the Bloom filter.
     * @return bit set representing the Bloom filter.
     */
    public BitSet getBitSet() {
        return bitset;
    }

    /**
     * Returns the number of bits in the Bloom filter. Use count() to retrieve
     * the number of inserted elements.
     *
     * @return the size of the bitset used by the Bloom filter.
     */
    public int size() {
        return this.bitSetSize;
    }

    /**
     * Returns the number of elements added to the Bloom filter after it
     * was constructed or after clear() was called.
     *
     * @return number of elements added to the Bloom filter.
     */
    public int count() {
        return this.numberOfAddedElements;
    }

    /**
     * Returns the expected number of elements to be inserted into the filter.
     * This value is the same value as the one passed to the constructor.
     *
     * @return expected number of elements.
     */
    public int getExpectedNumberOfElements() {
        return expectedNumberOfFilterElements;
    }

    /**
     * 返回更新的時間戳機制
     * @return
     */
    public long getTimestamp() {
        return timestamp;
    }

    /**
     * 設置跟新的時間戳
     * @param timestamp
     */
    public void setTimestamp(long timestamp) {
        this.timestamp = timestamp;
    }

    @Override
    public String toString() {
        return "BloomFilter [timestamp=" + timestamp + ", bitSetSize=" + bitSetSize
                + ", expectedNumberOfFilterElements=" 
                + expectedNumberOfFilterElements + ", numberOfAddedElements="
                + numberOfAddedElements + ", k=" 
                + k +",deleteMapSize=" +getDeleteMapSize()+"]";
    }
}



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章