BloomFilter, Count-Min Sketch算法

1.bloom filter

布隆過濾器（英語：Bloom Filter）是1970年由布隆提出的。它實際上是一個很長的二進制向量和一系列隨機映射函數。布隆過濾器可以用於檢索一個元素是否在一個集合中。它的優點是空間效率和查詢時間都遠遠超過一般的算法，缺點是有一定的誤識別率和刪除困難。

2.兩種錯誤率FP/FN

FP = A false positive error, or in short a false positive, commonly called a “false alarm”, is a result that indicates a given condition exists, when it does not.

FN = A false negative error, or in short a false negative, is a test result that indicates that a condition does not hold, while in fact it does.

在bloom filter中，FP是集合裏沒有某元素，查找結果是有該元素。FN是集合中有某元素，查找結果是沒有該元素。

在bloom filter中，FP 會隨着 BF 中插入元素的數量而增加——極限情況就是所有 bit 都爲 1，這時任何元素都會被認爲在集合裏。而FN則爲0，如果某元素在集合中，則一定能找到該元素。

3.FP的推導

假設哈希函數以相等的概率選擇位數組中的位置。如果 m 是位數組中的比特數，則在插入元素期間某一特定比特位不被某個哈希函數設置爲 1 的概率是：

$1- \frac{1}{m}$
假設有k個哈希函數，則通過 k 個哈希函數都未將該位設置爲 1 的概率是
$(1- \frac{1}{m}) ^ k$

那麼，如果我們插入了 n 個元素，某個位仍然爲 0 的概率就是：

$(1- \frac{1}{m}) ^ {nk}$

因此這一位的值爲 1 的概率就是：

$1 - (1- \frac{1}{m}) ^ {nk}$

那麼，BF 的誤判率是怎麼得出的？前面提到，我們主要關注 FP，即集合裏沒有某元素，查找結果是有該元素。

現在我們要判斷一個元素是否在集合中，假設這個元素本不在集合中，理論上來講，經過 k 個哈希函數計算後得到的位數組的 k 個位置的值都應該是 0，如果發生了誤判，即這 k 個位置的值都爲 1，這個概率如下：

$(1 - (1- \frac{1}{m}) ^ {kn}) ^ k = (1- e^{-\frac{kn}{m}})^k$

4.stream lib中的bloom filter

    public void test2() {
        Filter filter = new BloomFilter(1000, 0.01);
        filter.add("abc");
        filter.add("def");
        filter.add("g");
        boolean result = filter.isPresent("123");
        System.out.println("result is: " + result);
    }

Filter filter = new BloomFilter(1000, 0.01); 1000是在構建BitSet時跟bit數組位數有關的參數，0.01是錯誤率。

5.Count-Min Sketch算法

CountMinSketch 是一種“速寫”算法，能夠使用較小的空間勾勒出數據集內各類事件的頻次。比如，我們可以統計出當前最熱門的推特內容，或是計算網站訪問量最大的頁面。當然，這一算法同樣會犧牲一定的準確性。

CountMinSketch算法的流程：
1.選定d個hash函數，開一個 dxm 的二維整數數組作爲哈希表
2.於每個元素，分別使用d個hash函數計算相應的哈希值，並對m取餘，然後在對應的位置上增1，二維數組中的每個整數稱爲sketch
3.要查詢某個元素的頻率時，只需要取出d個sketch, 返回最小的那一個（其實d個sketch都是該元素的近似頻率，返回任意一個都可以，該算法選擇最小的那個）

6.Count-Mean-Min Sketch

Count-Min Sketch算法對於低頻的元素，結果不太準確，主要是因爲hash衝突比較嚴重，產生了噪音，例如當m=20時，有1000個數hash到這個20桶，平均每個桶會收到50個數，這50個數的頻率重疊在一塊了。Count-Mean-Min Sketch 算法做了如下改進：
1.來了一個查詢，按照 Count-Min Sketch的正常流程，取出它的d個sketch
2.對於每個hash函數，估算出一個噪音，噪音等於該行所有整數(除了被查詢的這個元素)的平均值
3.用該行的sketch 減去該行的噪音，作爲真正的sketch
4.返回d個sketch的中位數

class CountMeanMinSketch {
    // initialization and addition procedures as in CountMinSketch
    // n is total number of added elements
    long estimateFrequency(value) {
        long e[] = new long[d]
        for(i = 0; i < d; i++) {
            sketchCounter = estimators[i][ hash(value, i) ]
            noiseEstimation = (n - sketchCounter) / (m - 1)
            e[i] = sketchCounter – noiseEstimator
        }
        return median(e)
    }
}

Count-Mean-Min Sketch算法能夠顯著的改善在長尾數據上的精確度。

參考文獻
1.https://zh.wikipedia.org/wiki/%E5%B8%83%E9%9A%86%E8%BF%87%E6%BB%A4%E5%99%A8
2.http://shzhangji.com/cnblogs/2017/08/27/an-introduction-to-stream-lib-the-stream-processing-utilities/
3.https://soulmachine.gitbooks.io/system-design/content/cn/bigdata/frequency-estimation.html

BloomFilter, Count-Min Sketch算法

1.bloom filter

2.兩種錯誤率FP/FN

3.FP的推導

4.stream lib中的bloom filter

5.Count-Min Sketch算法

6.Count-Mean-Min Sketch

小白都能理解的FTRL

樹算法系列之四:XGBoost

Redis常用數據結構

樹算法系列之一:CART迴歸樹

HashMap簡單小結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結