布隆過濾器（Bloom Filter）

數據結構與算法筆記：戀上數據結構筆記目錄

引出布隆過濾器（判斷元素是否存在）

思考：如果要經常判斷 1 個元素是否存在，要怎麼做？

很容易想到使用哈希表（HashSet、HashMap），將元素作爲 key 去查找
時間複雜度：O(1)，但是空間利用率不高，需要佔用比較多的內存資源

如果需要編寫一個網絡爬蟲去爬10億個網站數據，爲了避免爬到重複的網站，如何判斷某個網站是否爬過？

很顯然，HashSet、HashMap 並不是非常好的選擇

是否存在時間複雜度低、佔用內存較少的數據結構？

布隆過濾器（Bloom Filter）

布隆過濾器介紹（概率型數據結構）

1970年由布隆提出

它是一個空間效率高的概率型數據結構，可以用來告訴你：一個元素一定不存在或者可能存在。

優缺點：

優點：空間效率和查詢時間都遠遠超過一般的算法
缺點：有一定的誤判率、刪除困難

布隆過濾器實質上是一個很長的二進制向量和一系列隨機映射函數（Hash函數）；

常見應用：

網頁黑名單系統
垃圾郵件過濾系統
爬蟲的網址判重系統
解決緩存穿透問題（後端開發）

布隆過濾器的原理（二進制 + 哈希函數）

假設布隆過濾器由 20位二進制、 3個哈希函數組成，每個元素經過哈希函數處理都能生成一個索引位置。

布隆過濾器的基礎操作有兩個：添加、查詢

添加元素：將每一個哈希函數生成的索引位置都設爲 1
查詢元素是否存在：
如果有一個哈希函數生成的索引位置不爲 1，就代表不存在（100%準確）
如果每一個哈希函數生成的索引位置都爲 1，就代表存在（存在一定的誤判率）

添加、查詢的時間複雜度都是：O(k) ，k 是哈希函數的個數
空間複雜度是：O(m) ，m 是二進制位的個數

布隆過濾器的誤判率（公式）

誤判率 p 受 3 個因素影響：二進制位的個數 m、哈希函數的個數 k、數據規模 n。

誤判率 p 的公式：

已知誤判率 p、數據規模 n，求二進制位的個數 m、哈希函數的個數 k：

二進制位的個數 m：
哈希函數的個數 k：

布隆過濾器的實現

Guava: Google Core Libraries For Java（谷歌核心庫中Java實現）

https://mvnrepository.com/artifact/com.google.guava/guava

布隆過濾器的基礎操作有兩個：添加元素、查詢元素是否存在

/**
 * 添加元素
 * @return true代表bit發送了變化
 */
boolean put(T value);

/**
 * 查詢元素是否存在
 * @return false代表一定不存在, true代表可能存在
 */
boolean contains(T value);

布隆過濾器的構造

根據上面的公式可知，布隆過濾器必然有2個全局變量：

bitSize：二進制向量的長度（一共有多少個二進制位）
hashSize：哈希函數的個數

並且必然有個容器來存儲這些二進制位：

bits：這裏選擇 long[] 來存儲，因爲1個long可以表示64位bit；（int[] 等數組也可以）

package com.mj;

public class BloomFilter<T> {
	/**
	 * 二進制向量的長度(一共有多少個二進制位)
	 */
	private int bitSize;
	/**
	 * 二進制向量
	 */
	private long[] bits;
	/**
	 * 哈希函數的個數
	 */
	private int hashSize;
	
	/**
	 * 布隆過濾器的構造
	 * @param n 數據規模
	 * @param p 誤判率, 取值範圍(0, 1)
	 */
	public BloomFilter(int n, double p){
		if (n <= 0 || p <= 0 || p >= 1) { // 非法輸入檢測
			throw new IllegalArgumentException("wrong n or p");
		}
		
		// 根據公式求出對應的數據
		double ln2 = Math.log(2);
		// 求出二進制向量的長度
		bitSize = (int) (- (n * Math.log(p)) / (ln2 * ln2));
		hashSize = (int) (bitSize * ln2 / n);
		// bits數組的長度
		bits = new long[(bitSize + Long.SIZE - 1) / Long.SIZE]; // 分頁公式
		// (64 + 64 - 1) / 64 = 127 / 64 = 1
		// (128 + 64 - 1) / 64 = 2
		// (130 + 64 - 1) / 64 = 3
		
		// 分頁問題:
		// 每一頁顯示100條數據, pageSize = 100
		// 一共有999999條數據, n = 999999
		// 請問有多少頁 pageCount = (n + pageSize - 1) / pageSize
	};
	
}

測試一下，假設有1億個數據，要求誤判率爲1%：
可以得到哈希函數的個數爲 6，二進制位的個數是 958505837。

public static void main(String[] args) {
	BloomFilter<Integer> bf = new BloomFilter<>(1_0000_0000, 0.01);
	// 哈希函數的個數: 6
	// 二進制位的個數: 958505837 
}

布隆過濾器 - 添加元素

設置指定位置元素的二進制值爲1

比如要設置 100000 的 第2位bit 爲 1，應當 100000 | 000100，即 100000 | (1 << 2)；

	100000
| 	000100   ==  (1 << 2)
	------------------
	100100

那麼設置 value 的 第index位bit爲 1，則是 value| (1 << index)；

/**
 * 設置index位置的二進制爲1
 */
private boolean set(int index){
	// 對應的long值
	long value = bits[index / Long.SIZE];
	int bitValue = 1 << (index % Long.SIZE);
	bits[index / Long.SIZE] = value | bitValue;
	return (value & bitValue) == 0;
}

有了以上基礎，可以實現布隆過濾器的添加元素操作：

/**
 * 添加元素
 */
public boolean put(T value) {
	nullCheck(value);
	
	// 利用value生成 2 個整數
	int hash1 = value.hashCode();
	int hash2 = hash1 >>> 16;

	boolean result = false;
	for (int i = 1; i <= hashSize; i++) {
		int combinedHash = hash1 + (i * hash2);
		if (combinedHash < 0) {
			combinedHash = ~combinedHash;
		}	
		
		// 生成一個二進制的索引
		int index = combinedHash % bitSize;
		// 設置第index位置的二進制爲1
		if (set(index)) result = true;
		//   101010101010010101
		// | 000000000000000100	   1 << index
		//   101010111010010101
	}
	return result;
}

布隆過濾器 - 判斷元素是否存在

查看指定位置的二進制的值

比如要查看 10101111 的 第2位bit 爲 1，應當 10101111 & 00000100，即 10101111 & (1 << 2)，只有指定位置的二進制的值爲 0，返回值纔會是 0，否則爲 1；

	10101111
& 	00000100	== 	(1 << 2)
	--------------
	00000100 != 0, 說明index位的二進制爲1

那麼獲取 value 的 第index位bit 的值，則是 value & (1 << index)；

/**
 * 查看index位置的二進制的值
 * @param index
 * @return true代表1, false代表0
 */
private boolean get(int index) {
	// 對應的long值
	long value = bits[index / Long.SIZE];
	return (value & (1 << (index % Long.SIZE))) != 0;
}

有了以上基礎，可以實現布隆過濾器的判斷一個元素是否存在操作：

/**
 * 判斷一個元素是否存在
 */
public boolean contains(T value) {
	nullCheck(value);
	// 利用value生成2個整數
	int hash1 = value.hashCode();
	int hash2 = hash1 >>> 16;
	
	for (int i = 1; i <= hashSize; i++) {
		int combinedHash = hash1 + (i * hash2);
		if (combinedHash < 0) {
			combinedHash = ~combinedHash;
		}	
		// 生成一個二進制的索引
		int index = combinedHash % bitSize;
		// 查詢第index位置的二進制是否爲0
		if (!get(index)) return false;
		//   101010101010010101
		// | 000000000000000100	   1 << index
		//   101010111010010101
	}
	return true;
}

布隆過濾器 - 完整代碼

package com.mj;

public class BloomFilter<T> {
	/**
	 * 二進制向量的長度(一共有多少個二進制位)
	 */
	private int bitSize;
	/**
	 * 二進制向量
	 */
	private long[] bits;
	/**
	 * 哈希函數的個數
	 */
	private int hashSize;
	
	/**
	 * 布隆過濾器的構造
	 * @param n 數據規模
	 * @param p 誤判率, 取值範圍(0, 1)
	 */
	public BloomFilter(int n, double p){
		if (n <= 0 || p <= 0 || p >= 1) { // 非法輸入檢測
			throw new IllegalArgumentException("wrong n or p");
		}
		
		// 根據公式求出對應的數據
		double ln2 = Math.log(2);
		// 求出二進制向量的長度
		bitSize = (int) (- (n * Math.log(p)) / (ln2 * ln2));
		hashSize = (int) (bitSize * ln2 / n);
		// bits數組的長度
		bits = new long[(bitSize + Long.SIZE - 1) / Long.SIZE]; // 分頁公式
		// (64 + 64 - 1) / 64 = 127 / 64 = 1
		// (128 + 64 - 1) / 64 = 2
		// (130 + 64 - 1) / 64 = 3
		
		// 分頁問題:
		// 每一頁顯示100條數據, pageSize = 100
		// 一共有999999條數據, n = 999999
		// 請問有多少頁 pageCount = (n + pageSize - 1) / pageSize
	};
	
	/**
	 * 添加元素
	 */
	public boolean put(T value) {
		nullCheck(value);
		
		// 利用value生成2個整數
		int hash1 = value.hashCode();
		int hash2 = hash1 >>> 16;

		boolean result = false;
		for (int i = 1; i <= hashSize; i++) {
			int combinedHash = hash1 + (i * hash2);
			if (combinedHash < 0) {
				combinedHash = ~combinedHash;
			}	
			// 生成一個二進制的索引
			int index = combinedHash % bitSize;
			// 設置第index位置的二進制爲1
			if (set(index)) result = true;
			//   101010101010010101
			// | 000000000000000100	   1 << index
			//   101010111010010101
		}
		return result;
	}
	
	/**
	 * 判斷一個元素是否存在
	 */
	public boolean contains(T value) {
		nullCheck(value);
		// 利用value生成2個整數
		int hash1 = value.hashCode();
		int hash2 = hash1 >>> 16;
		
		for (int i = 1; i <= hashSize; i++) {
			int combinedHash = hash1 + (i * hash2);
			if (combinedHash < 0) {
				combinedHash = ~combinedHash;
			}	
			// 生成一個二進制的索引
			int index = combinedHash % bitSize;
			// 查詢第index位置的二進制是否爲0
			if (!get(index)) return false;
			//   101010101010010101
			// | 000000000000000100	   1 << index
			//   101010111010010101
		}
		return true;
	}
	
	/**
	 * 設置index位置的二進制爲1
	 */
	private boolean set(int index){
		// 對應的long值
		long value = bits[index / Long.SIZE];
		int bitValue = 1 << (index % Long.SIZE);
		bits[index / Long.SIZE] = value | bitValue;
		return (value & bitValue) == 0;
		/*
		 *    100000
		 *  | 000100   1 << 2
		 *  ---------
		 *    100100
		 */
	}
	
	/**
	 * 查看index位置的二進制的值
	 * @param index
	 * @return true代表1, false代表0
	 */
	private boolean get(int index) {
		// 對應的long值
		long value = bits[index / Long.SIZE];
		return (value & (1 << (index % Long.SIZE))) != 0;
		/*
		 *   10101111
		 * & 00000100
		 * -----------
		 *   00000100 != 0, 說明index位的二進制爲1
		 */
	}
	
	private void nullCheck(T value) {
		if (value == null) {
			throw new IllegalArgumentException("Value must not be null.");
		}
	}
	
}

測試：

public static void main(String[] args) {
	BloomFilter<Integer> bf = new BloomFilter<>(1_00_0000, 0.01);
	
	for (int i = 1; i <= 1_00_0000; i++) {
		bf.put(i);
	}

//		for (int i = 1; i <= 1_00_0000; i++) {
//			System.out.println(bf.contains(i));
//		}
	
	// 測試 1000000 條數據中的誤判數
	int count = 0;
	for (int i = 1_00_0001; i <= 2_00_0000; i++) {
		if (bf.contains(i)){
			count++;
		}
	}
	System.out.println(count); // 20
}

10億網站爬蟲問題

回到一開始的問題：如果需要編寫一個網絡爬蟲去爬10億個網站數據，爲了避免爬到重複的網站，如何判斷某個網站是否爬過？

該問題的代碼的大體框架如下：

// url數組
String[] urls = {};
BloomFilter<String> bf = new BloomFilter<>(10_0000_0000, 0.01);


for (String url : urls) {
	if (bf.contains(url)) continue;
	// 爬這個url
	// ......
	
	// 爬完該url, 放進BloomFilter中
	bf.put(url);
}

根據布隆過濾器的原理：依靠哈希函數產生的索引，找到對應的二進制位值，爲 1 則已經存在（存在誤判），否則不存在（100%精確）。

bf.contains(url) ，如果已經爬過的網址在布隆過濾器中，必然會返回 true，因此可以保證不會重複爬。但是有些網址可能沒有爬過，但是經過哈希衝突，使得bf.contains(url) 返回也爲 true，可知有機率漏爬。

可以確保，這麼寫不會重複爬，但是有機率漏爬。

下面這種寫法也可以：同樣保證不重複爬，有機率漏爬。

String[] urls = {};
BloomFilter<String> bf = new BloomFilter<>(10_0000_0000, 0.01);

for (String url : urls) {
	if (bf.put(url) == false) continue;
	// 爬這個url
	// ......
}

bf.put(url) 如果遇到已經在布隆過濾器中的元素，必然返回 false，可以保證不重複爬。但是有些網址沒有爬過，經過哈希衝突，使得 bf.put(url) 返回了 flase，有機率漏爬。

【戀上數據結構】布隆過濾器（Bloom Filter）原理及實現

布隆過濾器（Bloom Filter）

引出布隆過濾器（判斷元素是否存在）

布隆過濾器介紹（概率型數據結構）

布隆過濾器的原理（二進制 + 哈希函數）

布隆過濾器的誤判率（公式）

布隆過濾器的實現

布隆過濾器的構造

布隆過濾器 - 添加元素

設置指定位置元素的二進制值爲1

布隆過濾器 - 判斷元素是否存在

查看指定位置的二進制的值

布隆過濾器 - 完整代碼

10億網站爬蟲問題

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

【Java 強化】單元測試（JUnit3、JUnit4）、XML（語法、約束、文檔結構）、DOM、DOM4J

【計算機網絡】第5章 Internet原理與技術2（因特網的路由協議RIP、OSPF、BGP，網絡地址轉換NAT，網絡協議IPv6）

南郵《網絡技術與應用》課後作業解析

【Java 強化】代碼規範、JavaBean、lombok、內省（Introspector）、commons-beanutils組件

【戀上數據結構】串匹配算法（蠻力匹配、KMP【重點】、Boyer-Moore、Karp-Rabin、Sunday）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結