guava使用二之哈希

原創

2020-02-21 18:50

Guava使用二之Hash

Guava包爲什麼要提供hash

java內置的hashcode算法被限制爲32位的，而且算法與數據之間耦合嚴重，無法進行算法的替換。雖然JDK內置的hashcode算法快，但是碰撞嚴重
在簡單的散列表中可以通過再hash解決這個問題，但是guava官方認爲在其它情況無法滿足需求。

HashFunction

HashFunction是一個純無狀態函數，它將任意數據塊映射到固定數量的位，其屬性是相等的輸入始終產生相等的輸出，而不相等的輸入則儘可能頻繁地產生不相等的輸出。
示例：常用的hash算法

		// 計算MD5
		System.out.println(Hashing.md5().hashBytes(input.getBytes()).toString());
		System.out.println(Hashing.md5().hashString("hello, world",Charsets.UTF_8));
		// 計算sha256
		System.out.println(Hashing.sha256().hashBytes(input.getBytes()).toString());
		// 計算sha512
		System.out.println(Hashing.sha512().hashBytes(input.getBytes()).toString());
		// 計算crc32
		System.out.println(Hashing.crc32().hashBytes(input.getBytes()).toString());

Hasher

可以向HashFunction請求有狀態的哈希器，該哈希器提供流利的語法以將數據添加到哈希中，然後檢索哈希值。哈希器可以接受任何原始輸入，字節數組，字節數組的片段，字符序列，某些字符集中的字符序列等，或任何其他帶有適當漏斗的對象
示例：


Funnel<Person> personFunnel = new Funnel<Person>() {
			@Override
			public void funnel(Person person, PrimitiveSink into) {
				into
						.putInt(person.id)
						.putString(person.firstName, Charsets.UTF_8)
						.putString(person.lastName, Charsets.UTF_8)
						.putInt(person.birthYear);
			}
		};

HashFunction hf = Hashing.md5();
		HashCode hc = hf.newHasher()
				.putLong(11L)
				.putString("www", Charsets.UTF_8)
				.putObject(person, personFunnel)
				.hash();

Funnel

漏斗描述了對象如何Hash

Funnel<Person> personFunnel = new Funnel<Person>() {
    @Override
    public void funnel(Person person, PrimitiveSink into) {
        into
            .putInt(person.id)
            .putString(person.firstName, Charsets.UTF_8)
            .putString(person.lastName, Charsets.UTF_8)
            .putInt(birthYear);
    }
}

BloomFilter 布隆過濾器

作用：布隆過濾器是一種概率數據結構，你可以測試對象是否絕對不在過濾器中，或者可能已添加到布隆過濾器中。
使用布隆過濾器，你需要實現Funnel漏斗，以便將你的對象拆解位原始類型

布隆過濾器大小該如何設定？

較大的濾波器將具有較少的誤報，而較小的將爲零。
估算誤報率的公式是：因此，你得首先確定你期望插入的數據大小n，然後嘗試不同的k,m值

布隆過濾器hash函數的個數如何設置？

hash函數設置的越多，過濾器就越慢，同時會越快被填滿。如果設置的太少，又會遇到很多誤報。
因此，首先確定n的值，通過公式確定k的值 k = (m/n)ln(2)^2

布隆過濾的hash函數改如何選擇？

Chromium uses HashMix. (also, here’s a short description of how they use bloom filters)
python-bloomfilter uses cryptographic hashes
Plan9 uses a simple hash as proposed in Mitzenmacher 2005
Sdroege Bloom filter uses fnv1a (included just because I wanted to show one that uses fnv.)
Squid uses MD5

布隆過濾器的速度和空間效率

set操作和測試元素是否在過濾器中的時間複雜度都是O（k）,取決與hash函數的個數
空間複雜度，取決與你能忍受的錯誤率以及你插入的數據潛在個數。如果數量有限，Vector表現更好，如果你不能估算潛在的插入個數，那麼最好使用哈希表或可伸縮的Bloom過濾器

guava包怎麼使用布隆過濾器

示例代碼

	@Test
	public void testBloomFilter () {

		// 定義漏斗，拆解對象
		Funnel<Person> personFunnel = new Funnel<Person>() {
			@Override
			public void funnel(Person person, PrimitiveSink into) {
				into
						.putInt(person.id)
						.putString(person.firstName, Charsets.UTF_8)
						.putString(person.lastName, Charsets.UTF_8)
						.putInt(person.birthYear);
			}
		};

		Person person = Person.builder()
				.id(1)
				.birthYear(1995)
				.firstName("Zhou")
				.lastName("Evan")
				.build();
		// 期望插入的expectedInsertions 大小
		BloomFilter<Person> bloomFilter = BloomFilter.create(personFunnel, 5000);
		// 判斷person是否包含在布隆過濾器中 
		//,如果返回false 那麼就是一定不存在，如果返回True，可能存在
		boolean b = bloomFilter.mightContain(person);
		System.out.println(b); //false 
		bloomFilter.put(person);
		boolean b1 = bloomFilter.mightContain(person);
		System.out.println(b1);//true
	}

ps：
我們並沒有指布隆過濾器大小，以及hash函數的個數，那麼guava內部是怎麼做的呢？
內部有這兩個函數，來參數化過濾器大小，和hash函數的個數，函數的實現就是我們上面提到的公式的體現。

	// 計算布隆過濾器大小
	long numBits = optimalNumOfBits(expectedInsertions, fpp);
	// 計算hash函數的個數
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);

	static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }
  
    static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

Spirits、

發佈了68 篇原創文章 · 獲贊 16 · 訪問量 5萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

guava使用二之哈希

Guava包爲什麼要提供hash

HashFunction

Hasher

Funnel

BloomFilter 布隆過濾器

布隆過濾器大小該如何設定？

布隆過濾器hash函數的個數如何設置？

布隆過濾的hash函數改如何選擇？

布隆過濾器的速度和空間效率

guava包怎麼使用布隆過濾器

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

aliyun安裝redis6.0最新版

讀spring源碼之理解TargetSource

深入OAuth2 微服務下的SSO單點登錄

使用Rancher cattle編排容器

Springboot原理分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結