哈希變形----布隆過濾器

原創

2020-02-20 17:01

在上一篇文章中我們實現了位圖的基本操作，現在我們已經知道位圖是用來標記某一個數據是否存在，而今天我們要說的布隆過濾器則是另外一種變形應用，通過布隆過濾器我們可以判斷一個字符串是否存在於某一對數據中。與位圖相比稍有不同的是這裏我們可以將數據插入到布隆過濾器（像哈希表一樣）中，通過兩個哈希函數從而得到兩個哈希地址，將這兩個哈希地址同時標記爲1，就代表該數據被插入到該布隆過濾器中（不是真的將數據插入），也就表示該布隆過濾器中存在該字符串。

細想我們這裏使用了兩個哈希函數，那麼自然會有兩個哈希地址。
可想而知，到我們有多個字符串的時候，就有可能算出的哈希地址中有一個或者兩個與別的字符串的哈希地址相同，如果我們想要刪除一個數據是，必須同時將這兩個哈希地址置爲0纔算是刪除了這個數據，然而當哈希地址與別的字符串的哈希地址重複的時候，如果將相應位置爲0，也就意味着另外一個數據也會受影響，所以我們這裏實現的布隆過濾器是不能進行刪除元素的。

優點:

空間效率高，查詢快速，布隆過濾器存儲空間和插入 / 查詢時間都是O(1)。
哈希函數相互之間沒有關係，方便由硬件並行實現。布隆過濾器不需要存儲元素本身，因爲它所存儲的是它數據的狀態，在某些對保密要求非常嚴格的場合有優勢。

缺點：

隨着存入的元素數量增加，誤算率也會增加。但是如果元素數量太少，則使用散列表就可以解決問題。

代碼實現：

//bloom_filter.h文件內容如下：


#pragma once
#include"bit_map.h"

//此處定義了布隆過濾器的哈希函數
//把字符串轉成下標
typedef uint64_t (*BloomHash)(const char*);
#define BloomHashCount 2
#define BitmapMaxCapacity 1024

typedef struct BloomFilter
{
    Bitmap bm;
    BloomHash bloom_hash[BloomHashCount];
}BloomFilter;

//初始化
void BloomFilterInit(BloomFilter *bf);
//銷燬
void BloomFilterDestroy(BloomFilter *bf);
//插入數據
void BloomFilterInsert(BloomFilter *bf,const char *str);
//判斷某個字符串是否存在
int BloomFilterIsExist(BloomFilter *bf,const char *str);

#include<stdio.h>
#include"bit_map.h"
#include"bloom_filter.h"
#include"bloom_hash.h"
//初始化
void BloomFilterInit(BloomFilter *bf)
{
    if(bf == NULL)
    {
        //非法輸入
        return;
    }
    //將bloom_filter中的位圖初始化
    BitmapInit(&bf->bm,10000);
    //初始化兩個哈希函數
    bf->bloom_hash[0] = SDBMHash;
    bf->bloom_hash[1] = BKDRHash;
    return;
}
//銷燬
void BloomFilterDestroy(BloomFilter *bf)
{
    if(bf == NULL)
    {
        //非法輸入
        return;
    }
    //銷燬位圖
    BitmapDestroy(&bf->bm);
    //將兩個哈希函數指向空
    bf->bloom_hash[0] = NULL;
    bf->bloom_hash[1] = NULL;
    return;
}
//插入數據
void BloomFilterInsert(BloomFilter *bf,const char *str)
{
    if(bf == NULL || str == NULL)
    {
        //非法輸入
        return;
    }
    size_t i = 0;
    for(;i < BloomHashCount;i++)
    {
        //通過循環由兩個哈希函數可以算出兩個哈希地址
        uint64_t hash = bf->bloom_hash[i](str)%BitmapMaxCapacity;
        //將每一個哈希地址置爲1
        BitmapSet(&bf->bm,hash);
    }
    return;
}
//判斷某個字符串是否存在
int BloomFilterIsExist(BloomFilter *bf,const char *str)
{
    if(bf == NULL || str == NULL)
    {
        //非法輸入
        return 0;
    }
    size_t i = 0;
    for(;i < BloomHashCount;i++)
    {
        //通過循環由兩個哈希函數可以算出兩個哈希地址
        uint64_t hash = bf->bloom_hash[i](str)%BitmapMaxCapacity;
        //檢測算出的哈希地址是否爲1（即檢測待判斷的數據是否存在）
        int ret = BitmapTest(&bf->bm,hash);
        //如果一旦有其中一個地址不爲1
        //返回了0，就說明該數據不存在
        if(ret == 0)
        {
            return 0;
        }
    }
    //走到這說明兩個哈希地址處的值均爲1
    //說明該數據就存在
    return 1;
}
//測試一下
void Test()
{
    BloomFilter bf;
    //初始化
    BloomFilterInit(&bf);
    //插入5個字符串
    BloomFilterInsert(&bf,"hello world");
    BloomFilterInsert(&bf,"hello today");
    BloomFilterInsert(&bf,"hi everybody");
    BloomFilterInsert(&bf,"how are you?");
    BloomFilterInsert(&bf,"I am fine!");
    //檢測某個字符串是否存在
    int ret1 = BloomFilterIsExist(&bf,"how are you?");
    printf("\nexpected ret1 = 1,actual ret1 = %d\n",ret1);
    int ret2 = BloomFilterIsExist(&bf,"where are you?");
    printf("expected ret2 = 0,actual ret2 = %d\n\n",ret2);
}

測試結果：

布隆過濾器的實現藉助了之前寫的位圖的實現代碼，詳情請移步上一篇文章：位圖
以下是布隆過濾器實現中用到的哈希函數，網上也可以搜到

//.h頭文件內容如下：
#pragma once

#include<stddef.h>

size_t SDBMHash(const char *str);
size_t BKDRHash(const char *str);


//.c函數實現內容如下：
#include"bloom_hash.h"

size_t SDBMHash(const char *str)
{
    size_t hash = 0;
    size_t ch;
    while(ch = (size_t)*str++)
    {
        hash = 65599*hash+ch;
    }
    return hash;
}
size_t BKDRHash(const char *str)
{
    size_t hash = 0;
    size_t ch;
    while(ch = (size_t)*str++)
    {
        hash = hash*131+ch;
    }
    return hash;
}

小心眼兒貓

發佈了110 篇原創文章 · 獲贊 47 · 訪問量 7萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

哈希變形----布隆過濾器

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

爬蟲兩種繞過5s盾的方法

【轉】[C#] WebAPI 防止併發調用二（冥等性）

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

自定義類型（一）：結構體和位段

排序算法：快速排序和歸併排序

自定義類型（二）：枚舉和聯合

shell學習筆記---（語法篇二）

哈希表的基本操作（二）：哈希桶處理哈希衝突

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結