trie樹-單詞樹-實現敏感詞屏蔽和詞頻統計

文章目錄

三、代碼實現

前幾天都看一個敏感詞屏蔽算法的文章，寫的挺好，順着思路寫了下去，實現了一下，算法效率還是槓槓的。。。

一、單詞樹介紹

利用的是單詞樹的算法，先看看什麼叫單詞樹。單詞樹也叫trie 樹也稱爲字典樹。最大的特點就是共享字符串的公共前綴來達到節省空間的目的。

例如，字符串 "abc"和"abd"構成的單詞樹如下：

樹的根節點不存任何數據，每整個個分支代表一個完整的字符串。像 abc 和 abd 有公共前綴 ab，所以我們可以共享節點 ab。如果再插入 abf，則變成這樣：

這樣看來能實現的功能就很顯而易見了，例如詞頻統計，單詞查找，還有就是遊戲裏的敏感詞屏蔽。

二、實現思路

來具體說說實現的思路吧。

2.1 詞頻統計和單詞查找

這兩個都是同一種思路。即下面代碼裏的find_word_exists函數，詞頻統計加個累計就好了。

關鍵在創建單詞樹的時候，需要添加子節點，另外還要標記單詞是否在此處是完整單詞。然後將一個個字符插入即可。

2.2 敏感詞屏蔽

這個稍微複雜點。即下面代碼裏的sensitive_word_filter函數。

需要三個指針來遍歷實現，兩個在檢查的單詞上，一個在單詞樹上。

1、首先指針 p1 指向 root，指針 p2 和 p3 指向字符串第一個字符

2、然後從字符串的 a 開始，檢測有沒有以 a 作爲前綴的敏感詞，直接判斷 p1 的孩子節點中是否有 a 這個節點就可以了，顯然這裏沒有。接着把指針 p2 和 p3 向右移動一格。

3、然後從字符串 b 開始查找，看看是否有以 b 作爲前綴的字符串，p1 的孩子節點中有 b，這時，我們把 p1 指向節點 b，p2 向右移動一格，不過，p3不動。

4、判斷 p1 的孩子節點中是否存在 p2 指向的字符c，顯然有。我們把 p1 指向節點 c，p2 向右移動一格，p3不動。

5、判斷 p1 的孩子節點中是否存在 p2 指向的字符d，這裏沒有。這意味着，不存在以字符b作爲前綴的敏感詞。這時我們把p2和p3都移向字符c，p1 還是還原到最開始指向 root。

6、和前面的步驟一樣，判斷有沒以 c 作爲前綴的字符串，顯然這裏沒有，所以把 p2 和 p3 移到字符 d。

到這裏應該差不多懂了。。。後面都一樣。那開始動手實踐。

三、代碼實現

這裏的詞頻統計，單詞查找和敏感詞屏蔽都實現了，如下；

#include <iostream>
#include <stdio.h>
using namespace std;

#pragma pack(1)
struct trie_node
{
    static const int letter_count = 26;
    int count;  // 字符的次數
    bool is_terminal; // 完整單詞的標誌
    char letter; // 當前節點的字符
    trie_node* childs[letter_count]; // 子節點

    trie_node(): letter(0), count(1), is_terminal(false)
    {
        for(int i = 0; i < letter_count; ++i)
        {
            childs[i] = NULL;
        }
    }
};
#pragma pack()

class trie
{
private:
    trie_node* _root_node;
public:
    trie(): _root_node(NULL)
    {
    }
    ~trie()
    {
        delete_trie(_root_node);
    }

    trie_node* create()
    {
        trie_node* node = new trie_node();
        return node;
    }

    void insert(const char* str)
    {
        if(NULL == _root_node || NULL == str)
        {
            _root_node = create();
        }
        trie_node* next_node = _root_node;

        while(*str != 0)
        {
            int index = *str - 'a';
            if(NULL == next_node->childs[index])
            {
                next_node->childs[index] = create();
            }
            else
            {
                next_node->childs[index]->count++;
            }
            next_node = next_node->childs[index];
            next_node->letter = *str;
            str++;
        }

        next_node->is_terminal = true;
    }

    bool find_word_exists(const char* str)
    {
        if(NULL == _root_node || NULL == str)
        {
            printf("condition is null\n");
            return false;
        }

        trie_node* cur_node = _root_node;

        do
        {
            cur_node = cur_node->childs[*str - 'a'];
            if(NULL == cur_node)
            {
                return false;
            }
            str++;
        }while (*str != 0);

        return cur_node->is_terminal; /* 直接看當前是否有完整單詞的標誌 */
    }

    void sensitive_word_filter(char* str)
    {
        if(NULL == _root_node || NULL == str)
        {
            printf("condition is null\n");
            return ;
        }

        char* pre = str;
        char* cur = str;
        trie_node* cur_node = _root_node;

        do
        {
            int index = *cur - 'a';
            if(NULL != cur_node->childs[index])
            {
                if(cur_node->childs[index]->is_terminal == true) /* 找到敏感詞 */
                {
                    while(pre != cur) /* 替換敏感詞 */
                    {
                        *pre = '*';
                        pre++;
                    }
                    *pre = '*';

                    // 向後移動，重新開始單詞樹查找
                    cur++;
                    pre = cur;
                    cur_node = _root_node;
                    continue;
                }
                cur_node = cur_node->childs[index];
                cur++;
            }
            else
            {
                /* 單詞樹需要重新開始查找。檢測的文本向後移動一步(前面的指針)然後查找 */
                pre++;
                cur = pre;
                cur_node = _root_node;
            }
        }while (*cur != 0);

        return;
    }

    void delete_trie(trie_node* node)
    {
        if(NULL == node)
        {
            return ;
        }
        for (int i = 0; i < trie_node::letter_count; i++)
        {
            if(NULL != node->childs[i])
            {
                delete_trie(node->childs[i]);
            }
        }
        delete node;
    }
};



int main(int argc, char** argv)
{
    if(argc < 2)
    {
        printf("Usage: ./a.out word\n");
        return -1;
    }

    char* word = NULL;
    if(NULL != argv[1])
    {
        word = argv[1];
    }
    else
    {
        return -2;
    }

    trie trie_tree = trie();
    trie_tree.insert("apps");
    trie_tree.insert("apply");
    trie_tree.insert("append");
    trie_tree.insert("back");
    trie_tree.insert("backen");
    trie_tree.insert("basic");

    /*1. 詞頻統計，和單詞查找*/
    bool is_find = trie_tree.find_word_exists(word);
    if(is_find)
    {
        printf("find word\n");

    }
    else
    {
        printf("not find\n");
    }

    /*2. 敏感詞屏蔽*/
    trie_tree.sensitive_word_filter(word);
    printf("word = %s\n", word);

    return 0;
}

./a.out apps
運行結果：

find word
word = ****

./a.out backhahaha
運行結果：

not find
word = ****hahaha

原理參考鏈接：https://blog.csdn.net/m0_37907797/article/details/103272967

trie樹-單詞樹-實現敏感詞屏蔽和詞頻統計

文章目錄

一、單詞樹介紹

二、實現思路

2.1 詞頻統計和單詞查找

2.2 敏感詞屏蔽

三、代碼實現

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

SGI iota() 函數

trie樹-單詞樹-實現敏感詞屏蔽和詞頻統計

無鎖生產者與消費者模型實例-線程

LeetCode | 有效的字母異位詞

利用條件變量實現進程間同步示例講解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結