Trie樹詞頻統計實例

原創

LHJ884

2020-02-23 08:49

Trie樹簡介

Trie樹，也叫前綴字典樹，是一種較常用的數據結構。常用於詞頻統計，
字符串的快速查找，最長前綴匹配等問題以及相關變種問題。

數據結構表現形式如下圖所示：

Trie樹的根爲空節點，不存放數據。每個節點包含了一個指針數組，數組大小通常爲26，即保存26個英文字母（如果要區分大小則數組大小爲52，如果要包括數字，則要加上0-9，數組大小爲62）。
可以想象它是一棵分支很龐大的樹，會佔用不少內存空間；不過它的樹高不會唱過最長的字符串長度，所以查找十分快捷。典型的用空間換取時間。

全英聖經詞頻統計

全英聖經TXT文件大小有4m，若要對它進行詞頻統計等相關操作，可以有許多方法解決。
我覺得可以用如下方式：

pthon字典數據結構解決
在linux下利用sed & awk 文本處理程序解決
C++ STL map解決
Trie樹解決

前三種實現比較簡單快捷，不過通過自己封裝Trie樹可以練習一下數據結構！感受一下數據結構帶來的效率提升，何樂而不爲。

下面則是我的具體實現，如有紕漏，敬請指正！

1）自定義頭文件

WordHash用來記錄不重複的單詞及其出現次數
TrieTree類封裝得不太好，偷懶把很多屬性如行數，單詞總數等都放在public域

#ifndef _WORD_COUNT_H
#define _WORD_COUNT_H

#include<stdio.h>
#include<string.h>
#include<string>
#include<fstream>
#include<sstream>
#include<vector>
#include<iterator>
#include<algorithm>
#include<iostream>

using std::string;
using std::vector;

typedef struct tag {
    char word[50];  //單個單nt show_times; //出現次數
    int show_times; //出現次數
}WordHash;

const int child_num = 26;

//字典樹節點
typedef struct Trie {
    int count;
    struct Trie *next_char[child_num];
    bool is_word;

    //節點構造函數
    Trie(): is_word(false) {
        memset(next_char,NULL,sizeof(next_char));
    }
}TrieNode;

class TrieTree {
 public:
    TrieTree();
    void insert(const char *word);
    bool search(const char *word);
    void deleteTrieTree(TrieNode *root);
    inline void setZero_wordindex(){ word_index = 0; }

    int word_index;
    WordHash *words_count_table; //詞頻統計表
    int lines_count;
    int all_words_count; //單詞總數
    int distinct_words_count;  //不重複單詞數

 private:
    TrieNode *root; //字典樹根節點
};

//文本詞頻統計類
class WordStatics {
 public:
    void open_file(string filename);
    void write_file();

    void set_open_filename(string input_path);
    string& get_open_filename();

    void getResult();
    void getTopX(int x);

 private:
    vector<string> words;  //保存文本中所有單詞
    TrieTree dictionary_tree; //字典樹

    vector<WordHash> result_table; //結果詞頻表
    string open_filename; //將要處理的文本路徑
    string write_filename; //詞頻統計結果文件
};



#endif

具體類成員函數cpp文件
1）字典樹構造函數

#include<iostream>
#include "word_count.h"

using namespace std;


//字典樹構造函數
TrieTree::TrieTree() {
    root = new TrieNode();
    //詞頻統計表,記錄單詞和出現次數
    word_index = 0;
    lines_count = 0;
    all_words_count = 0;
    distinct_words_count = 0;
    words_count_table = new WordHash[30000];
}

2）讀取文本中的單詞，逐個插入到字典樹中，創建字典樹。
（僅實現了能夠處理全爲小寫字母的文本，本人先將聖經文件做了一些簡單處理）

//建立字典樹，將單詞插入字典樹
void TrieTree::insert(const char *word) {
    TrieNode *location = root; //遍歷字典樹的指針

    const char *pword = word;

    //插入單詞
    while( *word ) {
        if ( location->next_char[ *word - 'a' ] == NULL ) {
            TrieNode *temp = new TrieNode();
            location->next_char[ *word - 'a' ] = temp;
        }    

        location = location->next_char[ *word - 'a' ];
        word++;
    }
    location->count++;
    location->is_word = true; //到達單詞末尾
    if ( location->count ==1 ) {
        strcpy(this->words_count_table[word_index++].word,pword);
        distinct_words_count++;
    }
}

3）按單詞查找字典樹，獲取其出現次數

//查找字典樹中的某個單詞
bool TrieTree::search(const char *word) {
    TrieNode *location = root;

    //將要查找的單詞沒到末尾字母，且字典樹遍歷指針非空
    while ( *word && location ) {
        location = location->next_char[ *word - 'a' ];
        word++;
    }

    this->words_count_table[word_index++].show_times = location->count;
    //在字典樹中找到單詞，並將其詞頻記錄到詞頻統計表中
    return (location != NULL && location->is_word);
}

4）刪除字典樹

//刪除字典樹,遞歸法刪除每個節點
void TrieTree::deleteTrieTree(TrieNode *root) {
    int i;
    for( i=0;i<child_num;i++ ) {
        if ( root->next_char[i] != NULL ) {
            deleteTrieTree(root->next_char[i]);
        }
    }
    delete root;
}

5）WordStatics類相關成員函數定義

void WordStatics::set_open_filename(string input_path) {
    this->open_filename = input_path;
}

string& WordStatics::get_open_filename() {
    return this->open_filename;
}

void WordStatics::open_file(string filename) {
    set_open_filename(filename);
    cout<<"文件詞頻統計中...請稍後"<<endl;

    fstream fout;
    fout.open(get_open_filename().c_str());  

    const char *pstr;
    while (!fout.eof() ) { //將文件單詞讀取到vector中
        string line,word;
        getline(fout,line);
        dictionary_tree.lines_count++;

        istringstream is(line);  
        while ( is >> word ) {
            pstr = word.c_str();
            dictionary_tree.all_words_count++;
            words.push_back(word);
        }
    } 

    //建立字典樹
    vector<string>::iterator it;
    for ( it=words.begin();it != words.end();it++ ) {
        if ( isalpha(it[0][0]) ) { 
           dictionary_tree.insert( (*it).c_str() );
        }
    }

}

void WordStatics::getResult() {
    cout<<"文本總行數："<<dictionary_tree.lines_count<<endl;
    cout<<"所有單詞的總數 : "<<dictionary_tree.all_words_count-1<<endl;
    cout<<"不重複單詞的總數 : "<<dictionary_tree.distinct_words_count<<endl;

    //在樹中查詢不重複單詞的出現次數
    dictionary_tree.setZero_wordindex();
    for(int i=0;i<dictionary_tree.distinct_words_count;i++) {
        dictionary_tree.search(dictionary_tree.words_count_table[i].word);
        result_table.push_back(dictionary_tree.words_count_table[i]);
    }
}

6）對統計結果進行排序，依照用戶輸入輸出前N詞頻的單詞

bool compare(const WordHash& lhs,const WordHash& rhs) {
    return lhs.show_times > rhs.show_times ;
}

void WordStatics::getTopX(int x) {
    sort(result_table.begin(),result_table.end(),compare);
    cout<<"文本中出現頻率最高的前5個單詞："<<endl;
    for( int i = 0; i<x; i++) {
        cout<<result_table[i].word<<": "<<result_table[i].show_times<<endl;
    }
}

運行結果：

僅供參考，記錄自己的學習歷程。
還有許多地方不太合理，需要改進，慢慢提升自己的編程能力！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Trie樹詞頻統計實例

Trie樹簡介

全英聖經詞頻統計

運行結果：

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

一個Sqrt函數引發的血案

Trie樹詞頻統計實例

C++單例模式實現

log4cpp學習筆記

C++ 自定義簡單String類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結