1、信息增益（互信息）介紹

由於在最終的bayse算法中只使用了部分的特徵，而特徵的選擇使用到了信息增益，所以在這裏做一個簡單的介紹。

（1）西瓜書中的信息增益¹

在西瓜書的4.2節中，選擇樹節點的劃分屬性時提到了信息增益；其定義如下：
首先是元集合D的類別信息熵

$Ent(D)=-\sum_{k=1}^{\left | y \right |}p_{k}log_{2}(p_{k})$
然後根據屬性a劃分爲了V個集合後，給出了信息增益的定義：
$Gain(D, a)=Ent(D) - \sum_{v=1}^{V}\frac{\left | D^{v} \right |}{\left | D \right |}Ent(D^v)$
然後利用信息增益來進行樹的劃分特徵的選取。

（2）PRML中的互信息²

在PRML1.6.1中定義了互信息，公式如下：
$I[x, y] \equiv KL(p(x, y)|| p(x)p(y)) = - \iint p(x,y)ln(\frac{p(x)p(y)}{p(x,y)}) dxdy$
化簡後可以得到：
$I[x,y]\equiv H[x] - H[x|y] = H[y] - H[y|x]$

（3）其實他們是一個東西

證明：首先把1、->（2）中的積分形式改寫成和的形式，並把ln還成log（相差了log2倍），有
$-\sum_{x,y}p(x,y)log_{2}(\frac{p(x)p(y)}{p(x,y)})$
$= -\sum_{y}\sum_{x}p(x,y)log_{2}(p(y)) - (-\sum_{x}p(x)\sum_{y}p(y|x)log_{2}(p(y|x)))$
$=-\sum_{y}p(y)log_{2}(p(y))-(-\sum_{x}p(x)\sum_{y}p(y|x)log_{2}(p(y|x)))$

仔細觀察·是不是和1、->(1)一模一樣！如下圖所示：

2、樸素Bayse新聞分類³

（1）常量及輔助函數

import math
import random
import collections
label_dict = {0: '財經', 1: '健康', 2: '教育', 3: '軍事', 4: '科技',
 5: '旅遊', 6: '母嬰', 7: '汽車', 8: '體育',9: '文化', 10: '娛樂'}

def code_2_label(code):
    return label_dict.get(code)

def default_doc_dict():
    """
    構造和類別數等長的0向量
    :return: 一個長度和文檔類別數相同的全0數組，用來作爲某些以該長度數組爲值的字
    典的默認返回值
    """
    return [0] * len(label_dict)

def shuffle(in_file):
    """
    簡單的亂序操作，用於生成訓練集和測試集
    :param in_file: 輸入文件
    :return:
    """
    text_lines = [line.strip() for line in open(in_file, encoding='utf-8')]
    print('正在準備訓練數據和測試數據，請稍後...')
    random.shuffle(text_lines)
    total_lines = len(text_lines)
    train_text = text_lines[:int(3 * total_lines / 5)]
    test_text = text_lines[int(3 * total_lines / 5):]
    print('準備訓練數據和測試數據完畢，下一步...')
    return train_text, test_text

（2）特徵提取

根據1中的信息增益（互信息）的大小來提取前100個最重要的特徵(這裏就是詞了)
首先定義瞭如下的計算互信息的輔助函數，它所計算的內容其實是：

（注意紅色的-，我把它移到了和的內部）

def mutual_info(N, Nij, N_i, N_j):
    """
        計算互信息，這裏log的底取爲2;同時爲了防止Nij爲0，分子做了+1的平滑
    :param N:總樣本數
    :param Nij:x爲i，y爲j的樣本數
    :param N_i:x爲i的樣本數
    :param N_j:y爲j的樣本數
    :return:
    """
    return Nij * 1.0 / N * math.log(N * (Nij + 1) * 1.0 / (N_i * N_j)) / math.log(2)

看起來並不是很清晰，因爲是化簡以後的。這裏進行一下推導：
$-p(x=i,y=j)log_{2}(\frac{p(x=i)p(y=j)}{p(x=i,y=j)})$
$=-\frac{N_{i,j}}{N}*log_{2}(\frac{\frac{N_{i}}{N}\frac{N_{j}}{N}}{\frac{N_{i,j}}{N}})$
$=\frac{N_{i,j}}{N}*log_{2}(\frac{NN_{i,j}}{N_{i}N_{j}})$
$=\frac{N_{i,j}}{N}*ln(\frac{NN_{i,j}}{N_{i}N_{j}})/ln(2)$
加上平滑以後，就得到了上面的函數（別在乎1.0，只是整數轉浮點數）
$=\frac{N_{i,j}}{N}*ln(\frac{N(N_{i,j}+1)}{N_{i}N_{j}})/ln(2)$

def count_for_cates(train_text, feature_file):
    """
    遍歷文件，統計每個詞在每個類別中出現的次數，以及每個類別中的文檔數，
    並將結果寫入特徵文件(只寫互信息值最大的前100項)
    :param train_text:
    :param feature_file:
    :return:
    """
    # 各個類別中所包含的詞的個數
    doc_count = [0] * len(label_dict)
    # 以word爲key的字典，value是對應該word
    # 在每個類別中出現次數的向量;該詞不存在就返回全0的向量
    word_count = collections.defaultdict(default_doc_dict)
    # 掃描文件和計數
    for line in train_text:
        label, text = line.strip().rstrip('\n').split(' ', 1)
        words = text.split(' ')
        int_label = int(label)
        for word in words:
            # 空字符串用了停用詞也沒有過濾掉，就在這裏處理了
            if word != '':
                word_count[word][int_label] += 1
                doc_count[int_label] += 1
    # 計算互信息
    print('計算互信息，提取關鍵/特徵詞中，請稍後...')
    # 互信息結果字典，value描述的是某個類別中的詞數
    # 衡量的信息量與明確是某個詞以後的以詞數衡量的信息量的互信息
    mi_dict = collections.defaultdict(default_doc_dict)
    # 詞總量
    N = sum(doc_count)
    # (word,[...各個類別中該詞出現的詞數...])
    for k, vs in word_count.items():
        for i in range(len(vs)):
            # N11代表是詞k並且出現在類別i中的詞數
            N11 = vs[i]
            # N10 代表是詞k但未出現在類別i中的詞數
            N10 = sum(vs) - N11
            # N01 代表不是詞k但出現在類別i中的詞數
            N01 = doc_count[i] - N11
            # N00 代表不是詞k也未出現在類別i中的詞數
            N00 = N - N11 - N10 - N01
            """
            設D爲某個類別中的詞總數，
            A爲某個詞出現的總次數，
            N爲總詞數
            則下面的式子表達的是
            mutual_info(N,DA,A,D) 
            + mutual_info(N,~DA, A, ~D) 
            + mutual_info(N,D~A, D, ~A) 
            + mutual_info(N, ~D~A, ~D, ~A)
            """
            mi = mutual_info(N, N11, N10 + N11, N01 + N11) 
            + mutual_info(N, N10, N10 + N11, N00 + N10) 
            + mutual_info(N, N01, N01 + N11, N01 + N00) 
            + mutual_info(N, N00, N00 + N10, N00 + N01)
            mi_dict[k][i] = mi
    # 用來作爲bayes參數的詞
    f_words = set()
    # 把每類文檔分類最重要的100個詞放到f_words中
    for i in range(len(doc_count)):
        sorted_dict = sorted(mi_dict.items(), 
        key=lambda x: x[1][i], reverse=True)
        for j in range(100):
            f_words.add(sorted_dict[j][0])
    with open(feature_file, 'w', encoding='utf-8') as out:
        # 輸出每個類別中包含的詞的數量
        out.write(str(doc_count) + '\n')
        # 輸出作爲參數的詞
        for f_word in f_words:
            out.write(f_word + "\n")
        print("特徵詞寫入完畢...")

def load_feature_words(feature_file):
    """
    從特徵文件中導入特徵詞
    :param feature_file:
    :return:
    """
    with open(feature_file, encoding='utf-8') as f:
        # 每個類別中包含的詞的數量
        doc_words_count = eval(f.readline())
        features = set()
        # 讀取特徵詞
        for line in f:
            features.add(line.strip())
        return doc_words_count, features

（3）訓練模型

def train_bayes(feature_file, text, model_file):
    """
    訓練貝葉斯模型，實際上計算每個類別中特徵詞的出現次數
    :param feature_file: 特徵文件
    :param text: 原始的樣本
    :param model_file: 模型文件
    :return:
    """
    print('使用樸素貝葉斯訓練中...')
    doc_words_count, features = load_feature_words(feature_file)
    feature_word_count = collections.defaultdict(default_doc_dict)
    # 每類文檔中特徵詞出現的總次數
    feature_doc_words_count = [0] * len(doc_words_count)
    for line in text:
        label, text = line.strip().rstrip('\n').split(' ', 1)
        int_label = int(label)
        words = text.split(' ')
        for word in words:
            if word in features:
                feature_doc_words_count[int_label] += 1
                feature_word_count[word][int_label] += 1
    out_model = open(model_file, 'w', encoding='utf-8')
    print('訓練完畢，寫入模型...')
    for k, v in feature_word_count.items():
        scores = [(v[i] + 1) * 1.0 / (feature_doc_words_count[i] + len(feature_word_count)) for i in range(len(v))]
        out_model.write(k + '\t' + str(scores) + '\n')

def load_model(model_file):
    """
    從模型文件中導入計算好的貝葉斯模型
    :param model_file:
    :return:
    """
    print('加載模型中...')
    with open(model_file, encoding='utf-8') as f:
        scores = {}
        for line in f.readlines():
            word, counts = line.split('\t', 1)
            scores[word] = eval(counts)
        return scores

（4）預測

def predict(feature_file, model_file, test_text):
    """
    預測文檔的類別，標準輸入每一行爲一個文檔
    這是一個樸素貝葉斯的預測方法
    p(c|x) 正比於 p(c)p(x1|c)....p(xn|c)
    :param feature_file:
    :param model_file:
    :param test_text:
    :return:
    """
    doc_words_count, features = load_feature_words(feature_file)
    # p(c)
    doc_scores = [math.log(count * 1.0 / sum(doc_words_count)) for count in doc_words_count]
    scores = load_model(model_file)
    r_count = 0
    doc_count = 0
    print("正在使用測試數據驗證模型效果...")
    for line in test_text:
        label, text = line.strip().split(' ', 1)
        int_label = int(label)
        words = text.split(' ')
        pre_values = list(doc_scores)
        for word in words:
            if word in features:
                for i in range(len(pre_values)):
                    pre_values[i] += math.log(scores[word][i])
        m = max(pre_values)
        p_index = pre_values.index(m)
        if p_index == int_label:
            r_count += 1
        doc_count += 1
    print("總共測試文本量: %d ,預測正確的類別量:%d,樸素貝葉斯分類器準確度:%f" %
          (doc_count, r_count, r_count * 1.0 / doc_count))

（5）測試

if __name__ == '__main__':
    out_in_file = 'd:/nlps/result.txt'
    out_feature_file = 'd:/nlps/feature.txt'
    out_model_file = 'd:/nlps/model.txt'
    train_text, test_text = shuffle(out_in_file)
    count_for_cates(train_text, out_feature_file)
    train_bayes(out_feature_file, train_text, out_model_file)
    predict(out_feature_file, out_model_file, test_text)

（6）測試結果

周志華《機器學習》 4.2節 ↩︎
Bishop “Pattern Recognition and Machine Learning” 1.6.1 ↩︎
寒小陽 ↩︎

樸素Bayse新聞分類實踐

目錄

1、信息增益（互信息）介紹

（1）西瓜書中的信息增益¹

（2）PRML中的互信息²

（3）其實他們是一個東西

2、樸素Bayse新聞分類³

（1）常量及輔助函數

（2）特徵提取

（3）訓練模型

（4）預測

（5）測試

（6）測試結果

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

吳恩達深度學習編程作業彙總

樸素Bayse新聞分類實踐

《吳恩達深度學習》第一課第四周任意層的神經網絡實現及BUG處理

python3 set相關操作

python3字符串常用操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

樸素Bayse新聞分類實踐

目錄

1、信息增益（互信息）介紹

（1）西瓜書中的信息增益1

（2）PRML中的互信息2

（3） 其實他們是一個東西

2、樸素Bayse新聞分類3

（1）常量及輔助函數

（2）特徵提取

（3）訓練模型

（4）預測

（5）測試

（6）測試結果

（1）西瓜書中的信息增益¹

（2）PRML中的互信息²

（3）其實他們是一個東西

2、樸素Bayse新聞分類³