【機器學習】樸素貝葉斯基本介紹+代碼實現

1. 基本概念

根據先驗概率和似然函數來求後驗概率。一般用於分類任務。

$P(c_i|w) = \frac{P(w|c_i)P(c_i)}{P(w)}$

先驗概率：

似然函數：

後驗概率：

根據條件獨立性假設：

目標函數：即求解使後驗概率最大的類。

訓練過程：即求各個單詞的條件概率，和類別的先驗概率。

測試過程：根據已經得到的條件概率和先驗概率，計算不同類別的後驗概率，取最大的類。

2. 優缺點

優點：簡單，易於實現。

缺點：由於條件獨立性假設，使得分類性能不是很好。並且對輸入數據格式有限制。

3. 應用題（影評分類）

訓練影評數據集
樣本	標籤
Good movie.	1
I like it.	1
So amazing.	1
Great movie.	1
Bad movie.	0
I hate it.	0
So boring.	0
Boring movie.	0

求影評“I hate bad movie.”的類別（1代表好的影評，0代表壞的。）

令 w = “I hate bad movie.”

$c = argmax_{k} P(c=k|w),k\in \{1,0\}$

$P(c=1|w) = \frac{P(w|c=1)P(c=1)}{P(w)} = \frac{P(c=1)}{P(w)}\prod_{i=1}^{4}P(w_i|c=1)$

$P(c=0|w) = \frac{P(w|c=0)P(c=0)}{P(w)} = \frac{P(c=0)}{P(w)}\prod_{i=1}^{4}P(w_i|c=0)$

$P(I|c=1) = \frac{count(I|c=1)}{9}= \frac{1}{9}$ ,

$P(hate|c=1) =\frac{count(hate|c = 1)}{9}\xrightarrow[smoothing]{ } \frac{0 + 1}{9 + 2} = \frac{1}{11}$

$P(bad|c=1) =\frac{count(bad|c = 1)}{9}\xrightarrow[smoothing]{ } \frac{0 + 1}{9 + 2} = \frac{1}{11}$

$P(movie|c=1) =\frac{count(movie|c = 1)}{9} = \frac{2}{9}$

$P(c=1) = \frac{1}{2}$

因此：

$P(c=1|w) = \frac{1}{P(w)}*\frac{1}{9} * \frac{1}{11} *\frac{1}{11}*\frac{2}{9}*\frac{1}{2} = \frac{1}{9801}*\frac{1}{P(w)}$

$P(I|c=0) = \frac{count(I|c=0)}{9}= \frac{1}{9}$

$P(hate|c=0) =\frac{count(hate|c = 0)}{9}= \frac{1}{9}$

$P(bad|c=0) =\frac{count(bad|c = 0)}{9}= \frac{1}{9}$

$P(movie|c=0) =\frac{count(movie|c = 0)}{9} = \frac{2}{9}$

$P(c=0) = \frac{1}{2}$

因此：

$P(c=0|w) =\frac{1}{P(w)} * \frac{1}{9} * \frac{1}{9} *\frac{1}{9}*\frac{2}{9}*\frac{1}{2} = \frac{1}{6561}*\frac{1}{P(w)}$

, 則爲“壞評價”。

總結：

1. 樸素貝葉斯常用模型：多項式模型，伯努利模型，高斯模型

多項式模型：基於單詞出現次數

先驗概率：P(c) = 類c下所有文檔的單詞總數 / 所有文檔單詞總數

條件概率：P(wi|c) = (類c下單詞wi出現的次數 + 1)/ (類c下所有文檔的單詞總數 + 類別數)

伯努利模型：基於單詞是否出現（二項特徵）

先驗概率：P(c) = 類c下文檔數 / 總文檔數

條件概率指示函數：P(i|c) = 類c下單詞wi出現過的文檔數 / 類c下所有文檔數

條件概率: P(wi|c) = P(i|c)wi + (1-P(i|c))(1-wi)

因此多項式模型通過smoothing同等對待未出現單詞，而伯努利模型則顯示的對待（反向作用）。

混合模型：既考慮詞頻但是又限制詞語出現次數的模型爲，其通過類別的分類不均來計算一個權重。

文本分類時一般可選擇多項式模型和伯努利模型，當句子較短時，伯努利的效果可能會更好。

高斯模型：特徵值符合高斯分佈的連續型變量，比如說人的身高，物體的長度。

2. 對於未出現詞，比如‘hate’ 在正例裏未出現，爲了防止其條件概率爲0，可使用smoothing方法，比如add-1（laplacian smoothing）。

3. 上述條件概率乘積太小，爲了防止向下溢出，可以使用自然對數log，將乘法轉化爲加法。

4. 上述例子僅爲一個參考。當訓練集十分龐大時（比如爛番茄評論數據庫），則我們可以剔除“停用詞”，例如：the, a, I ......，只選擇形容詞，如：good，bad......

5. 詞袋模型，僅考慮單詞出現次數，不考慮前後順序。對應上述的多項式模型。詞集模型：僅考慮單詞是否出現，若出現記爲1，不計數。詞袋模型要優於詞集模型。

4. 編程實現（垃圾郵件分類）

參考：《機器學習實戰》

源碼地址以及數據：https://github.com/JieruZhang/MachineLearninginAction_src

手寫python樸素貝葉斯：

import re
import random
from numpy import *

#tokenization,分詞
def textParse(s):
    tokens = re.split(r'\W*', s)
    #轉化成小寫，且只取長度大於2的單詞
    return [tok.lower() for tok in tokens if len(tok) > 2]

#通過set來創建無重複單詞的字典
def createVocab(fullText):
    return list(set(fullText))

#將一個單詞list轉化爲向量表示, 某單詞存在，則把vocabs中對應位置賦爲1(此處應用的是伯努利模型，即不考慮次數，只考慮是否出現)
def words2Vec(vocabs, words):
    vec = [0 for _ in range(len(vocabs))]
    for word in words:
        if word in vocabs:
            vec[vocabs.index(word)] = 1
    return vec
        
#訓練過程
def trainNB(trainMat, trainClasses):
    numDocs = len(trainMat)
    numWords = len(trainMat[0])
    #垃圾郵件的概率
    pSpam = sum(trainClasses)/float(numDocs)
    #分子p0是類別爲0的概率
    p0num = zeros(numWords)
    p1num = zeros(numWords)
    #分母
    p0denom = 0.0
    p1denom = 0.0
    for i in range(numDocs):
        #計算1類的分子分母,0類的分子分母
        if trainClasses[i] == 1:
            p1num += trainMat[i]
            p1denom += sum(trainMat[i])
        else:
            p0num += trainMat[i]
            p0denom += sum(trainMat[i])
    #計算概率，p1,p0,這裏兩者均爲向量表示，每個位置時該位置對應的單詞的概率p1[i] = p(wi|c=1), p0[i] = p(wi|c=0)
    #取自然對數是爲了轉化乘法爲加法，防止向下溢出
    p1 = log(p1num/p1denom)
    p0 = log(p0num/p0denom)
    return p0, p1, pSpam

#分類過程,傳入的概率p0和p1都是取了自然對數的,pSpam沒有取
def classifyNB(wordvec,p0vec,p1vec,pSpam):
    p1 = sum(wordvec*p1vec) + log(pSpam)
    p0 = sum(wordvec*p0vec) + log(1.0-pSpam)
    #返回概率大的類別
    if p1 > p0:
        return 1
    else:
        return 0

    
def spamTest():
    docs = []
    classes = []
    fullText = []
    #總共有25個正例,25個反例
    for i in range(1,26):
        #每封郵件作爲一個大字符串，使用textParse分詞放入list
        #docs存放每一封分詞過後的郵件,fullText存放所有的單詞,classes存放類別（spam中是正類）
        words = textParse(open('email/spam/%d.txt'%i).read())
        docs.append(words)
        fullText.extend(words)
        classes.append(1)
        #同理求負例
        words = textParse(open('email/ham/%d.txt'%i,encoding='gbk').read())
        docs.append(words)
        fullText.extend(words)
        classes.append(0)
    #構建字典
    vocabs = createVocab(fullText)
    #總共50封郵件，隨機選擇10封作爲測試集，剩餘40封爲訓練集，trainIndex和testIndex存的是選取的郵件的index
    trainIndex = [i for i in range(50)]
    testIndex = []
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainIndex)))
        testIndex.append(trainIndex[randIndex])
        del trainIndex[randIndex]
    #對於訓練集，將每封郵件的單詞列表轉化成向量表示， 並存入相應的list
    trainMat = []
    trainClasses = []
    for index in trainIndex:
        trainMat.append(words2Vec(vocabs, docs[index]))
        trainClasses.append(classes[index])
    #訓練模型，得到條件概率向量，以及先驗概率pSpam
    p0, p1, pSpam = trainNB(array(trainMat),array(trainClasses))
    #在測試集上測試
    errorCount = 0
    for index in testIndex:
        wordvec = words2Vec(vocabs, docs[index])
        if classifyNB(array(wordvec), p0, p1, pSpam) != classes[index]:
            errorCount += 1
    print('error rate is:', float(errorCount)/len(testIndex))
    
spamTest()

使用sklearn包實現：

#使用sklearn工具包進行分類
from sklearn.naive_bayes import MultinomialNB

def spamNBsklearn():
    #數據準備過程同上
    docs = []
    classes = []
    fullText = []
    for i in range(1,26):
        words = textParse(open('email/spam/%d.txt'%i).read())
        docs.append(words)
        fullText.extend(words)
        classes.append(1)
        #同理求負例
        words = textParse(open('email/ham/%d.txt'%i,encoding='gbk').read())
        docs.append(words)
        fullText.extend(words)
        classes.append(0)
    #構建字典
    vocabs = createVocab(fullText)
    #總共50封郵件，隨機選擇10封作爲測試集，剩餘40封爲訓練集，trainIndex和testIndex存的是選取的郵件的index
    trainIndex = [i for i in range(50)]
    testIndex = []
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainIndex)))
        testIndex.append(trainIndex[randIndex])
        del trainIndex[randIndex]
    #對於訓練集，將每封郵件的單詞列表轉化成向量表示， 並存入相應的list
    trainMat = []
    trainClasses = []
    for index in trainIndex:
        trainMat.append(words2Vec(vocabs, docs[index]))
        trainClasses.append(classes[index])
    #對於測試集，將每封郵件的單詞列表轉化成向量表示， 並存入相應的list
    testMat = []
    testClasses = []
    for index in testIndex:
        testMat.append(words2Vec(vocabs, docs[index]))
        testClasses.append(classes[index])
    #使用sklearn包訓練（使用多項式模型）
    clf = MultinomialNB()
    clf.fit(trainMat, trainClasses)
    #test, clf.score 輸出對測試樣本的預測準確率平均值
    score = clf.score(testMat, testClasses)
    print('error rate is:', 1-score)

spamNBsklearn()

【機器學習】樸素貝葉斯基本介紹+代碼實現

python 之 yield

【算法】斐波那契數列vs卡塔蘭數列DP

【算法】查找算法

【算法】樹

【算法】紙幣面額拼湊DP

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結