樸素貝葉斯算法

0.樸素貝葉斯

  • 樸素:特徵條件獨立
  • 貝葉斯:基於貝葉斯定理

1.樸素貝葉斯算法優缺點

  • 優點:在數據較少的情況下依然有效,可以處理多類別問題
  • 缺點:對輸入數據的準備方式敏感
  • 適用數據類型:標稱型數據

2.算法思想:

  •     比如我們想判斷一個郵件是不是垃圾郵件,那麼我們知道的是這個郵件中的詞的分佈,那麼我們還要知道:垃圾郵件中某些詞的出現是多少,就可以利用貝葉斯定理得到。

  •   樸素貝葉斯分類器中的一個假設是:每個特徵同等重要

  • 貝葉斯公式
    p(y=Ck|X)=p(X|y=Ck)p(y=Ck)p(X)
  • X 是M維向量的時候 因爲特徵條件獨立(樸素),根據全概率公式展開
    p(y=Ck|X)=mi=1p(xi|y=Ck)p(y=Ck)mi=1p(xi|y=Ck)kp(y=Ck)

即求在x詞句出現的情況下是侮辱性言亂還是正常言論的概率
1.本算法中p(y=1) 通過樣本我們可以知道概率(代碼中的pAbusive)
2.而疊乘利用數學性質 換做疊加
3.未完待續…….

3.算法僞代碼

計算每個類別中的文檔數目
for 每篇文檔
    每個類別
        詞條出現在文檔中 增加計數值
        增加所有詞條計數值
    對每個類別
        對每個詞條
            該詞條數目/總詞條數目 得到該類別的條件概率
返回 每個類別的條件概率

4.算法代碼

# -*- coding:utf-8 -*-

from numpy import *


# 構建數據集
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]  # 1代表侮辱性文字  0代表正常言論
    return postingList, classVec  # 返回輸入數據和標籤向量

# 創建'詞庫'
def createWordList(dataSet):
    wordList = set([])
    for line in dataSet:
        wordList = wordList | set(line)  # 按位與和集合合併相同符號
    return list(wordList)  # 單詞的集合list 無重複


def word2Vec(wordList, inputSet):  # 詞袋統計 不單純01而是記錄出現次數(爲多項式模型)
    Vec = [0] * len(wordList)
    for word in inputSet:
        if word in wordList:
            Vec[wordList.index(word)] += 1
        else:
            print "%s is not in this word list !!!" % word
    return Vec


def classifyNB(inputVec, p0Vec, p1Vec, pClass1):
    p1 = sum(inputVec * p1Vec) + log(pClass1)
    p0 = sum(inputVec * p0Vec) + log(1.0 - pClass1)  #  
    if p1 > p0:
        return 'abuse !!!'
    else:
        return 'normal ^_^'


def trainNB(trainMatrix, trainLabel):  # 輸入的文檔信息和分類標籤
    numTrainMatrix = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainLabel) / float(numTrainMatrix)  # p(y=1)侮辱性語句的概率
    print "sum of trainLabel", sum(trainLabel)
    print "pAbusive", pAbusive  # 先驗概率
    p0Num = ones(numWords)
    p1Num = ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainMatrix):
        if trainLabel[i] == 1:
            # 利用log(a+b)=log(a)*log(b)的性質 加法代替乘法 便於計算(避免乘法0與精度問題)
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num / p1Denom)  # 侮辱語言類類的條件概率
    p0Vect = log(p0Num / p0Denom)  # 正常類的條件概率

    print "p1num", p1Num
    print "p1Denom", p1Denom
    print "p0num", p0Num
    print "p0Denom", p0Denom

    return p0Vect, p1Vect, pAbusive


def testNB():
    dataSet, labelSet = loadDataSet()
    # print "dataSet", dataSet
    wordList = createWordList(dataSet)
    # print "wordList", wordList
    trainMat = []
    for item in dataSet:
        trainMat.append(word2Vec(wordList, item))
    print "labelset", labelSet
    print "trainMat", trainMat
    print "shape of trainMat", shape(trainMat)
    p0v, p1v, pAb = trainNB(trainMat, labelSet)
    print "p0v", p0v
    print "p1v", p1v
    print "pab", pAb
    # test1
    testData = ['love', 'my', 'dalmation']
    thisDoc = array(word2Vec(wordList, testData))
    print testData, "test result is --", classifyNB(thisDoc, p0v, p1v, pAb)
    # test2
    testData = ['stupid', 'garbage']
    thisDoc = array(word2Vec(wordList, testData))
    print testData, "test result is --", classifyNB(thisDoc, p0v, p1v, pAb)

if __name__ == '__main__':
    testNB()

5.運行結果

'''
labelset [0, 1, 0, 1, 0, 1]
trainMat [[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]
shape of trainMat (6L, 32L)
sum of trainLabel 3
pAbusive 0.5
p1num [ 1.  1.  1.  2.  2.  1.  1.  1.  2.  2.  1.  1.  1.  2.  2.  2.  2.  2.
  1.  3.  1.  2.  2.  1.  3.  1.  4.  1.  2.  1.  1.  1.]
p1Denom 21.0
p0num [ 2.  2.  2.  1.  1.  2.  2.  2.  1.  2.  2.  2.  2.  1.  1.  3.  1.  1.
  2.  1.  2.  2.  1.  2.  2.  2.  1.  2.  1.  2.  2.  4.]
p0Denom 26.0
p0v [-2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -3.25809654 -3.25809654 -2.15948425 -3.25809654 -3.25809654
 -2.56494936 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936
 -2.56494936 -2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.56494936
 -2.56494936 -1.87180218]
p1v [-3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -3.04452244
 -3.04452244 -2.35137526 -2.35137526 -2.35137526 -2.35137526 -2.35137526
 -3.04452244 -1.94591015 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -1.94591015 -3.04452244 -1.65822808 -3.04452244 -2.35137526 -3.04452244
 -3.04452244 -3.04452244]
pab 0.5
['love', 'my', 'dalmation'] test result is -- normal ^_^
['stupid', 'garbage'] test result is -- abuse !!!

'''
# 原始的p1v p0v 概率向量
# 可以看出第一個詞'cute'在第一類中出現而第二類則沒有
# 第二類中倒數第六個數據大於其他數據 對比發現這個詞是stupid 最能表徵這一類(侮辱性文檔)的單詞
'''
p0v [ 0.04166667  0.04166667  0.04166667  0.          0.          0.04166667
  0.04166667  0.04166667  0.          0.04166667  0.04166667  0.04166667
  0.04166667  0.          0.          0.08333333  0.          0.
  0.04166667  0.          0.04166667  0.04166667  0.          0.04166667
  0.04166667  0.04166667  0.          0.04166667  0.          0.04166667
  0.04166667  0.125     ]
p1v [ 0.          0.          0.          0.05263158  0.05263158  0.          0.
  0.          0.05263158  0.05263158  0.          0.          0.
  0.05263158  0.05263158  0.05263158  0.05263158  0.05263158  0.
  0.10526316  0.          0.05263158  0.05263158  0.          0.10526316
  0.          0.15789474  0.          0.05263158  0.          0.          0.        ]
'''

6.sklearn庫的使用

# coding:utf-8

import sklearn.naive_bayes


'''
數據例
基因片斷A 基因片斷B 高血壓y1 膽結石y2 (1患該病 0正常)
    1        1         1        0    
    0        0         0        1    
    0        1         0        0    
    1        0         0        0    
    1        1         0        1    
    1        0         0        1    
    0        1         1        1    
    0        0         0        0    
    1        0         1        0    
    0        1         0        1  
'''

x = [[1, 1], [0, 0], [0, 1], [1, 0], [1, 1], [1, 0], [0, 1], [0, 0], [1, 0], [0, 1]]
y1 = [1, 0, 0, 0, 0, 0, 1, 0, 1, 0]

# 訓練
clf = sklearn.naive_bayes.GaussianNB().fit(x, y1)

p = [[1, 0]]
# 預測
print clf.predict(p)  # [0]

y2 = [0, 1, 0, 0, 1, 1, 1, 0, 0, 1]
clf = sklearn.naive_bayes.GaussianNB().fit(x, y2)

p = [[1, 0]]
print clf.predict(p)  # [0]

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章