一、基礎知識篇
條件概率
全概率公式:
貝葉斯推斷
同時再思考一個問題,在使用該算法的時候,如果不需要知道具體的類別概率,我們有必要計算P(B)這個全概率嗎?要知道我們只需要比較 P(A1|B)和P(A2|B)的大小,找到那個最大的概率就可以。既然如此,兩者的分母都是相同的,那我們只需要比較分子即可。即比較P(B|A1)P(A1)和P(B|A2)P(A2)的大小,所以爲了減少計算量,全概率公式在實際編程中可以不使用。
樸素貝葉斯
普通貝葉斯公式 相當於 只有一個特徵
樸素貝葉斯有多個特徵,且不同特徵之間相互獨立
實例理解樸素貝葉斯
我們看:P(打噴嚏|感冒)*P(建築工人|感冒)*P(感冒) 這個式子的三項均是從訓練集中得到數據,如果我們不考慮分母的話,那麼這個式子得到的就是該樣本屬於 感冒 的概率。
二、補充知識
拉普拉斯平滑(Laplace smoothing)
當某個分量在總樣本某個分類中(觀察樣本庫/訓練集)從沒出現過,這時候會導致條件概率計算爲0,最後會導致整個實例的計算結果爲0。爲了解決這個問題,使用拉普拉斯平滑/加1平滑進行處理。分子+1,分母+類別數。
假設在文本分類中,有3個類,C1、C2、C3,在指定的訓練樣本中,某個詞語F1,在各個類中觀測計數分別爲=0,990,10,即概率爲P(F1/C1)=0,P(F1/C2)=0.99,P(F1/C3)=0.01,對這三個量使用拉普拉斯平滑的計算方法如下:
1/1003 = 0.001,991/1003=0.988,11/1003=0.011
實際應用場景
文本分類
垃圾郵件過濾
病人分類
拼寫檢查
樸素貝葉斯模型
高斯模型:處理特徵是連續型變量的情況
多項式模型:最常見,要求特徵是離散數據
伯努利模型:要求特徵是離散的,且爲布爾類型,即true和false,或者1和0
三、樸素貝葉斯算法實現
# encoding=utf-8
import pandas as pd
import numpy as np
import cv2
import time
# 該方法已被棄用
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 二值化處理
def binaryzation(img):
cv_img = img.astype(np.uint8) # 類型轉化成Numpy中的uint8型
cv2.threshold(cv_img, 50, 1, cv2.THRESH_BINARY_INV, cv_img) # 大於50的值賦值爲0,不然賦值爲1
return cv_img
# 訓練,計算出先驗概率和條件概率
def Train(trainset, train_labels):
print(len(trainset))
'''
:param trainset: 特徵矩陣
:param train_labels: 類別
:return:
'''
prior_probability = np.zeros(class_num) # 先驗概率
conditional_probability = np.zeros((class_num, feature_len, 2)) # 條件概率(10,784,2)
# 計算
for i in range(len(train_labels)): # 遍歷所有樣本,len(train_labels) 是所有樣本數
img = binaryzation(trainset[i]) # 特徵圖片二值化,讓每一個特徵都只有0,1兩種取值
label = train_labels[i]
prior_probability[label] += 1 # 存儲類別信息列表,注意這裏存儲的只是計數
for j in range(feature_len):
conditional_probability[label][j][img[j]] += 1
# 將條件概率歸到[1,10001]
for i in range(class_num):
for j in range(feature_len):
# 經過二值化後圖像只有0,1兩種取值
pix_0 = conditional_probability[i][j][0]
pix_1 = conditional_probability[i][j][1]
# 計算0,1像素點對應的條件概率
# 我實在是沒有理解這裏,,,估計就是某種處理吧。。
probalility_0 = (float(pix_0)/float(pix_0+pix_1))*10000 + 1
probalility_1 = (float(pix_1)/float(pix_0+pix_1))*10000 + 1
conditional_probability[i][j][0] = probalility_0
conditional_probability[i][j][1] = probalility_1
return prior_probability, conditional_probability
# 計算概率
def calculate_probability(img, label):
probability = int(prior_probability[label])
for j in range(feature_len):
probability *= int(conditional_probability[label][j][img[j]])
return probability
# 預測
def Predict(testset, prior_probability, conditional_probability):
predict = []
# 對每個輸入的x,將後驗概率最大的類作爲x的類輸出
for img in testset:
img = binaryzation(img) # 圖像二值化
max_label = 0
max_probability = calculate_probability(img, 0)
for j in range(1, class_num):
probability = calculate_probability(img, j)
if max_probability < probability:
max_label = j
max_probability = probability
predict.append(max_label)
return np.array(predict)
class_num = 10 # MINST數據集有10種labels,分別是“0,1,2,3,4,5,6,7,8,9”
feature_len = 784 # MINST數據集每個image有28*28=784個特徵(pixels)
if __name__ == '__main__':
print("Start read data")
time_1 = time.time()
raw_data = pd.read_csv('../data/train.csv', header=0) # 讀取csv數據
data = raw_data.values
features = data[::, 1::]
labels = data[::, 0]
# 避免過擬合,採用交叉驗證,隨機選取33%數據作爲測試集,剩餘爲訓練集
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
time_2 = time.time()
print('read data cost %f seconds' % (time_2 - time_1))
print('Start training')
prior_probability, conditional_probability = Train(train_features, train_labels)
time_3 = time.time()
print('training cost %f seconds' % (time_3 - time_2))
print('Start predicting')
test_predict = Predict(test_features, prior_probability, conditional_probability)
time_4 = time.time()
print('predicting cost %f seconds' % (time_4 - time_3))
score = accuracy_score(test_labels, test_predict)
print("The accruacy score is %f" % score)
因爲作者本人智商有限,實在不能理解上面code拉普拉斯平滑的處理方法,故提供以下方法,需要修改train和calculate_probability兩個函數:(性能降低0.0005)
# 訓練,計算出先驗概率和條件概率
def Train(trainset, train_labels):
'''
:param trainset: 特徵矩陣
:param train_labels: 類別
:return:
'''
prior_probability = np.zeros(class_num) # 先驗概率
conditional_probability = np.ones((class_num, feature_len, 2)) # 條件概率(10,784,2)
# 計算
for i in range(len(train_labels)): # 遍歷所有樣本,len(train_labels) 是所有樣本數
img = binaryzation(trainset[i]) # 特徵圖片二值化,讓每一個特徵都只有0,1兩種取值
label = train_labels[i]
prior_probability[label] += 1 # 存儲類別信息列表
for j in range(feature_len):
conditional_probability[label][j][img[j]] += 1
# 計算條件概率
for i in range(class_num):
conditional_probability[i] = conditional_probability[i] / (prior_probability[i] + 2)
return prior_probability, conditional_probability
# 重點: 下面計算 probability的 int 去掉啦!
def calculate_probability(img, label):
probability = int(prior_probability[label])
for j in range(feature_len):
probability *= conditional_probability[label][j][img[j]]
return probability
伯努利模型:處理離散問題
import numpy as np
from sklearn.naive_bayes import BernoulliNB
# 特徵形式如下:
# [Walks like a duck, Talks like a duck, Is small]
#
# Walks like a duck: 0 = False, 1 = True
# Talks like a duck: 0 = False, 1 = True
# Is small: 0 = False, 1 = True
# 數據
X = np.array([[1, 1, 0], [0, 0, 1], [1, 0, 0]])
# label
y = np.array(['Duck', 'Not a Duck', 'Not a Duck'])
# This is the code we need for the Bernoulli model
clf = BernoulliNB()
# We train the model on our data
clf.fit(X, y)
print("What does our model think this should be?")
print("Answer: %s!" % clf.predict([[1, 1, 1]])[0])
高斯模型:處理特徵是連續型變量的情況
import numpy as np
from sklearn.naive_bayes import GaussianNB
# X特徵的形式如下:
# [Red %, Green %, Blue %]
# Some data:
X = np.array([[.5, 0, .5], [1, 1, 0], [0, 0, 0]])
# Classes: Purple, Yellow, or Black
y = np.array(['Purple', 'Yellow', 'Black'])
# model
clf = GaussianNB()
# We train the model on our data
clf.fit(X, y)
print("What color does our model think this should be?")
print("Answer: %s!" % clf.predict([[1, 0, 1]])[0])
多項式模型:最常見,要求特徵是離散數據
import numpy as np
from sklearn.naive_bayes import MultinomialNB
# X特徵如下:
# [Size, Weight, Color]
#
# Size: 0 = Small, 1 = Moderate, 2 = Large
# Weight: 0 = Light, 1 = Moderate, 2 = Heavy
# Color: 0 = Red, 1 = Blue, 2 = Brown
# Some data:
X = np.array([[1, 1, 0], [0, 0, 1], [2, 2, 2]])
# Classes: Apple, Blueberry, or Coconut
y = np.array(['Apple', 'Blueberry', 'Coconut'])
# This is the code we need for the Multinomial model
clf = MultinomialNB()
# We train the model on our data
clf.fit(X, y)
print("What fruit does our model think this should be?")
print("Answer: %s!" % clf.predict([[1, 2, 0]])[0])