數據挖掘十大算法(四):樸素貝葉斯算法

在這裏插入圖片描述

一、基礎知識篇

條件概率
在這裏插入圖片描述
在這裏插入圖片描述
全概率公式:
在這裏插入圖片描述

在這裏插入圖片描述
貝葉斯推斷
在這裏插入圖片描述同時再思考一個問題,在使用該算法的時候,如果不需要知道具體的類別概率,我們有必要計算P(B)這個全概率嗎?要知道我們只需要比較 P(A1|B)和P(A2|B)的大小,找到那個最大的概率就可以。既然如此,兩者的分母都是相同的,那我們只需要比較分子即可。即比較P(B|A1)P(A1)和P(B|A2)P(A2)的大小,所以爲了減少計算量,全概率公式在實際編程中可以不使用。

樸素貝葉斯
在這裏插入圖片描述
普通貝葉斯公式 相當於 只有一個特徵
樸素貝葉斯有多個特徵,且不同特徵之間相互獨立

在這裏插入圖片描述

實例理解樸素貝葉斯
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
我們看:P(打噴嚏|感冒)*P(建築工人|感冒)*P(感冒) 這個式子的三項均是從訓練集中得到數據,如果我們不考慮分母的話,那麼這個式子得到的就是該樣本屬於 感冒 的概率。

二、補充知識

拉普拉斯平滑(Laplace smoothing)

當某個分量在總樣本某個分類中(觀察樣本庫/訓練集)從沒出現過,這時候會導致條件概率計算爲0,最後會導致整個實例的計算結果爲0。爲了解決這個問題,使用拉普拉斯平滑/加1平滑進行處理。分子+1,分母+類別數。
假設在文本分類中,有3個類,C1、C2、C3,在指定的訓練樣本中,某個詞語F1,在各個類中觀測計數分別爲=0,990,10,即概率爲P(F1/C1)=0,P(F1/C2)=0.99,P(F1/C3)=0.01,對這三個量使用拉普拉斯平滑的計算方法如下:
1/1003 = 0.001,991/1003=0.988,11/1003=0.011

實際應用場景

文本分類
垃圾郵件過濾
病人分類
拼寫檢查

樸素貝葉斯模型

高斯模型:處理特徵是連續型變量的情況
多項式模型:最常見,要求特徵是離散數據
伯努利模型:要求特徵是離散的,且爲布爾類型,即true和false,或者1和0

三、樸素貝葉斯算法實現

# encoding=utf-8

import pandas as pd
import numpy as np
import cv2
import time

# 該方法已被棄用
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# 二值化處理
def binaryzation(img):
    cv_img = img.astype(np.uint8)  # 類型轉化成Numpy中的uint8型
    cv2.threshold(cv_img, 50, 1, cv2.THRESH_BINARY_INV, cv_img)  # 大於50的值賦值爲0,不然賦值爲1
    return cv_img

# 訓練,計算出先驗概率和條件概率
def Train(trainset, train_labels):
    print(len(trainset))
    '''
    :param trainset: 特徵矩陣
    :param train_labels: 類別
    :return:
    '''
    prior_probability = np.zeros(class_num)                           # 先驗概率
    conditional_probability = np.zeros((class_num, feature_len, 2))   # 條件概率(10,784,2)

    #  計算
    for i in range(len(train_labels)):      # 遍歷所有樣本,len(train_labels) 是所有樣本數
        img = binaryzation(trainset[i])     # 特徵圖片二值化,讓每一個特徵都只有01兩種取值
        label = train_labels[i]
        
        prior_probability[label] += 1       # 存儲類別信息列表,注意這裏存儲的只是計數

        for j in range(feature_len):
            conditional_probability[label][j][img[j]] += 1


    # 將條件概率歸到[1,10001]
    for i in range(class_num):
        for j in range(feature_len):

            # 經過二值化後圖像只有01兩種取值
            pix_0 = conditional_probability[i][j][0]
            pix_1 = conditional_probability[i][j][1]

            # 計算01像素點對應的條件概率
            # 我實在是沒有理解這裏,,,估計就是某種處理吧。。
            probalility_0 = (float(pix_0)/float(pix_0+pix_1))*10000 + 1
            probalility_1 = (float(pix_1)/float(pix_0+pix_1))*10000 + 1

            conditional_probability[i][j][0] = probalility_0
            conditional_probability[i][j][1] = probalility_1


    return prior_probability, conditional_probability

# 計算概率
def calculate_probability(img, label):
    probability = int(prior_probability[label])

    for j in range(feature_len):
        probability *= int(conditional_probability[label][j][img[j]])

    return probability

# 預測
def Predict(testset, prior_probability, conditional_probability):
    predict = []

    # 對每個輸入的x,將後驗概率最大的類作爲x的類輸出
    for img in testset:

        img = binaryzation(img)  # 圖像二值化

        max_label = 0
        max_probability = calculate_probability(img, 0)

        for j in range(1, class_num):
            probability = calculate_probability(img, j)

            if max_probability < probability:
                max_label = j
                max_probability = probability

        predict.append(max_label)

    return np.array(predict)


class_num = 10     # MINST數據集有10種labels,分別是“0,1,2,3,4,5,6,7,8,9”
feature_len = 784  # MINST數據集每個image有28*28=784個特徵(pixels)

if __name__ == '__main__':

    print("Start read data")
    time_1 = time.time()

    raw_data = pd.read_csv('../data/train.csv', header=0)  # 讀取csv數據
    data = raw_data.values

    features = data[::, 1::]
    labels = data[::, 0]

    # 避免過擬合,採用交叉驗證,隨機選取33%數據作爲測試集,剩餘爲訓練集
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)

    time_2 = time.time()
    print('read data cost %f seconds' % (time_2 - time_1))


    print('Start training')
    prior_probability, conditional_probability = Train(train_features, train_labels)

    time_3 = time.time()
    print('training cost %f seconds' % (time_3 - time_2))


    print('Start predicting')
    test_predict = Predict(test_features, prior_probability, conditional_probability)
    time_4 = time.time()
    print('predicting cost %f seconds' % (time_4 - time_3))


    score = accuracy_score(test_labels, test_predict)
    print("The accruacy score is %f" % score)

因爲作者本人智商有限,實在不能理解上面code拉普拉斯平滑的處理方法,故提供以下方法,需要修改train和calculate_probability兩個函數:(性能降低0.0005)

# 訓練,計算出先驗概率和條件概率
def Train(trainset, train_labels):
    '''
    :param trainset: 特徵矩陣
    :param train_labels: 類別
    :return:
    '''
    prior_probability = np.zeros(class_num)                          # 先驗概率
    conditional_probability = np.ones((class_num, feature_len, 2))   # 條件概率(10,784,2)

    #  計算
    for i in range(len(train_labels)):      # 遍歷所有樣本,len(train_labels) 是所有樣本數
        img = binaryzation(trainset[i])     # 特徵圖片二值化,讓每一個特徵都只有01兩種取值
        label = train_labels[i]
        prior_probability[label] += 1       # 存儲類別信息列表

        for j in range(feature_len):
            conditional_probability[label][j][img[j]] += 1


    # 計算條件概率
    for i in range(class_num):
        conditional_probability[i] = conditional_probability[i] / (prior_probability[i] + 2)


    return prior_probability, conditional_probability

# 重點: 下面計算 probability的 int 去掉啦!
def calculate_probability(img, label):
    probability = int(prior_probability[label])

    for j in range(feature_len):
        probability *= conditional_probability[label][j][img[j]]

    return probability

伯努利模型:處理離散問題

import numpy as np
from sklearn.naive_bayes import BernoulliNB

# 特徵形式如下:
# [Walks like a duck, Talks like a duck, Is small]
#
# Walks like a duck: 0 = False, 1 = True
# Talks like a duck: 0 = False, 1 = True
# Is small: 0 = False, 1 = True

# 數據
X = np.array([[1, 1, 0], [0, 0, 1], [1, 0, 0]])
# label
y = np.array(['Duck', 'Not a Duck', 'Not a Duck'])

# This is the code we need for the Bernoulli model
clf = BernoulliNB()
# We train the model on our data
clf.fit(X, y)


print("What does our model think this should be?")
print("Answer: %s!" % clf.predict([[1, 1, 1]])[0])

高斯模型:處理特徵是連續型變量的情況

import numpy as np
from sklearn.naive_bayes import GaussianNB

# X特徵的形式如下:
# [Red %, Green %, Blue %]

# Some data:
X = np.array([[.5, 0, .5], [1, 1, 0], [0, 0, 0]])
# Classes: Purple, Yellow, or Black
y = np.array(['Purple', 'Yellow', 'Black'])

# model
clf = GaussianNB()
# We train the model on our data
clf.fit(X, y)



print("What color does our model think this should be?")
print("Answer: %s!" % clf.predict([[1, 0, 1]])[0])

多項式模型:最常見,要求特徵是離散數據

import numpy as np
from sklearn.naive_bayes import MultinomialNB

# X特徵如下:
# [Size, Weight, Color]
#
# Size: 0 = Small, 1 = Moderate, 2 = Large
# Weight: 0 = Light, 1 = Moderate, 2 = Heavy
# Color: 0 = Red, 1 = Blue, 2 = Brown



# Some data:
X = np.array([[1, 1, 0], [0, 0, 1], [2, 2, 2]])
# Classes: Apple, Blueberry, or Coconut
y = np.array(['Apple', 'Blueberry', 'Coconut'])


# This is the code we need for the Multinomial model
clf = MultinomialNB()
# We train the model on our data
clf.fit(X, y)



print("What fruit does our model think this should be?")
print("Answer: %s!" % clf.predict([[1, 2, 0]])[0])
發佈了120 篇原創文章 · 獲贊 9 · 訪問量 4235
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章