【手把手機器學習入門到放棄】樸素貝葉斯

naive bayes

樸素貝葉斯算法想法非常簡單,根據貝葉斯公式,通過先驗概率計算後驗概率,原理不多贅述,網上可以查到很多。
這裏值得一提的是根據

The Optimality of Naive Bayes
Harry Zhang
Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada E3B 5A3 email: [email protected]

樸素貝葉斯雖然假設各個特徵之間是獨立分佈的(這在現實中往往是不成立的),但樸素貝葉斯算法在實際應用中仍然得到了比較好的結果。上述論文中對特徵之間的局部依賴性對整體分類的影響進行了數學推導,有興趣的同學可以看一下。

另外值得注意的一點是,樸素貝葉斯(NB)雖然是一個很好的分類器,但不是一個很好的用於解決迴歸問題的工具,同時樸素貝葉斯輸出的概率也不具太多參考性,樸素貝葉斯算法很有可能以60%的概率將一個樣本歸爲某個分類,而實際這個樣本的置信度可能有90%,但anyway,樸素貝葉斯進行了正確的分類。

下面我們使用sklearn算法進行樸素貝葉斯的訓練。

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB, ComplementNB
from sklearn import metrics
from sklearn.model_selection import train_test_split

sklearn包中提供了四種內核的貝葉斯,分別對應了四種對於條件分佈 P(xiy)P(x_i|y) 的不同假設,分別是:

  • GaussianNB 高斯分佈
  • BernoulliNB 0-1分佈
  • MultinomialNB 多項式分佈
  • ComplementNB 爲了修復多項式分佈過強的假設,適合於分佈不均勻的樣本
X = pd.read_csv('american_salary_feture.csv')
y = pd.read_csv('american_salary_label.csv', header=None)
y= np.array(y)
y=y.ravel()

無先驗訓練

gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()
cnb = ComplementNB()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
gnb = gnb.fit(X_train, y_train)
bnb = bnb.fit(X_train, y_train)
mnb = mnb.fit(X_train, y_train)
cnb = cnb.fit(X_train, y_train)
score_gnb = gnb.score(X_test, y_test)
score_bnb = bnb.score(X_test, y_test)
score_mnb = mnb.score(X_test, y_test)
score_cnb = cnb.score(X_test, y_test)
print("gnb:", score_gnb)
print("bnb:", score_bnb)
print("mnb:", score_mnb)
print("cnb:", score_cnb)
gnb: 0.7970765262252795
bnb: 0.7626827171109201
mnb: 0.7870040535560742
cnb: 0.7870040535560742

有先驗訓練

# 先看下樣本本身的分佈
print("#0:", (y_train==0).sum())
print("#1:", (y_train==1).sum())
#0: 18509
#1: 5911
print("#0:", (y_test==0).sum())
print("#1:", (y_test==1).sum())
#0: 6211
#1: 1930

這個樣本本身大概是三比一左右,下面我們把訓練集改成1:1,然後測試集還是3:1的

X_train_1 = X_train[y_train==1]
X_train_0 = X_train[y_train==0]
y_train_1 = y_train[y_train==1]
y_train_0 = y_train[y_train==0]
# 取標籤爲0的數據的前5911個
X_train_0 = X_train_0[:5911]
y_train_0 = y_train_0[:5911]
X_train_new = pd.concat([X_train_0, X_train_1], axis=0)
y_train_new = np.concatenate((y_train_0, y_train_1), axis=0)
gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()
cnb = ComplementNB()
gnb = gnb.fit(X_train_new, y_train_new)
bnb = bnb.fit(X_train_new, y_train_new)
mnb = mnb.fit(X_train_new, y_train_new)
cnb = cnb.fit(X_train_new, y_train_new)
score_gnb = gnb.score(X_test, y_test)
score_bnb = bnb.score(X_test, y_test)
score_mnb = mnb.score(X_test, y_test)
score_cnb = cnb.score(X_test, y_test)
print("gnb:", score_gnb)
print("bnb:", score_bnb)
print("mnb:", score_mnb)
print("cnb:", score_cnb)
gnb: 0.7967080211276256
bnb: 0.7463456577815993
mnb: 0.7868812185235229
cnb: 0.7868812185235229

下面告訴分類器樣本比例是[3:1] 只有gnb有這個參數

gnb = GaussianNB(priors = [0.75,0.25])
gnb = gnb.fit(X_train_new, y_train_new)
score_gnb = gnb.score(X_test, y_test)
print("gnb:", score_gnb)
gnb: 0.7990418867461

可以看到有了先驗條件之後準確率有所提升,先驗比例尤其有利於樣本較少時候的訓練。

下面我們逐漸減少訓練樣本,然後看先驗帶來的收益,然後我們給模型一個與數據不符的先驗條件,然後查看在不同訓練數量的情況下,先驗條件帶來的差別。

j=0
score_without_prior=np.zeros(99)
score_with_prior = np.zeros(99)
for i in range(50,5000,50):
    X_train_1 = X_train[y_train==1]
    X_train_0 = X_train[y_train==0]
    y_train_1 = y_train[y_train==1]
    y_train_0 = y_train[y_train==0]
    
    X_train_1 = X_train[:i]
    X_train_0 = X_train[:i]
    y_train_1 = y_train[:i]
    y_train_0 = y_train[:i]
    
    X_train_new = pd.concat([X_train_0, X_train_1], axis=0)
    y_train_new = np.concatenate((y_train_0, y_train_1), axis=0)
    
    gnb = GaussianNB()
    gnb = gnb.fit(X_train_new, y_train_new)
    score_without_prior[j] = gnb.score(X_test, y_test)
    
    gnb = GaussianNB(priors = [0.5,0.5])
    gnb = gnb.fit(X_train_new, y_train_new)
    score_with_prior[j] = gnb.score(X_test, y_test)
    j= j+1
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"score_without_prior":score_without_prior,
                     "score_with_prior": score_with_prior},
                    index=range(50,5000,50))
sns.lineplot(data=data)
plt.xlabel("estimators")
plt.ylabel("score")
plt.title("scores varies with number of estimators")
Text(0.5, 1.0, 'scores varies with number of estimators')

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章