文章目錄
操作平臺: windows10, python37, jupyter
數據下載: https://www.lanzous.com/iaghe8f
1、導入數據
import pandas as pd
# BernoulliNB 二分佈,硬幣,正面反面,概率差不多
# MultinomialNB 投擲篩子,多分佈,6個面概率差不多
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
1.1、讀取數據“SMSSpamCollection”
#讀取數據並命名錶頭
data = pd.read_csv('../data/SMSSpamCollection',sep = '\t',header=None,names=['target','message'])
data.shape #結果爲(5572, 2)
data.head()
target | message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
結果分析: 短信就只有兩種類型,ham表示正常的短信,spam表示垃圾短信。
1.2、確定研究對象
X = data['message'] #提取消息
y = data['target'] #提取標籤
X.unique().size #統計去重後的大小
5169
結果分析: 原數據中有5572條信息,不重複的有5169條,可以不用去重,它不會影響訓練的模型。
2、統計詞頻
2.1、文本向量化處理
# 統計詞頻,文本向量化處理
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() #可以設置停用詞stop_words='english'
X_cv = cv.fit_transform(X)
X_cv
<5572x8713 sparse matrix of type '<class 'numpy.int64'>'
with 74169 stored elements in Compressed Sparse Row format>
查看向量化結果:
print(X_cv)
(0, 8324) 1
(0, 1082) 1
(0, 3615) 1
(0, 7694) 1
(0, 2061) 1
(0, 1765) 1
: :
(5570, 1802) 1
(5570, 3489) 1
(5570, 2905) 1
(5570, 7099) 1
(5570, 1794) 1
(5570, 8120) 1
(5570, 2606) 1
: :
2.2、查看詞頻
cv.vocabulary_
{'go': 3571,
'until': 8084,
'jurong': 4374,
'point': 5958,
'crazy': 2338,
'available': 1316,
'only': 5571,
'in': 4114,
'bugis': 1767,
'great': 3655,
'world': 8548,
...}
3、建模與評估
3.1、訓練模型
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_cv,y,test_size =0.2 )
3.2、模型評估
3.2.1、高斯分佈正太分佈
%%time
gNB = GaussianNB()
gNB.fit(X_train.toarray(),y_train)
s = gNB.score(X_test.toarray(),y_test)
print(s)
0.8977578475336323
Wall time: 3.03 s
結果分析: 該模型不建議使用高斯分佈,它的準確率不高而且比較費時!
3.2.2、多項式分佈樸素貝葉斯
%%time
mNB = MultinomialNB()
mNB.fit(X_train,y_train)
print(mNB.score(X_test,y_test))
0.9820627802690582
Wall time: 111 ms
結果分析: 這個模型的可靠性就高很多了,準確率高達98%,僅用時111毫秒。
3.2.3、二項式分佈
bNB = BernoulliNB()
bNB.fit(X_train,y_train)
bNB.score(X_test,y_test)
0.979372197309417
3.2.4、KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
0.9237668161434978
3.2.5、決策樹
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
0.9641255605381166
import numpy as np
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
#讀取數據
email_text = []
for i in ['ham', 'spam']:
for j in range(1, 26):
file_path = '../data/email/%s/%d.txt'%(i, j)
print (file_path)
with open(file_path) as f:
email_text.append(f.read())
f.close()
#添加標籤 y
y = ['ham', 'spam']*25
y.sort()#排序
#向量化處理
cv = CountVectorizer(stop_words='english')
X1 = cv.fit_transform(email_text)
# 切分訓練集和測試集
X_train,X_test,y_train,y_test = train_test_split(X1,y,test_size = 0.2)
mNB = MultinomialNB()#多項式分佈
mNB.fit(X_train,y_train)#訓練模型
print (mNB.score(X_test,y_test)) #模型測評
tfidf = TfidfVectorizer(stop_words='english')
X2 = tfidf.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X2,y,test_size = 0.2)
mNB = MultinomialNB()
mNB.fit(X_train,y_train)
print (mNB.score(X_test,y_test))
sklearn 有自帶的新聞數據集,直接加載來測試就可以了
import numpy as np
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
import sklearn.datasets as datasets #導入自帶數據
data = datasets.fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes')) #加載新聞數據
X = data.data
y = data.target
cv = CountVectorizer(stop_words='english') #可以加入ngram_range=(1,2)分詞長度爲1和2
X2 = cv.fit_transform(X)
#切分訓練集和測試集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X2,y,test_size = 0.2)
#訓練模型
mNB = MultinomialNB()
mNB.fit(X_train,y_train)
#模型評估
print (mNB.score(X_test,y_test))