樸素貝葉斯－新聞分類

原創

2020-02-23 15:37

樸素貝葉斯分類器的構造基礎是基於貝葉斯定理與特徵條件獨立假設的分類方法，與基於線性假設的模型（線性分類器和支持向量機分類器）不同。最爲廣泛的兩種分類模型是決策樹模型(Decision Tree Model)和樸素貝葉斯模型（Naive Bayesian Model，NBM）。

樸素貝葉斯有着廣泛的實際應用環境，特別是在文本分類的任務中，包括新聞的分類，垃圾郵件的篩選。

下面使用經典的20類新聞文本作爲試驗數據：

Python源碼：

#coding=utf-8
#load news data
from sklearn.datasets import fetch_20newsgroups
#-------------
from sklearn.cross_validation import train_test_split
#-------------
from sklearn.feature_extraction.text import CountVectorizer
#-------------
from sklearn.naive_bayes import MultinomialNB
#-------------
from sklearn.metrics import classification_report


#-------------download data
news=fetch_20newsgroups(subset='all')
print len(news.data)
print news.data[0]
#-------------split data
#75% training set,25% testing set
X_train,X_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#-------------transfer data to vector
vec=CountVectorizer()
X_train=vec.fit_transform(X_train)
#X_test=vec.fit_transform(X_test)  raise ValueError('dimension mismatch')
vectorizer_test = CountVectorizer(vocabulary=vec.vocabulary_)
X_test = vectorizer_test.transform(X_test)
#-------------training
#initialize NB model with default config
mnb=MultinomialNB()
#training model
mnb.fit(X_train,y_train)
#run on test data
y_predict=mnb.predict(X_test)
#-------------performance
print 'The Accuracy is',mnb.score(X_test,y_test)
print classification_report(y_test,y_predict,target_names=news.target_names)

Result：

18846
From: Mamatha Devineni Ratnam <[email protected]>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!

The Accuracy is 0.839770797963
precision recall f1-score support
alt.atheism 0.86 0.86 0.86 201
comp.graphics 0.59 0.86 0.70 250
comp.os.ms-windows.misc 0.89 0.10 0.17 248
comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240
comp.sys.mac.hardware 0.93 0.78 0.85 242
comp.windows.x 0.82 0.84 0.83 263
misc.forsale 0.91 0.70 0.79 257
rec.autos 0.89 0.89 0.89 238
rec.motorcycles 0.98 0.92 0.95 276
rec.sport.baseball 0.98 0.91 0.95 251
rec.sport.hockey 0.93 0.99 0.96 233
sci.crypt 0.86 0.98 0.91 238
sci.electronics 0.85 0.88 0.86 249
sci.med 0.92 0.94 0.93 245
sci.space 0.89 0.96 0.92 221
soc.religion.christian 0.78 0.96 0.86 232
talk.politics.guns 0.88 0.96 0.92 251
talk.politics.mideast 0.90 0.98 0.94 231
talk.politics.misc 0.79 0.89 0.84 188
talk.religion.misc 0.93 0.44 0.60 158

avg / total 0.86 0.84 0.82 4712

該數據共有18846條新聞，這些數據既沒有被設定特徵，也沒用數字化的度量。因此，交給樸素貝葉斯分類起學習之情，要對數據做預處理：將文本轉化爲特徵向量，然後再用樸素貝葉斯模型從訓練數據中估計參數，最後利用這些概率參數對同樣轉化成特徵向量的測試樣本進行類別預測。

算法特點：

樸素貝葉斯模型被廣發用於海量互聯網文本分類任務。由於較強的特徵條件獨立假設，使得模型預測所需要估計的參數規模從冪指數量級向線性量級減少，極大地節約了內存消耗和計算時間。但正是手這種強假設的限制，模型訓練時，無法將各個特徵之間的聯繫考量在內，使得該模型在其他數據特徵關聯性較強的分類任務上的性能表現不佳。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

樸素貝葉斯－新聞分類

Wireshark 安裝+使用（一）

線性分類器－Tumer Prediction

普通程序員如何轉向AI方向

蘋果核 - 天貓APP改版之全新大首頁架構&開發模式全面升級-TAC

支持向量機－手寫數字識別

樸素貝葉斯－新聞分類

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結