樸素貝葉斯-新聞分類

樸素貝葉斯分類器的構造基礎是基於貝葉斯定理與特徵條件獨立假設的分類方法,與基於線性假設的模型(線性分類器和支持向量機分類器)不同。最爲廣泛的兩種分類模型是決策樹模型(Decision Tree Model)和樸素貝葉斯模型(Naive Bayesian Model,NBM)。
樸素貝葉斯有着廣泛的實際應用環境,特別是在文本分類的任務中,包括新聞的分類,垃圾郵件的篩選。

下面使用經典的20類新聞文本作爲試驗數據:


Python源碼:

#coding=utf-8
#load news data
from sklearn.datasets import fetch_20newsgroups
#-------------
from sklearn.cross_validation import train_test_split
#-------------
from sklearn.feature_extraction.text import CountVectorizer
#-------------
from sklearn.naive_bayes import MultinomialNB
#-------------
from sklearn.metrics import classification_report


#-------------download data
news=fetch_20newsgroups(subset='all')
print len(news.data)
print news.data[0]
#-------------split data
#75% training set,25% testing set
X_train,X_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#-------------transfer data to vector
vec=CountVectorizer()
X_train=vec.fit_transform(X_train)
#X_test=vec.fit_transform(X_test)  raise ValueError('dimension mismatch')
vectorizer_test = CountVectorizer(vocabulary=vec.vocabulary_)
X_test = vectorizer_test.transform(X_test)
#-------------training
#initialize NB model with default config
mnb=MultinomialNB()
#training model
mnb.fit(X_train,y_train)
#run on test data
y_predict=mnb.predict(X_test)
#-------------performance
print 'The Accuracy is',mnb.score(X_test,y_test)
print classification_report(y_test,y_predict,target_names=news.target_names)
Result:

18846
From: Mamatha Devineni Ratnam <[email protected]>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

The Accuracy is 0.839770797963
                             precision    recall  f1-score   support
             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc    0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware     0.60      0.88      0.72       240
   comp.sys.mac.hardware      0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles     0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med       0.92      0.94      0.93       245
               sci.space       0.89      0.96      0.92       221
  soc.religion.christian       0.78      0.96      0.86       232
      talk.politics.guns       0.88      0.96      0.92       251
   talk.politics.mideast       0.90      0.98      0.94       231
      talk.politics.misc       0.79      0.89      0.84       188
      talk.religion.misc       0.93      0.44      0.60       158
  

           avg / total      0.86      0.84      0.82      4712


該數據共有18846條新聞,這些數據既沒有被設定特徵,也沒用數字化的度量。因此,交給樸素貝葉斯分類起學習之情,要對數據做預處理:將文本轉化爲特徵向量,然後再用樸素貝葉斯模型從訓練數據中估計參數,最後利用這些概率參數對同樣轉化成特徵向量的測試樣本進行類別預測。
算法特點:
樸素貝葉斯模型被廣發用於海量互聯網文本分類任務。由於較強的特徵條件獨立假設,使得模型預測所需要估計的參數規模從冪指數量級向線性量級減少,極大地節約了內存消耗和計算時間。但正是手這種強假設的限制,模型訓練時,無法將各個特徵之間的聯繫考量在內,使得該模型在其他數據特徵關聯性較強的分類任務上的性能表現不佳。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章