Scikit learn：machine learning in Python之貝葉斯學習

chapter 2之樸素貝葉斯.

樸素貝葉斯是一個簡單卻很強大的分類器，基於貝葉斯定理的概率模型。本質來說，貝葉斯是基於每個特徵值的概率去決定該實例屬於一類的概率，前提條件，也就是假定每個特徵之間是獨立的。樸素貝葉斯的一個非常成功的應用就是自然語言處理（natural language processing , NLP），NLP問題有很重要的，大量的標記數據（一般爲文本文件），該數據作爲算法的訓練集。

在這個章節，將介紹使用樸素貝葉斯進行文本分類。數據集爲一組分出着相應類別的文本文檔，然後訓練樸素貝葉斯算法來預測一個新的未知的文檔的類別。scikit-learn中給出的數據集包含19,000組來自從政治，宗教到體育和科學等20個不同主題的新聞組。

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all') #導入數據和賦值

值得注意的是，數據是存着一系列的文本內容，而不是矩陣。另外，由於書本是Python2的，我使用的是Python3，故代碼和書本有些微不同。

print (type(news.data),type(news.target),type(news.target_names))

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)

print (news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

print(len(news.data))
print(len(news.target))

18846

print(news.data[0])

From: Mamatha Devineni Ratnam <[email protected]>

Subject: Pens fans reactions

Organization: Post Office, Carnegie Mellon, Pittsburgh, PA

Lines: 12

NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack

of any kind of posts about the recent Pens massacre of the Devils. Actually,

I am bit puzzled too and a bit relieved. However, I am going to put an end

to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they

are killing those Devils worse than I thought. Jagr just showed you why

he is much better than his regular season stats. He is also a lot

fo fun to watch in the playoffs. Bowman should let JAgr have a lot of

fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final

regular season game. PENS RULE!!!

print(news.target[0],news.target_names[news.target[0]]) #target是用於下標定位

10 rec.sport.hockey #下標從0開始

預處理數據：

本書的機器學習算法只能適用於數值型數據，因此，需要將文本數據轉化爲數值數據。

目前，只有一個特徵——文本內容，因此，需要一些函數將文本內容轉變爲有意義的一組數值型特徵。直觀地看，每個文本類別中的文字（確切地說，就是符號，包括數字或標點符號）有哪些，然後嘗試用這些文字的頻繁分佈描述每個類別。sklearn.feature_extraction.text 提供一些實用程序，從文本文檔中建立數字特徵向量。

在轉換數據之前，先劃分好訓練集和測試集。在隨機順序下，75%個實例爲訓練集，25%個實例爲測試集。

SPLIT_PETC = 0.75
split_size = int(len(news.data) * SPLIT_PETC)
x_train = news.data[:split_size]
x_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

這裏有3中方式將文本轉變爲數字特徵：CountVectorizer, HashingVectorizer,and TfidfVectorizer.（它們之間的不同在於獲得數字特徵的計算）

CountVectorizer 主要是從文本中建立一個字典，然後每個實例轉變成一個數字特徵向量，其中的每個元素是文本中一個獨有單詞出現的次數

HashingVectorizer 實現一個哈希函數（hashing function），映射特徵的索引，然後如CountVectorizer計算次數

TfidfVectorizer 和CountVectorizer 很像，但是計算方式更爲先進，使用術語逆文檔頻率法（Term Frequency Inverse Document Frequency，TF-IDF）——測量單詞在文檔或者文集中的重要性的統計學方法（尋找當前文檔中比價頻繁出現的單詞，對比其在整個文檔集中出現的次數；這樣可以看到標準化的結果，避免了過度頻繁）。

訓練樸素貝葉斯分類器：

建立一個樸素貝葉斯分類器，由特徵向量化程序和實際貝葉斯分類器：使用 sklearn.naive_bayes模塊中的方法MultinomialNB；sklearn.pipeline模塊中的Pipeline能夠將向量和分類器組合一起。這裏結合MultinomialNB 建立3個不同的分類器，分別使用上面提及的3個不同的文本向量，然後對比在默認參數下，哪個更好。

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer,CountVectorizer

clf_1 = Pipeline([('vect',CountVectorizer()),('clf',MultinomialNB()),])
clf_2 = Pipeline([('vect',HashingVectorizer(non_negative=True)),('clf',MultinomialNB()),])
clf_3 = Pipeline([('vect',TfidfVectorizer()),('clf',MultinomialNB()),])

定義一個函數，分類和對指定的x和y值進行交叉驗證：

from sklearn.cross_validation import cross_val_score,KFold
import numpy as np

from scipy.stats import sem

def evaluate_cross_validation(clf,x,y,K):
#create a k-fold cross validation iterator of k=5 folds(建立一個k=5的交叉驗證迭代器)
cv = KFold(len(y),K,shuffle=True,random_state=0)
#by default the score used is the one returned by score method of the estimator(accuracy)(默認情況下，使用的得分是返回的一個估計分數)
scores = cross_val_score(clf,x,y,cv=cv)
print(scores)

print(("Mean score:{0:.3f} (+/-{1:.3f})").format(np.mean(scores),sem(scores)))

然後，每個分類器都進行5重交叉驗證：

clfs = [clf_1,clf_2,clf_3]
for clf in clfs:
evaluate_cross_validation(clf,news.data,news.target,5)

結果如下：

[ 0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]

Mean score:0.853 (+/-0.003)

[ 0.75543767 0.77659857 0.77049615 0.78508888 0.76200584]

Mean score:0.770 (+/-0.005)

[ 0.84482759 0.85990979 0.84558238 0.85990979 0.84213319]

Mean score:0.850 (+/-0.004)

可以看出，CountVectorizer 和 TfidfVectorizer 比HashingVectorizer 結果更好。使用TfidfVectorizer 繼續，嘗試通過將文檔解析成不同的符號正則表達式來提高結果。

默認的正則表達式：ur"\b\w\w+\b" ，考慮了字母數字字符，下劃線（也許也會考慮削減和點號以提高標記and begin considering tokens as Wi-Fi and site.com.）

新的正則表達式：ur"\b[a-z0- 9_\-\.]+[a-z][a-z0-9_\-\.]+\b"：

clf_4 = Pipeline([('vect',TfidfVectorizer(token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),]) #Python3不支持ur

evaluate_cross_validation(clf_4,news.data,news.target,5)

結果如下：

[ 0.86100796 0.8718493 0.86203237 0.87291059 0.8588485 ]

Mean score:0.865 (+/-0.003)

說明結果從0.850提高到0.865。

此外，還有另一個參數：stop_words，允許我們忽略掉不想加入計算的一列單詞，例如太頻繁的單詞，或者先驗認爲不該爲特定主題提供信息的單詞。

定義一個函數，獲得stop words （禁用詞）：

def get_stop_words():
result = set()
for line in open('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result

然後，建立一個新的分類器：

clf_5 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),])
evaluate_cross_validation(clf_5,news.data,news.target,5)

結果如下：

[ 0.88222812 0.89625895 0.88591138 0.89599363 0.88485009]

Mean score:0.889 (+/-0.003)

結果由0.865提高到0.889。

再看MultinomialNB的參數，最重要的參數是alpha參數，也叫平滑參數，其默認值爲1.0，假設令其爲0.1：

clf_6 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB(alpha=0.1)),])

結果如下：

[ 0.91405836 0.91589281 0.91085168 0.91721942 0.91509684]

Mean score:0.915 (+/-0.001)

結果由 0.889 提高到 0.915 。接下來，測試不同的alpha值對結果的影響，進而選擇最佳的alpha值。

模型評估：

定義一個函數，在整個訓練集訓練模型，和評估模型在訓練集和測試集的準確性。

from sklearn import metrics

def train_and_evaluate(clf,x_train,x_test,y_train,y_test):
clf.fit(x_train,y_train)
print("Accuracy on training set:")
print(clf.score(x_train,y_train))
print("Accuracy on testing set:")
print(clf.score(x_test,y_test))
print("Classification Report:")
print(metrics.classification_report(y_test,y_pred=y_test))
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test,y_pred=y_test))

train_and_evaluate(clf_6,x_train,x_test,y_train,y_test)

結果：

Accuracy on training set:

0.98776001132

Accuracy on testing set:

0.909592529711

由上可知，結果還可以。測試集結果也差不多達到0.91.

Scikit learn：machine learning in Python之貝葉斯學習

python之引入外援模塊

迴歸問題之線性迴歸

Scikit learn：machine learning in Python之貝葉斯學習

4、認識正則表達式和re庫

過度擬合與正規化線性迴歸

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結