文章目錄

1. Sklearn 實現樸素貝葉斯

1.1 數據導入

使用來自 UCI 機器學習資源庫中的數據集，該資源庫有大量供實驗性研究的數據集。這是直接數據鏈接。

下面是該數據的預覽：

數據集中的列目前沒有命名，可以看出有 2 列。

第一列有兩個值：“ham”，表示信息不是垃圾信息，以及“spam”，表示信息是垃圾信息。

第二列是被分類的信息的文本內容。

首先導入數據：

import pandas as pd
df = pd.read_csv('smsspamcollection/SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])
df.head(5)

	label	sms_message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

1.2 數據預處理

我們已經大概瞭解數據集的結構，現在將標籤轉換爲二元變量，0 表示“ham”（即非垃圾信息），1表示“spam”，這樣比較方便計算。

由於Scikit-learn 只處理數字值，因此如果標籤值保留爲字符串，scikit-learn 會自己進行轉換（更確切地說，字符串標籤將轉型爲未知浮點值）。

如果標籤保留爲字符串，模型依然能夠做出預測，但是稍後計算效果指標（例如計算精確率和召回率分數）時可能會遇到問題。因此，爲了避免稍後出現意外的陷阱，最好將分類值轉換爲整數，再傳入模型中。

說明：

使用映射方法將“標籤”列中的值轉換爲數字值，如下所示：
{‘ham’:0, ‘spam’:1} 這樣會將“ham”值映射爲 0，將“spam”值映射爲 1。
此外，爲了知道我們正在處理的數據集有多大，使用“shape”輸出行數和列數

df['label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head(5)

(5572, 2)

	label	sms_message
0	0	Go until jurong point, crazy.. Available only ...
1	0	Ok lar... Joking wif u oni...
2	1	Free entry in 2 a wkly comp to win FA Cup fina...
3	0	U dun say so early hor... U c already then say...
4	0	Nah I don't think he goes to usf, he lives aro...

1.3 拆分訓練集和測試集

說明：
通過在 sklearn 中使用 train_test_split 方法，將數據集拆分爲訓練集和測試集。使用以下變量拆分數據：

X_train 是 ‘sms_message’ 列的訓練數據。
y_train 是 ‘label’ 列的訓練數據
X_test 是 ‘sms_message’ 列的測試數據。
y_test 是 ‘label’ 列的測試數據。
輸出每個訓練數據和測試數據的行數。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393

1.4 Bag of Words

我們的數據集中有大量文本數據（572 行數據）。大多數機器學習算法都要求傳入的輸入是數字數據，而電子郵件/信息通常都是文本。

現在我們要介紹 Bag of Words (BoW) 這個概念，它用來表示要處理的問題具有“大量單詞”或很多文本數據。BoW 的基本概念是拿出一段文本，計算該文本中單詞的出現頻率。注意：BoW 平等地對待每個單詞，單詞的出現順序並不重要。

利用我們將介紹的流程，我們可以將文檔集合轉換成矩陣，每個文檔是一行，每個單詞（令牌）是一列，對應的（行，列）值是每個單詞或令牌在此文檔中出現的頻率。

例如：

X_train.head()

710     4mths half price Orange line rental & latest c...
3740                           Did you stitch his trouser
2711    Hope you enjoyed your new content. text stop t...
3155    Not heard from U4 a while. Call 4 rude chat pr...
3748    Ü neva tell me how i noe... I'm not at home in...
Name: sms_message, dtype: object

我們的目標是將這組文本轉換爲頻率分佈矩陣，如下所示：

從圖中可以看出，文檔在行中進行了編號，每個單詞是一個列名稱，相應的值是該單詞在文檔中出現的頻率。

我們詳細講解下，看看如何使用一小組文檔進行轉換。

1.4.1 Sklearn 實現 Bag of Words：CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the CountVectorizer method
count_vector = CountVectorizer(token_pattern='(?u)\\b\\w\\w+\\b', stop_words='english')

# 擬合併轉換訓練集（不能將測試集也fit，這違背了基本原則）
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

1.4.1.1 count_vector = CountVectorizer(lowercase=‘True’, token_pattern, stop_words)

要處理這一步，我們使用 sklearns
count vectorizer 方法，該方法的作用如下所示：

它會令牌化字符串（將字符串劃分爲單個單詞）併爲每個令牌設定一個整型 ID。
它會計算每個令牌的出現次數。

參數設置：

lowercase='True'：CountVectorizer 方法會自動將所有令牌化單詞轉換爲小寫形式，避免區分“He”和“he”等單詞。
token_pattern：CountVectorizer 方法會自動忽略所有標點符號，避免區分後面有標點的單詞（例如“hello!”）和前後沒有標點的同一單詞（例如“hello”）token_pattern 參數具有默認正則表達式值 (?u)\\b\\w\\w+\\b，它會忽略所有標點符號並將它們當做分隔符，並將長度大於等於 2 的字母數字字符串當做單個令牌或單詞。
stop_words：停用詞是指某個語言中最常用的字詞，包括“am”、“an”、“and”、“the”等。通過將此參數值設爲 english，CountVectorizer 將自動忽略（輸入文本中）出現在 scikit-learn 中的內置英語停用詞列表中的所有單詞。這非常有用，因爲當我們嘗試查找表明是垃圾內容的某些單詞時，停用詞會使我們的結論出現偏差。

1.4.1.2 count_vector.fit(data)

fit() 將文檔數據集與 CountVectorizer 對象進行擬合

1.4.1.3 count_vector.transform(data)

transform() 方法會返回一個 numpy 整數矩陣，可以使用 toarray() 將其轉換爲數組

1.4.1.4 結果可視化：

get_feature_names() 方法會返回此數據集的特徵名稱，即組成數據詞彙表的單詞集合。

transform() 方法會返回一個 numpy 整數矩陣，可以使用 toarray() 將其轉換爲數組

doc_array = count_vector.transform(X_train).toarray()
count_vector.get_feature_names()
frequency_matrix = pd.DataFrame(doc_array, columns=count_vector.get_feature_names())
frequency_matrix.head(5)

	01223585334	...
0	0	...
1	0	...
2	0	...
3	1	...
4	0	...

5 rows × 7204 columns

1.5 使用 Scikit-learn 實現樸素貝葉斯

GaussianNB^[1]：應用於任意連續數據
Bernoullinb：假定輸入數據爲二分類數據（主要用於文本數據分類）
MultinomialNB：假定輸入數據爲計數數據（主要用於文本數據分類）

from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

1.6 評估模型

我們已經對測試集進行了預測，下一個目標是評估模型的效果。我們可以採用各種衡量指標，但首先快速總結下這些指標。

準確率衡量的是分類器做出正確預測的概率，即正確預測的數量與預測總數（測試數據點的數量）之比。

精確率指的是分類爲垃圾信息的信息實際上是垃圾信息的概率，即真正例（分類爲垃圾內容並且實際上是垃圾內容的單詞）與所有正例（所有分類爲垃圾內容的單詞，無論是否分類正確）之比，換句話說，是以下公式的比值結果：

[True Positives/(True Positives + False Positives)]

召回率（敏感性）表示實際上爲垃圾信息並且被分類爲垃圾信息的信息所佔比例，即真正例（分類爲垃圾內容並且實際上是垃圾內容的單詞）與所有爲垃圾內容的單詞之比，換句話說，是以下公式的比值結果：

[True Positives/(True Positives + False Negatives)]

對於偏態分類分佈問題（我們的數據集就屬於偏態分類），例如如果有 100 條信息，只有 2 條是垃圾信息，剩下的 98 條不是，則準確率本身並不是很好的指標。我們將 90 條信息分類爲非垃圾信息（包括 2 條垃圾信息，但是我們將其分類爲非垃圾信息，因此它們屬於假負例），並將 10 條信息分類爲垃圾信息（所有 10 條都是假正例），依然會獲得比較高的準確率分數。對於此類情形，精確率和召回率非常實用。可以通過這兩個指標獲得 F1 分數，即精確率和召回率分數的加權平均值。該分數的範圍是 0 到 1，1 表示最佳潛在 F1 分數。

我們將使用所有四個指標確保我們的模型效果很好。這四個指標的值範圍都在 0 到 1 之間，分數儘量接近 1 可以很好地表示模型的效果如何。

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9877961234745154
Precision score:  0.9615384615384616
Recall score:  0.9459459459459459
F1 score:  0.9536784741144414

2. 總結

和其他分類算法相比，樸素貝葉斯具有的一大主要優勢是能夠處理大量特徵。在我們的示例中，有數千個不同的單詞，每個單詞都被當做一個特徵。此外，即使存在不相關的特徵也有很好的效果，不容易受到這種特徵的影響。另一個主要優勢是相對比較簡單。樸素貝葉斯完全可以直接使用，很少需要調整參數，除非通常分佈數據已知的情況需要調整。
它很少會過擬合數據。另一個重要優勢是相對於它能處理的數據量來說，訓練和預測速度很快。總之，樸素貝葉斯是非常實用的算法！

參考資料

[1] ls秦.Python機器學習 — 樸素貝葉斯算法（Naive Bayes）[EB/OL].https://blog.csdn.net/qq_38328378/article/details/80771469, 2018-07-10.

監督學習 | 樸素貝葉斯之Sklearn實現

文章目錄

1. Sklearn 實現樸素貝葉斯

1.1 數據導入

1.2 數據預處理

1.3 拆分訓練集和測試集

1.4 Bag of Words

1.4.1 Sklearn 實現 Bag of Words：CountVectorizer

1.4.1.1 count_vector = CountVectorizer(lowercase=‘True’, token_pattern, stop_words)

1.4.1.2 count_vector.fit(data)

1.4.1.3 count_vector.transform(data)

1.4.1.4 結果可視化：

1.5 使用 Scikit-learn 實現樸素貝葉斯

1.6 評估模型

2. 總結

參考資料

機器學習 | 目錄（持續更新）

無監督學習 | GMM 高斯混合聚類原理及Sklearn實現

無監督學習 | KMeans與KMeans++原理

無監督學習 | DBSCAN 原理及Sklearn實現

SQLite | SQLite 與 Pandas 比較篇之一

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

	01223585334	...
0	0	...
1	0	...
2	0	...
3	1	...
4	0	...

	01223585334	...
0	0	...
1	0	...
2	0	...
3	1	...
4	0	...

	01223585334	...
0	0	...
1	0	...
2	0	...
3	1	...
4	0	...