基於jupyter notebook的python編程-----MNIST數據集的的定義及相關處理學習


人工智能我們已經有過一段時間的學習了,其中,第一章提到,最常見的監督式學習任務包括迴歸任務(預測值)和分類任務(預測類)。

第二章探討了一個迴歸任務–預測住房價格,用到了線性迴歸、決策樹以及隨機森林等各種算法(我們會在後續章節中進一步講解這些算法)。

本章中我們會把注意力轉向分類系統。
那麼那麼林君學長就帶大家交接對MNIST數據集的操作!

一、MNIST定義

1、什麼是MNIST數據集

數據介紹:本章使用MNIST數據集,這是一組由美國高中生和人口調查局員工手寫的70000個數字的圖片。每張圖像都用其代表的數字標記。這個數據集被廣爲使用,因此也被稱作是機器學習領域的“Hello World”:但凡有人想到了一個新的分類算法,都會想看看在MNIST上的執行結果。因此只要是學習機器學習的人,早晚都要面對MNIST。

2、python如何導入MNIST數據集並操作

1)、導入需求的庫

# 使用sklearn的函數來獲取MNIST數據集
from sklearn.datasets import fetch_openml
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# 爲了顯示中文
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False

2)、定義導入數據集函數

# 耗時巨大
def sort_by_target(mnist):
    reorder_train=np.array(sorted([(target,i) for i, target in enumerate(mnist.target[:60000])]))[:,1]
    reorder_test=np.array(sorted([(target,i) for i, target in enumerate(mnist.target[60000:])]))[:,1]
    mnist.data[:60000]=mnist.data[reorder_train]
    mnist.target[:60000]=mnist.target[reorder_train]
    mnist.data[60000:]=mnist.data[reorder_test+60000]
    mnist.target[60000:]=mnist.target[reorder_test+60000]

3)、測試導入數據集函數,並計算導入時間

import time
a=time.time()
mnist=fetch_openml('mnist_784',version=1,cache=True)
mnist.target=mnist.target.astype(np.int8)
sort_by_target(mnist)
b=time.time()
print(b-a)
32.70347619056702

4)、取出mnist數據集的數據,並進行數據展示

X,y=mnist["data"],mnist["target"]
# 展示圖片
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")
some_digit = X[36000]
plot_digit(X[36000].reshape(28,28))

在這裏插入圖片描述
5)、定義mnist數據集中數字0-9展示功能函數

# 更好看的圖片展示
def plot_digits(instances,images_per_row=10,**options):
    size=28
    # 每一行有一個
    image_pre_row=min(len(instances),images_per_row)
    images=[instances.reshape(size,size) for instances in instances]
#     有幾行
    n_rows=(len(instances)-1) // image_pre_row+1
    row_images=[]
    n_empty=n_rows*image_pre_row-len(instances)
    images.append(np.zeros((size,size*n_empty)))
    for row in range(n_rows):
        # 每一次添加一行
        rimages=images[row*image_pre_row:(row+1)*image_pre_row]
        # 對添加的每一行的額圖片左右連接
        row_images.append(np.concatenate(rimages,axis=1))
    # 對添加的每一列圖片 上下連接
    image=np.concatenate(row_images,axis=0)
    plt.imshow(image,cmap=mpl.cm.binary,**options)
    plt.axis("off")

6)、調用函數,實現數字0-9手寫體的展示

plt.figure(figsize=(9,9))
example_images=np.r_[X[:12000:600],X[13000:30600:600],X[30600:60000:590]]
plot_digits(example_images,images_per_row=10)
plt.show()

在這裏插入圖片描述

3、接下來,我們需要創建一個測試集

1)、創建一個測試卷

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

同樣,我們還需要對訓練集進行洗牌,這樣可以保證交叉驗證的時候,所有的摺疊都差不多。此外,有些機器學習算法對訓練示例的循序敏感,如果連續輸入許多相似的實例,可能導致執行的性能不佳。給數據洗牌,正是爲了確保這種情況不會發生。

2)、對訓練集進行洗牌

import numpy as np
shuffer_index=np.random.permutation(60000)
X_train,y_train=X_train[shuffer_index],y_train[shuffer_index]

二、訓練一個二分類器

現在,我們先簡化問題,只嘗試識別一個數字,比如數字5,那麼這個"數字5檢測器",就是一個二分類器的例子,它只能區分兩個類別:5和非5。先爲此分類任務創建目錄標量

y_train_5=(y_train==5)
y_test_5=(y_test==5)

接着挑選一個分類器並開始訓練。一個好的選擇是隨機梯度下降(SGD)分類器,使用sklearn的SGDClassifier類即可。這個分類器的優勢是:能夠有效處理非常大型的數據集。這部分是因爲SGD獨立處理訓練實例,一次一個(這也使得SGD非常適合在線學習任務)。

from sklearn.linear_model import SGDClassifier
sgd_clf=SGDClassifier(max_iter=5,tol=-np.infty,random_state=42)
sgd_clf.fit(X_train,y_train_5)
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=5,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=42, shuffle=True, tol=-inf, validation_fraction=0.1,
              verbose=0, warm_start=False)
sgd_clf.predict([some_digit])
array([False])

三、性能考覈

評估分類器比評估迴歸器要困難很多,因此本章將會用很多篇幅來討論這個主題,同時也會涉及許多性能考覈的方法。

1、使用交叉驗證測量精度

1)、隨機交叉驗證和分層交叉驗證效果對比

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
array([0.9492, 0.9598, 0.9689])
# 類似於分層採樣,每一折的分佈類似
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train_5[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train_5[test_index])

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))
0.9492
0.9598
0.9689

2)、我們可以看到兩種交叉驗證的準確率都達到了95%上下,看起來很神奇,不過在開始激動之前,讓我們來看一個蠢笨的分類器,將所有圖片都預測爲‘非5’

from sklearn.base import BaseEstimator
# 隨機預測模型
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
array([0.90965, 0.91135, 0.90795])

我們可以看到,準確率也超過了90%!這是因爲我們只有大約10%的圖像是數字5,所以只要猜一張圖片不是5,那麼有90%的時間都是正確的,簡直超過了大預言家。
這說明,準確率通常無法成爲分類器的首要性能指標,特別是當我們處理偏斜數據集的時候(也就是某些類別比其他類更加頻繁的時候)

2、混淆矩陣

評估分類器性能的更好的方法是混淆矩陣。總體思路就是統計A類別實例被分成B類別的次數。例如,要想知道分類器將數字3和數字5混淆多少次,只需要通過混淆矩陣的第5行第3列來查看。

要計算混淆矩陣,需要一組預測才能將其與實際目標進行比較。當然可以通過測試集來進行預測,但是現在我們不動它(測試集最好保留到項目的最後,準備啓動分類器時再使用)。最爲代替,可以使用cross_val_predict()函數:

cross_val_predictcross_val_score 不同的是,前者返回預測值,並且是每一次訓練的時候,用模型沒有見過的數據來預測

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)
array([[53598,   981],
       [ 1461,  3960]], dtype=int64)

上面的結果表明:第一行所有’非5’(負類)的圖片中,有53417被正確分類(真負類),1162,錯誤分類成了5(假負類);第二行表示所有’5’(正類)的圖片中,有1350錯誤分類成了非5(假正類),有4071被正確分類成5(真正類).

一個完美的分類器只有真正類和真負類,所以其混淆矩陣只會在其對角線(左上到右下)上有非零值

y_train_perfect_predictions = y_train_5
confusion_matrix(y_train_5, y_train_perfect_predictions)
array([[54579,     0],
       [    0,  5421]], dtype=int64)

混淆矩陣能提供大量信息,但有時我們可能會希望指標簡潔一些。正類預測的準確率是一個有意思的指標,它也稱爲分類器的精度(如下)。
Precision()=TPTP+FPPrecision(精度)=\frac{TP}{TP+FP}
其中TP是真正類的數量,FP是假正類的數量。
做一個簡單的正類預測,並保證它是正確的,就可以得到完美的精度(精度=1/1=100%)

這並沒有什麼意義,因爲分類器會忽略這個正實例之外的所有內容。因此,精度通常會與另一個指標一起使用,這就是召回率,又稱爲靈敏度或者真正類率(TPR):它是分類器正確檢測到正類實例的比率(如下):
Recall()=TPTP+FNRecall(召回率)=\frac{TP}{TP+FN}
FN是假負類的數量

3、精度和召回率

# 使用sklearn的工具度量精度和召回率
from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)
0.8014571948998178
recall_score(y_train_5, y_train_pred)
0.7304925290536801

我們可以看到,這個5-檢測器,並不是那麼好用,大多時候,它說一張圖片爲5時,只有77%的概率是準確的,並且也只有75%的5被檢測出來了

下面,我們可以將精度和召回率組合成單一的指標,稱爲F1分數。
F1=21Precision+1Recall=2PreRecPre+Rec=TPTP+FN+FP2F_1=\frac{2}{\frac{1}{Precision}+\frac{1}{Recall}}=2*\frac{Pre*Rec}{Pre+Rec}=\frac{TP}{TP+\frac{FN+FP}{2}}
要計算F1分數,只需要調用f1_score()即可

from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)
0.7643312101910827

F1分數對那些具有相近的精度和召回率的分類器更爲有利。這不一定一直符合預期,因爲在某些情況下,我們更關心精度,而另一些情況下,我們可能真正關係的是召回率。

例如:假設訓練一個分類器來檢測兒童可以放心觀看的視頻,那麼我們可能更青睞那種攔截了好多好視頻(低召回率),但是保留下來的視頻都是安全(高精度)的分類器,而不是召回率雖高,但是在產品中可能會出現一些非常糟糕的視頻分類器(這種情況下,你甚至可能會添加一個人工流水線來檢查分類器選出來的視頻)。

反過來說,如果你訓練一個分類器通過圖像監控來檢測小偷:你大概可以接受精度只有30%,只要召回率能達到99%。(當然,安保人員會接收到一些錯誤的警報,但是幾乎所有的竊賊都在劫難逃)

遺憾的是,魚和熊掌不可兼得:我們不能同時增加精度並減少召回率,反之亦然,這稱爲精度/召回率權衡

4、精度/召回率權衡

1)、在分類中,對於每個實例,都會計算出一個分值,同時也有一個閾值,大於爲正例,小於爲負例。通過調節這個閾值,可以調整精度和召回率。

y_scores = sgd_clf.decision_function([some_digit])
y_scores
array([-117421.59910995])
threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
array([False])
threshold = 200000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
array([False])
# 返回決策分數,而不是預測結果
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")
y_scores.shape
(60000,)
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.title("精度和召回率VS決策閾值", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
plt.show()

在這裏插入圖片描述

可以看見,隨着閾值提高,召回率下降了,也就是說,有真例被判負了,精度上升,也就是說,有部分原本被誤判的負例,被丟出去了。

你可以會好奇,爲什麼精度曲線會比召回率曲線要崎嶇一些,原因在於,隨着閾值提高,精度也有可能會下降 4/5 => 3/4(雖然總體上升)。另一方面,閾值上升,召回率只會下降。

現在就可以輕鬆通過選擇閾值來實現最佳的精度/召回率權衡了。還有一種找到最好的精度/召回率權衡的方法是直接繪製精度和召回率的函數圖。

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.title("精度VS召回率", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.show()

在這裏插入圖片描述

可以看見,從80%的召回率往右,精度開始急劇下降。我們可能會盡量在這個陡降之前選擇一個精度/召回率權衡–比如召回率60%以上。當然,如何選擇取決於你的項目。
假設我們決定瞄準90%的精度目標。通過繪製的第一張圖(放大一點),得出需要使用的閾值大概是70000.要進行預測(現在是在訓練集上),除了調用分類器的predict方法,也可以使用這段代碼:

y_train_pred_90 = (y_scores > 70000)
precision_score(y_train_5, y_train_pred_90)
0.8823529411764706
recall_score(y_train_5, y_train_pred_90)
0.6059767570558937

現在我們就有了一個精度接近90%的分類器了,如果有人說,“我們需要99%的精度。”,那麼我就要問:“召回率是多少?”

5、ROC曲線

還有一種經常與二元分類器一起使用的工具,叫做受試者工作特徵曲線(簡稱ROC)。它與精度/召回率曲線非常相似,但繪製的不是精度和召回率,而是真正類率(召回率的另一種稱呼)和假正類率(FPR)。FPR是被錯誤分爲正類的負類實例比率。它等於1-真負類率(TNR),後者正是被正確分類爲負類的負類實例比率,也稱爲奇異度。因此ROC曲線繪製的是靈敏度和(1-奇異度)的關係

~ 1 0
1 TP FN
0 FP TN

FPR=FPFP+TNFPR=\frac{FP}{FP+TN}
Recall=TPTP+FNRecall=\frac{TP}{TP+FN}

# 使用 roc_curve()函數計算多種閾值的TPR和FPR
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()

在這裏插入圖片描述
這裏同樣面對一個折中權衡:召回率(TPR)很高,分類器產生的假正類(FPR)就越多。虛線表示純隨機的ROC曲線;一個優秀的分類器(向左上角)。

有一種比較分類器的方式是測量曲線下面積(AUC)。完美的ROC AUC等於1,純隨機分類的ROC AUC等於0.5

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)
0.9615806367628459

ROC曲線和精度/召回率(或PR)曲線非常相似,因此,你可能會問,如何決定使用哪種曲線。

一個經驗法則是,當正類非常少見或者你更關注假正類而不是假負類時,應該選擇PR曲線,反之選擇ROC曲線。

例如,看前面的ROC曲線圖時,以及ROC AUC分數時,你可能會覺得分類器真不錯。但這主要是應爲跟負類(非5)相比,正類(數字5)的數量真的很少。相比之下,PR曲線清楚地說明分類器還有改進的空間(曲線還可以更接近右上角)

6、訓練一個隨機森林分類器,並計算ROC和ROC AUC分數

# 具體RF的原理,第七章介紹
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.title("SGD和RL的ROC曲線對比")
plt.legend(loc="lower right", fontsize=16)
plt.show()

在這裏插入圖片描述

roc_auc_score(y_train_5, y_scores_forest)
0.991852822111278

測量精度和召回率

y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)
0.9830102374210412
recall_score(y_train_5, y_train_pred_forest)
0.8325032281866814

四、多類別分類器

1、分類器的實現

二元分類器在兩個類別中區分,而多類別分類器(也稱爲多項分類器),可以區分兩個以上的類別。

隨機森林算法和樸素貝葉斯分類器可以直接處理多個類別。也有一些嚴格的二元分類器,比如支持向量分類器或線性分類器。但有多種策略,可以讓我們用幾個二元二類器實現多類別分類的目的

例如:我們可以訓練0-9的10個二元分類器組合,那個分類器給的高,就分爲哪一類,這稱爲一對多(OvA)策略

另一種方法,是爲每一對數字訓練一個二元分類器:一個用來區分0-1,一個區分0-2,一個區分1-2,依次類推。這稱爲一對一(OvO)策略,解決N分類,需要(N)*(N-1)/2分類器,比如MNIST問題,需要45個分類器。OvO的主要優點在於每個分類器只需要用到部分訓練集對其必須區分的兩個類別進行訓練。

有些算法(例如支持向量機算法),在數據規模增大時,表現糟糕,因此對於這類算法,OvO是一個優秀的選擇,由於在較小的訓練集上分別訓練多個分類器比在大型數據集上訓練少數分類器要快得多。但對於大多數二元分類器,OvA策略還是更好的選擇。

# 使用0-9進行訓練,在sgd內部,sklearn使用了10個二元分類器,
#獲得它們對圖片的決策分數,然後選擇最高的類別
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
array([5], dtype=int8)

我們可以看到 sgd對輸入的結果輸出了10個預測分數,而不是1個

some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores
array([[ -46920.32059519, -473682.68085715, -422542.0018031 ,
          87383.0785554 , -278090.95869529,  244695.46537724,
        -898745.95707608, -188442.74420839, -842476.3954909 ,
        -638525.37800626]])
np.argmax(some_digit_scores)
5

訓練分類器的時候,目標類別的列表會存儲在classes_這個屬性中,按值的大小進行排序

sgd_clf.classes_
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

強制使用OVO策略

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, tol=-np.infty, random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])
array([5], dtype=int8)
len(ovo_clf.estimators_)
45

隨機森林的多分類,不需要OvA或者OVO策略

forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
array([5], dtype=int8)
forest_clf.predict_proba([some_digit])
array([[0.1, 0. , 0. , 0.1, 0. , 0.8, 0. , 0. , 0. , 0. ]])

對分類器進行評估

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
array([0.84993001, 0.81769088, 0.84707706])

評測結果大概都爲80%以上,如果是隨機分類器,準確率大概是10%左右,所以這個結果不是太糟糕,但是依然有提升的空間,比如使用標準化,進行簡單的縮放

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
array([0.91211758, 0.9099955 , 0.90643597])

2、錯誤分析

如果這是一個真正的項目,我們將遵循第二章機器學習項目清單的步驟:探索數據準備的選項,嘗試多個模型,列出最佳模型並使用GridSearchCV對超參數進行微調,儘可能自動化,等等。在這裏,假設我們已經找到一個有潛力的模型,現在希望找到一些方法,對其進一步改進。方法之一就是分析其類型錯誤。

首先,看一下混淆矩陣

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
array([[5749,    4,   22,   11,   11,   40,   36,   11,   36,    3],
       [   2, 6490,   43,   24,    6,   41,    8,   12,  107,    9],
       [  53,   42, 5330,   99,   87,   24,   89,   58,  159,   17],
       [  46,   41,  126, 5361,    1,  241,   34,   59,  129,   93],
       [  20,   30,   35,   10, 5369,    8,   48,   38,   76,  208],
       [  73,   45,   30,  194,   64, 4614,  106,   30,  170,   95],
       [  41,   30,   46,    2,   44,   91, 5611,    9,   43,    1],
       [  26,   18,   73,   30,   52,   11,    4, 5823,   14,  214],
       [  63,  159,   69,  168,   15,  172,   54,   26, 4997,  128],
       [  39,   39,   27,   90,  177,   40,    2,  230,   78, 5227]],
      dtype=int64)
def plot_confusion_matrix(matrix):
    """If you prefer color and a colorbar"""
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    cax = ax.matshow(matrix)
    fig.colorbar(cax)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

在這裏插入圖片描述
5稍微暗一點,可能意味着數據集中5的圖片少,也可能是分類器在5上的執行效果不行。實際上,這二者都屬實。

讓我們把焦點都放在錯誤上。首先,我們需要將混淆矩陣中的每個值都除以相應類別中的圖片數,這樣比較的而是錯誤率,而不是錯誤的絕對值(後者對圖片數量較多的類別不公平)

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

行表示實際類別,列表示預測的類別,可以看到 8 9 列比較亮,容易其他數字容易被分錯爲8 9, 8 9 行業比較亮,說明 8 9 容易被錯誤分爲其他數字。此外3 容易被錯分爲 5,5也容易被錯分爲4

np.fill_diagonal(norm_conf_mx, 0) # 填充主對稱軸
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

在這裏插入圖片描述

分析混淆矩陣,通常可以幫助我們深入瞭解如何改進分類器。通過上面的圖,我們可以花費更多時間來改進8 9的分類,以及修正 3 5 的混淆上。

例如,可以試着收集更多這些數字的訓練集,

或者開發新特徵來改進分類器–舉個例子,寫一個算法來計算閉環的數量,比如(8有兩個,6有一個,5沒有)。

再或者,對圖片進行預處理,讓某些模式更加突出,比如閉環之類的。

分析單個錯誤也可以爲分類器提供洞察:它在做什麼?爲什麼失敗?但這通常更加困難和耗時。例如,我們來看看數字3和數字5的例子:

cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()

在這裏插入圖片描述

我們可以看到,雖然有一些數字容易混淆,但大多數,還是比較好分類的,但算法還是會分錯。因爲SGD模型是一個線性模型,它所做的就是爲每一個像素分配一個各個類別的權重,當它看到新的圖像時,將加權後的像素強度彙總,從而得到一個分數進行分類。而數字3和5只在一部分像素位上有區別,所以分類器很容易將其搞混.

數字3和5之間的主要區別在於連接頂線和下方弧線中間的小線條的位置。如果我們寫的數字3將連續點略往左移,分類器就可能將其分類爲5,反之亦然。換言之,這個分類器對圖像位移和旋轉非常敏感,因此,減少3 5混淆的方法之一是對數字進行預處理,確保他們位於中心位置,並且沒有旋轉。這也有助於減少其他錯誤。

五、多標籤分類

到目前位置,每個實例都只有一個輸出,但某些情況下,我們需要分類器爲每個實例產出多個類別,比如,爲照片中的每個人臉附上一個標籤。

假設分類器經過訓練,已經可以識別三張臉 A B C,那麼當看到A和C的合照時,應該輸出[1,0,1],這種輸出多個二元標籤的分類系統成爲多標籤分類系統

下面以k近鄰算法爲例(不是所有的分類器都支持多標籤)

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
knn_clf.predict([some_digit])
array([[False,  True]])

結果正確,5顯然小於7,同時是奇數

評估多標籤分類器的方法很多,如何選擇正確的度量指標取決於我們的項目。比如方法之一是測量每個標籤的F1分數(或者是之前討論過的任何其他二元分類器指標),然後簡單的平均。

# 耗時巨大
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
0.97709078477525

這裏假設了所有的標籤都是同等重要,但實際的數據可能並不均衡,可以修改average="weighted",來給每個標籤設置一個等於其自身支持的權重

六、多輸出分類

現在,我們將討論最後一種分類任務–多輸出多分類任務(簡稱爲多輸出分類)。簡單而言,它是多標籤分類的泛化,其標籤也可以是多種類別的(比如有兩個以上的值)

說明:構建一個去除圖片中噪聲的系統。給它輸入一個帶噪聲的圖片,它將(希望)輸出一張乾淨的數字圖片,跟其他MNIST圖片一樣,以像素強度的一個數組作爲呈現方式。

需要注意的是:這個分類器的輸出時多個標籤(一個像素點一個標籤),每一個標籤有多個值(0-255)。所以這是一個多輸出分類器系統的例子。

分類和迴歸之間的界限有時候很模糊,比如這個系統,可以說,預測像素強度更像是迴歸任務,而不是分類。而多輸出系統也不僅僅限於分類任務,可以讓一個系統給每個實例輸出多個標籤,同時包括類別標籤和值標籤

首先還是從創建訓練集和測試集開始,使用Numpy的randint 來給Mnist圖片的像素強度增加噪聲。目標是將圖片還原爲原始圖片。

noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
some_index = 5500
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
plt.show()

在這裏插入圖片描述

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

在這裏插入圖片描述

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-w7qcEbBL-1587888960507)(output_130_0.png)]

拓展一:

DummyClassifier(隨機預測分類器)

from sklearn.dummy import DummyClassifier
dmy_clf = DummyClassifier()
y_probas_dmy = cross_val_predict(dmy_clf, X_train, y_train_5, cv=3, method="predict_proba")
y_scores_dmy = y_probas_dmy[:, 1]
fprr, tprr, thresholdsr = roc_curve(y_train_5, y_scores_dmy)
plot_roc_curve(fprr, tprr)

在這裏插入圖片描述

拓展二:

K近鄰分類器

# 耗時較大
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=-1, weights='distance', n_neighbors=4)
knn_clf.fit(X_train, y_train)
y_knn_pred = knn_clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_knn_pred)
0.9714

拓展三:

平移圖片

from scipy.ndimage.interpolation import shift
def shift_digit(digit_array, dx, dy, new=0):
    return shift(digit_array.reshape(28, 28), [dy, dx], cval=new).reshape(784)

plot_digit(shift_digit(some_digit, 5, 1, new=100))

在這裏插入圖片描述

拓展四:

通過平移圖片,增強數據集,再訓練

X_train_expanded = [X_train]
y_train_expanded = [y_train]
for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    shifted_images = np.apply_along_axis(shift_digit, axis=1, arr=X_train, dx=dx, dy=dy)
    X_train_expanded.append(shifted_images)
    y_train_expanded.append(y_train)

X_train_expanded = np.concatenate(X_train_expanded)
y_train_expanded = np.concatenate(y_train_expanded)
X_train_expanded.shape, y_train_expanded.shape
((300000, 784), (300000,))
# 耗時巨大,大概幾十分鐘
knn_clf.fit(X_train_expanded, y_train_expanded)
y_knn_expanded_pred = knn_clf.predict(X_test)
accuracy_score(y_test, y_knn_expanded_pred)
0.9763
ambiguous_digit = X_test[2589]
knn_clf.predict_proba([ambiguous_digit])
array([[0.       , 0.       , 0.5053645, 0.       , 0.       , 0.       ,
        0.       , 0.4946355, 0.       , 0.       ]])
plot_digit(ambiguous_digit)

在這裏插入圖片描述

七、練習

1.、爲mnist數據集構建一個測試集精度達到97%的分類器

提示:KNeighborsClassifier,調節 weights 和 n_neighbors

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, verbose=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=-1,
             param_grid=[{'n_neighbors': [3, 4, 5],
                          'weights': ['uniform', 'distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=3)
grid_search.best_params_
{'n_neighbors': 4, 'weights': 'distance'}
grid_search.best_score_
0.97325
from sklearn.metrics import accuracy_score

y_pred = grid_search.predict(X_test)
accuracy_score(y_test, y_pred)
0.9714

2、將MNIST往任意方向(上下左右)移動一個像素的功能,再用這個拓展過的訓練集來訓練模型

from scipy.ndimage.interpolation import shift
from sklearn.neighbors import KNeighborsClassifier
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])
image = X_train[1000]
shifted_image_down = shift_image(image, 0, 5)
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

在這裏插入圖片描述

X_train_augmented = [image for image in X_train]
y_train_augmented = [label for label in y_train]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train, y_train):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)
shuffle_idx = np.random.permutation(len(X_train_augmented))
X_train_augmented = X_train_augmented[shuffle_idx]
y_train_augmented = y_train_augmented[shuffle_idx]
knn_clf = KNeighborsClassifier(n_neighbors=4,weights='distance',n_jobs=-1)
knn_clf.fit(X_train_augmented, y_train_augmented)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=4, p=2,
                     weights='distance')
y_pred = knn_clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
0.9763

3.、處理泰坦尼克數據集

本問題目標是預測乘客是否存活問題,數據集可以從kaggle獲取Titanic challenge

# 首先是加載數據集
import os

TITANIC_PATH = os.path.join("datasets", "titanic")

import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")
# 看一下數據結構
train_data.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The attributes have the following meaning:

  • Survived: that’s the target, 0 means the passenger did not survive, while 1 means he/she survived.
  • Pclass: passenger class.
  • Name, Sex, Age: self-explanatory
  • SibSp: how many siblings & spouses of the passenger aboard the Titanic.
  • Parch: how many children & parents of the passenger aboard the Titanic.
  • Ticket: ticket id
  • Fare: price paid (in pounds)
  • Cabin: passenger’s cabin number
  • Embarked: where the passenger embarked the Titanic
# 看一下數據缺失的情況
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# 看一下數字屬性
train_data.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
# 存活率
train_data["Survived"].value_counts()
0    549
1    342
Name: Survived, dtype: int64
# 1,2,3等倉的人數
train_data["Pclass"].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64
# 性別比例
train_data["Sex"].value_counts()
male      577
female    314
Name: Sex, dtype: int64
# 從哪上船 C=Cherbourg, Q=Queenstown, S=Southampton
train_data["Embarked"].value_counts()
S    644
C    168
Q     77
U      2
Name: Embarked, dtype: int64
# 獲取有缺失的屬性裏面出現最頻繁的值,用以做填充
from sklearn.base import BaseEstimator, TransformerMixin
    
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)
# 數據預處理
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs=["Age", "SibSp", "Parch", "Fare"]
cat_attribs=["Pclass", "Sex", "Embarked"]

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
    ])

cat_pipeline = Pipeline([
        ("imputer", MostFrequentImputer()),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])
X_train = full_pipeline.fit_transform(train_data)
X_train
array([[22.,  1.,  0., ...,  0.,  1.,  0.],
       [38.,  1.,  0., ...,  0.,  0.,  0.],
       [26.,  0.,  0., ...,  0.,  1.,  0.],
       ...,
       [28.,  1.,  2., ...,  0.,  1.,  0.],
       [26.,  0.,  0., ...,  0.,  0.,  0.],
       [32.,  0.,  0., ...,  1.,  0.,  0.]])
y_train = train_data["Survived"]
# 使用線性SVM來處理
from sklearn.svm import SVC

svm_clf = SVC(gamma="auto")
svm_clf.fit(X_train, y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
X_test = full_pipeline.transform(test_data)
y_pred = svm_clf.predict(X_test)
# 評估結果
from sklearn.model_selection import cross_val_score

svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
svm_scores.mean()
0.7320304165247986
# 使用隨機森林來處理
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)
forest_scores.mean()
0.8060137895812053
# 畫出兩個圖評估結果的盒型圖
plt.figure(figsize=(8, 4))
plt.plot([1]*10, svm_scores, ".")
plt.plot([2]*10, forest_scores, ".")
plt.boxplot([svm_scores, forest_scores], labels=("SVM","Random Forest"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

在這裏插入圖片描述

1、最大值(或減掉outlier之後的最大值)

2、位於75%百分位的值

3、中間值

4、位於25%百分位的值

5、最小值

# 給年齡分組
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()
Survived
AgeBucket
0.0 0.576923
15.0 0.362745
30.0 0.423256
45.0 0.404494
60.0 0.240000
75.0 1.000000
# 將兄弟姐妹配偶和孩子父母數求和
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()
Survived
RelativesOnboard
0 0.303538
1 0.552795
2 0.578431
3 0.724138
4 0.200000
5 0.136364
6 0.333333
7 0.000000
10 0.000000

4、創建一個垃圾郵件分類器

1)、 首先獲取數據

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=SPAM_PATH)
        tar_bz2_file.close()
# fetch_spam_data()
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]
len(ham_filenames),len(spam_filenames)
(2500, 500)
ham_filenames[0]
'00001.7c53336b37003a9286aba55d2945844c'

我們使用python的 email 工具包來讀取和處理郵件數據

import email
import email.policy

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

查看一下垃圾郵件和正常郵件內容

print(ham_emails[1].get_content().strip())
Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
[email protected]

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
print(spam_emails[6].get_content().strip())
Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40

有些郵件會帶有多個部分,比如圖片和附件,這裏取出來看看

def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures
structures_counter(ham_emails).most_common()
[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]
structures_counter(spam_emails).most_common()
[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

可以看到正常郵件的附件要比垃圾郵件的所帶有的text/plain更少,同時,一些垃圾郵件帶有pgp這個簽名

取出郵件頭看一看

for header, value in spam_emails[0].items():
    print(header,":",value)
Return-Path : <[email protected]>
Delivered-To : [email protected]
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <[email protected]>; Thu, 22 Aug 2002 13:09:41 +0100
From : [email protected]
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : [email protected]
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : quoted-printable
#  看一下主題
spam_emails[0]["Subject"]
'Life Insurance - Why Pay More?'
# 切分訓練集和測試集
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 將 text/html 文件翻譯爲文本文件 去掉 <head>,<a> &gt; &nbsp;
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)
# 將 text/html 去出來看一下
html_spam_emails = [email for email in X_train[y_train==1]
                    if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:1000], "...")
<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0"  ...
len(html_spam_emails)
150
# 看一下去除html標籤的
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")
OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi ...
# 將郵件轉換爲文本
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)
for part in sample_html_spam.walk():
    print(part.get_content_type())
text/html
# 查看效果
print(email_to_text(sample_html_spam)[:100], "...")
OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Wat ...
# 自然語言工具包
import nltk
# 文本轉化
stemmer = nltk.PorterStemmer()
for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive"):
    print(word, "=>", stemmer.stem(word))
Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls
# 對url進行提取
import urlextract # may require an Internet connection to download root domain names

url_extractor = urlextract.URLExtract()
print(url_extractor.find_urls("Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"))
['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']
# 將前面的處理方式進行集成整合
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)
# 統計詞頻
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts
array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom': 1, 'most': 1, 'pervert': 1, 'system': 1, 'that': 1, 'ever': 1, 'shone': 1, 'man': 1, 'absurd': 1, 'untruth': 1, 'were': 1, 'perpetr': 1, 'upon': 1, 'a': 1, 'larg': 1, 'band': 1, 'dupe': 1, 'import': 1, 'led': 1, 'paul': 1, 'first': 1, 'great': 1, 'corrupt': 1}),
       Counter({'url': 4, 's': 3, 'group': 3, 'to': 3, 'in': 2, 'forteana': 2, 'martin': 2, 'an': 2, 'and': 2, 'we': 2, 'is': 2, 'yahoo': 2, 'unsubscrib': 2, 'y': 1, 'adamson': 1, 'wrote': 1, 'for': 1, 'altern': 1, 'rather': 1, 'more': 1, 'factual': 1, 'base': 1, 'rundown': 1, 'on': 1, 'hamza': 1, 'career': 1, 'includ': 1, 'hi': 1, 'belief': 1, 'that': 1, 'all': 1, 'non': 1, 'muslim': 1, 'yemen': 1, 'should': 1, 'be': 1, 'murder': 1, 'outright': 1, 'know': 1, 'how': 1, 'unbias': 1, 'memri': 1, 'don': 1, 't': 1, 'html': 1, 'rob': 1, 'sponsor': 1, 'number': 1, 'dvd': 1, 'free': 1, 'p': 1, 'join': 1, 'now': 1, 'from': 1, 'thi': 1, 'send': 1, 'email': 1, 'egroup': 1, 'com': 1, 'your': 1, 'use': 1, 'of': 1, 'subject': 1})],
      dtype=object)

文本詞頻轉向量,取出前n個最頻繁出現的

from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors
<3x11 sparse matrix of type '<class 'numpy.int32'>'
	with 20 stored elements in Compressed Sparse Row format>
# 單詞向量
X_few_vectors.toarray()
array([[ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [99, 11,  9,  8,  3,  1,  3,  1,  3,  2,  3],
       [67,  0,  1,  2,  3,  4,  1,  2,  0,  1,  0]], dtype=int32)
# 詞頻向量對應的詞
vocab_transformer.vocabulary_
{'the': 1,
 'of': 2,
 'and': 3,
 'to': 4,
 'url': 5,
 'all': 6,
 'in': 7,
 'christian': 8,
 'on': 9,
 'by': 10}
#  構建處理管道
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

a) liblinear:使用了開源的liblinear庫實現,內部使用了座標軸下降法來迭代優化損失函數。

b) lbfgs:擬牛頓法的一種,利用損失函數二階導數矩陣即海森矩陣來迭代優化損失函數。

c) newton-cg:也是牛頓法家族的一種,利用損失函數二階導數矩陣即海森矩陣來迭代優化損失函數。

d) sag:即隨機平均梯度下降,是梯度下降法的變種,和普通梯度下降法的區別是每次迭代僅僅用一部分的樣本來計算梯度,適合於樣本數據多的時候。

# 使用logist迴歸
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3, verbose=3)
score.mean()
0.9858333333333333
# 驗證效果
from sklearn.metrics import precision_score, recall_score

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))
Precision: 95.88%
Recall: 97.89%

以上就是本次博客的全部內容啦,希望對本篇博客的閱讀可以幫助小夥伴們更好的理解MNIST數據集的操作哦,有不懂的小夥伴,可以評論區留言,林君學長看到,儘量幫助大家解答,因爲這篇文章林君學長也沒有完全瞭解透徹,所以只能儘量幫助小夥伴們啦!
陳一月的又一天編程歲月^ _ ^

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章