機器學習系列（三十一）——ROC曲線、多分類問題的混淆矩陣

本篇主要內容：ROC曲線、多分類混淆矩陣

ROC曲線

ROC全稱是Receiver Operation Characteristic Curve，它描述的是TPR和FPR之間的關係。

TPR（True Positive Rate）的計算公式爲：
$TPR=\frac{TP}{TP+FN}$

它表示預測值爲1，真實值也爲1中預測正確的比例，TPR其實就是Recall。FPR（False Positive Rate）的計算公式爲：
$FPR=\frac{FP}{TN+FP}$

它表示預測值爲1，真實值爲0中預測錯誤的比例。和上篇文章中Precision和Recall負相關不同，TPR和FPR之間是正相關關係，TPR增加相應地FPR也會增加。
接下來使用我們自己的模塊繪製ROC曲線，在這之前，首先在play_Ml模塊的metrics.py中添加相關代碼：

'''分類問題評價指標'''
import numpy as np
from math import sqrt
def TN(y_true, y_predict):
    assert len(y_true)==len(y_predict)
    return np.sum((y_true==0)&(y_predict==0))
def FP(y_true,y_predict):
    assert len(y_true)==len(y_predict)
    return np.sum((y_true==0)&(y_predict==1))
def FN(y_true,y_predict):
    assert len(y_true)==len(y_predict)
    return np.sum((y_true==1)&(y_predict==0))
def TP(y_true,y_predict):
    assert len(y_true)==len(y_predict)
    return np.sum((y_true==1)&(y_predict==1))
def confusion_matrix(y_true,y_predict):
    return np.array([
        [TN(y_true,y_predict),FP(y_true,y_predict)],
        [FN(y_true,y_predict),TP(y_true,y_predict)]
    ])
def precision_score(y_true,y_predict):
    tp = TP(y_true,y_predict)
    fp = FP(y_true,y_predict)
    '''分母爲0'''
    try:
        return tp/(tp+fp)
    except:
        return 0.0
def recall_score(y_true,y_predict):
    tp = TP(y_true,y_predict)
    fn = FN(y_true,y_predict)
    '''分母爲0'''
    try:
        return tp/(tp+fn)
    except:
        return 0.0
def f1_score(precision,recall):
    try:
        return 2*precision*recall/(precision+recall)
    except:
        return 0.0
def TPR(y_true,y_predict):
    tp = TP(y_true,y_predict)
    fn = FN(y_true,y_predict)
    '''分母爲0'''
    try:
        return tp/(tp+fn)
    except:
        return 0.0
def FPR(y_true,y_predict):
    fp = FP(y_true,y_predict)
    tn = TN(y_true,y_predict)
    '''分母爲0'''
    try:
        return fp/(fp+tn)
    except:
        return 0.0

使用的數劇集仍然和上篇相同，是處理爲2類的手寫數字數據集，分類方法使用Logistic迴歸：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()

y[digits.target==9]=1
y[digits.target!=9]=0

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=666)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)

decision_scores = log_reg.decision_function(X_test)

繪製ROC曲線：

from play_Ml.metrics import FPR,TPR
fprs = []
tprs = []

thresholds = np.arange(np.min(decision_scores),np.max(decision_scores),0.1)
for threshold in thresholds:
    y_predict = np.array(decision_scores >= threshold,dtype=int)
    fprs.append(FPR(y_test,y_predict))
    tprs.append(TPR(y_test,y_predict))
    
'''繪製ROC曲線'''
plt.plot(fprs,tprs)
plt.show()

ROC曲線：

同樣也可以使用sklearn中已經封裝好的函數roc_curve，該函數傳入參數標籤的真值和decision_scores，得到三個返回值fprs, tprs, thresholds，這樣就可以根據fprs和tprs做圖：

from sklearn.metrics import roc_curve
fprs, tprs, thresholds = roc_curve(y_test, decision_scores)
plt.plot(fprs,tprs)
plt.show()

ROC曲線：

得到的是相同的結果。ROC曲線下的面積一定程度上代表了模型的好壞，面積越大模型一般就越好，這個面積最大是1。sklearn中封裝了求ROC曲線下面積的函數：

'''求面積，area under curve'''
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,decision_scores)

本例模型中ROC曲線下面積：

在本例中0.98這個結果是相當好的，不過，ROC曲線面積指標並不對有偏數據很敏感，因此一般不用來單獨評價一個模型性能的好壞，而是有多個模型在一塊時，將ROC曲線繪製在一起，它能用來直觀比較孰優孰劣。

多分類問題的混淆矩陣

前面關於混淆矩陣是基於二分類問題討論的，現在來看一下對於多分類問題，混淆矩陣是什麼樣的。仍然使用手寫數字數據集，只是不再讓它變爲二分類，直接處理這樣的10分類，我們來看一下它的混淆矩陣：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.8, random_state=666)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
'''OVR'''
y_predict = log_reg.predict(X_test)
from sklearn.metrics import confusion_matrix
'''天然支持多分類混淆矩陣'''
confusion_matrix(y_test,y_predict)

混淆矩陣：

sklearn中的confusion_matrix是支持多分類問題求混淆矩陣的。有了混淆矩陣我們可以爲混淆矩陣做圖觀察分析分類出錯的地方都出現在哪裏，以更好地改進模型，首先看一下各種錯誤所佔百分比：

row_sums = np.sum(cfm,axis=1)#求行和
err_matrix = cfm/row_sums
np.fill_diagonal(err_matrix,0)
err_matrix

犯錯百分比：

接下來對這些錯誤做圖，對於犯錯越多的相應的像素值越高，用灰度圖像展示出來也就是亮度越亮：

plt.matshow(err_matrix,cmap=plt.cm.gray)
plt.show

由圖看出，犯錯最多的是有兩個地方，一個是1被錯分類爲9，另一個是8被錯分類爲1，知道這兩種錯誤最多之後就可以相應的改進模型調整模型參數，使之表現更佳。

機器學習系列（三十一）——ROC曲線、多分類問題的混淆矩陣

ROC曲線

多分類問題的混淆矩陣

工作中用到的腳本合集

24-5-18 X

GNN筆記1-3——圖信號處理

模式識別與機器學習(一)——緒論、多項式擬合例子 1.1 緒論

GAN生成對抗網絡(一)

變分自編碼Variational Auto-Encoders

模式識別與機器學習(三)——高斯分佈基礎

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結