中文文檔: http://sklearn.apachecn.org/cn/stable/modules/model_evaluation.html

英文文檔: http://sklearn.apachecn.org/en/stable/modules/model_evaluation.html

官方文檔: http://scikit-learn.org/stable/

GitHub: https://github.com/apachecn/scikit-learn-doc-zh（覺得不錯麻煩給個 Star，我們一直在努力）

貢獻者: https://github.com/apachecn/scikit-learn-doc-zh#貢獻者

關於我們: http://www.apachecn.org/organization/209.html

3.3. 模型評估: 量化預測的質量

有 3 種不同的 API 用於評估模型預測的質量:

Estimator score method（估計器得分的方法）: Estimators（估計器）有一個 score（得分） 方法，爲其解決的問題提供了默認的 evaluation criterion （評估標準）。在這個頁面上沒有相關討論，但是在每個 estimator （估計器）的文檔中會有相關的討論。
Scoring parameter（評分參數）: Model-evaluation tools （模型評估工具）使用 cross-validation (如 model_selection.cross_val_score 和 model_selection.GridSearchCV) 依靠 internal scoring strategy （內部 scoring（得分） 策略）。這在 scoring 參數: 定義模型評估規則部分討論。
Metric functions（指標函數）: metrics 模塊實現了針對特定目的評估預測誤差的函數。這些指標在以下部分部分詳細介紹分類指標, 多標籤排名指標, 迴歸指標和聚類指標。

最後，虛擬估計用於獲取隨機預測的這些指標的基準值。

3.3.1. `scoring` 參數: 定義模型評估規則

Model selection （模型選擇）和 evaluation （評估）使用工具，例如 model_selection.GridSearchCV 和 model_selection.cross_val_score ，採用 scoring 參數來控制它們對 estimators evaluated （評估的估計量）應用的指標。

3.3.1.1. 常見場景: 預定義值

對於最常見的用例, 您可以使用 scoring 參數指定一個 scorer object （記分對象）; 下表顯示了所有可能的值。所有 scorer objects （記分對象）遵循慣例 higher return values are better than lower return values（較高的返回值優於較低的返回值） 。因此，測量模型和數據之間距離的 metrics （度量），如 metrics.mean_squared_error 可用作返回 metric （指數）的 negated value （否定值）的 neg_mean_squared_error 。

Scoring（得分）	Function（函數）	Comment（註解）
Classification（分類）
‘accuracy’	`metrics.accuracy_score`
‘average_precision’	`metrics.average_precision_score`
‘f1’	`metrics.f1_score`	for binary targets（用於二進制目標）
‘f1_micro’	`metrics.f1_score`	micro-averaged（微平均）
‘f1_macro’	`metrics.f1_score`	macro-averaged（微平均）
‘f1_weighted’	`metrics.f1_score`	weighted average（加權平均）
‘f1_samples’	`metrics.f1_score`	by multilabel sample（通過 multilabel 樣本）
‘neg_log_loss’	`metrics.log_loss`	requires `predict_proba` support（需要 `predict_proba` 支持）
‘precision’ etc.	`metrics.precision_score`	suffixes apply as with ‘f1’（後綴適用於 ‘f1’）
‘recall’ etc.	`metrics.recall_score`	suffixes apply as with ‘f1’（後綴適用於 ‘f1’）
‘roc_auc’	`metrics.roc_auc_score`
Clustering（聚類）
‘adjusted_mutual_info_score’	`metrics.adjusted_mutual_info_score`
‘adjusted_rand_score’	`metrics.adjusted_rand_score`
‘completeness_score’	`metrics.completeness_score`
‘fowlkes_mallows_score’	`metrics.fowlkes_mallows_score`
‘homogeneity_score’	`metrics.homogeneity_score`
‘mutual_info_score’	`metrics.mutual_info_score`
‘normalized_mutual_info_score’	`metrics.normalized_mutual_info_score`
‘v_measure_score’	`metrics.v_measure_score`
Regression（迴歸）
‘explained_variance’	`metrics.explained_variance_score`
‘neg_mean_absolute_error’	`metrics.mean_absolute_error`
‘neg_mean_squared_error’	`metrics.mean_squared_error`
‘neg_mean_squared_log_error’	`metrics.mean_squared_log_error`
‘neg_median_absolute_error’	`metrics.median_absolute_error`
‘r2’	`metrics.r2_score`

使用案例:

>>>
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import cross_val_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf = svm.SVC(probability=True, random_state=0)
>>> cross_val_score(clf, X, y, scoring='neg_log_loss') 
array([-0.07..., -0.16..., -0.06...])
>>> model = svm.SVC()
>>> cross_val_score(model, X, y, scoring='wrong_choice')
Traceback (most recent call last):
ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']


Note

ValueError exception 列出的值對應於以下部分描述的 functions measuring prediction accuracy （測量預測精度的函數）。這些函數的 scorer objects （記分對象）存儲在 dictionary sklearn.metrics.SCORERS 中。

3.3.1.2. 根據 metric 函數定義您的評分策略

模塊 sklearn.metrics 還公開了一組 measuring a prediction error （測量預測誤差）的簡單函數，給出了基礎真實的數據和預測:

函數以 _score 結尾返回一個值來最大化，越高越好。
函數 _error 或 _loss 結尾返回一個值來 minimize （最小化），越低越好。當使用 make_scorer 轉換成 scorer object （記分對象）時，將 greater_is_better 參數設置爲 False（默認爲 True; 請參閱下面的參數說明）。

可用於各種機器學習任務的 Metrics （指標）在下面詳細介紹。

許多 metrics （指標）沒有被用作 scoring（得分） 值的名稱，有時是因爲它們需要額外的參數，例如 fbeta_score 。在這種情況下，您需要生成一個適當的 scoring object （評分對象）。生成 callable object for scoring （可評估對象進行評分）的最簡單方法是使用 make_scorer 。該函數將 metrics （指數）轉換爲可用於可調用的 model evaluation （模型評估）。

一個典型的用例是從庫中包含一個非默認值參數的 existing metric function （現有指數函數），例如 fbeta_score 函數的 beta 參數:

>>>
>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)


第二個用例是使用 make_scorer 從簡單的 python 函數構建一個完全 custom scorer object （自定義的記分對象），可以使用幾個參數 :

你要使用的 python 函數（在下面的例子中是 my_custom_loss_func）
python 函數是否返回一個分數 (greater_is_better=True, 默認值) 或者一個 loss （損失） (greater_is_better=False)。如果是一個 loss （損失），scorer object （記分對象）的 python 函數的輸出被 negated （否定），符合 cross validation convention （交叉驗證約定），scorers 爲更好的模型返回更高的值。
僅用於 classification metrics （分類指數）: 您提供的 python 函數是否需要連續的 continuous decision certainties （判斷確定性）（needs_threshold=True）。默認值爲 False 。
任何其他參數，如 beta 或者 labels 在函數 f1_score 。

以下是建立 custom scorers （自定義記分對象）的示例，並使用 greater_is_better 參數:

>>>
>>> import numpy as np
>>> def my_custom_loss_func(ground_truth, predictions):
...     diff = np.abs(ground_truth - predictions).max()
...     return np.log(1 + diff)
...
>>> # loss_func will negate the return value of my_custom_loss_func,
>>> #  which will be np.log(2), 0.693, given the values for ground_truth
>>> #  and predictions defined below.
>>> loss  = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> score = make_scorer(my_custom_loss_func, greater_is_better=True)
>>> ground_truth = [[1], [1]]
>>> predictions  = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(ground_truth, predictions)
>>> loss(clf,ground_truth, predictions) 
-0.69...
>>> score(clf,ground_truth, predictions) 
0.69...


3.3.1.3. 實現自己的記分對象

您可以通過從頭開始構建自己的 scoring object （記分對象），而不使用 make_scorer factory 來生成更加靈活的 model scorers （模型記分對象）。對於被叫做 scorer 來說，它需要符合以下兩個規則所指定的協議:

可以使用參數 (estimator, X, y) 來調用它，其中 estimator 是要被評估的模型，X 是驗證數據， y 是 X (在有監督情況下) 或 None (在無監督情況下) 已經被標註的真實數據目標。
它返回一個浮點數，用於對 X 進行量化 estimator 的預測質量，參考 y 。再次，按照慣例，更高的數字更好，所以如果你的 scorer 返回 loss ，那麼這個值應該被 negated 。

3.3.1.4. 使用多個指數評估

Scikit-learn 還允許在 GridSearchCV, RandomizedSearchCV 和 cross_validate 中評估 multiple metric （多個指數）。

爲 scoring 參數指定多個評分指標有兩種方法:

As an iterable of string metrics（作爲 string metrics 的迭代）::
>>>
>>> scoring = ['accuracy', 'precision']

As a dict mapping the scorer name to the scoring function（作爲 dict ，將 scorer 名稱映射到 scoring 函數）::

>>>
>>> from sklearn.metrics import accuracy_score
>>> from sklearn.metrics import make_scorer
>>> scoring = {'accuracy': make_scorer(accuracy_score),
...            'prec': 'precision'}


請注意， dict 值可以是 scorer functions （記分函數）或者 predefined metric strings （預定義 metric 字符串）之一。

目前，只有那些返回 single score （單一分數）的 scorer functions （記分函數）才能在 dict 內傳遞。不允許返回多個值的 Scorer functions （Scorer 函數），並且需要一個 wrapper 才能返回 single metric（單個指標）:

>>>
>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import confusion_matrix
>>> # A sample toy binary classification dataset
>>> X, y = datasets.make_classification(n_classes=2, random_state=0)
>>> svm = LinearSVC(random_state=0)
>>> def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
>>> def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
>>> def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
>>> def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
>>> scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
...            'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
>>> cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
>>> # Getting the test set true positive scores
>>> print(cv_results['test_tp'])          
[12 13 15]
>>> # Getting the test set false negative scores
>>> print(cv_results['test_fn'])          
[5 4 1]


3.3.2. 分類指標

sklearn.metrics 模塊實現了幾個 loss, score, 和 utility 函數來衡量 classification （分類）性能。某些 metrics （指標）可能需要 positive class （正類），confidence values（置信度值）或 binary decisions values （二進制決策值）的概率估計。大多數的實現允許每個樣本通過 sample_weight 參數爲 overall score （總分）提供 weighted contribution （加權貢獻）。

其中一些僅限於二分類案例:

`precision_recall_curve`(y_true, probas_pred)	Compute precision-recall pairs for different probability thresholds
`roc_curve`(y_true, y_score[, pos_label, …])	Compute Receiver operating characteristic (ROC)

其他也可以在多分類案例中運行:

`cohen_kappa_score`(y1, y2[, labels, weights, …])	Cohen’s kappa: a statistic that measures inter-annotator agreement.
`confusion_matrix`(y_true, y_pred[, labels, …])	Compute confusion matrix to evaluate the accuracy of a classification
`hinge_loss`(y_true, pred_decision[, labels, …])	Average hinge loss (non-regularized)
`matthews_corrcoef`(y_true, y_pred[, …])	Compute the Matthews correlation coefficient (MCC)

有些還可以在 multilabel case （多重案例）中工作:

`accuracy_score`(y_true, y_pred[, normalize, …])	Accuracy classification score.
`classification_report`(y_true, y_pred[, …])	Build a text report showing the main classification metrics
`f1_score`(y_true, y_pred[, labels, …])	Compute the F1 score, also known as balanced F-score or F-measure
`fbeta_score`(y_true, y_pred, beta[, labels, …])	Compute the F-beta score
`hamming_loss`(y_true, y_pred[, labels, …])	Compute the average Hamming loss.
`jaccard_similarity_score`(y_true, y_pred[, …])	Jaccard similarity coefficient score
`log_loss`(y_true, y_pred[, eps, normalize, …])	Log loss, aka logistic loss or cross-entropy loss.
`precision_recall_fscore_support`(y_true, y_pred)	Compute precision, recall, F-measure and support for each class
`precision_score`(y_true, y_pred[, labels, …])	Compute the precision
`recall_score`(y_true, y_pred[, labels, …])	Compute the recall
`zero_one_loss`(y_true, y_pred[, normalize, …])	Zero-one classification loss.

一些通常用於 ranking:

`dcg_score`(y_true, y_score[, k])	Discounted cumulative gain (DCG) at rank K.
`ndcg_score`(y_true, y_score[, k])	Normalized discounted cumulative gain (NDCG) at rank K.

有些工作與 binary 和 multilabel （但不是多類）的問題:

`average_precision_score`(y_true, y_score[, …])	Compute average precision (AP) from prediction scores
`roc_auc_score`(y_true, y_score[, average, …])	Compute Area Under the Curve (AUC) from prediction scores

在以下小節中，我們將介紹每個這些功能，前面是一些關於通用 API 和 metric 定義的註釋。

3.3.2.1. 從二分到多分類和 multilabel

一些 metrics 基本上是爲 binary classification tasks （二分類任務）定義的 (例如 f1_score, roc_auc_score) 。在這些情況下，默認情況下僅評估 positive label （正標籤），假設默認情況下，positive label （正類）標記爲 1 （儘管可以通過 pos_label 參數進行配置）。

將 binary metric （二分指標）擴展爲 multiclass （多類）或 multilabel （多標籤）問題時，數據將被視爲二分問題的集合，每個類都有一個。然後可以使用多種方法在整個類中 average binary metric calculations （平均二分指標計算），每種類在某些情況下可能會有用。如果可用，您應該使用 average 參數來選擇它們。

"macro（宏）" 簡單地計算 binary metrics （二分指標）的平均值，賦予每個類別相同的權重。在不常見的類別重要的問題上，macro-averaging （宏觀平均）可能是突出表現的一種手段。另一方面，所有類別同樣重要的假設通常是不真實的，因此 macro-averaging （宏觀平均）將過度強調不頻繁類的典型的低性能。
"weighted（加權）" 通過計算其在真實數據樣本中的存在來對每個類的 score 進行加權的 binary metrics （二分指標）的平均值來計算類不平衡。
"micro（微）" 給每個 sample-class pair （樣本類對）對 overall metric （總體指數）（sample-class 權重的結果除外）等同的貢獻。除了對每個類別的 metric 進行求和之外，這個總和構成每個類別度量的 dividends （除數）和 divisors （除數）計算一個整體商。在 multilabel settings （多標籤設置）中，Micro-averaging 可能是優先選擇的，包括要忽略 majority class （多數類）的 multiclass classification （多類分類）。
"samples（樣本）" 僅適用於 multilabel problems （多標籤問題）。它 does not calculate a per-class measure （不計算每個類別的 measure），而是計算 evaluation data （評估數據）中的每個樣本的 true and predicted classes （真實和預測類別）的 metric （指標），並返回 (sample_weight-weighted) 加權平均。
選擇 average=None 將返回一個 array 與每個類的 score 。

雖然將 multiclass data （多類數據）提供給 metric ，如 binary targets （二分類目標），作爲 array of class labels （類標籤的數組），multilabel data （多標籤數據）被指定爲 indicator matrix（指示符矩陣），其中 cell [i, j] 具有值 1，如果樣本 i 具有標號 j ，否則爲值 0 。

3.3.2.2. 精確度得分

accuracy_score 函數計算 accuracy, 正確預測的分數（默認）或計數 (normalize=False)。

在 multilabel classification （多標籤分類）中，函數返回 subset accuracy（子集精度）。如果樣本的 entire set of predicted labels （整套預測標籤）與真正的標籤組合匹配，則子集精度爲 1.0; 否則爲 0.0 。

如果 $\hat{y}_i$ 是第 $i$ 個樣本的預測值， $y_i$ 是相應的真實值，則 $n_\text{samples}$ 上的正確預測的分數被定義爲

$\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)$

其中 $1(x)$ 是 indicator function（指示函數）.

>>>
>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2


In the multilabel case with binary label indicators（在具有二分標籤指示符的多標籤情況下）:

>>>
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5


示例:

參閱 Test with permutations the significance of a classification score 例如使用數據集排列的 accuracy score （精度分數）。

3.3.2.3. Cohen’s kappa

函數 cohen_kappa_score 計算 Cohen’s kappa statistic（統計）。這個 measure （措施）旨在比較不同人工標註者的標籤，而不是 classifier （分類器）與 ground truth （真實數據）。

kappa score （參閱 docstring ）是 -1 和 1 之間的數字。 .8 以上的 scores 通常被認爲是很好的 agreement （協議）; 0 或者更低表示沒有 agreement （實際上是 random labels （隨機標籤））。

Kappa scores 可以計算 binary or multiclass （二分或者多分類）問題，但不能用於 multilabel problems （多標籤問題）（除了手動計算 per-label score （每個標籤分數）），而不是兩個以上的 annotators （註釋器）。

>>>
>>> from sklearn.metrics import cohen_kappa_score
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> cohen_kappa_score(y_true, y_pred)
0.4285714285714286


3.3.2.4. 混淆矩陣

confusion_matrix 函數通過計算 confusion matrix（混淆矩陣）來 evaluates classification accuracy （評估分類的準確性）。

根據定義，confusion matrix （混淆矩陣）中的 entry（條目） $i, j$ ，是實際上在 group $i$ 中的 observations （觀察數），但預測在 group $j$ 中。這裏是一個示例:

>>>
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])


這是一個這樣的 confusion matrix （混淆矩陣）的可視化表示（這個數字來自於 Confusion matrix）:

../_images/sphx_glr_plot_confusion_matrix_0011.png

對於 binary problems （二分類問題），我們可以得到 true negatives（真 negatives）, false positives（假 positives）, false negatives（假 negatives）和 true positives（真 positives）的數量如下:

>>>
>>> y_true = [0, 0, 0, 1, 1, 1, 1, 1]
>>> y_pred = [0, 1, 0, 1, 0, 1, 0, 1]
>>> tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
>>> tn, fp, fn, tp
(2, 1, 2, 3)


示例:

參閱 Confusion matrix 例如使用 confusion matrix （混淆矩陣）來評估 classifier （分類器）的輸出質量。
參閱 Recognizing hand-written digits 例如使用 confusion matrix （混淆矩陣）來分類手寫數字。
參閱 Classification of text documents using sparse features 例如使用 confusion matrix （混淆矩陣）對文本文檔進行分類。

3.3.2.5. 分類報告

classification_report 函數構建一個顯示 main classification metrics （主分類指標）的文本報告。這是一個小例子，其中包含自定義的 target_names 和 inferred labels （推斷標籤）:

>>>
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 1, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5


示例:

參閱 Recognizing hand-written digits 作爲手寫數字的分類報告的使用示例。
參閱 Classification of text documents using sparse features 作爲文本文檔的分類報告使用的示例。
參閱 Parameter estimation using grid search with cross-validation 例如使用 grid search with nested cross-validation （嵌套交叉驗證進行網格搜索）的分類報告。

3.3.2.6. 漢明損失

hamming_loss 計算兩組樣本之間的 average Hamming loss （平均漢明損失）或者 Hamming distance（漢明距離）。

如果 $\hat{y}_j$ 是給定樣本的第 $j$ 個標籤的預測值，則 $y_j$ 是相應的真實值，而 $n_\text{labels}$ 是 classes or labels （類或者標籤）的數量，則兩個樣本之間的 Hamming loss （漢明損失） $L_{Hamming}$ 定義爲:

$L_{Hamming}(y, \hat{y}) = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} 1(\hat{y}_j \not= y_j)$

其中 $1(x)$ 是 indicator function（指標函數）.

>>>
>>> from sklearn.metrics import hamming_loss
>>> y_pred = [1, 2, 3, 4]
>>> y_true = [2, 2, 3, 4]
>>> hamming_loss(y_true, y_pred)
0.25


在具有 binary label indicators （二分標籤指示符）的 multilabel （多標籤）情況下:

>>>
>>> hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2)))
0.75


Note

在 multiclass classification （多類分類）中， Hamming loss （漢明損失）對應於 y_true 和 y_pred 之間的 Hamming distance（漢明距離），它類似於零一損失函數。然而， zero-one loss penalizes （0-1損失懲罰）不嚴格匹配真實集合的預測集，Hamming loss （漢明損失）懲罰 individual labels （獨立標籤）。因此，Hamming loss（漢明損失）高於 zero-one loss（0-1 損失），總是在 0 和 1 之間，包括 0 和 1;預測真正的標籤的正確的 subset or superset （子集或超集）將給出 0 和 1 之間的 Hamming loss（漢明損失）。

3.3.2.7. Jaccard 相似係數 score

jaccard_similarity_score 函數計算 pairs of label sets （標籤組對）之間的 Jaccard similarity coefficients 也稱作 Jaccard index 的平均值（默認）或總和。

將第 $i$ 個樣本的 Jaccard similarity coefficient 與被標註過的真實數據的標籤集 $y_i$ 和 predicted label set （預測標籤集）:math:hat{y}_i 定義爲

$J(y_i, \hat{y}_i) = \frac{|y_i \cap \hat{y}_i|}{|y_i \cup \hat{y}_i|}.$

在 binary and multiclass classification （二分和多類分類）中，Jaccard similarity coefficient score 等於 classification accuracy（分類精度）。

>>>
>>> import numpy as np
>>> from sklearn.metrics import jaccard_similarity_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> jaccard_similarity_score(y_true, y_pred)
0.5
>>> jaccard_similarity_score(y_true, y_pred, normalize=False)
2


在具有 binary label indicators （二分標籤指示符）的 multilabel （多標籤）情況下:

>>>
>>> jaccard_similarity_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.75


3.3.2.8. 精準，召回和 F-measures

直觀地來理解，precision 是 the ability of the classifier not to label as positive a sample that is negative （classifier （分類器）的標籤不能被標記爲正的樣本爲負的能力），並且 recall 是 classifier （分類器）查找所有 positive samples （正樣本）的能力。

F-measure ( $F_\beta$ 和 $F_1$ measures) 可以解釋爲 precision （精度）和 recall （召回）的 weighted harmonic mean （加權調和平均值）。 $F_\beta$ measure 值達到其最佳值 1 ，其最差分數爲 0 。與 $\beta = 1$ , $F_\beta$ 和 $F_1$ 是等價的， recall （召回）和 precision （精度）同樣重要。

precision_recall_curve 通過改變 decision threshold （決策閾值）從 ground truth label （被標記的真實數據標籤）和 score given by the classifier （分類器給出的分數）計算 precision-recall curve （精確召回曲線）。

average_precision_score 函數根據 prediction scores （預測分數）計算出 average precision (AP)（平均精度）。該分數對應於 precision-recall curve （精確召回曲線）下的面積。該值在 0 和 1 之間，並且越高越好。通過 random predictions （隨機預測）， AP 是 fraction of positive samples （正樣本的分數）。

幾個函數可以讓您 analyze the precision （分析精度），recall（召回）和 F-measures 得分:

`average_precision_score`(y_true, y_score[, …])	Compute average precision (AP) from prediction scores
`f1_score`(y_true, y_pred[, labels, …])	Compute the F1 score, also known as balanced F-score or F-measure
`fbeta_score`(y_true, y_pred, beta[, labels, …])	Compute the F-beta score
`precision_recall_curve`(y_true, probas_pred)	Compute precision-recall pairs for different probability thresholds
`precision_recall_fscore_support`(y_true, y_pred)	Compute precision, recall, F-measure and support for each class
`precision_score`(y_true, y_pred[, labels, …])	Compute the precision
`recall_score`(y_true, y_pred[, labels, …])	Compute the recall

請注意，precision_recall_curve 函數僅限於 binary case （二分情況）。 average_precision_score 函數只適用於 binary classification and multilabel indicator format （二分類和多標籤指示器格式）。

示例:

參閱 Classification of text documents using sparse features 例如 f1_score 用於分類文本文檔的用法。
參閱 Parameter estimation using grid search with cross-validation 例如 precision_score 和 recall_score 用於 using grid search with nested cross-validation （使用嵌套交叉驗證的網格搜索）來估計參數。
參閱 Precision-Recall 例如 precision_recall_curve 用於 evaluate classifier output quality（評估分類器輸出質量）。

3.3.2.8.1. 二分類

在二分類任務中，術語 ‘’positive（正）’’ 和 ‘’negative（負）’’ 是指 classifier’s prediction （分類器的預測），術語 ‘’true（真）’’ 和 ‘’false（假）’’ 是指該預測是否對應於 external judgment （外部判斷）（有時被稱爲 ‘’observation（觀測值）’‘）。給出這些定義，我們可以指定下表:

	Actual class (observation)
Predicted class (expectation)	tp (true positive) Correct result	fp (false positive) Unexpected result
Predicted class (expectation)	fn (false negative) Missing result	tn (true negative) Correct absence of result

在這種情況下，我們可以定義 precision（精度）, recall（召回）和 F-measure 的概念:

$\text{precision} = \frac{tp}{tp + fp},$

$\text{recall} = \frac{tp}{tp + fn},$

$F_\beta = (1 + \beta^2) \frac{\text{precision} \times \text{recall}}{\beta^2 \text{precision} + \text{recall}}.$

以下是 binary classification （二分類）中的一些小例子:

>>>
>>> from sklearn import metrics
>>> y_pred = [0, 1, 0, 0]
>>> y_true = [0, 1, 0, 1]
>>> metrics.precision_score(y_true, y_pred)
1.0
>>> metrics.recall_score(y_true, y_pred)
0.5
>>> metrics.f1_score(y_true, y_pred)  
0.66...
>>> metrics.fbeta_score(y_true, y_pred, beta=0.5)  
0.83...
>>> metrics.fbeta_score(y_true, y_pred, beta=1)  
0.66...
>>> metrics.fbeta_score(y_true, y_pred, beta=2) 
0.55...
>>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5)  
(array([ 0.66...,  1.        ]), array([ 1. ,  0.5]), array([ 0.71...,  0.83...]), array([2, 2]...))


>>> import numpy as np
>>> from sklearn.metrics import precision_recall_curve
>>> from sklearn.metrics import average_precision_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, threshold = precision_recall_curve(y_true, y_scores)
>>> precision  
array([ 0.66...,  0.5       ,  1.        ,  1.        ])
>>> recall
array([ 1. ,  0.5,  0.5,  0. ])
>>> threshold
array([ 0.35,  0.4 ,  0.8 ])
>>> average_precision_score(y_true, y_scores)  
0.83...


3.3.2.8.2. 多類和多標籤分類

在 multiclass and multilabel classification task（多類和多標籤分類任務）中，precision（精度）, recall（召回）, and F-measures 的概念可以獨立地應用於每個標籤。有以下幾種方法 combine results across labels （將結果跨越標籤組合），由 average 參數指定爲 average_precision_score （僅用於 multilabel）， f1_score, fbeta_score, precision_recall_fscore_support, precision_score 和 recall_score 函數，如上 above 所述。請注意，對於在包含所有標籤的多類設置中進行 “micro”-averaging （”微”平均），將產生相等的 precision（精度）， recall（召回）和 $F$ ，而 “weighted（加權）” averaging（平均）可能會產生 precision（精度）和 recall（召回）之間的 F-score 。

爲了使這一點更加明確，請考慮以下 notation （符號）:

$y$ predicted（預測） $(sample, label)$ 對
$\hat{y}$ true（真） $(sample, label)$ 對
$L$ labels 集合
$S$ samples 集合
$y_s$ $y$ 的子集與樣本 $s$ , 即 $y_s := \left\{(s', l) \in y | s' = s\right\}$
$y_l$ $y$ 的子集與 label $l$
類似的, $\hat{y}_s$ 和 $\hat{y}_l$ 是 $\hat{y}$ 的子集
$P(A, B) := \frac{\left| A \cap B \right|}{\left|A\right|}$
$R(A, B) := \frac{\left| A \cap B \right|}{\left|B\right|}$ (Conventions （公約）在處理 $B = \emptyset$ 有所不同; 這個實現使用 $R(A, B):=0$ , 與 $P$ 類似.)
$F_\beta(A, B) := \left(1 + \beta^2\right) \frac{P(A, B) \times R(A, B)}{\beta^2 P(A, B) + R(A, B)}$

然後將 metrics （指標）定義爲:

`average`	Precision	Recall	F_beta
`"micro"`	$P(y, \hat{y})$	$R(y, \hat{y})$	$F_\beta(y, \hat{y})$
`"samples"`	$\frac{1}{\left\|S\right\|} \sum_{s \in S} P(y_s, \hat{y}_s)$	$\frac{1}{\left\|S\right\|} \sum_{s \in S} R(y_s, \hat{y}_s)$	$\frac{1}{\left\|S\right\|} \sum_{s \in S} F_\beta(y_s, \hat{y}_s)$
`"macro"`	$\frac{1}{\left\|L\right\|} \sum_{l \in L} P(y_l, \hat{y}_l)$	$\frac{1}{\left\|L\right\|} \sum_{l \in L} R(y_l, \hat{y}_l)$	$\frac{1}{\left\|L\right\|} \sum_{l \in L} F_\beta(y_l, \hat{y}_l)$
`"weighted"`	$\frac{1}{\sum_{l \in L} \left\|\hat{y}_l\right\|} \sum_{l \in L} \left\|\hat{y}_l\right\| P(y_l, \hat{y}_l)$	$\frac{1}{\sum_{l \in L} \left\|\hat{y}_l\right\|} \sum_{l \in L} \left\|\hat{y}_l\right\| R(y_l, \hat{y}_l)$	$\frac{1}{\sum_{l \in L} \left\|\hat{y}_l\right\|} \sum_{l \in L} \left\|\hat{y}_l\right\| F_\beta(y_l, \hat{y}_l)$
`None`	$\langle P(y_l, \hat{y}_l) \| l \in L \rangle$	$\langle R(y_l, \hat{y}_l) \| l \in L \rangle$	$\langle F_\beta(y_l, \hat{y}_l) \| l \in L \rangle$

>>>
>>> from sklearn import metrics
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> metrics.precision_score(y_true, y_pred, average='macro')  
0.22...
>>> metrics.recall_score(y_true, y_pred, average='micro')
... 
0.33...
>>> metrics.f1_score(y_true, y_pred, average='weighted')  
0.26...
>>> metrics.fbeta_score(y_true, y_pred, average='macro', beta=0.5)  
0.23...
>>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5, average=None)
... 
(array([ 0.66...,  0.        ,  0.        ]), array([ 1.,  0.,  0.]), array([ 0.71...,  0.        ,  0.        ]), array([2, 2, 2]...))


For multiclass classification with a “negative class”, it is possible to exclude some labels:

>>>
>>> metrics.recall_score(y_true, y_pred, labels=[1, 2], average='micro')
... # excluding 0, no labels were correctly recalled
0.0


Similarly, labels not present in the data sample may be accounted for in macro-averaging.

>>>
>>> metrics.precision_score(y_true, y_pred, labels=[0, 1, 2, 3], average='macro')
... 
0.166...


3.3.2.9. Hinge loss

hinge_loss 函數使用 hinge loss 計算模型和數據之間的 average distance （平均距離），這是一種只考慮 prediction errors （預測誤差）的 one-sided metric （單向指標）。（Hinge loss 用於最大邊界分類器，如支持向量機）

如果標籤用 +1 和 -1 編碼，則 $y$ : 是真實值，並且 $w$ 是由 decision_function 輸出的 predicted decisions （預測決策），則 hinge loss 定義爲:

$L_\text{Hinge}(y, w) = \max\left\{1 - wy, 0\right\} = \left|1 - wy\right|_+$

如果有兩個以上的標籤， hinge_loss 由於 Crammer & Singer 而使用了 multiclass variant （多類型變體）。 Here 是描述它的論文。

如果 $y_w$ 是真實標籤的 predicted decision （預測決策），並且 $y_t$ 是所有其他標籤的預測決策的最大值，其中預測決策由 decision function （決策函數）輸出，則 multiclass hinge loss 定義如下:

$L_\text{Hinge}(y_w, y_t) = \max\left\{1 + y_t - y_w, 0\right\}$

這裏是一個小例子，演示了在 binary class （二類）問題中使用了具有 svm classifier （svm 的分類器）的 hinge_loss 函數:

>>>
>>> from sklearn import svm
>>> from sklearn.metrics import hinge_loss
>>> X = [[0], [1]]
>>> y = [-1, 1]
>>> est = svm.LinearSVC(random_state=0)
>>> est.fit(X, y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     verbose=0)
>>> pred_decision = est.decision_function([[-2], [3], [0.5]])
>>> pred_decision  
array([-2.18...,  2.36...,  0.09...])
>>> hinge_loss([-1, 1, 1], pred_decision)  
0.3...


這裏是一個示例，演示了在 multiclass problem （多類問題）中使用了具有 svm 分類器的 hinge_loss 函數:

>>>
>>> X = np.array([[0], [1], [2], [3]])
>>> Y = np.array([0, 1, 2, 3])
>>> labels = np.array([0, 1, 2, 3])
>>> est = svm.LinearSVC()
>>> est.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
>>> pred_decision = est.decision_function([[-1], [2], [3]])
>>> y_true = [0, 2, 3]
>>> hinge_loss(y_true, pred_decision, labels)  
0.56...


3.3.2.10. Log 損失

Log loss，又被稱爲 logistic regression loss（logistic 迴歸損失）或者 cross-entropy loss（交叉熵損失）定義在 probability estimates （概率估計）。它通常用於 (multinomial) logistic regression （（多項式）logistic 迴歸）和 neural networks （神經網絡）以及 expectation-maximization （期望最大化）的一些變體中，並且可用於評估分類器的 probability outputs （概率輸出）（predict_proba）而不是其 discrete predictions （離散預測）。

對於具有真實標籤 $y \in \{0,1\}$ 的 binary classification （二分類）和 probability estimate （概率估計） $p = \operatorname{Pr}(y = 1)$ , 每個樣本的 log loss 是給定的分類器的 negative log-likelihood 真正的標籤:

$L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y \log (p) + (1 - y) \log (1 - p))$

這擴展到 multiclass case （多類案例）如下。讓一組樣本的真實標籤被編碼爲 1-of-K binary indicator matrix $Y$ , 即如果樣本 $i$ 具有取自一組 $K$ 個標籤的標籤 $k$ ，則 $y_{i,k} = 1$ 。令 $P$ 爲 matrix of probability estimates （概率估計矩陣）， $p_{i,k} = \operatorname{Pr}(t_{i,k} = 1)$ 。那麼整套的 log loss 就是

$L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}$

爲了看這這裏如何 generalizes （推廣）上面給出的 binary log loss （二分 log loss），請注意，在 binary case （二分情況下）， $p_{i,0} = 1 - p_{i,1}$ 和 $y_{i,0} = 1 - y_{i,1}$ ，因此擴展 $y_{i,k} \in \{0,1\}$ 的 inner sum （內部和），給出 binary log loss （二分 log loss）。

log_loss 函數計算出一個 a list of ground-truth labels （已標註的真實數據的標籤的列表）和一個 probability matrix （概率矩陣）的 log loss，由 estimator （估計器）的 predict_proba 方法返回。

>>>
>>> from sklearn.metrics import log_loss
>>> y_true = [0, 0, 1, 1]
>>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
>>> log_loss(y_true, y_pred)    
0.1738...


y_pred 中的第一個 [.9, .1] 表示第一個樣本具有標籤 0 的 90% 概率。log loss 是非負數。

3.3.2.11. 馬修斯相關係數

matthews_corrcoef 函數用於計算 binary classes （二分類）的 Matthew’s correlation coefficient (MCC) 引用自 Wikipedia:

“Matthews correlation coefficient（馬修斯相關係數）用於機器學習，作爲 binary (two-class) classifications （二分類）分類質量的度量。它考慮到 true and false positives and negatives （真和假的 positives 和 negatives），通常被認爲是可以使用的 balanced measure（平衡措施），即使 classes are of very different sizes （類別大小不同）。MCC 本質上是 -1 和 +1 之間的相關係數值。係數 +1 表示完美預測，0 表示平均隨機預測， -1 表示反向預測。statistic （統計量）也稱爲 phi coefficient （phi）係數。”

在 binary (two-class) （二分類）情況下， $tp$ , $tn$ , $fp$ 和 $fn$ 分別是 true positives, true negatives, false positives 和 false negatives 的數量，MCC 定義爲

$MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}.$

在 multiclass case （多類的情況）下， Matthews correlation coefficient（馬修斯相關係數）可以根據 $K$ classes （類）的 confusion_matrix $C$ 定義 defined 。爲了簡化定義，考慮以下中間變量:

$t_k=\sum_{i}^{K} C_{ik}$ 真正發生了 $k$ 類的次數,
$p_k=\sum_{i}^{K} C_{ki}$ $k$ 類被預測的次數,
$c=\sum_{k}^{K} C_{kk}$ 正確預測的樣本總數,
$s=\sum_{i}^{K} \sum_{j}^{K} C_{ij}$ 樣本總數.

然後 multiclass MCC 定義爲:

$MCC = \frac{ c \times s - \sum_{k}^{K} p_k \times t_k}{\sqrt{ (s^2 - \sum_{k}^{K} p_k^2) \times (s^2 - \sum_{k}^{K} t_k^2)}}$

當有兩個以上的標籤時， MCC 的值將不再在 -1 和 +1 之間。相反，根據已經標註的真實數據的數量和分佈情況，最小值將介於 -1 和 0 之間。最大值始終爲 +1 。

這是一個小例子，說明了使用 matthews_corrcoef 函數:

>>>
>>> from sklearn.metrics import matthews_corrcoef
>>> y_true = [+1, +1, +1, -1]
>>> y_pred = [+1, -1, +1, +1]
>>> matthews_corrcoef(y_true, y_pred)  
-0.33...


3.3.2.12. Receiver operating characteristic (ROC)

函數 roc_curve 計算 receiver operating characteristic curve, or ROC curve. 引用 Wikipedia :

“A receiver operating characteristic (ROC), 或者簡單的 ROC 曲線，是一個圖形圖，說明了 binary classifier （二分分類器）系統的性能，因爲 discrimination threshold （鑑別閾值）是變化的。它是通過在不同的閾值設置下，從 true positives out of the positives (TPR = true positive 比例) 與 false positives out of the negatives (FPR = false positive 比例) 繪製 true positive 的比例來創建的。 TPR 也稱爲 sensitivity（靈敏度），FPR 是減去 specificity（特異性）或 true negative 比例。”

該函數需要真正的 binar value （二分值）和 target scores（目標分數），這可以是 positive class 的 probability estimates （概率估計），confidence values（置信度值）或 binary decisions（二分決策）。這是一個如何使用 roc_curve 函數的小例子:

>>>
>>> import numpy as np
>>> from sklearn.metrics import roc_curve
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
>>> fpr
array([ 0. ,  0.5,  0.5,  1. ])
>>> tpr
array([ 0.5,  0.5,  1. ,  1. ])
>>> thresholds
array([ 0.8 ,  0.4 ,  0.35,  0.1 ])


該圖顯示了這樣的 ROC 曲線的示例:

roc_auc_score 函數計算 receiver operating characteristic (ROC) 曲線下的面積，也由 AUC 和 AUROC 表示。通過計算 roc 曲線下的面積，曲線信息總結爲一個數字。有關更多的信息，請參閱 Wikipedia article on AUC .

>>>
>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75


在 multi-label classification （多標籤分類）中， roc_auc_score 函數通過在標籤上進行平均來擴展 above .

與諸如 subset accuracy （子集精確度），Hamming loss（漢明損失）或 F1 score 的 metrics（指標）相比， ROC 不需要優化每個標籤的閾值。roc_auc_score 函數也可以用於 multi-class classification （多類分類），如果預測的輸出被 binarized （二分化）。

示例:

參閱 Receiver Operating Characteristic (ROC) 例如使用 ROC 來評估分類器輸出的質量。
參閱 Receiver Operating Characteristic (ROC) with cross validation 例如使用 ROC 來評估分類器輸出質量，使用 cross-validation （交叉驗證）。
參閱 Species distribution modeling 例如使用 ROC 來 model species distribution 模擬物種分佈。

3.3.2.13. 零一損失

zero_one_loss 函數通過 $n_{\text{samples}}$ 計算 0-1 classification loss ( $L_{0-1}$ ) 的 sum （和）或 average （平均值）。默認情況下，函數在樣本上 normalizes （標準化）。要獲得 $L_{0-1}$ 的總和，將 normalize 設置爲 False。

在 multilabel classification （多標籤分類）中，如果零標籤與標籤嚴格匹配，則 zero_one_loss 將一個子集作爲一個子集，如果有任何錯誤，則爲零。默認情況下，函數返回不完全預測子集的百分比。爲了得到這樣的子集的計數，將 normalize 設置爲 False 。

如果 $\hat{y}_i$ 是第 $i$ 個樣本的預測值， $y_i$ 是相應的真實值，則 0-1 loss $L_{0-1}$ 定義爲:

$L_{0-1}(y_i, \hat{y}_i) = 1(\hat{y}_i \not= y_i)$

其中 $1(x)$ 是 indicator function.

>>>
>>> from sklearn.metrics import zero_one_loss
>>> y_pred = [1, 2, 3, 4]
>>> y_true = [2, 2, 3, 4]
>>> zero_one_loss(y_true, y_pred)
0.25
>>> zero_one_loss(y_true, y_pred, normalize=False)
1


在具有 binary label indicators （二分標籤指示符）的 multilabel （多標籤）情況下，第一個標籤集 [0,1] 有錯誤:

>>>
>>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

>>> zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)),  normalize=False)
1


示例:

參閱 Recursive feature elimination with cross-validation 例如 zero one loss 使用以通過 cross-validation （交叉驗證）執行遞歸特徵消除。

3.3.2.14. Brier 分數損失

brier_score_loss 函數計算二進制類的 Brier 分數。引用維基百科：

“Brier 分數是一個特有的分數函數，用於衡量概率預測的準確性。它適用於預測必須將概率分配給一組相互排斥的離散結果的任務。”

該函數返回的是實際結果與可能結果的預測概率之間均方差的得分。實際結果必須爲1或0（真或假），而實際結果的預測概率可以是0到1之間的值。

Brier 分數損失也在0到1之間，分數越低（均方差越小），預測越準確。它可以被認爲是對一組概率預測的 “校準” 的度量。

$BS = \frac{1}{N} \sum_{t=1}^{N}(f_t - o_t)^2$

其中: $N$ 是預測的總數， $f_t$ 是實際結果 $o_t$ 的預測概率。

這是一個使用這個函數的小例子:

>>>
>>> import numpy as np
>>> from sklearn.metrics import brier_score_loss
>>> y_true = np.array([0, 1, 1, 0])
>>> y_true_categorical = np.array(["spam", "ham", "ham", "spam"])
>>> y_prob = np.array([0.1, 0.9, 0.8, 0.4])
>>> y_pred = np.array([0, 1, 1, 0])
>>> brier_score_loss(y_true, y_prob)
0.055
>>> brier_score_loss(y_true, 1-y_prob, pos_label=0)
0.055
>>> brier_score_loss(y_true_categorical, y_prob, pos_label="ham")
0.055
>>> brier_score_loss(y_true, y_prob > 0.5)
0.0


示例:

請參閱分類器的概率校準 Probability calibration of classifiers ，通過 Brier 分數損失使用示例來執行分類器的概率校準。

參考文獻:

1. Brier, 以概率表示的預測驗證 , 月度天氣評估78.1（1950）

3.3.3. 多標籤排名指標

在多分類學習中，每個樣本可以具有與其相關聯的任何數量的真實標籤。目標是給予高分，更好地評價真實標籤。

3.3.3.1. 覆蓋誤差

coverage_error 函數計算必須包含在最終預測中的標籤的平均數，以便預測所有真正的標籤。如果您想知道有多少 top 評分標籤，您必須通過平均來預測，而不會丟失任何真正的標籤，這很有用。因此，此指標的最佳價值是真正標籤的平均數量。

Note

我們的實現的分數比 Tsoumakas 等人在2010年的分數大1。這擴展了它來處理一個具有0個真實標籤實例的退化情況。

正式地，給定真實標籤 $y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}$ 的二進制指示矩陣和與每個標籤 $\hat{f} \in \mathbb{R}^{n_\text{samples} \times n_\text{labels}}$ 相關聯的分數，覆蓋範圍被定義爲

$coverage(y, \hat{f}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} \max_{j:y_{ij} = 1} \text{rank}_{ij}$

與 $\text{rank}_{ij} = \left|\left\{k: \hat{f}_{ik} \geq \hat{f}_{ij} \right\}\right|$ 。給定等級定義，通過給出將被分配給所有綁定值的最大等級， y_scores 中的關係會被破壞。

這是一個使用這個函數的小例子:

>>>
>>> import numpy as np
>>> from sklearn.metrics import coverage_error
>>> y_true = np.array([[1, 0, 0], [0, 0, 1]])
>>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
>>> coverage_error(y_true, y_score)
2.5


3.3.3.2. 標籤排名平均精度

label_ranking_average_precision_score 函數實現標籤排名平均精度（LRAP）。該度量值與 average_precision_score 函數相關聯，但是基於標籤排名的概念，而不是精確度和召回。

標籤排名平均精度（LRAP）是分配給每個樣本的每個真實標籤的平均值，真實對總標籤與較低分數的比率。如果能夠爲每個樣本相關標籤提供更好的排名，這個指標就會產生更好的分數。獲得的得分總是嚴格大於0，最佳值爲1。如果每個樣本只有一個相關標籤，則標籤排名平均精度等於平均倒數等級。

正式地，給定真實標籤 $y \in \mathcal{R}^{n_\text{samples} \times n_\text{labels}}$ 的二進制指示矩陣和與每個標籤 $\hat{f} \in \mathcal{R}^{n_\text{samples} \times n_\text{labels}}$ 相關聯的得分，平均精度被定義爲

$LRAP(y, \hat{f}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} \frac{1}{|y_i|} \sum_{j:y_{ij} = 1} \frac{|\mathcal{L}_{ij}|}{\text{rank}_{ij}}$

與 $\mathcal{L}_{ij} = \left\{k: y_{ik} = 1, \hat{f}_{ik} \geq \hat{f}_{ij} \right\}$ ， $\text{rank}_{ij} = \left|\left\{k: \hat{f}_{ik} \geq \hat{f}_{ij} \right\}\right|$ 和 $|\cdot|$ 是集合的 l0 範數或基數。

這是一個使用這個函數的小例子:

>>>
>>> import numpy as np
>>> from sklearn.metrics import label_ranking_average_precision_score
>>> y_true = np.array([[1, 0, 0], [0, 0, 1]])
>>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
>>> label_ranking_average_precision_score(y_true, y_score) 
0.416...


3.3.3.3. 排序損失

label_ranking_loss 函數計算在樣本上平均排序錯誤的標籤對數量的排序損失，即真實標籤的分數低於假標籤，由虛假和真實標籤的倒數加權。最低可實現的排名損失爲零。

正式地，給定真相標籤 $y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}$ 的二進制指示矩陣和與每個標籤 $\hat{f} \in \mathbb{R}^{n_\text{samples} \times n_\text{labels}}$ 相關聯的得分，排序損失被定義爲

$\text{ranking\_loss}(y, \hat{f}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} \frac{1}{|y_i|(n_\text{labels} - |y_i|)} \left|\left\{(k, l): \hat{f}_{ik} < \hat{f}_{il}, y_{ik} = 1, y_{il} = 0 \right\}\right|$

其中 $|\cdot|$ 是 $\ell_0$ 範數或集合的基數。

這是一個使用這個函數的小例子:

>>>
>>> import numpy as np
>>> from sklearn.metrics import label_ranking_loss
>>> y_true = np.array([[1, 0, 0], [0, 0, 1]])
>>> y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
>>> label_ranking_loss(y_true, y_score) 
0.75...
>>> # With the following prediction, we have perfect and minimal loss
>>> y_score = np.array([[1.0, 0.1, 0.2], [0.1, 0.2, 0.9]])
>>> label_ranking_loss(y_true, y_score)
0.0


參考文獻:

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). 挖掘多標籤數據。在數據挖掘和知識發現手冊（第667-685頁）。美國 Springer.

3.3.4. 迴歸指標

該 sklearn.metrics 模塊實現了一些 loss, score 以及 utility 函數以測量 regression（迴歸）的性能. 其中一些已經被加強以處理多個輸出的場景: mean_squared_error, mean_absolute_error, explained_variance_score 和 r2_score.

這些函數有 multioutput 這樣一個 keyword（關鍵的）參數, 它指定每一個目標的 score（得分）或 loss（損失）的平均值的方式. 默認是 'uniform_average', 其指定了輸出時一致的權重均值. 如果一個 ndarray 的 shape (n_outputs,) 被傳遞, 則其中的 entries（條目）將被解釋爲權重，並返回相應的加權平均值. 如果 multioutput 指定了 'raw_values' , 則所有未改變的部分 score（得分）或 loss（損失）將以 (n_outputs,) 形式的數組返回.

該 r2_score 和 explained_variance_score 函數接受一個額外的值 'variance_weighted' 用於 multioutput 參數. 該選項通過相應目標變量的方差使得每個單獨的 score 進行加權. 該設置量化了全局捕獲的未縮放方差. 如果目標變量的大小不一樣, 則該 score 更好地解釋了較高的方差變量. multioutput='variance_weighted' 是 r2_score 的默認值以向後兼容. 以後該值會被改成 uniform_average.

3.3.4.1. 解釋方差得分

該 explained_variance_score 函數計算了 explained variance regression score（解釋的方差迴歸得分）.

如果 $\hat{y}$ 是預估的目標輸出, $y$ 是相應（正確的）目標輸出, 並且 $Var$ is 方差, 標準差的平方, 那麼解釋的方差預估如下:

$\texttt{explained\_{}variance}(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}$

最好的得分是 1.0, 值越低越差.

下面是一下有關 explained_variance_score 函數使用的一些例子:

>>>
>>> from sklearn.metrics import explained_variance_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> explained_variance_score(y_true, y_pred)  
0.957...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> explained_variance_score(y_true, y_pred, multioutput='raw_values')
... 
array([ 0.967...,  1.        ])
>>> explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7])
... 
0.990...


3.3.4.2. 平均絕對誤差

該 mean_absolute_error 函數計算了平均絕對誤差, 一個對應絕對誤差損失預期值或者 $l1$ -norm 損失的風險度量.

如果 $\hat{y}_i$ 是 $i$ -th 樣本的預測值, 並且 $y_i$ 是對應的真實值, 則平均絕對誤差 (MAE) 預估的 $n_{\text{samples}}$ 定義如下

$\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \left| y_i - \hat{y}_i \right|.$

下面是一個有關 mean_absolute_error 函數用法的小例子:

>>>
>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_absolute_error(y_true, y_pred)
0.5
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> mean_absolute_error(y_true, y_pred)
0.75
>>> mean_absolute_error(y_true, y_pred, multioutput='raw_values')
array([ 0.5,  1. ])
>>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])
... 
0.849...


3.3.4.3. 均方誤差

該 mean_squared_error 函數計算了均方誤差, 一個對應於平方（二次）誤差或損失的預期值的風險度量.

如果 $\hat{y}_i$ 是 $i$ -th 樣本的預測值, 並且 $y_i$ 是對應的真實值, 則均方誤差（MSE）預估的 $n_{\text{samples}}$ 定義如下

$\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2.$

下面是一個有關 mean_squared_error 函數用法的小例子:

>>>
>>> from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> mean_squared_error(y_true, y_pred)  
0.7083...


Examples:

點擊 Gradient Boosting regression 查看均方誤差用於梯度上升（gradient boosting）迴歸的使用例子。

3.3.4.4. 均方誤差對數

該 mean_squared_log_error 函數計算了一個對應平方對數（二次）誤差或損失的預估值風險度量.

如果 $\hat{y}_i$ 是 $i$ -th 樣本的預測值, 並且 $y_i$ 是對應的真實值, 則均方誤差對數（MSLE）預估的 $n_{\text{samples}}$ 定義如下

$\text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2.$

其中 $\log_e (x)$ 表示 $x$ 的自然對數. 當目標具有指數增長的趨勢時, 該指標最適合使用, 例如人口數量, 跨年度商品的平均銷售額等. 請注意, 該指標會對低於預測的估計值進行估計.

下面是一個有關 mean_squared_log_error 函數用法的小例子:

>>>
>>> from sklearn.metrics import mean_squared_log_error
>>> y_true = [3, 5, 2.5, 7]
>>> y_pred = [2.5, 5, 4, 8]
>>> mean_squared_log_error(y_true, y_pred)  
0.039...
>>> y_true = [[0.5, 1], [1, 2], [7, 6]]
>>> y_pred = [[0.5, 2], [1, 2.5], [8, 8]]
>>> mean_squared_log_error(y_true, y_pred)  
0.044...


3.3.4.5. 中位絕對誤差

該 median_absolute_error 函數尤其有趣, 因爲它的離羣值很強. 通過取目標和預測之間的所有絕對差值的中值來計算損失.

如果 $\hat{y}_i$ 是 $i$ -th 樣本的預測值, 並且 $y_i$ 是對應的真實值, 則中位絕對誤差（MedAE）預估的 $n_{\text{samples}}$ 定義如下

$\text{MedAE}(y, \hat{y}) = \text{median}(\mid y_1 - \hat{y}_1 \mid, \ldots, \mid y_n - \hat{y}_n \mid).$

該 median_absolute_error 函數不支持多輸出.

下面是一個有關 median_absolute_error 函數用法的小例子:

>>>
>>> from sklearn.metrics import median_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> median_absolute_error(y_true, y_pred)
0.5


3.3.4.6. R² score, 可決係數

該 r2_score 函數計算了 computes R², 即可決係數. 它提供了將來樣本如何可能被模型預測的估量. 最佳分數爲 1.0, 可以爲負數（因爲模型可能會更糟）. 總是預測 y 的預期值，不考慮輸入特徵的常數模型將得到 R^2 得分爲 0.0.

如果 $\hat{y}_i$ 是 $i$ -th 樣本的預測值, 並且 $y_i$ 是對應的真實值, 則 R² 得分預估的 $n_{\text{samples}}$ 定義如下

$R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^{n_{\text{samples}} - 1} (y_i - \hat{y}_i)^2}{\sum_{i=0}^{n_\text{samples} - 1} (y_i - \bar{y})^2}$

其中 $\bar{y} = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1} y_i$ .

下面是一個有關 r2_score 函數用法的小例子:

>>>
>>> from sklearn.metrics import r2_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> r2_score(y_true, y_pred)  
0.948...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred, multioutput='variance_weighted')
... 
0.938...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred, multioutput='uniform_average')
... 
0.936...
>>> r2_score(y_true, y_pred, multioutput='raw_values')
... 
array([ 0.965...,  0.908...])
>>> r2_score(y_true, y_pred, multioutput=[0.3, 0.7])
... 
0.925...


示例:

點擊 Lasso and Elastic Net for Sparse Signals 查看關於R²用於評估在Lasso and Elastic Net on sparse signals上的使用.

3.3.5. 聚類指標

該 sklearn.metrics 模塊實現了一些 loss, score 和 utility 函數. 更多信息請參閱聚類性能度量部分, 例如聚類, 以及用於二分聚類的 Biclustering 評測.

3.3.6. 虛擬估計

在進行監督學習的過程中，簡單的 sanity check（理性檢查）包括將人的估計與簡單的經驗法則進行比較. DummyClassifier實現了幾種簡單的分類策略:

stratified 通過在訓練集類分佈方面來生成隨機預測.
most_frequent 總是預測訓練集中最常見的標籤.
prior always predicts the class that maximizes the class prior (like most_frequent`) and ``predict_proba returns the class prior.
uniform 隨機產生預測.
constant 總是預測用戶提供的常量標籤.
A major motivation of this method is F1-scoring, when the positive class is in the minority. 這種方法的主要動機是 F1-scoring, 當 positive class（正類）較少時.

請注意, 這些所有的策略, predict 方法徹底的忽略了輸入數據!

爲了說明 DummyClassifier, 首先讓我們創建一個 imbalanced dataset:

>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


接下來, 讓我們比較一下 SVC 和 most_frequent 的準確性.

>>>
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test) 
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)  
0.57...


我們看到 SVC 沒有比一個 dummy classifier（虛擬分類器）好很多. 現在, 讓我們來更改一下 kernel:

>>>
>>> clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)  
0.97...


我們看到準確率提升到將近 100%. 建議採用交叉驗證策略, 以更好地估計精度, 如果不是太耗 CPU 的話. 更多信息請參閱交叉驗證: 評估（衡量）機器學習模型的性能部分. 此外，如果要優化參數空間，強烈建議您使用適當的方法; 更多詳情請參閱調整估計器的超參數部分.

通常來說，當分類器的準確度太接近隨機情況時，這可能意味着出現了一些問題: 特徵沒有幫助, 超參數沒有正確調整, class 不平衡造成分類器有問題等…

DummyRegressor 還實現了四個簡單的經驗法則來進行迴歸:

mean 總是預測訓練目標的平均值.
median 總是預測訓練目標的中位數.
quantile 總是預測用戶提供的訓練目標的 quantile（分位數）.
constant 總是預測由用戶提供的常數值.

在以上所有的策略中, predict 方法完全忽略了輸入數據.

中文文檔: http://sklearn.apachecn.org/cn/stable/modules/model_evaluation.html

英文文檔: http://sklearn.apachecn.org/en/stable/modules/model_evaluation.html

官方文檔: http://scikit-learn.org/stable/

GitHub: https://github.com/apachecn/scikit-learn-doc-zh（覺得不錯麻煩給個 Star，我們一直在努力）

貢獻者: https://github.com/apachecn/scikit-learn-doc-zh#貢獻者

關於我們: http://www.apachecn.org/organization/209.html

有興趣的們也可以和我們一起來維護，持續更新中。。。

機器學習交流羣: 629470233

【Scikit-Learn 中文文檔】模型評估: 量化預測的質量 - 模型選擇和評估 - 用戶指南 | ApacheCN

3.3. 模型評估: 量化預測的質量

3.3.1. scoring 參數: 定義模型評估規則

3.3.1.1. 常見場景: 預定義值

3.3.1.2. 根據 metric 函數定義您的評分策略

3.3.1.3. 實現自己的記分對象

3.3.1.4. 使用多個指數評估

3.3.2. 分類指標

3.3.2.1. 從二分到多分類和 multilabel

3.3.2.2. 精確度得分

3.3.2.3. Cohen’s kappa

3.3.2.4. 混淆矩陣

3.3.2.5. 分類報告

3.3.2.6. 漢明損失

3.3.2.7. Jaccard 相似係數 score

3.3.2.8. 精準，召回和 F-measures

3.3.2.8.1. 二分類

3.3.2.8.2. 多類和多標籤分類

3.3.2.9. Hinge loss

3.3.2.10. Log 損失

3.3.2.11. 馬修斯相關係數

3.3.2.12. Receiver operating characteristic (ROC)

3.3.2.13. 零一損失

3.3.2.14. Brier 分數損失

3.3.3. 多標籤排名指標

3.3.3.1. 覆蓋誤差

3.3.3.2. 標籤排名平均精度

3.3.3.3. 排序損失

3.3.4. 迴歸指標

3.3.4.1. 解釋方差得分

3.3.4.2. 平均絕對誤差

3.3.4.3. 均方誤差

3.3.4.4. 均方誤差對數

3.3.4.5. 中位絕對誤差

3.3.4.6. R² score, 可決係數

3.3.5. 聚類指標

3.3.6. 虛擬估計

3.3.1. `scoring` 參數: 定義模型評估規則