第一章.Classification -- 06.Evaluation Methods for Classifiers翻譯

So let’s talk about how to evaluate a classifier. Now just following the example,

we have our features, each observation being represented by a set of numbers,

and each observation’s labelled and then the machine learning algorithm comes along and it gives a number to each observation, which is sort of what it thinks is going on,

and then the number says how far away from the decision boundary the observation is,

and also the sign of this function f is the predicted label.

Now let’s put those in another column, so this is y hat,

this is the sign of f; just tells you which side of the decision boundary that point is on.

And if the classifier is right, then y hat agrees with y.

Okay so let’s just look at these two columns for a few minutes.

If the classifier is really good, these predicted labels often agree with the true labels.

And let’s just put a few more examples in there just for fun.

Now this is called a true positive, where the true label is plus one, and the predicted label is plus one.

In a true negative, they’re both minus one.

And then a false positive or a type one error is when you think that it’s positive,

but it’s actually not. And then this is a false negative or a type 2 error, where you think it’s negative,

but it’s actually not.

And then below it this is just another true negative, and below that is another false negative, and so on.

Okay so the errors come in these two flavours. Now, how do we judge the quality of a classifier?

And we construct a confusion matrix,

and the confusion matrix has the true positives over on the upper left,

and then the true negatives down on the lower right,

and then the false positives up here and the false negatives down here so you can see that this classifier’s pretty good because most of the points are either true positives or true negatives,

which is good and there’s not too many errors.

Now if we don’t care about whether we have false positives or false negatives,

if they’re both equally bad, then we can look at them as classification error.

So this is just the fraction of points that are misclassified;

it’s also called the classification error, and it’s also called the misclassification rate.

And then I can write it this way,

so this is the fraction of points for which the predicted label is not equal to the true label.

So it just counts up the false positives and false negatives and divides by n.

The true positive rate is simply the number of true positives divided by the total number of positives.

So it’s this guy divided by the sum of these two here,

so it’s the number of points whose true label is one,

and whose predicted label is also one, and then divided by the number of points whose true label is one. It’s also called the sensitivity or the recall,

and the true negative rate, or the specificity is defined this way:

it’s actually just the number of true negatives divided by the total number of negatives,

okay, so it’s the number of points who are negative – they’re truly negative,

and they’re predicted to be negative, and then divided by the total number of negatives.

So again, it’s this guy true negatives, divided by the sum of these two guys.

And then the false positive rate looks like this:

so it’s the number of false positives divided by the total number of negatives,

and then there’s a few more

metrics.

There’s the precision, which is the true positives divided by the total number of predicted positives,

so in other words it’s this one divided by the sum of these two.

So the reason why I’m going through all these is because these metrics are all provided by pretty much any piece of software that you want to work with,

and they’re quantities of interest that you hear about fairly often and here’s the F1 score – the F1 score is kind of neat. It’s a balance between precision and recall.

So it’s two times precision times recall divided by precision plus recall.

So if either the precision or the recall are bad, then the F1 score is bad.

And so precision again uses those two quantities, and recall uses these two.

But if you get a good F1 score, that generally means that your model – your model is good.

F1 scores and precision and recall, these are all terms that are used very often in information retrieval, so things like evaluating search engines. And so here’s just some more detail about that.

The precision at n for a search query is defined like this:

so of the top n pages received by the search engine, how many were actually relevant to the query?

And the way we can write that is the number of true positives out of those top n, divided by n –

the number of pages received. And then the recall at n for a search query is the following:

it says if there are n relevant webpages where n is the number of total positives –

there are n relevant webpages what fraction of them did we get from our query?

So that’s the number of true positives divided by the total number of positives.

Which measure should you use?

Now the machine learners often use accuracy –

just plain accuracy or misclassification error because it’s just one number that you can directly compare across algorithms.

You need a single measure of quality to compare algorithms.

Once you have two measures of quality, you can’t directly make a comparison because what if one algorithm’s better according to one quality measure but not another one?

Then you can’t compare them.

But this only works –

this one only works when errors for the positive class count equally to errors from the negative class, and this doesn’t work when the data are imbalanced, but anyway that’s what people do. So doctors,

they often want to know like how many of the positives they got – they got right,

and they want to know how many of the negatives they got right,

so that makes sense that they want to look at both the true positive rate and the true negative rate.

And if you’re in information retrieval, then you probably want to use precision and recall and F1 score, which is a combination of the two.

So let’s say you’re – you know, you’re judging the quality of a search engine like Bing for instance.

You might care about precision; again, precision is, you know, of the webpages that the engine returned, how many were relevant?

That’s precision, and then recall is the fraction of the relevant webpages that the search engine returned,

and then you use the F1 score so it’s a single measure,

and then you can compare the quality of the different search engines in an easy way.

我們來談談如何評估分類器。下面舉個例子,

我們有我們的特徵,每個觀察都用一組數字表示,

每一個觀察的標記然後機器學習算法就會出現它給每一個觀察一個數字,這就是它所認爲的,

然後這個數字表示離決策邊界有多遠觀察是,

這個函數f的符號是預測值。

現在我們把這些放到另一列,這是y帽,

這是f的符號;只是告訴你點所在的決定邊界的哪一邊。

如果分類器是正確的,那麼y帽和y是一致的。

好,我們來看看這兩列。

如果分類器真的很好,這些預測的標籤通常與真正的標籤一致。

我們再舉幾個例子。

這被稱爲真正,真正的標籤是+ 1,預測的標籤是+ 1。

在一個真正的負數中,它們都是- 1。

當你認爲它是正的時,一個錯誤的正的或一個類型的錯誤是,

但實際上不是。然後這是一個假陰性或2型錯誤,你認爲它是陰性的,

但實際上不是。

下面是另一個負的,下面是另一個假陰性,等等。

好的,誤差來自這兩種味道。那麼,我們如何判斷一個分類器的質量呢?

我們構造一個混亂矩陣,

而混淆矩陣在左上角有真正的優點,

右下角的真正的底片,

然後這裏的假陽性和假陰性結果你可以看到這個分類器很好因爲大部分的點要麼是真陽性,要麼是真陰性,

這很好,也沒有太多的錯誤。

如果我們不關心是否有假陽性或假陰性,

如果它們都同樣糟糕,那麼我們可以把它們看成是分類錯誤。

這只是錯誤分類的分數的分數;

它也被稱爲分類錯誤,它也被稱爲錯誤分類率。

然後我可以這樣寫,

這是預測標籤不等於真實標籤的分數的分數。

所以它只計算假陽性和假陰性,然後除以n。

真正的陽性率就是真正的正數的數目除以正數的總數。

就是這個除以這兩個的和,

所以這是真正的標籤爲1的點的個數,

它的預測值也是1,然後除以它的真值爲1的點的個數。它也被稱爲靈敏度或回憶,

而真正的負速率,或者說專一性是這樣定義的:

它實際上就是真正的負數的數量除以負的總數,

好的,這是負數的個數,它們是負的,

它們被預測爲負,然後除以負的總數。

再一次,這是一個真正的負數,除以這兩個數的和。

然後假陽性率是這樣的

這是假陽性的數量除以負的總數,

然後還有一些。

指標。

這是精度,這是真正的正數除以預測的正數的總數,

換句話說就是這個除以這兩個的和。

我之所以要講這些是因爲這些指標都是由你想要處理的任何軟件提供的,

他們的興趣是你經常聽到的,這是F1的分數,F1的分數很簡潔。它是精確和回憶之間的平衡。

所以它是2倍精度乘以回憶除以精確加上回憶。

因此,如果精度或召回都不好,那麼F1的分數就不好。

所以精確度再次使用這兩個量,回憶使用這兩個。

但是如果你得到一個好的F1分數,這通常意味着你的模型-你的模型是好的。

F1的分數,精度和回憶,這些都是在信息檢索中經常用到的術語,所以像評估搜索引擎。這裏還有一些細節。

搜索查詢的n的精度是這樣定義的:

那麼在搜索引擎接收的前n頁中,有多少是與查詢相關的?

我們可以這樣寫,這是上面n的正數的個數,除以n -。

收到的頁數。然後在n個搜索查詢的回憶是:

它說,如果有n個相關的網頁,其中n是總陽性數。

有n個相關的網頁我們從查詢中得到了多少?

所以這就是真正的正數的數目除以正數的總數。

你應該使用哪種測量方法?

現在機器學習者經常使用準確性。

只是簡單的準確或錯誤分類錯誤因爲它只是一個數字你可以直接比較算法。

你需要一個衡量質量的方法來比較算法。

一旦你有了兩種質量衡量標準,你就不能直接進行比較了,因爲如果一種算法根據一種質量標準而不是另一種方法更好,那該怎麼辦?

然後你就不能比較它們了。

但這隻能起作用。

只有當正值類的錯誤與負類的錯誤相等時,這個纔有效,當數據不平衡時,這就不起作用了,但不管怎樣,這就是人們做的。所以醫生,

他們經常想知道他們得到了多少好處——他們說得對,

他們想知道他們得到了多少個負數,

因此,他們想要看看真正的正確率和真實的負速率是有道理的。

如果你在信息檢索中,那麼你可能想要使用精確和回憶,以及F1評分,這是兩者的結合。

假設你在判斷搜索引擎的質量比如Bing。

你可能關心精確度;同樣,精度,你知道,引擎返回的網頁,有多少是相關的?

這是精確的,然後回憶是搜索引擎返回的相關網頁的一小部分,

然後你使用F1積分,所以這是一個單一的度量,

然後你可以簡單地比較不同搜索引擎的質量。



發佈了30 篇原創文章 · 獲贊 50 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章