第一章.Classification -- 02.Loss Functions for Classification翻譯

So now that you understand the basics of classification, let’s talk about loss functions,

because that determines the major difference between a lot of machine learning methods.

Okay so how do we measure classification error?

Well, one very very simple way to do it is to use just a fraction of times our predictions are wrong.

So just the fraction of times the sign of f(x) is not equal to the truth y, and I can write it like that, okay? The issue with this particular way of measuring classification error is that if you want to try to minimize this, you could run into a lot of trouble because it’s computationally hard to minimize this thing.

So let’s give you the geometric picture here: the decision boundary is this line right here,

and f being positive is here and f being negative is here, and the red points are all misclassified.

And now what I’m going to do is something you might not be expecting which is that I’m going to move all the correctly classified points across the decision boundary to one side,

and all the misclassified points to the other side.

There they go, we’ll move all the misclassified points across the decision boundary and then all the correctly classified ones.

Okay, so I’m glad I did that, now I have the ones we’ve got correct over here and the ones we got wrong over there. And now the labels on this plot are wrong because we changed it, so it’s actually something like that.

Okay so over here, on the right either f is positive and y is also positive, so y-f is positive,

or they’re both negative so the product is positive again.

And then over here on the left, we have cases where the sign of f is different from y,

so the product is negative. And then the ones I suffer a penalty for are all these guys over there.

Okay so let’s keep that image in mind over here, and then I’ll put the labels up.

Now this function tells us what kind of penalty we’re going to issue for being wrong.

Okay so right now, if y times f is positive, it means we got it right and we lose 0 points.

And if we get it wrong, the point is on the wrong side and so we lose one point.

Now I’m just going to write this function another way which is like this,

okay, so we lose one

point if y-f is less than 0, and otherwise we lose no points. And this is the classic 0-1 loss.

It just tells you whether your classifier is right or wrong.

And then this thing is called a loss function, and there are other loss functions too,

and this one’s nice because – but it’s problematic because it’s not smooth,

and we have issues with things that are not smooth in machine learning.

So let’s try some more loss functions.

So while we’re doing this,

just keep in mind that these points over here are the ones that are very wrong,

because they’re on the wrong side of the decision boundary,

but they’re really far away from it too. And these points are wrong, but they’re not as bad;

they’re on the wrong side of the decision boundary but they’re pretty close to it.

And then we’ll say these points are sort of correct, and we’ll say these points are very correct.

And what we’ll really like to have are loss functions that don’t penalize the very correct ones,

but the penalty gets worse and worse as you go to the left.

But maybe we can use some other loss function,

something that – you know, maybe we get a small penalty for being sort of correct and then a bigger penalty for being sort of wrong and then a huge penalty for being very wrong.

Something that looks like this would be a deal.

So again,

this is – the horizontal axis is y times f, and this red one is 1 if y disagrees with the sign of f, and then the other curves are different loss functions and they’re actually for different machine learning algorithms.

And again, just keep in mind that on the right

– these are points that are on the correct side of the decision boundary,

they don’t suffer much penalty and on the left,

these are points that are incorrectly classified and they suffer more penalty.

This is the loss function that AdaBoost uses.

AdaBoost is one of the machine learning methods that we’ll cover in the course.

And this is the loss function that support vector machines use;

it’s a line, and then it’s another flat line.And this is the loss function for logistic regression, and we’re going to cover all three of these.

Now I’m going to write this idea about the loss functions in notation on the next slide.

Okay so start here:

the misclassification error is the fraction of times that the sign of f is not equal to the truth y that’s this.

I can rewrite it this way, okay, the number of times y times f is less than 0.

And then we’ll upper-bound this by these loss functions.

Okay, so then what is a good way to try to reduce the misclassification error which is that guy?

Well you could just try to minimize the average loss.

So if you had a choice of functions f, you could try to choose f to minimize this thing,

which hopefully would also minimize this but in a computationally easier way.

So here’s your first try for a machine learning algorithm.

Just choose the function f to minimize the average loss. And this seems like a good idea, right?

Well it is, and that’s what most machine learning methods are based on,

and how to do this minimization over models to get the best one,

that involves some optimization techniques, which go on behind the scenes.

But there’s one more thing I didn’t quite tell you,

which is that we want to do more than have a low training error.

We want to predict well on data that we haven’t seen before.

We want to, you know, generalize the new points, and that’s why we need statistical learning theory, because this algorithm that I showed you – that’s not right, and you’ll see why.

It’s pretty good,

but it’s missing this key element that tells – that encourages the model to stay simple and not over-fit.

So I’ll talk about statistical learning theory shortly.

現在你們已經瞭解了分類的基本知識,我們來討論一下損失函數,

因爲這決定了很多機器學習方法的主要區別。

那麼我們如何測量分類誤差呢?

一種很簡單的方法就是用很小一部分的時間我們的預測是錯誤的。

所以f(x)的符號的分數不等於y,我可以這樣寫,對吧?這種測量分類誤差的方法的問題是,如果你想要最小化這個,你可能會遇到很多麻煩因爲它在計算上很難最小化這個東西。

我們給你一個幾何圖形,決定邊界是這條線,

f是正的這裏f是負的,紅色的點都是錯誤分類的。

現在我要做的是你們可能不會想到的是我要把所有正確分類的點都移到一邊,

所有的錯誤分類都指向另一邊。

在這裏,我們會把所有的錯誤分類都移到決策邊界然後所有的分類都是正確的。

好了,我很高興我這麼做了,現在我有了我們在這裏已經改正的那些和我們在那裏出錯的那些。這個圖上的標籤是錯的,因爲我們改變了它,所以它實際上是這樣的。

好的,在這裏,右邊的f是正的y也是正的,所以y-f是正的,

或者它們都是負的,所以產物是正的。

然後在左邊,我們有例子f的符號和y不同,

所以乘積是負的。然後我要懲罰的是這些人。

好的,讓我們記住這個圖像,然後我把標籤放上去。

現在這個函數告訴我們,我們要對錯誤進行什麼樣的懲罰。

好的,現在,如果y乘以f是正的,這意味着我們得到了它,我們失去了0點。

如果我們做錯了,重點就在錯誤的一邊,所以我們失去了一點。

現在我要用另一種方法來寫這個函數,

好的,我們失去了一個。

點如果y-f小於0,否則我們就失去了點。這是典型的0-1損失。

它只是告訴你分類器是對還是錯。

然後這個叫做損失函數,還有其他的損失函數,

這個很好因爲-但它有問題因爲它不光滑,

我們在機器學習中遇到了一些不順利的問題。

讓我們嘗試更多的損失函數。

當我們這樣做的時候,

請記住,這裏的這些點是非常錯誤的,

因爲他們站在了決策界限的錯誤一邊,

但它們離它也很遠。這些觀點是錯誤的,但也沒有那麼糟糕;

他們站在了決策界限的錯誤一邊,但他們非常接近。

然後我們會說這些點是正確的,我們會說這些點是非常正確的。

我們真正想要的是損失函數不懲罰非常正確的函數,

但是當你走到左邊的時候,懲罰變得越來越糟糕。

但是也許我們可以用其他的損失函數,

你知道,也許我們得到了一個小的懲罰,因爲它是正確的,然後是一個更大的懲罰,因爲它是錯誤的,然後是一個巨大的懲罰,因爲它是非常錯誤的。

看起來這是一筆交易。

再一次,

橫軸是y乘以f,這個紅色的是1如果y不同意f的符號,那麼其他的曲線是不同的損失函數它們實際上是針對不同的機器學習算法的。

記住,在右邊。

-這些點在決策邊界的正確一側,

他們不會受到太多懲罰,

這些都是不正確分類的點,並且會受到更多的懲罰。

這是AdaBoost使用的損失函數。

AdaBoost是我們將在課程中介紹的機器學習方法之一。

這是支持向量機使用的損失函數;

這是一條直線,然後是另一條直線。

這是logistic迴歸的損失函數,我們將涵蓋這三個部分。

現在我要把這個關於損失函數的概念寫在下一張幻燈片上。

所以從這裏開始:

誤分類錯誤是f的符號不等於y的分數。

我可以這樣寫,y乘以f的次數小於0。

然後我們用這些損失函數來上界。

好的,那麼什麼是減少錯誤分類錯誤的好方法呢?

你可以試着最小化平均損失。

如果你有一個函數f,你可以選擇f來最小化這個,

希望這也能使它最小化但在計算上更簡單。

這是你第一次嘗試機器學習算法。

只要選擇函數f,就可以最小化平均損失。這看起來是個好主意,對吧?

這就是大多數機器學習方法的基礎,

如何將模型最小化以得到最好的模型,

這涉及到一些在幕後進行的優化技術。

但還有一件事我沒告訴你,

也就是說,我們想要做的不僅僅是低訓練誤差。

我們想要預測我們之前沒見過的數據。

我們想要推廣新的觀點,這就是爲什麼我們需要統計學習理論,因爲我給你們展示的這個算法是不對的,你們會明白爲什麼。

這是很好,

但它忽略了這個關鍵元素——它鼓勵模型保持簡單,而不是過度匹配。

我很快會講到統計學習理論。

發佈了30 篇原創文章 · 獲贊 50 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章