第一章.Classification -- 05.Maximum Likelihood Perspective翻譯

Let’s talk more in depth about logistic regression. Putting that in the corner for now,

I wanted to give you another perspective on logistic regression,

which is the maximum likelihood perspective.

And you can skip this section if you’re not interested and nothing bad’s going to happen,

but it might be useful to some of you. Look at this function here;

it looks like – what does this function look like?

It looks like something is growing and then saturating.

Now this function is called the logistic function,

and it was one of the very early population models invented by Adolphe Quetelet and his pupil,

Pierre Francois Verhulst, somewhere in the mid-19th century,

and they were modelling growth of populations, and they were thinking that when a country gets full,

the population won’t grow as much and then the population will saturate, which is why it looks like that.

And it sounds kind of funny, but that’s what they were doing.

So see this is when the country is just growing and then here’s where it’s full and the population won’t grow anymore. Anyways, so how does this relate to logistic regression? Well, it does.

So what do you know about probabilities, right? They don’t go below 0, and they don’t

go above 1. You can take any number you like and send it through this function, and it’ll

give you a number between 0 and 1, it’ll give you a probability.

So this is the basic formula for that function, so when t is really big, then e to the t is much bigger than 1, 

and so this one basically gets ignored down here and you get 1.

And if t is really small, then the top goes to 0 and the bottom goes to 1, and you get 0.

Okay, so again, where does logistic regression come in?

You know, here is where it enters logistic regression.

Let’s model the probability that the outcome is –

the outcome y is 1 for a specific accent beta just like this, okay?

So why would we do this? It looks like a complicated function;

where did I get this? So here’s the trick: the thing on the left is a probability,

so the thing on the right had better be a probability. And guess what, we know it is.

It’s just a logistic function, and a logistic function only produces probabilities.

Okay so now this model makes sense, that’s why I want to model a probability like this.

And now I’m just putting it in matrix notation,

just to make my life a little bit easier instead of having to write all these sums all over the place,

I can just write this matrix x times the vector beta.

Now I also can compute the probability that the label is minus 1,

given xi and beta using the model. So it’s just one minus the probability that it’s 1,

so it’s just 1 minus that guy, and you can simplify that and make it look like this.

Now I’m going to need to calculate the likelihood of each of the observations,

which is the probability to observe the label y that I actually observed, given it’s x in the model beta.

And I am actually almost there already, because I’ve already done all of that already.

So this is it, right, if y is minus 1, then you use this one. If y is plus 1,

then you use this one, and that’s – that’s this probability right here.

And then I can simplify this a little bit more, because remember y is minus 1,

so I can always put a minus y here because minus y is just 1,

and I can do this same thing with the other term here.

So first thing I want to just divide top and bottom by e to the x beta,

and I end up with something that looks like that. And then I can always multiply by 1 in disguise,

because remember y is positive 1,

so I can just write this as minus yx because I just multiply by 1 here which is just the y.

Now the interesting thing is these two expressions should look rather similar to you; in fact,

they should look exactly the same because they are.

That’s very nice, because it means that the probability for y to equal whatever it does is written either this way or that way and they are the same. Okay, so I can just put it right there.

Alright, just adding a little space there, and then I compute the likelihood for all the data,

I have to multiply all these probabilities together. So what I end up with is this,

so this is the full likelihood for the dataset, and it looks just like that.

Okay, and so this guy equals this, which equals this, which equals that.

So I can summarize there and start with a fresh page; there it is.

And now, I can take negative log of both sides – that’s completely legal.

And then, when you take the log of a product, it becomes the sum of the logs, so there we are.

And then this fraction becomes – this is the log of this to the negative 1 power,

so the negative 1 comes out front and cancels with this minus sign, and I get this expression.

Now hopefully you have a good memory,

because this expression is exactly the same as the one I have up in the corner, okay?

So that’s cool, that minimizing negative log likelihood is the same as minimizing logistic logs.

Now minimizing negative log likelihood is like finding the coefficients that are the most likely to generate your data,

if you use the logistic model. I can derive the logistic regression a different way, but why do I care?

Why do I need this other derivation when I have the first one? And the answer is really neat:

it’s because now you have this. Remember this? This is the logistic function,

but now it provides a probabilistic interpretation of the model.

Whatever score that the model gives the observation, now you get the probability that y equals 1,

given x. You don’t just get a classification, so maybe I can show it geometrically another way.

Okay, so back to this picture over here. Now, this is the logistic function,

and over here you get a higher probability estimate and over here it’s very low.

And that interpretation is not something that you have with the loss function interpretation.

Okay, so just a summary here: for a logistic regression,

we split data randomly into training and test sets,

we estimate the coefficients and train the model by minimizing the subjective,

and then we score the model, and evaluate the model.

And if we want to, now we have the probabilistic interpretation;

we can send – we can get that through the function f.

We can plug f into the logistic function to get an estimate of the probability that y equals 1 given x.

and again, this is just the basic version in Azure ML;

this is all the programming, I just – you know, literally moved the modules and moved the modules there and put the connectors on them and hit run and that’s it.

Now this is just a preview of what happens when we put regularization on there;

we can actually improve performance by asking the logistic model to be simple,

and we can do that by adjusting this lovely lovely constant c,

and that determines how much regularization we’ll put into the model to keep it simple.

And again, we can work with linear models.

So for regularization, we’ll choose the sum of the squares of these coefficients,

and this is actually written in this nice neat way; this is actually called an L2 norm,

and that’s what we’re going to use to measure simplicity of models,

and that constant c is going to determine how much we care about the simplicity of the model and the accuracy of it.

And here’s another kind of regularization where we take the sum of the absolute values of those coefficients, and this is actually written this way called the L1 norm.

Now in practice, these two kinds of regularization –

they have very different meaning and they change the coefficients in different ways,

but they are both very helpful for purposes of generalization.

讓我們更深入地討論邏輯迴歸。把它放在角落裏,

我想給你們另一個關於邏輯迴歸的觀點,

這是最大可能性的觀點。

你可以跳過這部分如果你不感興趣,也不會有什麼壞事發生,

但它對你們中的一些人可能有用。看看這個函數;

它看起來是什麼樣的?

它看起來像是在生長,然後飽和。

這個函數被稱爲logistic函數,

這是阿道夫·奎特雷和他的學生髮明的早期人口模型之一,

Pierre Francois Verhulst, 19世紀中期,

他們模擬了人口的增長,他們認爲當一個國家變得富裕時,

人口不會增長那麼人口就會飽和,這就是爲什麼它看起來像這樣。

這聽起來很有趣,但他們就是這麼做的。

所以,當這個國家剛剛開始增長的時候,這就是它的全盛時期,人口不會再增長了。不管怎樣,這與邏輯迴歸有什麼關係?嗯,確實如此。

你對概率瞭解多少?它們不低於0,也不低於0。

超過1。你可以取任何你喜歡的數字並通過這個函數發送,它會。

給你一個0到1之間的數字,它會給你一個概率。

這是這個函數的基本公式,當t很大時,e的t次方比1大很多,

所以這個基本上被忽略瞭然後得到1。

如果t是很小的,那麼頂部是0,下面是1,得到0。

那麼,邏輯迴歸在哪裏呢?

這裏是它進入邏輯迴歸的地方。

我們來模擬一下結果的概率。

結果y是一個特定的重音,就像這樣?

我們爲什麼要這麼做?它看起來像一個複雜的函數;

我從哪兒弄來的?這裏有個技巧,左邊的是概率,

所以右邊的東西最好是概率。你猜怎麼着,我們知道。

它只是一個邏輯函數,而logistic函數只能產生概率。

現在這個模型有意義了,這就是爲什麼我要建立這樣的概率模型。

現在我把它代入矩陣表示法,

爲了讓我的生活更簡單,而不是把所有的和都寫在這裏,

我可以把這個矩陣寫成x乘以向量。

現在我也可以計算出標籤爲- 1的概率,

給定xi和使用模型的。所以它是1減去它是1的概率,

它就是1減去這個數,你可以化簡它,讓它看起來像這樣。

現在我需要計算每一個觀測的可能性,

這是觀察我觀察到的y的概率,給定的是模型中的x。

事實上,我已經在那裏了,因爲我已經做過了。

這是,如果y = - 1,你用這個。如果y是+ 1,

然後用這個,也就是這個概率。

然後我可以再化簡一下,因爲y = - 1,

所以我可以在這裏加上- y因爲- y = 1,

我可以用另一項來做同樣的事情。

首先,我要把上面和下面除以e的x次方,

結果是這樣的。然後我可以把它乘以1,

因爲y是正1,

所以我可以把它寫成- yx因爲我把它乘以1這就是y。

有趣的是這兩個表達式應該和你很相似;事實上,

它們應該看起來完全一樣,因爲它們是。

這很好,因爲這意味着y的概率等於它所做的任何事情的概率都是這樣或者那樣,它們是一樣的。我可以把它放在這裏。

好的,在這裏加一個小空間,然後我計算所有數據的可能性,

我要把這些概率相乘。所以我得到的結果是,

這是數據集的全部可能性,它看起來是這樣的。

所以這個等於這個,等於這個,等於這個。

我可以總結一下,從一個新的頁面開始;在這裏。

現在,我可以對兩邊取負對數,這是完全合法的。

然後,當你取一個乘積的對數時,它就變成了對數的和,所以我們在這裏。

然後這個分數變成-這是這個的- 1次方的對數,

所以- 1出現在前面,和這個負號相消,我得到這個表達式。

希望你們有好的記憶力,

因爲這個表達式和我在角落裏的那個表達式是完全一樣的?

所以這很酷,最小化負對數的可能性和最小化邏輯日誌是一樣的。

減少負對數的可能性就像找到最可能產生數據的係數,

如果你使用logistic模型。我可以用不同的方式推導邏輯迴歸,但我爲什麼要關心呢?

當我有第一個的時候爲什麼還需要這個導數呢?答案很簡單:

因爲現在你有了這個。還記得這個嗎?這是logistic函數,

但是現在它提供了模型的概率解釋。

無論模型給出的結果是什麼,現在你得到的概率是y = 1,

給定x,你不只是得到一個分類,也許我可以用另一種方式展示它。

回到這張圖。這是logistic函數,

在這裏你得到一個更高的概率估計值這裏是很低的。

這種解釋並不是你對損失函數的解釋。

好的,總結一下,對於邏輯迴歸,

我們將數據隨機分爲訓練和測試集,

我們通過最小化主觀性來估計係數和訓練模型,

然後我們對模型進行評分,並對模型進行評估。

如果我們想,現在我們有概率解釋;

我們可以通過函數f得到它。

我們可以把f代入logistic函數得到y = 1給定x的概率的估計值。

這只是Azure ML的基本版本;

這是所有的編程,我只是-你知道,真的移動了模塊,移動了模塊,並把連接器放在它們上,然後點擊運行,就這樣。

這只是一個預覽,當我們把正規化放在那裏的時候;

我們可以通過讓邏輯模型變得簡單來提高性能,

我們可以通過調整這個可愛的常數c來做,

這就決定了我們要在模型中引入多少正規化來保持它的簡單性。

同樣,我們可以用線性模型。

對於正則化,我們會選擇這些係數的平方和,

這是用很巧妙的方法寫的;這實際上被稱爲L2規範,

這就是我們用來衡量模型簡單性的方法,

這個常數c將決定我們對模型的簡單性和準確性的關心程度。

這是另一種正則化方法我們將這些係數的絕對值和,寫成這種形式叫做L1準則。

在實踐中,這兩種正規化。

它們有非常不同的含義它們用不同的方式改變係數,

但它們都很有助於推廣。

發佈了30 篇原創文章 · 獲贊 50 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章