Machine Learning - Regularized Linear Regression

This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 3, Solving the Problem of Overfitting Part II. This article contains some topic about how to implementation linear regression with regularization to addressing overfitting.


Regularized Linear Regression


1. Cost Function of Linear Regression with Regularization

For linear regression, we have previously worked out two learning algorithms. One based on gradient descent and one based on the normal equation. In this video, we'll take those two algorithms and generalize them to the case of regularized linear regression.

Original cost function of linear regressionwithout regularization:

Cost function of linear regression with regularization:


Let's take a look at the gradient descent algorithm below. Previously, we were using gradient descent for the original cost function without the regularization term. And we had the following algorithm, for regular linear regression, without regularization, we would repeatedly update the parameters theta J as follows for J =0, 1, 2, ... n.

Gradient descent for linear regression (without regularization term):

Let me take this and just write the case for theta 0 separately. So I'm just going to write the update for theta 0 separately than for the update for the parameters 1, 2, 3, and so on up to n. And so this is, I haven't changed anything yet. This is just writing the update for theta 0 separately from the updates for theta 1, theta 2, theta 3, up to theta n.

Then we add the regularization term. 
When we modify this algorithm for regularized linear regression, we're going to end up treating theta zero slightly differently. Concretely, if we want to take this algorithm and modify it to use the regular rise objective, all we need to do is take this term at the bottom and modify it as follows. We'll take this term and add minus lambda over m times theta j. And if you implement this, then you have gradient descent for trying to minimize the regularized cost function, j of theta.

Gradient descent for linear regression (with regularization term):

Analysis of the update effect for θ after adding the regularization term

Concretely, theta j gets updated as theta j minus alpha times, and then you have this other term here that depends on theta J. So if you group all the terms together that depend on theta j, you can show that this update can be written equivalently as follows.


so you end up with alpha lambda over m multiplied into theta j. And this term here, 1 minus alpha times lambda m, is a pretty interesting term. It has a pretty interesting effect. Concretely this term, 1 minus alpha times lambda over m, is going to be a number that is usually a little bit less than one, because alpha times lambda over m is going to be positive, and usually if your learning rate is small and if m is large, this is usually pretty small. So this term here is going to be a number that's usually a little bit less than 1, so think of it as a number like 0.99, let's say. And so
the effect of our update to theta j is, we're going to say that theta j gets replaced by theta j times 0.99. So theta j times 0.99 has the effect of shrinking theta j a little bit towards zero. So this makes theta j a bit smaller. And more formally, this makes the square norm of theta j a little bit smaller. And then after that, the second term here, that's actually exactly the same as the original
gradient descent update that we had, before we added all this regularization stuff. So, hopefully this gradient descent update makes sense. 

And if you calculate the partial derivative,you will find out that the term in the square bracket is the partial derivative of the regularization form of cost function J(θ).

And, the first term is the partial derivative


2. Normal equation 


Gradient descent was just one of our two algorithms for fitting a linear regression model. The second algorithm was the one based on the normal equation, where what we did was we created the design matrix X where each row corresponded to a separate training example.
And we created a vector y, so this is a vector, that's an m dimensional vector. And that contained the labels from my training set. So whereas X is an m by (n+1) dimensional matrix, y is an m dimensional vector.


And in order to minimize the cost function J, we found that one way to do so is to set theta to be equal to this. And what this value for theta does is this minimizes the cost function J of theta, when we were not using regularization.


Now if you are using regularization, then this formula changes as follows. Inside this parenthesis, you end up with a matrix like this. 0, 1, 1, 1, and so on, 1, until the bottom. So this thing over here is a matrix whose upper left-most entry is 0. There are ones on the diagonals, and then zeros everywhere else in this matrix.


E.g. n=2

But as a example, if n = 2, then this matrix is going to be a three by three matrix. More generally, this matrix is an(n+1) by (n+1) dimensional matrix.

but it is possible to prove that if you are using the new definition of J of theta, with the regularization objective, then this new formula for theta is the one that we give you, the global minimum ofJ(θ).

3. Non-invertibility

So finally I want to just quickly describe the issue of non-invertibility.

If you have fewer examples than features, than this matrix, X transpose X will be non-invertible, or singular. Or the other term for this is the matrix will be degenerate. And if you implement this in Octave anyway and you use the pinv function to take the pseudo inverse,it will kind of do the right thing, but it's not clear that it would give you a very good hypothesis, even though numerically the Octave pinv function will give you a result that kinda makes sense.

Suppose mn, (m is the number of examples, n is the   number of features)

if (XTX)is non-invertible or singular.

Fortunately, regularization also takes care of this for us. And concretely, so long as the regularization parameter λ is strictly greater than 0, it is actually possible to prove that this matrix,X transpose X plus lambda times this funny matrix here, it is possible to prove that this matrix will not be singular and that this matrix will be invertible:

If  λ > 0,

The matrix in the above equation will not be non-invertible or singular.

發佈了96 篇原創文章 · 獲贊 461 · 訪問量 119萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章