Regression(4)-------Logistic Regression & Regularization

玉璽在手，天下我有！

Logistic Regression

=========================

(一)、Classification

（二）、Hypothesis Representation

（三）、Decision Boundary

（四）、Cost Function

（五）、Simplified Cost Function and Gradient Descent

（六）、Parameter Optimization in Matlab

（七）、Multiclass classification : One-vs-all

The problem of overfitting and how to solve it

=========================

（八）、The problem of overfitting

（九）、Cost Function

（十）、Regularized Linear Regression （解決過擬合問題）

（十一）、Regularized Logistic Regression （解決過擬合問題）

（十二）。經典小總結

本章主要講述邏輯迴歸和Regularization解決過擬合的問題，非常非常重要，是機器學習中非常常用的迴歸工具，下面分別進行兩部分的講解。

第一部分：Logistic Regression

/*************（一）~（二）、Classification / Hypothesis Representation***********/

假設隨Tumor Size變化，預測病人的腫瘤是惡性（malignant）還是良性（benign）的情況。

給出8個數據如下：

假設進行linear regression得到的hypothesis線性方程如上圖中粉線所示，則可以確定一個threshold:0.5進行predict

y=1, if h(x)>=0.5

y=0, if h(x)<0.5

即malignant=0.5的點投影下來，其右邊的點預測y=1;左邊預測y=0；則能夠很好地進行分類。

那麼，如果數據集是這樣的呢？

這種情況下，假設linear regression預測爲藍線，那麼由0.5的boundary得到的線性方程中，不能很好地進行分類。因爲不滿足

y=1, h(x)>0.5

y=0, h(x)<=0.5

這時，我們引入logistic regression model：

所謂Sigmoid function或Logistic function就是這樣一個函數g(z)見上圖所示

當z>=0時，g(z)>=0.5；當z<0時，g(z)<0.5

由下圖中公式知，給定了數據x和參數θ，y=0和y=1的概率和=1

/*****************************（三）、decision boundary**************************/

所謂Decision Boundary就是能夠將所有數據點進行很好地分類的h(x)邊界。

如下圖所示，假設形如h(x)=g(θ0+θ1x1+θ2x2)的hypothesis參數θ=[-3,1,1]T, 則有

predict Y=1, if -3+x1+x2>=0

predict Y=0, if -3+x1+x2<0

剛好能夠將圖中所示數據集進行很好地分類

Another Example:

answer:

除了線性boundary還有非線性decision boundaries，比如

下圖中，進行分類的decision boundary就是一個半徑爲1的圓，如圖所示：

/********************（四）~（五）Simplified cost function and gradient descent<非常重要>*******************/

該部分講述簡化的logistic regression系統中how to implement gradient descents for logistic regression.

假設我們的數據點中y只會取0和1, 對於一個logistic regression model系統，有，那麼cost function定義如下：

由於y只會取0,1，那麼就可以寫成

不信的話可以把y=0,y=1分別代入，可以發現這個J（θ）和上面的Cost(hθ(x),y)是一樣的(*^__^*) ，那麼剩下的工作就是求能最小化 J(θ)的θ了~

在第一章中我們已經講了如何應用Gradient Descent, 也就是下圖Repeat中的部分，將θ中所有維同時進行更新，而J(θ)的導數可以由下面的式子求得，結果如下圖手寫所示：

現在將其帶入Repeat中：

這是我們驚奇的發現，它和第一章中我們得到的公式是一樣滴~

也就是說，下圖中所示，不管h(x)的表達式是線性的還是logistic regression model, 都能得到如下的參數更新過程。

那麼如何用vectorization來做呢？換言之，我們不要用for循環一個個更新θj，而用一個矩陣乘法同時更新整個θ。也就是解決下面這個問題：

上面的公式給出了參數矩陣θ的更新，那麼下面再問個問題，第二講中說了如何判斷學習率α大小是否合適，那麼在logistic regression系統中怎麼評判呢？

Q：Suppose you are running gradient descent to fit a logistic regression model with parameter θ∈Rn+1. Which of the following is a reasonable way to make sure the learning rate α is set properly and that gradient descent is running correctly?

A：

/*************（六）、Parameter Optimization in Matlab***********/

這部分內容將對logistic regression 做一些優化措施，使得能夠更快地進行參數梯度下降。本段實現了matlab下用梯度方法計算最優參數的過程。

首先聲明，除了gradient descent 方法之外，我們還有很多方法可以使用，如下圖所示，左邊是另外三種方法，右邊是這三種方法共同的優缺點，無需選擇學習率α，更快，但是更復雜。

也就是matlab中已經幫我們實現好了一些優化參數θ的方法，那麼這裏我們需要完成的事情只是寫好cost function,並告訴系統，要用哪個方法進行最優化參數。比如我們用‘GradObj’， Use the GradObj option to specify that FUN also returns a second output argument G that is the partial derivatives of the function df/dX, at the point X.

如上圖所示，給定了參數θ，我們需要給出cost Function. 其中，

jVal 是 cost function 的表示，比如設有兩個點（1,0,5）和（0,1,5）進行迴歸，那麼就設方程爲hθ(x)=θ1x1+θ2x2;
則有costfunction J(θ)： jVal=(theta(1)-5)^2+(theta(2)-5)^2;

在每次迭代中，按照gradient descent的方法更新參數θ：θ(i)-=gradient(i),其中gradient(i)是J(θ)對θi求導的函數式，在此例中就有gradient(1)=2*(theta(1)-5), gradient(2)=2*(theta(2)-5)。如下面代碼所示：

函數costFunction, 定義jVal=J(θ)和對兩個θ的gradient：

[cpp]view
plaincopy

function [ jVal,gradient ] = costFunction( theta )  

%COSTFUNCTION Summary of this function goes here  

%   Detailed explanation goes here  

jVal= (theta(1)-5)^2+(theta(2)-5)^2;  

gradient = zeros(2,1);  

%code to compute derivative to theta  

gradient(1) = 2 * (theta(1)-5);  

gradient(2) = 2 * (theta(2)-5);  

end

編寫函數Gradient_descent，進行參數優化

[cpp]view
plaincopy

function [optTheta,functionVal,exitFlag]=Gradient_descent( )  

%GRADIENT_DESCENT Summary of this function goes here  

%   Detailed explanation goes here  

 options = optimset('GradObj','on','MaxIter',100);  

 initialTheta = zeros(2,1)  

 [optTheta,functionVal,exitFlag] = fminunc(@costFunction,initialTheta,options);  

end

matlab主窗口中調用，得到優化厚的參數(θ1,θ2)=(5,5),即hθ(x)=θ1x1+θ2x2=5*x1+5*x2

[cpp]view
plaincopy

 [optTheta,functionVal,exitFlag] = Gradient_descent()  

initialTheta =  

     0  

     0  

Local minimum found.  

Optimization completed because the size of the gradient is less than  

the default value of the function tolerance.  

<stopping criteria details>  

optTheta =  

     5  

     5  

functionVal =  

     0  

exitFlag =  

     1

最後得到的結果顯示出優化參數optTheta=[5,5], functionVal = costFunction(迭代後) = 0

/*****************************（七）、Multi-class Classification One-vs-all**************************/

所謂one-vs-all method就是將binary分類的方法應用到多類分類中。

比如我想分成K類，那麼就將其中一類作爲positive，另（k-1）合起來作爲negative，這樣進行K個h(θ)的參數優化，每次得到的一個hθ(x)是指給定θ和x，它屬於positive的類的概率。

按照上面這種方法，給定一個輸入向量x，獲得最大hθ(x)的類就是x所分到的類。

第二部分：The problem of overfitting and how to solve it

/************（八）、The problem of overfitting***********/

The Problem of overfitting:

overfitting就是過擬合，如下圖中最右邊的那幅圖。對於以上講述的兩類（logistic regression和linear regression）都有overfitting的問題，下面分別用兩幅圖進行解釋：

<Linear Regression>:

<logistic regression>:

怎樣解決過擬合問題呢？兩個方法：

1. 減少feature個數（人工定義留多少個feature、算法選取這些feature）

2. 規格化（留下所有的feature，但對於部分feature定義其parameter非常小）

下面我們將對regularization進行詳細的講解。

對於linear regression model, 我們的問題是最小化

$MSE(f)=\frac{1}{n}(y_{i}-f(x_{i})^2)$

寫作矩陣表示即

$for\:problem\:\: Y=aX,\\ J(a)=\sum_{\overrightarrow{x}\epsilon X}(a^T \overrightarrow{x}-y)^2)\\ X = [x_{1},x_{2},...,x_{n}],\:\:Y = [y_{1},y_{2},...,y_{n}]\\$

i.e. the loss function can be written as

there we can get:

$a=(XX^T)^{-1}XY$

After regularization, however,we have:

$a=(XX^T+\lambda I)^{-1}XY$

/************（九）、Cost Function***********/

對於Regularization，方法如下，定義cost function中θ3，θ4的parameter非常大，那麼最小化cost function後就有非常小的θ3,θ4了。

寫作公式如下，在cost function中加入θ1~θn的懲罰項：

這裏要注意λ的設置，見下面這個題目：

A:λ很大會導致所有θ≈0

下面呢，我們分linear regression 和 logistic regression分別進行regularization步驟.

/************（十）、Regularized Linear Regression***********/

<Linear regression>:

首先看一下，按照上面的cost function的公式，如何應用gradient descent進行參數更新。

對於θ0，沒有懲罰項，更新公式跟原來一樣

對於其他θj，J(θ)對其求導後還要加上一項(λ/m)*θj，見下圖：

如果不使用梯度下降法（gradient descent+regularization），而是用矩陣計算（normal equation）來求θ，也就求使J(θ)min的θ，令J(θ)對θj求導的所有導數等於0，有公式如下：

而且已經證明，上面公式中括號內的東西是可逆的。

/************（十一）、Regularized Logistic Regression***********/

<Logistic regression>:

前面已經講過Logisitic Regression的cost function和overfitting的情況，如下圖中所示:

和linear regression一樣，我們給J(θ)加入關於θ的懲罰項來抑制過擬合：

用Gradient Descent的方法，令J(θ)對θj求導都等於0，得到

這裏我們發現，其實和線性迴歸的θ更新方法是一樣的。

When using regularized logistic regression, which of these is the best way to monitor whether gradient descent is working correctly?

和上面matlab中調用那個例子相似，我們可以定義logistic regression的cost function如下所示：

圖中，jval表示cost function 表達式，其中最後一項是參數θ的懲罰項；下面是對各θj求導的梯度，其中θ0沒有在懲罰項中，因此gradient不變，θ1~θn分別多了一項(λ/m)*θj；

至此，regularization可以解決linear和logistic的overfitting regression問題了~

最後做一下小總結(經典）：

Logistic regression:

　　在logistic regression問題中，logistic函數表達式如下：

　　這樣做的好處是可以把輸出結果壓縮到0~1之間。而在logistic迴歸問題中的損失函數與線性迴歸中的損失函數不同，這裏定義的爲：

　　如果採用牛頓法來求解迴歸方程中的參數，則參數的迭代公式爲：

　　其中一階導函數和hessian矩陣表達式如下：

　　當然了，在編程的時候爲了避免使用for循環，而應該直接使用這些公式的矢量表達式（具體的見程序內容）。

% Exercise 4 -- Logistic Regression

clear all; close all; clc

x = load('ex4x.dat'); 
y = load('ex4y.dat');

[m, n] = size(x);

% Add intercept term to x
x = [ones(m, 1), x]; 

% Plot the training data
% Use different markers for positives and negatives
figure
pos = find(y); neg = find(y == 0);%find是找到的一個向量，其結果是find函數括號值爲真時的值的編號
plot(x(pos, 2), x(pos,3), '+')
hold on
plot(x(neg, 2), x(neg, 3), 'o')
hold on
xlabel('Exam 1 score')
ylabel('Exam 2 score')


% Initialize fitting parameters
theta = zeros(n+1, 1);

% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); 

% Newton's method
MAX_ITR = 7;
J = zeros(MAX_ITR, 1);

for i = 1:MAX_ITR
    % Calculate the hypothesis function
    z = x * theta;
    h = g(z);%轉換成logistic函數
    
    % Calculate gradient and hessian.
    % The formulas below are equivalent to the summation formulas
    % given in the lecture videos.
    grad = (1/m).*x' * (h-y);%梯度的矢量表示法
    H = (1/m).*x' * diag(h) * diag(1-h) * x;%hessian矩陣的矢量表示法
    
    % Calculate J (for testing convergence)
    J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h));%損失函數的矢量表示法
    
    theta = theta - H\grad;%是這樣子的嗎？
end
% Display theta
theta

% Calculate the probability that a student with
% Score 20 on exam 1 and score 80 on exam 2 
% will not be admitted
prob = 1 - g([1, 20, 80]*theta)

%畫出分界面
% Plot Newton's method result
% Only need 2 points to define a line, so choose two endpoints
plot_x = [min(x(:,2))-2,  max(x(:,2))+2];
% Calculate the decision boundary line
plot_y = (-1./theta(3)).*(theta(2).*plot_x +theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
hold off

% Plot J
figure
plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
xlabel('Iteration'); ylabel('J')
% Display J
J

regularized linear regression:

此時的模型表達式如下所示：

　　模型中包含了規則項的損失函數如下：

　　模型的normal equation求解爲：

　　程序中主要測試lambda=0,1,10這3個參數對最終結果的影響。

clc,clear
%加載數據
x = load('ex5Linx.dat');
y = load('ex5Liny.dat');

%顯示原始數據
plot(x,y,'o','MarkerEdgeColor','b','MarkerFaceColor','r')

%將特徵值變成訓練樣本矩陣
x = [ones(length(x),1) x x.^2 x.^3 x.^4 x.^5];
[m n] = size(x);
n = n -1;

%計算參數sidta，並且繪製出擬合曲線
rm = diag([0;ones(n,1)]);%lamda後面的矩陣
lamda = [0 1 10]';
colortype = {'g','b','r'};
sida = zeros(n+1,3);
xrange = linspace(min(x(:,2)),max(x(:,2)))';
hold on;
for i = 1:3
    sida(:,i) = inv(x'*x+lamda(i).*rm)*x'*y;%計算參數sida
    norm_sida = norm(sida)
    yrange = [ones(size(xrange)) xrange xrange.^2 xrange.^3,...
        xrange.^4 xrange.^5]*sida(:,i);
    plot(xrange',yrange,char(colortype(i)))
    hold on
end
legend('traning data', '\lambda=0', '\lambda=1','\lambda=10')%注意轉義字符的使用方法
hold off

regularized logistic regression:

　在logistic迴歸中，其表達式爲：

　　在此問題中，將特徵x映射到一個28維的空間中，其x向量映射後爲：

　　此時加入了規則項後的系統的損失函數爲：

　　對應的牛頓法參數更新方程爲：

　　其中：

　　公式中的一些宏觀說明（直接截的原網頁）：

%載入數據
clc,clear,close all;
x = load('ex5Logx.dat');
y = load('ex5Logy.dat');

%畫出數據的分佈圖
plot(x(find(y),1),x(find(y),2),'o','MarkerFaceColor','b')
hold on;
plot(x(find(y==0),1),x(find(y==0),2),'r+')
legend('y=1','y=0')

% Add polynomial features to x by 
% calling the feature mapping function
% provided in separate m-file
x = map_feature(x(:,1), x(:,2));

[m, n] = size(x);

% Initialize fitting parameters
theta = zeros(n, 1);

% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); 

% setup for Newton's method
MAX_ITR = 15;
J = zeros(MAX_ITR, 1);

% Lambda is the regularization parameter
lambda = 1;%lambda=0,1,10，修改這個地方，運行3次可以得到3種結果。

% Newton's Method
for i = 1:MAX_ITR
    % Calculate the hypothesis function
    z = x * theta;
    h = g(z);
    
    % Calculate J (for testing convergence)
    J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ ...
    (lambda/(2*m))*norm(theta([2:end]))^2;
    
    % Calculate gradient and hessian.
    G = (lambda/m).*theta; G(1) = 0; % extra term for gradient
    L = (lambda/m).*eye(n); L(1) = 0;% extra term for Hessian
    grad = ((1/m).*x' * (h-y)) + G;
    H = ((1/m).*x' * diag(h) * diag(1-h) * x) + L;
    
    % Here is the actual update
    theta = theta - H\grad;
  
end
% Show J to determine if algorithm has converged
J
% display the norm of our parameters
norm_theta = norm(theta) 

% Plot the results 
% We will evaluate theta*x over a 
% grid of features and plot the contour 
% where theta*x equals zero

% Here is the grid range
u = linspace(-1, 1.5, 200);
v = linspace(-1, 1.5, 200);

z = zeros(length(u), length(v));
% Evaluate z = theta*x over the grid
for i = 1:length(u)
    for j = 1:length(v)
        z(i,j) = map_feature(u(i), v(j))*theta;%這裏繪製的並不是損失函數與迭代次數之間的曲線，而是線性變換後的值
    end
end
z = z'; % important to transpose z before calling contour

% Plot z = 0
% Notice you need to specify the range [0, 0]
contour(u, v, z, [0, 0], 'LineWidth', 2)%在z上畫出爲0值時的界面，因爲爲0時剛好概率爲0.5，符合要求
legend('y = 1', 'y = 0', 'Decision boundary')
title(sprintf('\\lambda = %g', lambda), 'FontSize', 14)


hold off

% Uncomment to plot J
% figure
% plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
% xlabel('Iteration'); ylabel('J')

Regression(4)-------Logistic Regression & Regularization

The effective tools for processing matrix in C++ programming

Deep learning-------------Neural networks

Python（4）-----The function operation of python

Machine learning-------SIFT feature extraction

The mixed programming in terms of matlab and C++

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結