1 分類 Classification

在分類問題中,我們嘗試預測的是結果是否屬於某一個類(例如正確或錯誤)。
分類問題的例子有:

判斷一封電子郵件是否是垃圾郵件;
判斷一次金融交易是否是欺詐等等。

我們從二元的分類問題開始討論。
我們將因變量(dependant variable)可能屬於的兩個類分別稱爲

負向類(negative class)，因變量 0 表示
正向類(positive class)，因變量 1 表示

Instead of our output vector y being a continuous range of values, it will only be 0 or 1.
y∈{0,1}
Where 0 is usually taken as the “negative class” and 1 as the “positive class”, but you are free to assign any representation to it.
We’re only doing two classes for now, called a “Binary Classification Problem.”
One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn’t work well because classification is not actually a linear function.

2 假設表示 Hypothesis Representation

乳腺癌分類問題,我們可以用線性迴歸的方法求出適合數據的一條直線:
根據線性迴歸模型我們只能預測連續的直,然而對於分類問題,我們需要輸出 0 或 1,我們可以預測:

當 hθ大於等於 0.5 時,預測 y=1
當 hθ小於 0.5 時,預測 y=0

對於上圖所示的數據,這樣的一個線性模型似乎能很好地完成分類任務。假使我們又觀測到一個非常大尺寸的惡性腫瘤,將其作爲實例加入到我們的訓練集中來,這將使得我們獲得一條新的直線。

這時,再使用 0.5 作爲閥值來預測腫瘤是良性還是惡性便不合適了。可以看出,線性迴歸模型, 因爲其預測的值可以超越[0,1]的範圍,並不適合解決這樣的問題。

我們引入一個新的模型,邏輯迴歸,該模型的輸出變量範圍始終在 0 和 1 之間。邏輯迴歸模型的假設是:

y \in {0, 1}

h θ (x) = g (θ T x) z = θ T x g (z) = 1 1 + e - z

其中:

X 代表特徵向量
g 代表邏輯函數(logistic function)是一個常用的邏輯函數爲 S 形函數(Sigmoid function),公式爲:g(z)=11+e−z

該函數的圖像爲:

合起來,我們得到邏輯迴歸模型的假設:

g (z) = 1 1 + e - θ T x

對模型的理解:

hθ(x)的作用是,對於給定的輸入變量,根據選擇的參數計算輸出變量=1 的可能性(estimated probablity)即

h θ (x) = P (y = 1 | x; θ) = 1 - P (y = 0 | x; θ)

例如,如果對於給定的 x,通過已經確定的參數計算得出 hθ(x)=0.7,則表示有百分之70 的機率 y 爲正向類,相應地 y 爲負向類的機率爲 1-0.7=0.3。

Our hypothesis should satisfy:
y∈{0,1}
Our new form uses the “Sigmoid Function,” also called the “Logistic Function”:
hθ(x)=g(θTx)z=θTxg(z)=11+e−z

The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging θTx into the Logistic Function.
hθ will give us the probability that our output is 1. For example, hθ(x)=0.7 gives us the probability of 70% that our output is 1.

hθ(x)=P(y=1|x;θ)=1−P(y=0|x;θ)P(y=0|x;θ)+P(y=1|x;θ)=1

Our probability that our prediction is 0 is just the opposite of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).

3 判定邊界 Decision Boundary

在邏輯迴歸中,我們預測:

當 hθ大於等於 0.5 時,預測 y=1
當 hθ小於 0.5 時,預測 y=0

根據上面繪製出的 S形函數圖像,我們知道當

z=0 時 g(z)=0.5
z>0 時 g(z)>0.5
z<0 時 g(z)<0.5

又z=θTx ,即:

θ T x \geq 0 \to y = 1 θ T x < 0 \to y = 0

現在假設我們有一個模型: 並且參數θ是向量[-3 1 1]。則當-3+x1+x2 大於等於 0,即 x1+x2 大於等於 3 時,模型將預測 y=1。

我們可以繪製直線 x1+x2=3,這條線便是我們模型的分界線,將預測爲 1 的區域和預測爲 0 的區域分隔開。

假使我們的數據呈現這樣的分佈情況,怎樣的模型才能適合呢?

因爲需要用曲線才能分隔 y=0 的區域和 y=1 的區域,我們需要二次方特徵:

假設參數是[-1 0 0 1 1],則我們得到的判定邊界恰好是圓點在原點且半徑爲 1 的圓形。我們可以用非常複雜的模型來適應非常複雜形狀的判定邊界。

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

h θ (x) \geq 0.5 \to y = 1 h θ (x) < 0.5 \to y = 0

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

g (z) \geq 0.5 w h e n z \geq 0

Remember.-

z = 0, e 0 = 1, g (z) = 1 / 2 z \to \infty, e - \infty \to 0, g (z) = 1 z \to - \infty, e \infty \to \infty, g (z) = 0

So if our input to g is θTX , then that means:

h θ (x) = g (θ T x) \geq 0.5 w h e n θ T x \geq 0

From these statements we can now say:

θ T x \geq 0 \to y = 1 θ T x < 0 \to y = 0

The decision boundary is the line that separates the area where y=0 and where y=1. It is created by our hypothesis function.
Example:

θ = ⎡ ⎣ ⎢ ⎢ 5 - 1 0 ⎤ ⎦ ⎥ ⎥ y = 1 i f 5 + (- 1) x 1 + 0 x 2 \geq 0 5 - x 1 \geq 0 - x 1 \geq - 5 x 1 \leq 5

Our decision boundary then is a straight vertical line placed on the graph where x1=5, and everything to the left of that denotes y=1, while everything to the right denotes y=0.
Again, the input to the sigmoid function g(z) (e.g.

θTX ) need not be linear, and could be a function that describes a circle (e.g.

z=θ0+θ1x21+θ2x22 ) or any shape to fit our data.

4 代價函數 Cost Function

對於線性迴歸模型,我們定義的代價函數是所有模型誤差的平方和。理論上來說,我們也可以對邏輯迴歸模型沿用這個定義,但是問題在於,當我們將 hθ(x)=11+e−z 帶入到這樣定義了的代價函數中時,我們得到的代價函數將是一個非凸函數(non-convex function)。
這意味着我們的代價函數有許多局部最小值,這將影響梯度下降算法尋找全局最小值。

因此我們重新定義邏輯迴歸的代價函數爲:

J (θ) = 1 m \sum i = 1 m C o s t (h θ (x (i)), y (i)) C o s t (h θ (x), y) = - log (h θ (x)) C o s t (h θ (x), y) = - log (1 - h θ (x)) if y = 1 if y = 0

hθ(x)與 Cost(hθ(x),y)之間的關係如下圖所示:

這樣構建的 Cost(hθ(x),y)函數的特點是:當實際的 y=1 且 hθ也爲 1 時誤差爲 0,當 y=1 但 h θ不爲 1 時誤差隨着 hθ的變小而變大;當實際的 y=0 且 hθ也爲 0 時代價爲 0,當 y=0 但 hθ 不爲 0 時誤差隨着 hθ的變大而變大。

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
Instead, our cost function for logistic regression looks like:

J (θ) = 1 m \sum i = 1 m C o s t (h θ (x (i)), y (i)) C o s t (h θ (x), y) = - log (h θ (x)) C o s t (h θ (x), y) = - log (1 - h θ (x)) if y = 1 if y = 0

The more our hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0:

C o s t (h θ (x), y) = 0 if h θ (x) = y C o s t (h θ (x), y) \to \infty if y = 0 a n d h θ (x) \to 1 C o s t (h θ (x), y) \to \infty if y = 1 a n d h θ (x) \to 0

If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

5 Simplified Cost Function and Gradient Descent

將構建的 Cost(hθ(x),y)簡化如下:

We can compress our cost function’s two conditional cases into one case:

C o s t (h θ (x), y) = - y log (h θ (x)) - (1 - y) log (1 - h θ (x))

Notice that when y is equal to 1, then the second term ((1−y)log(1−hθ(x))) will be zero and will not affect the result. If y is equal to 0, then the first term (−ylog(hθ(x))) will be zero and will not affect the result.
We can fully write out our entire cost function as follows:
帶入代價函數得到:

J(θ)=−1m∑i=1m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]

A vectorized implementation is:
J(θ)=−1m(log(g(Xθ))Ty+log(1−g(Xθ))T(1−y))

5.1 Gradient Descent

一般的梯度下降算法：
Remember that the general form of gradient descent is:

R e p e a t {θ j : = θ j - α \partial \partial θ j J (θ)}

在得到這樣一個代價函數以後,我們便可以用梯度下降算法來求得能使代價函數最小的參數了。算法爲:
We can work out the derivative part using calculus to get:

R e p e a t {θ j : = θ j - α m \sum i = 1 m (h θ (x (i)) - y (i)) x (i) j}

注:雖然得到的梯度下降算法表面上看上去與線性迴歸的梯度下降算法一樣,但是這裏的 hθ (x)=g(θTX)與線性迴歸中不同,所以實際上是不一樣的。另外,在運行梯度下降算法之前,進行特徵縮放依舊是非常必要的。
Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.
向量化的實現：
A vectorized implementation is:

θ:=θ−αmXT(g(Xθ)−y⃗ )

5.2 J(θ)的偏導數 Partial derivative of J(θ)

首先計算S函數的偏導數：
First calculate derivative of sigmoid function (it will be useful while finding partial derivative of J(θ) :

σ (x)' = (1 1 + e - x)' = - ( 1 + e - x ) ' ( 1 + e - x ) 2 = - 1 ' - ( e - x ) ' ( 1 + e - x ) 2 = 0 - ( - x ) ' ( e - x ) ( 1 + e - x ) 2 = - ( - 1 ) ( e - x ) ( 1 + e - x ) 2 = e - x ( 1 + e - x ) 2 = (1 1 + e - x) (e - x 1 + e - x) = σ (x) (+ 1 - 1 + e - x 1 + e - x) = σ (x) (1 + e - x 1 + e - x - 1 1 + e - x) = σ (x) (1 - σ (x))

計算J(θ)的偏導數：
Now we are ready to find out resulting partial derivative:

\partial \partial θ j J (θ) = \partial \partial θ j - 1 m \sum i = 1 m [y (i) l o g (h θ (x (i))) + (1 - y (i)) l o g (1 - h θ (x (i)))] = - 1 m \sum i = 1 m [y (i) \partial \partial θ j l o g (h θ (x (i))) + (1 - y (i)) \partial \partial θ j l o g (1 - h θ (x (i)))] = - 1 m \sum i = 1 m ⎡ ⎣ ⎢ ⎢ y ( i ) \partial \partial θ j h θ ( x ( i ) ) h θ ( x ( i ) ) + ( 1 - y ( i ) ) \partial \partial θ j ( 1 - h θ ( x ( i ) ) ) 1 - h θ ( x ( i ) ) ⎤ ⎦ ⎥ ⎥ = - 1 m \sum i = 1 m ⎡ ⎣ ⎢ ⎢ y ( i ) \partial \partial θ j σ ( θ T x ( i ) ) h θ ( x ( i ) ) + ( 1 - y ( i ) ) \partial \partial θ j ( 1 - σ ( θ T x ( i ) ) ) 1 - h θ ( x ( i ) ) ⎤ ⎦ ⎥ ⎥ = - 1 m \sum i = 1 m ⎡ ⎣ ⎢ ⎢ y ( i ) σ ( θ T x ( i ) ) ( 1 - σ ( θ T x ( i ) ) ) \partial \partial θ j θ T x ( i ) h θ ( x ( i ) ) + - ( 1 - y ( i ) ) σ ( θ T x ( i ) ) ( 1 - σ ( θ T x ( i ) ) ) \partial \partial θ j θ T x ( i ) 1 - h θ ( x ( i ) ) ⎤ ⎦ ⎥ ⎥ = - 1 m \sum i = 1 m ⎡ ⎣ ⎢ ⎢ y ( i ) h θ ( x ( i ) ) ( 1 - h θ ( x ( i ) ) ) \partial \partial θ j θ T x ( i ) h θ ( x ( i ) ) - ( 1 - y ( i ) ) h θ ( x ( i ) ) ( 1 - h θ ( x ( i ) ) ) \partial \partial θ j θ T x ( i ) 1 - h θ ( x ( i ) ) ⎤ ⎦ ⎥ ⎥ = - 1 m \sum i = 1 m [y (i) (1 - h θ (x (i))) x (i) j - (1 - y (i)) h θ (x (i)) x (i) j] = - 1 m \sum i = 1 m [y (i) (1 - h θ (x (i))) - (1 - y (i)) h θ (x (i))] x (i) j = - 1 m \sum i = 1 m [y (i) - y (i) h θ (x (i)) - h θ (x (i)) + y (i) h θ (x (i))] x (i) j = - 1 m \sum i = 1 m [y (i) - h θ (x (i))] x (i) j = 1 m \sum i = 1 m [h θ (x (i)) - y (i)] x (i) j

6 高級優化 Advanced Optimization

一些梯度下降算法之外的選擇:
除了梯度下降算法以外還有一些常被用來令代價函數最小的算法,這些算法更加複雜和優越, 而且通常不需要人工選擇學習率,通常比梯度下降算法要更加快速。這些算法有:

共軛梯度 (Conjugate Gradient)
局部優化法(Broyden fletcher goldfarb shann,BFGS)
有限內存局部優化法(LBFGS)

“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize theta instead of using gradient descent. A. Ng suggests you do not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use them pre-written from libraries. Octave provides them.
We first need to provide a function that computes the following two equations:
J(θ)∂∂θjJ(θ)

We can write a single function that returns both of these:

function [jVal, gradient] = costFunction(theta)
  jval = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

fminunc 是 matlab 和 octave 中都帶的一個最小值優化函數,使用時我們需要提供代價函數和每個參數的求導,下面是 octave 中使用 fminunc 函數的代碼示例:

Then we can use octave’s “fminunc()” optimization algorithm along with the “optimset()” function that creates an object containing the options we want to send to “fminunc()”. (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30)

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We give to the function “fminunc()” our cost function, our initial vector of theta values, and the “options” object that we created beforehand.
Note: If you use matlab, be aware that “fminunc()” is not available in the base installation - you also need to install the Optimization Toolbox http://www.mathworks.com/help/optim/ug/fminunc.html

7 多分類 Multiclass Classification: One-vs-all

判斷依據。例如我們要預測天氣情況分四種類型:晴天、多雲、下雨或下雪。

一種解決這類問題的途徑是採用一對多(One-vs-All)方法。在一對多方法中,我們將多類分類問題轉化成二元分類問題。爲了能實現這樣的轉變,我們將多個類中的一個類標記爲正向類(y=1),然後將其他所有類都標記爲負向類,這個模型記作 h(1)θ(x) 。
接着,類似地第我們選擇另一個類標記爲正向類(y=2),再將其它類都標記爲負向類,將這個模型記作h(2)θ(x) ,依此類推。
最後我們得到一系列的模型簡記爲:

h (i) θ (x) = P (y = i | x; θ) i \in {0, 1 . . . n}

Now we will approach the classification of data into more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1…n}.
In this case we divide our problem into n+1 binary classification problems; in each one, we predict the probability that ‘y’ is a member of one of our classes.

y \in {0, 1 . . . n} h (0) θ (x) = P (y = 0 | x; θ) h (1) θ (x) = P (y = 1 | x; θ) \dots h (n) θ (x) = P (y = n | x; θ) p r e d i c t i o n = max i (h (i) θ (x))

最後,在我們需要做預測時,我們將所有的分類機都運行一遍,然後對每一個輸入變量,都選擇最高可能性的輸出變量。
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.