Introduction of Machine Learning
Liner Regression with one Variable
Liner Algebra Review
1 Introduction
1.1 What is Machine Learning?
機器學習:一臺計算機,從 E 中學習實現 T ,用 P 衡量
機器學習分爲:Supervised learning and Unsupervised learning.
Example: playing checkers.
T = the task of playing checkers.
E = the experience of playing many games of checkers.
P = the probability that the program will win the next game.
1.2 Supervised learning
監督學習:給定待訓練數據集(帶正確輸出結果的label,即知道這些數據集的輸出結果 )
監督學習分爲:Regression and Classification.
Regression :
We are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function .(連續函數)
Classification :
We are instead trying to predict results in a discrete output, meaning that we are trying to map input variables to some discrete categories .(離散範疇;yes or no)
Example:
Regression - Given a picture of a person, we have to predict their age on the basis of the given picture.
Classification - Given a patient with a tumor, we have to predict whether the tumor is maligant or benign.
1.3 Unsupervised learning
無監督學習:給定待訓練數據集(不知道這些數據集的輸出結果 )
無監督學習分爲:Clustering and Non-clustering.
Octave: 關注於算法本身,很少的代碼實現,計算快。如果效果是可行的,再遷移進入高級語言python, java, C++中,效率更高。
2 Liner Regression with one Variable
2.1 Model and Cost Function
2.1.1 Model Representation
Notation:
input variables: x ( i ) x^{(i)} x ( i )
output variables: y ( i ) y^{(i)} y ( i )
a training example: ( x ( i ) , y ( i ) ) (x^{(i)} ,y^{(i)}) ( x ( i ) , y ( i ) )
number of training examples: m m m
a training set: ( x ( i ) , y ( i ) ) ; i = 1 , . . . , m (x^{(i)} ,y^{(i)}); i=1,...,m ( x ( i ) , y ( i ) ) ; i = 1 , . . . , m
the space of input values: X X X
(x ( i ) x^{(i)} x ( i ) 的取值範圍 )
the space of output values: Y Y Y
Example:
input variable: living area (one)
output variable: price (one)
X = Y = R X=Y=R X = Y = R
hypothesis: h h h
Our goal is , given a training set, to learn a function h : X → Y h:X \to Y h : X → Y so that h ( x ) h(x) h ( x ) is a “good” predictor for the corresponding value of y y y .
2.1.2 Cost Function
Hypothesis:
h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_0 +\theta_1 x h θ ( x ) = θ 0 + θ 1 x
(h θ ( x ) h_{\theta}(x) h θ ( x ) 簡寫爲h ( x ) h(x) h ( x ) )
Parameters:
θ 0 , θ 1 \theta_0,\theta_1 θ 0 , θ 1
Cost Function:
Squared error function/ Mean error function:
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 = 1 2 m ∑ i = 1 m ( y i ^ − y i ) 2 J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}=\frac{1}{2m} \sum_{i=1}^{m}(\hat{y_i}-y_i)^{2} J ( θ 0 , θ 1 ) = 2 m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) 2 = 2 m 1 ∑ i = 1 m ( y i ^ − y i ) 2
(1 2 \frac{1}{2} 2 1 :爲了求導方便,( x 2 ) ′ = 2 x (x^{2})'=2x ( x 2 ) ′ = 2 x ,爲了消除2)
Goal:
m i n i m i z e θ 0 , θ 1 J ( θ 0 , θ 1 ) \frac{minimize}{\theta_0,\theta_1}J(\theta_0,\theta_1) θ 0 , θ 1 m i n i m i z e J ( θ 0 , θ 1 )
Idea:
choose θ 0 , θ 1 \theta_0,\theta_1 θ 0 , θ 1 so that h θ ( x ) h_{\theta}(x) h θ ( x ) is close to y for our training examples ( x , y ) (x,y) ( x , y )
2.1.3 Cost Function - Intution1
In order to better visualize the cost function J J J ,set θ 0 = 0 \theta_0=0 θ 0 = 0 .
Smplified:
Hypothesis:
h θ ( x ) = θ 1 x h_{\theta}(x)=\theta_1 x h θ ( x ) = θ 1 x
(過原點的直線)
Parameters:
θ 1 \theta_1 θ 1
Cost Function:
J ( θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 = 1 2 m ∑ i = 1 m ( y i ^ − y i ) 2 J(\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}=\frac{1}{2m} \sum_{i=1}^{m}(\hat{y_i}-y_i)^{2} J ( θ 1 ) = 2 m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) 2 = 2 m 1 ∑ i = 1 m ( y i ^ − y i ) 2
Goal:
m i n i m i z e θ 1 J ( θ 1 ) \frac{minimize}{\theta_1}J(\theta_1) θ 1 m i n i m i z e J ( θ 1 )
某個訓練集的J ( θ 1 ) J(\theta_1) J ( θ 1 ) (弓形曲線)如下:
2.1.4 Cost Function - Intution2
Hypothesis:
h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_0 +\theta_1 x h θ ( x ) = θ 0 + θ 1 x
Parameters:
θ 0 , θ 1 \theta_0,\theta_1 θ 0 , θ 1
Cost Function:
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 = 1 2 m ∑ i = 1 m ( y i ^ − y i ) 2 J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}=\frac{1}{2m} \sum_{i=1}^{m}(\hat{y_i}-y_i)^{2} J ( θ 0 , θ 1 ) = 2 m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) 2 = 2 m 1 ∑ i = 1 m ( y i ^ − y i ) 2
Goal:
m i n i m i z e θ 0 , θ 1 J ( θ 0 , θ 1 ) \frac{minimize}{\theta_0,\theta_1}J(\theta_0,\theta_1) θ 0 , θ 1 m i n i m i z e J ( θ 0 , θ 1 )
某個訓練集的J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J ( θ 0 , θ 1 ) 三維曲面圖(弓形曲面)如下:
某個訓練集的J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J ( θ 0 , θ 1 ) 輪廓圖 contour plot 如下:(等高線上的點J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J ( θ 0 , θ 1 ) 相等)
2.2 Parameter Learning
2.2.1 Gradient Desent (梯度下降)
Have some function J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J ( θ 0 , θ 1 )
Want m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) \frac{min}{\theta_0,\theta_1}J(\theta_0,\theta_1) θ 0 , θ 1 m i n J ( θ 0 , θ 1 )
Outline:
Start with some θ 0 , θ 1 \theta_0,\theta_1 θ 0 , θ 1
Keep changing θ 0 , θ 1 \theta_0,\theta_1 θ 0 , θ 1 to reduce J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J ( θ 0 , θ 1 ) until we hopefully end up at a minimum.
local optimal:
Gradient descent algorithm
repeat until convergence{
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) \theta_j:=\theta_j - \alpha \frac{\partial}{\partial\theta_j}
J(\theta_0,\theta_1) θ j : = θ j − α ∂ θ j ∂ J ( θ 0 , θ 1 ) (for j=0,1)
}
notation:
learning rate: α \alpha α (每次步伐多大,以多大幅度更新θ j \theta_j θ j )
Corret: simultaneous update θ 0 , θ 1 \theta_0,\theta_1 θ 0 , θ 1
repeat until convergence{
t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0:=\theta_0 - \alpha \frac{\partial}{\partial\theta_0}
J(\theta_0,\theta_1) t e m p 0 : = θ 0 − α ∂ θ 0 ∂ J ( θ 0 , θ 1 )
t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1:=\theta_1 - \alpha \frac{\partial}{\partial\theta_1}
J(\theta_0,\theta_1) t e m p 1 : = θ 1 − α ∂ θ 1 ∂ J ( θ 0 , θ 1 )
θ 0 : = t e m p 0 \theta_0:=temp0 θ 0 : = t e m p 0
θ 1 : = t e m p 1 \theta_1:=temp1 θ 1 : = t e m p 1
}
2.2.2 Gradient Desent Intution
在J ( θ 1 ) J(\theta_1) J ( θ 1 ) 的函數圖中,gradient descent d d θ 1 J ( θ 1 ) \frac{d}{d\theta_1}J(\theta_1) d θ 1 d J ( θ 1 ) 相當於切線斜率。
α \alpha α is too small or large:
When α \alpha α is not too large , gradient descent can converge to a lcoal minimum, even with the learning rate α \alpha α fixed. Because, as we approach a local minimum, gradient descent ∂ ∂ θ j J ( θ 0 , θ 1 ) \frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) ∂ θ j ∂ J ( θ 0 , θ 1 ) will automatically take smaller steps (邁的步子越來越小). So, no need to decrease α \alpha α over time.
2.2.3 Gradient Desent For Linear Regression
Hypothesis:
h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_0 +\theta_1 x h θ ( x ) = θ 0 + θ 1 x
Cost function:
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2} J ( θ 0 , θ 1 ) = 2 m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) 2
= 1 2 m ∑ i = 1 m ( θ 0 + θ 1 x i − y i ) 2 =\frac{1}{2m} \sum_{i=1}^{m}(\theta_0 +\theta_1 x_i-y_i)^{2} = 2 m 1 ∑ i = 1 m ( θ 0 + θ 1 x i − y i ) 2
Gradient descent:
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) \theta_j:=\theta_j - \alpha \frac{\partial}{\partial\theta_j}
J(\theta_0,\theta_1) θ j : = θ j − α ∂ θ j ∂ J ( θ 0 , θ 1 ) (for j=0,1)
結合爲:
j = 0 , ∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) j=0, \frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)
=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i) j = 0 , ∂ θ 0 ∂ J ( θ 0 , θ 1 ) = m 1 ∑ i = 1 m ( h θ ( x i ) − y i )
j = 1 , ∂ ∂ θ 1 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x i j=1, \frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)
=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_i j = 1 , ∂ θ 1 ∂ J ( θ 0 , θ 1 ) = m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) x i
repeat until convergence{
//need simultaneous update:
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) \theta_0:=\theta_0 - \alpha
\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i) θ 0 : = θ 0 − α m 1 ∑ i = 1 m ( h θ ( x i ) − y i )
θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x i \theta_1:=\theta_1 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_i θ 1 : = θ 1 − α m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) x i
}
Batch: Each step of gradient descent uses all the training examples.(1 m ∑ i = 1 m ( h θ ( x i ) − y i ) \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i) m 1 ∑ i = 1 m ( h θ ( x i ) − y i ) )
3 Liner Algebra Review
3.1 Matrics and Vectors
Matrix: (大寫表示)
4 × 2 4\times2 4 × 2 matrix: # rows = 4, #columns=2. (R 4 × 2 R^{4\times2} R 4 × 2 )
A i j A_{ij} A i j : the i t h i^{th} i t h row, j t h j^{th} j t h cloumn
Vector: An n × 1 n\times1 n × 1 matrix(小寫表示)
like, y = [ 460 235 315 178 ] y=\begin{bmatrix}
460 \\
235 \\
315\\
178
\end{bmatrix} y = ⎣ ⎢ ⎢ ⎡ 4 6 0 2 3 5 3 1 5 1 7 8 ⎦ ⎥ ⎥ ⎤ is a 4-dimensional vector. (R 4 R^{4} R 4 )
y i = i t h y_i = i^{th} y i = i t h element
1-indexed(默認): y = [ y 1 y 2 y 3 y 4 ] y=\begin{bmatrix}
y_1 \\
y_2 \\
y_3\\
y_4
\end{bmatrix} y = ⎣ ⎢ ⎢ ⎡ y 1 y 2 y 3 y 4 ⎦ ⎥ ⎥ ⎤
0-indexed: y = [ y 0 y 1 y 2 y 3 ] y=\begin{bmatrix}
y_0 \\
y_1 \\
y_2\\
y_3
\end{bmatrix} y = ⎣ ⎢ ⎢ ⎡ y 0 y 1 y 2 y 3 ⎦ ⎥ ⎥ ⎤
3.2 Addition and Scalar Multiplication
matrix addition:…
scalar(real number) multiplication: 標量 × \times × 矩陣
3.3 Matrix Vector Multiplication
Conditions :
h θ ( x ) = − 40 + 0.25 x x = [ 2104 1416 1534 852 ] h_\theta(x)=-40+0.25x\\
x = \begin{bmatrix}
2104 \\
1416 \\
1534\\
852
\end{bmatrix} h θ ( x ) = − 4 0 + 0 . 2 5 x x = ⎣ ⎢ ⎢ ⎡ 2 1 0 4 1 4 1 6 1 5 3 4 8 5 2 ⎦ ⎥ ⎥ ⎤
Result:
p r e d i c t i o n = D a t a M a t r i x × p a r a m e t e r s = [ 1 , 2104 1 , 1416 1 , 1534 1 , 852 ] × [ − 40 0.25 ] prediction = DataMatrix \times parameters\\
= \begin{bmatrix}
1,2104 \\
1,1416 \\
1,1534\\
1, 852
\end{bmatrix} \times \begin{bmatrix}
-40\\
0.25
\end{bmatrix} p r e d i c t i o n = D a t a M a t r i x × p a r a m e t e r s = ⎣ ⎢ ⎢ ⎡ 1 , 2 1 0 4 1 , 1 4 1 6 1 , 1 5 3 4 1 , 8 5 2 ⎦ ⎥ ⎥ ⎤ × [ − 4 0 0 . 2 5 ]
不僅在octave,在高級語言中,矩陣運算 比用 for 循環計算代碼少,效率高 。
循環運算:
x = [2014, 1416, 1534, 852]
for i in range(4):
prediction[i] = -40 + 0.25*x
3.4 Matrix Matrix Multiplication
Conditions :
1. h θ ( x ) = − 40 + 0.25 x 2. h θ ( x ) = 200 + 0.1 x 3. h θ ( x ) = − 150 + 0.4 x x = [ 2104 1416 1534 852 ] 1. h_\theta(x)=-40+0.25x\\
2. h_\theta(x)=200+0.1x\\
3. h_\theta(x)=-150+0.4x\\
x = \begin{bmatrix}
2104 \\
1416 \\
1534\\
852
\end{bmatrix} 1 . h θ ( x ) = − 4 0 + 0 . 2 5 x 2 . h θ ( x ) = 2 0 0 + 0 . 1 x 3 . h θ ( x ) = − 1 5 0 + 0 . 4 x x = ⎣ ⎢ ⎢ ⎡ 2 1 0 4 1 4 1 6 1 5 3 4 8 5 2 ⎦ ⎥ ⎥ ⎤
Result:
P r e d i c t i o n M a t r i x = D a t a M a t r i x × P a r a m e t e r s M a t r i x = [ 1 , 2104 1 , 1416 1 , 1534 1 , 852 ] × [ − 40 , 200 , − 150 0.25 , 0.1 , 0.4 ] = [ 486 , 410 , 692 314 , 342 , 416 344 , 353 , 464 173 , 285 , 191 ] PredictionMatrix = DataMatrix \times ParametersMatrix\\
= \begin{bmatrix}
1,2104 \\
1,1416 \\
1,1534\\
1, 852
\end{bmatrix} \times \begin{bmatrix}
-40,200, -150\\
0.25,0.1,0.4
\end{bmatrix}\\
=\begin{bmatrix}
486,410,692\\
314, 342, 416\\
344, 353, 464\\
173, 285, 191
\end{bmatrix} P r e d i c t i o n M a t r i x = D a t a M a t r i x × P a r a m e t e r s M a t r i x = ⎣ ⎢ ⎢ ⎡ 1 , 2 1 0 4 1 , 1 4 1 6 1 , 1 5 3 4 1 , 8 5 2 ⎦ ⎥ ⎥ ⎤ × [ − 4 0 , 2 0 0 , − 1 5 0 0 . 2 5 , 0 . 1 , 0 . 4 ] = ⎣ ⎢ ⎢ ⎡ 4 8 6 , 4 1 0 , 6 9 2 3 1 4 , 3 4 2 , 4 1 6 3 4 4 , 3 5 3 , 4 6 4 1 7 3 , 2 8 5 , 1 9 1 ⎦ ⎥ ⎥ ⎤
ParametersMatrix 的第 j j j 列和 PredictionMatrix 第 j j j 列結果對應。
3.5 Matrix Multiplication Properties
Matrix × \times × Matrix:
沒有交換律,有結合律
Identity Matrix: 單位矩陣 (I n × n I_{n\times n} I n × n )
3.6 Inverse and Transpose
Matrix Inverse:…(逆)
singular / degenerate:Matrices that don’t have an inverse. (奇異矩陣 / 退化矩陣)
Matrix Transpose:…(轉置)