Introduction of Machine Learning
Liner Regression with one Variable
Liner Algebra Review

1 Introduction

1.1 What is Machine Learning?

機器學習：一臺計算機，從 E 中學習實現 T，用 P 衡量
機器學習分爲：Supervised learning and Unsupervised learning.

Example: playing checkers.
T = the task of playing checkers.
E = the experience of playing many games of checkers.
P = the probability that the program will win the next game.

1.2 Supervised learning

監督學習：給定待訓練數據集（帶正確輸出結果的label，即知道這些數據集的輸出結果）
監督學習分爲：Regression and Classification.
Regression:
We are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function.（連續函數）
Classification:
We are instead trying to predict results in a discrete output, meaning that we are trying to map input variables to some discrete categories.（離散範疇；yes or no）

Example:
Regression - Given a picture of a person, we have to predict their age on the basis of the given picture.
Classification - Given a patient with a tumor, we have to predict whether the tumor is maligant or benign.

1.3 Unsupervised learning

無監督學習：給定待訓練數據集（不知道這些數據集的輸出結果）
無監督學習分爲：Clustering and Non-clustering.
Octave: 關注於算法本身，很少的代碼實現，計算快。如果效果是可行的，再遷移進入高級語言python, java, C++中，效率更高。

2 Liner Regression with one Variable

2.1 Model and Cost Function

2.1.1 Model Representation

Notation:

input variables: $x^{(i)}$
output variables: $y^{(i)}$
a training example: $(x^{(i)} ,y^{(i)})$
number of training examples: $m$
a training set: $(x^{(i)} ,y^{(i)}); i=1,...,m$
the space of input values: $X$
（ $x^{(i)}$ 的取值範圍）
the space of output values: $Y$

Example:
input variable: living area (one)
output variable: price (one)
$X=Y=R$
hypothesis: $h$

Our goal is , given a training set, to learn a function $h:X \to Y$ so that $h(x)$ is a “good” predictor for the corresponding value of $y$ .

2.1.2 Cost Function

Hypothesis:
$h_{\theta}(x)=\theta_0 +\theta_1 x$
（ $h_{\theta}(x)$ 簡寫爲 $h(x)$ ）

Parameters:
$\theta_0,\theta_1$

Cost Function:
Squared error function/ Mean error function:
$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}=\frac{1}{2m} \sum_{i=1}^{m}(\hat{y_i}-y_i)^{2}$
（ $\frac{1}{2}$ ：爲了求導方便， $(x^{2})'=2x$ ，爲了消除2）

Goal:
$\frac{minimize}{\theta_0,\theta_1}J(\theta_0,\theta_1)$

Idea:
choose $\theta_0,\theta_1$ so that $h_{\theta}(x)$ is close to y for our training examples $(x,y)$

2.1.3 Cost Function - Intution1

In order to better visualize the cost function $J$ ，set $\theta_0=0$ .
Smplified：

Hypothesis:
$h_{\theta}(x)=\theta_1 x$
（過原點的直線）

Parameters:
$\theta_1$

Cost Function:
$J(\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}=\frac{1}{2m} \sum_{i=1}^{m}(\hat{y_i}-y_i)^{2}$

Goal:
$\frac{minimize}{\theta_1}J(\theta_1)$

某個訓練集的 $J(\theta_1)$ （弓形曲線）如下：

2.1.4 Cost Function - Intution2

Hypothesis:
$h_{\theta}(x)=\theta_0 +\theta_1 x$

Parameters:
$\theta_0,\theta_1$

Cost Function:
$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}=\frac{1}{2m} \sum_{i=1}^{m}(\hat{y_i}-y_i)^{2}$

Goal:
$\frac{minimize}{\theta_0,\theta_1}J(\theta_0,\theta_1)$

某個訓練集的 $J(\theta_0,\theta_1)$ 三維曲面圖（弓形曲面）如下：

某個訓練集的 $J(\theta_0,\theta_1)$ 輪廓圖 contour plot 如下：（等高線上的點 $J(\theta_0,\theta_1)$ 相等）

2.2 Parameter Learning

2.2.1 Gradient Desent （梯度下降）

Have some function $J(\theta_0,\theta_1)$
Want $\frac{min}{\theta_0,\theta_1}J(\theta_0,\theta_1)$
Outline:

Start with some $\theta_0,\theta_1$
Keep changing $\theta_0,\theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a minimum.

local optimal：

Gradient descent algorithm
repeat until convergence{
$\theta_j:=\theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0,\theta_1)$ (for j=0,1)
}
notation:
learning rate: $\alpha$ （每次步伐多大，以多大幅度更新 $\theta_j$ ）

Corret: simultaneous update $\theta_0,\theta_1$
repeat until convergence{
     $temp0:=\theta_0 - \alpha \frac{\partial}{\partial\theta_0} J(\theta_0,\theta_1)$
     $temp1:=\theta_1 - \alpha \frac{\partial}{\partial\theta_1} J(\theta_0,\theta_1)$
     $\theta_0:=temp0$
     $\theta_1:=temp1$
}

2.2.2 Gradient Desent Intution

在 $J(\theta_1)$ 的函數圖中，gradient descent $\frac{d}{d\theta_1}J(\theta_1)$ 相當於切線斜率。

$\alpha$ is too small or large:

When $\alpha$ is not too large, gradient descent can converge to a lcoal minimum, even with the learning rate $\alpha$ fixed. Because, as we approach a local minimum, gradient descent $\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$ will automatically take smaller steps（邁的步子越來越小）. So, no need to decrease $\alpha$ over time.

2.2.3 Gradient Desent For Linear Regression

Hypothesis:
$h_{\theta}(x)=\theta_0 +\theta_1 x$

Cost function:
$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i)-y_i)^{2}$
$=\frac{1}{2m} \sum_{i=1}^{m}(\theta_0 +\theta_1 x_i-y_i)^{2}$

Gradient descent:
$\theta_j:=\theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0,\theta_1)$ (for j=0,1)

結合爲：
$j=0, \frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)$
$j=1, \frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_i$

repeat until convergence{
    //need simultaneous update:
     $\theta_0:=\theta_0 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)$
     $\theta_1:=\theta_1 - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_i$
}

Batch: Each step of gradient descent uses all the training examples.（ $\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)$ ）

3 Liner Algebra Review

3.1 Matrics and Vectors

Matrix:（大寫表示）
$4\times2$ matrix: # rows = 4, #columns=2. （ $R^{4\times2}$ ）
$A_{ij}$ : the $i^{th}$ row, $j^{th}$ cloumn

Vector: An $n\times1$ matrix（小寫表示）
like, $y=\begin{bmatrix} 460 \\ 235 \\ 315\\ 178 \end{bmatrix}$ is a 4-dimensional vector. ( $R^{4}$ )
$y_i = i^{th}$ element
1-indexed（默認）: $y=\begin{bmatrix} y_1 \\ y_2 \\ y_3\\ y_4 \end{bmatrix}$

0-indexed: $y=\begin{bmatrix} y_0 \\ y_1 \\ y_2\\ y_3 \end{bmatrix}$

3.2 Addition and Scalar Multiplication

matrix addition:…
scalar（real number） multiplication: 標量 $\times$ 矩陣

3.3 Matrix Vector Multiplication

Conditions:
$h_\theta(x)=-40+0.25x\\ x = \begin{bmatrix} 2104 \\ 1416 \\ 1534\\ 852 \end{bmatrix}$

Result:
$prediction = DataMatrix \times parameters\\ = \begin{bmatrix} 1,2104 \\ 1,1416 \\ 1,1534\\ 1, 852 \end{bmatrix} \times \begin{bmatrix} -40\\ 0.25 \end{bmatrix}$

不僅在octave，在高級語言中，矩陣運算比用 for 循環計算代碼少，效率高。
循環運算：

x =  [2014, 1416, 1534, 852]
for i in range(4):
	prediction[i] = -40 + 0.25*x

3.4 Matrix Matrix Multiplication

Conditions:
$1. h_\theta(x)=-40+0.25x\\ 2. h_\theta(x)=200+0.1x\\ 3. h_\theta(x)=-150+0.4x\\ x = \begin{bmatrix} 2104 \\ 1416 \\ 1534\\ 852 \end{bmatrix}$

Result:
$PredictionMatrix = DataMatrix \times ParametersMatrix\\ = \begin{bmatrix} 1,2104 \\ 1,1416 \\ 1,1534\\ 1, 852 \end{bmatrix} \times \begin{bmatrix} -40,200, -150\\ 0.25,0.1,0.4 \end{bmatrix}\\ =\begin{bmatrix} 486,410,692\\ 314, 342, 416\\ 344, 353, 464\\ 173, 285, 191 \end{bmatrix}$
ParametersMatrix 的第 $j$ 列和 PredictionMatrix 第 $j$ 列結果對應。

3.5 Matrix Multiplication Properties

Matrix $\times$ Matrix:
沒有交換律，有結合律
Identity Matrix：單位矩陣（ $I_{n\times n}$ ）

3.6 Inverse and Transpose

Matrix Inverse：…（逆）
singular / degenerate：Matrices that don’t have an inverse. （奇異矩陣 / 退化矩陣）
Matrix Transpose：…（轉置）

Coursera: Week 1——knowledge