Machine Learning Notes [更新中]

原創

tangwing

2020-04-27 11:02

臺大李宏毅老師的機器學習課程

Learning Map

Scenarios: supervised, semi-supervised, unsupervised, transfer, reinforcement learning.

Tasks: regression, classification, structured (the result is complex and structured, not just a scalar or class. E.x. image recognization, translation)

Methods: linear regression, deep learning, svm, decision tree, etc

Regression

Define the function set (model, linear, 2nd order, 3rd order…)
Train the model to get a function
Gradient descent, partial derivative
Complex model has smaller training error but larger test error: overfitting
Train separate parts of functions according to some features of data, species
To make the function smoother: Regularization. Be robust to input data variation.

Diagnose your model

Know the difference between bias and variance
Cannot even fit the training data->underfitting->high bias->need more complex model
Can fit the training data but not the testing data->high variance-> need a simpler model or just more data. Can also apply regularization (penalize the weights).

Gradient Descent

The update of step size:
Vanilla GD: $w^{t+1}=w^{t}-\theta^t\Delta^t$ with $\theta^t=\theta_0/t$
AdaBoost: $w^{t+1}=w^{t}-\theta^0\Delta^t/\sqrt{\sum \Delta_i^2}$
Intuition behind AdaBoost: consideration of 2nd derivative.
Consider $y=ax^2+bx+c$ , given $x_0$ , what is the best step to get to the optimum? The optimum point $x^*$ has $y'=0$ . So the step $=x_0 -b/2a = y'/y''$ . $\sqrt{\sum \Delta_i^2}$ is an approximation of $y''$ , according to Taylor Series.

Taylor Series
Given a starting point $(a,b)$ , we want find around it, the minimal point to move to. We use Taylor Series to approximate the value of the nearby points and what we need to do is to find parameter $\theta$ which minimizes that Taylor equation.

TO check: Newton method (considering the 2nd order term in Taylor series).

Stochastic GD
Update parameter after EACH example. Faster convergence.

Feature Scaling

$x_i^r$ is the $r\text{-}th$ sample’s $i$ -th component.
$x_i^r = (x_i^r-m_i)/\sigma_i$ , to have 0 mean and 1 variance.

Classification

Why not do classification as regression?
The definition of goodness in Regression is not suitable in classification. The cost value of points belonging to the same class will perturbe the classifier.

Probability model (generative model)

From training data, guess the distribution of probability.
Parametric method: suppose some distribution and train the parameter (maximum likelihood). Maximum the probability that the samples we have were generated by some parameter. For example multi-variable gaussian.

When having separate parameters for different classes, possible to overfit. One way is to let different class share the same variance. The separator will become linear.

Naive Bayes Classifier: dimensions are independent

Posterior probability and sigmoid:
$\sigma(z)=P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} = 1+\frac{1}{1+exp(-z)}$ where $z=ln\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}$

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Machine Learning Notes [更新中]

Learning Map

Regression

Diagnose your model

Gradient Descent

Feature Scaling

Classification

Probability model (generative model)

[NOTE in progress] Simulation Optimization

A Road Map for Deep Learning

Stochastic Optimization: Casual Notes

Graph Neural Network: A First Glance

Git 項目管理流程與協作方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結