Machine Learning Notes [更新中]

臺大李宏毅老師的機器學習課程

Learning Map

Scenarios: supervised, semi-supervised, unsupervised, transfer, reinforcement learning.

Tasks: regression, classification, structured (the result is complex and structured, not just a scalar or class. E.x. image recognization, translation)

Methods: linear regression, deep learning, svm, decision tree, etc

Regression

  1. Define the function set (model, linear, 2nd order, 3rd order…)
  2. Train the model to get a function
  3. Gradient descent, partial derivative
  4. Complex model has smaller training error but larger test error: overfitting
  5. Train separate parts of functions according to some features of data, species
  6. To make the function smoother: Regularization. Be robust to input data variation.

Diagnose your model

  1. Know the difference between bias and variance
  2. Cannot even fit the training data->underfitting->high bias->need more complex model
  3. Can fit the training data but not the testing data->high variance-> need a simpler model or just more data. Can also apply regularization (penalize the weights).

Gradient Descent

The update of step size:
Vanilla GD: wt+1=wtθtΔtw^{t+1}=w^{t}-\theta^t\Delta^t with θt=θ0/t\theta^t=\theta_0/t
AdaBoost:wt+1=wtθ0Δt/Δi2w^{t+1}=w^{t}-\theta^0\Delta^t/\sqrt{\sum \Delta_i^2}
Intuition behind AdaBoost: consideration of 2nd derivative.
Consider y=ax2+bx+cy=ax^2+bx+c, given x0x_0, what is the best step to get to the optimum? The optimum point xx^* has y=0y'=0. So the step =x0b/2a=y/y=x_0 -b/2a = y'/y''. Δi2\sqrt{\sum \Delta_i^2} is an approximation of yy'', according to Taylor Series.

Taylor Series
Given a starting point (a,b)(a,b), we want find around it, the minimal point to move to. We use Taylor Series to approximate the value of the nearby points and what we need to do is to find parameter θ\theta which minimizes that Taylor equation.

TO check: Newton method (considering the 2nd order term in Taylor series).

Stochastic GD
Update parameter after EACH example. Faster convergence.

Feature Scaling

xirx_i^r is the r-thr\text{-}th sample’s ii-th component.
xir=(xirmi)/σix_i^r = (x_i^r-m_i)/\sigma_i, to have 0 mean and 1 variance.

Classification

Why not do classification as regression?
The definition of goodness in Regression is not suitable in classification. The cost value of points belonging to the same class will perturbe the classifier.

Probability model (generative model)

From training data, guess the distribution of probability.
Parametric method: suppose some distribution and train the parameter (maximum likelihood). Maximum the probability that the samples we have were generated by some parameter. For example multi-variable gaussian.

When having separate parameters for different classes, possible to overfit. One way is to let different class share the same variance. The separator will become linear.

Naive Bayes Classifier: dimensions are independent

Posterior probability and sigmoid:
σ(z)=P(C1x)=P(xC1)P(C1)P(xC1)P(C1)+P(xC2)P(C2)=1+11+exp(z)\sigma(z)=P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} = 1+\frac{1}{1+exp(-z)} where z=lnP(xC1)P(C1)P(xC2)P(C2)z=ln\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章