臺大李宏毅老師的機器學習課程
Learning Map
Scenarios: supervised, semi-supervised, unsupervised, transfer, reinforcement learning.
Tasks: regression, classification, structured (the result is complex and structured, not just a scalar or class. E.x. image recognization, translation)
Methods: linear regression, deep learning, svm, decision tree, etc
Regression
- Define the function set (model, linear, 2nd order, 3rd order…)
- Train the model to get a function
- Gradient descent, partial derivative
- Complex model has smaller training error but larger test error: overfitting
- Train separate parts of functions according to some features of data, species
- To make the function smoother: Regularization. Be robust to input data variation.
Diagnose your model
- Know the difference between bias and variance
- Cannot even fit the training data->underfitting->high bias->need more complex model
- Can fit the training data but not the testing data->high variance-> need a simpler model or just more data. Can also apply regularization (penalize the weights).
Gradient Descent
The update of step size:
Vanilla GD: with
AdaBoost:
Intuition behind AdaBoost: consideration of 2nd derivative.
Consider , given , what is the best step to get to the optimum? The optimum point has . So the step . is an approximation of , according to Taylor Series.
Taylor Series
Given a starting point , we want find around it, the minimal point to move to. We use Taylor Series to approximate the value of the nearby points and what we need to do is to find parameter which minimizes that Taylor equation.
TO check: Newton method (considering the 2nd order term in Taylor series).
Stochastic GD
Update parameter after EACH example. Faster convergence.
Feature Scaling
is the sample’s -th component.
, to have 0 mean and 1 variance.
Classification
Why not do classification as regression?
The definition of goodness in Regression is not suitable in classification. The cost value of points belonging to the same class will perturbe the classifier.
Probability model (generative model)
From training data, guess the distribution of probability.
Parametric method: suppose some distribution and train the parameter (maximum likelihood). Maximum the probability that the samples we have were generated by some parameter. For example multi-variable gaussian.
When having separate parameters for different classes, possible to overfit. One way is to let different class share the same variance. The separator will become linear.
Naive Bayes Classifier: dimensions are independent
Posterior probability and sigmoid:
where