- [ ] normalized initialization & intermediate normalization layers addresses vanishing/exploding gradients.
- [ ] stochastic gradient descent
- [ ] stationary points
- [ ] identity mapping
- [ ] vector quantization
Stochastic gradient descent
Stochastic gradient descent (often shortened in SGD), also known as imcremental gradient descent, is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD tries to find minimums or maximums by iteration.
Minimize an objective function that has the form of a sum:
When used to minimize the above function, a standard (or “batch”) gradient descent method would perform the following iterations:
To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is every effective in the case of large-scale machine learning problems
Iterative method
In stochastic (or “on-line”) gradient descent, the true gradient of
In pseudocode, stochastic gradient descent can be presented as follows:
- Choose an initial vector of parameters
w and learning rateη .- Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the training set.
- for
i=i,2,...,n , do:
w:=w−η▽Qi(w) .
A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a “mini-batch”) at each step. This can perform significantly better than true stochasitc gradient descent because the code can make use of vectorization libaries rather than computing each step separately. It may alse result in smoother convergence, as the gradient computed at each step uses more training examples.
The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rats
Example
Let’s suppose we want to fit a straight line
The last line in the above pseudocode for this specific problem will become:
Related Work
- [ ] What is a scale in Residual Representations.
- [ ] What is an identity mapping?
Deep Residual Learning for Image Recognition
Abstract
We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
1. Introduction
Recent evidence reveals that network depth is of crucial importance, and the leading results on the challenging ImageNet dataset all exploit “very deep” models, with depth of sixteen to thirty. Many other nontrivial visual recognition tasks have also greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence from the beginning.
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and the degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitable deep model leads to higher training error.
In this paper, we address the degradation problem by introducing a deep residual learning framework. Formally, denoting the desired underlying mapping as
[x] what is the identity mapping
A new paper by Kaiming He.
2. Related Work
Residual Learning
In image recognition, shallow representation for image retrieval and classification:
- VLAD is a representation that encodes by the residual vectors with respect to dictionary
- Fisher Vector can be formulated as a probabilistic version of VLAD.
For vector quantization, encoding residual vectors is shown to be more effective than encoding original vectors.
List some examples that can suggest that a good reformulation or preconditioning can simplify the optimization.
Shortcut Connections
In our residual network, the reformulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned.
3. Deep Residual Learning
3.1 Residual Learning
- [ ] zero mapping?
This reformulation is motivated by the counterintuitive phenomena about the degradation problem, i.e. the deeper network has higher training error, and thus test error. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers.
In real cases, it is unlikely that identity mappings are optional, but our reformulation may help to precondition the problem. If the optional function is closer to identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one.
3.2 Identity Mapping by Shortcuts
A building block is shown as figure, defined as:
Here x and y are the input and output vectors of the layers considered. The function
The dimensions of x and
- [ ] how to understand channal?
Network Architectures
- Plain Network
- Residual Network
3.4 Implementation
- [ ] per-pixel mean substracted
4. Experiments
4.1 ImageNet Classification
Experiments
Facebook has implemented the deep residual network using torch, here is the github project fb.resnet.torch.
To run an experiment:
th main.lua -dataset cifar10 -depth 20 -save DIRECTORY_TO_SAVE