Deep Learning:Optimization for Training Deep Models(一)

How Learning Differs from Pure Optimization

Optimization algorithms used for training of deep models differ from traditional optimization algorithms in several ways.

  • Machine learning usually acts indirectly. In most machine learning scenarios, we care about some performance measure P , that is defined with respect to the test set and may also be intractable. We therefore optimize P only indirectly. We reduce a different cost function J(θ) in the hope that doing so will improve P. This is in contrast to pure optimization, where minimizing J is a goal in and of itself.
  • Optimization algorithms for training deep models also typically include some specialization on the specific structure of machine learning objective functions. Typically, the cost function can be written as an average over the training set, such as: Eq.(1)
    J(θ)=E(x,y)p^dataL(f(x;θ),y)

    where L is the per-example loss function, f(x;θ) is the predicted output when the input is x, p^data is the empirical distribution.
    We would usually prefer to minimize the corresponding objective function where the expectation is taken across the data generating distribution pdata rather than just over the finite training set: Eq.(2)
    J(θ)=E(x,y)pdataL(f(x;θ),y)

Empirical Risk Minimization

The goal of a machine learning algorithm is to reduce the expected generalization error given by Eq.(2). This quantity is known as the risk. We emphasize here that the expectation is taken over the true underlying distribution pdata .
If we knew the true distribution pdata(x,y) , risk minimization would be an optimization task solvable by an optimization algorithm.
However, when we do not know pdata(x,y) but only have a training set of samples, we have a machine learning problem.
The simplest way to convert a machine learning problem back into an optimization problem is to minimize the expected loss on the training set. This means replacing the true distribution p(x,y) with the empirical distribution p^(x,y) defined by the training set. We now minimize the empirical risk: Eq.(3)

E(x,y)p^dataL(f(x;θ),y)=1mi=1mL(f(x(i);θ),y(i))

where m is the number of training examples.
The training process based on minimizing this average training error is known as empirical risk minimization.
  • Rather than optimizing the risk directly, we optimize the empirical risk, and hope that the risk decreases significantly as well. A variety of theoretical results establish conditions under which the true risk can be expected to decrease by various amounts.
  • However, empirical risk minimization is prone to overfitting. Models with high capacity can simply memorize the training set. In many cases, empirical risk minimization is not really feasible.
  • The most effective modern optimization algorithms are based on gradient descent, but many useful loss functions, such as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined everywhere)

These two problems mean that, in the context of deep learning, we rarely use empirical risk minimization. Instead, we must use a slightly different approach, in which the quantity that we actually optimize is even more different from the quantity that we truly want to optimize.

Surrogate Loss Functions and Early Stopping

Sometimes, the loss function we actually care about (say classification error) is not one that can be optimized efficiently. For example, exactly minimizing expected 0-1 loss is typically intractable (exponential in the input dimension), even for a linear classifier (Marcotte and Savard, 1992).
In such situations, one typically optimizes a surrogate loss function instead, which acts as a proxy but has advantages. For example, the negative log-likelihood of the correct class is typically used as a surrogate for the 0-1 loss.
The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if the model can do that well, then it can pick the classes that yield the least classification error in expectation.

In some cases, a surrogate loss function actually results in being able to learn more. For example, the test set 0-1 loss often continues to decrease for a long time after the training set 0-1 loss has reached zero, when training using the log-likelihood surrogate. This is because even when the expected 0-1 loss is zero, one can improve the robustness of the classifier by further pushing the classes apart from each other, obtaining a more confident and reliable classifier, thus extracting more information from the training data than would have been possible by simply minimizing the average 0-1 loss on the training set.

A very important difference between optimization in general and optimization as we use it for training algorithms is that training algorithms do not usually halt at a local minimum. Training often halts while the surrogate loss function still has large derivatives, which is very different from the pure optimization setting, where an optimization algorithm is considered to have converged when the gradient becomes very small.

Batch and Minibatch Algorithm

One aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective function usually decomposes as a sum over the training examples. Optimization algorithms for machine learning typically compute each update to the parameters based on an expected value of the cost function estimated using only a subset of the terms of the full cost function.
For example, maximum likelihood estimation problems, when viewed in log space, decompose into a sum over each example:

θML=argminθi=1mlogpmodel(x(1),y(i);θ)

Maximizing this sum is equivalent to maximizing the expectation over the empirical distribution defined by the training set:
J(θ)=E(x,y)p^datalogpmodel(x,y;θ)

Most of the properties of the objective function J used by most of our optimization algorithms are also expectations over the training set. For example, the most commonly used property is the gradient:
θJ(θ)=E(x,y)p^dataθlogpmodel(x,y;θ)

Computing this expectation exactly is very expensive because it requires evaluating the model on every example in the entire dataset. In practice, we can compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples.
  • Most optimization algorithms converge much faster (in terms of total computation, not in terms of number of updates) if they are allowed to rapidly compute approximate estimates of the gradient rather than slowly computing the exact gradient.
  • Another consideration motivating statistical estimation of the gradient from a small number of samples is redundancy in the training set. In the worst case, all m samples in the training set could be identical copies of each other. A sampling-based estimate of the gradient could compute the correct gradient with a single sample, using m times less computation than the naive approach. In practice, we are unlikely to truly encounter this worst-case situation, but we may find large numbers of examples that all make very similar contributions to the gradient.
  • Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they process all of the training examples simultaneously in a large batch.
    Typically the term “batch gradient descent” implies the use of the full training set, while the use of the term “batch” to describe a group of examples does not.
  • Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online methods. (The term online is usually reserved for the case where the examples are drawn from a stream of continually created examples rather than from a fixed-size training set over which several passes are made.)
  • Most algorithms used for deep learning fall somewhere in between, using more than one but less than all of the training examples. These were traditionally called minibatch or minibatch stochastic methods and it is now common to simply call them stochastic methods.

The canonical example of a stochastic method is stochastic gradient descent, minibatch sizes are generally driven by the following factors:

  • Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
  • Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
  • If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
  • Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
  • Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization verror is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability due to the high variance in the estimate of the gradient. The total runtime can be very high due to the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.

An interesting motivation for minibatch stochastic gradient descent is that it follows the gradient of the true generalization error (Eq. 8.2) so long as no examples are repeated. Most implementations of minibatch stochastic gradient descent shuffle the dataset once and then pass through it multiple times. On the first pass, each minibatch is used to compute an unbiased estimate of the true generalization error. On the second pass, the estimate becomes biased because it is formed by re-sampling values that have already been used, rather than obtaining new fair samples from the data generating distribution.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章