先敬大佬的一篇文章
《深度學習最全優化方法總結比較(SGD,Adagrad,Adadelta,Adam,Adamax,Nadam)》
在assignment2 FullyConnectedNets作業中的optim.py裏有以下幾種優化機制(cs231n_2018_lecture07):
- SGD
- SGD + Momentum
- RMSprop
- Adam
1. SGD
公式:
缺點:1.來回振盪 2.會困在局部最優點或鞍點(鞍點在高維向量中很常見)
代碼:
def sgd(w, dw, config=None):
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
w -= config['learning_rate'] * dw
return w, config
2. SGD + Momentum
公式:
優缺點:幫助快速收斂,但會跳過某些點
代碼:
def sgd_momentum(w, dw, config=None):
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w))
next_w = None
v = config['momentum'] * v - config['learning_rate'] * dw
next_w = w + v
config['velocity'] = v
return next_w, config
3. RMSprop
公式:
優點:可以動態調節梯度,當dw較小時,cache較小,則較大,提高速度。
缺點:當開始速度慢,因爲梯度小
代碼:
def rmsprop(w, dw, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared
gradient values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(w))
next_w = None
cache = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw ** 2
next_w = w - config['learning_rate'] * dw / (np.sqrt(cache) + config['epsilon'])
config['cache'] = cache
return next_w, config
4. Adam(最常用)
公式:
優缺點:當迭代次數變多時,m變小,lr變小,逼近全局最優。
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(w))
config.setdefault('v', np.zeros_like(w))
config.setdefault('t', 0)
next_w = None
config['t'] += 1
m = config['beta1'] * config['m'] + (1 - config['beta1']) * dw
mt = m / (1 - config['beta1'] ** config['t'])
v = config['beta2'] * config['v'] + (1 - config['beta2']) * dw ** 2
vt = v / (1 - config['beta2'] ** config['t'])
next_w = w - config['learning_rate'] * mt / (np.sqrt(vt) + config['epsilon'])
config['m'] = m
config['v'] = v
return next_w, config