一、什麼是Warmup?
Warmup是在ResNet論文中提到的一種學習率預熱的方法,它在訓練開始的時候先選擇使用一個較小的學習率,訓練了一些steps(15000steps,見代碼1)或者epoches(5epoches,見代碼2),再修改爲預先設置的學習來進行訓練。例如:
二、爲什麼使用Warmup?
由於剛開始訓練時,模型的權重(weights)是隨機初始化的,此時若選擇一個較大的學習率,可能帶來模型的不穩定(振盪),選擇Warmup預熱學習率的方式,可以使得開始訓練的幾個epoches或者一些steps內學習率較小,在預熱的小學習率下,模型可以慢慢趨於穩定,等模型相對穩定後再選擇預先設置的學習率進行訓練,使得模型收斂速度變得更快,模型效果更佳。
ExampleExampleExample:Resnet論文中使用一個110層的ResNet在cifar10上訓練時,先用0.01的學習率訓練直到訓練誤差低於80%(大概訓練了400個steps),然後使用0.1的學習率進行訓練。
三、Warmup的改進
二中所述的Warmup是constant warmup,它的不足之處在於從一個很小的學習率一下變爲比較大的學習率可能會導致訓練誤差突然增大。於是18年Facebook提出了gradual warmup來解決這個問題,即從最初的小學習率開始,每個step增大一點點,直到達到最初設置的比較大的學習率時,採用最初設置的學習率進行訓練。
四、總結
使用Warmup預熱學習率的方式,即先用最初的小學習率訓練,然後每個step增大一點點,直到達到最初設置的比較大的學習率時(注:此時預熱學習率完成),採用最初設置的學習率進行訓練(注:預熱學習率完成後的訓練過程,學習率是衰減的),有助於使模型收斂速度變快,效果更佳。
gradual warmup示例代碼1:15000 steps
"""
Implements gradual warmup, if train_steps < warmup_steps, the
learning rate will be `train_steps/warmup_steps * init_lr`.
Args:
warmup_steps:warmup步長閾值,即train_steps<warmup_steps,使用預熱學習率,否則使用預設值學習率
train_steps:訓練了的步長數
init_lr:預設置學習率
"""
import numpy as np
warmup_steps = 2500
init_lr = 0.1
# 模擬訓練15000步
max_steps = 15000
for train_steps in range(max_steps):
if warmup_steps and train_steps < warmup_steps:
warmup_percent_done = train_steps / warmup_steps
warmup_learning_rate = init_lr * warmup_percent_done #gradual warmup_lr
learning_rate = warmup_learning_rate
else:
#learning_rate = np.sin(learning_rate) #預熱學習率結束後,學習率呈sin衰減
learning_rate = learning_rate**1.0001 #預熱學習率結束後,學習率呈指數衰減(近似模擬指數衰減)
if (train_steps+1) % 100 == 0:
print("train_steps:%.3f--warmup_steps:%.3f--learning_rate:%.3f" % (
train_steps+1,warmup_steps,learning_rate))
2.上述代碼實現的Warmup預熱學習率以及學習率預熱完成後衰減(sin or exp decay)的曲線圖如下:
學習率warmup先升至初始學習率,後衰減
gradual warmup示例代碼2: 5 epochs
import tensorflow as tf
import numpy as np
callbacks = tf.keras.callbacks
backend = tf.keras.backend
class LearningRateScheduler(callbacks.Callback):
def __init__(self,
schedule,
learning_rate=None,
warmup=False,
steps_per_epoch=None,
verbose=0):
super(LearningRateScheduler, self).__init__()
self.learning_rate = learning_rate
self.schedule = schedule
self.verbose = verbose
self.warmup_epochs = 5 if warmup else 0
self.warmup_steps = int(steps_per_epoch) * self.warmup_epochs if warmup else 0
self.global_batch = 0
if warmup and learning_rate is None:
raise ValueError('learning_rate cannot be None if warmup is used.')
if warmup and steps_per_epoch is None:
raise ValueError('steps_per_epoch cannot be None if warmup is used.')
def on_train_batch_begin(self, batch, logs=None):
self.global_batch += 1
if self.global_batch < self.warmup_steps:
if not hasattr(self.model.optimizer, 'lr'):
raise ValueError('Optimizer must have a "lr" attribute.')
lr = self.learning_rate * self.global_batch / self.warmup_steps
backend.set_value(self.model.optimizer.lr, lr)
if self.verbose > 0:
print('\nBatch %05d: LearningRateScheduler warming up learning '
'rate to %s.' % (self.global_batch, lr))
def on_epoch_begin(self, epoch, logs=None):
if not hasattr(self.model.optimizer, 'lr'):
raise ValueError('Optimizer must have a "lr" attribute.')
lr = float(backend.get_value(self.model.optimizer.lr))
if epoch >= self.warmup_epochs:
try: # new API
lr = self.schedule(epoch - self.warmup_epochs, lr)
except TypeError: # Support for old API for backward compatibility
lr = self.schedule(epoch - self.warmup_epochs)
if not isinstance(lr, (float, np.float32, np.float64)):
raise ValueError('The output of the "schedule" function '
'should be float.')
backend.set_value(self.model.optimizer.lr, lr)
if self.verbose > 0:
print('\nEpoch %05d: LearningRateScheduler reducing learning '
'rate to %s.' % (epoch + 1, lr))
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
logs['lr'] = backend.get_value(self.model.optimizer.lr)
def step_decay(lr=3e-4, max_epochs=100, warmup=False):
"""
step decay.
:param lr: initial lr
:param max_epochs: max epochs
:param warmup: warm up or not
:return: current lr
"""
drop = 0.1
max_epochs = max_epochs - 5 if warmup else max_epochs
def decay(epoch):
lrate = lr * np.power(drop, np.floor((1 + epoch) / max_epochs))
return lrate
return decay
args.learning_rate = 0.01
args.num_epochs = 1000
args.lr_warmup = True
steps_per_epoch = 100 # update for use
lr_decay = step_decay(args.learning_rate, args.num_epochs - 5 if args.lr_warmup else args.num_epochs, warmup=args.lr_warmup)
learning_rate_scheduler = LearningRateScheduler(lr_decay, args.learning_rate, args.lr_warmup, steps_per_epoch, verbose=1)
本文整理自:
2、https://github.com/luyanger1799/amazing-semantic-segmentation