神經網絡學習小記錄45——Keras常用學習率下降方式彙總
2020年5月19日更新
增加了論文中的餘弦退火下降方式。
如圖所示:
前言
學習率是深度學習中非常重要的一環,好好學習吧!
爲什麼要調控學習率
在深度學習中,學習率的調整非常重要。
學習率大有如下優點:
1、加快學習速率。
2、幫助跳出局部最優值。
但存在如下缺點:
1、導致模型訓練不收斂。
2、單單使用大學習率容易導致模型不精確。
學習率小有如下優點:
1、幫助模型收斂,有助於模型細化。
2、提高模型精度。
但存在如下缺點:
1、無法跳出局部最優值。
2、收斂緩慢。
學習率大和學習率小的功能是幾乎相反的。因此我們適當的調整學習率,纔可以最大程度的提高訓練性能。
下降方式彙總
1、階層性下降
在Keras當中,常用ReduceLROnPlateau函數實現階層性下降。
階層性下降指的就是學習率會突然變爲原來的1/2或者1/10。
使用ReduceLROnPlateau可以指定某一項指標不繼續下降後,比如說驗證集的loss、訓練集的loss等,突然下降學習率,變爲原來的1/2或者1/10。
ReduceLROnPlateau的主要參數有:
1、factor:在某一項指標不繼續下降後學習率下降的比率。
2、patience:在某一項指標不繼續下降幾個時代後,學習率開始下降。
# 導入ReduceLROnPlateau
from keras.callbacks import ReduceLROnPlateau
# 定義ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, verbose=1)
# 使用ReduceLROnPlateau
model.fit(X_train, Y_train, callbacks=[reduce_lr])
2、指數型下降
在Keras當中,我沒有找到特別好的Callback直接實現指數型下降,於是利用Callback類實現了一個。
指數型下降指的就是學習率會隨着指數函數不斷下降。
具體公式如下:
1、learning_rate指的是當前的學習率。
2、learning_rate_base指的是基礎學習率。
3、decay_rate指的是衰減係數。
效果如圖所示:
實現方式如下,利用Callback實現,與普通的ReduceLROnPlateau調用方式類似:
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras import backend as K
from keras.layers import Flatten,Conv2D,Dropout,Input,Dense,MaxPooling2D
from keras.models import Model
def exponent(global_epoch,
learning_rate_base,
decay_rate,
min_learn_rate=0,
):
learning_rate = learning_rate_base * pow(decay_rate, global_epoch)
learning_rate = max(learning_rate,min_learn_rate)
return learning_rate
class ExponentDecayScheduler(keras.callbacks.Callback):
"""
繼承Callback,實現對學習率的調度
"""
def __init__(self,
learning_rate_base,
decay_rate,
global_epoch_init=0,
min_learn_rate=0,
verbose=0):
super(ExponentDecayScheduler, self).__init__()
# 基礎的學習率
self.learning_rate_base = learning_rate_base
# 全局初始化epoch
self.global_epoch = global_epoch_init
self.decay_rate = decay_rate
# 參數顯示
self.verbose = verbose
# learning_rates用於記錄每次更新後的學習率,方便圖形化觀察
self.min_learn_rate = min_learn_rate
self.learning_rates = []
def on_epoch_end(self, epochs ,logs=None):
self.global_epoch = self.global_epoch + 1
lr = K.get_value(self.model.optimizer.lr)
self.learning_rates.append(lr)
#更新學習率
def on_epoch_begin(self, batch, logs=None):
lr = exponent(global_epoch=self.global_epoch,
learning_rate_base=self.learning_rate_base,
decay_rate = self.decay_rate,
min_learn_rate = self.min_learn_rate)
K.set_value(self.model.optimizer.lr, lr)
if self.verbose > 0:
print('\nBatch %05d: setting learning '
'rate to %s.' % (self.global_epoch + 1, lr))
# 載入Mnist手寫數據集
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
x_train = np.expand_dims(x_train,-1)
x_test = np.expand_dims(x_test,-1)
#-----------------------------#
# 創建模型
#-----------------------------#
inputs = Input([28,28,1])
x = Conv2D(32, kernel_size= 5,padding = 'same',activation="relu")(inputs)
x = MaxPooling2D(pool_size = 2, strides = 2, padding = 'same',)(x)
x = Conv2D(64, kernel_size= 5,padding = 'same',activation="relu")(x)
x = MaxPooling2D(pool_size = 2, strides = 2, padding = 'same',)(x)
x = Flatten()(x)
x = Dense(1024)(x)
x = Dense(256)(x)
out = Dense(10, activation='softmax')(x)
model = Model(inputs,out)
# 設定優化器,loss,計算準確率
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# 設置訓練參數
epochs = 10
init_epoch = 0
# 每一次訓練使用多少個Batch
batch_size = 31
# 最大學習率
learning_rate_base = 1e-3
sample_count = len(x_train)
# 學習率
exponent_lr = ExponentDecayScheduler(learning_rate_base = learning_rate_base,
global_epoch_init = init_epoch,
decay_rate = 0.9,
min_learn_rate = 1e-6
)
# 利用fit進行訓練
model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,
verbose=1, callbacks=[exponent_lr])
plt.plot(exponent_lr.learning_rates)
plt.xlabel('Step', fontsize=20)
plt.ylabel('lr', fontsize=20)
plt.axis([0, epochs, 0, learning_rate_base*1.1])
plt.xticks(np.arange(0, epochs, 1))
plt.grid()
plt.title('lr decay with exponent', fontsize=20)
plt.show()
3、餘弦退火衰減
餘弦退火衰減法,學習率會先上升再下降,這是退火優化法的思想。(關於什麼是退火算法可以百度。)
上升的時候使用線性上升,下降的時候模擬cos函數下降。
效果如圖所示:
餘弦退火衰減有幾個比較必要的參數:
1、learning_rate_base:學習率最高值。
2、warmup_learning_rate:最開始的學習率。
3、warmup_steps:多少步長後到達頂峯值。
實現方式如下,利用Callback實現,與普通的ReduceLROnPlateau調用方式類似:
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras import backend as K
from keras.layers import Flatten,Conv2D,Dropout,Input,Dense,MaxPooling2D
from keras.models import Model
def cosine_decay_with_warmup(global_step,
learning_rate_base,
total_steps,
warmup_learning_rate=0.0,
warmup_steps=0,
hold_base_rate_steps=0,
min_learn_rate=0,
):
"""
參數:
global_step: 上面定義的Tcur,記錄當前執行的步數。
learning_rate_base:預先設置的學習率,當warm_up階段學習率增加到learning_rate_base,就開始學習率下降。
total_steps: 是總的訓練的步數,等於epoch*sample_count/batch_size,(sample_count是樣本總數,epoch是總的循環次數)
warmup_learning_rate: 這是warm up階段線性增長的初始值
warmup_steps: warm_up總的需要持續的步數
hold_base_rate_steps: 這是可選的參數,即當warm up階段結束後保持學習率不變,知道hold_base_rate_steps結束後纔開始學習率下降
"""
if total_steps < warmup_steps:
raise ValueError('total_steps must be larger or equal to '
'warmup_steps.')
#這裏實現了餘弦退火的原理,設置學習率的最小值爲0,所以簡化了表達式
learning_rate = 0.5 * learning_rate_base * (1 + np.cos(np.pi *
(global_step - warmup_steps - hold_base_rate_steps) / float(total_steps - warmup_steps - hold_base_rate_steps)))
#如果hold_base_rate_steps大於0,表明在warm up結束後學習率在一定步數內保持不變
if hold_base_rate_steps > 0:
learning_rate = np.where(global_step > warmup_steps + hold_base_rate_steps,
learning_rate, learning_rate_base)
if warmup_steps > 0:
if learning_rate_base < warmup_learning_rate:
raise ValueError('learning_rate_base must be larger or equal to '
'warmup_learning_rate.')
#線性增長的實現
slope = (learning_rate_base - warmup_learning_rate) / warmup_steps
warmup_rate = slope * global_step + warmup_learning_rate
#只有當global_step 仍然處於warm up階段纔會使用線性增長的學習率warmup_rate,否則使用餘弦退火的學習率learning_rate
learning_rate = np.where(global_step < warmup_steps, warmup_rate,
learning_rate)
learning_rate = max(learning_rate,min_learn_rate)
return learning_rate
class WarmUpCosineDecayScheduler(keras.callbacks.Callback):
"""
繼承Callback,實現對學習率的調度
"""
def __init__(self,
learning_rate_base,
total_steps,
global_step_init=0,
warmup_learning_rate=0.0,
warmup_steps=0,
hold_base_rate_steps=0,
min_learn_rate=0,
verbose=0):
super(WarmUpCosineDecayScheduler, self).__init__()
# 基礎的學習率
self.learning_rate_base = learning_rate_base
# 總共的步數,訓練完所有世代的步數epochs * sample_count / batch_size
self.total_steps = total_steps
# 全局初始化step
self.global_step = global_step_init
# 熱調整參數
self.warmup_learning_rate = warmup_learning_rate
# 熱調整步長,warmup_epoch * sample_count / batch_size
self.warmup_steps = warmup_steps
self.hold_base_rate_steps = hold_base_rate_steps
# 參數顯示
self.verbose = verbose
# learning_rates用於記錄每次更新後的學習率,方便圖形化觀察
self.min_learn_rate = min_learn_rate
self.learning_rates = []
#更新global_step,並記錄當前學習率
def on_batch_end(self, batch, logs=None):
self.global_step = self.global_step + 1
lr = K.get_value(self.model.optimizer.lr)
self.learning_rates.append(lr)
#更新學習率
def on_batch_begin(self, batch, logs=None):
lr = cosine_decay_with_warmup(global_step=self.global_step,
learning_rate_base=self.learning_rate_base,
total_steps=self.total_steps,
warmup_learning_rate=self.warmup_learning_rate,
warmup_steps=self.warmup_steps,
hold_base_rate_steps=self.hold_base_rate_steps,
min_learn_rate = self.min_learn_rate)
K.set_value(self.model.optimizer.lr, lr)
if self.verbose > 0:
print('\nBatch %05d: setting learning '
'rate to %s.' % (self.global_step + 1, lr))
# 載入Mnist手寫數據集
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
x_train = np.expand_dims(x_train,-1)
x_test = np.expand_dims(x_test,-1)
#-----------------------------#
# 創建模型
#-----------------------------#
inputs = Input([28,28,1])
x = Conv2D(32, kernel_size= 5,padding = 'same',activation="relu")(inputs)
x = MaxPooling2D(pool_size = 2, strides = 2, padding = 'same',)(x)
x = Conv2D(64, kernel_size= 5,padding = 'same',activation="relu")(x)
x = MaxPooling2D(pool_size = 2, strides = 2, padding = 'same',)(x)
x = Flatten()(x)
x = Dense(1024)(x)
x = Dense(256)(x)
out = Dense(10, activation='softmax')(x)
model = Model(inputs,out)
# 設定優化器,loss,計算準確率
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# 設置訓練參數
epochs = 10
# 預熱期
warmup_epoch = 3
# 每一次訓練使用多少個Batch
batch_size = 16
# 最大學習率
learning_rate_base = 1e-3
sample_count = len(x_train)
# 總共的步長
total_steps = int(epochs * sample_count / batch_size)
# 預熱步長
warmup_steps = int(warmup_epoch * sample_count / batch_size)
# 學習率
warm_up_lr = WarmUpCosineDecayScheduler(learning_rate_base=learning_rate_base,
total_steps=total_steps,
warmup_learning_rate=1e-5,
warmup_steps=warmup_steps,
hold_base_rate_steps=5,
min_learn_rate = 1e-6
)
# 利用fit進行訓練
model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,
verbose=1, callbacks=[warm_up_lr])
plt.plot(warm_up_lr.learning_rates)
plt.xlabel('Step', fontsize=20)
plt.ylabel('lr', fontsize=20)
plt.axis([0, total_steps, 0, learning_rate_base*1.1])
plt.xticks(np.arange(0, epochs, 1))
plt.grid()
plt.title('Cosine decay with warmup', fontsize=20)
plt.show()
4、餘弦退火衰減更新版
論文當中的餘弦退火衰減並非只上升下降一次,因此我重新寫了一段代碼用於實現多次上升下降:
實現方式如下,利用Callback實現,與普通的ReduceLROnPlateau調用方式類似:
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras import backend as K
from keras.layers import Flatten,Conv2D,Dropout,Input,Dense,MaxPooling2D
from keras.models import Model
def cosine_decay_with_warmup(global_step,
learning_rate_base,
total_steps,
warmup_learning_rate=0.0,
warmup_steps=0,
hold_base_rate_steps=0,
min_learn_rate=0,
):
"""
參數:
global_step: 上面定義的Tcur,記錄當前執行的步數。
learning_rate_base:預先設置的學習率,當warm_up階段學習率增加到learning_rate_base,就開始學習率下降。
total_steps: 是總的訓練的步數,等於epoch*sample_count/batch_size,(sample_count是樣本總數,epoch是總的循環次數)
warmup_learning_rate: 這是warm up階段線性增長的初始值
warmup_steps: warm_up總的需要持續的步數
hold_base_rate_steps: 這是可選的參數,即當warm up階段結束後保持學習率不變,知道hold_base_rate_steps結束後纔開始學習率下降
"""
if total_steps < warmup_steps:
raise ValueError('total_steps must be larger or equal to '
'warmup_steps.')
#這裏實現了餘弦退火的原理,設置學習率的最小值爲0,所以簡化了表達式
learning_rate = 0.5 * learning_rate_base * (1 + np.cos(np.pi *
(global_step - warmup_steps - hold_base_rate_steps) / float(total_steps - warmup_steps - hold_base_rate_steps)))
#如果hold_base_rate_steps大於0,表明在warm up結束後學習率在一定步數內保持不變
if hold_base_rate_steps > 0:
learning_rate = np.where(global_step > warmup_steps + hold_base_rate_steps,
learning_rate, learning_rate_base)
if warmup_steps > 0:
if learning_rate_base < warmup_learning_rate:
raise ValueError('learning_rate_base must be larger or equal to '
'warmup_learning_rate.')
#線性增長的實現
slope = (learning_rate_base - warmup_learning_rate) / warmup_steps
warmup_rate = slope * global_step + warmup_learning_rate
#只有當global_step 仍然處於warm up階段纔會使用線性增長的學習率warmup_rate,否則使用餘弦退火的學習率learning_rate
learning_rate = np.where(global_step < warmup_steps, warmup_rate,
learning_rate)
learning_rate = max(learning_rate,min_learn_rate)
return learning_rate
class WarmUpCosineDecayScheduler(keras.callbacks.Callback):
"""
繼承Callback,實現對學習率的調度
"""
def __init__(self,
learning_rate_base,
total_steps,
global_step_init=0,
warmup_learning_rate=0.0,
warmup_steps=0,
hold_base_rate_steps=0,
min_learn_rate=0,
# interval_epoch代表餘弦退火之間的最低點
interval_epoch=[0.05, 0.15, 0.30, 0.50],
verbose=0):
super(WarmUpCosineDecayScheduler, self).__init__()
# 基礎的學習率
self.learning_rate_base = learning_rate_base
# 熱調整參數
self.warmup_learning_rate = warmup_learning_rate
# 參數顯示
self.verbose = verbose
# learning_rates用於記錄每次更新後的學習率,方便圖形化觀察
self.min_learn_rate = min_learn_rate
self.learning_rates = []
self.interval_epoch = interval_epoch
# 貫穿全局的步長
self.global_step_for_interval = global_step_init
# 用於上升的總步長
self.warmup_steps_for_interval = warmup_steps
# 保持最高峯的總步長
self.hold_steps_for_interval = hold_base_rate_steps
# 整個訓練的總步長
self.total_steps_for_interval = total_steps
self.interval_index = 0
# 計算出來兩個最低點的間隔
self.interval_reset = [self.interval_epoch[0]]
for i in range(len(self.interval_epoch)-1):
self.interval_reset.append(self.interval_epoch[i+1]-self.interval_epoch[i])
self.interval_reset.append(1-self.interval_epoch[-1])
#更新global_step,並記錄當前學習率
def on_batch_end(self, batch, logs=None):
self.global_step = self.global_step + 1
self.global_step_for_interval = self.global_step_for_interval + 1
lr = K.get_value(self.model.optimizer.lr)
self.learning_rates.append(lr)
#更新學習率
def on_batch_begin(self, batch, logs=None):
# 每到一次最低點就重新更新參數
if self.global_step_for_interval in [0]+[int(i*self.total_steps_for_interval) for i in self.interval_epoch]:
self.total_steps = self.total_steps_for_interval * self.interval_reset[self.interval_index]
self.warmup_steps = self.warmup_steps_for_interval * self.interval_reset[self.interval_index]
self.hold_base_rate_steps = self.hold_steps_for_interval * self.interval_reset[self.interval_index]
self.global_step = 0
self.interval_index += 1
lr = cosine_decay_with_warmup(global_step=self.global_step,
learning_rate_base=self.learning_rate_base,
total_steps=self.total_steps,
warmup_learning_rate=self.warmup_learning_rate,
warmup_steps=self.warmup_steps,
hold_base_rate_steps=self.hold_base_rate_steps,
min_learn_rate = self.min_learn_rate)
K.set_value(self.model.optimizer.lr, lr)
if self.verbose > 0:
print('\nBatch %05d: setting learning '
'rate to %s.' % (self.global_step + 1, lr))
# 載入Mnist手寫數據集
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
x_train = np.expand_dims(x_train,-1)
x_test = np.expand_dims(x_test,-1)
y_train = y_train
#-----------------------------#
# 創建模型
#-----------------------------#
inputs = Input([28,28,1])
x = Conv2D(32, kernel_size= 5,padding = 'same',activation="relu")(inputs)
x = MaxPooling2D(pool_size = 2, strides = 2, padding = 'same',)(x)
x = Conv2D(64, kernel_size= 5,padding = 'same',activation="relu")(x)
x = MaxPooling2D(pool_size = 2, strides = 2, padding = 'same',)(x)
x = Flatten()(x)
x = Dense(1024)(x)
x = Dense(256)(x)
out = Dense(10, activation='softmax')(x)
model = Model(inputs,out)
# 設定優化器,loss,計算準確率
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# 設置訓練參數
epochs = 10
# 預熱期
warmup_epoch = 2
# 每一次訓練使用多少個Batch
batch_size = 256
# 最大學習率
learning_rate_base = 1e-3
sample_count = len(x_train)
# 總共的步長
total_steps = int(epochs * sample_count / batch_size)
# 預熱步長
warmup_steps = int(warmup_epoch * sample_count / batch_size)
# 學習率
warm_up_lr = WarmUpCosineDecayScheduler(learning_rate_base=learning_rate_base,
total_steps=total_steps,
warmup_learning_rate=1e-5,
warmup_steps=warmup_steps,
hold_base_rate_steps=5,
min_learn_rate=1e-6
)
# 利用fit進行訓練
model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,
verbose=1, callbacks=[warm_up_lr])
plt.plot(warm_up_lr.learning_rates)
plt.xlabel('Step', fontsize=20)
plt.ylabel('lr', fontsize=20)
plt.axis([0, total_steps, 0, learning_rate_base*1.1])
plt.grid()
plt.title('Cosine decay with warmup', fontsize=20)
plt.show()