這裏寫三種實現,一種是vanilla,一種是效率更高的寫法,還有一種是作爲網絡層的實現方法。
雖然dropout的參數叫probability,一般指的不是扔的機率,是keep的機率(因爲代碼更好解釋?)。(但是不固定,注意一致性,自恰即可)
vanilla dropout的前向傳播網絡示意:
作爲對照組,給出了predict不乘以p的結果,隨着數據量或者數據維度的增大,可以看到最後總輸出前兩者是接近的,而不乘以p的結果會偏差很大。
""" Vanilla Dropout: Not recommended implementation (see notes below) """
import numpy as np
p = 0.5 # probability of keeping a unit active. higher = less dropout
def train_step(X):
""" X contains the data """
# forward pass for example 3-layer neural network
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = np.random.rand(*H1.shape) < p # first dropout mask
H1 *= U1 # drop!
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = np.random.rand(*H2.shape) < p # second dropout mask
H2 *= U2 # drop!
out = np.dot(W3, H2) + b3
return out
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations
out = np.dot(W3, H2) + b3
return out
def predict_without_multiply_p(X):#作爲對比,展示一下忽略scale的結果
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) # NOTE: scale the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2) # NOTE: scale the activations
out = np.dot(W3, H2) + b3
return out
W1 = np.random.randn(4,3)#W1*X=4,1
b1 = np.random.randn(4,1)
W2 = np.random.randn(4,4)#W2*H1=4,1
b2 = np.random.randn(4,1)
W3 = np.random.randn(4,4)#W2*H1=4,1
b3 = np.random.randn(4,1)
if __name__ == '__main__':
X = np.random.randn(3,1000000)
y = train_step(X)
print('in training phase, average value:',y.shape,y.mean())
predict_y = predict(X)
print('in predicting phase, average value:',predict_y.shape,predict_y.mean())
predict_y = predict_without_multiply_p(X)#奇特,不乘以p,反而變小了????#隨機的,每一次都不一樣,充分說明網絡的隨機性,每一層的數值大小和最後輸出不成絕對比例
print('in predicting phase(without multiply p), average value:',predict_y.mean())
in training phase, average value: (4, 1000000) 0.5332048355924359
in predicting phase, average value: (4, 1000000) 0.4632303379943579
in predicting phase(without multiply p), average value: 2.0510060393300087
上邊這個網絡邏輯上是沒問題的,問題在哪呢?在於運行時性能,因爲乘以p的操作在predict時,這就導致運行時開銷加大,速度變慢。既然只是要維持train和predict的scale相同,那麼把操作移到train時就好了。
inverted dropout的前向傳播示意網絡結構:
注意事項,既然是除法,要注意一下零除的問題(當然,不太可能設置p=0),p的定義,每個課程或者每個人的說法都可能有差異,這裏p是keep的概率(感覺keep可能更接近代碼的解讀,乘以mask),記得李宏毅的課程p好像是drop,那麼mask的設置就應該是反向的(>p),除法也是除以1-p而不是p。只要自恰就行了,怎麼叫都沒關係
import numpy as np
from dropout_vanilla import train_step as vanilla_train
from dropout_vanilla import predict as vanilla_predict
import math
p = 0.5 # probability of keeping a unit active. higher = less dropout
#這裏的p是keep,李宏毅的課程,如果p是drop,當然就是除以1-p了,除了繞一點
def train_step(X):
""" X contains the data """
# forward pass for example 3-layer neural network
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask
H1 *= U1 # drop!
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask
H2 *= U2 # drop!
out = np.dot(W3, H2) + b3
return out
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) # NOTE: scale the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2) # NOTE: scale the activations
out = np.dot(W3, H2) + b3
return out
if __name__ == '__main__':
np.random.seed(1)
W1 = np.random.randn(4,3)#W1*X=4,1
b1 = np.random.randn(4,1)
W2 = np.random.randn(4,4)#W2*H1=4,1
b2 = np.random.randn(4,1)
W3 = np.random.randn(4,4)#W2*H1=4,1
b3 = np.random.randn(4,1)
X = np.random.randn(3,1000000)
y = train_step(X)
print(y.shape)
print('inverted train,average value:',y.mean())
predict_y = predict(X)
print('inverted predict,average value:',predict_y.mean())
#想對比兩者的期望,但是想了一下,本來激活之後的數據就是雜亂的,沒法比,簡單示意,兩個從量級上都保持了相對穩定
y = vanilla_train(X)
predict_y = vanilla_predict(X)
print('vanilla train,average value:',y.mean())
print('vanilla predict,average value:',predict_y.mean())
inverted train,average value: -0.19223531434253763
inverted predict,average value: -0.24786749397781124
vanilla train,average value: 0.0815151269153607
vanilla predict,average value: 0.13525566404177755
當然,上邊都是裸露的網絡,下面實現一個網絡層。
dropout網絡層實現:
相對於其他網絡層(尤其BNhttps://blog.csdn.net/huqinweI987/article/details/103229158),這個可以說是非常簡單。
class Dropout:
def __init__(self,dropout_ratio=0.5):#這裏的是扔的概率
self.dropout_ratio = dropout_ratio
self.mask = None
def forward(self,x,is_train):
if is_train:
self.mask = np.random.rand(*x.shape) > self.dropout_ratio
return x * self.mask
else:
return x * (1 - self.dropout_ratio)
def backward(self,dout):
return dout * self.mask
第一段參考代碼就是屬於vanilla的那種實現,並且p是drop的概率,等下我優化一下,兩個版本都看一下。
class Dropout:
def __init__(self,keep_probability=0.5):#這裏的是保留的概率
self.keep_probability = keep_probability
self.mask = None
def forward(self,x,is_train):
if is_train:
self.mask = np.random.rand(*x.shape) < self.keep_probability
return x * self.mask
else:
return x * self.keep_probability
def backward(self,dout):
return dout * self.mask
然後是測試性能優化:
class Dropout:
def __init__(self,keep_probability=0.5):#這裏的是保留的概率
self.keep_probability = keep_probability
self.mask = None
def forward(self,x,is_train):
if is_train:
self.mask = np.random.rand(*x.shape) < self.keep_probability
return x * self.mask / self.keep_probability
else:
return x
def backward(self,dout):
return dout * self.mask
然後就把這個層丟進網絡結構就可以了。
# 生成層
activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
self.layers = OrderedDict()
for idx in range(1, self.hidden_layer_num+1):
self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
self.params['b' + str(idx)])
if self.use_batchnorm:
self.params['gamma' + str(idx)] = np.ones(hidden_size_list[idx-1])
self.params['beta' + str(idx)] = np.zeros(hidden_size_list[idx-1])
self.layers['BatchNorm' + str(idx)] = BatchNormalization(self.params['gamma' + str(idx)], self.params['beta' + str(idx)])
self.layers['Activation_function' + str(idx)] = activation_layer[activation]()
if self.use_dropout:
self.layers['Dropout' + str(idx)] = Dropout(dropout_ration)
idx = self.hidden_layer_num + 1
self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)], self.params['b' + str(idx)])
self.last_layer = SoftmaxWithLoss()
提到丟進網絡結構,因爲簡化的網絡前向傳播迭代都是如下形式:
for layer in self.layers:
x = layer.forward(x)
這樣輸入參數是不方便的,所以要給is_train加一個默認參數True,就可以了,這樣只要測試時自己手動加上False就可以了
class Dropout:
def __init__(self,keep_probability=0.5):#這裏的是保留的概率
self.keep_probability = keep_probability
self.mask = None
def forward(self,x,is_train=True):
if is_train:
self.mask = np.random.rand(*x.shape) < self.keep_probability
return x * self.mask / self.keep_probability
else:
return x
def backward(self,dout):
return dout * self.mask
不使用dropout結果:300次基本train set過擬合和test set停滯
使用dropout結果(drop概率0.2,即,keep概率0.8):仍然在上漲
3000次