python手寫神經網絡之Dropout實現

這裏寫三種實現，一種是vanilla，一種是效率更高的寫法，還有一種是作爲網絡層的實現方法。

雖然dropout的參數叫probability，一般指的不是扔的機率，是keep的機率（因爲代碼更好解釋？）。（但是不固定，注意一致性，自恰即可）

vanilla dropout的前向傳播網絡示意：

作爲對照組，給出了predict不乘以p的結果，隨着數據量或者數據維度的增大，可以看到最後總輸出前兩者是接近的，而不乘以p的結果會偏差很大。

""" Vanilla Dropout: Not recommended implementation (see notes below) """
import numpy as np


p = 0.5  # probability of keeping a unit active. higher = less dropout


def train_step(X):
    """ X contains the data """

    # forward pass for example 3-layer neural network
    H1 = np.maximum(0, np.dot(W1, X) + b1)
    U1 = np.random.rand(*H1.shape) < p  # first dropout mask
    H1 *= U1  # drop!
    H2 = np.maximum(0, np.dot(W2, H1) + b2)
    U2 = np.random.rand(*H2.shape) < p  # second dropout mask
    H2 *= U2  # drop!
    out = np.dot(W3, H2) + b3

    return out

    # backward pass: compute gradients... (not shown)
    # perform parameter update... (not shown)


def predict(X):
    # ensembled forward pass
    H1 = np.maximum(0, np.dot(W1, X) + b1) * p  # NOTE: scale the activations
    H2 = np.maximum(0, np.dot(W2, H1) + b2) * p  # NOTE: scale the activations
    out = np.dot(W3, H2) + b3
    return out

def predict_without_multiply_p(X):#作爲對比,展示一下忽略scale的結果
    # ensembled forward pass
    H1 = np.maximum(0, np.dot(W1, X) + b1) # NOTE: scale the activations
    H2 = np.maximum(0, np.dot(W2, H1) + b2) # NOTE: scale the activations
    out = np.dot(W3, H2) + b3
    return out
W1 = np.random.randn(4,3)#W1*X=4,1
b1 = np.random.randn(4,1)
W2 = np.random.randn(4,4)#W2*H1=4,1
b2 = np.random.randn(4,1)
W3 = np.random.randn(4,4)#W2*H1=4,1
b3 = np.random.randn(4,1)
if __name__ == '__main__':

    X = np.random.randn(3,1000000)
    y = train_step(X)
    print('in training phase, average value:',y.shape,y.mean())
    predict_y = predict(X)
    print('in predicting phase, average value:',predict_y.shape,predict_y.mean())
    predict_y = predict_without_multiply_p(X)#奇特，不乘以p，反而變小了？？？？#隨機的，每一次都不一樣，充分說明網絡的隨機性，每一層的數值大小和最後輸出不成絕對比例
    print('in predicting phase(without multiply p), average value:',predict_y.mean())

in training phase, average value: (4, 1000000) 0.5332048355924359
in predicting phase, average value: (4, 1000000) 0.4632303379943579
in predicting phase(without multiply p), average value: 2.0510060393300087

上邊這個網絡邏輯上是沒問題的，問題在哪呢？在於運行時性能，因爲乘以p的操作在predict時，這就導致運行時開銷加大，速度變慢。既然只是要維持train和predict的scale相同，那麼把操作移到train時就好了。

inverted dropout的前向傳播示意網絡結構：

注意事項，既然是除法，要注意一下零除的問題（當然，不太可能設置p=0），p的定義，每個課程或者每個人的說法都可能有差異，這裏p是keep的概率（感覺keep可能更接近代碼的解讀，乘以mask），記得李宏毅的課程p好像是drop，那麼mask的設置就應該是反向的（>p），除法也是除以1-p而不是p。只要自恰就行了，怎麼叫都沒關係

import numpy as np
from dropout_vanilla import train_step as vanilla_train
from dropout_vanilla import predict as vanilla_predict
import math
p = 0.5  # probability of keeping a unit active. higher = less dropout
#這裏的p是keep，李宏毅的課程，如果p是drop，當然就是除以1-p了，除了繞一點

def train_step(X):
    """ X contains the data """

    # forward pass for example 3-layer neural network
    H1 = np.maximum(0, np.dot(W1, X) + b1)
    U1 = (np.random.rand(*H1.shape) < p) / p  # first dropout mask
    H1 *= U1  # drop!
    H2 = np.maximum(0, np.dot(W2, H1) + b2)
    U2 = (np.random.rand(*H2.shape) < p) / p  # second dropout mask
    H2 *= U2  # drop!
    out = np.dot(W3, H2) + b3

    return out

    # backward pass: compute gradients... (not shown)
    # perform parameter update... (not shown)


def predict(X):
    # ensembled forward pass
    H1 = np.maximum(0, np.dot(W1, X) + b1) # NOTE: scale the activations
    H2 = np.maximum(0, np.dot(W2, H1) + b2) # NOTE: scale the activations
    out = np.dot(W3, H2) + b3
    return out

if __name__ == '__main__':
    np.random.seed(1)

    W1 = np.random.randn(4,3)#W1*X=4,1
    b1 = np.random.randn(4,1)
    W2 = np.random.randn(4,4)#W2*H1=4,1
    b2 = np.random.randn(4,1)
    W3 = np.random.randn(4,4)#W2*H1=4,1
    b3 = np.random.randn(4,1)
    X = np.random.randn(3,1000000)
    y = train_step(X)
    print(y.shape)
    print('inverted train,average value:',y.mean())
    predict_y = predict(X)
    print('inverted predict,average value:',predict_y.mean())

    #想對比兩者的期望，但是想了一下，本來激活之後的數據就是雜亂的，沒法比,簡單示意,兩個從量級上都保持了相對穩定
    y = vanilla_train(X)
    predict_y = vanilla_predict(X)
    print('vanilla train,average value:',y.mean())
    print('vanilla predict,average value:',predict_y.mean())

inverted train,average value: -0.19223531434253763
inverted predict,average value: -0.24786749397781124
vanilla train,average value: 0.0815151269153607
vanilla predict,average value: 0.13525566404177755

當然，上邊都是裸露的網絡，下面實現一個網絡層。

dropout網絡層實現：

相對於其他網絡層（尤其BNhttps://blog.csdn.net/huqinweI987/article/details/103229158），這個可以說是非常簡單。

class Dropout:
    def __init__(self,dropout_ratio=0.5):#這裏的是扔的概率
        self.dropout_ratio = dropout_ratio
        self.mask = None
    def forward(self,x,is_train):
        if is_train:
            self.mask = np.random.rand(*x.shape) > self.dropout_ratio
            return x * self.mask
        else:
            return x * (1 - self.dropout_ratio)
    def backward(self,dout):
        return dout * self.mask

第一段參考代碼就是屬於vanilla的那種實現，並且p是drop的概率，等下我優化一下，兩個版本都看一下。

class Dropout:
    def __init__(self,keep_probability=0.5):#這裏的是保留的概率
        self.keep_probability = keep_probability
        self.mask = None
    def forward(self,x,is_train):
        if is_train:
            self.mask = np.random.rand(*x.shape) < self.keep_probability
            return x * self.mask
        else:
            return x * self.keep_probability
    def backward(self,dout):
        return dout * self.mask

然後是測試性能優化：

class Dropout:
    def __init__(self,keep_probability=0.5):#這裏的是保留的概率
        self.keep_probability = keep_probability
        self.mask = None
    def forward(self,x,is_train):
        if is_train:
            self.mask = np.random.rand(*x.shape) < self.keep_probability
            return x * self.mask / self.keep_probability
        else:
            return x
    def backward(self,dout):
        return dout * self.mask

然後就把這個層丟進網絡結構就可以了。

        # 生成層
        activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
        self.layers = OrderedDict()
        for idx in range(1, self.hidden_layer_num+1):
            self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
                                                      self.params['b' + str(idx)])
            if self.use_batchnorm:
                self.params['gamma' + str(idx)] = np.ones(hidden_size_list[idx-1])
                self.params['beta' + str(idx)] = np.zeros(hidden_size_list[idx-1])
                self.layers['BatchNorm' + str(idx)] = BatchNormalization(self.params['gamma' + str(idx)], self.params['beta' + str(idx)])
                
            self.layers['Activation_function' + str(idx)] = activation_layer[activation]()
            
            if self.use_dropout:
                self.layers['Dropout' + str(idx)] = Dropout(dropout_ration)

        idx = self.hidden_layer_num + 1
        self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)], self.params['b' + str(idx)])

        self.last_layer = SoftmaxWithLoss()

提到丟進網絡結構，因爲簡化的網絡前向傳播迭代都是如下形式：

for layer in self.layers:

x = layer.forward(x)

這樣輸入參數是不方便的，所以要給is_train加一個默認參數True，就可以了，這樣只要測試時自己手動加上False就可以了

class Dropout:
    def __init__(self,keep_probability=0.5):#這裏的是保留的概率
        self.keep_probability = keep_probability
        self.mask = None
    def forward(self,x,is_train=True):
        if is_train:
            self.mask = np.random.rand(*x.shape) < self.keep_probability
            return x * self.mask / self.keep_probability
        else:
            return x
    def backward(self,dout):
        return dout * self.mask

不使用dropout結果：300次基本train set過擬合和test set停滯

使用dropout結果（drop概率0.2，即，keep概率0.8）：仍然在上漲

3000次

python手寫神經網絡之Dropout實現

vanilla dropout的前向傳播網絡示意：

inverted dropout的前向傳播示意網絡結構：

dropout網絡層實現：

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

python手寫神經網絡之Dropout實現

python手寫神經網絡之BatchNormalization實現

python手寫神經網絡之權重初始化——梯度消失、表達消失

python手寫神經網絡之優化器（Optimizer）SGD、Momentum、Adagrad、RMSProp、Adam實現與對比——《深度學習入門——基於Python的理論與實現（第六章）》

python實現微分函數，兩種計算方式對比，一個誤區

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結