Pytorch —— 權值初始化

1、梯度消失與爆炸

這裏使用一個三層的全連接網絡,現在觀察一下第二個隱藏層W2W_2的權值的梯度是怎麼求取的。

在這裏插入圖片描述
根據鏈式求導法則可以知道,W2W_2的求導如下:
H2=H1W2\mathrm{H}_{2}=\mathrm{H}_{1} * \mathrm{W}_{2}
ΔW2=LossW2=LossoutoutH2H2w2\Delta \mathrm{W}_{2}=\frac{\partial \mathrm{Loss}}{\partial \mathrm{W}_{2}}=\frac{\partial \mathrm{Loss}}{\partial \mathrm{out}} \star \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} \star \frac{\partial \mathrm{H}_{2}}{\partial \mathrm{w}_{2}}
=LossoutoutH2H1=\frac{\partial \operatorname{Loss}}{\partial \mathrm{out}} \star \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} * \mathrm{H}_{1}
上面公式中,H1H_1是上一層神經元的輸出值,W2W_2的梯度依賴於上一層的輸出,如果H1H_1的輸出值趨向於零,W2W_2的梯度也趨向於零,從而導致梯度消失。如果H1H_1趨向於無窮大,那麼W2W_2也趨向於無窮大,從而導致梯度爆炸。

從上面我們可以知道,要避免梯度消失或者梯度爆炸,就要嚴格控制網絡輸出層的輸出值的範圍,也就是每一層網絡的輸出值不能太大也不能太小。

下面通過代碼觀察全連接網絡的輸出:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

現在觀察一下output的輸出,運行代碼,輸出爲:

tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<MmBackward>)

可以發現輸出的每一個值都是nan,也就是數據非常大或者非常小,已經超出了當前精度能夠表示的範圍。

現在返回forward()中觀察數據什麼時候變爲nan,在代碼中使用標準差來衡量數據的尺度範圍:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果爲nan,則停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通過運行上面的代碼,可以得到下面的輸出結果:

layer:0, std:15.959932327270508
layer:1, std:256.6237487792969
layer:2, std:4107.24560546875
layer:3, std:65576.8125
layer:4, std:1045011.875
layer:5, std:17110408.0
layer:6, std:275461408.0
layer:7, std:4402537984.0
layer:8, std:71323615232.0
layer:9, std:1148104736768.0
layer:10, std:17911758454784.0
layer:11, std:283574846619648.0
layer:12, std:4480599809064960.0
layer:13, std:7.196814275405414e+16
layer:14, std:1.1507761512626258e+18
layer:15, std:1.853110740188555e+19
layer:16, std:2.9677725826641455e+20
layer:17, std:4.780376223769898e+21
layer:18, std:7.613223480799065e+22
layer:19, std:1.2092652108825478e+24
layer:20, std:1.923257075956356e+25
layer:21, std:3.134467063655912e+26
layer:22, std:5.014437766285408e+27
layer:23, std:8.066615144249704e+28
layer:24, std:1.2392661553516338e+30
layer:25, std:1.9455688099759845e+31
layer:26, std:3.0238180658999113e+32
layer:27, std:4.950357571077011e+33
layer:28, std:8.150925520353362e+34
layer:29, std:1.322983152787379e+36
layer:30, std:2.0786820453988485e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[        inf, -2.6817e+38,         inf,  ...,         inf,
                 inf,         inf],
        [       -inf,        -inf,  1.4387e+38,  ..., -1.3409e+38,
         -1.9659e+38,        -inf],
        [-1.5873e+37,         inf,        -inf,  ...,         inf,
                -inf,  1.1484e+38],
        ...,
        [ 2.7754e+38, -1.6783e+38, -1.5531e+38,  ...,         inf,
         -9.9440e+37, -2.5132e+38],
        [-7.7184e+37,        -inf,         inf,  ..., -2.6505e+38,
                 inf,         inf],
        [        inf,         inf,        -inf,  ...,        -inf,
                 inf,  1.7432e+38]], grad_fn=<MmBackward>)

通過分析結果,可以知道,在31層的時候就會輸出nan結果。

下面通過方差的公式推導來分析爲什麼神經網絡的輸出的標準差會越來越大,最終會超出可以表示的範圍。
在進行方差公式推導之前,先來複習三個基本公式:
(1)兩個相互獨立的隨機變量X和Y的乘積的期望爲:E(XY)=E(X)E(Y)\mathrm{E}(X * Y)=E(X) * E(Y)
(2)方差的推導公式:D(X)=E(X2)[E(X)]2\mathrm{D}(X)=E\left(\mathrm{X}^{2}\right)-[\boldsymbol{E}(X)]^{2}
(3)兩個相互獨立的隨機變量X和Y的和的方差爲:D(X+Y)=D(X)+D(Y)\mathbf{D}(\boldsymbol{X}+\boldsymbol{Y})=\boldsymbol{D}(\boldsymbol{X})+\boldsymbol{D}(\boldsymbol{Y})
通過以上三個公式可以推導出兩個相互獨立的隨機變量相乘的方差爲:D(XY)=D(X)D(Y)+D(X)[E(Y)]2+D(Y)[E(X)]2D(X * Y)=D(X) * D(Y)+D(X) *[E(Y)]^{2}+D(Y) *[E(X)]^{2}上面公式中的X和Y默認爲均值爲0,標準差爲1,即E(X)=0,E(Y)=0E(X)=0,E(Y)=0,因此可以得到簡化的公式:D(XY)=D(X)D(Y)D(X*Y)=D(X)*D(Y)

下面觀察網絡層的標準差,觀察第一個隱藏層的第一個神經元,設置爲H11H_{11}H11H_{11}的計算公式如下:H11=i=0nXiW1i\mathrm{H}_{11}=\sum_{i=0}^{n} X_{i} * W_{1 i}接着使用上面推導得到的公式:D(XY)=D(X)D(Y)D(X*Y)=D(X)*D(Y)來求取H11H_{11}的方差,由於X和W都是零均值,1標準差的數據,因此H11H_{11}的方差可以表示爲:D(H11)=i=0nD(Xi)D(W1i)=n(11)=n\mathbf{D}\left(\mathrm{H}_{11}\right)=\sum_{i=0}^{n} \boldsymbol{D}\left(\boldsymbol{X}_{i}\right) * \boldsymbol{D}\left(W_{1 i}\right)=n*(1*1)=n公式中的n表示神經元的個數,後面的1代表XiX_i的方差和W1iW_{1i}的方差,由於輸入X服從零均值,1標準差的分佈,W也是一個標準正態分佈,所以H11H_{11}的方差爲n,從而可以得到H11H_{11}的標準差爲std(H11)=D(H11)=n\operatorname{std}\left(\mathrm{H}_{11}\right)=\sqrt{\mathrm{D}\left(\mathrm{H}_{11}\right)}=\sqrt{n}從公式推導可以發現,第一個隱藏層的輸出值的方差變爲n,而輸入數據的方差爲1,經過一個網絡層的前向傳播,數據的方差就擴大了n倍,標準差擴大了根號n倍。同理,從第一個隱藏層到第二個隱藏層,標準差就變爲n。不斷往後傳播,每經過一層,輸出值的尺度範圍都會不斷擴大根號n倍,最終超出精度可以表示的範圍,最終變爲nan。

從公式中可以發現,標準差由三個因素決定,第一個是n,就是每一層的神經元個數,第二個是X的方差,也就是輸入值的方差,第三個是W的方差,也就是網絡層權值的方差。從這個公式中可以看到,如果想讓網絡層的方差保持尺度不變,只能讓方差等於1,因爲層與層之間的方差是進行相乘得到的。讓方差爲1,這樣多個1相乘得到的方差結果仍爲1。

爲了讓每一層的方差爲1,也就是:D(H1)=nD(X)D(W)=1\mathbf{D}\left(\mathbf{H}_{\mathbf{1}}\right)=\boldsymbol{n} * \boldsymbol{D}(\boldsymbol{X}) * \boldsymbol{D}(\boldsymbol{W})=\mathbf{1}因此可以推導出W的方差爲:D(W)=1nstd(W)=1n\boldsymbol{D}(\boldsymbol{W})=\frac{1}{n} \Rightarrow \operatorname{std}(W)=\sqrt{\frac{1}{n}}這樣可以使得每一個網絡層的輸出的方差爲1。

下面回到代碼中,採用一個零均值,標準差爲1n\sqrt{\frac{1}{n}}的分佈去初始化權值,再來觀察網絡層的輸出的標準差,代碼如下:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果爲nan,則停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num))

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

現在來看一下代碼的輸出:

layer:0, std:0.9974957704544067
layer:1, std:1.0024365186691284
layer:2, std:1.002745509147644
layer:3, std:1.0006227493286133
layer:4, std:0.9966009855270386
layer:5, std:1.019859790802002
layer:6, std:1.026173710823059
layer:7, std:1.0250457525253296
layer:8, std:1.0378952026367188
layer:9, std:1.0441951751708984
layer:10, std:1.0181655883789062
layer:11, std:1.0074602365493774
layer:12, std:0.9948930144309998
layer:13, std:0.9987586140632629
layer:14, std:0.9981392025947571
layer:15, std:1.0045733451843262
layer:16, std:1.0055204629898071
layer:17, std:1.0122840404510498
layer:18, std:1.0076017379760742
layer:19, std:1.000280737876892
layer:20, std:0.9943006038665771
layer:21, std:1.012800931930542
layer:22, std:1.012657642364502
layer:23, std:1.018149971961975
layer:24, std:0.9776086211204529
layer:25, std:0.9592394828796387
layer:26, std:0.9317858815193176
layer:27, std:0.9534041881561279
layer:28, std:0.9811319708824158
layer:29, std:0.9953019022941589
layer:30, std:0.9773916006088257
layer:31, std:0.9655940532684326
layer:32, std:0.9270440936088562
layer:33, std:0.9329946637153625
layer:34, std:0.9311841726303101
layer:35, std:0.9354336261749268
layer:36, std:0.9492132067680359
layer:37, std:0.9679954648017883
layer:38, std:0.9849981665611267
layer:39, std:0.9982335567474365
layer:40, std:0.9616852402687073
layer:41, std:0.9439758658409119
layer:42, std:0.9631161093711853
layer:43, std:0.958673894405365
layer:44, std:0.9675614237785339
layer:45, std:0.9837557077407837
layer:46, std:0.9867278337478638
layer:47, std:0.9920817017555237
layer:48, std:0.9650403261184692
layer:49, std:0.9991624355316162
layer:50, std:0.9946174025535583
layer:51, std:0.9662044048309326
layer:52, std:0.9827387928962708
layer:53, std:0.9887880086898804
layer:54, std:0.9932605624198914
layer:55, std:1.0237400531768799
layer:56, std:0.9702046513557434
layer:57, std:1.0045380592346191
layer:58, std:0.9943899512290955
layer:59, std:0.9900636076927185
layer:60, std:0.99446702003479
layer:61, std:0.9768352508544922
layer:62, std:0.9797843098640442
layer:63, std:0.9951220750808716
layer:64, std:0.9980446696281433
layer:65, std:1.0086933374404907
layer:66, std:1.0276142358779907
layer:67, std:1.0429234504699707
layer:68, std:1.0197855234146118
layer:69, std:1.0319130420684814
layer:70, std:1.0540012121200562
layer:71, std:1.026781439781189
layer:72, std:1.0331352949142456
layer:73, std:1.0666675567626953
layer:74, std:1.0413838624954224
layer:75, std:1.0733673572540283
layer:76, std:1.0404183864593506
layer:77, std:1.0344083309173584
layer:78, std:1.0022705793380737
layer:79, std:0.99835205078125
layer:80, std:0.9732587337493896
layer:81, std:0.9777462482452393
layer:82, std:0.9753198623657227
layer:83, std:0.9938382506370544
layer:84, std:0.9472599029541016
layer:85, std:0.9511011242866516
layer:86, std:0.9737769961357117
layer:87, std:1.005651831626892
layer:88, std:1.0043526887893677
layer:89, std:0.9889539480209351
layer:90, std:1.0130352973937988
layer:91, std:1.0030947923660278
layer:92, std:0.9993206262588501
layer:93, std:1.0342745780944824
layer:94, std:1.031973123550415
layer:95, std:1.0413124561309814
layer:96, std:1.0817031860351562
layer:97, std:1.128799557685852
layer:98, std:1.1617802381515503
layer:99, std:1.2215303182601929
tensor([[-1.0696, -1.1373,  0.5047,  ..., -0.4766,  1.5904, -0.1076],
        [ 0.4572,  1.6211,  1.9659,  ..., -0.3558, -1.1235,  0.0979],
        [ 0.3908, -0.9998, -0.8680,  ..., -2.4161,  0.5035,  0.2814],
        ...,
        [ 0.1876,  0.7971, -0.5918,  ...,  0.5395, -0.8932,  0.1211],
        [-0.0102, -1.5027, -2.6860,  ...,  0.6954, -0.1858, -0.8027],
        [-0.5871, -1.3739, -2.9027,  ...,  1.6734,  0.5094, -0.9986]],
       grad_fn=<MmBackward>)

通過分析輸出,可以看到輸出的範圍基本在1左右。因此通過恰當的權重初始化方法可以實現多層的全連接網絡的輸出值的尺度維持在一定的範圍,不會過大也不會過小。通過以上的例子,我們可以知道,需要保持每一個網絡層輸出的方差爲1,但是這裏還需要考慮激活函數的存在,下面學習具有激活函數的權值初始化方法。

現在我們在forward()函數中加一個tanh激活函數,觀察網絡的輸出結果,其代碼如下:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果爲nan,則停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num))

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通過運行代碼,可以發現網絡的輸出爲:

layer:0, std:0.6273701786994934
layer:1, std:0.48910173773765564
layer:2, std:0.4099564850330353
layer:3, std:0.35637012124061584
layer:4, std:0.32117360830307007
layer:5, std:0.2981105148792267
layer:6, std:0.27730831503868103
layer:7, std:0.2589356303215027
layer:8, std:0.2468511462211609
layer:9, std:0.23721906542778015
layer:10, std:0.22171513736248016
layer:11, std:0.21079954504966736
layer:12, std:0.19820132851600647
layer:13, std:0.19069305062294006
layer:14, std:0.18555502593517303
layer:15, std:0.17953835427761078
layer:16, std:0.17485804855823517
layer:17, std:0.1702701896429062
layer:18, std:0.16508983075618744
layer:19, std:0.1591130942106247
layer:20, std:0.15480302274227142
layer:21, std:0.15263864398002625
layer:22, std:0.148549422621727
layer:23, std:0.14617665112018585
layer:24, std:0.13876433670520782
layer:25, std:0.13316625356674194
layer:26, std:0.12660598754882812
layer:27, std:0.12537944316864014
layer:28, std:0.12535445392131805
layer:29, std:0.1258980631828308
layer:30, std:0.11994212120771408
layer:31, std:0.11700888723134995
layer:32, std:0.11137298494577408
layer:33, std:0.11154613643884659
layer:34, std:0.10991233587265015
layer:35, std:0.10996390879154205
layer:36, std:0.10969001054763794
layer:37, std:0.10975217074155807
layer:38, std:0.11063199490308762
layer:39, std:0.11021336913108826
layer:40, std:0.10465587675571442
layer:41, std:0.10141163319349289
layer:42, std:0.1026025339961052
layer:43, std:0.10079070925712585
layer:44, std:0.10096712410449982
layer:45, std:0.10117629915475845
layer:46, std:0.10145658254623413
layer:47, std:0.09987485408782959
layer:48, std:0.09677786380052567
layer:49, std:0.099615179002285
layer:50, std:0.09867013245820999
layer:51, std:0.09398546814918518
layer:52, std:0.09388342499732971
layer:53, std:0.09352942556142807
layer:54, std:0.09336657077074051
layer:55, std:0.094817616045475
layer:56, std:0.08856320381164551
layer:57, std:0.09024856984615326
layer:58, std:0.0886448472738266
layer:59, std:0.08766943961381912
layer:60, std:0.08726290613412857
layer:61, std:0.08623497188091278
layer:62, std:0.08549781143665314
layer:63, std:0.08555219322443008
layer:64, std:0.08536665141582489
layer:65, std:0.08462796360254288
layer:66, std:0.08521939814090729
layer:67, std:0.08562128990888596
layer:68, std:0.08368432521820068
layer:69, std:0.08476376533508301
layer:70, std:0.08536301553249359
layer:71, std:0.08237562328577042
layer:72, std:0.08133520931005478
layer:73, std:0.08416961133480072
layer:74, std:0.08226993680000305
layer:75, std:0.08379077166318893
layer:76, std:0.08003699779510498
layer:77, std:0.07888863980770111
layer:78, std:0.07618381083011627
layer:79, std:0.07458438724279404
layer:80, std:0.07207277417182922
layer:81, std:0.07079191505908966
layer:82, std:0.0712786540389061
layer:83, std:0.07165778428316116
layer:84, std:0.06893911212682724
layer:85, std:0.06902473419904709
layer:86, std:0.07030880451202393
layer:87, std:0.07283663004636765
layer:88, std:0.07280216366052628
layer:89, std:0.07130247354507446
layer:90, std:0.07225216180086136
layer:91, std:0.0712454691529274
layer:92, std:0.07088855654001236
layer:93, std:0.0730612725019455
layer:94, std:0.07276969403028488
layer:95, std:0.07259569317102432
layer:96, std:0.0758652538061142
layer:97, std:0.07769152522087097
layer:98, std:0.07842093706130981
layer:99, std:0.08206242322921753
tensor([[-0.1103, -0.0739,  0.1278,  ..., -0.0508,  0.1544, -0.0107],
        [ 0.0807,  0.1208,  0.0030,  ..., -0.0385, -0.1887, -0.0294],
        [ 0.0321, -0.0833, -0.1482,  ..., -0.1133,  0.0206,  0.0155],
        ...,
        [ 0.0108,  0.0560, -0.1099,  ...,  0.0459, -0.0961, -0.0124],
        [ 0.0398, -0.0874, -0.2312,  ...,  0.0294, -0.0562, -0.0556],
        [-0.0234, -0.0297, -0.1155,  ...,  0.1143,  0.0083, -0.0675]],
       grad_fn=<TanhBackward>)

通過分析結果可以發現,網絡層的標準差隨着前向傳播變得越來越小,從而導致梯度消失。針對存在激活函數的權值初始化問題,分別提出了Xavier方法和Kaiming方法。

2、 Xavier方法與Kaiming方法

2.1 Xavier方法

2010年,在論文《Understanding the difficulty of training deep feedforward neural networks》詳細探討了具有激活函數時如何進行初始化。在論文中,結合方差一致性原則,也就是讓每一層的輸出值的方差儘量爲1,同時這種方法是針對飽和激活函數如Sigmoid,Tanh方法進行分析的。

通過文章中的公式推導,可以得到下面兩個等式:niD(W)=1\boldsymbol{n}_{\boldsymbol{i}} * \boldsymbol{D}(\boldsymbol{W})=\mathbf{1}ni+1D(W)=1\boldsymbol{n}_{\boldsymbol{i}+1} * \boldsymbol{D}(\boldsymbol{W})=1公式中的nin_i是輸入的神經元個數,ni+1n_{i+1}是輸出的神經元個數,這是同時考慮了前向傳播和反向傳播得到的兩個等式,同時結合方差一致性原則,最終得到權值的方差爲:D(W)=2ni+ni+1D(W)=\frac{2}{n_{i}+n_{i+1}}通常Xavier採用的是均勻分佈,下面來推導一下均勻分佈的上限和下限,假設均勻分佈的下限爲-a,上限爲a,即:WU[a,a]\boldsymbol{W} \sim \boldsymbol{U}[-\boldsymbol{a}, \boldsymbol{a}]D(W)=(aa)212=(2a)212=a23D(W)=\frac{(-a-a)^{2}}{12}=\frac{(2 a)^{2}}{12}=\frac{a^{2}}{3}綜合上面的公式,可以得到:2ni+ni+1=a23α=6ni+ni+1\frac{2}{n_{i}+n_{i+1}}=\frac{a^{2}}{3} \Rightarrow \alpha=\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}WU[6ni+ni+1,6ni+ni+1]W \sim U\left[-\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}, \frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}\right]下面通過Xavier初始化方法觀察網絡層的輸出,其代碼如下所示:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果爲nan,則停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
                a = np.sqrt(6 / (self.neural_num + self.neural_num))  # Xavier初始化方法
                tanh_gain = nn.init.calculate_gain('tanh')
                a *= tanh_gain
                nn.init.uniform_(m.weight.data, -a, a)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通過觀察代碼的輸出,可以發現網絡層的輸出穩定在一個固定值附近。同樣,在Pytorch中實現了Xavier初始化方法:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果爲nan,則停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
            	tanh_gain = nn.init.calculate_gain('tanh')
                nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通過運行上面的代碼可以發現和我們手動設計的Xavier初始化方法功能類似。Xavier針對Sigmoid方法、Tanh方法這種飽和激活函數提供了有效的初始化方法。但是對於非飽和激活函數Relu,Xavier不再適用。

2.2 Kaiming初始化

針對Xavier方法不能有效解決Relu非飽和激活函數的問題,2015年提出了Kaiming初始化方法。基於方差一致性原則,Kaiming初始化方法保持數據尺度維持在恰當範圍,通常方差爲1,這種方法針對的激活函數爲ReLU及其變種。

針對ReLU激活函數,通過公式推導可以得到權值的方差等於:D(W)=2niD(W)=\frac{2}{n_i}公式中nin_i是輸入神經元個數。針對ReLU的變種,也就是負半軸有一定的斜率,其權值的方差應該是:(W)=2(1+a2)ni(W)=\frac{2}{\left(1+a^{2}\right) * n_{i}}公式中a是負半軸的斜率。在ReLU中,其負半軸的斜率爲0,即a=0。因此權值的標準差公式爲:std(W)=2(1+a2)ni\operatorname{std}(W)=\sqrt{\frac{2}{\left(1+a^{2}\right) * n_{i}}}下面通過代碼實現Kaiming初始化方法,具體代碼如下:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 設置隨機種子


class MLP(nn.Module):  # 建立全連接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果爲nan,則停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型參數
        for m in self.modules():
            if isinstance(m, nn.Linear):
            	nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

同樣的,在Pytorch的init中也實現了Kaiming初始化方法,其代碼如下:

nn.init.kaiming_normal_(m.weight.data)

3、常用初始化方法

不良的初始化方法會導致輸出的結果發生梯度消失或者梯度爆炸,最終導致模型沒有辦法正常訓練。爲了避免這一現象的發生,我們要控制網絡層的輸出值的尺度範圍。從公式推導可以知道,要使每一層的輸出值的方差儘量是1,爭取方差一致性原則,保持網絡層的輸出值在1附近,下面來認識一下Pytorch提供的十種權值初始化方法:

  1. Xavier均勻分佈;
  2. Xavier正態分佈;
  3. Kaiming均勻分佈;
  4. Kaiming正態分佈;
  5. 均勻分佈;
  6. 正態分佈;
  7. 常數分佈;
  8. 正交矩陣初始化;
  9. 單位矩陣初始化;
    10.稀疏矩陣初始化;

在權值初始化的時候,選擇哪一種初始化方法得根據具體問題進行分析。

現在學習一個特殊函數nn.init.calculate_gain

nn.init.calculate_gain(nonlinearoty, param=None)

主要功能

  • 計算激活函數的方差變化尺度;

主要參數

  • nonlinearity:激活函數名稱;
  • param:激活函數的參數,如Leaky ReLU的negative_slop;

方差變化尺度意思就是輸入數據的方差除於經過激活函數之後的輸出數據的方差,也就是方差的比例。

下面通過代碼分析這個函數的功能:

x = torch.randn(10000)
out = torch.tanh(x)

gain = x.std() / out.std()
print('gain:{}'.format(gain))

tanh_gain = nn.init.calculate_gain('tanh')
print('tanh_gain in PyTorch:', tanh_gain)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章