l2正則化係數過高導致梯度消失

 

 lamda = 3  # 正則化懲罰係數
 w_grad=(np.dot(self.input.T, grad) + (self.lamda * self.w))/self.batch_size

這裏把正則化係數設置爲3

如果採用4層relu隱藏層的神經網絡,將會直接導致梯度消失

1%| | 2/200 [00:17<28:09, 8.53s/it]loss爲0.23035251606429571
準確率爲0.09375
梯度均值爲-1.452481815897801e-11
2%|▏ | 3/200 [00:26<28:26, 8.66s/it]loss爲0.23077135760414888
準確率爲0.1015625
梯度均值爲1.422842658558051e-14
2%|▏ | 4/200 [00:34<27:49, 8.52s/it]loss爲0.23046438461223917
準確率爲0.10546875
梯度均值爲-8.111952281250118e-18
2%|▎ | 5/200 [00:41<26:23, 8.12s/it]loss爲0.2301827048850293
準確率爲0.12109375
梯度均值爲-6.3688796773963155e-21
3%|▎ | 6/200 [00:49<25:47, 7.98s/it]loss爲0.23023365984639205
準確率爲0.125
梯度均值爲-1.2646968613522145e-23
4%|▎ | 7/200 [00:56<25:00, 7.77s/it]loss爲0.23074116618703105
準確率爲0.08984375
梯度均值爲7.443049613238094e-26
4%|▍ | 8/200 [01:03<24:34, 7.68s/it]loss爲0.23025406010680918
準確率爲0.11328125
梯度均值爲5.544761930793375e-29
4%|▍ | 9/200 [01:11<24:14, 7.62s/it]loss爲0.23057808569519062
準確率爲0.08984375
梯度均值爲-2.505663387779514e-30
5%|▌ | 10/200 [01:19<24:35, 7.76s/it]loss爲0.23014966000613057
準確率爲0.10546875
梯度均值爲-1.588181439704063e-31

梯度均值會越來越低,從e-5一直下降到e-31

而把lamda改爲1後,將會緩解這個情況

0%| | 0/200 [00:00<?, ?it/s]loss爲0.2284026479024262
準確率爲0.13671875
梯度均值爲4.6649560959205905e-05
0%| | 1/200 [00:07<24:21, 7.34s/it]loss爲0.1554314625472287
準確率爲0.4609375
梯度均值爲-0.0002773582179562886
1%| | 2/200 [00:14<23:59, 7.27s/it]loss爲0.18376994316806905
準確率爲0.31640625
梯度均值爲8.423075286773206e-05
2%|▏ | 3/200 [00:21<23:30, 7.16s/it]loss爲0.12577617122257392
準確率爲0.53515625
梯度均值爲0.00047661977909027993
2%|▏ | 4/200 [00:28<23:09, 7.09s/it]loss爲0.12035617394653744
準確率爲0.515625
梯度均值爲1.5361318373022455e-05
2%|▎ | 5/200 [00:35<22:52, 7.04s/it]loss爲0.11590587695113908
準確率爲0.5546875
梯度均值爲4.901066522064529e-05

lamda值是提升模型泛化能力的,但是不能設置過高,否則也會導致梯度消失,也不能設置過低,將會導致梯度爆炸

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章