神經網絡中的兩種正則化---Batch Normalization和Weight Normalization

原創

2020-07-04 03:13

Batch Normalization

原理

BN是對小批量數據進行正則化，其算法原理如下：

我們可以理解爲BN的本質就是一個以 $\gamma$ 和 $\beta$ 爲參數，從 $x_i$ 到 $y_i$ 的映射。即：
$BN_{\gamma\beta}:x_{i...m} \rightarrow y_{i...m}$
其反向傳播如下：

BN在測試時的使用

在測試時，均是使用單個樣本進行檢測。運行到BN層時使用的均值和方差均是來自於訓練樣本，隨機抽取多個批次，每個批次的大小都是m，進行計算。即：
$E(x) \leftarrow E(\mu_{\Beta})$
$Var(x) \leftarrow \frac{m-1}{m}E(\sigma^{2}_{\Beta})$
當然，這樣比較麻煩，需要大量的計算。所以在訓練時就將均值和方差保存下來，並通過滑動平均(moving average)進行更新。其中的均值和方差分別成爲滑動均值 $E_{moving}$ 和滑動方差 $Var_{moving}$ ：
$E_{moving}(x)=m*E_{moving}(x)+(1-m)*E_{sample}$
$Var_{moving}(x)=m*Var_{moving}(x)+(1-m)*Var_{sample}$
$E_{sample}$ 爲採樣均值， $Var_{sample}$ 爲採樣方差，此處的m爲遺忘因子momentum，默認爲0.99.
根據參數 $\gamma$ 和 $\beta$ 和上述方式得到：
$y = \frac{r}{\sqrt{Var(x)+\epsilon}}+(\beta-\frac{\gamma*E(x)}{\sqrt{Var(x)+\epsilon}})$

CNN中的BN

對於卷積，BN是以feature map爲單位，對於 $shape=m*p*q*d$ 的輸入，每個feature map計算 $m*p*q$ 個數據，有d組參數 $\gamma$ 和 $\beta$ 。

示例

[1]. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Weight Normalization

理論：

WN是將權重進行歸一化，這個明顯區別於BN對數據進行歸一化的方式。BN將miniBatch的局部歸一化作爲全局歸一化，進而引入噪聲，而WN則沒有這個問題，因此WN除了可以應用於CNN，還可以應用於RNN、生成網絡和深度強化學習等對噪聲敏感的學習中。
一個神經網絡節點：
$y =\phi(wx+b)$
在SGD過程中，將參數 $w$ 解藕爲歐式範數 $g$ 和方向向量 $v$ ，則
$w = \frac{g}{||v||}v$
在反向傳播中， $g$ 和 $v$ 都是損失函數 $L$ 的參數。則其計算方式：

或者另外一種表達方式：

它能夠達成兩個目標：

通過 $g/||v||$ 縮放權重梯度
遠離當前的權重梯度 $\triangledown_wL$
這些都可以加速收斂，這個類似於優化器中的Momentum或者Adam[2]，雖然沒有嚴格意義上對不同參數學習不同的學習率，但是效果上是相似的。

參數的數據依賴初始化

BN每次的縮放都是根據數據進行的，因此優化時，具有很強的魯棒性。WN則沒有這個特徵。因此初始化就變得非常重要。
對於每個神經元：

其中 $\mu$ 和 $\sigma$ 分別是預激活t的均值和標準差。
$v$ 使用均值爲0，標準差爲0.05的正態分佈；
$g$ 和 $b$ 分別是第一批樣本的統計量進行初始化；

示例

全連接層

def dense(x_, n_filters,init_scaler=1, init=False):
    V = tf.get_variable("v", shape=[x_.shape[-1], n_filters], dtype=tf.float32, initializer=tf.random.normal(0,0.05), trainable=True)
    g = tf.get_variable("g", shape=[n_filters], dtype=tf.float32, initializer=tf.constant_initializer(1.), trainable=True)
    b = tf.get_variable("b", shape=[n_filters], dtype=tf.float32, initializer=tf.constant_initializer(0.), trainable=True)

    # x與方向向量相乘
    x = tf.matmul(x_,V)
    # 縮放係數
    scaler = g/tf.sqrt(tf.reduce_mean(tf.square(V), [0]))
    x = tf.reshape(scaler, [1, n_filters]) * x + tf.reshape(b, [1,n_filters])

    if init:
        # 第一批樣本的均值和方差
        init_mean, init_v = tf.nn.moments(x, [0])
        # 初始化的縮放係數
        init_scaler = init_scaler/tf.sqrt(init_v+1e-10)
        # 利用tf.control_dependencies控制計算流圖，先執行g*init_scaler賦值給g和-init_mean*init_scaler賦值給b的兩個操作，完成g和b的初始化，再執行之後的。
        with tf.control_dependencies([g.assign(g*init_scaler), b.assign_add(-init_mean*init_scaler)]):
            x = tf.matmul(x_,V)
            scaler = g / tf.sqrt(tf.reduce_mean(tf.square(V), [0]))
            x = tf.reshape(scaler, [1, n_filters]) * x + tf.reshape(b, [1, n_filters])

    return x

卷基層

def conv2d(x_, n_filters, filter_size=[3,3], stride=[1,1], pad="SAME",init_scaler=1, init=False):
    V = tf.get_variable("v", shape=filter_size+[x_.shape[-1], n_filters], dtype=tf.float32,
                        initializer=tf.random.normal(0, 0.05), trainable=True)
    g = tf.get_variable("g", shape=[n_filters], dtype=tf.float32, initializer=tf.constant_initializer(1.),
                        trainable=True)
    b = tf.get_variable("b", shape=[n_filters], dtype=tf.float32, initializer=tf.constant_initializer(0.),
                        trainable=True)

    # weight normalization
    w = tf.reshape(g,[1,1,1,n_filters]) * tf.nn.l2_normalize(V, [0,1,2])
    # 計算conv2d
    x = tf.nn.bias_add(tf.nn.conv2d(x_, w, [1]+stride+[1], pad) + b)

    # init
    if init:
        # 第一批樣本的均值和方差
        init_mean, init_v = tf.nn.moments(x, [0,1,2])
        # 初始化縮放係數
        init_scaler = init_scaler/tf.sqrt(init_v+1e-10)
        with tf.control_dependencies([g.assign(g*init_scaler), b.assign_add(-init_mean*init_scaler)]):
            # weight normalization
            w = tf.reshape(g, [1, 1, 1, n_filters]) * tf.nn.l2_normalize(V, [0, 1, 2])
            # 計算conv2d
            x = tf.nn.bias_add(tf.nn.conv2d(x_, w, [1] + stride + [1], pad) + b)
    
    return x

參考：
[1]. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
[2]. An overview of gradient descent optimization algorithms
[3]. openAI/WeightNormal

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

神經網絡中的兩種正則化---Batch Normalization和Weight Normalization

Batch Normalization

原理

BN在測試時的使用

CNN中的BN

示例

Weight Normalization

理論：

參數的數據依賴初始化

示例

全連接層

卷基層

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

leetcode-66-加1

每日一算 Leetcode 104.二叉樹的最大深度

Ctrl+z, Ctrl+c, Ctrl+D的區別，以及Ctrl+z後的啓動

神經網絡中的兩種正則化---Batch Normalization和Weight Normalization

TensorFlow之tf.multiply和tf.matmul

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結