關於BN的理解

關於BN的理解

在中文資料裏面，現在有很多資料解釋BN.但是沒有一個能解釋清楚BN在卷積網絡中怎麼work的.

卷積網絡中間特徵層的shape爲[N,H,W,C].N,爲min-batch大小.H,W爲高度和寬度,C爲channel大小。

對於BN來說,min-batch大小並不是N,而是N*H*W.所以一個min-batch裏面,實際上是對N*H*W個C維向量做normalization.

BN論文中的實現和tensorflow中的實際實現是完全不一樣的。

BN論文中，進行推斷時,所使用的均值和方差,是訓練時所有的min-batch的均值和方差的統計結果。即

$$ u_{inference}=\dfrac{1}{所有的訓練minbatch個數}*\sum_{i=0}^{所有的訓練minbatch個數} u_{i}$$

方差也一樣。

原論文中的實現，很明顯存在一個問題，即訓練時用的均值和方差和測試時用的訓練和方差並不一致。

所以在tensorflow實現時，實際上在訓練時，並沒有真正使用min-batch的均值和方差，而是用統計的均值和方差=decay*當前min-batch的均值和方差+(1-decay)*統計的均值和方差作爲替換。

這樣保證了訓練時和測試時所使用的均值和方差保持一致。

具體的參考鏈接見:https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time

I have read the paper Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift http://arxiv.org/pdf/1502.03167v...

I realized that there is a difference when Batch Normalization (BN) calculates mean and variance in train and in test.

In training: Calculating mean and variance base on mini-batch. You can see below picture.

In test: While mean and variance were calculated using the population, rather than mini-batch, statistics.

The actual implement have a little difference with the original paper. For details implement, here is a tutorial Implementing Batch Normalization in Tensorflow from R2RT.

In training phase:

Step 1, the model will calculate batch_mean and batch_var base on the input batch

batch_mean, batch_var = tf.nn.moments(x,[0])

Step 2, pop_mean and pop_var will be update base on batch_mean, batch_var with decay. In this tutorial decay will be 0.999

decay = 0.999
global_mean = tf.assign(pop_mean,
pop_mean * decay + batch_mean * (1 - decay))
global_var = tf.assign(pop_var,pop_var * decay + batch_var * (1 - decay))

Step 3, we can normalize inputs using global_mean and global_var:

# Apply the initial batch normalizing transform
z1_hat = (x - global_mean) / tf.sqrt(global_var + epsilon)

Step 4, apply scale and shift on

```
BN1 = scale1 * z1_hat + beta1
```

In test phase:

Model don’t update pop_mean and pop_var, skip step 1 and step 2 in training phase. Model just use pop_mean and pop_var directly:

z1_hat = (x - pop_mean) / tf.sqrt(pop_var + epsilon)
BN1 = scale1 * z1_hat + beta1

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

xavier初始化的背後原理

關於resnet的直覺性解釋

深度學習權重初始化爲什麼要用正態分佈

關於BN的理解

證明偶階羣必存在元素a,使得a^2=1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結