关于BN的理解

在中文资料里面,现在有很多资料解释BN.但是没有一个能解释清楚BN在卷积网络中怎么work的.

卷积网络中间特征层的shape为[N,H,W,C].N,为min-batch大小.H,W为高度和宽度,C为channel大小。

对于BN来说,min-batch大小并不是N,而是N*H*W.所以一个min-batch里面,实际上是对N*H*W个C维向量做normalization.

BN论文中的实现和tensorflow中的实际实现是完全不一样的。

BN论文中,进行推断时,所使用的均值和方差,是训练时所有的min-batch的均值和方差的统计结果。即

$$ u_{inference}=\dfrac{1}{所有的训练minbatch个数}*\sum_{i=0}^{所有的训练minbatch个数} u_{i}$$

 

方差也一样。

原论文中的实现,很明显存在一个问题,即训练时用的均值和方差和测试时用的训练和方差并不一致。

 

所以在tensorflow实现时,实际上在训练时,并没有真正使用min-batch的均值和方差,而是用统计的均值和方差=decay*当前min-batch的均值和方差+(1-decay)*统计的均值和方差作为替换。

这样保证了训练时和测试时所使用的均值和方差保持一致。

 

具体的参考链接见:https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time

I have read the paper Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift http://arxiv.org/pdf/1502.03167v...

I realized that there is a difference when Batch Normalization (BN) calculates mean and variance in train and in test.

In training: Calculating mean and variance base on mini-batch. You can see below picture.

In test: While mean and variance were calculated using the population, rather than mini-batch, statistics.

The actual implement have a little difference with the original paper. For details implement, here is a tutorial Implementing Batch Normalization in Tensorflow from R2RT.

In training phase:

Step 1, the model will calculate batch_mean and batch_var base on the input batch

 

  1. batch_mean, batch_var = tf.nn.moments(x,[0])

     

Step 2, pop_mean and pop_var will be update base on batch_mean, batch_var with decay. In this tutorial decay will be 0.999

  1. decay = 0.999
    global_mean = tf.assign(pop_mean,
    pop_mean * decay + batch_mean * (1 - decay))
    global_var = tf.assign(pop_var,pop_var * decay + batch_var * (1 - decay))

     

 

Step 3, we can normalize inputs using global_mean and global_var:

  1. # Apply the initial batch normalizing transform
    z1_hat = (x - global_mean) / tf.sqrt(global_var + epsilon)

     

 

Step 4, apply scale and shift on

  1. BN1 = scale1 * z1_hat + beta1

     

 

In test phase:

Model don’t update pop_mean and pop_var, skip step 1 and step 2 in training phase. Model just use pop_mean and pop_var directly:

  1. z1_hat = (x - pop_mean) / tf.sqrt(pop_var + epsilon)
    BN1 = scale1 * z1_hat + beta1

     

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章