使用tensorflow 的slim模塊fine-tune resnet/densenet/inception網絡，解決batchnorm問題

使用tf fine-tune resnet模型
前言
使用tensorflow踩了很多的坑，尤其是使用tf的slim模塊的時候，其中batchnorm的問題困撓了我很久，問題表現如下：

訓練結果很好，測試的時候is−trainingis−training設置成false測試結果很差，設置成true測試結果恢復正常
訓練結果很好，但是測試的結果要差上不少
但是tensorflow官方提供的常見的網絡代碼以及與訓練模型都是基於slim模塊建立的，使用者可以直接fine-tune這些網絡比如resnet, inception, densenet, 等等。但是經常有同學在使用過程中遇到結果不盡人意或者各種奇葩問題。

本文爲上述提出的兩個問題做一個總結，附上我的解決方案，有問題歡迎留言。

解決方案
tensorflow的slim地址，資源如下：

每個網絡都有對應的代碼和預訓練的模型，可以直接拿來fine-tune

坑1：
對於問題：訓練結果很好，測試的時候istrainingistraining設置成false測試結果很差，設置成true測試結果恢復正常。
顯然了是batchnorm的問題，假設要finetune-resnet-v1-101, 網絡定義如下：

with slim.arg_scope(resnet_utils.resnet_arg_scope()):
net, end_points = resnet_v1_101.resnet_v1_101(imgs_processed,
num_classes=1000,
is_training=is_training,
global_pool=True,
output_stride=None,
spatial_squeeze=True,
store_non_strided_activations=False,
reuse=None,
scope='resnet_v1_101')
1
2
3
4
5
6
7
8
9
10
這個is_training 在測試的時候給成True，測試給爲false，此參數控制網絡batchnorm的使用，設置爲true時，batchnorm中的beta和gama參與訓練進行更新，設置成false的時候不更新，而是使用計算好的moving mean 和moving variance，關於batchnorm相關問題可以參考我的博文，因此，is_training 在測試的時候給成True，也就是在測試集上仍然更新batchnorm的參數，如果在訓練集上訓練的比較好了，在測試集上繼續擬合，那結果肯定不會太差。

問題的原因是在測試的時候沒有利用到moving mean 和moving variance，解決方案就是更新train op的時候同時更新batchnorm的op，即是在代碼中做如下更改：

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
if update_ops:
updates = tf.group(*update_ops)
self.cross_entropy = control_flow_ops.with_dependencies([updates], self.cross_entropy)
1
2
3
4
這樣就可以將batchnorm的更新和train op的更新放在一起，也可以使用另一種方法：

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
train_op = slim.learning.create_train_op(cross_entropy,
optimizer,
global_step=step,
variables_to_train=all_vars)
.
.
.
sess.run([train_op, extra_update_ops, cross_entropy])
1
2
3
4
5
6
7
8
9
10
作用都是一樣的，但是值得注意的是，使用slim模塊的時候建立train op時最好要使用slim自帶的train op，具體代碼如下：

optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
train_op = slim.learning.create_train_op(cross_entropy,
optimizer,
global_step=step,
variables_to_train=all_vars) # 選擇性訓練權重
1
2
3
4
5
6
而不是使用：

train_op = tf.train.GradientDescentOptimizer(learning_rate=lr).minimize(cross_entropy)
1
如果問題得到解決，那麼恭喜，如果是在小數據集上fine-tune，可能還會遇到問題二，訓練結果很好，但是測試的結果要差上不少。

坑二：
訓練結果很好，但是測試的結果要差的問題出在batchnorm的decay參數上，先看一下slim中網絡的arg scope定義，在resnet utiles.py的末尾可以找到如下代碼：

def resnet_arg_scope(weight_decay=0.0001,
batch_norm_decay=0.99, #0.997,
batch_norm_epsilon=1e-5,
batch_norm_scale=True,
activation_fn=tf.nn.relu,
use_batch_norm=True):
batch_norm_params = {
'decay': batch_norm_decay,
'epsilon': batch_norm_epsilon,
'scale': batch_norm_scale,
'updates_collections': tf.GraphKeys.UPDATE_OPS,
'fused': None, # Use fused batch norm if possible.

}

with slim.arg_scope(
[slim.conv2d],
weights_regularizer=slim.l2_regularizer(weight_decay),
weights_initializer=slim.variance_scaling_initializer(),
activation_fn=activation_fn,
normalizer_fn=tf.contrib.layers.batch_norm if use_batch_norm else None,
normalizer_params=batch_norm_params):
with slim.arg_scope([slim.batch_norm], **batch_norm_params):
# The following implies padding='SAME' for pool1, which makes feature
# alignment easier for dense prediction tasks. This is also used in
# https://github.com/facebook/fb.resnet.torch. However the accompanying
# code of 'Deep Residual Learning for Image Recognition' uses
# padding='VALID' for pool1. You can switch to that choice by setting
# slim.arg_scope([slim.max_pool2d], padding='VALID').
with slim.arg_scope([slim.max_pool2d], padding='SAME') as arg_sc:
return arg_sc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
聲明，在這裏我沒有使用slim.batchnorm，而是使用了tf.contrib.layers.batch_norm，二者差距不大，都是一樣的，當然你也可以使用自己定義的batchnorm函數。

其中最重要的一個參數就是'decay': batch_norm_decay,原始的代碼是在image net上訓練的，decay設置的是0.999，這個數值越大，網絡訓練越平緩，相對需要更多的訓練時間，但是在小數據集上訓練的時候可以選用較小的數值，比如0.99或者0.95。

到這裏坑就填完了，有問題可以在評論區提出。

代碼在我的git上，根據我之前的多GPU並行代碼改的，核心部分沒有變，精度計算需要自己寫：
https://github.com/LDOUBLEV/TF_resnet
---------------------
作者：Double_V_
來源：CSDN
原文：https://blog.csdn.net/qq_25737169/article/details/79616671
版權聲明：本文爲博主原創文章，轉載請附上博文鏈接！

使用tensorflow 的slim模塊fine-tune resnet/densenet/inception網絡，解決batchnorm問題

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

LeetCode ： 4. Median of Two Sorted Arrays 兩個有序數組的中值

tensorflow中的dataset API

美團鍵盤大小寫轉換最小敲擊次數

C++ Primer中小細節章節二：C++基礎

LeetCode： 932. Beautiful Array 完美序列滿足A[k] * 2 = A[i] + A[j]

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結