Imagenet圖像分類訓練總結(基於Tensorflow 2.0實現)

最近看到AWS在18年年底的一篇論文(Bag of Tricks for Image Classification with Convolutional Neural Networks),是李沐和他的同事們總結的在圖像分類中用到的一些技巧,可以提高分類的準確率,我也照着論文提到的技巧測試了一下,基於Tensorflow 2.1版本,搭建了一個Darknet53的模型(這也是大名鼎鼎的YOLOV3的骨幹網絡),在這個基礎上來對Imagenent進行分類的訓練。

網絡模型的搭建

首先是Darknet53網絡的搭建,具體的網絡結構可以參考https://github.com/pjreddie/darknet裏面CFG目錄下的darknet53.cfg文件。代碼如下:

import tensorflow as tf
from tensorflow.keras import Model

l=tf.keras.layers
category_num = 80
vector_size = 3*(1+4+category_num)

def _conv(inputs, filters, kernel_size, strides, padding, bias=False, normalize=True, activation='relu'):
    output = inputs
    padding_str = 'same'
    if padding>0:
        output = l.ZeroPadding2D(padding=padding, data_format='channels_first')(output)
        padding_str = 'valid'
    output = l.Conv2D(filters, kernel_size, strides, padding_str, \
                  'channels_first', use_bias=bias, \
                  kernel_initializer='he_normal', \
                  kernel_regularizer=tf.keras.regularizers.l2(l=5e-4))(output)
    if normalize:
        output = l.BatchNormalization(axis=1)(output)
    if activation=='relu':
        output = l.ReLU()(output)
    if activation=='relu6':
        output = l.ReLU(max_value=6)(output)
    if activation=='leaky_relu':
        output = l.LeakyReLU(alpha=0.1)(output)
    return output

def _residual(inputs, out_channels, activation='relu', name=None):
    output1 = _conv(inputs, out_channels//2, 1, 1, 0, False, True, 'leaky_relu')
    output2 = _conv(output1, out_channels, 3, 1, 1, False, True, 'leaky_relu')
    output = l.Add(name=name)([inputs, output2])
    return output 

def darknet53_base():
    image = tf.keras.Input(shape=(3,None,None))
    net = _conv(image, 32, 3, 1, 1, False, True, 'leaky_relu')     #32*H*W
    net = _conv(net, 64, 3, 2, 1, False, True, 'leaky_relu')       #64*H/2*W/2
    net = _residual(net, 64, 'leaky_relu')                         #64*H/2*W/2
    net = _conv(net, 128, 3, 2, 1, False, True, 'leaky_relu')      #128*H/4*W/4
    net = _residual(net, 128, 'leaky_relu')                        #128*H/4*W/4
    net = _residual(net, 128, 'leaky_relu')                        #128*H/4*W/4
    net = _conv(net, 256, 3, 2, 1, False, True, 'leaky_relu')      #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    route1 = l.Activation('linear', dtype='float32', name='route1')(net)
    net = _conv(net, 512, 3, 2, 1, False, True, 'leaky_relu')   #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    route2 = l.Activation('linear', dtype='float32', name='route2')(net)
    net = _conv(net, 1024, 3, 2, 1, False, True, 'leaky_relu')     #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    route3 = l.Activation('linear', dtype='float32', name='route3')(net)
    net = tf.reduce_mean(net, axis=[2,3], keepdims=True)
    net = _conv(net, 1000, 1, 1, 0, True, False, 'linear')         #1000
    net = l.Flatten(data_format='channels_first', name='logits')(net)
    net = l.Activation('linear', dtype='float32', name='output')(net)
    model = tf.keras.Model(inputs=image, outputs=[net, route1, route2, route3])
    return model

在以上的代碼中,Darknet53模型有4個輸出,其中route1, route2, route3這三個是留待以後搭建YOLO V3網絡時用的,在圖像分類中暫時用不上。

圖像預處理

論文介紹了以下的圖像預處理的步驟:

  1. 隨機採樣圖片,解碼爲[0,255]的32位浮點數
  2. 在圖片中隨機剪切一個長寬比在[3/4, 4/3]之間的矩形,其面積爲圖片面積的[8%, 100%]之間的一個隨機數。然後把剪切後的圖片縮放到224*224
  3. 隨機翻轉圖片
  4. 隨機調整圖片的hue, 飽和度,明亮度,調整係數是在[0.6, 1.4]之間的一個隨機值。
  5. 給圖片添加PCA噪音,其係數爲高斯分佈(0,0.1)的一個隨機值
  6. 標準化圖片的RGB的值,給RGB 3個Channel分別減去123.68,116.779,103.939,然後再除以58.393,57.12,57.375 

我也遵照以上的步驟進行處理,只是在第3步翻轉圖片之後,我參照Darknet裏面的方式,增加了一個隨機旋轉圖片的步驟,旋轉角度是在[-7, 7]之間的一個隨機數。另外對於第5步的操作,在論文裏面沒有給出詳細的介紹,我是參考mxnet裏面的代碼來實現的。對於圖像的驗證集數據來說,需要把以上的第2步改爲把圖片的最短邊縮放爲256並保持長寬比,然後在圖片中間剪切一個224*224的矩形。之後跳過第3,4,5步,執行第6步即可。以下是在Tensorflow 2.1版本下的代碼,構建訓練集和測試集。這裏我用到的Imagenent的數據是先整理爲TFRECORD的格式,具體做法可以參見我之前的博客https://blog.csdn.net/gzroy/article/details/85954329

模型訓練

論文中提到了以下一些技巧:

  1. 大的Batch Size的訓練,隨着Batch Size的增大線性增大學習率,例如Batch 128的學習率爲0.1,那麼Batch 256的學習率爲0.2。我的顯卡在FP16精度下最大隻能支持128的Batch Size,因此這條技巧對我沒有用。
  2. 學習率的熱身,在模型剛開始訓練的時候,需要從0開始逐漸增大學習率。例如設定初始學習率爲0.1,在頭1000個Batch的訓練時,學習率是從0線性增長到0.1,這樣可以有助於儘快幫助模型進入穩定學習的狀態。
  3. 對殘差網絡的每個殘差塊的最後一個Batch Normalization層的γ初始化爲0,這可以幫助模型在初始階段的訓練。
  4. 對權重參數的偏差項不要做L2正則化
  5. 如果顯卡支持混合精度計算,則可以提高模型訓練速度,同時不會網絡性能不會有下降。
  6. 學習率採用餘弦下降的方式,實際應用中我還是採用了Step Decay的方式,因爲這樣可以更可控和減少訓練時間。因爲餘弦下降的方式學習率改變的太慢了,用Step Decay可以根據Loss值的情況來靈活調整學習率,能更快一些完成訓練。當然如果顯卡性能足夠強大的話,用餘弦下降的方式就最方便省心了。
  7. 採用標籤平滑的方式來處理訓練數據,例如Imagenet裏面有1000個種類的圖像,對於每一個特定的圖像,其對應的類別的Target不是設爲1,而是設爲0.9,其他999個類別設置爲0.1/999
  8. 知識蒸餾,就是用一個更復雜也更高準確度的教師網絡,來幫助現有網絡提升性能。例如用一個RESNET152的網絡來幫助一個RESNET52的網絡。這裏我沒有采用這個技巧。
  9. Mix-Up訓練,也就是每次對2張採樣圖片進行線性整合,相應的Label也要做線性整合。這種方式需要增加訓練的次數。我這裏也沒有采用這個技巧。不過這個技巧對於做目標識別會比較有用,可以增強模型的健壯性。

代碼

完整的訓練代碼如下:

import tensorflow as tf
import tensorflow_addons as tfa
import math
import os
import random
import time
import numpy as np
from darknet53_model import darknet53_base
from tensorflow.keras.mixed_precision import experimental as mixed_precision
l = tf.keras.layers 

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

imageWidth = 224
imageHeight = 224
imageDepth = 3
batch_size = 128
resize_min = 256
train_images = 1280000
batches_per_epoch = train_images//batch_size
train_epochs = 80
total_steps = batches_per_epoch*train_epochs

random_min_aspect = 0.75
random_max_aspect = 1/0.75
random_min_area = 0.08
random_angle = 7.

initial_warmup_steps = 1000
initial_lr = 0.02

eigvec = tf.constant([[-0.5675, 0.7192, 0.4009], [-0.5808, -0.0045, -0.8140], [-0.5836, -0.6948, 0.4203]], shape=[3,3], dtype=tf.float32)
eigval = tf.constant([55.46, 4.794, 1.148], shape=[3,1], dtype=tf.float32)

mean_RGB = tf.constant([123.68, 116.779, 109.939], dtype=tf.float32)
std_RGB = tf.constant([58.393, 57.12, 57.375], dtype=tf.float32)
 
train_files_names = os.listdir('../train_tf/')
train_files = ['../train_tf/'+item for item in train_files_names]
valid_files_names = os.listdir('../valid_tf/')
valid_files = ['../valid_tf/'+item for item in valid_files_names]

# Parse TFRECORD and distort the image for train
def _parse_function(example_proto):
    features = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
        "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "bbox_xmin": tf.io.VarLenFeature(tf.float32),
        "bbox_xmax": tf.io.VarLenFeature(tf.float32),
        "bbox_ymin": tf.io.VarLenFeature(tf.float32),
        "bbox_ymax": tf.io.VarLenFeature(tf.float32),
        "text": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
    }
    parsed_features = tf.io.parse_single_example(example_proto, features)
    image_decoded = tf.image.decode_jpeg(parsed_features["image"], channels=3)
    image_decoded = tf.cast(image_decoded, dtype=tf.float32)
    # Random crop the image 
    shape = tf.shape(image_decoded)
    height, width = shape[0], shape[1]
    random_aspect = tf.random.uniform(shape=[], minval=random_min_aspect, maxval=random_max_aspect)
    random_area = tf.random.uniform(shape=[], minval=random_min_area, maxval=1.0)
    crop_width = tf.math.sqrt(
        tf.divide(
            tf.multiply(
                tf.cast(tf.multiply(height,width), tf.float32),
                random_area),
            random_aspect)
        )
    crop_height = tf.cast(crop_width * random_aspect, tf.int32)
    crop_height = tf.cond(crop_height<height, lambda:crop_height, lambda:height)
    crop_width = tf.cast(crop_width, tf.int32)
    crop_width = tf.cond(crop_width<width, lambda:crop_width, lambda:width)
    cropped = tf.image.random_crop(image_decoded, [crop_height, crop_width, 3])
    resized = tf.image.resize(cropped, [imageHeight, imageWidth])
    # Flip to add a little more random distortion in.
    flipped = tf.image.random_flip_left_right(resized)
    # Random rotate the image
    angle = tf.random.uniform(shape=[], minval=-random_angle, maxval=random_angle)*np.pi/180
    rotated = tfa.image.rotate(flipped, angle)
    # Random distort the image
    distorted = tf.image.random_hue(rotated, max_delta=0.3)
    distorted = tf.image.random_saturation(distorted, lower=0.6, upper=1.4)
    distorted = tf.image.random_brightness(distorted, max_delta=0.3)
    # Add PCA noice
    alpha = tf.random.normal([3], mean=0.0, stddev=0.1)
    pca_noice = tf.reshape(tf.matmul(tf.multiply(eigvec,alpha), eigval), [3])
    distorted = tf.add(distorted, pca_noice)
    # Normalize RGB
    distorted = tf.subtract(distorted, mean_RGB)
    distorted = tf.divide(distorted, std_RGB)

    image_train = tf.transpose(distorted, perm=[2, 0, 1])
    features = {'input_1': image_train}
    labels = tf.one_hot(parsed_features["label"][0], depth=1000)
    return features, labels
 
def train_input_fn():
    dataset_train = tf.data.TFRecordDataset(train_files)
    dataset_train = dataset_train.map(_parse_function, num_parallel_calls=4)
    dataset_train = dataset_train.shuffle(
        buffer_size=12800, 
        reshuffle_each_iteration=True
    )
    dataset_train = dataset_train.repeat(10)
    dataset_train = dataset_train.batch(batch_size)
    dataset_train = dataset_train.prefetch(batch_size)
    return dataset_train

def _parse_test_function(example_proto):
    features = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
        "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "bbox_xmin": tf.io.VarLenFeature(tf.float32),
        "bbox_xmax": tf.io.VarLenFeature(tf.float32),
        "bbox_ymin": tf.io.VarLenFeature(tf.float32),
        "bbox_ymax": tf.io.VarLenFeature(tf.float32),
        "text": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
    }
    parsed_features = tf.io.parse_single_example(example_proto, features)
    image_decoded = tf.image.decode_jpeg(parsed_features["image"], channels=3)
    image_decoded = tf.cast(image_decoded, dtype=tf.float32)
    shape = tf.shape(image_decoded)
    height, width = shape[0], shape[1]
    resized_height, resized_width = tf.cond(height<width,
        lambda: (resize_min, tf.cast(tf.multiply(tf.cast(width, tf.float64),tf.divide(resize_min,height)), tf.int32)),
        lambda: (tf.cast(tf.multiply(tf.cast(height, tf.float64),tf.divide(resize_min,width)), tf.int32), resize_min))
    image_resized = tf.image.resize(image_decoded, [resized_height, resized_width])
    # calculate how many to be center crop
    shape = tf.shape(image_resized)  
    height, width = shape[0], shape[1]
    amount_to_be_cropped_h = (height - imageHeight)
    crop_top = amount_to_be_cropped_h // 2
    amount_to_be_cropped_w = (width - imageWidth)
    crop_left = amount_to_be_cropped_w // 2
    image_cropped = tf.slice(image_resized, [crop_top, crop_left, 0], [imageHeight, imageWidth, -1])
    # Normalize RGB
    image_valid = tf.subtract(image_cropped, mean_RGB)
    image_valid = tf.divide(image_valid, std_RGB)
    image_valid = tf.transpose(image_valid, perm=[2, 0, 1])
    features = {'input_1': image_valid}
    labels = tf.one_hot(parsed_features["label"][0], depth=1000)
    return features, labels
 
def val_input_fn():
    dataset_valid = tf.data.TFRecordDataset(valid_files)
    dataset_valid = dataset_valid.map(_parse_test_function, num_parallel_calls=4)
    dataset_valid = dataset_valid.batch(batch_size)
    dataset_valid = dataset_valid.prefetch(batch_size)
    return dataset_valid

boundaries = [30000, 60000, 90000, 120000, 150000, 170000, 190000, 220000, 260000, 275000]
values = [0.02, 0.01, 0.005, 0.002, 0.001, 0.0005, 0.00025, 0.0001, 0.00005, 0.000025, 0.00001]
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)

class LRCallback(tf.keras.callbacks.Callback):
    def __init__(self, starttime):
        super(LRCallback, self).__init__()
        self.epoch_starttime = starttime
        self.batch_starttime = starttime
    def on_train_batch_end(self, batch, logs):
        step = tf.keras.backend.get_value(self.model.optimizer.iterations)
        lr = tf.keras.backend.get_value(self.model.optimizer.lr)
        # Initial warmup phase, linearly increase the learning rate
        if step < initial_warmup_steps:
            newlr = (initial_lr/initial_warmup_steps)*step
            tf.keras.backend.set_value(self.model.optimizer.lr, newlr)
        # Calculate the lr based on cosine decay, not used here
        '''
        else:
            newlr = (1+math.cos(step/total_steps*math.pi))*initial_lr/2
            tf.keras.backend.set_value(self.model.optimizer.lr, newlr)
        '''
        if step%100==0:
            elasp_time = time.time()-self.batch_starttime
            self.batch_starttime = time.time()
            #Step decay learning rate
            if step >= initial_warmup_steps:
                tf.keras.backend.set_value(self.model.optimizer.lr, learning_rate_fn(step))
            print("Steps:{}, LR:{:6.4f}, Loss:{:4.2f}, Time:{:4.1f}s"\
                  .format(step, lr, logs['loss'], elasp_time))
    def on_epoch_end(self, epoch, logs=None):
        epoch_elasp_time = time.time()-self.epoch_starttime
        print("Epoch:{}, Top-1 Accuracy:{:5.3f}, Top-5 Accuracy:{:5.3f}, Time:{:5.1f}s"\
              .format(epoch, logs['val_output_top_1_accuracy'], logs['val_output_top_5_accuracy'], epoch_elasp_time))
    def on_epoch_begin(self, epoch, logs=None):
        tf.keras.backend.set_learning_phase(True)
        self.epoch_starttime=time.time()
    def on_test_begin(self, logs=None):
        tf.keras.backend.set_learning_phase(False)
 
tensorboard_cbk = tf.keras.callbacks.TensorBoard(log_dir='darknet53_20200203/logs')
checkpoint_cbk = tf.keras.callbacks.ModelCheckpoint(filepath='darknet53_20200203/epoch_{epoch}.h5', verbose=1)

model = darknet53_base()
model.compile(
    loss={
        'output': 
            tf.keras.losses.CategoricalCrossentropy(
                from_logits=True, label_smoothing=0.1)
    },
    optimizer=tf.keras.optimizers.SGD(
        learning_rate=0.001, momentum=0.9),
    metrics={
        'output':[
            tf.keras.metrics.CategoricalAccuracy(
                name='top_1_accuracy'),
            tf.keras.metrics.TopKCategoricalAccuracy(
                k=5, 
                name='top_5_accuracy')]
    }
)

train_data = train_input_fn()
val_data = val_input_fn()
_ = model.fit(
    train_data,
    validation_data=val_data,
    epochs=2,
    initial_epoch=0,
    verbose=0,
    callbacks=[LRCallback(time.time()), tensorboard_cbk, checkpoint_cbk],
    steps_per_epoch=5000)

總結

最終在訓練了300000個Batch(30個Epoch)之後,在驗證集達到了Top1 71.5%,Top5 90.6%的準確率。這個離論文提到的性能以及YOLO3的性能還有一定的差距,不過暫時已經想不到能進一步提高的方法了。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章