YOLO V3基於Tensorflow 2.0的完整實現

YOLO V3版本是一個強大和快速的物體檢測模型,同時原理上也相對簡單。我之前的博客中已經介紹瞭如何用Tensorflow來實現YOLO V1版本,之後我自己也用Tensorflow 1.X版本實現了YOLO V3,現在Tensorflow演進到了2.0版本,相比較1.X版本做了很大的改進,也更加易用了,因此我記錄一下如何用Tensorflow 2.0版本來實現YOLO V3。網上能找到的很多Tensorflow YOLO V3的代碼都沒有完整的一個訓練過程,基本上都是轉換和加載YOLO的作者在Darknet上發佈的訓練好的權重數據,直接進行檢測的。我的這個代碼實現了完整的訓練流程,包括了搭建基礎架構網絡Darknet53進行Imagenet預訓練,以及增加YOLO V3網絡模塊進行物體檢測訓練,對模型的訓練效果進行評測,以及用訓練好的模型進行物體檢測的過程。

訓練數據的準備

需要準備兩份訓練數據,一個是Imagenent的物體分類數據,包括了1000種類別的物體的數據,共128萬張圖片。數據集需要預先處理爲TFRECORD格式,具體過程可以參見我之前的博客https://blog.csdn.net/gzroy/article/details/85954329。 第二個訓練數據是物體檢測的數據,目前有很多個數據集可以採用,例如COCO數據集(包括80種物體的檢測框),OpenImage,Pascal VOC等等,比較流行的是COCO數據集,大部分物體檢測的論文都會基於這個數據集來提供性能指標。我也採用COCO數據集,同樣也是預處理爲TFRECORD格式,具體過程可以參見我的另一篇博客https://blog.csdn.net/gzroy/article/details/95027532

網絡模型的搭建

按照YOLO V3論文的描述,基礎網絡架構是一個叫做Darknet53的網絡模型,共有53個卷積層,其網絡架構如下:

用Tensorflow可以很方便的構建一個Darknet53模型,代碼如下:

import tensorflow as tf
from tensorflow.keras import Model
l=tf.keras.layers

def _conv(inputs, filters, kernel_size, strides, padding, bias=False, normalize=True, activation='relu', last=False):
    output = inputs
    padding_str = 'same'
    if padding>0:
        output = l.ZeroPadding2D(padding=padding, data_format='channels_first')(output)
        padding_str = 'valid'
    output = l.Conv2D(filters, kernel_size, strides, padding_str, \
                  'channels_first', use_bias=bias, \
                  kernel_initializer='he_normal', \
                  kernel_regularizer=tf.keras.regularizers.l2(l=5e-4))(output)
    if normalize:
        if not last:
            output = l.BatchNormalization(axis=1)(output)
        else:
            output = l.BatchNormalization(axis=1, gamma_initializer='zeros')(output)
    if activation=='relu':
        output = l.ReLU()(output)
    if activation=='relu6':
        output = l.ReLU(max_value=6)(output)
    if activation=='leaky_relu':
        output = l.LeakyReLU(alpha=0.1)(output)
    return output

def _residual(inputs, out_channels, activation='relu', name=None):
    output1 = _conv(inputs, out_channels//2, 1, 1, 0, False, True, 'leaky_relu', False)
    output2 = _conv(output1, out_channels, 3, 1, 1, False, True, 'leaky_relu', True)
    output = l.Add(name=name)([inputs, output2])
    return output 

def darknet53_base():
    image = tf.keras.Input(shape=(3,None,None))
    net = _conv(image, 32, 3, 1, 1, False, True, 'leaky_relu')     #32*H*W
    net = _conv(net, 64, 3, 2, 1, False, True, 'leaky_relu')       #64*H/2*W/2
    net = _residual(net, 64, 'leaky_relu')                         #64*H/2*W/2
    net = _conv(net, 128, 3, 2, 1, False, True, 'leaky_relu')      #128*H/4*W/4
    net = _residual(net, 128, 'leaky_relu')                        #128*H/4*W/4
    net = _residual(net, 128, 'leaky_relu')                        #128*H/4*W/4
    net = _conv(net, 256, 3, 2, 1, False, True, 'leaky_relu')      #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    net = _residual(net, 256, 'leaky_relu')                        #256*H/8*W/8
    route1 = l.Activation('linear', dtype='float32', name='route1')(net)
    net = _conv(net, 512, 3, 2, 1, False, True, 'leaky_relu')   #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    net = _residual(net, 512, 'leaky_relu')                        #512*H/16*W/16
    route2 = l.Activation('linear', dtype='float32', name='route2')(net)
    net = _conv(net, 1024, 3, 2, 1, False, True, 'leaky_relu')     #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    net = _residual(net, 1024, 'leaky_relu')                       #1024*H/32*W/32
    route3 = l.Activation('linear', dtype='float32', name='route3')(net)
    net = tf.reduce_mean(net, axis=[2,3], keepdims=True)
    net = _conv(net, 1000, 1, 1, 0, True, False, 'linear')         #1000
    net = l.Flatten(data_format='channels_first', name='logits')(net)
    net = l.Activation('linear', dtype='float32', name='output')(net)
    model = tf.keras.Model(inputs=image, outputs=[net, route1, route2, route3])
    return model

我們需要先基於這個骨幹網絡架構來進行Imagenet的預訓練,以提取有效的圖片內容的特徵數據。我用這個網絡訓練了30個EPOCH,最終到達Top-1 71%,Top-5 91%的準確率。具體的訓練過程可以見我的博客https://blog.csdn.net/gzroy/article/details/104170537

訓練好了骨幹網絡之後,我們就可以在這個網絡的基礎上再增加相應的卷積層,實現圖像特徵金字塔(FPN)的架構,這裏我們會用到骨幹網絡輸出的route1, route2, route3這幾個不同圖像分辨率的特徵值,最終構建一個可以對圖片進行下采樣8倍,16倍和32倍的基於網格的檢測系統,例如訓練圖片的分辨率爲416*416,那麼將輸出52*52, 26*26, 13*13這三個不同維度的檢測結果。具體的原理可以參見網上的一些文章,例如:我這裏參照Darknet的源代碼來搭建了一個YOLO V3的網絡,代碼如下:

category_num = 80
vector_size = 3*(1+4+category_num)
def darknet53_yolov3():
    route1 = tf.keras.Input(shape=(256,None,None), name='input1')        #256*H/8*W/8
    route2 = tf.keras.Input(shape=(512,None,None), name='input2')        #256*H/16*W/16
    route3 = tf.keras.Input(shape=(1024,None,None), name='input3')       #256*H/32*W/32
    net = _conv(route3, 512, 1, 1, 0, False, True, 'leaky_relu')         #512*H/32*W/32
    net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu')           #1024*H/32*W/32
    net = _conv(net, 512, 1, 1, 0, False, True, 'leaky_relu')            #512*H/32*W/32
    net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu')           #1024*H/32*W/32
    net = _conv(net, 512, 1, 1, 0, False, True, 'leaky_relu')            #512*H/32*W/32
    route4 = tf.identity(net, 'route4')
    net = _conv(net, 1024, 3, 1, 1, False, True, 'leaky_relu')           #1024*H/32*W/32
    predict1 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear')   #vector_size*H/32*W/32
    predict1 = l.Activation('linear', dtype='float32')(predict1)
    predict1 = l.Reshape((vector_size, imageHeight//32*imageWidth//32))(predict1)
    net = _conv(route4, 256, 1, 1, 0, False, True, 'leaky_relu')         #256*H/32*W/32
    net = l.UpSampling2D((2,2),"channels_first",'nearest')(net)    #256*H/16*W/16
    net = l.Concatenate(axis=1)([route2, net])                     #768*H/16*W/16
    net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu')            #256*H/16*W/16
    net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu')            #512*H/16*W/16
    net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu')            #256*H/16*W/16
    net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu')            #512*H/16*W/16
    net = _conv(net, 256, 1, 1, 0, False, True, 'leaky_relu')            #256*H/16*W/16
    route5 = tf.identity(net, 'route5')
    net = _conv(net, 512, 3, 1, 1, False, True, 'leaky_relu')            #512*H/16*W/16
    predict2 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear')   #vector_size*H/16*W/16
    predict2 = l.Activation('linear', dtype='float32')(predict2)
    predict2 = l.Reshape((vector_size, imageHeight//16*imageWidth//16))(predict2)
    net = _conv(route5, 128, 1, 1, 0, False, True, 'leaky_relu')         #128*H/16*W/16
    net = l.UpSampling2D((2,2),"channels_first",'nearest')(net)    #128*H/8*W/8
    net = l.Concatenate(axis=1)([route1, net])                     #384*H/8*W/8
    net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu')            #128*H/8*W/8
    net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu')            #256*H/8*W/8
    net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu')            #128*H/8*W/8
    net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu')            #256*H/8*W/8
    net = _conv(net, 128, 1, 1, 0, False, True, 'leaky_relu')            #128*H/8*W/8
    net = _conv(net, 256, 3, 1, 1, False, True, 'leaky_relu')            #256*H/8*W/8
    predict3 = _conv(net, vector_size, 1, 1, 0, True, False, 'linear')   #vector_size*H/8*W/8
    predict3 = l.Activation('linear', dtype='float32')(predict3)
    predict3 = l.Reshape((vector_size, imageHeight//8*imageWidth//8))(predict3)
    predict = l.Concatenate()([predict3, predict2, predict1])
    predict = tf.transpose(predict, perm=[0, 2, 1], name='predict')
    model = tf.keras.Model(inputs=[route1, route2, route3], outputs=predict, name='darknet53_yolo')
    return model

可以看到,這個網絡模型是以骨幹網絡的三個輸出route1, route2, route3作爲輸入的,最終輸出的Predict是三個不同維度的預測結果。

YOLO V3訓練過程

有了訓練數據和搭建好網絡模型之後,我們就可以開始訓練了。整個訓練過程分爲如下幾步:

1. 對骨幹網絡進行更高分辨率的訓練

因爲我們的骨幹網絡是基於224*224這個分辨率來進行訓練和提取圖片特徵的,但是在物體檢測中,這個分辨率太低,不利於檢測小物體,因此我們需要基於更加高的分辨率,例如416*416來進行訓練。我們可以把骨幹網絡基於這個高的分辨率再多訓練一些次數,讓網絡適應高分辨率,這樣可以最終提升物體檢測的性能。爲此我們可以重新加載之前訓練好的骨幹網絡來進行訓練。

2. 骨幹網絡和檢測網絡組合

把預訓練完成後的骨幹網絡和檢測網絡組合起來,構成一個YOLO V3的網絡模型。訓練圖片先通過骨幹網絡進行特徵提取,輸出route1,route2,route3這三個不同維度的圖像特徵數據,然後作爲輸入進到檢測網絡中進行訓練,最終得到三個維度的預測結果。這個組合網絡中,需要設置骨幹網絡的參數爲不可訓練,只訓練檢測網絡的參數。代碼如下:

#Load the pretrained backbone model
model_base = tf.keras.models.load_model('darknet53/epoch_60.h5')
model_base.trainable = False
image = tf.keras.Input(shape=(3,image_height,image_width))
_, route1, route2, route3 = model_base(image, training=False)

#The detect model will accept the backmodel output as input
predict = darknet53_yolov3(image_height,image_width)([route1, route2, route3])

#Construct the combined yolo model
model_yolo = tf.keras.Model(inputs=image, outputs=predict, name='model_yolo')

3. 讀取訓練數據並進行預處理

讀取COCO訓練集的圖片和檢測框的數據,並進行數據增廣,生成數據標籤等預處理。這裏的數據增廣除了按照Darknet源碼的處理方式之外,還參照論文https://arxiv.org/pdf/1902.04103.pdf中提出的數據增廣處理流程,增加了Mixup的處理,即一次取兩張圖片,通過隨機的透明度的處理之後,同時疊加在一起。

因爲涉及到檢測框的位置,在做數據增廣時,需要相應調整檢測框的位置。包括以下幾個步驟:

  1. 圖像縮放:隨機縮放圖像的寬和高(縮放係數爲0.7-1.3之間的一個隨機數),並計算縮放後的寬高的比例,然後以縮放後的寬和高的長邊爲準,縮放爲圖像輸入維度416,並按照比例來縮放短邊。
  2. 圖像的填充:因爲上一步完成後,圖像的長邊爲416像素,需要對短邊進行填充使其也達到416像素。
  3. 檢測框的調整:根據以上圖像的變換,相應調整檢測框的位置。
  4. 隨機反轉圖像
  5. 再次調整檢測框
  6. 隨機調整圖像的飽和度,明亮度等
  7. 添加PCA噪聲
  8. 標準化圖像的RGB通道值。
  9. 根據檢測框的大小判斷其應由哪個Anchor來負責預測
  10. 圖像數據和檢測框的數據作爲Feature
  11. 根據檢測框的數據生成Label,其維度爲(1+1+4+1+80)*3=258,其中第1位爲grid id,第2位表示是否存在Object,第3-6位表示如果Object的中央點的座標和寬高,第7位表示Mixup的比例,最後的80位標識這個Object屬於哪一類物體。

首先是定義模型的一些參數,代碼如下:

mixup_flag = True
#Parameters for PCA noice
eigvec = tf.constant(
    [
        [-0.5675, 0.7192, 0.4009], 
        [-0.5808, -0.0045, -0.8140], 
        [-0.5836, -0.6948, 0.4203]
    ], 
    shape=[3,3], 
    dtype=tf.float32
)
eigval = tf.constant([55.46, 4.794, 1.148], shape=[3,1], dtype=tf.float32)
#Parameters for normalization
mean_RGB = tf.constant([123.68, 116.779, 109.939], dtype=tf.float32)
std_RGB = tf.constant([58.393, 57.12, 57.375], dtype=tf.float32)
#Train and valid batch size
batch_size = 16
val_batch_size = 10
epoch_size = 118287
epoch_batch = int(epoch_size/batch_size)
#Parameters for yolo loss scale
no_object_scale = 1.0
iou_threshold = 0.7
object_scale=3.0
class_scale=1.0
jitter = 0.3
#Label and prediction vector size
category_num = 80
label_vector_size = 1+1+4+1+category_num  #index 0:grid_id,1:obj_conf,2-5:(x,y,w,h),6:mixup weight,7-86:category
vector_size = 1+4+category_num  #index 0:obj_conf,1-4:(x,y,w,h),5-84:category
#Images parameter
image_size_list = [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
image_size = image_size_list[random.randint(0,9)]
val_image_size = 608
#Grids parameter
grid_wh_array = np.array([[8.,8.],[16.,16.],[32.,32.]])
grid_size = [8.,16.,32.]
#The Anchor size for image_size 416*416
anchors_base = [10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326]

定義讀取訓練文件的函數:

def _parse_function(example_proto):
    features = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
        "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.VarLenFeature(tf.int64),
        "bbox_xmin": tf.io.VarLenFeature(tf.int64),
        "bbox_xmax": tf.io.VarLenFeature(tf.int64),
        "bbox_ymin": tf.io.VarLenFeature(tf.int64),
        "bbox_ymax": tf.io.VarLenFeature(tf.int64),
        "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
    }
    parsed_features = tf.io.parse_single_example(example_proto, features)
    label = tf.expand_dims(parsed_features["label"].values, 0)
    label = tf.cast(label, tf.float32)
    image_raw  = tf.image.decode_jpeg(parsed_features["image"], channels=3)
    image_decoded = tf.cast(image_raw, dtype=tf.float32)
    filename = parsed_features["filename"]
    #Get the coco image id as we need to use COCO API to evaluate
    image_id = tf.strings.to_number(tf.strings.substr(filename, 0, 12), tf.int32)
    image_id = tf.expand_dims(image_id, 0)
    #Get the bbox
    xmin = tf.cast(tf.expand_dims(parsed_features["bbox_xmin"].values, 0), tf.float32)
    xmax = tf.cast(tf.expand_dims(parsed_features["bbox_xmax"].values, 0), tf.float32)
    ymin = tf.cast(tf.expand_dims(parsed_features["bbox_ymin"].values, 0), tf.float32)
    ymax = tf.cast(tf.expand_dims(parsed_features["bbox_ymax"].values, 0), tf.float32)
    mixup_w = tf.ones_like(xmin)
    boxes = tf.concat([xmin,ymin,xmax,ymax,label,mixup_w], axis=0)
    boxes = tf.transpose(boxes, [1, 0])
    return {'image':image_decoded, 'bbox':boxes, 'imageid':image_id}

定義一個Flatmap函數,每次讀取兩張圖片,通過Flatmap函數來把這兩張圖片組合在一起

def _flatmap_function(feature):
    dataset_image = feature['image'].padded_batch(2, [-1,-1,3])
    dataset_bbox = feature['bbox'].padded_batch(2, [-1,6])
    dataset_combined = tf.data.Dataset.zip({'image':dataset_image, 'bbox':dataset_bbox})
    return dataset_combined

Mixup函數把組合後的兩張圖片進行數據增廣處理,同時生成訓練的Label

def _label_fn(bbox):
    global image_size,grid_wh_array,anchors_base
    grids_list = [image_size//8, image_size//16, image_size//32]
    image_ratio = image_size/416
    anchors = [round(a*image_ratio) for a in anchors_base]
    labels_list = [np.zeros([a**2,label_vector_size]) for a in grids_list]
    for i in range(3):
        labels_list[i][:,0] = np.arange(grids_list[i]**2)
    labels_list = [np.tile(a,3) for a in labels_list]
    box_num, _ = bbox.shape
    for i in range(box_num):
        center_x = (bbox[i,0]+bbox[i,2])/2
        center_y = (bbox[i,1]+bbox[i,3])/2
        if (center_x==0 and center_y==0):
            continue
        box_width = bbox[i,2]-bbox[i,0]
        box_height = bbox[i,3]-bbox[i,1]
        label = np.int(bbox[i,4].numpy())
        anchor_id = np.int(bbox[i,5].numpy())
        featuremap_id = anchor_id//3
        anchorid_offset = anchor_id%3
        g_h = grid_wh_array[featuremap_id,1]
        g_w = grid_wh_array[featuremap_id,0]
        grid_id = np.int((center_y//g_h*grids_list[featuremap_id] + center_x//g_w).numpy())
        index = anchorid_offset*label_vector_size
        #set the object exist flag
        labels_list[featuremap_id][grid_id, index+1] = 1.
        #set the center_x_offset
        labels_list[featuremap_id][grid_id, index+2]=(center_x%g_w)/g_w
        #set the center_y_offset
        labels_list[featuremap_id][grid_id, index+3]=(center_y%g_h)/g_h
        #set the width
        labels_list[featuremap_id][grid_id, index+4]=math.log(box_width/anchors[2*anchor_id])
        #set the height
        labels_list[featuremap_id][grid_id, index+5]=math.log(box_height/anchors[2*anchor_id+1])
        #set the mixup weight
        labels_list[featuremap_id][grid_id, index+6]=bbox[i,6]
        #set the class label, using label smoothing
        labels_list[featuremap_id][grid_id, (index+7):(index+label_vector_size)]=0.1/(category_num-1)
        labels_list[featuremap_id][grid_id, index+7+label]=0.9
        #labels_list[featuremap_id][grid_id, index+7+label]=1.0
    return tf.concat(labels_list, axis=0)

def _mixup_function(features):
    global anchors_base,image_size,mixup_flag,grid_size
    image_ratio = image_size/416
    anchors = [round(a*image_ratio) for a in anchors_base]
    image_height = image_size
    image_width = image_size
    images = features['image']
    bboxes = features['bbox']
    #imageid = features['imageid']
    if mixup_flag:
        lam = np.random.beta(1.5,1.5,1)
        lam_all = np.vstack([lam,1.-lam])
        lam_all = np.expand_dims(lam_all, 1)
        #bboxes = tf.cast(bboxes, tf.float32)
        mixup_w = bboxes[...,-1:] + lam_all
        bboxes_mixup = tf.concat([bboxes[...,:-1], mixup_w], axis=-1)
        bboxes_mixup = tf.reshape(bboxes_mixup, [-1,6])
        true_box_mask = tf.logical_or(
            bboxes_mixup[:,1]>0,
            bboxes_mixup[:,1]>0
        )
        bboxes_all = tf.boolean_mask(bboxes_mixup, true_box_mask)
        image_mix = (images[0]*lam[0] + images[1]*(1.-lam[0]))
    else:
        image_mix = images
        bboxes_all = bboxes
    #Random jitter and resize the image
    height = tf.shape(image_mix)[0]
    width = tf.shape(image_mix)[1]
    dw = jitter*tf.cast(width, tf.float32)
    dh = jitter*tf.cast(height, tf.float32)
    new_ar = tf.truediv(
        tf.add(
            tf.cast(width, tf.float32), 
            tf.random.uniform([1], minval=tf.math.negative(dw), maxval=dw)),
        tf.add(
            tf.cast(height, tf.float32), 
            tf.random.uniform([1], minval=tf.math.negative(dh), maxval=dh)))
    nh, nw = tf.cond(
        tf.less(new_ar[0],1), \
        lambda:(image_height, tf.cast(tf.cast(image_height, tf.float32)*new_ar[0], tf.int32)), \
        lambda:(tf.cast(tf.cast(image_width, tf.float32)/new_ar[0], tf.int32), image_width)
    )
    dx = tf.cond(
        tf.equal(image_width, nw), \
        lambda:tf.constant([0]), \
        lambda:tf.random.uniform([1], minval=0, maxval=(image_width-nw), dtype=tf.int32)
    )
    dy = tf.cond(
        tf.equal(image_height, nh), \
        lambda:tf.constant([0]), \
        lambda:tf.random.uniform([1], minval=0, maxval=(image_height-nh), dtype=tf.int32)
    )
    image_resize = tf.image.resize(image_mix, [nh, nw])
    image_padded = tf.image.pad_to_bounding_box(image_resize, dy[0], dx[0], image_height, image_width)
    #Adjust the boxes
    xmin_new = tf.cast(tf.truediv(nw, width) * tf.cast(bboxes_all[:,0:1],tf.float64), tf.int32) + dx
    xmax_new = tf.cast(tf.truediv(nw, width) * tf.cast(bboxes_all[:,2:3],tf.float64), tf.int32) + dx
    ymin_new = tf.cast(tf.truediv(nh, height) * tf.cast(bboxes_all[:,1:2],tf.float64), tf.int32) + dy
    ymax_new = tf.cast(tf.truediv(nh, height) * tf.cast(bboxes_all[:,3:4],tf.float64), tf.int32) + dy
    # Random flip flag
    random_flip_flag = tf.random.uniform([1], minval=0, maxval=1, dtype=tf.float32)
    def flip_box():
        xmax_flip = image_width - xmin_new
        xmin_flip = image_width - xmax_new
        image_flip = tf.image.flip_left_right(image_padded)
        return xmin_flip, xmax_flip, image_flip
    def notflip():
        return xmin_new, xmax_new, image_padded
    xmin_flip, xmax_flip, image_flip = tf.cond(tf.less(random_flip_flag[0], 0.5), notflip, flip_box)
    boxes_width = xmax_flip-xmin_flip
    boxes_height = ymax_new-ymin_new
    boxes_area = boxes_width*boxes_height
    # Determine the anchor
    iou_list = []
    for i in range(9):
        intersect_area = tf.minimum(boxes_width, anchors[2*i])*tf.minimum(boxes_height, anchors[2*i+1])
        union_area = boxes_area+anchors[2*i]*anchors[2*i+1]-intersect_area
        iou_list.append(intersect_area/union_area)
    iou = tf.concat(iou_list, axis=1)
    anchor_id = tf.reshape(tf.argmax(iou, axis=1), [-1,1])
    # Random distort the image
    distorted = tf.image.random_hue(image_flip, max_delta=0.3)
    distorted = tf.image.random_saturation(distorted, lower=0.6, upper=1.4)
    distorted = tf.image.random_brightness(distorted, max_delta=0.3)
    # Add PCA noice
    alpha = tf.random.normal([3], mean=0.0, stddev=0.1)
    pca_noice = tf.reshape(tf.matmul(tf.multiply(eigvec,alpha), eigval), [3])
    distorted = tf.add(distorted, pca_noice)
    # Normalize RGB
    distorted = tf.subtract(distorted, mean_RGB)
    distorted = tf.divide(distorted, std_RGB)
    # Get the adjusted boxes
    xmin_flip = tf.cast(xmin_flip, tf.float32)
    xmax_flip = tf.cast(xmax_flip, tf.float32)
    ymin_new = tf.cast(ymin_new, tf.float32)
    ymax_new = tf.cast(ymax_new, tf.float32)
    anchor_id = tf.cast(anchor_id, tf.float32)
    boxes_new = tf.concat([xmin_flip,ymin_new,xmax_flip,ymax_new,bboxes_all[:,4:5],anchor_id,bboxes_all[:,-1:]], axis=1)
    # Remove the boxes that height or width less than 5 pixels
    boxes_mask = tf.math.logical_and(
        tf.math.greater((boxes_new[:,2]-boxes_new[:,0]), 5),
        tf.math.greater((boxes_new[:,3]-boxes_new[:,1]), 5))
    boxes_new = tf.boolean_mask(boxes_new, boxes_mask)
    boxes_new = tf.cast(boxes_new, tf.float32)
    # Generate the labels
    labels = tf.py_function(_label_fn, [boxes_new], [tf.float64])
    labels = tf.cast(labels, tf.float32)
    
    image_train = tf.transpose(distorted, perm=[2, 0, 1])
    #features = {'images':image_train, 'bboxes':boxes_new, 'images_flip':image_flip, 'image_id':imageid}
    features = {'images':image_train, 'bboxes':boxes_new}
    return features, labels[0]

然後就可以構造訓練的數據集了

def train_input_fn():
    global image_size
    train_files = tf.data.Dataset.list_files("../dataset/coco/train2017_tf/*.tfrecord")
    dataset_train = train_files.interleave(tf.data.TFRecordDataset, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    dataset_train = dataset_train.shuffle(buffer_size=1000, reshuffle_each_iteration=True)
    dataset_train = dataset_train.repeat(8)
    dataset_train = dataset_train.map(_parse_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    if mixup_flag:
        dataset_train = dataset_train.window(2)
        dataset_train = dataset_train.flat_map(_flatmap_function)
    dataset_train = dataset_train.map(_mixup_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    dataset_train = dataset_train.padded_batch(batch_size, \
        padded_shapes=(
            {
                'images':[3,image_size,image_size],
                'bboxes':[None,7]
            }, 
            [None, label_vector_size*3]
        )
    )
    dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)
    return dataset_train

數據增廣後的處理效果可見下圖:

4. 定義損失函數

這是整個訓練中最具挑戰性的部分。因爲按照YOLO V3的源代碼,損失函數由兩部分組成:

  1. 沒有對應檢測物體的網格的損失函數,因爲這部分網格沒有對應的物體,只要計算其預測的物體存在的概率值與Label值的方差,對於其預測的物體的檢測框的位置以及物體類別的概率不作懲罰。但是論文中也提到,這些網格雖然不負責預測,但是如果其預測的檢測框與真實的檢測框之間的IOU大於某個閾值(0.5)時,應忽略懲罰其預測物體存在概率與Label值的方差。由於每張圖片的真實物體的檢測框的數量不確定,因此如何計算每個真實物體檢測框與這些網格的預測框的IOU是一個問題。這裏我是採用廣播的方式來進行匹配計算。我傳入Feature的真實物體的BBOX的維度是[batch, V, 4],其中V代表不定長度,取這個Batch中的最大值,4表示BBOX包括了xmin,ymin,xmax,ymax。例如這個Batch中,某一張圖片擁有最多的物體檢測框(12個), 那麼V=12,其他圖片的BBOX也填充爲12個。然後把BBOX的維度擴展爲[batch, 1, V, 6],預測值計算出來的BBOX維度爲[batch, 52**2+26**2+13**2, 4*3],第2個維度是三種不同大小的網格的總數量,第三個維度是每種大小的網格預測3個BBOX。把預測的BBOX和真實BBOX進行IOU的計算,然後取最大值,並判斷是否超過閾值,如超過則不懲罰其預測概率。這裏的損失函數採用的是交叉熵。
  2. 對應預測物體的網格的損失函數。這部分比較簡單,對於物體的中心點座標,只要直接計算預測值(對中心點座標需要先進行sigmoid函數激活)與Label的交叉熵,對於物體的寬高計算預測值與Label的方差,對於物體的存在概率,以及物體所屬類別的概率,需要用交叉熵來計算。另外對於不同部分的方差還要與不同的係數進行相乘。

具體的代碼如下:

# Predicts, combination of three dimention, [batch, 52*52+26*26+13*13, 85*3]
# Labels, combination of three dimention, [batch, 52*52+26*26+13*13, 87*3]
def new_loss_func(predict, label, gt_box, grids_property):
    global image_size
    predict = tf.reshape(predict, [batch_size,-1,vector_size]) #[batch, (52*52+26*26+13*13)*3, 85] 
    label = tf.reshape(label, [batch_size,-1,label_vector_size]) #[batch, (52*52+26*26+13*13)*3, 87] 
    noobj_mask = tf.cast(label[...,1:2]==0.0, tf.float32)
    obj_mask = tf.cast(label[...,1:2]==1.0, tf.float32)
    #Get the predict box center xy
    predict_xy = (grids_property[...,0:2]+tf.nn.sigmoid(predict[...,1:3]))*grids_property[...,-2:]
    #Get the predict box wh, only caluculate the noobj wh
    predict_half_wh = tf.exp(predict[...,3:5])*grids_property[...,2:4]/2
    predict_xmin = tf.clip_by_value((predict_xy[...,0:1]-predict_half_wh[...,0:1]), 0, image_size)
    predict_xmax = tf.clip_by_value((predict_xy[...,0:1]+predict_half_wh[...,0:1]), 0, image_size)
    predict_ymin = tf.clip_by_value((predict_xy[...,1:2]-predict_half_wh[...,1:2]), 0, image_size)
    predict_ymax = tf.clip_by_value((predict_xy[...,1:2]+predict_half_wh[...,1:2]), 0, image_size)
    predict_boxes_area = (predict_xmax-predict_xmin)*(predict_ymax-predict_ymin) #[-batch, (52*52+26*26+13*13)*3, 1]
    #Assemble the predict box coords and expand dim, shape: [batch, (52*52+26*26+13*13)*3, 1, 4]
    predict_boxes = tf.concat([predict_xmin,predict_ymin,predict_xmax,predict_ymax], axis=-1)
    predict_boxes = tf.expand_dims(predict_boxes, 2)
    #Expand ground boxes dim for broadcast, shape: [batch, 1, V, 4]
    gt_box = tf.expand_dims(gt_box, 1)
    gt_box = tf.cast(gt_box, tf.float32)
    #gt_box_area = (gt_box[...,2:3]-gt_box[...,0:1])*(gt_box[...,3:4]-gt_box[...,1:2]) #[batch, 1, V, 1]
    gt_box_area = (gt_box[...,2]-gt_box[...,0])*(gt_box[...,3]-gt_box[...,1]) #[batch, 1, V]
    #Broadcast calculation, intersect_boxes_width shape [batch, noobjs_num, V, 1]
    intersect_boxes_width = tf.minimum(predict_boxes[...,2:3], gt_box[...,2:3])-tf.maximum(predict_boxes[...,0:1], gt_box[...,0:1])
    intersect_boxes_width = tf.clip_by_value(intersect_boxes_width, clip_value_min=0, clip_value_max=image_size)
    intersect_boxes_height = tf.minimum(predict_boxes[...,3:4], gt_box[...,3:4])-tf.maximum(predict_boxes[...,1:2], gt_box[...,1:2])
    intersect_boxes_height = tf.clip_by_value(intersect_boxes_height, clip_value_min=0, clip_value_max=image_size)
    intersect_boxes_area = intersect_boxes_width * intersect_boxes_height # [batch, (52*52+26*26+13*13)*3, V, 1]
    intersect_boxes_area = tf.squeeze(intersect_boxes_area) # [batch, (52*52+26*26+13*13)*3, V]
    #Calculate the noobj predict box IOU with ground truth boxes, shape:[batch, (52*52+26*26+13*13)*3, V]
    iou_boxes = intersect_boxes_area/(predict_boxes_area+gt_box_area-intersect_boxes_area) #
    iou_max = tf.reduce_max(iou_boxes, axis=2, keepdims=True)  #[batch, (52*52+26*26+13*13)*3, 1]
    #iou_max = tf.expand_dims(iou_max, 2)
    #Ignore the noobj loss for the IOU larger than threshold
    no_ignore_mask = tf.cast(iou_max[...,0:1]<iou_threshold, tf.float32)
    noobj_loss_mask = tf.cast(noobj_mask*no_ignore_mask, tf.bool)
    noobj_loss_mask = tf.reshape(noobj_loss_mask, [batch_size, -1])
    noobj_predict = tf.boolean_mask(predict, noobj_loss_mask)[...,0:1]
    #noobj_predict_topk, _ = tf.nn.top_k(noobj_predict[...,0], k=500)
    #Calculate the noobj predict loss
    #loss_noobj = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(tf.zeros_like(noobj_predict_topk), noobj_predict_topk))
    loss_noobj = tf.reduce_sum(
        tf.nn.sigmoid_cross_entropy_with_logits(tf.zeros_like(noobj_predict), noobj_predict))
    loss_noobj = loss_noobj/batch_size
    
    #Calculate the scale for object coord loss
    coord_scale = 2.0-\
        (tf.exp(label[...,4:5])*grids_property[...,2:3])*\
        (tf.exp(label[...,5:6])*grids_property[...,3:4])/\
        image_size**2
    loss_xy = tf.reduce_sum(
        0.5*coord_scale*label[...,6:7]*obj_mask*
        tf.nn.sigmoid_cross_entropy_with_logits(
            labels=label[...,2:4], 
            logits=predict[...,1:3]
        )
    )
    loss_wh = tf.reduce_sum(
        0.5*coord_scale*label[...,6:7]*obj_mask*
        tf.square(
            label[...,4:6]-
            predict[...,3:5]
        )
    )
    #Calculate the conf loss
    loss_conf = tf.reduce_sum(
        object_scale*label[...,6:7]*obj_mask*
        tf.nn.sigmoid_cross_entropy_with_logits(
            tf.ones_like(predict[...,0:1]),
            predict[...,0:1]
        )
    )
    #Calculate the predict class loss
    loss_class = tf.reduce_sum(
        class_scale*label[...,6:7]*obj_mask*
        tf.nn.sigmoid_cross_entropy_with_logits(
            labels=label[...,7:], 
            logits=predict[...,5:]
        )
    )
    loss_obj = (loss_xy+loss_wh+loss_conf+loss_class)/batch_size
    final_loss = loss_obj + loss_noobj
    return final_loss
tf_new_loss_func = tf.function(new_loss_func, experimental_relax_shapes=True)

5. 模型的訓練

模型的訓練過程,我是採用了自定義訓練的方式來做的。YOLO論文提到訓練時可以隨機採用多種圖片尺度,例如416*416, 608*608,352*352等,這樣的好處是模型能夠更好的適應不同尺寸大小的圖片的檢測。

隨機採用多種圖片尺度的代碼如下:

def random_image():
    global image_size_list,image_size
    global grid_wh_array
    global anchors_base
    
    image_size = image_size_list[random.randint(0,9)]
    #image_size = 608
    image_ratio = image_size/416
    grids_list = [image_size//8, image_size//16, image_size//32]
    anchors = [round(a*image_ratio) for a in anchors_base]
    grids_x_list = [np.reshape(np.arange(a**2)%a,[-1,1]) for a in grids_list]
    grids_x = np.vstack(grids_x_list)
    grids_x = np.reshape(np.hstack([grids_x,grids_x,grids_x]),[-1,1])
    grids_y_list = [np.reshape(np.arange(a**2)//a,[-1,1]) for a in grids_list]
    grids_y = np.vstack(grids_y_list)
    grids_y = np.reshape(np.hstack([grids_y,grids_y,grids_y]),[-1,1])
    anchors_all = np.vstack(
        [
            np.reshape(np.tile(np.reshape(np.array(anchors[:6]),[-1,6]),[grids_list[0]**2,1]),[-1,2]),
            np.reshape(np.tile(np.reshape(np.array(anchors[6:12]),[-1,6]),[grids_list[1]**2,1]),[-1,2]),
            np.reshape(np.tile(np.reshape(np.array(anchors[12:]),[-1,6]),[grids_list[2]**2,1]),[-1,2])
        ]
    )
    grid_wh_all = np.vstack(
        [
            np.tile(grid_wh_array[:1,:], (grids_list[0]**2*3,1)),
            np.tile(grid_wh_array[1:2,:], (grids_list[1]**2*3,1)),
            np.tile(grid_wh_array[2:3,:], (grids_list[2]**2*3,1))
        ]
    )
    grids_property = np.concatenate([grids_x, grids_y, anchors_all, grid_wh_all], axis=-1)
    grids_property_all = tf.constant(grids_property, dtype=tf.float32)
    grids_property_all = tf.expand_dims(grids_property_all, 0)
    grids_property_all = tf.tile(grids_property_all, [batch_size,1,1])
    return grids_property_all

自定義訓練過程的代碼如下:

model_base = tf.keras.models.load_model('darknet53_20200228/epoch_42.h5')
model_base.trainable = False
image = tf.keras.Input(shape=(3,None,None))
_, route1, route2, route3 = model_base(image, training=False)
predict = darknet53_yolov3()([route1, route2, route3])
model_yolo = tf.keras.Model(inputs=image, outputs=predict, name='model_yolo')

START_EPOCH = 0
NUM_EPOCH = 1
STEPS_EPOCH = epoch_batch
STEPS_OFFSET = STEPS_EPOCH*START_EPOCH
initial_warmup_steps = 1000
initial_lr = 0.0005

optimizer=tf.keras.optimizers.SGD(learning_rate=0.00001, momentum=0.9)
mp_opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)
def train_step(images, bbox, labels, grids_property_all):
    with tf.GradientTape() as tape:
        predict = model_yolo(images, training=True)
        regularization_loss = tf.math.add_n(model_yolo.losses)
        pred_loss = tf_new_loss_func(predict, labels, bbox, grids_property_all)
        total_loss = pred_loss + regularization_loss
    gradients = tape.gradient(total_loss, model_yolo.trainable_variables)
    mp_opt.apply_gradients(zip(gradients, model_yolo.trainable_variables))
    return total_loss, predict
tf_train_step = tf.function(train_step, experimental_relax_shapes=True)
#Loss rate step decay
boundaries = [STEPS_EPOCH*4, STEPS_EPOCH*10, STEPS_EPOCH*13, STEPS_EPOCH*16]
values = [0.0005, 0.0001, 0.00005, 0.00001, 0.00005]
learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)

steps = STEPS_OFFSET
for epoch in range(NUM_EPOCH):
    loss_sum = 0
    start_time = time.time()
    grids_property_all = new_random_image()
    train_data = iter(train_input_fn())
    #for features, labels in train_data:
    while(True):
        if steps < initial_warmup_steps:
            newlr = (initial_lr/initial_warmup_steps)*steps
            tf.keras.backend.set_value(optimizer.lr, newlr)
        features, data_labels = train_data.next()
        loss_temp, predict_temp = tf_train_step(features['images'], features['bboxes'], data_labels, grids_property_all)
        loss_sum += loss_temp
        steps += 1
        if steps%100 == 0:
            elasp_time = time.time()-start_time
            lr = tf.keras.backend.get_value(optimizer.lr)
            print("Step:{}, Image_size:{:d}, Loss:{:4.2f}, LR:{:5f}, Time:{:3.1f}s".format(steps, image_size, loss_sum/100, lr, elasp_time))
            loss_sum = 0
            if steps > initial_warmup_steps:
                tf.keras.backend.set_value(optimizer.lr, learning_rate_fn(steps))
            start_time = time.time()
        if steps%STEPS_EPOCH == 0:
            START_EPOCH += 1
            model_yolo.save('model_yolov3/yolo_v10_'+str(START_EPOCH)+'.h5')
            break

模型的訓練非常耗時,在我的電腦(2080Ti)的配置下,訓練一個Epoch大概要花2個小時,我訓練了14個EPOCH,mAP .50的準確度大概爲32%,和論文提到的57.9%還有比較大的差距。不過Darknet的源碼是訓練了200多個EPOCH的,可能繼續訓練會進一步提高準確度。這個有待以後繼續驗證。

6. 評價模型的性能指標

目標檢測一般採用mAP來評價性能,這個指標的計算比較複雜,我是直接採用了COCO API來進行計算,這個也是和論文中的計算方法保持一致。

首先是構造COCO測試集,代碼如下:

def _parse_val_function(example_proto):
    global val_image_size
    features = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "height": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "width": tf.io.FixedLenFeature([1], tf.int64, default_value=[0]),
        "channels": tf.io.FixedLenFeature([1], tf.int64, default_value=[3]),
        "colorspace": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "img_format": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.VarLenFeature(tf.int64),
        "bbox_xmin": tf.io.VarLenFeature(tf.int64),
        "bbox_xmax": tf.io.VarLenFeature(tf.int64),
        "bbox_ymin": tf.io.VarLenFeature(tf.int64),
        "bbox_ymax": tf.io.VarLenFeature(tf.int64),
        "filename": tf.io.FixedLenFeature([], tf.string, default_value="")
    }
    parsed_features = tf.io.parse_single_example(example_proto, features)
    label = tf.expand_dims(parsed_features["label"].values, 0)
    label = tf.cast(label, tf.int32)
    channels = parsed_features["channels"]
    filename = parsed_features["filename"]
    #Get the coco image id as we need to use COCO API to evaluate
    image_id = tf.strings.to_number(
        tf.strings.substr(filename, 0, 12),
        tf.int32
    )
    #Decode the image
    image_raw  = tf.image.decode_jpeg(parsed_features["image"], channels=3)
    image_decoded = tf.cast(image_raw, dtype=tf.float32)
    image_h = tf.constant(val_image_size)
    image_w = tf.constant(val_image_size)
    height = tf.shape(image_decoded)[0]
    width = tf.shape(image_decoded)[1]
    original_size = tf.stack([height, width], axis=0)
    original_size = tf.cast(original_size, tf.float32)
    ratio = tf.truediv(tf.cast(height, tf.float32), tf.cast(width, tf.float32))
    nh, nw = tf.cond(
        tf.less(ratio,1),
        lambda:(tf.cast(tf.cast(image_h, tf.float32)*ratio, tf.int32), image_w),
        lambda:(image_h, tf.cast(tf.cast(image_w, tf.float32)/ratio, tf.int32)))
    dx = tf.cond(
        tf.equal(image_w, nw), \
        lambda:tf.constant(0), \
        lambda:tf.cast((image_w-nw)/2, tf.int32))
    dy = tf.cond(
        tf.equal(image_h, nh), \
        lambda:0, \
        lambda:tf.cast((image_h-nh)/2, tf.int32))
    image_resize = tf.image.resize(image_decoded, [nh, nw])
    image_padded = tf.image.pad_to_bounding_box(image_resize, dy, dx, image_h, image_w)
    image_normalize = tf.subtract(image_padded, mean_RGB)
    image_normalize = tf.divide(image_normalize, std_RGB)
    image_val = tf.transpose(image_normalize, perm=[2, 0, 1])
    features = {'images':image_val, 'image_id':image_id, 'original_size':original_size}
    return features

def val_input_fn():
    val_files = tf.data.Dataset.list_files("../dataset/coco/val2017_tf/*.tfrecord")
    dataset_val = val_files.interleave(tf.data.TFRecordDataset, cycle_length=12, num_parallel_calls=12)
    dataset_val = dataset_val.map(_parse_val_function, num_parallel_calls=12)
    dataset_val = dataset_val.batch(val_batch_size)
    dataset_val = dataset_val.prefetch(1)
    return dataset_val

解碼預測的結果,轉換爲相應的BBOX,如以下代碼:

def predict_func(predict, image_id, original_size):
    global val_image_size, anchors_base
    val_grids_list = [val_image_size//8, val_image_size//16, val_image_size//32]
    image_ratio = val_image_size/416
    val_anchors = [round(a*image_ratio) for a in anchors_base]
    val_grids_x_list = [np.reshape(np.arange(a**2)%a,[-1,1]) for a in val_grids_list]
    val_grids_x = np.vstack(val_grids_x_list)
    val_grids_y_list = [np.reshape(np.arange(a**2)//a,[-1,1]) for a in val_grids_list]
    val_grids_y = np.vstack(val_grids_y_list)
    val_anchors_all = np.vstack(
        [
            np.tile(np.reshape(np.array(val_anchors[:6]),[-1,6]),[val_grids_list[0]**2,1]),
            np.tile(np.reshape(np.array(val_anchors[6:12]),[-1,6]),[val_grids_list[1]**2,1]),
            np.tile(np.reshape(np.array(val_anchors[12:]),[-1,6]),[val_grids_list[2]**2,1])
        ]
    )
    grid_wh_all = np.vstack(
        [
            np.tile(grid_wh_array[:1,:], (val_grids_list[0]**2,1)),
            np.tile(grid_wh_array[1:2,:], (val_grids_list[1]**2,1)),
            np.tile(grid_wh_array[2:3,:], (val_grids_list[2]**2,1))
        ]
    )
    val_grids_property = np.concatenate([val_grids_x, val_grids_y, val_anchors_all, grid_wh_all], axis=-1)
    val_grids_property_all = tf.constant(val_grids_property, dtype=tf.float32)
    val_grids_property_all = tf.expand_dims(val_grids_property_all, 0)
    val_grids_property_all = tf.tile(val_grids_property_all, [predict.shape[0],1,1])
    result_json = []
    original_height = original_size[...,0]
    original_width = original_size[...,1]
    hw_ratio = original_height/original_width
    hw_ratio_mask = tf.cast(tf.less(hw_ratio, 1.), tf.float32)
    ratio = \
        hw_ratio_mask*(original_width/val_image_size) + \
        (1.-hw_ratio_mask)*(original_height/val_image_size)
    dx = (1.-hw_ratio_mask)*((original_height-original_width)//2)
    dy = hw_ratio_mask*((original_width-original_height)//2)
    
    confidence_threshold = 0.2
    probabilty_threshold = 0.5
    predict_boxes_list = []
    for i in range(3):
        predict_conf = tf.nn.sigmoid(predict[...,i*vector_size:(i*vector_size+1)])
        predict_xy = tf.nn.sigmoid(predict[...,(i*vector_size+1):(i*vector_size+3)])
        predict_xy = predict_xy + val_grids_property_all[...,0:2]
        predict_x = predict_xy[...,0:1] * val_grids_property_all[...,-2:-1]
        predict_y = predict_xy[...,1:] * val_grids_property_all[...,-1:]
        predict_w = tf.exp(predict[...,(i*vector_size+3):(i*vector_size+4)])
        predict_w = predict_w * val_grids_property_all[...,(2+i*2):(2+i*2+1)]
        predict_h = tf.exp(predict[...,(i*vector_size+4):(i*vector_size+5)])
        predict_h = predict_h * val_grids_property_all[...,(2+i*2+1):(2+i*2+2)]
        min_x = tf.clip_by_value((predict_x-predict_w/2), 0, val_image_size)
        max_x = tf.clip_by_value((predict_x + predict_w/2), 0, val_image_size)
        min_y = tf.clip_by_value((predict_y - predict_h/2), 0, val_image_size)
        max_y = tf.clip_by_value((predict_y + predict_h/2), 0, val_image_size)
        predict_class = tf.argmax(predict[...,(i*vector_size+5):((i+1)*vector_size)], axis=-1)
        predict_class = tf.cast(predict_class, tf.float32)
        predict_class = tf.expand_dims(predict_class, 2)
        predict_proba = tf.nn.sigmoid(
            tf.reduce_max(
                predict[...,(i*vector_size+5):((i+1)*vector_size)], axis=-1, keepdims=True
            )
        )
        predict_box = tf.concat([predict_conf, min_x, min_y, max_x, max_y, predict_class, predict_proba], axis=-1)
        predict_boxes_list.append(predict_box)
    predict_boxes = tf.concat(predict_boxes_list, axis=1)
    
    for i in range(predict.shape[0]):
        obj_mask = tf.logical_and(
            predict_boxes[i,:,0]>=confidence_threshold,
            predict_boxes[i,:,-1]>=probabilty_threshold)
        predict_true_box = tf.boolean_mask(predict_boxes[i], obj_mask)
        predict_classes, _ = tf.unique(predict_true_box[:,5])
        predict_classes_list = tf.unstack(predict_classes)
        for class_id in predict_classes_list:
            class_mask = tf.math.equal(predict_true_box[:, 5], class_id)
            predict_true_box_class = tf.boolean_mask(predict_true_box, class_mask)
            predict_true_box_xy = predict_true_box_class[:, 1:5]
            predict_true_box_score = predict_true_box_class[:, 6]*predict_true_box_class[:, 0]
            #predict_true_box_score = predict_true_box_class[:, 0]
            selected_indices = tf.image.non_max_suppression(
                predict_true_box_xy,
                predict_true_box_score,
                100,
                iou_threshold=0.2
                #score_threshold=confidence_threshold
            )
            #Shape [box_num, 7]
            selected_boxes = tf.gather(predict_true_box_class, selected_indices) 
            original_bbox_xmin = tf.clip_by_value(
                selected_boxes[:,1:2]*ratio[i]-dx[i], 0, original_width[i])
            original_bbox_xmax = tf.clip_by_value(
                selected_boxes[:,3:4]*ratio[i]-dx[i], 0, original_width[i])
            original_bbox_ymin = tf.clip_by_value(
                selected_boxes[:,2:3]*ratio[i]-dy[i], 0, original_height[i])
            original_bbox_ymax = tf.clip_by_value(
                selected_boxes[:,4:5]*ratio[i]-dy[i], 0, original_height[i])
            original_bbox_width = original_bbox_xmax - original_bbox_xmin
            original_bbox_height = original_bbox_ymax - original_bbox_ymin
            original_bbox = tf.concat(
                [
                    selected_boxes[:,0:1],
                    original_bbox_xmin,
                    original_bbox_ymin,
                    original_bbox_width,
                    original_bbox_height,
                    selected_boxes[:,5:]
                ], axis=-1
            )
            original_bbox_list = tf.unstack(original_bbox)
            for item in original_bbox_list:
                result = {}
                result['image_id'] = int(image_id.numpy()[i])
                result['category_id'] = cocoid_mapping_labels[int(class_id.numpy())]
                result['bbox'] = item[1:5].numpy().tolist()
                result['bbox'] = [int(a*10)/10 for a in result['bbox']]
                result['score'] = int((item[0]*item[6]).numpy()*1000)/1000
                result['conf'] = str(int(item[0].numpy()*1000)/1000)
                result['prop'] = str(int(item[6].numpy()*1000)/1000)
                result_json.append(result)
    return result_json

利用COCO API來計算mAP:

START_EPOCH = 14
val_image_size = 608
dataset_val = val_input_fn()
all_result_json = []
i = 0
for val_features in dataset_val:
    predict = model_yolo(val_features['images'], training=False)
    result_json = predict_func(
        predict, val_features['image_id'], val_features['original_size']
    )
    all_result_json.extend(result_json)
    i +=1
all_result_str = ','.join([json.dumps(item) for item in all_result_json])
all_result_str = '['+all_result_str+']'
result_filename = 'test_v11_epoch_'+str(START_EPOCH)+'_result.json'
result_file = open(result_filename, 'w')
result_file.write(all_result_str)
result_file.close()
cocodt = coco.loadRes(result_filename)
annType = 'bbox'
imgIds=sorted(coco.getImgIds())
cocoEval = COCOeval(coco,cocodt,annType)
cocoEval.params.imgIds = imgIds
cocoEval.evaluate()
cocoEval.accumulate()
cocoEval.summarize()

模型的預測效果

最後我們來看一下模型的預測效果如何,首先是Kite圖片:

再看看Darknet YOLOV3官方模型的預測結果:

看樣子我的模型的預測效果還好一些,出乎意料:),官方模型有幾個風箏沒有檢測到。不過我的模型則錯誤的把一個浪花檢測爲人。總體好像還是我的模型預測的準確一些。

再看看另外一張圖片dog,以下是我的檢測效果:

官方模型的檢測結果:

總體來看檢測結果基本一致,官方模型的IOU檢測的更精確一些。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章