學習前言

做一個TF2……和Keras挺像的！

什麼是YOLOV4

YOLOV4是YOLOV3的改進版，在YOLOV3的基礎上結合了非常多的小Tricks。
儘管沒有目標檢測上革命性的改變，但是YOLOV4依然很好的結合了速度與精度。
根據上圖也可以看出來，YOLOV4在YOLOV3的基礎上，在FPS不下降的情況下，mAP達到了44，提高非常明顯。

YOLOV4整體上的檢測思路和YOLOV3相比相差並不大，都是使用三個特徵層進行分類與迴歸預測。

請注意！

強烈建議在學習YOLOV4之前學習YOLOV3，因爲YOLOV4確實可以看作是YOLOV3結合一系列改進的版本！

（重要的事情說三遍！）

YOLOV3可參考該博客：
https://blog.csdn.net/weixin_44791964/article/details/103276106

代碼下載

https://github.com/bubbliiiing/yolov4-tf2
喜歡的可以給個star噢！

YOLOV4改進的部分（不完全）

1、主幹特徵提取網絡：DarkNet53 => CSPDarkNet53

2、特徵金字塔：SPP，PAN

3、分類迴歸層：YOLOv3（未改變）

4、訓練用到的小技巧：Mosaic數據增強、Label Smoothing平滑、CIOU、學習率餘弦退火衰減

5、激活函數：使用Mish激活函數

以上並非全部的改進部分，還存在一些其它的改進，由於YOLOV4使用的改進實在太多了，很難完全實現與列出來，這裏只列出來了一些我比較感興趣，而且非常有效的改進。

整篇BLOG會結合YOLOV3與YOLOV4的差別進行解析

YOLOV4結構解析

1、主幹特徵提取網絡Backbone

當輸入是416x416時，特徵結構如下：

當輸入是608x608時，特徵結構如下：

主幹特徵提取網絡Backbone的改進點有兩個：
a).主幹特徵提取網絡：DarkNet53 => CSPDarkNet53
b).激活函數：使用Mish激活函數

如果大家對YOLOV3比較熟悉的話，應該知道Darknet53的結構，其由一系列殘差網絡結構構成。在Darknet53中，其存在如下resblock_body模塊，其由一次下采樣和多次殘差結構的堆疊構成，Darknet53便是由resblock_body模塊組合而成。

def resblock_body(x, num_filters, num_blocks):
    x = ZeroPadding2D(((1,0),(1,0)))(x)
    x = DarknetConv2D_BN_Leaky(num_filters, (3,3), strides=(2,2))(x)
    for i in range(num_blocks):
        y = DarknetConv2D_BN_Leaky(num_filters//2, (1,1))(x)
        y = DarknetConv2D_BN_Leaky(num_filters, (3,3))(y)
        x = Add()([x,y])
    return x

而在YOLOV4中，其對該部分進行了一定的修改。
1、其一是將DarknetConv2D的激活函數由LeakyReLU修改成了Mish，卷積塊由DarknetConv2D_BN_Leaky變成了DarknetConv2D_BN_Mish。
Mish函數的公式與圖像如下：
$Mish=x \times tanh(ln(1+e^x))$

2、其二是將resblock_body的結構進行修改，使用了CSPnet結構。此時YOLOV4當中的Darknet53被修改成了CSPDarknet53。

CSPnet結構並不算複雜，就是將原來的殘差塊的堆疊進行了一個拆分，拆成左右兩部分：
主幹部分繼續進行原來的殘差塊的堆疊；
另一部分則像一個殘差邊一樣，經過少量處理直接連接到最後。
因此可以認爲CSP中存在一個大的殘差邊。

#---------------------------------------------------#
#   CSPdarknet的結構塊
#   存在一個大殘差邊
#   這個大殘差邊繞過了很多的殘差結構
#---------------------------------------------------#
def resblock_body(x, num_filters, num_blocks, all_narrow=True):
    # 進行長和寬的壓縮
    preconv1 = ZeroPadding2D(((1,0),(1,0)))(x)
    preconv1 = DarknetConv2D_BN_Mish(num_filters, (3,3), strides=(2,2))(preconv1)

    # 生成一個大的殘差邊 
    shortconv = DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (1,1))(preconv1)

    # 主幹部分的卷積
    mainconv = DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (1,1))(preconv1)
    # 1x1卷積對通道數進行整合->3x3卷積提取特徵，使用殘差結構
    for i in range(num_blocks):
        y = compose(
                DarknetConv2D_BN_Mish(num_filters//2, (1,1)),
                DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (3,3)))(mainconv)
        mainconv = Add()([mainconv,y])
    # 1x1卷積後和殘差邊堆疊
    postconv = DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (1,1))(mainconv)
    route = Concatenate()([postconv, shortconv])

    # 最後對通道數進行整合
    return DarknetConv2D_BN_Mish(num_filters, (1,1))(route)

全部實現代碼爲：

from functools import wraps
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Add, ZeroPadding2D, UpSampling2D, Concatenate, MaxPooling2D, Layer, LeakyReLU, BatchNormalization
from tensorflow.keras.regularizers import l2
from utils.utils import compose


class Mish(Layer):
    def __init__(self, **kwargs):
        super(Mish, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        return inputs * K.tanh(K.softplus(inputs))

    def get_config(self):
        config = super(Mish, self).get_config()
        return config

    def compute_output_shape(self, input_shape):
        return input_shape
#--------------------------------------------------#
#   單次卷積
#--------------------------------------------------#
@wraps(Conv2D)
def DarknetConv2D(*args, **kwargs):
    darknet_conv_kwargs = {'kernel_regularizer': l2(5e-4)}
    darknet_conv_kwargs['padding'] = 'valid' if kwargs.get('strides')==(2,2) else 'same'
    darknet_conv_kwargs.update(kwargs)
    return Conv2D(*args, **darknet_conv_kwargs)

#---------------------------------------------------#
#   卷積塊
#   DarknetConv2D + BatchNormalization + Mish
#---------------------------------------------------#
def DarknetConv2D_BN_Mish(*args, **kwargs):
    no_bias_kwargs = {'use_bias': False}
    no_bias_kwargs.update(kwargs)
    return compose(
        DarknetConv2D(*args, **no_bias_kwargs),
        BatchNormalization(),
        Mish())

#---------------------------------------------------#
#   CSPdarknet的結構塊
#   存在一個大殘差邊
#   這個大殘差邊繞過了很多的殘差結構
#---------------------------------------------------#
def resblock_body(x, num_filters, num_blocks, all_narrow=True):
    # 進行長和寬的壓縮
    preconv1 = ZeroPadding2D(((1,0),(1,0)))(x)
    preconv1 = DarknetConv2D_BN_Mish(num_filters, (3,3), strides=(2,2))(preconv1)

    # 生成一個大的殘差邊 
    shortconv = DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (1,1))(preconv1)

    # 主幹部分的卷積
    mainconv = DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (1,1))(preconv1)
    # 1x1卷積對通道數進行整合->3x3卷積提取特徵，使用殘差結構
    for i in range(num_blocks):
        y = compose(
                DarknetConv2D_BN_Mish(num_filters//2, (1,1)),
                DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (3,3)))(mainconv)
        mainconv = Add()([mainconv,y])
    # 1x1卷積後和殘差邊堆疊
    postconv = DarknetConv2D_BN_Mish(num_filters//2 if all_narrow else num_filters, (1,1))(mainconv)
    route = Concatenate()([postconv, shortconv])

    # 最後對通道數進行整合
    return DarknetConv2D_BN_Mish(num_filters, (1,1))(route)

#---------------------------------------------------#
#   darknet53 的主體部分
#---------------------------------------------------#
def darknet_body(x):
    x = DarknetConv2D_BN_Mish(32, (3,3))(x)
    x = resblock_body(x, 64, 1, False)
    x = resblock_body(x, 128, 2)
    x = resblock_body(x, 256, 8)
    feat1 = x
    x = resblock_body(x, 512, 8)
    feat2 = x
    x = resblock_body(x, 1024, 4)
    feat3 = x
    return feat1,feat2,feat3

2、特徵金字塔

當輸入是416x416時，特徵結構如下：

當輸入是608x608時，特徵結構如下：

在特徵金字塔部分，YOLOV4結合了兩種改進:
a).使用了SPP結構。
b).使用了PANet結構。
如上圖所示，除去CSPDarknet53和Yolo Head的結構外，都是特徵金字塔的結構。
1、SPP結構參雜在對CSPdarknet53的最後一個特徵層的卷積裏，在對CSPdarknet53的最後一個特徵層進行三次DarknetConv2D_BN_Leaky卷積後，分別利用四個不同尺度的最大池化進行處理，最大池化的池化核大小分別爲13x13、9x9、5x5、1x1（1x1即無處理）

# 使用了SPP結構，即不同尺度的最大池化後堆疊。
maxpool1 = MaxPooling2D(pool_size=(13,13), strides=(1,1), padding='same')(P5)
maxpool2 = MaxPooling2D(pool_size=(9,9), strides=(1,1), padding='same')(P5)
maxpool3 = MaxPooling2D(pool_size=(5,5), strides=(1,1), padding='same')(P5)
P5 = Concatenate()([maxpool1, maxpool2, maxpool3, P5])

其可以它能夠極大地增加感受野，分離出最顯著的上下文特徵。

2、PANet是2018的一種實例分割算法，其具體結構由反覆提升特徵的意思。

上圖爲原始的PANet的結構，可以看出來其具有一個非常重要的特點就是特徵的反覆提取。
在（a）裏面是傳統的特徵金字塔結構，在完成特徵金字塔從下到上的特徵提取後，還需要實現（b）中從上到下的特徵提取。

而在YOLOV4當中，其主要是在三個有效特徵層上使用了PANet結構。

實現代碼如下：

#---------------------------------------------------#
#   特徵層->最後的輸出
#---------------------------------------------------#
def yolo_body(inputs, num_anchors, num_classes):
    # 生成darknet53的主幹模型
    feat1,feat2,feat3 = darknet_body(inputs)

    P5 = DarknetConv2D_BN_Leaky(512, (1,1))(feat3)
    P5 = DarknetConv2D_BN_Leaky(1024, (3,3))(P5)
    P5 = DarknetConv2D_BN_Leaky(512, (1,1))(P5)
    # 使用了SPP結構，即不同尺度的最大池化後堆疊。
    maxpool1 = MaxPooling2D(pool_size=(13,13), strides=(1,1), padding='same')(P5)
    maxpool2 = MaxPooling2D(pool_size=(9,9), strides=(1,1), padding='same')(P5)
    maxpool3 = MaxPooling2D(pool_size=(5,5), strides=(1,1), padding='same')(P5)
    P5 = Concatenate()([maxpool1, maxpool2, maxpool3, P5])
    P5 = DarknetConv2D_BN_Leaky(512, (1,1))(P5)
    P5 = DarknetConv2D_BN_Leaky(1024, (3,3))(P5)
    P5 = DarknetConv2D_BN_Leaky(512, (1,1))(P5)

    P5_upsample = compose(DarknetConv2D_BN_Leaky(256, (1,1)), UpSampling2D(2))(P5)
    
    P4 = DarknetConv2D_BN_Leaky(256, (1,1))(feat2)
    P4 = Concatenate()([P4, P5_upsample])
    P4 = make_five_convs(P4,256)

    P4_upsample = compose(DarknetConv2D_BN_Leaky(128, (1,1)), UpSampling2D(2))(P4)
    
    P3 = DarknetConv2D_BN_Leaky(128, (1,1))(feat1)
    P3 = Concatenate()([P3, P4_upsample])
    P3 = make_five_convs(P3,128)

    # 76x76的out
    P3_output = DarknetConv2D_BN_Leaky(256, (3,3))(P3)
    P3_output = DarknetConv2D(num_anchors*(num_classes+5), (1,1))(P3_output)

    P3_downsample = ZeroPadding2D(((1,0),(1,0)))(P3)
    P3_downsample = DarknetConv2D_BN_Leaky(256, (3,3), strides=(2,2))(P3_downsample)
    P4 = Concatenate()([P3_downsample, P4])
    P4 = make_five_convs(P4,256)
    
    # 38x38的out
    P4_output = DarknetConv2D_BN_Leaky(512, (3,3))(P4)
    P4_output = DarknetConv2D(num_anchors*(num_classes+5), (1,1))(P4_output)
    

    P4_downsample = ZeroPadding2D(((1,0),(1,0)))(P4)
    P4_downsample = DarknetConv2D_BN_Leaky(512, (3,3), strides=(2,2))(P4_downsample)
    P5 = Concatenate()([P4_downsample, P5])
    P5 = make_five_convs(P5,512)
    
    # 19x19的out
    P5_output = DarknetConv2D_BN_Leaky(1024, (3,3))(P5)
    P5_output = DarknetConv2D(num_anchors*(num_classes+5), (1,1))(P5_output)

    return Model(inputs, [P5_output, P4_output, P3_output])

3、YoloHead利用獲得到的特徵進行預測

當輸入是416x416時，特徵結構如下：

當輸入是608x608時，特徵結構如下：

1、在特徵利用部分，YoloV4提取多特徵層進行目標檢測，一共提取三個特徵層，分別位於中間層，中下層，底層，三個特徵層的shape分別爲(76,76,256)、(38,38,512)、(19,19,1024)。

2、輸出層的shape分別爲(19,19,75)，(38,38,75)，(76,76,75)，最後一個維度爲75是因爲該圖是基於voc數據集的，它的類爲20種，YoloV4只有針對每一個特徵層存在3個先驗框，所以最後維度爲3x25；
如果使用的是coco訓練集，類則爲80種，最後的維度應該爲255 = 3x85，三個特徵層的shape爲(19,19,255)，(38,38,255)，(76,76,255)

實現代碼如下：

#---------------------------------------------------#
#   特徵層->最後的輸出
#---------------------------------------------------#
def yolo_body(inputs, num_anchors, num_classes):
# 省略了一部分，只看最後的head部分
    P3_output = DarknetConv2D_BN_Leaky(256, (3,3))(P3)
    P3_output = DarknetConv2D(num_anchors*(num_classes+5), (1,1))(P3_output)

    P4_output = DarknetConv2D_BN_Leaky(512, (3,3))(P4)
    P4_output = DarknetConv2D(num_anchors*(num_classes+5), (1,1))(P4_output)

    P5_output = DarknetConv2D_BN_Leaky(1024, (3,3))(P5)
    P5_output = DarknetConv2D(num_anchors*(num_classes+5), (1,1))(P5_output)

4、預測結果的解碼

由第二步我們可以獲得三個特徵層的預測結果，shape分別爲(N,19,19,255)，(N,38,38,255)，(N,76,76,255)的數據，對應每個圖分爲19x19、38x38、76x76的網格上3個預測框的位置。

但是這個預測結果並不對應着最終的預測框在圖片上的位置，還需要解碼纔可以完成。

此處要講一下yolo3的預測原理，yolo3的3個特徵層分別將整幅圖分爲19x19、38x38、76x76的網格，每個網絡點負責一個區域的檢測。

我們知道特徵層的預測結果對應着三個預測框的位置，我們先將其reshape一下，其結果爲(N,19,19,3,85)，(N,38,38,3,85)，(N,76,76,3,85)。

最後一個維度中的85包含了4+1+80，分別代表x_offset、y_offset、h和w、置信度、分類結果。

yolo3的解碼過程就是將每個網格點加上它對應的x_offset和y_offset，加完後的結果就是預測框的中心，然後再利用先驗框和h、w結合計算出預測框的長和寬。這樣就能得到整個預測框的位置了。

當然得到最終的預測結構後還要進行得分排序與非極大抑制篩選
這一部分基本上是所有目標檢測通用的部分。不過該項目的處理方式與其它項目不同。其對於每一個類進行判別。
1、取出每一類得分大於self.obj_threshold的框和得分。
2、利用框的位置和得分進行非極大抑制。

實現代碼如下，當調用yolo_eval時，就會對每個特徵層進行解碼：

#---------------------------------------------------#
#   將預測值的每個特徵層調成真實值
#---------------------------------------------------#
def yolo_head(feats, anchors, num_classes, input_shape, calc_loss=False):
    num_anchors = len(anchors)
    # [1, 1, 1, num_anchors, 2]
    anchors_tensor = K.reshape(K.constant(anchors), [1, 1, 1, num_anchors, 2])

    # 獲得x，y的網格
    # (19, 19, 1, 2)
    grid_shape = K.shape(feats)[1:3] # height, width
    grid_y = K.tile(K.reshape(K.arange(0, stop=grid_shape[0]), [-1, 1, 1, 1]),
        [1, grid_shape[1], 1, 1])
    grid_x = K.tile(K.reshape(K.arange(0, stop=grid_shape[1]), [1, -1, 1, 1]),
        [grid_shape[0], 1, 1, 1])
    grid = K.concatenate([grid_x, grid_y])
    grid = K.cast(grid, K.dtype(feats))

    # (batch_size,19,19,3,85)
    feats = K.reshape(feats, [-1, grid_shape[0], grid_shape[1], num_anchors, num_classes + 5])

    # 將預測值調成真實值
    # box_xy對應框的中心點
    # box_wh對應框的寬和高
    box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
    box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))
    box_confidence = K.sigmoid(feats[..., 4:5])
    box_class_probs = K.sigmoid(feats[..., 5:])

    # 在計算loss的時候返回如下參數
    if calc_loss == True:
        return grid, feats, box_xy, box_wh
    return box_xy, box_wh, box_confidence, box_class_probs

#---------------------------------------------------#
#   對box進行調整，使其符合真實圖片的樣子
#---------------------------------------------------#
def yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape):
    box_yx = box_xy[..., ::-1]
    box_hw = box_wh[..., ::-1]
    
    input_shape = K.cast(input_shape, K.dtype(box_yx))
    image_shape = K.cast(image_shape, K.dtype(box_yx))

    new_shape = K.round(image_shape * K.min(input_shape/image_shape))
    offset = (input_shape-new_shape)/2./input_shape
    scale = input_shape/new_shape

    box_yx = (box_yx - offset) * scale
    box_hw *= scale

    box_mins = box_yx - (box_hw / 2.)
    box_maxes = box_yx + (box_hw / 2.)
    boxes =  K.concatenate([
        box_mins[..., 0:1],  # y_min
        box_mins[..., 1:2],  # x_min
        box_maxes[..., 0:1],  # y_max
        box_maxes[..., 1:2]  # x_max
    ])

    boxes *= K.concatenate([image_shape, image_shape])
    return boxes

#---------------------------------------------------#
#   獲取每個box和它的得分
#---------------------------------------------------#
def yolo_boxes_and_scores(feats, anchors, num_classes, input_shape, image_shape):
    # 將預測值調成真實值
    # box_xy對應框的中心點
    # box_wh對應框的寬和高
    # -1,19,19,3,2; -1,19,19,3,2; -1,19,19,3,1; -1,19,19,3,80
    box_xy, box_wh, box_confidence, box_class_probs = yolo_head(feats, anchors, num_classes, input_shape)
    # 將box_xy、和box_wh調節成y_min,y_max,xmin,xmax
    boxes = yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape)
    # 獲得得分和box
    boxes = K.reshape(boxes, [-1, 4])
    box_scores = box_confidence * box_class_probs
    box_scores = K.reshape(box_scores, [-1, num_classes])
    return boxes, box_scores

#---------------------------------------------------#
#   圖片預測
#---------------------------------------------------#
def yolo_eval(yolo_outputs,
              anchors,
              num_classes,
              image_shape,
              max_boxes=20,
              score_threshold=.6,
              iou_threshold=.5):
    # 獲得特徵層的數量
    num_layers = len(yolo_outputs)
    # 特徵層1對應的anchor是678
    # 特徵層2對應的anchor是345
    # 特徵層3對應的anchor是012
    anchor_mask = [[6,7,8], [3,4,5], [0,1,2]]
    
    input_shape = K.shape(yolo_outputs[0])[1:3] * 32
    boxes = []
    box_scores = []
    # 對每個特徵層進行處理
    for l in range(num_layers):
        _boxes, _box_scores = yolo_boxes_and_scores(yolo_outputs[l], anchors[anchor_mask[l]], num_classes, input_shape, image_shape)
        boxes.append(_boxes)
        box_scores.append(_box_scores)
    # 將每個特徵層的結果進行堆疊
    boxes = K.concatenate(boxes, axis=0)
    box_scores = K.concatenate(box_scores, axis=0)

    mask = box_scores >= score_threshold
    max_boxes_tensor = K.constant(max_boxes, dtype='int32')
    boxes_ = []
    scores_ = []
    classes_ = []
    for c in range(num_classes):
        # 取出所有box_scores >= score_threshold的框，和成績
        class_boxes = tf.boolean_mask(boxes, mask[:, c])
        class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])

        # 非極大抑制，去掉box重合程度高的那一些
        nms_index = tf.image.non_max_suppression(
            class_boxes, class_box_scores, max_boxes_tensor, iou_threshold=iou_threshold)

        # 獲取非極大抑制後的結果
        # 下列三個分別是
        # 框的位置，得分與種類
        class_boxes = K.gather(class_boxes, nms_index)
        class_box_scores = K.gather(class_box_scores, nms_index)
        classes = K.ones_like(class_box_scores, 'int32') * c
        boxes_.append(class_boxes)
        scores_.append(class_box_scores)
        classes_.append(classes)
    boxes_ = K.concatenate(boxes_, axis=0)
    scores_ = K.concatenate(scores_, axis=0)
    classes_ = K.concatenate(classes_, axis=0)

    return boxes_, scores_, classes_

5、在原圖上進行繪製

通過第四步，我們可以獲得預測框在原圖上的位置，而且這些預測框都是經過篩選的。這些篩選後的框可以直接繪製在圖片上，就可以獲得結果了。

YOLOV4的訓練

1、YOLOV4的改進訓練技巧

a)、Mosaic數據增強

Yolov4的mosaic數據增強參考了CutMix數據增強方式，理論上具有一定的相似性！
CutMix數據增強方式利用兩張圖片進行拼接。

但是mosaic利用了四張圖片，根據論文所說其擁有一個巨大的優點是豐富檢測物體的背景！且在BN計算的時候一下子會計算四張圖片的數據！
就像下圖這樣：

實現思路如下：
1、每次讀取四張圖片。

2、分別對四張圖片進行翻轉、縮放、色域變化等，並且按照四個方向位置擺好。

3、進行圖片的組合和框的組合

def rand(a=0, b=1):
    return np.random.rand()*(b-a) + a

def merge_bboxes(bboxes, cutx, cuty):

    merge_bbox = []
    for i in range(len(bboxes)):
        for box in bboxes[i]:
            tmp_box = []
            x1,y1,x2,y2 = box[0], box[1], box[2], box[3]

            if i == 0:
                if y1 > cuty or x1 > cutx:
                    continue
                if y2 >= cuty and y1 <= cuty:
                    y2 = cuty
                    if y2-y1 < 5:
                        continue
                if x2 >= cutx and x1 <= cutx:
                    x2 = cutx
                    if x2-x1 < 5:
                        continue
                
            if i == 1:
                if y2 < cuty or x1 > cutx:
                    continue

                if y2 >= cuty and y1 <= cuty:
                    y1 = cuty
                    if y2-y1 < 5:
                        continue
                
                if x2 >= cutx and x1 <= cutx:
                    x2 = cutx
                    if x2-x1 < 5:
                        continue

            if i == 2:
                if y2 < cuty or x2 < cutx:
                    continue

                if y2 >= cuty and y1 <= cuty:
                    y1 = cuty
                    if y2-y1 < 5:
                        continue

                if x2 >= cutx and x1 <= cutx:
                    x1 = cutx
                    if x2-x1 < 5:
                        continue

            if i == 3:
                if y1 > cuty or x2 < cutx:
                    continue

                if y2 >= cuty and y1 <= cuty:
                    y2 = cuty
                    if y2-y1 < 5:
                        continue

                if x2 >= cutx and x1 <= cutx:
                    x1 = cutx
                    if x2-x1 < 5:
                        continue

            tmp_box.append(x1)
            tmp_box.append(y1)
            tmp_box.append(x2)
            tmp_box.append(y2)
            tmp_box.append(box[-1])
            merge_bbox.append(tmp_box)
    return merge_bbox

def get_random_data(annotation_line, input_shape, random=True, hue=.1, sat=1.5, val=1.5, proc_img=True):
    '''random preprocessing for real-time data augmentation'''
    h, w = input_shape
    min_offset_x = 0.4
    min_offset_y = 0.4
    scale_low = 1-min(min_offset_x,min_offset_y)
    scale_high = scale_low+0.2

    image_datas = [] 
    box_datas = []
    index = 0

    place_x = [0,0,int(w*min_offset_x),int(w*min_offset_x)]
    place_y = [0,int(h*min_offset_y),int(w*min_offset_y),0]
    for line in annotation_line:
        # 每一行進行分割
        line_content = line.split()
        # 打開圖片
        image = Image.open(line_content[0])
        image = image.convert("RGB") 
        # 圖片的大小
        iw, ih = image.size
        # 保存框的位置
        box = np.array([np.array(list(map(int,box.split(',')))) for box in line_content[1:]])
        
        # image.save(str(index)+".jpg")
        # 是否翻轉圖片
        flip = rand()<.5
        if flip and len(box)>0:
            image = image.transpose(Image.FLIP_LEFT_RIGHT)
            box[:, [0,2]] = iw - box[:, [2,0]]

        # 對輸入進來的圖片進行縮放
        new_ar = w/h
        scale = rand(scale_low, scale_high)
        if new_ar < 1:
            nh = int(scale*h)
            nw = int(nh*new_ar)
        else:
            nw = int(scale*w)
            nh = int(nw/new_ar)
        image = image.resize((nw,nh), Image.BICUBIC)

        # 進行色域變換
        hue = rand(-hue, hue)
        sat = rand(1, sat) if rand()<.5 else 1/rand(1, sat)
        val = rand(1, val) if rand()<.5 else 1/rand(1, val)
        x = rgb_to_hsv(np.array(image)/255.)
        x[..., 0] += hue
        x[..., 0][x[..., 0]>1] -= 1
        x[..., 0][x[..., 0]<0] += 1
        x[..., 1] *= sat
        x[..., 2] *= val
        x[x>1] = 1
        x[x<0] = 0
        image = hsv_to_rgb(x)

        image = Image.fromarray((image*255).astype(np.uint8))
        # 將圖片進行放置，分別對應四張分割圖片的位置
        dx = place_x[index]
        dy = place_y[index]
        new_image = Image.new('RGB', (w,h), (128,128,128))
        new_image.paste(image, (dx, dy))
        image_data = np.array(new_image)/255

        # Image.fromarray((image_data*255).astype(np.uint8)).save(str(index)+"distort.jpg")
        
        index = index + 1
        box_data = []
        # 對box進行重新處理
        if len(box)>0:
            np.random.shuffle(box)
            box[:, [0,2]] = box[:, [0,2]]*nw/iw + dx
            box[:, [1,3]] = box[:, [1,3]]*nh/ih + dy
            box[:, 0:2][box[:, 0:2]<0] = 0
            box[:, 2][box[:, 2]>w] = w
            box[:, 3][box[:, 3]>h] = h
            box_w = box[:, 2] - box[:, 0]
            box_h = box[:, 3] - box[:, 1]
            box = box[np.logical_and(box_w>1, box_h>1)]
            box_data = np.zeros((len(box),5))
            box_data[:len(box)] = box
        
        image_datas.append(image_data)
        box_datas.append(box_data)

        img = Image.fromarray((image_data*255).astype(np.uint8))
        for j in range(len(box_data)):
            thickness = 3
            left, top, right, bottom  = box_data[j][0:4]
            draw = ImageDraw.Draw(img)
            for i in range(thickness):
                draw.rectangle([left + i, top + i, right - i, bottom - i],outline=(255,255,255))
        img.show()

    
    # 將圖片分割，放在一起
    cutx = np.random.randint(int(w*min_offset_x), int(w*(1 - min_offset_x)))
    cuty = np.random.randint(int(h*min_offset_y), int(h*(1 - min_offset_y)))

    new_image = np.zeros([h,w,3])
    new_image[:cuty, :cutx, :] = image_datas[0][:cuty, :cutx, :]
    new_image[cuty:, :cutx, :] = image_datas[1][cuty:, :cutx, :]
    new_image[cuty:, cutx:, :] = image_datas[2][cuty:, cutx:, :]
    new_image[:cuty, cutx:, :] = image_datas[3][:cuty, cutx:, :]

    # 對框進行進一步的處理
    new_boxes = merge_bboxes(box_datas, cutx, cuty)

    return new_image, new_boxes

b)、Label Smoothing平滑

標籤平滑的思想很簡單，具體公式如下：

new_onehot_labels = onehot_labels * (1 - label_smoothing) + label_smoothing / num_classes

當label_smoothing的值爲0.01得時候，公式變成如下所示：

new_onehot_labels = y * (1 - 0.01) + 0.01 / num_classes

其實Label Smoothing平滑就是將標籤進行一個平滑，原始的標籤是0、1，在平滑後變成0.005(如果是二分類)、0.995，也就是說對分類準確做了一點懲罰，讓模型不可以分類的太準確，太準確容易過擬合。

實現代碼如下：

#---------------------------------------------------#
#   平滑標籤
#---------------------------------------------------#
def _smooth_labels(y_true, label_smoothing):
    num_classes = tf.cast(K.shape(y_true)[-1], dtype=K.floatx())
    label_smoothing = K.constant(label_smoothing, dtype=K.floatx())
    return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes

c)、CIOU

IoU是比值的概念，對目標物體的scale是不敏感的。然而常用的BBox的迴歸損失優化和IoU優化不是完全等價的，尋常的IoU無法直接優化沒有重疊的部分。

於是有人提出直接使用IOU作爲迴歸優化loss，CIOU是其中非常優秀的一種想法。

CIOU將目標與anchor之間的距離，重疊率、尺度以及懲罰項都考慮進去，使得目標框迴歸變得更加穩定，不會像IoU和GIoU一樣出現訓練過程中發散等問題。而懲罰因子把預測框長寬比擬合目標框的長寬比考慮進去。

CIOU公式如下
$CIOU = IOU - \frac{\rho^2(b,b^{gt})}{c^2} - \alpha v$
其中， $\rho^2(b,b^{gt})$ 分別代表了預測框和真實框的中心點的歐式距離。 c代表的是能夠同時包含預測框和真實框的最小閉包區域的對角線距離。

而 $\alpha$ 和 $v$ 的公式如下
$\alpha = \frac{v}{1-IOU+v}$
$v = \frac{4}{\pi ^2}(arctan\frac{w^{gt}}{h^{gt}}-arctan\frac{w}{h})^2$
把1-CIOU就可以得到相應的LOSS了。
$LOSS_{CIOU} = 1 - IOU + \frac{\rho^2(b,b^{gt})}{c^2} + \alpha v$

def box_ciou(b1, b2):
    """
    輸入爲：
    ----------
    b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
    b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh

    返回爲：
    -------
    ciou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
    """
    # 求出預測框左上角右下角
    b1_xy = b1[..., :2]
    b1_wh = b1[..., 2:4]
    b1_wh_half = b1_wh/2.
    b1_mins = b1_xy - b1_wh_half
    b1_maxes = b1_xy + b1_wh_half
    # 求出真實框左上角右下角
    b2_xy = b2[..., :2]
    b2_wh = b2[..., 2:4]
    b2_wh_half = b2_wh/2.
    b2_mins = b2_xy - b2_wh_half
    b2_maxes = b2_xy + b2_wh_half

    # 求真實框和預測框所有的iou
    intersect_mins = K.maximum(b1_mins, b2_mins)
    intersect_maxes = K.minimum(b1_maxes, b2_maxes)
    intersect_wh = K.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
    b1_area = b1_wh[..., 0] * b1_wh[..., 1]
    b2_area = b2_wh[..., 0] * b2_wh[..., 1]
    union_area = b1_area + b2_area - intersect_area
    iou = intersect_area / (union_area + K.epsilon())

    # 計算中心的差距
    center_distance = K.sum(K.square(b1_xy - b2_xy), axis=-1)
    # 找到包裹兩個框的最小框的左上角和右下角
    enclose_mins = K.minimum(b1_mins, b2_mins)
    enclose_maxes = K.maximum(b1_maxes, b2_maxes)
    enclose_wh = K.maximum(enclose_maxes - enclose_mins, 0.0)
    # 計算對角線距離
    enclose_diagonal = K.sum(K.square(enclose_wh), axis=-1)
    # calculate ciou, add epsilon in denominator to avoid dividing by 0
    ciou = iou - 1.0 * (center_distance) / (enclose_diagonal + K.epsilon())

    # calculate param v and alpha to extend to CIoU
    v = 4*K.square(tf.math.atan2(b1_wh[..., 0], b1_wh[..., 1]) - tf.math.atan2(b2_wh[..., 0], b2_wh[..., 1])) / (math.pi * math.pi)
    alpha = v / (1.0 - iou + v)
    ciou = ciou - alpha * v

    ciou = K.expand_dims(ciou, -1)
    return ciou

d)、學習率餘弦退火衰減

餘弦退火衰減法，學習率會先上升再下降，這是退火優化法的思想。（關於什麼是退火算法可以百度。）

上升的時候使用線性上升，下降的時候模擬cos函數下降。執行多次。

效果如圖所示：

在TF2中可使用自帶的tf.keras.experimental.CosineDecayRestarts實現。

餘弦退火衰減有幾個比較必要的參數：
1、learning_rate_base：學習率最高值。
2、first_decay_steps ：多少充分一次。

lr_schedule = tf.keras.experimental.CosineDecayRestarts(
    initial_learning_rate = learning_rate_base, 
    first_decay_steps = 5*epoch_size, 
    t_mul = 1.0,
    alpha = 1e-2
)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

2、loss組成

a)、計算loss所需參數

在計算loss的時候，實際上是y_pre和y_true之間的對比：
y_pre就是一幅圖像經過網絡之後的輸出，內部含有三個特徵層的內容；其需要解碼才能夠在圖上作畫
y_true就是一個真實圖像中，它的每個真實框對應的(19,19)、(38,38)、(76,76)網格上的偏移位置、長寬與種類。其仍需要編碼才能與y_pred的結構一致
實際上y_pre和y_true內容的shape都是
(batch_size,19,19,3,85)
(batch_size,38,38,3,85)
(batch_size,76,76,3,85)

b)、y_pre是什麼

網絡最後輸出的內容就是三個特徵層每個網格點對應的預測框及其種類，即三個特徵層分別對應着圖片被分爲不同size的網格後，每個網格點上三個先驗框對應的位置、置信度及其種類。
對於輸出的y1、y2、y3而言，[…, : 2]指的是相對於每個網格點的偏移量，[…, 2: 4]指的是寬和高，[…, 4: 5]指的是該框的置信度，[…, 5: ]指的是每個種類的預測概率。
現在的y_pre還是沒有解碼的，解碼了之後纔是真實圖像上的情況。

c)、y_true是什麼。

y_true就是一個真實圖像中，它的每個真實框對應的(19,19)、(38,38)、(76,76)網格上的偏移位置、長寬與種類。其仍需要編碼才能與y_pred的結構一致
在yolo4中，其使用了一個專門的函數用於處理讀取進來的圖片的框的真實情況。

def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):

其輸入爲：
true_boxes：shape爲(m, T, 5)代表m張圖T個框的x_min、y_min、x_max、y_max、class_id。
input_shape：輸入的形狀，此處爲608、608
anchors：代表9個先驗框的大小
num_classes：種類的數量。
其實對真實框的處理是將真實框轉化成圖片中相對網格的xyhw，步驟如下：
1、取框的真實值，獲取其框的中心及其寬高，除去input_shape變成比例的模式。
2、建立全爲0的y_true，y_true是一個列表，包含三個特徵層，shape分別爲(batch_size,19,19,3,85)、(batch_size,38,38,3,85)、(batch_size,76,76,3,85)。
3、對每一張圖片處理，將每一張圖片中的真實框的wh和先驗框的wh對比，計算IOU值，選取其中IOU最高的一個，得到其所屬特徵層及其網格點的位置，在對應的y_true中將內容進行保存。

for t, n in enumerate(best_anchor):
    for l in range(num_layers):
        if n in anchor_mask[l]:

            # 計算該目標在第l個特徵層所處網格的位置
            i = np.floor(true_boxes[b,t,0]*grid_shapes[l][1]).astype('int32')
            j = np.floor(true_boxes[b,t,1]*grid_shapes[l][0]).astype('int32')

            # 找到best_anchor索引的索引
            k = anchor_mask[l].index(n)
            c = true_boxes[b,t, 4].astype('int32')
            
            # 保存到y_true中
            y_true[l][b, j, i, k, 0:4] = true_boxes[b,t, 0:4]
            y_true[l][b, j, i, k, 4] = 1
            y_true[l][b, j, i, k, 5+c] = 1

對於最後輸出的y_true而言，只有每個圖裏每個框最對應的位置有數據，其它的地方都爲0。
preprocess_true_boxes全部的代碼如下：

#---------------------------------------------------#
#   讀入xml文件，並輸出y_true
#---------------------------------------------------#
def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):
    assert (true_boxes[..., 4]<num_classes).all(), 'class id must be less than num_classes'
    # 一共有三個特徵層數
    num_layers = len(anchors)//3
    # 先驗框
    # 678爲116,90,  156,198,  373,326
    # 345爲30,61,  62,45,  59,119
    # 012爲10,13,  16,30,  33,23,  
    anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]]

    true_boxes = np.array(true_boxes, dtype='float32')
    input_shape = np.array(input_shape, dtype='int32') # 416,416
    # 讀出xy軸，讀出長寬
    # 中心點(m,n,2)
    boxes_xy = (true_boxes[..., 0:2] + true_boxes[..., 2:4]) // 2
    boxes_wh = true_boxes[..., 2:4] - true_boxes[..., 0:2]
    # 計算比例
    true_boxes[..., 0:2] = boxes_xy/input_shape[:]
    true_boxes[..., 2:4] = boxes_wh/input_shape[:]

    # m張圖
    m = true_boxes.shape[0]
    # 得到網格的shape爲19,19;38,38;76,76
    grid_shapes = [input_shape//{0:32, 1:16, 2:8}[l] for l in range(num_layers)]
    # y_true的格式爲(m,19,19,3,85)(m,38,38,3,85)(m,76,76,3,85)
    y_true = [np.zeros((m,grid_shapes[l][0],grid_shapes[l][1],len(anchor_mask[l]),5+num_classes),
        dtype='float32') for l in range(num_layers)]
    # [1,9,2]
    anchors = np.expand_dims(anchors, 0)
    anchor_maxes = anchors / 2.
    anchor_mins = -anchor_maxes
    # 長寬要大於0纔有效
    valid_mask = boxes_wh[..., 0]>0

    for b in range(m):
        # 對每一張圖進行處理
        wh = boxes_wh[b, valid_mask[b]]
        if len(wh)==0: continue
        # [n,1,2]
        wh = np.expand_dims(wh, -2)
        box_maxes = wh / 2.
        box_mins = -box_maxes

        # 計算真實框和哪個先驗框最契合
        intersect_mins = np.maximum(box_mins, anchor_mins)
        intersect_maxes = np.minimum(box_maxes, anchor_maxes)
        intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
        box_area = wh[..., 0] * wh[..., 1]
        anchor_area = anchors[..., 0] * anchors[..., 1]
        iou = intersect_area / (box_area + anchor_area - intersect_area)
        # 維度是(n) 感謝 消盡不死鳥 的提醒
        best_anchor = np.argmax(iou, axis=-1)

        for t, n in enumerate(best_anchor):
            for l in range(num_layers):
                if n in anchor_mask[l]:
                    # floor用於向下取整
                    i = np.floor(true_boxes[b,t,0]*grid_shapes[l][1]).astype('int32')
                    j = np.floor(true_boxes[b,t,1]*grid_shapes[l][0]).astype('int32')
                    # 找到真實框在特徵層l中第b副圖像對應的位置
                    k = anchor_mask[l].index(n)
                    c = true_boxes[b,t, 4].astype('int32')
                    y_true[l][b, j, i, k, 0:4] = true_boxes[b,t, 0:4]
                    y_true[l][b, j, i, k, 4] = 1
                    y_true[l][b, j, i, k, 5+c] = 1

    return y_true

d)、loss的計算過程

在得到了y_pre和y_true後怎麼對比呢？不是簡單的減一下!

loss值需要對三個特徵層進行處理，這裏以最小的特徵層爲例。
1、利用y_true取出該特徵層中真實存在目標的點的位置(m,19,19,3,1)及其對應的種類(m,19,19,3,80)。
2、將yolo_outputs的預測值輸出進行處理，得到reshape後的預測值y_pre，shape爲(m,19,19,3,85)。還有解碼後的xy，wh。
3、對於每一幅圖，計算其中所有真實框與預測框的IOU，如果某些預測框和真實框的重合程度大於0.5，則忽略。
4、計算ciou作爲迴歸的loss，這裏只計算正樣本的迴歸loss。
5、計算置信度的loss，其有兩部分構成，第一部分是實際上存在目標的，預測結果中置信度的值與1對比；第二部分是實際上不存在目標的，預測結果中置信度的值與0對比。
6、計算預測種類的loss，其計算的是實際上存在目標的，預測類與真實類的差距。

其實際上計算的總的loss是三個loss的和，這三個loss分別是：

實際存在的框，CIOU LOSS。
實際存在的框，預測結果中置信度的值與1對比；實際不存在的框，預測結果中置信度的值與0對比，該部分要去除被忽略的不包含目標的框。
實際存在的框，種類預測結果與實際結果的對比。

其實際代碼如下，使用yolo_loss就可以獲得loss值：

#---------------------------------------------------#
#   平滑標籤
#---------------------------------------------------#
def _smooth_labels(y_true, label_smoothing):
    num_classes = K.shape(y_true)[-1],
    label_smoothing = K.constant(label_smoothing, dtype=K.floatx())
    return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes
#---------------------------------------------------#
#   將預測值的每個特徵層調成真實值
#---------------------------------------------------#
def yolo_head(feats, anchors, num_classes, input_shape, calc_loss=False):
    num_anchors = len(anchors)
    # [1, 1, 1, num_anchors, 2]
    anchors_tensor = K.reshape(K.constant(anchors), [1, 1, 1, num_anchors, 2])

    # 獲得x，y的網格
    # (19,19, 1, 2)
    grid_shape = K.shape(feats)[1:3] # height, width
    grid_y = K.tile(K.reshape(K.arange(0, stop=grid_shape[0]), [-1, 1, 1, 1]),
        [1, grid_shape[1], 1, 1])
    grid_x = K.tile(K.reshape(K.arange(0, stop=grid_shape[1]), [1, -1, 1, 1]),
        [grid_shape[0], 1, 1, 1])
    grid = K.concatenate([grid_x, grid_y])
    grid = K.cast(grid, K.dtype(feats))

    # (batch_size,19,19,3,85)
    feats = K.reshape(feats, [-1, grid_shape[0], grid_shape[1], num_anchors, num_classes + 5])

    # 將預測值調成真實值
    # box_xy對應框的中心點
    # box_wh對應框的寬和高
    box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
    box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))
    box_confidence = K.sigmoid(feats[..., 4:5])
    box_class_probs = K.sigmoid(feats[..., 5:])

    # 在計算loss的時候返回如下參數
    if calc_loss == True:
        return grid, feats, box_xy, box_wh
    return box_xy, box_wh, box_confidence, box_class_probs

#---------------------------------------------------#
#   用於計算每個預測框與真實框的iou
#---------------------------------------------------#
def box_iou(b1, b2):
    # 19,19,3,1,4
    # 計算左上角的座標和右下角的座標
    b1 = K.expand_dims(b1, -2)
    b1_xy = b1[..., :2]
    b1_wh = b1[..., 2:4]
    b1_wh_half = b1_wh/2.
    b1_mins = b1_xy - b1_wh_half
    b1_maxes = b1_xy + b1_wh_half

    # 1,n,4
    # 計算左上角和右下角的座標
    b2 = K.expand_dims(b2, 0)
    b2_xy = b2[..., :2]
    b2_wh = b2[..., 2:4]
    b2_wh_half = b2_wh/2.
    b2_mins = b2_xy - b2_wh_half
    b2_maxes = b2_xy + b2_wh_half

    # 計算重合面積
    intersect_mins = K.maximum(b1_mins, b2_mins)
    intersect_maxes = K.minimum(b1_maxes, b2_maxes)
    intersect_wh = K.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
    b1_area = b1_wh[..., 0] * b1_wh[..., 1]
    b2_area = b2_wh[..., 0] * b2_wh[..., 1]
    iou = intersect_area / (b1_area + b2_area - intersect_area)

    return iou

#---------------------------------------------------#
#   loss值計算
#---------------------------------------------------#
def yolo_loss(args, anchors, num_classes, ignore_thresh=.5, label_smoothing=0.1, print_loss=False):

    # 一共有三層
    num_layers = len(anchors)//3 

    # 將預測結果和實際ground truth分開，args是[*model_body.output, *y_true]
    # y_true是一個列表，包含三個特徵層，shape分別爲(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)。
    # yolo_outputs是一個列表，包含三個特徵層，shape分別爲(m,13,13,255),(m,26,26,255),(m,52,52,255)。
    y_true = args[num_layers:]
    yolo_outputs = args[:num_layers]

    # 先驗框
    # 678爲142,110,  192,243,  459,401
    # 345爲36,75,  76,55,  72,146
    # 012爲12,16,  19,36,  40,28  
    anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]]

    # 得到input_shpae爲608,608 
    input_shape = K.cast(K.shape(yolo_outputs[0])[1:3] * 32, K.dtype(y_true[0]))

    loss = 0

    # 取出每一張圖片
    # m的值就是batch_size
    m = K.shape(yolo_outputs[0])[0]
    mf = K.cast(m, K.dtype(yolo_outputs[0]))

    # y_true是一個列表，包含三個特徵層，shape分別爲(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)。
    # yolo_outputs是一個列表，包含三個特徵層，shape分別爲(m,13,13,255),(m,26,26,255),(m,52,52,255)。
    for l in range(num_layers):
        # 以第一個特徵層(m,13,13,3,85)爲例子
        # 取出該特徵層中存在目標的點的位置。(m,13,13,3,1)
        object_mask = y_true[l][..., 4:5]
        # 取出其對應的種類(m,13,13,3,80)
        true_class_probs = y_true[l][..., 5:]
        if label_smoothing:
            true_class_probs = _smooth_labels(true_class_probs, label_smoothing)

        # 將yolo_outputs的特徵層輸出進行處理
        # grid爲網格結構(13,13,1,2)，raw_pred爲尚未處理的預測結果(m,13,13,3,85)
        # 還有解碼後的xy，wh，(m,13,13,3,2)
        grid, raw_pred, pred_xy, pred_wh = yolo_head(yolo_outputs[l],
             anchors[anchor_mask[l]], num_classes, input_shape, calc_loss=True)
        
        # 這個是解碼後的預測的box的位置
        # (m,13,13,3,4)
        pred_box = K.concatenate([pred_xy, pred_wh])

        # 找到負樣本羣組，第一步是創建一個數組，[]
        ignore_mask = tf.TensorArray(K.dtype(y_true[0]), size=1, dynamic_size=True)
        object_mask_bool = K.cast(object_mask, 'bool')
        
        # 對每一張圖片計算ignore_mask
        def loop_body(b, ignore_mask):
            # 取出第b副圖內，真實存在的所有的box的參數
            # n,4
            true_box = tf.boolean_mask(y_true[l][b,...,0:4], object_mask_bool[b,...,0])
            # 計算預測結果與真實情況的iou
            # pred_box爲13,13,3,4
            # 計算的結果是每個pred_box和其它所有真實框的iou
            # 13,13,3,n
            iou = box_iou(pred_box[b], true_box)

            # 13,13,3
            best_iou = K.max(iou, axis=-1)

            # 如果某些預測框和真實框的重合程度大於0.5，則忽略。
            ignore_mask = ignore_mask.write(b, K.cast(best_iou<ignore_thresh, K.dtype(true_box)))
            return b+1, ignore_mask

        # 遍歷所有的圖片
        _, ignore_mask = tf.while_loop(lambda b,*args: b<m, loop_body, [0, ignore_mask])

        # 將每幅圖的內容壓縮，進行處理
        ignore_mask = ignore_mask.stack()
        #(m,13,13,3,1)
        ignore_mask = K.expand_dims(ignore_mask, -1)

        box_loss_scale = 2 - y_true[l][...,2:3]*y_true[l][...,3:4]

        # Calculate ciou loss as location loss
        raw_true_box = y_true[l][...,0:4]
        ciou = box_ciou(pred_box, raw_true_box)
        ciou_loss = object_mask * box_loss_scale * (1 - ciou)
        ciou_loss = K.sum(ciou_loss) / mf
        location_loss = ciou_loss
        
        # 如果該位置本來有框，那麼計算1與置信度的交叉熵
        # 如果該位置本來沒有框，而且滿足best_iou<ignore_thresh，則被認定爲負樣本
        # best_iou<ignore_thresh用於限制負樣本數量
        confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True)+ \
            (1-object_mask) * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True) * ignore_mask
        
        class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[...,5:], from_logits=True)

        confidence_loss = K.sum(confidence_loss) / mf
        class_loss = K.sum(class_loss) / mf
        loss += location_loss + confidence_loss + class_loss
    loss = K.expand_dims(loss, axis=-1)
    return loss

訓練自己的YOLOV4模型

yolo4整體的文件夾構架如下：

本文使用VOC格式進行訓練。
訓練前將標籤文件放在VOCdevkit文件夾下的VOC2007文件夾下的Annotation中。

訓練前將圖片文件放在VOCdevkit文件夾下的VOC2007文件夾下的JPEGImages中。

在訓練前利用voc2yolo4.py文件生成對應的txt。

再運行根目錄下的voc_annotation.py，運行前需要將classes改成你自己的classes。

classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

就會生成對應的2007_train.txt，每一行對應其圖片位置及其真實框的位置。

在訓練前需要修改model_data裏面的voc_classes.txt文件，需要將classes改成你自己的classes。

運行train.py即可開始訓練。

爲了適配Tensorflow2的Eager模式，我也專門建立了一個train_eager.py。其中參數與train.py差不多。也可以運行進行訓練。

睿智的目標檢測32——TF2搭建YoloV4目標檢測平臺（tensorflow2）