學習前言

也做了一下pytorch版本的。

什麼是YOLOV4

YOLOV4是YOLOV3的改進版，在YOLOV3的基礎上結合了非常多的小Tricks。
儘管沒有目標檢測上革命性的改變，但是YOLOV4依然很好的結合了速度與精度。
根據上圖也可以看出來，YOLOV4在YOLOV3的基礎上，在FPS不下降的情況下，mAP達到了44，提高非常明顯。

YOLOV4整體上的檢測思路和YOLOV3相比相差並不大，都是使用三個特徵層進行分類與迴歸預測。

請注意！

強烈建議在學習YOLOV4之前學習YOLOV3，因爲YOLOV4確實可以看作是YOLOV3結合一系列改進的版本！

（重要的事情說三遍！）

YOLOV3可參考該博客：
https://blog.csdn.net/weixin_44791964/article/details/105310627

代碼下載

https://github.com/bubbliiiing/yolov4-pytorch
喜歡的可以給個star噢！

YOLOV4改進的部分（不完全）

1、主幹特徵提取網絡：DarkNet53 => CSPDarkNet53

2、特徵金字塔：SPP，PAN

3、分類迴歸層：YOLOv3（未改變）

4、訓練用到的小技巧：Mosaic數據增強、Label Smoothing平滑、CIOU、學習率餘弦退火衰減

5、激活函數：使用Mish激活函數

以上並非全部的改進部分，還存在一些其它的改進，由於YOLOV4使用的改進實在太多了，很難完全實現與列出來，這裏只列出來了一些我比較感興趣，而且非常有效的改進。

整篇BLOG會結合YOLOV3與YOLOV4的差別進行解析

YOLOV4結構解析

爲方便理解，本文將所有通道數都放到了最後一維度。
爲方便理解，本文將所有通道數都放到了最後一維度。
爲方便理解，本文將所有通道數都放到了最後一維度。

1、主幹特徵提取網絡Backbone

當輸入是416x416時，特徵結構如下：

當輸入是608x608時，特徵結構如下：

主幹特徵提取網絡Backbone的改進點有兩個：
a).主幹特徵提取網絡：DarkNet53 => CSPDarkNet53
b).激活函數：使用Mish激活函數

如果大家對YOLOV3比較熟悉的話，應該知道Darknet53的結構，其由一系列殘差網絡結構構成。在Darknet53中，其存在resblock_body模塊，其由一次下采樣和多次殘差結構的堆疊構成，Darknet53便是由resblock_body模塊組合而成。

而在YOLOV4中，其對該部分進行了一定的修改。
1、其一是將DarknetConv2D的激活函數由LeakyReLU修改成了Mish，卷積塊由DarknetConv2D_BN_Leaky變成了DarknetConv2D_BN_Mish。
Mish函數的公式與圖像如下：
$Mish=x \times tanh(ln(1+e^x))$

2、其二是將resblock_body的結構進行修改，使用了CSPnet結構。此時YOLOV4當中的Darknet53被修改成了CSPDarknet53。

CSPnet結構並不算複雜，就是將原來的殘差塊的堆疊進行了一個拆分，拆成左右兩部分：
主幹部分繼續進行原來的殘差塊的堆疊；
另一部分則像一個殘差邊一樣，經過少量處理直接連接到最後。
因此可以認爲CSP中存在一個大的殘差邊。

#---------------------------------------------------#
#   CSPdarknet的結構塊
#   存在一個大殘差邊
#   這個大殘差邊繞過了很多的殘差結構
#---------------------------------------------------#
class Resblock_body(nn.Module):
    def __init__(self, in_channels, out_channels, num_blocks, first):
        super(Resblock_body, self).__init__()

        self.downsample_conv = BasicConv(in_channels, out_channels, 3, stride=2)

        if first:
            self.split_conv0 = BasicConv(out_channels, out_channels, 1)
            self.split_conv1 = BasicConv(out_channels, out_channels, 1)  
            self.blocks_conv = nn.Sequential(
                Resblock(channels=out_channels, hidden_channels=out_channels//2),
                BasicConv(out_channels, out_channels, 1)
            )
            self.concat_conv = BasicConv(out_channels*2, out_channels, 1)
        else:
            self.split_conv0 = BasicConv(out_channels, out_channels//2, 1)
            self.split_conv1 = BasicConv(out_channels, out_channels//2, 1)

            self.blocks_conv = nn.Sequential(
                *[Resblock(out_channels//2) for _ in range(num_blocks)],
                BasicConv(out_channels//2, out_channels//2, 1)
            )
            self.concat_conv = BasicConv(out_channels, out_channels, 1)

    def forward(self, x):
        x = self.downsample_conv(x)

        x0 = self.split_conv0(x)

        x1 = self.split_conv1(x)
        x1 = self.blocks_conv(x1)

        x = torch.cat([x1, x0], dim=1)
        x = self.concat_conv(x)

        return x

全部實現代碼爲：

import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from collections import OrderedDict

#-------------------------------------------------#
#   MISH激活函數
#-------------------------------------------------#
class Mish(nn.Module):
    def __init__(self):
        super(Mish, self).__init__()

    def forward(self, x):
        return x * torch.tanh(F.softplus(x))

#-------------------------------------------------#
#   卷積塊
#   CONV+BATCHNORM+MISH
#-------------------------------------------------#
class BasicConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1):
        super(BasicConv, self).__init__()

        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, kernel_size//2, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.activation = Mish()

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.activation(x)
        return x

#---------------------------------------------------#
#   CSPdarknet的結構塊的組成部分
#   內部堆疊的殘差塊
#---------------------------------------------------#
class Resblock(nn.Module):
    def __init__(self, channels, hidden_channels=None, residual_activation=nn.Identity()):
        super(Resblock, self).__init__()

        if hidden_channels is None:
            hidden_channels = channels

        self.block = nn.Sequential(
            BasicConv(channels, hidden_channels, 1),
            BasicConv(hidden_channels, channels, 3)
        )

    def forward(self, x):
        return x+self.block(x)

#---------------------------------------------------#
#   CSPdarknet的結構塊
#   存在一個大殘差邊
#   這個大殘差邊繞過了很多的殘差結構
#---------------------------------------------------#
class Resblock_body(nn.Module):
    def __init__(self, in_channels, out_channels, num_blocks, first):
        super(Resblock_body, self).__init__()

        self.downsample_conv = BasicConv(in_channels, out_channels, 3, stride=2)

        if first:
            self.split_conv0 = BasicConv(out_channels, out_channels, 1)
            self.split_conv1 = BasicConv(out_channels, out_channels, 1)  
            self.blocks_conv = nn.Sequential(
                Resblock(channels=out_channels, hidden_channels=out_channels//2),
                BasicConv(out_channels, out_channels, 1)
            )
            self.concat_conv = BasicConv(out_channels*2, out_channels, 1)
        else:
            self.split_conv0 = BasicConv(out_channels, out_channels//2, 1)
            self.split_conv1 = BasicConv(out_channels, out_channels//2, 1)

            self.blocks_conv = nn.Sequential(
                *[Resblock(out_channels//2) for _ in range(num_blocks)],
                BasicConv(out_channels//2, out_channels//2, 1)
            )
            self.concat_conv = BasicConv(out_channels, out_channels, 1)

    def forward(self, x):
        x = self.downsample_conv(x)

        x0 = self.split_conv0(x)

        x1 = self.split_conv1(x)
        x1 = self.blocks_conv(x1)

        x = torch.cat([x1, x0], dim=1)
        x = self.concat_conv(x)

        return x

class CSPDarkNet(nn.Module):
    def __init__(self, layers):
        super(CSPDarkNet, self).__init__()
        self.inplanes = 32
        self.conv1 = BasicConv(3, self.inplanes, kernel_size=3, stride=1)
        self.feature_channels = [64, 128, 256, 512, 1024]

        self.stages = nn.ModuleList([
            Resblock_body(self.inplanes, self.feature_channels[0], layers[0], first=True),
            Resblock_body(self.feature_channels[0], self.feature_channels[1], layers[1], first=False),
            Resblock_body(self.feature_channels[1], self.feature_channels[2], layers[2], first=False),
            Resblock_body(self.feature_channels[2], self.feature_channels[3], layers[3], first=False),
            Resblock_body(self.feature_channels[3], self.feature_channels[4], layers[4], first=False)
        ])

        self.num_features = 1
        # 進行權值初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()


    def forward(self, x):
        x = self.conv1(x)

        x = self.stages[0](x)
        x = self.stages[1](x)
        out3 = self.stages[2](x)
        out4 = self.stages[3](out3)
        out5 = self.stages[4](out4)

        return out3, out4, out5

def darknet53(pretrained, **kwargs):
    model = CSPDarkNet([1, 2, 8, 8, 4])
    if pretrained:
        if isinstance(pretrained, str):
            model.load_state_dict(torch.load(pretrained))
        else:
            raise Exception("darknet request a pretrained path. got [{}]".format(pretrained))
    return model

2、特徵金字塔

當輸入是416x416時，特徵結構如下：

當輸入是608x608時，特徵結構如下：

在特徵金字塔部分，YOLOV4結合了兩種改進:
a).使用了SPP結構。
b).使用了PANet結構。
如上圖所示，除去CSPDarknet53和Yolo Head的結構外，都是特徵金字塔的結構。
1、SPP結構參雜在對CSPdarknet53的最後一個特徵層的卷積裏，在對CSPdarknet53的最後一個特徵層進行三次DarknetConv2D_BN_Leaky卷積後，分別利用四個不同尺度的最大池化進行處理，最大池化的池化核大小分別爲13x13、9x9、5x5、1x1（1x1即無處理）

#---------------------------------------------------#
#   SPP結構，利用不同大小的池化核進行池化
#   池化後堆疊
#---------------------------------------------------#
class SpatialPyramidPooling(nn.Module):
    def __init__(self, pool_sizes=[5, 9, 13]):
        super(SpatialPyramidPooling, self).__init__()

        self.maxpools = nn.ModuleList([nn.MaxPool2d(pool_size, 1, pool_size//2) for pool_size in pool_sizes])

    def forward(self, x):
        features = [maxpool(x) for maxpool in self.maxpools[::-1]]
        features = torch.cat(features + [x], dim=1)

        return features

其可以它能夠極大地增加感受野，分離出最顯著的上下文特徵。

2、PANet是2018的一種實例分割算法，其具體結構由反覆提升特徵的意思。

上圖爲原始的PANet的結構，可以看出來其具有一個非常重要的特點就是特徵的反覆提取。
在（a）裏面是傳統的特徵金字塔結構，在完成特徵金字塔從下到上的特徵提取後，還需要實現（b）中從上到下的特徵提取。

而在YOLOV4當中，其主要是在三個有效特徵層上使用了PANet結構。

實現代碼如下：

#---------------------------------------------------#
#   yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
    def __init__(self, config):
        super(YoloBody, self).__init__()
        self.config = config
        #  backbone
        self.backbone = darknet53(None)

        self.conv1 = make_three_conv([512,1024],1024)
        self.SPP = SpatialPyramidPooling()
        self.conv2 = make_three_conv([512,1024],2048)

        self.upsample1 = Upsample(512,256)
        self.conv_for_P4 = conv2d(512,256,1)
        self.make_five_conv1 = make_five_conv([256, 512],512)

        self.upsample2 = Upsample(256,128)
        self.conv_for_P3 = conv2d(256,128,1)
        self.make_five_conv2 = make_five_conv([128, 256],256)
        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        final_out_filter2 = len(config["yolo"]["anchors"][2]) * (5 + config["yolo"]["classes"])
        self.yolo_head3 = yolo_head([256, final_out_filter2],128)

        self.down_sample1 = conv2d(128,256,3,stride=2)
        self.make_five_conv3 = make_five_conv([256, 512],512)
        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        final_out_filter1 = len(config["yolo"]["anchors"][1]) * (5 + config["yolo"]["classes"])
        self.yolo_head2 = yolo_head([512, final_out_filter1],256)


        self.down_sample2 = conv2d(256,512,3,stride=2)
        self.make_five_conv4 = make_five_conv([512, 1024],1024)
        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        final_out_filter0 = len(config["yolo"]["anchors"][0]) * (5 + config["yolo"]["classes"])
        self.yolo_head1 = yolo_head([1024, final_out_filter0],512)


    def forward(self, x):
        #  backbone
        x2, x1, x0 = self.backbone(x)

        P5 = self.conv1(x0)
        P5 = self.SPP(P5)
        P5 = self.conv2(P5)

        P5_upsample = self.upsample1(P5)
        P4 = self.conv_for_P4(x1)
        P4 = torch.cat([P4,P5_upsample],axis=1)
        P4 = self.make_five_conv1(P4)

        P4_upsample = self.upsample2(P4)
        P3 = self.conv_for_P3(x2)
        P3 = torch.cat([P3,P4_upsample],axis=1)
        P3 = self.make_five_conv2(P3)

        P3_downsample = self.down_sample1(P3)
        P4 = torch.cat([P3_downsample,P4],axis=1)
        P4 = self.make_five_conv3(P4)

        P4_downsample = self.down_sample2(P4)
        P5 = torch.cat([P4_downsample,P5],axis=1)
        P5 = self.make_five_conv4(P5)

        out2 = self.yolo_head3(P3)
        out1 = self.yolo_head2(P4)
        out0 = self.yolo_head1(P5)

        return out0, out1, out2

3、YoloHead利用獲得到的特徵進行預測

當輸入是416x416時，特徵結構如下：

當輸入是608x608時，特徵結構如下：

1、在特徵利用部分，YoloV4提取多特徵層進行目標檢測，一共提取三個特徵層，分別位於中間層，中下層，底層，三個特徵層的shape分別爲(76,76,256)、(38,38,512)、(19,19,1024)。

2、輸出層的shape分別爲(19,19,75)，(38,38,75)，(76,76,75)，最後一個維度爲75是因爲該圖是基於voc數據集的，它的類爲20種，YoloV4只有針對每一個特徵層存在3個先驗框，所以最後維度爲3x25；
如果使用的是coco訓練集，類則爲80種，最後的維度應該爲255 = 3x85，三個特徵層的shape爲(19,19,255)，(38,38,255)，(76,76,255)

實現代碼如下：

#---------------------------------------------------#
#   最後獲得yolov4的輸出
#---------------------------------------------------#
def yolo_head(filters_list, in_filters):
    m = nn.Sequential(
        conv2d(in_filters, filters_list[0], 3),
        nn.Conv2d(filters_list[0], filters_list[1], 1),
    )
    return m

#---------------------------------------------------#
#   yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
    def __init__(self, config):
        super(YoloBody, self).__init__()
        self.config = config
        #  backbone
        self.backbone = darknet53(None)

        self.conv1 = make_three_conv([512,1024],1024)
        self.SPP = SpatialPyramidPooling()
        self.conv2 = make_three_conv([512,1024],2048)

        self.upsample1 = Upsample(512,256)
        self.conv_for_P4 = conv2d(512,256,1)
        self.make_five_conv1 = make_five_conv([256, 512],512)

        self.upsample2 = Upsample(256,128)
        self.conv_for_P3 = conv2d(256,128,1)
        self.make_five_conv2 = make_five_conv([128, 256],256)
        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        final_out_filter2 = len(config["yolo"]["anchors"][2]) * (5 + config["yolo"]["classes"])
        self.yolo_head3 = yolo_head([256, final_out_filter2],128)

        self.down_sample1 = conv2d(128,256,3,stride=2)
        self.make_five_conv3 = make_five_conv([256, 512],512)
        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        final_out_filter1 = len(config["yolo"]["anchors"][1]) * (5 + config["yolo"]["classes"])
        self.yolo_head2 = yolo_head([512, final_out_filter1],256)


        self.down_sample2 = conv2d(256,512,3,stride=2)
        self.make_five_conv4 = make_five_conv([512, 1024],1024)
        # 3*(5+num_classes)=3*(5+20)=3*(4+1+20)=75
        final_out_filter0 = len(config["yolo"]["anchors"][0]) * (5 + config["yolo"]["classes"])
        self.yolo_head1 = yolo_head([1024, final_out_filter0],512)


    def forward(self, x):
        #  backbone
        x2, x1, x0 = self.backbone(x)

        P5 = self.conv1(x0)
        P5 = self.SPP(P5)
        P5 = self.conv2(P5)

        P5_upsample = self.upsample1(P5)
        P4 = self.conv_for_P4(x1)
        P4 = torch.cat([P4,P5_upsample],axis=1)
        P4 = self.make_five_conv1(P4)

        P4_upsample = self.upsample2(P4)
        P3 = self.conv_for_P3(x2)
        P3 = torch.cat([P3,P4_upsample],axis=1)
        P3 = self.make_five_conv2(P3)

        P3_downsample = self.down_sample1(P3)
        P4 = torch.cat([P3_downsample,P4],axis=1)
        P4 = self.make_five_conv3(P4)

        P4_downsample = self.down_sample2(P4)
        P5 = torch.cat([P4_downsample,P5],axis=1)
        P5 = self.make_five_conv4(P5)

        out2 = self.yolo_head3(P3)
        out1 = self.yolo_head2(P4)
        out0 = self.yolo_head1(P5)

        return out0, out1, out2

4、預測結果的解碼

由第二步我們可以獲得三個特徵層的預測結果，shape分別爲(N,19,19,255)，(N,38,38,255)，(N,76,76,255)的數據，對應每個圖分爲19x19、38x38、76x76的網格上3個預測框的位置。

但是這個預測結果並不對應着最終的預測框在圖片上的位置，還需要解碼纔可以完成。

此處要講一下yolo3的預測原理，yolo3的3個特徵層分別將整幅圖分爲19x19、38x38、76x76的網格，每個網絡點負責一個區域的檢測。

我們知道特徵層的預測結果對應着三個預測框的位置，我們先將其reshape一下，其結果爲(N,19,19,3,85)，(N,38,38,3,85)，(N,76,76,3,85)。

最後一個維度中的85包含了4+1+80，分別代表x_offset、y_offset、h和w、置信度、分類結果。

yolo3的解碼過程就是將每個網格點加上它對應的x_offset和y_offset，加完後的結果就是預測框的中心，然後再利用先驗框和h、w結合計算出預測框的長和寬。這樣就能得到整個預測框的位置了。

當然得到最終的預測結構後還要進行得分排序與非極大抑制篩選
這一部分基本上是所有目標檢測通用的部分。不過該項目的處理方式與其它項目不同。其對於每一個類進行判別。
1、取出每一類得分大於self.obj_threshold的框和得分。
2、利用框的位置和得分進行非極大抑制。

實現代碼如下，當調用yolo_eval時，就會對每個特徵層進行解碼：

class DecodeBox(nn.Module):
    def __init__(self, anchors, num_classes, img_size):
        super(DecodeBox, self).__init__()
        self.anchors = anchors
        self.num_anchors = len(anchors)
        self.num_classes = num_classes
        self.bbox_attrs = 5 + num_classes
        self.img_size = img_size

    def forward(self, input):
        # input爲bs,3*(1+4+num_classes),13,13

        # 一共多少張圖片
        batch_size = input.size(0)
        # 13，13
        input_height = input.size(2)
        input_width = input.size(3)

        # 計算步長
        # 每一個特徵點對應原來的圖片上多少個像素點
        # 如果特徵層爲13x13的話，一個特徵點就對應原來的圖片上的32個像素點
        # 416/13 = 32
        stride_h = self.img_size[1] / input_height
        stride_w = self.img_size[0] / input_width

        # 把先驗框的尺寸調整成特徵層大小的形式
        # 計算出先驗框在特徵層上對應的寬高
        scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors]

        # bs,3*(5+num_classes),13,13 -> bs,3,13,13,(5+num_classes)
        prediction = input.view(batch_size, self.num_anchors,
                                self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()

        # 先驗框的中心位置的調整參數
        x = torch.sigmoid(prediction[..., 0])  
        y = torch.sigmoid(prediction[..., 1])
        # 先驗框的寬高調整參數
        w = prediction[..., 2]  # Width
        h = prediction[..., 3]  # Height

        # 獲得置信度，是否有物體
        conf = torch.sigmoid(prediction[..., 4])
        # 種類置信度
        pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred.

        FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

        # 生成網格，先驗框中心，網格左上角 batch_size,3,13,13
        grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_width, 1).repeat(
            batch_size * self.num_anchors, 1, 1).view(x.shape).type(FloatTensor)
        grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_height, 1).t().repeat(
            batch_size * self.num_anchors, 1, 1).view(y.shape).type(FloatTensor)

        # 生成先驗框的寬高
        anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
        anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
        anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
        anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)
        
        # 計算調整後的先驗框中心與寬高
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        pred_boxes[..., 0] = x.data + grid_x
        pred_boxes[..., 1] = y.data + grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * anchor_w
        pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

        # 用於將輸出調整爲相對於416x416的大小
        _scale = torch.Tensor([stride_w, stride_h] * 2).type(FloatTensor)
        output = torch.cat((pred_boxes.view(batch_size, -1, 4) * _scale,
                            conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
        return output.data

5、在原圖上進行繪製

通過第四步，我們可以獲得預測框在原圖上的位置，而且這些預測框都是經過篩選的。這些篩選後的框可以直接繪製在圖片上，就可以獲得結果了。

YOLOV4的訓練

1、YOLOV4的改進訓練技巧

a)、Mosaic數據增強

Yolov4的mosaic數據增強參考了CutMix數據增強方式，理論上具有一定的相似性！
CutMix數據增強方式利用兩張圖片進行拼接。

但是mosaic利用了四張圖片，根據論文所說其擁有一個巨大的優點是豐富檢測物體的背景！且在BN計算的時候一下子會計算四張圖片的數據！
就像下圖這樣：

實現思路如下：
1、每次讀取四張圖片。

2、分別對四張圖片進行翻轉、縮放、色域變化等，並且按照四個方向位置擺好。

3、進行圖片的組合和框的組合

def rand(a=0, b=1):
    return np.random.rand()*(b-a) + a

def merge_bboxes(bboxes, cutx, cuty):

    merge_bbox = []
    for i in range(len(bboxes)):
        for box in bboxes[i]:
            tmp_box = []
            x1,y1,x2,y2 = box[0], box[1], box[2], box[3]

            if i == 0:
                if y1 > cuty or x1 > cutx:
                    continue
                if y2 >= cuty and y1 <= cuty:
                    y2 = cuty
                    if y2-y1 < 5:
                        continue
                if x2 >= cutx and x1 <= cutx:
                    x2 = cutx
                    if x2-x1 < 5:
                        continue
                
            if i == 1:
                if y2 < cuty or x1 > cutx:
                    continue

                if y2 >= cuty and y1 <= cuty:
                    y1 = cuty
                    if y2-y1 < 5:
                        continue
                
                if x2 >= cutx and x1 <= cutx:
                    x2 = cutx
                    if x2-x1 < 5:
                        continue

            if i == 2:
                if y2 < cuty or x2 < cutx:
                    continue

                if y2 >= cuty and y1 <= cuty:
                    y1 = cuty
                    if y2-y1 < 5:
                        continue

                if x2 >= cutx and x1 <= cutx:
                    x1 = cutx
                    if x2-x1 < 5:
                        continue

            if i == 3:
                if y1 > cuty or x2 < cutx:
                    continue

                if y2 >= cuty and y1 <= cuty:
                    y2 = cuty
                    if y2-y1 < 5:
                        continue

                if x2 >= cutx and x1 <= cutx:
                    x1 = cutx
                    if x2-x1 < 5:
                        continue

            tmp_box.append(x1)
            tmp_box.append(y1)
            tmp_box.append(x2)
            tmp_box.append(y2)
            tmp_box.append(box[-1])
            merge_bbox.append(tmp_box)
    return merge_bbox

def get_random_data(annotation_line, input_shape, random=True, hue=.1, sat=1.5, val=1.5, proc_img=True):
    '''random preprocessing for real-time data augmentation'''
    h, w = input_shape
    min_offset_x = 0.4
    min_offset_y = 0.4
    scale_low = 1-min(min_offset_x,min_offset_y)
    scale_high = scale_low+0.2

    image_datas = [] 
    box_datas = []
    index = 0

    place_x = [0,0,int(w*min_offset_x),int(w*min_offset_x)]
    place_y = [0,int(h*min_offset_y),int(w*min_offset_y),0]
    for line in annotation_line:
        # 每一行進行分割
        line_content = line.split()
        # 打開圖片
        image = Image.open(line_content[0])
        image = image.convert("RGB") 
        # 圖片的大小
        iw, ih = image.size
        # 保存框的位置
        box = np.array([np.array(list(map(int,box.split(',')))) for box in line_content[1:]])
        
        # image.save(str(index)+".jpg")
        # 是否翻轉圖片
        flip = rand()<.5
        if flip and len(box)>0:
            image = image.transpose(Image.FLIP_LEFT_RIGHT)
            box[:, [0,2]] = iw - box[:, [2,0]]

        # 對輸入進來的圖片進行縮放
        new_ar = w/h
        scale = rand(scale_low, scale_high)
        if new_ar < 1:
            nh = int(scale*h)
            nw = int(nh*new_ar)
        else:
            nw = int(scale*w)
            nh = int(nw/new_ar)
        image = image.resize((nw,nh), Image.BICUBIC)

        # 進行色域變換
        hue = rand(-hue, hue)
        sat = rand(1, sat) if rand()<.5 else 1/rand(1, sat)
        val = rand(1, val) if rand()<.5 else 1/rand(1, val)
        x = rgb_to_hsv(np.array(image)/255.)
        x[..., 0] += hue
        x[..., 0][x[..., 0]>1] -= 1
        x[..., 0][x[..., 0]<0] += 1
        x[..., 1] *= sat
        x[..., 2] *= val
        x[x>1] = 1
        x[x<0] = 0
        image = hsv_to_rgb(x)

        image = Image.fromarray((image*255).astype(np.uint8))
        # 將圖片進行放置，分別對應四張分割圖片的位置
        dx = place_x[index]
        dy = place_y[index]
        new_image = Image.new('RGB', (w,h), (128,128,128))
        new_image.paste(image, (dx, dy))
        image_data = np.array(new_image)/255

        # Image.fromarray((image_data*255).astype(np.uint8)).save(str(index)+"distort.jpg")
        
        index = index + 1
        box_data = []
        # 對box進行重新處理
        if len(box)>0:
            np.random.shuffle(box)
            box[:, [0,2]] = box[:, [0,2]]*nw/iw + dx
            box[:, [1,3]] = box[:, [1,3]]*nh/ih + dy
            box[:, 0:2][box[:, 0:2]<0] = 0
            box[:, 2][box[:, 2]>w] = w
            box[:, 3][box[:, 3]>h] = h
            box_w = box[:, 2] - box[:, 0]
            box_h = box[:, 3] - box[:, 1]
            box = box[np.logical_and(box_w>1, box_h>1)]
            box_data = np.zeros((len(box),5))
            box_data[:len(box)] = box
        
        image_datas.append(image_data)
        box_datas.append(box_data)

        img = Image.fromarray((image_data*255).astype(np.uint8))
        for j in range(len(box_data)):
            thickness = 3
            left, top, right, bottom  = box_data[j][0:4]
            draw = ImageDraw.Draw(img)
            for i in range(thickness):
                draw.rectangle([left + i, top + i, right - i, bottom - i],outline=(255,255,255))
        img.show()

    
    # 將圖片分割，放在一起
    cutx = np.random.randint(int(w*min_offset_x), int(w*(1 - min_offset_x)))
    cuty = np.random.randint(int(h*min_offset_y), int(h*(1 - min_offset_y)))

    new_image = np.zeros([h,w,3])
    new_image[:cuty, :cutx, :] = image_datas[0][:cuty, :cutx, :]
    new_image[cuty:, :cutx, :] = image_datas[1][cuty:, :cutx, :]
    new_image[cuty:, cutx:, :] = image_datas[2][cuty:, cutx:, :]
    new_image[:cuty, cutx:, :] = image_datas[3][:cuty, cutx:, :]

    # 對框進行進一步的處理
    new_boxes = merge_bboxes(box_datas, cutx, cuty)

    return new_image, new_boxes

b)、Label Smoothing平滑

標籤平滑的思想很簡單，具體公式如下：

new_onehot_labels = onehot_labels * (1 - label_smoothing) + label_smoothing / num_classes

當label_smoothing的值爲0.01得時候，公式變成如下所示：

new_onehot_labels = y * (1 - 0.01) + 0.01 / num_classes

其實Label Smoothing平滑就是將標籤進行一個平滑，原始的標籤是0、1，在平滑後變成0.005(如果是二分類)、0.995，也就是說對分類準確做了一點懲罰，讓模型不可以分類的太準確，太準確容易過擬合。

實現代碼如下：

#---------------------------------------------------#
#   平滑標籤
#---------------------------------------------------#
def smooth_labels(y_true, label_smoothing,num_classes):
    return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes

c)、CIOU

IoU是比值的概念，對目標物體的scale是不敏感的。然而常用的BBox的迴歸損失優化和IoU優化不是完全等價的，尋常的IoU無法直接優化沒有重疊的部分。

於是有人提出直接使用IOU作爲迴歸優化loss，CIOU是其中非常優秀的一種想法。

CIOU將目標與anchor之間的距離，重疊率、尺度以及懲罰項都考慮進去，使得目標框迴歸變得更加穩定，不會像IoU和GIoU一樣出現訓練過程中發散等問題。而懲罰因子把預測框長寬比擬合目標框的長寬比考慮進去。

CIOU公式如下
$CIOU = IOU - \frac{\rho^2(b,b^{gt})}{c^2} - \alpha v$
其中， $\rho^2(b,b^{gt})$ 分別代表了預測框和真實框的中心點的歐式距離。 c代表的是能夠同時包含預測框和真實框的最小閉包區域的對角線距離。

而 $\alpha$ 和 $v$ 的公式如下
$\alpha = \frac{v}{1-IOU+v}$
$v = \frac{4}{\pi ^2}(arctan\frac{w^{gt}}{h^{gt}}-arctan\frac{w}{h})^2$
把1-CIOU就可以得到相應的LOSS了。
$LOSS_{CIOU} = 1 - IOU + \frac{\rho^2(b,b^{gt})}{c^2} + \alpha v$

def box_ciou(b1, b2):
    """
    輸入爲：
    ----------
    b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
    b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh

    返回爲：
    -------
    ciou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
    """
    # 求出預測框左上角右下角
    b1_xy = b1[..., :2]
    b1_wh = b1[..., 2:4]
    b1_wh_half = b1_wh/2.
    b1_mins = b1_xy - b1_wh_half
    b1_maxes = b1_xy + b1_wh_half
    # 求出真實框左上角右下角
    b2_xy = b2[..., :2]
    b2_wh = b2[..., 2:4]
    b2_wh_half = b2_wh/2.
    b2_mins = b2_xy - b2_wh_half
    b2_maxes = b2_xy + b2_wh_half

    # 求真實框和預測框所有的iou
    intersect_mins = torch.max(b1_mins, b2_mins)
    intersect_maxes = torch.min(b1_maxes, b2_maxes)
    intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
    intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
    b1_area = b1_wh[..., 0] * b1_wh[..., 1]
    b2_area = b2_wh[..., 0] * b2_wh[..., 1]
    union_area = b1_area + b2_area - intersect_area
    iou = intersect_area / (union_area + 1e-6)

    # 計算中心的差距
    center_distance = torch.sum(torch.pow((b1_xy - b2_xy), 2), axis=-1)
    # 找到包裹兩個框的最小框的左上角和右下角
    enclose_mins = torch.min(b1_mins, b2_mins)
    enclose_maxes = torch.max(b1_maxes, b2_maxes)
    enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))
    # 計算對角線距離
    enclose_diagonal = torch.sum(torch.pow(enclose_wh,2), axis=-1)
    ciou = iou - 1.0 * (center_distance) / (enclose_diagonal + 1e-7)
    
    v = (4 / (math.pi ** 2)) * torch.pow((torch.atan(b1_wh[..., 0]/b1_wh[..., 1]) - torch.atan(b2_wh[..., 0]/b2_wh[..., 1])), 2)
    alpha = v / (1.0 - iou + v)
    ciou = ciou - alpha * v
    return ciou

d)、學習率餘弦退火衰減

餘弦退火衰減法，學習率會先上升再下降，這是退火優化法的思想。（關於什麼是退火算法可以百度。）

上升的時候使用線性上升，下降的時候模擬cos函數下降。執行多次。

效果如圖所示：

pytorch有直接實現的函數，可直接調用。

lr_scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5, eta_min=1e-5)

2、loss組成

a)、計算loss所需參數

在計算loss的時候，實際上是y_pre和y_true之間的對比：
y_pre就是一幅圖像經過網絡之後的輸出，內部含有三個特徵層的內容；其需要解碼才能夠在圖上作畫
y_true就是一個真實圖像中，它的每個真實框對應的(19,19)、(38,38)、(76,76)網格上的偏移位置、長寬與種類。其仍需要編碼才能與y_pred的結構一致
實際上y_pre和y_true內容的shape都是
(batch_size,19,19,3,85)
(batch_size,38,38,3,85)
(batch_size,76,76,3,85)

b)、y_pre是什麼

網絡最後輸出的內容就是三個特徵層每個網格點對應的預測框及其種類，即三個特徵層分別對應着圖片被分爲不同size的網格後，每個網格點上三個先驗框對應的位置、置信度及其種類。
對於輸出的y1、y2、y3而言，[…, : 2]指的是相對於每個網格點的偏移量，[…, 2: 4]指的是寬和高，[…, 4: 5]指的是該框的置信度，[…, 5: ]指的是每個種類的預測概率。
現在的y_pre還是沒有解碼的，解碼了之後纔是真實圖像上的情況。

c)、y_true是什麼。

y_true就是一個真實圖像中，它的每個真實框對應的(19,19)、(38,38)、(76,76)網格上的偏移位置、長寬與種類。其仍需要編碼才能與y_pred的結構一致

d)、loss的計算過程

在得到了y_pre和y_true後怎麼對比呢？不是簡單的減一下!

loss值需要對三個特徵層進行處理，這裏以最小的特徵層爲例。
1、利用y_true取出該特徵層中真實存在目標的點的位置(m,19,19,3,1)及其對應的種類(m,19,19,3,80)。
2、將prediction的預測值輸出進行處理，得到reshape後的預測值y_pre，shape爲(m,19,19,3,85)。還有解碼後的xy，wh。
3、對於每一幅圖，計算其中所有真實框與預測框的IOU，如果某些預測框和真實框的重合程度大於0.5，則忽略。
4、計算ciou作爲迴歸的loss，這裏只計算正樣本的迴歸loss。
5、計算置信度的loss，其有兩部分構成，第一部分是實際上存在目標的，預測結果中置信度的值與1對比；第二部分是實際上不存在目標的，在第四步中得到其最大IOU的值與0對比。
6、計算預測種類的loss，其計算的是實際上存在目標的，預測類與真實類的差距。

其實際上計算的總的loss是三個loss的和，這三個loss分別是：

實際存在的框，CIOU LOSS。
實際存在的框，預測結果中置信度的值與1對比；實際不存在的框，預測結果中置信度的值與0對比，該部分要去除被忽略的不包含目標的框。
實際存在的框，種類預測結果與實際結果的對比。

其實際代碼如下：


#---------------------------------------------------#
#   平滑標籤
#---------------------------------------------------#
def smooth_labels(y_true, label_smoothing,num_classes):
    return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes

def box_ciou(b1, b2):
    """
    輸入爲：
    ----------
    b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
    b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh

    返回爲：
    -------
    ciou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
    """
    # 求出預測框左上角右下角
    b1_xy = b1[..., :2]
    b1_wh = b1[..., 2:4]
    b1_wh_half = b1_wh/2.
    b1_mins = b1_xy - b1_wh_half
    b1_maxes = b1_xy + b1_wh_half
    # 求出真實框左上角右下角
    b2_xy = b2[..., :2]
    b2_wh = b2[..., 2:4]
    b2_wh_half = b2_wh/2.
    b2_mins = b2_xy - b2_wh_half
    b2_maxes = b2_xy + b2_wh_half

    # 求真實框和預測框所有的iou
    intersect_mins = torch.max(b1_mins, b2_mins)
    intersect_maxes = torch.min(b1_maxes, b2_maxes)
    intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
    intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
    b1_area = b1_wh[..., 0] * b1_wh[..., 1]
    b2_area = b2_wh[..., 0] * b2_wh[..., 1]
    union_area = b1_area + b2_area - intersect_area
    iou = intersect_area / (union_area + 1e-6)

    # 計算中心的差距
    center_distance = torch.sum(torch.pow((b1_xy - b2_xy), 2), axis=-1)
    # 找到包裹兩個框的最小框的左上角和右下角
    enclose_mins = torch.min(b1_mins, b2_mins)
    enclose_maxes = torch.max(b1_maxes, b2_maxes)
    enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))
    # 計算對角線距離
    enclose_diagonal = torch.sum(torch.pow(enclose_wh,2), axis=-1)
    ciou = iou - 1.0 * (center_distance) / (enclose_diagonal + 1e-7)
    
    v = (4 / (math.pi ** 2)) * torch.pow((torch.atan(b1_wh[..., 0]/b1_wh[..., 1]) - torch.atan(b2_wh[..., 0]/b2_wh[..., 1])), 2)
    alpha = v / (1.0 - iou + v)
    ciou = ciou - alpha * v
    return ciou

def clip_by_tensor(t,t_min,t_max):
    t=t.float()
    result = (t >= t_min).float() * t + (t < t_min).float() * t_min
    result = (result <= t_max).float() * result + (result > t_max).float() * t_max
    return result

def MSELoss(pred,target):
    return (pred-target)**2

def BCELoss(pred,target):
    epsilon = 1e-7
    pred = clip_by_tensor(pred, epsilon, 1.0 - epsilon)
    output = -target * torch.log(pred) - (1.0 - target) * torch.log(1.0 - pred)
    return output

class YOLOLoss(nn.Module):
    def __init__(self, anchors, num_classes, img_size, label_smooth=0, cuda=True):
        super(YOLOLoss, self).__init__()
        self.anchors = anchors
        self.num_anchors = len(anchors)
        self.num_classes = num_classes
        self.bbox_attrs = 5 + num_classes
        self.img_size = img_size
        self.label_smooth = label_smooth

        self.ignore_threshold = 0.5
        self.lambda_conf = 1.0
        self.lambda_cls = 1.0
        self.lambda_loc = 1.0
        self.cuda = cuda

    def forward(self, input, targets=None):
        # input爲bs,3*(5+num_classes),13,13
        
        # 一共多少張圖片
        bs = input.size(0)
        # 特徵層的高
        in_h = input.size(2)
        # 特徵層的寬
        in_w = input.size(3)

        # 計算步長
        # 每一個特徵點對應原來的圖片上多少個像素點
        # 如果特徵層爲13x13的話，一個特徵點就對應原來的圖片上的32個像素點
        stride_h = self.img_size[1] / in_h
        stride_w = self.img_size[0] / in_w

        # 把先驗框的尺寸調整成特徵層大小的形式
        # 計算出先驗框在特徵層上對應的寬高
        scaled_anchors = [(a_w / stride_w, a_h / stride_h) for a_w, a_h in self.anchors]
        # bs,3*(5+num_classes),13,13 -> bs,3,13,13,(5+num_classes)
        prediction = input.view(bs, int(self.num_anchors/3),
                                self.bbox_attrs, in_h, in_w).permute(0, 1, 3, 4, 2).contiguous()
        
        # 對prediction預測進行調整
        conf = torch.sigmoid(prediction[..., 4])  # Conf
        pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred.

        # 找到哪些先驗框內部包含物體
        mask, noobj_mask, t_box, tconf, tcls, box_loss_scale_x, box_loss_scale_y = self.get_target(targets, scaled_anchors,in_w, in_h,self.ignore_threshold)

        noobj_mask, pred_boxes_for_ciou = self.get_ignore(prediction, targets, scaled_anchors, in_w, in_h, noobj_mask)

        if self.cuda:
            mask, noobj_mask = mask.cuda(), noobj_mask.cuda()
            box_loss_scale_x, box_loss_scale_y= box_loss_scale_x.cuda(), box_loss_scale_y.cuda()
            tconf, tcls = tconf.cuda(), tcls.cuda()
            pred_boxes_for_ciou = pred_boxes_for_ciou.cuda()
            t_box = t_box.cuda()

        box_loss_scale = 2-box_loss_scale_x*box_loss_scale_y
        #  losses.
        ciou = (1 - box_ciou( pred_boxes_for_ciou[mask.bool()], t_box[mask.bool()]))* box_loss_scale[mask.bool()]

        loss_loc = torch.sum(ciou / bs)
        loss_conf = torch.sum(BCELoss(conf, mask) * mask / bs) + \
                    torch.sum(BCELoss(conf, mask) * noobj_mask / bs)
                    
        # print(smooth_labels(tcls[mask == 1],self.label_smooth,self.num_classes))
        loss_cls = torch.sum(BCELoss(pred_cls[mask == 1], smooth_labels(tcls[mask == 1],self.label_smooth,self.num_classes))/bs)
        # print(loss_loc,loss_conf,loss_cls)
        loss = loss_conf * self.lambda_conf + loss_cls * self.lambda_cls + loss_loc * self.lambda_loc
        return loss, loss_conf.item(), loss_cls.item(), loss_loc.item()

    def get_target(self, target, anchors, in_w, in_h, ignore_threshold):
        # 計算一共有多少張圖片
        bs = len(target)
        # 獲得先驗框
        anchor_index = [[0,1,2],[3,4,5],[6,7,8]][[13,26,52].index(in_w)]
        subtract_index = [0,3,6][[13,26,52].index(in_w)]
        # 創建全是0或者全是1的陣列
        mask = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        noobj_mask = torch.ones(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

        tx = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        ty = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        tw = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        th = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        t_box = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, 4, requires_grad=False)
        tconf = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        tcls = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, self.num_classes, requires_grad=False)

        box_loss_scale_x = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        box_loss_scale_y = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)
        for b in range(bs):
            for t in range(target[b].shape[0]):
                # 計算出在特徵層上的點位
                gx = target[b][t, 0] * in_w
                gy = target[b][t, 1] * in_h
                
                gw = target[b][t, 2] * in_w
                gh = target[b][t, 3] * in_h

                # 計算出屬於哪個網格
                gi = int(gx)
                gj = int(gy)

                # 計算真實框的位置
                gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0)
                
                # 計算出所有先驗框的位置
                anchor_shapes = torch.FloatTensor(np.concatenate((np.zeros((self.num_anchors, 2)),
                                                                  np.array(anchors)), 1))
                # 計算重合程度
                anch_ious = bbox_iou(gt_box, anchor_shapes)
               
                # Find the best matching anchor box
                best_n = np.argmax(anch_ious)
                if best_n not in anchor_index:
                    continue
                # Masks
                if (gj < in_h) and (gi < in_w):
                    best_n = best_n - subtract_index
                    # 判定哪些先驗框內部真實的存在物體
                    noobj_mask[b, best_n, gj, gi] = 0
                    mask[b, best_n, gj, gi] = 1
                    # 計算先驗框中心調整參數
                    tx[b, best_n, gj, gi] = gx
                    ty[b, best_n, gj, gi] = gy
                    # 計算先驗框寬高調整參數
                    tw[b, best_n, gj, gi] = gw
                    th[b, best_n, gj, gi] = gh
                    # 用於獲得xywh的比例
                    box_loss_scale_x[b, best_n, gj, gi] = target[b][t, 2]
                    box_loss_scale_y[b, best_n, gj, gi] = target[b][t, 3]
                    # 物體置信度
                    tconf[b, best_n, gj, gi] = 1
                    # 種類
                    tcls[b, best_n, gj, gi, int(target[b][t, 4])] = 1
                else:
                    print('Step {0} out of bound'.format(b))
                    print('gj: {0}, height: {1} | gi: {2}, width: {3}'.format(gj, in_h, gi, in_w))
                    continue
        t_box[...,0] = tx
        t_box[...,1] = ty
        t_box[...,2] = tw
        t_box[...,3] = th
        return mask, noobj_mask, t_box, tconf, tcls, box_loss_scale_x, box_loss_scale_y

    def get_ignore(self,prediction,target,scaled_anchors,in_w, in_h,noobj_mask):
        bs = len(target)
        anchor_index = [[0,1,2],[3,4,5],[6,7,8]][[13,26,52].index(in_w)]
        scaled_anchors = np.array(scaled_anchors)[anchor_index]
        # 先驗框的中心位置的調整參數
        x = torch.sigmoid(prediction[..., 0])  
        y = torch.sigmoid(prediction[..., 1])
        # 先驗框的寬高調整參數
        w = prediction[..., 2]  # Width
        h = prediction[..., 3]  # Height

        FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

        # 生成網格，先驗框中心，網格左上角
        grid_x = torch.linspace(0, in_w - 1, in_w).repeat(in_w, 1).repeat(
            int(bs*self.num_anchors/3), 1, 1).view(x.shape).type(FloatTensor)
        grid_y = torch.linspace(0, in_h - 1, in_h).repeat(in_h, 1).t().repeat(
            int(bs*self.num_anchors/3), 1, 1).view(y.shape).type(FloatTensor)

        # 生成先驗框的寬高
        anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
        anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
        
        anchor_w = anchor_w.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(w.shape)
        anchor_h = anchor_h.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(h.shape)
        
        # 計算調整後的先驗框中心與寬高
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        pred_boxes[..., 0] = x + grid_x
        pred_boxes[..., 1] = y + grid_y
        pred_boxes[..., 2] = torch.exp(w) * anchor_w
        pred_boxes[..., 3] = torch.exp(h) * anchor_h
        for i in range(bs):
            pred_boxes_for_ignore = pred_boxes[i]
            pred_boxes_for_ignore = pred_boxes_for_ignore.view(-1, 4)

            for t in range(target[i].shape[0]):
                gx = target[i][t, 0] * in_w
                gy = target[i][t, 1] * in_h
                gw = target[i][t, 2] * in_w
                gh = target[i][t, 3] * in_h
                gt_box = torch.FloatTensor(np.array([gx, gy, gw, gh])).unsqueeze(0).type(FloatTensor)

                anch_ious = bbox_iou(gt_box, pred_boxes_for_ignore, x1y1x2y2=False)
                anch_ious = anch_ious.view(pred_boxes[i].size()[:3])
                noobj_mask[i][anch_ious>self.ignore_threshold] = 0
        return noobj_mask, pred_boxes

訓練自己的YOLOV4模型

yolo4整體的文件夾構架如下：

本文使用VOC格式進行訓練。
訓練前將標籤文件放在VOCdevkit文件夾下的VOC2007文件夾下的Annotation中。

訓練前將圖片文件放在VOCdevkit文件夾下的VOC2007文件夾下的JPEGImages中。

在訓練前利用voc2yolo3.py文件生成對應的txt。

再運行根目錄下的voc_annotation.py，運行前需要將classes改成你自己的classes。

classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

就會生成對應的2007_train.txt，每一行對應其圖片位置及其真實框的位置。

在訓練前需要修改model_data裏面的voc_classes.txt文件，需要將classes改成你自己的classes。

運行train.py即可開始訓練。

睿智的目標檢測30——Pytorch搭建YoloV4目標檢測平臺