睿智的目標檢測33——Pytorch搭建Efficientdet目標檢測平臺

學習前言

一起來看看Efficientdet的keras實現吧，順便訓練一下自己的數據。

什麼是Efficientdet目標檢測算法

最近，谷歌大腦 Mingxing Tan、Ruoming Pang 和 Quoc V. Le 提出新架構 EfficientDet，結合 EfficientNet（同樣來自該團隊）和新提出的 BiFPN，實現新的 SOTA 結果。

源碼下載

https://github.com/bubbliiiing/efficientdet-pytorch
喜歡的可以點個star噢。

Efficientdet實現思路

一、預測部分

1、主幹網絡介紹

Efficientdet採用Efficientnet作爲主幹特徵提取網絡。EfficientNet-B0對應Efficientdet-D0；EfficientNet-B1對應Efficientdet-D1；以此類推。

EfficientNet模型具有很獨特的特點，這個特點是參考其它優秀神經網絡設計出來的。經典的神經網絡特點如下：
1、利用殘差神經網絡增大神經網絡的深度，通過更深的神經網絡實現特徵提取。
2、改變每一層提取的特徵層數，實現更多層的特徵提取，得到更多的特徵，提升寬度。
3、通過增大輸入圖片的分辨率也可以使得網絡可以學習與表達的東西更加豐富，有利於提高精確度。

EfficientNet就是將這三個特點結合起來，通過一起縮放baseline模型（MobileNet中就通過縮放α實現縮放模型，不同的α有不同的模型精度，α=1時爲baseline模型；ResNet其實也是有一個baseline模型，在baseline的基礎上通過改變圖片的深度實現不同的模型實現），同時調整深度、寬度、輸入圖片的分辨率完成一個優秀的網絡設計。

在EfficientNet模型中，其使用一組固定的縮放係數統一縮放網絡深度、寬度和分辨率。
假設想使用 2^N倍的計算資源，我們可以簡單的對網絡深度擴大α^N倍、寬度擴大β^N 、圖像尺寸擴大γ^N倍，這裏的α,β,γ都是由原來的小模型上做微小的網格搜索決定的常量係數。
如圖爲EfficientNet的設計思路，從三個方面同時拓充網絡的特性。

本博客以Efficientnet-B0和Efficientdet-D0爲例，進行Efficientdet的解析。

Efficientnet-B0由1個Stem+16個大Blocks堆疊構成，16個大Blocks可以分爲1、2、2、3、3、4、1個Block。Block的通用結構如下，其總體的設計思路是Inverted residuals結構和殘差結構，在3x3或者5x5網絡結構前利用1x1卷積升維，在3x3或者5x5網絡結構後增加了一個關於通道的注意力機制，最後利用1x1卷積降維後增加一個大殘差邊。

整體結構如下：

最終獲得三個有效特徵層傳入到BIFPN當中進行下一步的操作。

import torch
from torch import nn
from torch.nn import functional as F

from nets.layers import (
    round_filters,
    round_repeats,
    drop_connect,
    get_same_padding_conv2d,
    get_model_params,
    efficientnet_params,
    load_pretrained_weights,
    Swish,
    MemoryEfficientSwish,
)


class MBConvBlock(nn.Module):
    '''
    EfficientNet-b0:
    [BlockArgs(kernel_size=3, num_repeat=1, input_filters=32, output_filters=16, expand_ratio=1, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=2, input_filters=16, output_filters=24, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=2, input_filters=24, output_filters=40, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=3, input_filters=40, output_filters=80, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=3, input_filters=80, output_filters=112, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=4, input_filters=112, output_filters=192, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=1, input_filters=192, output_filters=320, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25)]
    
    GlobalParams(batch_norm_momentum=0.99, batch_norm_epsilon=0.001, dropout_rate=0.2, num_classes=1000, width_coefficient=1.0, 
                    depth_coefficient=1.0, depth_divisor=8, min_depth=None, drop_connect_rate=0.2, image_size=224)
    '''
    def __init__(self, block_args, global_params):
        super().__init__()
        self._block_args = block_args
        # 獲得標準化的參數
        self._bn_mom = 1 - global_params.batch_norm_momentum
        self._bn_eps = global_params.batch_norm_epsilon

        # 注意力機制的縮放比例
        self.has_se = (self._block_args.se_ratio is not None) and (
            0 < self._block_args.se_ratio <= 1)
        # 是否需要短接邊
        self.id_skip = block_args.id_skip 

        Conv2d = get_same_padding_conv2d(image_size=global_params.image_size)

        # 1x1卷積通道擴張
        inp = self._block_args.input_filters  # number of input channels
        oup = self._block_args.input_filters * self._block_args.expand_ratio  # number of output channels
        if self._block_args.expand_ratio != 1:
            self._expand_conv = Conv2d(
                in_channels=inp, out_channels=oup, kernel_size=1, bias=False)
            self._bn0 = nn.BatchNorm2d(
                num_features=oup, momentum=self._bn_mom, eps=self._bn_eps)

        # 深度可分離卷積
        k = self._block_args.kernel_size
        s = self._block_args.stride
        self._depthwise_conv = Conv2d(
            in_channels=oup, out_channels=oup, groups=oup,
            kernel_size=k, stride=s, bias=False)
        self._bn1 = nn.BatchNorm2d(
            num_features=oup, momentum=self._bn_mom, eps=self._bn_eps)

        # 注意力機制模塊組，先進行通道數的收縮再進行通道數的擴張
        if self.has_se:
            num_squeezed_channels = max(
                1, int(self._block_args.input_filters * self._block_args.se_ratio))
            self._se_reduce = Conv2d(
                in_channels=oup, out_channels=num_squeezed_channels, kernel_size=1)
            self._se_expand = Conv2d(
                in_channels=num_squeezed_channels, out_channels=oup, kernel_size=1)

        # 輸出部分
        final_oup = self._block_args.output_filters
        self._project_conv = Conv2d(
            in_channels=oup, out_channels=final_oup, kernel_size=1, bias=False)
        self._bn2 = nn.BatchNorm2d(
            num_features=final_oup, momentum=self._bn_mom, eps=self._bn_eps)
        self._swish = MemoryEfficientSwish()

    def forward(self, inputs, drop_connect_rate=None):
        x = inputs
        if self._block_args.expand_ratio != 1:
            x = self._swish(self._bn0(self._expand_conv(inputs)))

        x = self._swish(self._bn1(self._depthwise_conv(x)))

        # 添加了注意力機制
        if self.has_se:
            x_squeezed = F.adaptive_avg_pool2d(x, 1)
            x_squeezed = self._se_expand(
                self._swish(self._se_reduce(x_squeezed)))
            x = torch.sigmoid(x_squeezed) * x

        x = self._bn2(self._project_conv(x))

        # 滿足以下條件纔可以短接
        input_filters, output_filters = self._block_args.input_filters, self._block_args.output_filters
        if self.id_skip and self._block_args.stride == 1 and input_filters == output_filters:
            if drop_connect_rate:
                x = drop_connect(x, p=drop_connect_rate,
                                 training=self.training)
            x = x + inputs  # skip connection
        return x

    def set_swish(self, memory_efficient=True):
        """Sets swish function as memory efficient (for training) or standard (for export)"""
        self._swish = MemoryEfficientSwish() if memory_efficient else Swish()

class EfficientNet(nn.Module):
    '''
    EfficientNet-b0:
    [BlockArgs(kernel_size=3, num_repeat=1, input_filters=32, output_filters=16, expand_ratio=1, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=2, input_filters=16, output_filters=24, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=2, input_filters=24, output_filters=40, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=3, input_filters=40, output_filters=80, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=3, input_filters=80, output_filters=112, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=4, input_filters=112, output_filters=192, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=1, input_filters=192, output_filters=320, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25)]
    
    GlobalParams(batch_norm_momentum=0.99, batch_norm_epsilon=0.001, dropout_rate=0.2, num_classes=1000, width_coefficient=1.0, 
                    depth_coefficient=1.0, depth_divisor=8, min_depth=None, drop_connect_rate=0.2, image_size=224)
    '''
    def __init__(self, blocks_args=None, global_params=None):
        super().__init__()
        assert isinstance(blocks_args, list), 'blocks_args should be a list'
        assert len(blocks_args) > 0, 'block args must be greater than 0'
        self._global_params = global_params
        self._blocks_args = blocks_args
        # 獲得一種卷積方法
        Conv2d = get_same_padding_conv2d(image_size=global_params.image_size)

        # 獲得標準化的參數
        bn_mom = 1 - self._global_params.batch_norm_momentum
        bn_eps = self._global_params.batch_norm_epsilon

        # 網絡主幹部分開始
        # 設定輸入進來的是RGB三通道圖像
        in_channels = 3  
        # 利用round_filters可以使得通道數在擴張的時候可以被8整除
        out_channels = round_filters(32, self._global_params)

        # 卷積+標準化
        self._conv_stem = Conv2d(
            in_channels, out_channels, kernel_size=3, stride=2, bias=False)
        self._bn0 = nn.BatchNorm2d(
            num_features=out_channels, momentum=bn_mom, eps=bn_eps)

        # 對每個block的參數進行修改
        self._blocks = nn.ModuleList([])
        for i in range(len(self._blocks_args)):
            # 對每個block的參數進行修改，根據所選的efficient版本進行修改
            self._blocks_args[i] = self._blocks_args[i]._replace(
                input_filters=round_filters(
                    self._blocks_args[i].input_filters, self._global_params),
                output_filters=round_filters(
                    self._blocks_args[i].output_filters, self._global_params),
                num_repeat=round_repeats(
                    self._blocks_args[i].num_repeat, self._global_params)
            )

            # 第一次大的Block裏面的卷積需要考慮步長和輸入進來的通道數！
            self._blocks.append(MBConvBlock(self._blocks_args[i], self._global_params))

            if self._blocks_args[i].num_repeat > 1:
                self._blocks_args[i] = self._blocks_args[i]._replace(input_filters=self._blocks_args[i].output_filters, stride=1)
            for _ in range(self._blocks_args[i].num_repeat - 1):
                self._blocks.append(MBConvBlock(self._blocks_args[i], self._global_params))

        # 增加了head部分
        in_channels = self._blocks_args[len(self._blocks_args)-1].output_filters
        out_channels = round_filters(1280, self._global_params)

        # 卷積+標準化
        self._conv_head = Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self._bn1 = nn.BatchNorm2d(num_features=out_channels, momentum=bn_mom, eps=bn_eps)

        # 最後的線性全連接層
        self._avg_pooling = nn.AdaptiveAvgPool2d(1)
        self._dropout = nn.Dropout(self._global_params.dropout_rate)
        self._fc = nn.Linear(out_channels, self._global_params.num_classes)
        # 進行swish激活函數
        self._swish = MemoryEfficientSwish()

    def set_swish(self, memory_efficient=True):
        """Sets swish function as memory efficient (for training) or standard (for export)"""
        # swish函數
        self._swish = MemoryEfficientSwish() if memory_efficient else Swish()
        for block in self._blocks:
            block.set_swish(memory_efficient)

    def extract_features(self, inputs):
        """ Returns output of the final convolution layer """

        # Stem
        x = self._swish(self._bn0(self._conv_stem(inputs)))

        # Blocks
        for idx, block in enumerate(self._blocks):
            drop_connect_rate = self._global_params.drop_connect_rate
            if drop_connect_rate:
                drop_connect_rate *= float(idx) / len(self._blocks)
            x = block(x, drop_connect_rate=drop_connect_rate)
        # Head
        x = self._swish(self._bn1(self._conv_head(x)))

        return x

    def forward(self, inputs):
        """ Calls extract_features to extract features, applies final linear layer, and returns logits. """
        bs = inputs.size(0)
        # Convolution layers
        x = self.extract_features(inputs)

        # Pooling and final linear layer
        x = self._avg_pooling(x)
        x = x.view(bs, -1)
        x = self._dropout(x)
        x = self._fc(x)
        return x

    @classmethod
    def from_name(cls, model_name, override_params=None):
        cls._check_model_name_is_valid(model_name)
        blocks_args, global_params = get_model_params(model_name, override_params)
        return cls(blocks_args, global_params)

    @classmethod
    def from_pretrained(cls, model_name, load_weights=True, advprop=True, num_classes=1000, in_channels=3):
        model = cls.from_name(model_name, override_params={'num_classes': num_classes})
        if load_weights:
            load_pretrained_weights(model, model_name, load_fc=(num_classes == 1000), advprop=advprop)
        if in_channels != 3:
            Conv2d = get_same_padding_conv2d(image_size = model._global_params.image_size)
            out_channels = round_filters(32, model._global_params)
            model._conv_stem = Conv2d(in_channels, out_channels, kernel_size=3, stride=2, bias=False)
        return model

    @classmethod
    def get_image_size(cls, model_name):
        cls._check_model_name_is_valid(model_name)
        _, _, res, _ = efficientnet_params(model_name)
        return res

    @classmethod
    def _check_model_name_is_valid(cls, model_name):
        """ Validates model name. """
        valid_models = ['efficientnet-b'+str(i) for i in range(9)]
        if model_name not in valid_models:
            raise ValueError('model_name should be one of: ' + ', '.join(valid_models))

2、BiFPN加強特徵提取

BiFPN簡單來講是一個加強版本的FPN，上圖是BiFPN，下圖是普通的FPN，大家可以看到，與普通的FPN相比，BiFPN的FPN構建更加複雜，中間還增加了許多連接。

構建BiFPN可以分爲多步：
1、獲得P3_in、P4_in、P5_in、P6_in、P7_in，通過主幹特徵提取網絡，我們已經可以獲得P3、P4、P5，還需要進行兩次下采樣獲得P6、P7。
P3、P4、P5在經過1x1卷積調整通道數後，就可以作爲P3_in、P4_in、P5_in了，在構建BiFPN的第一步，需要構建兩個P4_in、P5_in（原版是這樣設計的）。

實現代碼如下：

p3_in = self.p3_down_channel(p3)
p4_in_1 = self.p4_down_channel(p4)
p5_in_1 = self.p5_down_channel(p5)
p4_in_2 = self.p4_down_channel_2(p4)
p5_in_2 = self.p5_down_channel_2(p5)
p6_in = self.p5_to_p6(p5)
p7_in = self.p6_to_p7(p6_in)

2、在獲得P3_in、P4_in_1、P4_in_2、P5_in_1、P5_in_2、P6_in、P7_in之後需要對P7_in進行上採樣，上採樣後與P6_in堆疊獲得P6_td；之後對P6_td進行上採樣，上採樣後與P5_in_1進行堆疊獲得P5_td；之後對P5_td進行上採樣，上採樣後與P4_in_1進行堆疊獲得P4_td；之後對P4_td進行上採樣，上採樣後與P3_in進行堆疊獲得P3_out。

實現代碼如下：

# 簡單的注意力機制，用於確定更關注p7_in還是p6_in
p6_w1 = self.p6_w1_relu(self.p6_w1)
weight = p6_w1 / (torch.sum(p6_w1, dim=0) + self.epsilon)
p6_up = self.conv6_up(self.swish(weight[0] * p6_in + weight[1] * self.p6_upsample(p7_in)))

# 簡單的注意力機制，用於確定更關注p6_up還是p5_in
p5_w1 = self.p5_w1_relu(self.p5_w1)
weight = p5_w1 / (torch.sum(p5_w1, dim=0) + self.epsilon)
p5_up = self.conv5_up(self.swish(weight[0] * p5_in_1 + weight[1] * self.p5_upsample(p6_up)))

# 簡單的注意力機制，用於確定更關注p5_up還是p4_in
p4_w1 = self.p4_w1_relu(self.p4_w1)
weight = p4_w1 / (torch.sum(p4_w1, dim=0) + self.epsilon)
p4_up = self.conv4_up(self.swish(weight[0] * p4_in_1 + weight[1] * self.p4_upsample(p5_up)))

# 簡單的注意力機制，用於確定更關注p4_up還是p3_in
p3_w1 = self.p3_w1_relu(self.p3_w1)
weight = p3_w1 / (torch.sum(p3_w1, dim=0) + self.epsilon)
p3_out = self.conv3_up(self.swish(weight[0] * p3_in + weight[1] * self.p3_upsample(p4_up)))

3、在獲得P3_out、P4_td、P4_in_2、P5_td、P5_in_2、P6_in、P6_td、P7_in之後，之後需要對P3_out進行下采樣，下采樣後與P4_td、P4_in_2堆疊獲得P4_out；之後對P4_out進行下采樣，下采樣後與P5_td、P5_in_2進行堆疊獲得P5_out；之後對P5_out進行下采樣，下采樣後與P6_in、P6_td進行堆疊獲得P6_out；之後對P6_out進行下采樣，下采樣後與P7_in進行堆疊獲得P7_out。

實現代碼如下：

# 簡單的注意力機制，用於確定更關注p4_in_2還是p4_up還是p3_out
p4_w2 = self.p4_w2_relu(self.p4_w2)
weight = p4_w2 / (torch.sum(p4_w2, dim=0) + self.epsilon)
# Connections for P4_0, P4_1 and P3_2 to P4_2 respectively
p4_out = self.conv4_down(
    self.swish(weight[0] * p4_in_2 + weight[1] * p4_up + weight[2] * self.p4_downsample(p3_out)))

# 簡單的注意力機制，用於確定更關注p5_in_2還是p5_up還是p4_out
p5_w2 = self.p5_w2_relu(self.p5_w2)
weight = p5_w2 / (torch.sum(p5_w2, dim=0) + self.epsilon)
p5_out = self.conv5_down(
    self.swish(weight[0] * p5_in_2 + weight[1] * p5_up + weight[2] * self.p5_downsample(p4_out)))

# 簡單的注意力機制，用於確定更關注p6_in還是p6_up還是p5_out
p6_w2 = self.p6_w2_relu(self.p6_w2)
weight = p6_w2 / (torch.sum(p6_w2, dim=0) + self.epsilon)
p6_out = self.conv6_down(
    self.swish(weight[0] * p6_in + weight[1] * p6_up + weight[2] * self.p6_downsample(p5_out)))

# 簡單的注意力機制，用於確定更關注p7_in還是p7_up還是p6_out
p7_w2 = self.p7_w2_relu(self.p7_w2)
weight = p7_w2 / (torch.sum(p7_w2, dim=0) + self.epsilon)
p7_out = self.conv7_down(self.swish(weight[0] * p7_in + weight[1] * self.p7_downsamp

4、將獲得的P3_out、P4_out、P5_out、P6_out、P7_out作爲P3_in、P4_in、P5_in、P6_in、P7_in，重複2、3步驟進行堆疊即可，對於Effiicientdet B0來講，還需要重複2次，需要注意P4_in_1和P4_in_2此時不需要分開了，P5也是。

實現代碼如下：

p3_in, p4_in, p5_in, p6_in, p7_in = inputs

# 簡單的注意力機制，用於確定更關注p7_in還是p6_in
p6_w1 = self.p6_w1_relu(self.p6_w1)
weight = p6_w1 / (torch.sum(p6_w1, dim=0) + self.epsilon)
p6_up = self.conv6_up(self.swish(weight[0] * p6_in + weight[1] * self.p6_upsample(p7_in)))

# 簡單的注意力機制，用於確定更關注p6_up還是p5_in
p5_w1 = self.p5_w1_relu(self.p5_w1)
weight = p5_w1 / (torch.sum(p5_w1, dim=0) + self.epsilon)
p5_up = self.conv5_up(self.swish(weight[0] * p5_in + weight[1] * self.p5_upsample(p6_up)))

# 簡單的注意力機制，用於確定更關注p5_up還是p4_in
p4_w1 = self.p4_w1_relu(self.p4_w1)
weight = p4_w1 / (torch.sum(p4_w1, dim=0) + self.epsilon)
p4_up = self.conv4_up(self.swish(weight[0] * p4_in + weight[1] * self.p4_upsample(p5_up)))

# 簡單的注意力機制，用於確定更關注p4_up還是p3_in
p3_w1 = self.p3_w1_relu(self.p3_w1)
weight = p3_w1 / (torch.sum(p3_w1, dim=0) + self.epsilon)
p3_out = self.conv3_up(self.swish(weight[0] * p3_in + weight[1] * self.p3_upsample(p4_up)))


# 簡單的注意力機制，用於確定更關注p4_in還是p4_up還是p3_out
p4_w2 = self.p4_w2_relu(self.p4_w2)
weight = p4_w2 / (torch.sum(p4_w2, dim=0) + self.epsilon)
# Connections for P4_0, P4_1 and P3_2 to P4_2 respectively
p4_out = self.conv4_down(
    self.swish(weight[0] * p4_in + weight[1] * p4_up + weight[2] * self.p4_downsample(p3_out)))

# 簡單的注意力機制，用於確定更關注p5_in還是p5_up還是p4_out
p5_w2 = self.p5_w2_relu(self.p5_w2)
weight = p5_w2 / (torch.sum(p5_w2, dim=0) + self.epsilon)
p5_out = self.conv5_down(
    self.swish(weight[0] * p5_in + weight[1] * p5_up + weight[2] * self.p5_downsample(p4_out)))

# 簡單的注意力機制，用於確定更關注p6_in還是p6_up還是p5_out
p6_w2 = self.p6_w2_relu(self.p6_w2)
weight = p6_w2 / (torch.sum(p6_w2, dim=0) + self.epsilon)
p6_out = self.conv6_down(
    self.swish(weight[0] * p6_in + weight[1] * p6_up + weight[2] * self.p6_downsample(p5_out)))

# 簡單的注意力機制，用於確定更關注p7_in還是p7_up還是p6_out
p7_w2 = self.p7_w2_relu(self.p7_w2)
weight = p7_w2 / (torch.sum(p7_w2, dim=0) + self.epsilon)
p7_out = self.conv7_down(self.swish(weight[0] * p7_in + weight[1] * self.p7_downsample(p6_out)))

3、從特徵獲取預測結果

通過第二部的重複運算，我們獲得了P3_out, P4_out, P5_out, P6_out, P7_out。

爲了和普通特徵層區分，我們稱之爲有效特徵層，將這五個有效的特徵層傳輸過ClassNet+BoxNet就可以獲得預測結果了。

對於Efficientdet-B0來講：
ClassNet採用3次64通道的卷積和1次num_priors x num_classes的卷積，num_priors指的是該特徵層所擁有的先驗框數量，num_classes指的是網絡一共對多少類的目標進行檢測。

BoxNet採用3次64通道的卷積和1次num_priors x 4的卷積，num_priors指的是該特徵層所擁有的先驗框數量，4指的是先驗框的調整情況。

需要注意的是，每個特徵層所用的ClassNet是同一個ClassNet；每個特徵層所用的BoxNet是同一個BoxNet。

其中：
num_priors x 4的卷積 用於預測 該特徵層上 每一個網格點上每一個先驗框的變化情況。**

num_priors x num_classes的卷積 用於預測 該特徵層上 每一個網格點上 每一個預測框對應的種類。

實現代碼爲：

class BoxNet(nn.Module):
    def __init__(self, in_channels, num_anchors, num_layers, onnx_export=False):
        super(BoxNet, self).__init__()
        self.num_layers = num_layers

        self.conv_list = nn.ModuleList(
            [SeparableConvBlock(in_channels, in_channels, norm=False, activation=False) for i in range(num_layers)])
        # 每一個有效特徵層對應的Batchnor不同
        self.bn_list = nn.ModuleList(
            [nn.ModuleList([nn.BatchNorm2d(in_channels, momentum=0.01, eps=1e-3) for i in range(num_layers)]) for j in
             range(5)])
        self.header = SeparableConvBlock(in_channels, num_anchors * 4, norm=False, activation=False)
        self.swish = MemoryEfficientSwish() if not onnx_export else Swish()

    def forward(self, inputs):
        feats = []
        # 對每個特徵層循環
        for feat, bn_list in zip(inputs, self.bn_list):
            # 每個特徵層需要進行num_layer次卷積+標準化+激活函數
            for i, bn, conv in zip(range(self.num_layers), bn_list, self.conv_list):
                feat = conv(feat)
                feat = bn(feat)
                feat = self.swish(feat)
            feat = self.header(feat)

            feat = feat.permute(0, 2, 3, 1)
            feat = feat.contiguous().view(feat.shape[0], -1, 4)
            feats.append(feat)
        # 進行一個堆疊
        feats = torch.cat(feats, dim=1)

        return feats


class ClassNet(nn.Module):
    def __init__(self, in_channels, num_anchors, num_classes, num_layers, onnx_export=False):
        super(ClassNet, self).__init__()
        self.num_anchors = num_anchors
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.conv_list = nn.ModuleList(
            [SeparableConvBlock(in_channels, in_channels, norm=False, activation=False) for i in range(num_layers)])
        # 每一個有效特徵層對應的Batchnor不同
        self.bn_list = nn.ModuleList(
            [nn.ModuleList([nn.BatchNorm2d(in_channels, momentum=0.01, eps=1e-3) for i in range(num_layers)]) for j in
             range(5)])
        self.header = SeparableConvBlock(in_channels, num_anchors * num_classes, norm=False, activation=False)
        self.swish = MemoryEfficientSwish() if not onnx_export else Swish()

    def forward(self, inputs):
        feats = []
        # 對每個特徵層循環
        for feat, bn_list in zip(inputs, self.bn_list):
            for i, bn, conv in zip(range(self.num_layers), bn_list, self.conv_list):
                # 每個特徵層需要進行num_layer次卷積+標準化+激活函數
                feat = conv(feat)
                feat = bn(feat)
                feat = self.swish(feat)
            feat = self.header(feat)

            feat = feat.permute(0, 2, 3, 1)
            feat = feat.contiguous().view(feat.shape[0], feat.shape[1], feat.shape[2], self.num_anchors,
                                          self.num_classes)
            feat = feat.contiguous().view(feat.shape[0], -1, self.num_classes)

            feats.append(feat)
        # 進行一個堆疊
        feats = torch.cat(feats, dim=1)
        # 取sigmoid表示概率
        feats = feats.sigmoid()

        return feats

4、預測結果的解碼

我們通過對每一個特徵層的處理，可以獲得三個內容，分別是：

num_priors x 4的卷積 用於預測 該特徵層上 每一個網格點上每一個先驗框的變化情況。**

num_priors x num_classes的卷積 用於預測 該特徵層上 每一個網格點上 每一個預測框對應的種類。

每一個有效特徵層對應的先驗框對應着該特徵層上 每一個網格點上預先設定好的9個框。

我們利用 num_priors x 4的卷積 與 每一個有效特徵層對應的先驗框 獲得框的真實位置。

每一個有效特徵層對應的先驗框就是，如圖所示的作用：
每一個有效特徵層將整個圖片分成與其長寬對應的網格，如P3的特徵層就是將整個圖像分成64x64個網格；然後從每個網格中心建立9個先驗框，一共64x64x9個，36864‬個先驗框。

先驗框雖然可以代表一定的框的位置信息與框的大小信息，但是其是有限的，無法表示任意情況，因此還需要調整，Efficientdet利用3次64通道的卷積+num_priors x 4的卷積的結果對先驗框進行調整。

num_priors x 4中的num_priors表示了這個網格點所包含的先驗框數量，其中的4表示了框的左上角xy軸，右下角xy的調整情況。

Efficientdet解碼過程就是將對應的先驗框的左上角和右下角進行位置的調整，調整完的結果就是預測框的位置了。

當然得到最終的預測結構後還要進行得分排序與非極大抑制篩選這一部分基本上是所有目標檢測通用的部分。
1、取出每一類得分大於confidence_threshold的框和得分。
2、利用框的位置和得分進行非極大抑制。
實現代碼如下：

def bbox_iou(box1, box2, x1y1x2y2=True):
    """
        計算IOU
    """
    if not x1y1x2y2:
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
        b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
        b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
        b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
    else:
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)

    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * \
                 torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)
                 
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

    return iou


def non_max_suppression(prediction, num_classes, conf_thres=0.5, nms_thres=0.4):

    output = [None for _ in range(len(prediction))]
    for image_i, image_pred in enumerate(prediction):

        # 獲得種類及其置信度
        class_conf, class_pred = torch.max(image_pred[:, 4:], 1, keepdim=True)

        # 利用置信度進行第一輪篩選
        conf_mask = (class_conf >= conf_thres).squeeze()
        image_pred = image_pred[conf_mask]
        class_conf, class_pred = class_conf[conf_mask], class_pred[conf_mask]

        if not image_pred.size(0):
            continue

        # 獲得的內容爲(x1, y1, x2, y2, class_conf, class_pred)
        detections = torch.cat((image_pred[:, :4], class_conf.float(), class_pred.float()), 1)

        # 獲得種類
        unique_labels = detections[:, -1].cpu().unique()

        if prediction.is_cuda:
            unique_labels = unique_labels.cuda()

        for c in unique_labels:
            # 獲得某一類初步篩選後全部的預測結果
            detections_class = detections[detections[:, -1] == c]
            # 按照存在物體的置信度排序
            _, conf_sort_index = torch.sort(detections_class[:, 4], descending=True)
            detections_class = detections_class[conf_sort_index]
            # 進行非極大抑制
            max_detections = []
            while detections_class.size(0):
                # 取出這一類置信度最高的，一步一步往下判斷，判斷重合程度是否大於nms_thres，如果是則去除掉
                max_detections.append(detections_class[0].unsqueeze(0))
                if len(detections_class) == 1:
                    break
                ious = bbox_iou(max_detections[-1], detections_class[1:])
                detections_class = detections_class[1:][ious < nms_thres]
            # 堆疊
            max_detections = torch.cat(max_detections).data
            # Add max detections to outputs
            output[image_i] = max_detections if output[image_i] is None else torch.cat(
                (output[image_i], max_detections))

    return output

5、在原圖上進行繪製

通過第三步，我們可以獲得預測框在原圖上的位置，而且這些預測框都是經過篩選的。這些篩選後的框可以直接繪製在圖片上，就可以獲得結果了。

二、訓練部分

1、真實框的處理

從預測部分我們知道，每個特徵層的預測結果，num_priors x 4的卷積 用於預測 該特徵層上 每一個網格點上每一個先驗框的變化情況。

也就是說，我們直接利用Efficientdet網絡預測到的結果，並不是預測框在圖片上的真實位置，需要解碼才能得到真實位置。

而在訓練的時候，我們需要計算loss函數，這個loss函數是相對於Efficientdet網絡的預測結果的。我們需要把圖片輸入到當前的Efficientdet網絡中，得到預測結果；同時還需要把真實框的信息，進行編碼，這個編碼是把真實框的位置信息格式轉化爲Efficientdet預測結果的格式信息。

也就是，我們需要找到 每一張用於訓練的圖片的每一個真實框對應的先驗框，並求出如果想要得到這樣一個真實框，我們的預測結果應該是怎麼樣的。

從預測結果獲得真實框的過程被稱作解碼，而從真實框獲得預測結果的過程就是編碼的過程。

因此我們只需要將解碼過程逆過來就是編碼過程了。

編碼過程首先需要找到真實框對應的先驗框，在Efficientdet中採取的辦法是求取真實框和所有先驗框的重合程度IOU，IOU＞0.5以上的則認爲可以利用這些先驗框調整獲得預測框。

而IOU在0.4-0.5之間的，給予忽略。

IOU在0.4以下的作爲負樣本。

實現代碼如下：

def calc_iou(a, b):
    area = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
    iw = torch.min(torch.unsqueeze(a[:, 3], dim=1), b[:, 2]) - torch.max(torch.unsqueeze(a[:, 1], 1), b[:, 0])
    ih = torch.min(torch.unsqueeze(a[:, 2], dim=1), b[:, 3]) - torch.max(torch.unsqueeze(a[:, 0], 1), b[:, 1])
    iw = torch.clamp(iw, min=0)
    ih = torch.clamp(ih, min=0)
    ua = torch.unsqueeze((a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1]), dim=1) + area - iw * ih
    ua = torch.clamp(ua, min=1e-8)
    intersection = iw * ih
    IoU = intersection / ua

    return IoU

def get_target(anchor, bbox_annotation, classification):
    IoU = calc_iou(anchor[:, :], bbox_annotation[:, :4])

    IoU_max, IoU_argmax = torch.max(IoU, dim=1)

    # compute the loss for classification
    targets = torch.ones_like(classification) * -1
    if torch.cuda.is_available():
        targets = targets.cuda()

    targets[torch.lt(IoU_max, 0.4), :] = 0

    positive_indices = torch.ge(IoU_max, 0.5)

    num_positive_anchors = positive_indices.sum()

    assigned_annotations = bbox_annotation[IoU_argmax, :]

    targets[positive_indices, :] = 0
    targets[positive_indices, assigned_annotations[positive_indices, 4].long()] = 1
    return targets, num_positive_anchors, positive_indices, assigned_annotations

def encode_bbox(assigned_annotations, positive_indices, anchor_widths, anchor_heights, anchor_ctr_x, anchor_ctr_y):
    assigned_annotations = assigned_annotations[positive_indices, :]

    anchor_widths_pi = anchor_widths[positive_indices]
    anchor_heights_pi = anchor_heights[positive_indices]
    anchor_ctr_x_pi = anchor_ctr_x[positive_indices]
    anchor_ctr_y_pi = anchor_ctr_y[positive_indices]

    gt_widths = assigned_annotations[:, 2] - assigned_annotations[:, 0]
    gt_heights = assigned_annotations[:, 3] - assigned_annotations[:, 1]
    gt_ctr_x = assigned_annotations[:, 0] + 0.5 * gt_widths
    gt_ctr_y = assigned_annotations[:, 1] + 0.5 * gt_heights

    # efficientdet style
    gt_widths = torch.clamp(gt_widths, min=1)
    gt_heights = torch.clamp(gt_heights, min=1)

    targets_dx = (gt_ctr_x - anchor_ctr_x_pi) / anchor_widths_pi
    targets_dy = (gt_ctr_y - anchor_ctr_y_pi) / anchor_heights_pi
    targets_dw = torch.log(gt_widths / anchor_widths_pi)
    targets_dh = torch.log(gt_heights / anchor_heights_pi)

    targets = torch.stack((targets_dy, targets_dx, targets_dh, targets_dw))
    targets = targets.t()
    return targets

2、利用處理完的真實框與對應圖片的預測結果計算loss

loss的計算分爲兩個部分：
1、Smooth Loss：獲取所有正標籤的框的預測結果的迴歸loss。
2、Focal Loss：獲取所有未被忽略的種類的預測結果的交叉熵loss。

由於在Efficientdet的訓練過程中，正負樣本極其不平衡，即存在對應真實框的先驗框可能只有若干個，但是不存在對應真實框的負樣本卻有上萬個，這就會導致負樣本的loss值極大，因此引入了Focal Loss進行正負樣本的平衡，關於Focal Loss的介紹可以看這個博客。
https://blog.csdn.net/weixin_44791964/article/details/102853782

實現代碼如下：

class FocalLoss(nn.Module):
    def __init__(self):
        super(FocalLoss, self).__init__()

    def forward(self, classifications, regressions, anchors, annotations, alpha = 0.25, gamma = 2.0):
        # 設置
        dtype = regressions.dtype
        batch_size = classifications.shape[0]
        classification_losses = []
        regression_losses = []

        # 獲得先驗框，將先驗框轉換成中心寬高的形勢
        anchor = anchors[0, :, :].to(dtype)
        # 轉換成中心，寬高的形式
        anchor_widths = anchor[:, 3] - anchor[:, 1]
        anchor_heights = anchor[:, 2] - anchor[:, 0]
        anchor_ctr_x = anchor[:, 1] + 0.5 * anchor_widths
        anchor_ctr_y = anchor[:, 0] + 0.5 * anchor_heights

        for j in range(batch_size):
            # 取出真實框
            bbox_annotation = annotations[j]

            if bbox_annotation.shape[0] == 0:
                alpha_factor = torch.ones_like(classification) * alpha
                
                if torch.cuda.is_available():
                    alpha_factor = alpha_factor.cuda()
                alpha_factor = 1. - alpha_factor
                focal_weight = classification
                focal_weight = alpha_factor * torch.pow(focal_weight, gamma)
                
                bce = -(torch.log(1.0 - classification))
                
                cls_loss = focal_weight * bce
                
                if torch.cuda.is_available():
                    regression_losses.append(torch.tensor(0).to(dtype).cuda())
                else:
                    regression_losses.append(torch.tensor(0).to(dtype))
                classification_losses.append(cls_loss.sum())

            # 獲得每張圖片的分類結果和迴歸預測結果
            classification = classifications[j, :, :]
            regression = regressions[j, :, :]
            # 平滑標籤
            classification = torch.clamp(classification, 1e-4, 1.0 - 1e-4)
            # 獲得目標預測結果
            targets, num_positive_anchors, positive_indices, assigned_annotations = get_target(anchor, bbox_annotation, classification)
            
            alpha_factor = torch.ones_like(targets) * alpha
            if torch.cuda.is_available():
                alpha_factor = alpha_factor.cuda()

            alpha_factor = torch.where(torch.eq(targets, 1.), alpha_factor, 1. - alpha_factor)
            focal_weight = torch.where(torch.eq(targets, 1.), 1. - classification, classification)
            focal_weight = alpha_factor * torch.pow(focal_weight, gamma)

            bce = -(targets * torch.log(classification) + (1.0 - targets) * torch.log(1.0 - classification))

            cls_loss = focal_weight * bce

            zeros = torch.zeros_like(cls_loss)
            if torch.cuda.is_available():
                zeros = zeros.cuda()
            cls_loss = torch.where(torch.ne(targets, -1.0), cls_loss, zeros)

            classification_losses.append(cls_loss.sum() / torch.clamp(num_positive_anchors.to(dtype), min=1.0))

            if positive_indices.sum() > 0:
                targets = encode_bbox(assigned_annotations, positive_indices, anchor_widths, anchor_heights, anchor_ctr_x, anchor_ctr_y)
               
                regression_diff = torch.abs(targets - regression[positive_indices, :])

                regression_loss = torch.where(
                    torch.le(regression_diff, 1.0 / 9.0),
                    0.5 * 9.0 * torch.pow(regression_diff, 2),
                    regression_diff - 0.5 / 9.0
                )
                regression_losses.append(regression_loss.mean())
            else:
                if torch.cuda.is_available():
                    regression_losses.append(torch.tensor(0).to(dtype).cuda())
                else:
                    regression_losses.append(torch.tensor(0).to(dtype))
        
        loss = torch.stack(classification_losses).mean() + torch.stack(regression_losses).mean()
        return loss

訓練自己的Efficientdet模型

Efficientdet整體的文件夾構架如下：

本文使用VOC格式進行訓練。
訓練前將標籤文件放在VOCdevkit文件夾下的VOC2007文件夾下的Annotation中。

訓練前將圖片文件放在VOCdevkit文件夾下的VOC2007文件夾下的JPEGImages中。

在訓練前利用voc2efficientdet.py文件生成對應的txt。

再運行根目錄下的voc_annotation.py，運行前需要將classes改成你自己的classes。

classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

就會生成對應的2007_train.txt，每一行對應其圖片位置及其真實框的位置。

在訓練前需要修改model_data裏面的voc_classes.txt文件，需要將classes改成你自己的classes。

運行train.py即可開始訓練。

修改train.py文件下的phi可以修改efficientdet的版本，訓練前注意權重文件與Efficientdet版本的對齊。

睿智的目標檢測36——Pytorch搭建Efficientdet目標檢測平臺