目標檢測論文M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network解讀

這篇2019AAAI論文主要提出了一種新的特徵金字塔網絡:MLFPN,基於新的特徵網絡,在COCO數據集上取得了優異結果。本文將以檢測一張照片的流程進行解讀,另附部分代碼
代碼詳細地址:M2Det

爲什麼要提出這種新的特徵金字塔架構

我們知道,一個目標檢測框架的性能跟他的特徵提取程度有很大的關係,爲了充分提取特徵同時解決目標的尺度問題(距離攝像頭的遠近不同,同一類物體的檢測效果不同),大佬創造了兩種金字塔網絡,一種是圖像金字塔網絡,即將輸入圖像通過縮放等操作,在多個尺度進行目標檢測,但是此類算法計算量大,速度慢。人們更傾向於特徵金字塔網絡(FPN)並在FPN上做了許多變體,M2Det其實也是一種FPN變體,如下圖
在這裏插入圖片描述
圖d即是本文的主要架構,乍一看很複雜,但其中主要是重複工作。

BackBone Network

在這篇論文中給出了兩大類BackBone:VGG和Resnet,本文不再闡述
在這裏插入圖片描述

MLFPN

MLFPN是該論文提出的一種金字塔網絡,主要有TUM(Thinned U-shape Module)、FFM(Feature Fusion Module)、FFAM(Scale-wise Feature Aggregation Module)三大部分組成。具體結構如下
在這裏插入圖片描述

TUM

採用FPN模型,一共用到了8個TUM,單個TUM結構如下:
在這裏插入圖片描述
輸入tensor爲(256.40,40)的向量後經過一系列下采樣然後上採樣再卷積(1*1,論文說是提高平滑度),最後每個TUP產生6個尺度不同的特徵向量,tensor越小,其深度信息越強烈,tensor越大,淺度信息越強烈。同時產生的(128,40,40)也通過FFM參與到下一個TUP的初始輸入中

class TUM(nn.Module):
    def __init__(self, first_level=True, input_planes=128, is_smooth=True, side_channel=512, scales=6):
        super(TUM, self).__init__()
        self.is_smooth = is_smooth
        self.side_channel = side_channel
        self.input_planes = input_planes
        self.planes = 2 * self.input_planes
        self.first_level = first_level
        self.scales = scales
        self.in1 = input_planes + side_channel if not first_level else input_planes

        self.layers = nn.Sequential()
        self.layers.add_module('{}'.format(len(self.layers)), BasicConv(self.in1, self.planes, 3, 2, 1))
        for i in range(self.scales-2):
            if not i == self.scales - 3:
                self.layers.add_module(
                        '{}'.format(len(self.layers)),
                        BasicConv(self.planes, self.planes, 3, 2, 1)
                        )
            else:
                self.layers.add_module(
                        '{}'.format(len(self.layers)),
                        BasicConv(self.planes, self.planes, 3, 1, 0)
                        )
        self.toplayer = nn.Sequential(BasicConv(self.planes, self.planes, 1, 1, 0))
        
        self.latlayer = nn.Sequential()
        for i in range(self.scales-2):
            self.latlayer.add_module(
                    '{}'.format(len(self.latlayer)),
                    BasicConv(self.planes, self.planes, 3, 1, 1)
                    )
        self.latlayer.add_module('{}'.format(len(self.latlayer)),BasicConv(self.in1, self.planes, 3, 1, 1))

        if self.is_smooth:
            smooth = list()
            for i in range(self.scales-1):
                smooth.append(
                        BasicConv(self.planes, self.planes, 1, 1, 0)
                        )
            self.smooth = nn.Sequential(*smooth)

    def _upsample_add(self, x, y, fuse_type='interp'):
        _,_,H,W = y.size()
        if fuse_type=='interp':
            return F.interpolate(x, size=(H,W), mode='nearest') + y
        else:
            raise NotImplementedError
            #return nn.ConvTranspose2d(16, 16, 3, stride=2, padding=1)

    def forward(self, x, y):
        if not self.first_level:
            x = torch.cat([x,y],1)
        conved_feat = [x]
        for i in range(len(self.layers)):
            x = self.layers[i](x)
            conved_feat.append(x)
        
        deconved_feat = [self.toplayer[0](conved_feat[-1])]
        for i in range(len(self.latlayer)):
            deconved_feat.append(
                    self._upsample_add(
                        deconved_feat[i], self.latlayer[i](conved_feat[len(self.layers)-1-i])
                        )
                    )
        if self.is_smooth:
            smoothed_feat = [deconved_feat[0]]
            for i in range(len(self.smooth)):
                smoothed_feat.append(
                        self.smooth[i](deconved_feat[i+1])
                        )
            return smoothed_feat
        return deconved_feat

FFM

在這裏插入圖片描述
FFM分爲FFMv1(圖a)FFMv2(圖b),FFMv1是將backbone的後兩層特徵concat,注意最後一層要上採樣保持尺度大小相等,FFMv2則將FFMv1的輸出和上一層的TUP輸出concat。

FFAM

至此,我們已經得到了8個128128(1、3、5、10、20、40)特徵,SFAM的目標是將TUMs生成的多層次多尺度特徵聚合成多層次的特徵金字塔。現在我們把特徵大小相等的特徵進行拼接,此時應該拼接後的特徵都是n×n×1024(128*8=1024),每個特徵都包含了不同深度的特徵,隨後作者將每個特徵壓縮成1×1×1024的大小,每個1×1×1024特徵(共6個)隨後兩個全卷積用於學習參數,以此來選擇最適合的檢測尺寸。
在這裏插入圖片描述

class SFAM(nn.Module):
    def __init__(self, planes, num_levels, num_scales, compress_ratio=16):
        super(SFAM, self).__init__()
        self.planes = planes
        self.num_levels = num_levels
        self.num_scales = num_scales
        self.compress_ratio = compress_ratio

        self.fc1 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels,
                                                 self.planes*self.num_levels // 16,
                                                 1, 1, 0)] * self.num_scales)
        self.relu = nn.ReLU(inplace=True)
        self.fc2 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels // 16,
                                                 self.planes*self.num_levels,
                                                 1, 1, 0)] * self.num_scales)
        self.sigmoid = nn.Sigmoid()
        self.avgpool = nn.AdaptiveAvgPool2d(1)

    def forward(self, x):
        attention_feat = []
        for i, _mf in enumerate(x):
            _tmp_f = self.avgpool(_mf)
            _tmp_f = self.fc1[i](_tmp_f)
            _tmp_f = self.relu(_tmp_f)
            _tmp_f = self.fc2[i](_tmp_f)
            _tmp_f = self.sigmoid(_tmp_f)
            attention_feat.append(_mf*_tmp_f)
        return attention_feat

Detection stage

檢測階段,爲每個特徵連接了兩個全卷積層,分別用於迴歸和分類。每個像素點設置了6個anchor,三對不同比例,bbox檢測範圍和SSD一樣。然後,使用0.05的threshold作爲閾值來過濾掉大部分低分值的anchor。然後使用oft-NMS 進行後期處理,留下更精確的bbox。將threshold降爲0.01可以得到更好的檢測結果,但速度會慢。
基於tensorflow實現的focal loss:

import tensorflow as tf

def calc_focal_loss(cls_outputs, cls_targets, alpha=0.25, gamma=2.0):
    """
    Args:
        cls_outputs: [batch_size, num_anchors, num_classes]
        cls_targets: [batch_size, num_anchors, num_classes]
    Returns:
        cls_loss: [batch_size]
    Compute focal loss:
        FL = -(1 - pt)^gamma * log(pt), where pt = p if y == 1 else 1 - p
        cf. https://arxiv.org/pdf/1708.02002.pdf
    """
    positive_mask = tf.equal(cls_targets, 1.0)
    pos = tf.where(positive_mask, 1.0 - cls_outputs, tf.zeros_like(cls_outputs))
    neg = tf.where(positive_mask, tf.zeros_like(cls_outputs), cls_outputs)
    pos_loss = - alpha * tf.pow(pos, gamma) * tf.log(tf.clip_by_value(cls_outputs, 1e-15, 1.0))
    neg_loss = - (1 - alpha) * tf.pow(neg, gamma) * tf.log(tf.clip_by_value(1.0 - cls_outputs, 1e-15, 1.0))
    loss = tf.reduce_sum(pos_loss + neg_loss, axis=[1, 2])
    return loss
    
def calc_cls_loss(cls_outputs, cls_targets, positive_flag):
    batch_size = tf.shape(cls_outputs)[0]
    num_anchors = tf.to_float(tf.shape(cls_outputs)[1])
    num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
    num_negatives = tf.minimum(3 * num_positives, num_anchors - num_positives) # neg_pos_ratio is 3
    negative_mask = tf.greater(num_negatives, 0)

    cls_outputs = tf.clip_by_value(cls_outputs, 1e-15, 1 - 1e-15)
    conf_loss = -tf.reduce_sum(cls_targets * tf.log(cls_outputs), axis=-1)
    pos_conf_loss = tf.reduce_sum(conf_loss * positive_flag, axis=1) 
    
    has_min = tf.to_float(tf.reduce_any(negative_mask)) # would be 0.0 if ALL num_neg are 0
    num_neg = tf.concat(axis=0, values=[num_negatives, [(1 - has_min) * 100]])
    # minimum value under the condition the value > 0
    num_neg_batch = tf.reduce_min(tf.boolean_mask(num_negatives, tf.greater(num_negatives, 0)))
    num_neg_batch = tf.to_int32(num_neg_batch)
    max_confs = tf.reduce_max(cls_outputs[:, :, 1:], axis=2) # except backgound class
    _, indices = tf.nn.top_k(max_confs * (1 - positive_flag), k=num_neg_batch)
    batch_idx = tf.expand_dims(tf.range(0, batch_size), 1)
    batch_idx = tf.tile(batch_idx, (1, num_neg_batch))
    full_indices = (tf.reshape(batch_idx, [-1]) * tf.to_int32(num_anchors) + tf.reshape(indices, [-1]))
    neg_conf_loss = tf.gather(tf.reshape(conf_loss, [-1]), full_indices)
    neg_conf_loss = tf.reshape(neg_conf_loss, [batch_size, num_neg_batch])
    neg_conf_loss = tf.reduce_sum(neg_conf_loss, axis=1)

    cls_loss = pos_conf_loss + neg_conf_loss
    cls_loss /= (num_positives + tf.to_float(num_neg_batch))
    return cls_loss
    
def calc_box_loss(box_outputs, box_targets, positive_flag, delta=0.1):
    num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
    normalizer = num_positives * 4
    normalizer = tf.where(tf.not_equal(normalizer, 0), normalizer, tf.ones_like(normalizer)) # to avoid division by 0

    loss_scale = 2.0 - box_targets[:, :, 2:3] * box_targets[:, :, 3:4]

    sq_loss = 0.5 * (box_targets - box_outputs) ** 2
    abs_loss = 0.5 * delta ** 2 + delta * (tf.abs(box_outputs - box_targets) - delta)
    l1_loss = tf.where(tf.less(tf.abs(box_outputs - box_targets), delta), sq_loss, abs_loss)

    box_loss = tf.reduce_sum(l1_loss, axis=-1, keepdims=True)
    box_loss = box_loss * loss_scale
    box_loss = tf.reduce_sum(box_loss, axis=-1)
    box_loss = tf.reduce_sum(box_loss * positive_flag, axis=-1)
    box_loss = box_loss / normalizer

    return box_loss

def calc_loss(y_true, y_pred, box_loss_weight):
    """
    Args:
        y_true: [batch_size, num_anchors, 4 + num_classes + 1]
        y_pred: [batch_size, num_anchors, 4 + num_classes]
            num_classes is including the back-ground class
            last element of y_true denotes if the box is positive or negative:
    Returns:
        total_loss:
    cf. https://github.com/tensorflow/tpu/blob/master/models/official/retinanet/retinanet_model.py
    """
    
    box_outputs = y_pred[:, :, :4]
    box_targets = y_true[:, :, :4]
    cls_outputs = y_pred[:, :, 4:]
    cls_targets = y_true[:, :, 4:-1]
    positive_flag = y_true[:, :, -1]
    num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]

    box_loss = calc_box_loss(box_outputs, box_targets, positive_flag)
    ##cls_loss = calc_cls_loss(cls_outputs, cls_targets, positive_flag)
    cls_loss = calc_focal_loss(cls_outputs, cls_targets)

    total_loss = cls_loss + box_loss_weight * box_loss

    return tf.reduce_mean(total_loss)

總結:

本文主要注重網絡結構的改善,但是我感覺8個TUP計算量太大了。如果適當的減少TUP數量同時給不同TUP一個可學習的權重參數應該可以達到更好的效果

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章