這篇2019AAAI論文主要提出了一種新的特徵金字塔網絡:MLFPN,基於新的特徵網絡,在COCO數據集上取得了優異結果。本文將以檢測一張照片的流程進行解讀,另附部分代碼
代碼詳細地址:M2Det
爲什麼要提出這種新的特徵金字塔架構
我們知道,一個目標檢測框架的性能跟他的特徵提取程度有很大的關係,爲了充分提取特徵同時解決目標的尺度問題(距離攝像頭的遠近不同,同一類物體的檢測效果不同),大佬創造了兩種金字塔網絡,一種是圖像金字塔網絡,即將輸入圖像通過縮放等操作,在多個尺度進行目標檢測,但是此類算法計算量大,速度慢。人們更傾向於特徵金字塔網絡(FPN)並在FPN上做了許多變體,M2Det其實也是一種FPN變體,如下圖
圖d即是本文的主要架構,乍一看很複雜,但其中主要是重複工作。
BackBone Network
在這篇論文中給出了兩大類BackBone:VGG和Resnet,本文不再闡述
MLFPN
MLFPN是該論文提出的一種金字塔網絡,主要有TUM(Thinned U-shape Module)、FFM(Feature Fusion Module)、FFAM(Scale-wise Feature Aggregation Module)三大部分組成。具體結構如下
TUM
採用FPN模型,一共用到了8個TUM,單個TUM結構如下:
輸入tensor爲(256.40,40)的向量後經過一系列下采樣然後上採樣再卷積(1*1,論文說是提高平滑度),最後每個TUP產生6個尺度不同的特徵向量,tensor越小,其深度信息越強烈,tensor越大,淺度信息越強烈。同時產生的(128,40,40)也通過FFM參與到下一個TUP的初始輸入中
class TUM(nn.Module):
def __init__(self, first_level=True, input_planes=128, is_smooth=True, side_channel=512, scales=6):
super(TUM, self).__init__()
self.is_smooth = is_smooth
self.side_channel = side_channel
self.input_planes = input_planes
self.planes = 2 * self.input_planes
self.first_level = first_level
self.scales = scales
self.in1 = input_planes + side_channel if not first_level else input_planes
self.layers = nn.Sequential()
self.layers.add_module('{}'.format(len(self.layers)), BasicConv(self.in1, self.planes, 3, 2, 1))
for i in range(self.scales-2):
if not i == self.scales - 3:
self.layers.add_module(
'{}'.format(len(self.layers)),
BasicConv(self.planes, self.planes, 3, 2, 1)
)
else:
self.layers.add_module(
'{}'.format(len(self.layers)),
BasicConv(self.planes, self.planes, 3, 1, 0)
)
self.toplayer = nn.Sequential(BasicConv(self.planes, self.planes, 1, 1, 0))
self.latlayer = nn.Sequential()
for i in range(self.scales-2):
self.latlayer.add_module(
'{}'.format(len(self.latlayer)),
BasicConv(self.planes, self.planes, 3, 1, 1)
)
self.latlayer.add_module('{}'.format(len(self.latlayer)),BasicConv(self.in1, self.planes, 3, 1, 1))
if self.is_smooth:
smooth = list()
for i in range(self.scales-1):
smooth.append(
BasicConv(self.planes, self.planes, 1, 1, 0)
)
self.smooth = nn.Sequential(*smooth)
def _upsample_add(self, x, y, fuse_type='interp'):
_,_,H,W = y.size()
if fuse_type=='interp':
return F.interpolate(x, size=(H,W), mode='nearest') + y
else:
raise NotImplementedError
#return nn.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
def forward(self, x, y):
if not self.first_level:
x = torch.cat([x,y],1)
conved_feat = [x]
for i in range(len(self.layers)):
x = self.layers[i](x)
conved_feat.append(x)
deconved_feat = [self.toplayer[0](conved_feat[-1])]
for i in range(len(self.latlayer)):
deconved_feat.append(
self._upsample_add(
deconved_feat[i], self.latlayer[i](conved_feat[len(self.layers)-1-i])
)
)
if self.is_smooth:
smoothed_feat = [deconved_feat[0]]
for i in range(len(self.smooth)):
smoothed_feat.append(
self.smooth[i](deconved_feat[i+1])
)
return smoothed_feat
return deconved_feat
FFM
FFM分爲FFMv1(圖a)FFMv2(圖b),FFMv1是將backbone的後兩層特徵concat,注意最後一層要上採樣保持尺度大小相等,FFMv2則將FFMv1的輸出和上一層的TUP輸出concat。
FFAM
至此,我們已經得到了8個128128(1、3、5、10、20、40)特徵,SFAM的目標是將TUMs生成的多層次多尺度特徵聚合成多層次的特徵金字塔。現在我們把特徵大小相等的特徵進行拼接,此時應該拼接後的特徵都是n×n×1024(128*8=1024),每個特徵都包含了不同深度的特徵,隨後作者將每個特徵壓縮成1×1×1024的大小,每個1×1×1024特徵(共6個)隨後兩個全卷積用於學習參數,以此來選擇最適合的檢測尺寸。
class SFAM(nn.Module):
def __init__(self, planes, num_levels, num_scales, compress_ratio=16):
super(SFAM, self).__init__()
self.planes = planes
self.num_levels = num_levels
self.num_scales = num_scales
self.compress_ratio = compress_ratio
self.fc1 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels,
self.planes*self.num_levels // 16,
1, 1, 0)] * self.num_scales)
self.relu = nn.ReLU(inplace=True)
self.fc2 = nn.ModuleList([nn.Conv2d(self.planes*self.num_levels // 16,
self.planes*self.num_levels,
1, 1, 0)] * self.num_scales)
self.sigmoid = nn.Sigmoid()
self.avgpool = nn.AdaptiveAvgPool2d(1)
def forward(self, x):
attention_feat = []
for i, _mf in enumerate(x):
_tmp_f = self.avgpool(_mf)
_tmp_f = self.fc1[i](_tmp_f)
_tmp_f = self.relu(_tmp_f)
_tmp_f = self.fc2[i](_tmp_f)
_tmp_f = self.sigmoid(_tmp_f)
attention_feat.append(_mf*_tmp_f)
return attention_feat
Detection stage
檢測階段,爲每個特徵連接了兩個全卷積層,分別用於迴歸和分類。每個像素點設置了6個anchor,三對不同比例,bbox檢測範圍和SSD一樣。然後,使用0.05的threshold作爲閾值來過濾掉大部分低分值的anchor。然後使用oft-NMS 進行後期處理,留下更精確的bbox。將threshold降爲0.01可以得到更好的檢測結果,但速度會慢。
基於tensorflow實現的focal loss:
import tensorflow as tf
def calc_focal_loss(cls_outputs, cls_targets, alpha=0.25, gamma=2.0):
"""
Args:
cls_outputs: [batch_size, num_anchors, num_classes]
cls_targets: [batch_size, num_anchors, num_classes]
Returns:
cls_loss: [batch_size]
Compute focal loss:
FL = -(1 - pt)^gamma * log(pt), where pt = p if y == 1 else 1 - p
cf. https://arxiv.org/pdf/1708.02002.pdf
"""
positive_mask = tf.equal(cls_targets, 1.0)
pos = tf.where(positive_mask, 1.0 - cls_outputs, tf.zeros_like(cls_outputs))
neg = tf.where(positive_mask, tf.zeros_like(cls_outputs), cls_outputs)
pos_loss = - alpha * tf.pow(pos, gamma) * tf.log(tf.clip_by_value(cls_outputs, 1e-15, 1.0))
neg_loss = - (1 - alpha) * tf.pow(neg, gamma) * tf.log(tf.clip_by_value(1.0 - cls_outputs, 1e-15, 1.0))
loss = tf.reduce_sum(pos_loss + neg_loss, axis=[1, 2])
return loss
def calc_cls_loss(cls_outputs, cls_targets, positive_flag):
batch_size = tf.shape(cls_outputs)[0]
num_anchors = tf.to_float(tf.shape(cls_outputs)[1])
num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
num_negatives = tf.minimum(3 * num_positives, num_anchors - num_positives) # neg_pos_ratio is 3
negative_mask = tf.greater(num_negatives, 0)
cls_outputs = tf.clip_by_value(cls_outputs, 1e-15, 1 - 1e-15)
conf_loss = -tf.reduce_sum(cls_targets * tf.log(cls_outputs), axis=-1)
pos_conf_loss = tf.reduce_sum(conf_loss * positive_flag, axis=1)
has_min = tf.to_float(tf.reduce_any(negative_mask)) # would be 0.0 if ALL num_neg are 0
num_neg = tf.concat(axis=0, values=[num_negatives, [(1 - has_min) * 100]])
# minimum value under the condition the value > 0
num_neg_batch = tf.reduce_min(tf.boolean_mask(num_negatives, tf.greater(num_negatives, 0)))
num_neg_batch = tf.to_int32(num_neg_batch)
max_confs = tf.reduce_max(cls_outputs[:, :, 1:], axis=2) # except backgound class
_, indices = tf.nn.top_k(max_confs * (1 - positive_flag), k=num_neg_batch)
batch_idx = tf.expand_dims(tf.range(0, batch_size), 1)
batch_idx = tf.tile(batch_idx, (1, num_neg_batch))
full_indices = (tf.reshape(batch_idx, [-1]) * tf.to_int32(num_anchors) + tf.reshape(indices, [-1]))
neg_conf_loss = tf.gather(tf.reshape(conf_loss, [-1]), full_indices)
neg_conf_loss = tf.reshape(neg_conf_loss, [batch_size, num_neg_batch])
neg_conf_loss = tf.reduce_sum(neg_conf_loss, axis=1)
cls_loss = pos_conf_loss + neg_conf_loss
cls_loss /= (num_positives + tf.to_float(num_neg_batch))
return cls_loss
def calc_box_loss(box_outputs, box_targets, positive_flag, delta=0.1):
num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
normalizer = num_positives * 4
normalizer = tf.where(tf.not_equal(normalizer, 0), normalizer, tf.ones_like(normalizer)) # to avoid division by 0
loss_scale = 2.0 - box_targets[:, :, 2:3] * box_targets[:, :, 3:4]
sq_loss = 0.5 * (box_targets - box_outputs) ** 2
abs_loss = 0.5 * delta ** 2 + delta * (tf.abs(box_outputs - box_targets) - delta)
l1_loss = tf.where(tf.less(tf.abs(box_outputs - box_targets), delta), sq_loss, abs_loss)
box_loss = tf.reduce_sum(l1_loss, axis=-1, keepdims=True)
box_loss = box_loss * loss_scale
box_loss = tf.reduce_sum(box_loss, axis=-1)
box_loss = tf.reduce_sum(box_loss * positive_flag, axis=-1)
box_loss = box_loss / normalizer
return box_loss
def calc_loss(y_true, y_pred, box_loss_weight):
"""
Args:
y_true: [batch_size, num_anchors, 4 + num_classes + 1]
y_pred: [batch_size, num_anchors, 4 + num_classes]
num_classes is including the back-ground class
last element of y_true denotes if the box is positive or negative:
Returns:
total_loss:
cf. https://github.com/tensorflow/tpu/blob/master/models/official/retinanet/retinanet_model.py
"""
box_outputs = y_pred[:, :, :4]
box_targets = y_true[:, :, :4]
cls_outputs = y_pred[:, :, 4:]
cls_targets = y_true[:, :, 4:-1]
positive_flag = y_true[:, :, -1]
num_positives = tf.reduce_sum(positive_flag, axis=-1) # shape: [batch_size,]
box_loss = calc_box_loss(box_outputs, box_targets, positive_flag)
##cls_loss = calc_cls_loss(cls_outputs, cls_targets, positive_flag)
cls_loss = calc_focal_loss(cls_outputs, cls_targets)
total_loss = cls_loss + box_loss_weight * box_loss
return tf.reduce_mean(total_loss)
總結:
本文主要注重網絡結構的改善,但是我感覺8個TUP計算量太大了。如果適當的減少TUP數量同時給不同TUP一個可學習的權重參數應該可以達到更好的效果