1 Background and Motivation

目前 feature pyramids 設計的結構仍 inefficient to integrate the semantic information over different scales.

作者爭對 SSD 和 FPN 結構的缺點，在 SSD 的基礎上，設計了新的 feature pyramids 結構，使得 object detection 模型具有更強的特徵表達能力！

SSD 是最早的用 feature pyramids 來做 object detection 的方法之一，

圖片來自 SSD詳解

缺點是 SSD 用 shallow-layer 的神經網絡來檢測小目標，但是低層的網絡沒有高級語義信息，小目標檢測的效果不理想

FPN 的 top-down 特徵融合是線性的，too simple to capture highly-nonlinear patterns for more complicate and practical cases. Several

2 Advantages / Contributions

提出了 global-local reconfigurations 的 feature pyramids，enhance multi-scale representations
feature pyramids 中 all scales are performed simultaneously，比 layer-by-layer transformation 更efficient

3 Method

1）ConvNet Feature Hierarchy

目標檢測的 backbone 特徵集合可以表示成如下形式

$L$ ：表示 backbone 的總層數
$x_l$ ：表示 $l^{th}$ 層的輸出

在 SSD 模型中，預測特徵圖集合可以表示爲

eg： $P$ 在 VGG 中爲 23

$x_P$ 是高分辨率，limited semantic information，沒有 reuse deeper and semantic information，不利於小目標的檢測！

2）Lateral Connection

在 FPN 結構中，特徵進行了如下的融合

$\alpha$ 和 $\beta$ 是 Conv（沒有 activation function）和 up-sampling（雙線性插值），可以理解爲線性操作！一般化的表示，FPN 對特徵做了如下形式的 polynomial expansions：

新生成的用來預測的特徵圖集合可以表示爲

這種 representation power，在複雜的目標檢測任務上是不夠的

3.1 Deep Feature Reconfiguration

首先是要從線性變成非線性

其中非線性函數 $H_l(X)$ 表示爲 a global attention 和 a local reconfiguration 操作！

3.1.1 Global Attention for Feature Hierarchy

用的 SENet 的方法，channel attention

SENet 的介紹可以參考【SENet】《Squeeze-and-Excitation Networks》

第一步壓縮分辨率，把特徵圖保留爲一個只有通道數的向量

$x_l^c(i,j)$ 表示 $i$ 行 $j$ 列， $c^{th}$ channel

第二步接兩個 fully connection

這裏有點錯誤，應該是 $W_1^l$ 和 $W_2^l$ ， $\delta$ 是 relu， $\sigma$ 的 sigmoid

把 channel dimension 的向量，壓縮 $r=16$ 倍，

第三步，把學習到的通道權重，與原來的特徵圖做 channel-wise multiplication

最後用來預測的特徵圖集合表示爲

這裏把下標 $l$ 去掉更好，加個下標多此一舉

3.1.2 Local Reconfiguration

resnet 的結構！把 channel attention 的輸出接一個 bottleneck，配合 1x1 conv 的 shortcut 分支

用殘差的好處是，it is easier to optimize the residual mapping than to optimize the desired underlying mapping.