顯著性目標檢測之Global Context-Aware Progressive Aggregation Network for Salient Object Detection

原創

2020-05-31 13:47

Global Context-Aware Progressive Aggregation Network for Salient Object Detection

文章目錄

Global Context-Aware Progressive Aggregation Network for Salient Object Detection

原始文檔：https://www.yuque.com/lart/papers/kyxtc1

AAAI 2020

主要工作

這篇文章和F3Net在想法上有很大的相似之處，都認爲：the previous works mainly adopted multiple-level feature integration yet ignored the gap between different features.

而另一條 there also exists a dilution process of high-level features as they passed on the top-down pathway 實際上也是借用自之前的PoolNet。

因此，總體而言，本文認爲現有的基於FCN的模型存在這樣的兩個問題：

Due to the gap between different level features, the simple combination of semantic information and appearance information is insufficient and lacks consideration of the different contribution of different features for salient object detection;
Most of the previous works ignored the global context information, which benefits for deducing the relationship among multiple salient regions and producing more complete saliency result.

爲了處理這兩個問題，這裏提出了幾個模塊：

對於第一個問題：Feature Interweaved Aggregation (FIA) module fully integrates the high-level semantic features, low-level detail features, and global context features, which is expected to suppress the noises but recover more structural and detail information.
通用提升：
- Head Attention (HA) module is used to reduce information redundancy and enhance the top layers features by leveraging the spatial and channel-wise attention
- Self Refinement (SR) module is utilized to further refine and heighten the input features
對於第二個問題：Global Context Flow (GCF) module generates the global context information at different stages, which aims to learn the relationship among different salient regions and alleviate the dilution effect of high-level features

主體結構

主要包含四個組件，這裏分別簡單介紹。

FIA

這裏多處使用乘法操作。The multiplication operation can strengthen the response of salient objects, meanwhile suppress the background noises. 從圖中可以比較直觀的瞭解整體的計算過程。注意，這裏中間和右側分支使用的輸入時 $\tilde f_l$ 而不是 $f_l$ ，也就是左側分支的中間特徵。

整個模塊的輸入包含三個部分：

the high-level features from the output of the previous layer
the low-level features from the corresponding bottom layer
the global context feature generated by the GCF module

SR

例如，在預測的顯著物體上有一些洞，這是由不同層的矛盾反應引起的。因此，我們開發了一個 SR 模塊，在通過 HA 模塊和 FIA 模塊後，通過使用乘法和加法操作來進一步細化和增強特徵圖。

SR本身很簡單，就是一個三層的卷積結構。沒啥好說的。

HA

由於編碼器組件的頂層特徵對於顯著目標檢測通常是冗餘的,我們設計了一個接在頂層後的 HA 模塊，通過利用空間和通道注意機制來學習更具有選擇性和代表性的特徵。

輸入特徵圖 $F$ ，先將通道調整成256，得到 $\tilde F$ 然後使用簡單的卷積結構得到第一階段特徵 $F_1$ 。

之後再通過全局平均池化來處理 $F$ 變成了通道級特徵矢量 $f$ ，後接兩個全連接層，分別使用ReLU和Sigmoid作爲激活函數，從而得到權重矢量 $y$ 。

最終的輸出使用 $F_1 \odot y$ ，即用 $y$ 對 $F_1$ 進行通道加權。

GCF

與PoolNet不同，這裏考慮了不同階段的不同貢獻。首先使用全局平均池化獲取全局上下文信息，然後爲每個階段的全局上下文特徵圖的不同通道重新分配不同的權重。要注意，這裏的GCF按照圖示，針對不同的階段會設置不同的GCF。

這裏的輸出會被送到FIA的右側分支作爲輸入。

損失函數

使用的深監督形式下的交叉熵損失。

To facilitate the optimization of the proposed network, we add auxiliary loss at three decoder stages. Specifically, a 3×3 convolution operation is applied for each stage to squeeze the channel of the output feature maps to 1. Then these maps are up-sampled to the same size as the ground truth via bilinear interpolation and sigmoid function is used to normalize the predicted values into [0,1].

The auxiliary loss branches only exist during the training stage, whereas they are abandoned when inference.

實驗細節

We adopt ResNet-50 (He et al. 2016) pretrained on ImageNet (Deng et al. 2009) as our network backbone.
In the training stage, we resize each image to 320×320 with random horizontal flipping, then randomly crop a patch with the size of 288 × 288 for training.
During the inference stage, images are simply **resized to 320 × 320 **then fed into the network to obtain prediction without any other post-processing (e.g., CRF).
We use Pytorch (Paszke et al. 2017) to implement our model.
Mini-batch Stochastic gradient descent (SGD) is used to optimize the whole network with the batch size of 32, the momentum of 0.9, and the weight decay of 5e-4.
We use the **warm-up and linear decay strategies **with the maximum learning rate 5e-3 for the backbone and 0.05 for other parts to train our model and stop training after 30 epochs.
The inference of a 320×320 image takes about 0.02s (over 50 fps) with the acceleration of one NVIDIA Titan-Xp GPU card.

顯著性目標檢測之Global Context-Aware Progressive Aggregation Network for Salient Object Detection

Global Context-Aware Progressive Aggregation Network for Salient Object Detection

文章目錄

主要工作

主體結構

FIA

SR

HA

GCF

損失函數

實驗細節

相關鏈接

視頻目標分割之Video Object Segmentation using Space-Time Memory Networks

顯著性檢測之SE2Net: Siamese Edge-Enhancement Network for Salient Object Detection

LaTeX之寫論文有用的指令

數字圖像處理之Bilateral Filters

實例分割之Semi-convolutional Operators forInstance Segmentation（翻譯）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結