顯著性目標檢測之Global Context-Aware Progressive Aggregation Network for Salient Object Detection

Global Context-Aware Progressive Aggregation Network for Salient Object Detection

原始文檔:https://www.yuque.com/lart/papers/kyxtc1

image.png

AAAI 2020

主要工作

這篇文章和F3Net在想法上有很大的相似之處,都認爲:the previous works mainly adopted multiple-level feature integration yet ignored the gap between different features.

而另一條 there also exists a dilution process of high-level features as they passed on the top-down pathway 實際上也是借用自之前的PoolNet。

因此,總體而言,本文認爲現有的基於FCN的模型存在這樣的兩個問題:

  1. Due to the gap between different level features, the simple combination of semantic information and appearance information is insufficient and lacks consideration of the different contribution of different features for salient object detection;
  2. Most of the previous works ignored the global context information, which benefits for deducing the relationship among multiple salient regions and producing more complete saliency result.

爲了處理這兩個問題,這裏提出了幾個模塊:

  • 對於第一個問題:Feature Interweaved Aggregation (FIA) module fully integrates the high-level semantic features, low-level detail features, and global context features, which is expected to suppress the noises but recover more structural and detail information.
  • 通用提升:
    • Head Attention (HA) module is used to reduce information redundancy and enhance the top layers features by leveraging the spatial and channel-wise attention
    • Self Refinement (SR) module is utilized to further refine and heighten the input features
  • 對於第二個問題:Global Context Flow (GCF) module generates the global context information at different stages, which aims to learn the relationship among different salient regions and alleviate the dilution effect of high-level features

主體結構

image.png

主要包含四個組件,這裏分別簡單介紹。

FIA

image.png

這裏多處使用乘法操作。The multiplication operation can strengthen the response of salient objects, meanwhile suppress the background noises. 從圖中可以比較直觀的瞭解整體的計算過程。注意,這裏中間和右側分支使用的輸入時f~l\tilde f_l而不是flf_l,也就是左側分支的中間特徵。

整個模塊的輸入包含三個部分:

  1. the high-level features from the output of the previous layer
  2. the low-level features from the corresponding bottom layer
  3. the global context feature generated by the GCF module

SR

例如,在預測的顯著物體上有一些洞,這是由不同層的矛盾反應引起的。因此,我們開發了一個 SR 模塊,在通過 HA 模塊和 FIA 模塊後,通過使用乘法和加法操作來進一步細化和增強特徵圖。

SR本身很簡單,就是一個三層的卷積結構。沒啥好說的。

HA

由於編碼器組件的頂層特徵對於顯著目標檢測通常是冗餘的,我們設計了一個接在頂層後的 HA 模塊,通過利用空間和通道注意機制來學習更具有選擇性和代表性的特徵。

輸入特徵圖FF,先將通道調整成256,得到F~\tilde F然後使用簡單的卷積結構得到第一階段特徵F1F_1

之後再通過全局平均池化來處理FF變成了通道級特徵矢量ff,後接兩個全連接層,分別使用ReLU和Sigmoid作爲激活函數,從而得到權重矢量yy

最終的輸出使用F1yF_1 \odot y,即用yyF1F_1進行通道加權。

GCF

與PoolNet不同,這裏考慮了不同階段的不同貢獻。首先使用全局平均池化獲取全局上下文信息,然後爲每個階段的全局上下文特徵圖的不同通道重新分配不同的權重。要注意,這裏的GCF按照圖示,針對不同的階段會設置不同的GCF。

image.png

這裏的輸出會被送到FIA的右側分支作爲輸入。

損失函數

使用的深監督形式下的交叉熵損失。

image.png

To facilitate the optimization of the proposed network, we add auxiliary loss at three decoder stages. Specifically, a 3×3 convolution operation is applied for each stage to squeeze the channel of the output feature maps to 1. Then these maps are up-sampled to the same size as the ground truth via bilinear interpolation and sigmoid function is used to normalize the predicted values into [0,1].

The auxiliary loss branches only exist during the training stage, whereas they are abandoned when inference.

實驗細節

  • We adopt ResNet-50 (He et al. 2016) pretrained on ImageNet (Deng et al. 2009) as our network backbone.
  • In the training stage, we resize each image to 320×320 with random horizontal flipping, then randomly crop a patch with the size of 288 × 288 for training.
  • During the inference stage, images are simply **resized to 320 × 320 **then fed into the network to obtain prediction without any other post-processing (e.g., CRF).
  • We use Pytorch (Paszke et al. 2017) to implement our model.
  • Mini-batch Stochastic gradient descent (SGD) is used to optimize the whole network with the batch size of 32, the momentum of 0.9, and the weight decay of 5e-4.
  • We use the **warm-up and linear decay strategies **with the maximum learning rate 5e-3 for the backbone and 0.05 for other parts to train our model and stop training after 30 epochs.
  • The inference of a 320×320 image takes about 0.02s (over 50 fps) with the acceleration of one NVIDIA Titan-Xp GPU card.

image.png

image.png

image.png

相關鏈接

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章