顯著性目標檢測之F3Net: Fusion, Feedback and Focus for Salient Object Detection

F3Net: Fusion, Feedback and Focus for Salient Object Detection

原始文檔:https://www.yuque.com/lart/papers/mq2x1y

image.png

AAAI 2020

說在開頭

這篇文章的代碼用了一些很有用的trick,其中我覺得最重要的就是多尺度訓練。這確實是一個在一定程度上可以緩解多尺度問題的方法(之後再從知乎上扒一扒,記得有個問題問的就是類似的內容)。

在我自己的項目裏,將其訓練策略遷移過來,實現了非常大的性能提升。一些實驗結果可見:https://github.com/lartpang/MINet/tree/master/code#more-experiments

主要工作

本文主要是在解決這樣的兩個問題:

  • reduce the impact of inconsistency between features of different levels
  • assign larger weights to those truly important pixels

分別提出了不同的組件來處理:

  • CFM&CFD:
    • First, to mitigate the discrepancy between features, we design cross feature module (CFM), which fuses features of different levels by element-wise multiplication. Different from addition and concatenation, CFM takes a selective fusion strategy, where redundant information will be suppressed to avoid the contamination between features and important features will complement each other. Compared with traditional fusion methods, CFM is able to remove background noises and sharpen boundaries.
    • Second, due to downsampling, high level features may suffer from information loss and distortion, which can not be solved by CFM. Therefore, we develop the cascaded feedback decoder (CFD) to refine these features iteratively. CFD contains multiple sub-decoders, each of which contains both bottom-up and top-down processes.
      • For bottom-up process, multi-level features are aggregated by CFM gradually.
      • For top-down process, aggregated features are feedback into previous features to refine them.
  • PPA:
    • We propose the pixel position aware loss (PPA) to improve the commonly used binary cross entropy loss which treats all pixels equally. In fact, pixels located at boundaries or elongated areas are more difficult and discriminating. Paying more attention to these hard pixels can further enhance model generalization. PPA loss assigns different weights to different pixels, which extends binary cross entropy. The weight of each pixel is determined by its surrounding pixels. Hard pixels will get larger weights and easy pixels will get smaller ones.

主要結構

image.png

從圖上看,結構很直觀。通過這裏的CFM結構,作者想要實現:By multiple feature crossings, fl and fh will gradually absorb useful information from each other to complement themselves, i.e., noises of fl will be suppressed and boundaries of fh will be sharpened.

這裏也使用了多次級聯的雙向解碼器的策略:

  • 關於級聯:

Cascaded feedback decoder (CFD) is built upon CFM which refines the multi-level features and generate saliency maps iteratively.

  • 關於雙向:

In fact, features of different levels may have missing or redundant parts because of downsamplings and noises. Even with CFM, these parts are still difficult to identify and restore, which may hurt the final performance.
Considering the output saliency map is relatively complete and approximate to ground truth, we propose to propagate the features of the last convolution layer back to features of previous layers to correct and refine them.

損失函數

這裏指出了BCE的三個缺點:

  1. 像素級損失:First, it calculates the loss for each pixel independently and ignores the global structure of the image.
  2. 易受大的區域的引導:Second, in pictures where the background is dominant, loss of foreground pixels will be diluted.
  3. 平等對待每個像素:Third, it treats all pixels equally. In fact, pixels located on cluttered or elongated areas (e.g., pole and horn) are prone to wrong predictions and deserve more attention and pixels located areas, like sky and grass, deserveless attention.

最終使用位置重加權的方式,結合了像素級損失BEL和區域級損失IOU來進行監督:

image.png

image.png

image.png

這裏的其他不必細談,主要是這個權重參數alpha。它使用了當前位置周圍的窗口內的像素的真值,來評估當前像素是否處於較難的位置,可視化結果如下:

image.png

可以看到,對於較難的區域,確實可以實現不錯的關注。整體損失如下:

image.png

image.png

這裏也使用了深監督策略,哥哥監督位置如前面的網絡結構圖所示。這裏的Lsi表示中間各個級聯解碼器的預測損失,而後面的Lsj表示最終一個解碼器裏的各個層級的預測損失。

實驗細節

  • DUTS-TR is used to train F3Net and other above mentioned datasets are used to evaluate F3Net.
  • For data augmentation, we use horizontal flip, random crop and multi-scale input images.
  • ResNet-50 (He et al. 2016), pre-trained on ImageNet, is used as the backbone network. Maximum learning rate is set to 0.005 for ResNet-50 backbone and 0.05 for other parts.
  • Warm-up and linear decay strategies are used to adjust the learning rate.
  • The whole network is trained end-to-end, using stochastic gradient descent (SGD). Momentum and weight decay are set to 0.9 and 0.0005, respectively.
  • Batchsize is set to 32 and maximum** epoch is set to 32**.
  • We use Pytorch 1.3 to implement our model. An RTX 2080Ti GPU is used for acceleration.
  • During testing, we resized each image to 352 x 352 and then feed it to F3Net to predict saliency maps without any post-processing.

image.png

image.png

image.png

相關鏈接

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章