Progressive Feature Polishing Network for Salient Object Detection
AAAI 2020
主要工作
本文還是從多層級特徵利用的角度入手,提出了一個Progressive Feature Polishing Network (PFPN),通過利用Feature Polishing Modules (FPMs)來“打磨”(polish)(細化)多層級特徵。
這一點並沒有太多讓人感覺很有新意的地方。
倒是這篇文章反覆提到了提出的結構可以avoid the long-term dependency problem。這一點主要引自Bengio et al. 1994年的一篇文章:Learning long-term dependencies with gradient descent is difficult。這裏是一篇介紹文章和原文鏈接:
- https://www.jianshu.com/p/46367f985c64
- http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=23894D6306F585CE5D8D7B028F821C40?doi=10.1.1.41.7128&rep=rep1&type=pdf
- [學習筆記] Long Short-Term Memory - 卓柳舟的文章 - 知乎 https://zhuanlan.zhihu.com/p/22670364
這是Bengio這篇文章摘要的翻譯:
遞歸神經網絡可用於將輸入序列映射到輸出序列,如用於識別、生產或預測問題。然而,在訓練遞歸神經網絡執行輸入/輸出序列中出現的時間偶發事件跨越很長時間間隔的任務時,已有工作報告了一些實際的困難。我們展示了爲什麼基於梯度的學習算法面臨着一個越來越困難的問題,因爲被捕獲的依賴的持續時間增加了。這些結果揭示了通過梯度下降進行有效學習和網絡長時間保持信息之間的權衡。基於對這一問題的理解,考慮了標準梯度下降的替代方法。
PFPN這篇文章提到“然而,這些(現有方法)在多級特徵之間間接執行的集成可能由於引起的長期依賴問題而存在缺陷”,而這裏指的便是Bengio這篇文章提到的梯度下降訓練RNN模型存在一定的困難(“大概結論是back-propagation難以發現跨度太大的關聯性”)。但是PFPN將十幾年前在RNN上的結論直接用到了現在的CNN上,似乎有點牽強。
主要結構
整體結構還是比較直接和清晰的。各個結構都是不同的卷積塊的組合。可以看出,這裏有DSS的影子(遞進式的跳層連接的集成)。
關於這裏多部迭代計算的方式,文章有如下表述:
However, since the predicted results (一些現有的例如DSS、DHSNet等專注於顯著性圖細化的方法) have severe information loss than original representations, the refinement might be deficient. Different from these methods, our approach progressively improves the multi-level representations in a recurrent manner instead of attempting to rectify the predicted results.
Besides, most previous refinements are performed in a deep-to-shallow manner, in which at each step only the features specific to that step are exploited. In contrast to that, our method polishes the representations at every level with multi-level context information at each step.
下面介紹下本文的關鍵結構,FPM。
這裏以FPM1-3或者FPM2-3爲例進行展示,如上圖所示。整體計算過程如下:
訓練使用的損失函數如下:
這篇文章使用了深監督技術。其中s表示最終的預測輸出,si表示網絡中間的預測輸出:
In detail, 1x1 convolutional layers are performed on the multi-level feature maps before the Fusion Module to obtain a series of intermediate results.
實驗細節
- Pytorch
- ResNet-101 and VGG-16. We initialize the layers of backbone with the weights pre-trained on ImageNet classification task and randomly initialize the rest layers.
- We follow source code of PiCA (Liu, Han, and Yang 2018) given by author and FQN (Li et al. 2019) and freeze the BatchNorm statistics of the backbone.
- Following the conventional practice (Liu, Han, and Yang 2018; Zhang et al. 2018a; 2018a), our proposed model is trained on the training set of DUTS dataset.
- We also perform a data augmentation similar to (Liu, Han, and Yang 2018) during training to mitigate the over-fitting problem.
- Specifically, the image is first resized to 300x300 and then a 256x256 image patch is randomly cropped from it.
- Random horizontal flipping is also applied.
- We use Adam optimizer to train our model without evaluation until the training loss convergences.
- The initial learning rate is set to 1e-4 and the overall training procedure takes about 16000 iterations.
- For testing, the images are scaled to 256x256 to feed into the network and then the predicted saliency maps are bilinearly interpolated to the size of the original image.