Contour Loss: Boundary-Aware Learning for Salient Object Segmentation

文章目錄

Contour Loss: Boundary-Aware Learning for Salient Object Segmentation

原始文檔：https://www.yuque.com/lart/papers/contourloss

arxiv2019年8月9日21:48:09 目前還是放在arxiv上的, 具體鏈接在最後一節中.

這篇文章主要內容:

提出一種輪廓損失, 利用目標輪廓來引導模型獲得更具區分能力的特徵, 保留目標邊界, 同時也能在一定程度上增強局部的顯著性預測.
提出了一種層次全局顯著性模塊, 來促使模塊逐階段獲取全局內容, 捕獲全局顯著性.

網絡結構

是一個FPN-like的結構
Ei表示第i個編碼器模塊的輸出特徵, 所有的這些特徵集合表示爲FE
每個編碼器都使用了一個殘差結構來集成多尺度特徵, 解碼器將Ei轉化爲Resi, 所有這些特徵集合表示爲FR

這裏的 $\delta$ 表示卷積層, $\theta$ 表示卷積層參數
$\bigoplus$ 表示concat, $\{\star \}^{up \times 2}$ 表示上採樣兩倍
爲了實現深監督, 這裏對於每個Resi, 直接上採樣到224x224, 獲得Ui, 再通過使用sigmoid激活的卷積層來獲得顯著性預測Pi

訓練時使用五個層級的損失, 但是對於各級使用了不同的權重

Contour Loss

由於對於顯著性目標檢測(這裏與"分割"無異)的每個樣本的密集預測來說, 實際上在邊界附近的像素可以看作是一些難樣本, 參考Focal Loss的設計, 在交叉熵上使用空間加權, 來對顯著性目標邊界的像素的結果設置更高的權重. 空間權重可以表示爲下式對應的集合.

$(\star; S)^+$ 表示膨脹操作, $(\star; S)^-$ 表示腐蝕操作, 都是用的5x5的核S實施的
K是一個超參數, 這裏論文設定爲5
Gauss是爲了賦予接近邊界但是並沒有位於邊界上的像素一些關注, 這裏對於Gauss的範圍設置爲5x5
這裏的 $\mathbb{1}$ 表示像素224x224這樣的整個圖上的像素, 是否是遠離邊界, 是的話就是1, 反之爲0
Compared with some boundary operators, such as Laplace operator, the above approach can generate thicker object contours for considerable error rates.
整體損失函數設置如下:

這裏論文裏的描述應該有誤, M是權重, Y是真值, Y*是預測
實際中, 對於式子(3)中的loss, 對應的就是(5)中對應層級的LC

Hierarchical Global Attention Module

認爲現有的顯著性檢測方法大多是基於softmax函數的:which enormously emphasizes several important pixels and endows the others with a very small value. Therefore these attention modules cannot attend to global contexts in high-resolution, which easily lead to overfitting in training.

因此, 這裏使用了一個新的基於全局對比度的方法來利用全局上下文信息. 這裏使用了特徵圖的均值來作爲一個標準: Since a region is conspicuous infeature maps, each pixel in the region is also significant with a relatively large value, for example, over the mean. In other words, the inconsequential features often have a relatively small value in feature maps, which are often smaller than the mean. 於是使用特徵圖減去均值, 正值表示顯著性區域, 負值表示非顯著性區域. 於是可以得到如下的分層全局注意力:

這裏的FIn表示輸入的特徵, Aver和Var表示對應於FIn的均值和方差值
$\lambda$ 表示一個正則項, 這裏設定爲0.1
$\epsilon$ 是一個小數, 防止除零
Compared with softmax results, the pixel-wise disparity of our attention maps is more reasonable, in other words, our attention method can retain conspicuous regions from feature maps in high-resolution

這裏通過提出一個hierarchical global attention module (HGAM) 來捕獲multi-scale global context.

這裏作爲輸入的有三部分: 來自本級的上採樣特徵Ui; 來自本級的編碼器特徵Ei, 以及來自上一級的HGAM信息Houti+1
爲了提取全局上下文新西, 這裏對於Ui使用了最大池化和平均池化處理, 獲得H1和H2, 這裏來自於[Cbam: Con-volutional block attention module]
對於Ei和Houti+1調整通道和分辨率分別獲得H3和H4

HAtten可以通過(6)獲得
Houti用於生成下一個Houti+1, 而HAtten用來引導殘差結構:

這裏的 $\bigodot$ 表示像素級乘法
所以最終的預測輸出實際上是ResG1生成的P

總體的訓練損失改進爲:

實驗細節

Our experiments are based on the Pytorch framework and run on a PC machine with a single NVIDIA TITAN X GPU (with 12G memory).
For training, we adopt DUTS-TR as training set and utilize data augmentation, which resamples each image to 256×256 before random flipping, and randomly crops the 224×224 region.
We employ stochastic gradient descent (SGD) as the optimizer with a momentum (0.9) and a weight decay (1e-4).
We also set basic learning rate to 1e-3 and finetune the VGG-16 backbone with a 0.05 times smaller learning rate.
Since the saliency maps of hierarchical predictions are coarse to fine from P5 to P1, we set the incremental weights with these predictions. Therefore WL5, …, WL1 are set to 0.3, 0.4, 0.6, 0.8, 1 respectively in both Eq 3 and 9.
The minibatch size of our network is set to 10. The maximum iteration is set to 150 epochs with the learning rate decay by a factor of 0.05 for each 10 epochs.
As it costs less than 500s for one epoch including training and evaluation, the total training time is below 21 hours.
For testing, follow the training settings, we also resize the feeding images to 224×224, and only utilize the final output P. Since the testing time for each image is 0.038s, our model achieves 26 fps speed with 224×224 resolution.