顯著性目標檢測之Towards High-Resolution Salient Object Detection

Towards High-Resolution Salient Object Detection


image.png

原始文檔: https://www.yuque.com/lart/papers/fgwcg5

主要貢獻

  • 提供了第一個高分辨率的顯著性目標檢測數據集
  • 指出了當前顯著性檢測模型在高分辨率圖像任務上的不足, 這裏從高分辨率圖像數據集入手, 給出了一個研究的新思路
  • 針對高分辨率圖像給出了自己的解決方案: 通過集成both global semantic information and local high-resolution details來應對這個挑戰
  • 提出了一種patch的採集策略, 將局部的處理集中在目標的邊界部分(所謂較難的區域), 從而更有效的實現patch的處理.

針對問題

首先高分辨率顯著性檢測存在的必要性:

  • 對於密集預測任務(例如, 分割任務), 低分辨率的圖像會導致邊緣的模糊
  • 現在的大多數電子產品獲取的圖像大多數分辨率較高, 我們需要針對它們進行處理, 所以有必要研究高分辨率圖像上的顯著性目標檢測

現有方法在高分辨率目標檢測任務上的不足:

  • 現有數據集多爲低分辨率圖片, 模型也大多使用低分辨率圖片作爲輸入進行訓練
  • 現有模型往往通過下采樣的過程來獲得高層的語義信息, 在不斷地降低分辨率的過程中, 細節信息逐漸丟失, 對於高分辨率圖像而言尤其如此.

現有研究在高分辨率顯著性目標檢測任務上的不足:

  • 目前沒有專門的高分辨率顯著性目標檢測數據集, 這樣使得現有的研究往往更多基於低分辨率圖像開展, 很少有人研究在高分辨率圖像上的顯著性目標檢測任務

現有的處理高分辨率圖像的主要思路:

  • The first is simply increasing the input size to maintain a relative high resolution and object details after a series of pooling operations. However, the large input size results in significant increases in memory usage. Moreover, it remains a question that if we can effectively extract details from lower-level layers in such a deep network through back propagation.
  • The second method is partitioning inputs into patches and making predictions patch-by-patch. However, this type of method is time-consuming and can easily be affected by background noise.
  • The third one includes some post-processing methods such as CRF or graph cuts, which can address this issue to a certain degree. But very few works attempted to solve it directly within the neural network training process.

As a result, the problem of applying DNNs for high-resolution salient object detection is fairly unsolved.

主要方法

image.png

這裏給出了完整的網絡結構. 各個模塊主要作用如下:

  1. Global Semantic Net-work (GSN): for extracting semantic information
  2. Local Refinement Network (LRN): for optimizing local details
  3. Attended Patch Sampling (APS) scheme: is proposed to enforce LRN to focus on uncertain regions, and this scheme provides a good trade-off between performance and efficiency
  4. Global-Local Fusion Network (GLFN): is proposed to enforce spatial consistency and further boost performance at high resolution.

GSN旨在從全局角度提取語義知識。在GSN的指導下,LRN旨在細化不確定的子區域。最後,GLFN將高分辨率圖像作爲輸入,並進一步加強了GSN和LRN融合預測的空間一致性。

網絡流程:

  1. 輸入圖像I通過下采樣獲得低分辨率圖片(384x384), 之後通過GSN處理後得到粗略預測, 之後通過上採樣恢復分原始辨率, 得到Fi.
    image.png
  2. 然後將圖像送入APS模塊, 來生成M個子圖像塊, 其中對於第i個輸入圖像Ii的劃分的M個塊中的第m個塊可以表示爲PIim.
  3. 之後將每個PIim送入LRN獲得細化的顯著性預測結果RIim, 在這個過程中用到了來自GSN的語義引導.
    image.png
  4. GSN和LRN的輸出經過融合處理後, 送到GLFN得到最終的預測Si.
    image.png

GSN&LRN

image.png

對於這裏的構造而言, We adopt the same backbone for GSN and LRN. Our model is simply built on the FCN architecture with the pre-trained 16-layer VGG network.

對於GSN和LRN而言, 由於輸入的內容不同, 所以有着各自的缺點:

  1. The saliency maps generated by GSN are based on the full image and embedded with rich contextual information. As a result, GSN is competent in giving a rough saliency prediction but insufficient to precisely localize salient objects.
  2. In contrary, LRN takes sub-images as input, avoiding down-sampling which results in the loss of details. However, since sub-images are too local to indicate which area is more salient, LRN may be confused about which region should be highlighted. Also, LRN alone may have false alarms in some locally salient regions.

因此, 文章引入了semantic guidance from GSN to LRN, 這能夠enhance global contextual knowledge while maintain high-resolution details. 如圖4b所示, 給定GSN的粗略預測Fi, 首先根據LRN中的patch PIim的位置, 從Fi中剪裁一個patch PFim. 然後將PFim和LRN對應的特徵圖進行拼接進行處理.

APS

傳統的基於patch的方法, 通常利用滑窗或者超像素的方法來推斷出每個patch, 這樣很費時間. 這裏注意到來自GSN的粗略顯著性圖實際上對於大多數像素而言預測是比較正確的, 所以這裏考慮基於這樣一種知識, 來更有效的提取patch.

這裏實際上夠早了一種分層級的預測方法, GSN可以爲容易的區域提供良好的預測, 而LRN則是對於複雜的難的區域進行細緻的預測, 使得模型更加有效和準確.

通過GSN的結果的引導, 可以生成附加在不確定區域的子圖. 這裏首先使用注意力圖Ai來指示所有不確定的像素:

image.png

實際上所有在閾值T1和T2之間的像素值都要被認定爲是不確定區域, 指示爲1. 之後依據如下的算法進行確定:

image.png

算法中, 使用了幾個常數:

  1. D是基本剪裁尺寸, 文中使用384
  2. n用來控制不同patch的重疊程度, 文中使用5
  3. r是一個用來生成不同大小的子圖的隨機值, 文中滿足: r[D6,D6]r \in [-\frac{D}{6}, \frac{D}{6}]
  4. T1和T2是兩個閾值, 分別設置爲50和200
  5. w表示Ai中non-zero區域的寬度
  6. XL和XR是Ai中非零區域最左和最右的x座標

We have performed grid search for setting these hyper-parameters and found that the results were not sen-sitive to their specific choices.

GLFN

image.png

最終的預測結果通過融合來自GSN和LRN的結果可以獲得. 關於融合的策略, 一種簡單的方式是直接用LRN結果中的RIim來替換對應於GSN結果Fi中的對應的不確定區域. 對重疊區域取平均即可. 然而, this kind of fusion lacks spatial consistency and does not leverage rich details in original high-resolution images.

這裏提出直接訓練網絡來整合高分辨率的信息來幫助GSN和LRN的融合. To maintain all the high-resolution details from images, this network should not include any pooling layers or convolutional layers with large strides. 由於GPU內存的限制, 這裏提出了一個輕量的網絡Global-Local Fusion Network (GLFN)來處理這個問題. 具體結構如圖5.

  1. High-resolution RGB images and combined maps from GSN and LRN are concatenated together to be the inputs of GLFN.
  2. GLFN consists of some convolution layers with dense connectivity. We set the growth rate g to be 2 for saving memory. We let the bottleneck layers (1×1 convolution) produce 4g feature maps.
  3. On the top of these densely connected layers, we add four dilated convolutional layers to enlarge receptive field. All the dilated convolutional layers have the same kernel size and output channels, i.e., k = 3 and c = 2. The rates of the four dilated convolutional layers are set with dilation = 1,6,12,18, respectively.
  4. At last, a 3×3 convolution is appended for final prediction.
  5. What is worth mentioning is that our proposed GLFN has an extremely small model size (i.e., 11.9 kB).

實驗細節

  • All experiments are conducted on a PC with an i7-8700 CPU and a 1080 Ti GPU, with the Caffe toolbox.
  • In our method, every stage is trained to minimize a pixelwise softmax loss function, by using the stochastic gradient descent (SGD).
  • Empirically, the momentum parameter is set to 0.9 and the weight decay is set to 0.0005.
  • For GSN and LRN:
    • The inputs are first warped into 384×384 and the batch size is set to 32.
    • The weights in block 1 to block 5 are initialized with the pre-trained VGG model, while weight parameters of newly-added convolutional layers are randomly initialized by using the “msra” method.
    • The learning rates of the pre-trained and newly-added layers are set to 1e-3 and 1e-2, respectively.
  • GLFN:
    • It is trained from scratch, and its weight parameters of convolutional layers are also randomly initialized by using the “msra” method.
    • Its inputs are warped into 1024×1024 and the batch size is set to 2.

image.png
image.png
image.png

相關鏈接

發佈了153 篇原創文章 · 獲贊 72 · 訪問量 11萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章