Delve into FastFCN:Rethinking Dilated Convolution in the backbone for Semantic Segmentation

1. What is the contribution of this paper?

This paper proposed joint unsampling module named Joint Pyramid Upsampling(JPU) by formulating the task of extracting high-resolution feature maps into a joint upsampling problem. This module can replace the dilated convolutions which are embbed into the backbone to extract hight-resoltution feature maps, resutlt in less computation complexity and memory footprint but no performance loss.

2. What is the method?

Main idea: Feature YY extraxted by dilated convolution can be approximated by regular convolution when input XlX_l of regular convolution is downsmapled from the input XhX_h of dilated convolution. Formally, given the input feature map xx, and xlx_l , the latter xlx_l is downsampled for the former xx (usually, using regular convolution with stride >=2), the output feature map ydy_d of dilated convolution with input xx can be approximated by the output feature map ysy_s of reguar convolution with input xx. Mathematically,
ddh^(x)d_d \approx \hat{h}(x) where h^=arg minhHysh(xl),   xl=Convstride>=2(x)\hat{h} = \argmin \limits_{h \in \mathcal{H}} \lVert y_s - h(x_l) \rVert, \space\space\space x_l = Conv_{stride>=2}(x) This idea is the same as joint upsample . Thus, the key of the method is how to find the mapping h^\hat{h}. In this paper, author design JPU module (Joint Pyramid Upsampling) to simulate optimization process. The goal is to produce the feature map whose os = 8 when last layer feature map downsampling rate is 32. Here is JPU module figure:

在這裏插入圖片描述

3. My comment

This paper redefines the relationship between dilated convolution and regular convolution with stride>=2 as a joint upsampling problem, and show reasonable proof. The detailed proofing process can be seen in the Sec.3.3 of this paper. Then, it proposes Joint Pyramid Upsampling module to solve the optimized y^\hat{y}. However, I think author did not explain cleary why JPU can sovle the mapping y^\hat{y}. For example, if the trained JPU module can be regarded as the optimizest y^\hat{y}, what are the high-resolution input, low-resolution input and low-resolution target, which are essential in Joint Upsmapling problem. Thus, it is unreasonable to think that JPU can solve the y^\hat{y}. I think JPU just collects richer contextual information to do prediction, it can be considered as an extension of ASPP.

4. Something worth thinking about

4.1 why is there less computation complexity and memory footprint?

In the paper, author said that “Compared to DilatedFCN, our method takes 4 times fewer computation and memory resources in 23 residual blocks (69 layers) and 16 times fewer in 3 blocks (9 layers) when the backbone is ResNet-101.” Why is here 4 times and 16 times fewer computation and memory resources. conv4 feature map spatial size is 2 times larger than that of conv3, conv5 feature is 4 times larger than that of conv3, Thus why is not 2 times and 4 times. How to calculate it?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章