1. What is the contribution of this paper?
This paper proposed joint unsampling module named Joint Pyramid Upsampling(JPU) by formulating the task of extracting high-resolution feature maps into a joint upsampling problem. This module can replace the dilated convolutions which are embbed into the backbone to extract hight-resoltution feature maps, resutlt in less computation complexity and memory footprint but no performance loss.
2. What is the method?
Main idea: Feature extraxted by dilated convolution can be approximated by regular convolution when input of regular convolution is downsmapled from the input of dilated convolution. Formally, given the input feature map , and , the latter is downsampled for the former (usually, using regular convolution with stride >=2), the output feature map of dilated convolution with input can be approximated by the output feature map of reguar convolution with input . Mathematically,
where This idea is the same as joint upsample . Thus, the key of the method is how to find the mapping . In this paper, author design JPU module (Joint Pyramid Upsampling) to simulate optimization process. The goal is to produce the feature map whose os = 8 when last layer feature map downsampling rate is 32. Here is JPU module figure:
3. My comment
This paper redefines the relationship between dilated convolution and regular convolution with stride>=2 as a joint upsampling problem, and show reasonable proof. The detailed proofing process can be seen in the Sec.3.3 of this paper. Then, it proposes Joint Pyramid Upsampling module to solve the optimized . However, I think author did not explain cleary why JPU can sovle the mapping . For example, if the trained JPU module can be regarded as the optimizest , what are the high-resolution input, low-resolution input and low-resolution target, which are essential in Joint Upsmapling problem. Thus, it is unreasonable to think that JPU can solve the . I think JPU just collects richer contextual information to do prediction, it can be considered as an extension of ASPP.
4. Something worth thinking about
4.1 why is there less computation complexity and memory footprint?
In the paper, author said that “Compared to DilatedFCN, our method takes 4 times fewer computation and memory resources in 23 residual blocks (69 layers) and 16 times fewer in 3 blocks (9 layers) when the backbone is ResNet-101.” Why is here 4 times and 16 times fewer computation and memory resources. conv4 feature map spatial size is 2 times larger than that of conv3, conv5 feature is 4 times larger than that of conv3, Thus why is not 2 times and 4 times. How to calculate it?