Asymmetric Non-local Neural Networks for Semantic Segmentation
Zhen Zhu , Mengde Xu , Song Bai , Tengteng Huang , Xiang Bai
Huazhong University of Science and Technology, University of Oxford
[GitHub]: https://github.com/MendelXu/ANN
[paper]: https://arxiv.org/pdf/1908.07678.pdf
這篇文章的寫作很棒。
[Non-Local Attention 系列]
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [my CSDN]
Asymmetric Non-local Neural Networks for Semantic Segmentation [my CSDN]
Efficient Attention: Attention with Linear Complexities [my CSDN]
CCNet: Criss-Cross Attention for Semantic Segmentation [my CSDN]
Non-locally Enhanced Encoder-Decoder Network for Single Image De-raining [my CSDN]
Image Restoration via Residual Non-local Attention Networks [my CSDN]
Table of Contents
Asymmetric Non-local Neural Networks for Semantic Segmentation
Asymmetric Non-local Neural Network
Asymmetric Pyramid Non-local Block
Asymmetric Fusion Non-local Block
Datasets and Evaluation Metrics
Comparisons with Other Methods
Abstract
The non-local module works as a particularly useful technique for semantic segmentation while criticized for its prohibitive computation and GPU memory occupation.
[文章的motivation,即文章要解決的問題]
In this paper, we present Asymmetric Non-local Neural Network to semantic segmentation, which has two prominent components: Asymmetric Pyramid Non-local Block (APNB) and Asymmetric Fusion Non-local Block (AFNB). APNB leverages a pyramid sampling module into the non-local block to largely reduce the computation and memory consumption without sacrificing the performance. AFNB is adapted from APNB to fuse the features of different levels under a sufficient consideration of long range dependencies and thus considerably improves the performance.
[網絡結構介紹,簡潔而清晰]
Extensive experiments on semantic segmentation benchmarks demonstrate the effectiveness and efficiency of our work. In particular, we report the state-of-the-art performance of 81.3 mIoU on the Cityscapes test set. For a 256 × 128 input, APNB is around 6 times faster than a non-local block on GPU while 28 times smaller in GPU running memory occupation.
[最後,實驗結果要和motivation相呼應,即說明文章提出的方法確實有效解決了motivation提出的問題]
Introduction
Some recent studies [20, 33, 47] indicate that the performance could be improved if making sufficient use of long range dependencies. However, models that solely rely on convolutions exhibit limited ability in capturing these long range dependencies. A possible reason is the receptive field of a single convolutional layer is inadequate to cover correlated areas. Choosing a big kernel or composing a very deep network is able to enlarge the receptive field. However, such strategies require extensive computation and parameters, thus being very inefficient [44]. Consequently, several works [33, 47] resort to use global operations like non-local means [2] and spatial pyramid pooling [12, 16].
傳統CNN不能高效的捕捉long range dependency. Non-local 和 spatial pyramid pooling 可以有效解決這個問題。
In [33], Wang et al. combined CNNs and traditional non-local means [2] to compose a network module named nonlocal block in order to leverage features from all locations in an image. This module improves the performance of existing methods [33]. However, the prohibitive computational cost and vast GPU memory occupation hinder its usage in many real applications. The architecture of a common non-local block [33] is depicted in Fig. 1(a). The block first calculates the similarities of all locations between each other, requiring a matrix multiplication of computational complexity , given an input feature map with size . Then it requires another matrix multiplication of computational complexity to gather the influence of all locations to themselves. Concerning the high complexity brought by the matrix multiplications, we are interested in this work if there are efficient ways to solve this without sacrificing the performance.
描述了Non-local的結構和複雜度。
這兩段是問題的提出。
Figure 1: Architecture of a standard non-local block (a) and the asymmetric non-local block (b). while .
We notice that as long as the outputs of the key branch and value branch hold the same size, the output size of the non-local block remains unchanged. Considering this, if we could sample only a few representative points from key branch and value branch, it is possible that the time complexity is significantly decreased without sacrificing the performance. This motivation is demonstrated in Fig. 1 when changing a large value N in the key branch and value branch to a much smaller value S (From (a) to (b)).
本文的觀察和動機。Non-local是計算每個點和所有點的相關性(即每個像素點都對應一張attention map);而本文認爲不需要計算每個點的attention map,而是計算每個像素點與某幾個局部區域的 attention map。
In this paper, we propose a simple yet effective nonlocal module called Asymmetric Pyramid Non-local Block (APNB) to decrease the computation and GPU memory consumption of the standard non-local module [33] with applications to semantic segmentation. Motivated by the spatial pyramid pooling [12, 16, 47] strategy, we propose to embed a pyramid sampling module into non-local blocks, which could largely reduce the computation overhead of matrix multiplications yet provide substantial semantic feature statistics. This spirit is also related to the sub-sampling tricks [33] (e.g., max pooling). Our experiments suggest that APNB yields much better performance than those sub-sampling tricks with a decent decrease of computations. To better illustrate the boosted efficiency, we compare the GPU times of APNB and a standard non-local block in Fig. 2, averaging the running time of 10 different runs with the same configuration. Our APNB largely reduces the time cost on matrix multiplications, thus being nearly 6 times faster than a non-local block.
描述APNB的形成機制(Non-Local + Spatial Pyramid Pooling),具備低複雜度、高 performance。
Besides, we also adapt APNB to fuse the features of different stages of a deep network, which brings a considerable improvement over the baseline model. We call the adapted block as Asymmetric Fusion Non-local Block (AFNB). AFNB calculates the correlations between every pixel of the low-level and high-level feature maps, yielding a fused feature with long range interactions. Our network is built based on a standard ResNet-FCN model by integrating APNB and AFNB together.
描述AFNB:融合了low-level和high-level的特徵,生成了具有long range interactions的、融合的特徵。
Related Work
Recent advances focus on exploring the context information and can be roughly categorized into five directions:
爲了挖掘 context information,現有的方法大致包括5個研究方向:
Encoder-Decoder:U-Net
Conditional Random Field: Deeplab
Different Convolutions: Dilated Conv.
Spatial Pyramid Pooling: PSPNet, Atrous Spatial Pyramid Pooling layer (ASPP)
Non-local Network: GCNet(19CVPR), NLNet
Different from these works, our network uniquely incorporates pyramid sampling strategies with non-local blocks to capture the semantic statistics of different scales with only a minor budget of computation, while maintaining the excellent performance as the original non-local modules.
Asymmetric Non-local Neural Network
While APNB aims to decrease the computational overhead of non-local blocks, AFNB improves the learning capacity of non-local blocks thereby improving the segmentation performance.
APNB的目標是降低 non-local blocks 的計算開銷,而AFNB提高了 non-local blocks 的學習能力,從而提高了分割性能。
Revisiting Non-local Block
Non-local 原理就不多寫了,這裏列出後面要用到的公式。函數can take the form from softmax, rescaling, and none. 選擇 softmax,就是Embedded Gaussian.
Asymmetric Pyramid Non-local Block
- Motivation and Analysis
By inspecting the general computing flow of a non-local block, one could clearly find that Eq. (2) and Eq. (4) dominate the computation. The time complexities of the two matrix multiplications are both . In semantic segmentation, the output of the network usually has a large resolution to retain detailed semantic features [6, 47]. That means N is large (for example in our training phase, N = 96 × 96 = 9216). Hence, the large matrix multiplication is the main cause of the inefficiency of a non-local block (see our statistic in Fig. 2). A more straightforward pipeline is given as
這段很好理解,就是詳細介紹 Non-local 的複雜度爲啥那麼高。
We hold a key yet intuitive observation that by changing N to another number , the output size will remain the same, as
Returning to the design of the non-local block, changing N to a small number S is equivalent to sampling several representative points from θ and γ instead of feeding all the spatial points, as illustrated in Fig. 1. Consequently, the computational complexity could be considerably decreased.
這段也很好理解,就是把N縮小到S,S遠遠小於N。那這個操作是怎麼實現的呢?
- Solution
Based on the above observation, we propose to add sampling modules and after θ and γ to sample several sparse anchor points denoted as and , where S is the number of sampled anchor points. Mathematically, this is computed by
(8)
The similarity matrix between and the anchor points is thus calculated by
Note that is an asymmetric matrix of size N × S. then goes through the same normalizing function as a standard non-local block, giving the unified similarity matrix . And the attention output is acquired by
where the output is in the same size as that of Eq. (4). Following non-local blocks, the final output is given as
The time complexity of such an asymmetric matrix multiplication is only O(CNS ˆ ), significantly lower than O(CNˆ 2 ) in a standard non-local block. It is ideal that S should be much smaller than N. However, it is hard to ensure that when S is small, the performance would not drop too much in the meantime.
這部分詳細介紹採用pyramid pooling降低複雜度。Non-local 中,輸入的 query 和 key 是經過 1x1 卷積得到的 CxN 維;而這裏,query 和 key 是經過 1x1 卷積,還要用 pyramid pooling 進行採樣,得到 CxS 維。 即,non-local是把每個像素點作爲 query,而APNB是對圖像做採樣,用幾個 sparse anchor points 作爲query。如此,計算複雜度從 降低到 。
作者同時認爲,S 不能太小。
下面部分則是更進一步介紹具體的 pyramid pooling 是如何操作的。
- Spatial Pyramid Pooling
As discovered by previous works [16, 47], global and multi-scale representations are useful for categorizing scene semantics. Such representations can be comprehensively carved by Spatial Pyramid Pooling [16], which contains several pooling layers with different output sizes in parallel. In addition to this virtue, spatial pyramid pooling is also parameter-free and very efficient. Therefore, we embed pyramid pooling in the non-local block to enhance the global representations while reducing the computational overhead.
By doing so, we now arrive at the final formulation of Asymmetric Pyramid Non-local Block (APNB), as given in Fig. 3. As can be seen, our APNB derives from the design of a standard non-local block [33]. A vital change is to add a spatial pyramid pooling module after θ and γ respectively to sample representative anchors. This sampling process is clearly depicted in Fig. 4, where several pooling layers are applied after θ or γ and then the four pooling results are flattened and concatenated to serve as the input to the next layer. We denote the spatial pyramid pooling modules as and , where the superscript means the width (or height) of the output size of the pooling layer (empirically, the width is equal to the height). In our model, we set . Then the total number of the anchor points is
Figure 3: Overview of the proposed Asymmetric Non-local Neural Network. In our implementation, the key branch and the value branch in APNB share the same 1×1 convolution and sampling module, which decreases the number of parameters and computation without sacrificing the performance.
Figure 4: Demonstration of the pyramid max or average sampling process.
SPP具體的操作細節。以H = 128 and W = 256爲例,本文的方法降低複雜度爲non-local的 (256x128)/112=298倍。
注意:在APNB中,作者把key和value的conv1x1及pyramid pooling共享了,即用的是同一組矩陣。
Asymmetric Fusion Non-local Block
Fusing features of different levels are helpful to semantic segmentation and object tracking as hinted in [16, 18, 26, 41, 46, 51]. Common fusing operations such as addition/concatenation, are conducted in a pixel-wise and local manner. We provide an alternative that leverages long range dependencies through a non-local block to fuse multi-level features, called Fusion Non-local Block. A standard non-local block only has one input source while FNB has two: a high-level feature map and a low-level feature map . Nh and Nl are the numbers of spatial locations of and , respectively. and are the channel numbers of and , respectively. Likewise, 1 × 1 convolutions , and are used to transform and to embeddings , and as
the similarity matrix
The output reflects the bonus of to , which are carefully selected from all locations in .
AFNB的細節,其實圖3已經解釋的很清楚了。
Network Architecture
The overall architecture of our network is depicted in Fig. 3. We choose ResNet-101 [13] as our backbone network following the choice of most previous works [38, 47, 48]. We remove the last two down-sampling operations and use the dilation convolutions instead to hold the feature maps from the last two stages 1/8 of the input image. Concretely, all the feature maps in the last three stages have the same spatial size. According to our experimental trials, we fuse the features of Stage4 and Stage5 using AFNB. The fused features are thereupon concatenated with the feature maps after Stage5, avoiding situations that AFNB could not produce accurate strengthened features particularly when the training just begins and degrades the overall performance. Such features, full of rich long range cues from different feature levels, serve as the input to APNB, which then help to discover the correlations among pixels. As done for AFNB, the output of APNB is also concatenated with its input source. Note that in our implementation for APNB, and share parameters in order to save parameters and computation, following the design of [42]. This design doesn’t decrease the performance of APNB. Finally, a classifier is followed up to produce channel-wise semantic maps that later receive their supervisions from the ground truth maps. Note we also add another supervision to Stage4 following the settings of [47], as it is beneficial to improve the performance.
網絡構成細節:
1. ResNet 101
2. stage 4 和 5 分辨率不變,採用的正是 dilated ResNet [參考我的博客 Dilated Residual Networks]
3. AFNB 融合的是 stage 4 和 5
4. AFNB 之後跟 APNB,殘差連接方式
5. and share parameters [42: OCnet: Object context network for scene parsing]
6. add another supervision to Stage4 [47: Pyramid scene parsing network]
Experiments
Datasets and Evaluation Metrics
Datasets:Cityscapes [9], ADE20K [50] and PASCAL Context [21].
Evaluation Metric:Mean IoU (mean of classwise intersection over union)
Implementation Details
Training Objectives
Following [47 Pyramid scene parsing network], our model has two supervisions: one after the final output of our model while another at the output layer of Stage4. For Lfinal, we perform online hard pixel mining, which excels at coping with difficult cases.
Comparisons with Other Methods
1. Efficiency Comparison with Non-local Block
2. Performance Comparisons
3. Ablation Study
Efficacy of the APNB and AFNB
Selection of Sampling Methods:
Influence of the Anchor Points Numbers
Conclusion
個人總結:
這篇文章兩個亮點:
1. 與Non-local 不同的是:
Non-local:每個點(query)與每個點之間的相關;pixel-wise
APNB:每個點與局部內容之間的關係;patch-wise
2. 融合不同level的的特徵:AFNB