The non-local module works as a particularly useful technique for semantic segmentation while criticized for its prohibitive computation and GPU memory occupation.

[文章的motivation，即文章要解決的問題]

In this paper, we present Asymmetric Non-local Neural Network to semantic segmentation, which has two prominent components: Asymmetric Pyramid Non-local Block (APNB) and Asymmetric Fusion Non-local Block (AFNB). APNB leverages a pyramid sampling module into the non-local block to largely reduce the computation and memory consumption without sacrificing the performance. AFNB is adapted from APNB to fuse the features of different levels under a sufficient consideration of long range dependencies and thus considerably improves the performance.

[網絡結構介紹，簡潔而清晰]

Extensive experiments on semantic segmentation benchmarks demonstrate the effectiveness and efficiency of our work. In particular, we report the state-of-the-art performance of 81.3 mIoU on the Cityscapes test set. For a 256 × 128 input, APNB is around 6 times faster than a non-local block on GPU while 28 times smaller in GPU running memory occupation.

[最後，實驗結果要和motivation相呼應，即說明文章提出的方法確實有效解決了motivation提出的問題]

Introduction

Some recent studies [20, 33, 47] indicate that the performance could be improved if making sufficient use of long range dependencies. However, models that solely rely on convolutions exhibit limited ability in capturing these long range dependencies. A possible reason is the receptive field of a single convolutional layer is inadequate to cover correlated areas. Choosing a big kernel or composing a very deep network is able to enlarge the receptive field. However, such strategies require extensive computation and parameters, thus being very inefficient [44]. Consequently, several works [33, 47] resort to use global operations like non-local means [2] and spatial pyramid pooling [12, 16].

傳統CNN不能高效的捕捉long range dependency. Non-local 和 spatial pyramid pooling 可以有效解決這個問題。

In [33], Wang et al. combined CNNs and traditional non-local means [2] to compose a network module named nonlocal block in order to leverage features from all locations in an image. This module improves the performance of existing methods [33]. However, the prohibitive computational cost and vast GPU memory occupation hinder its usage in many real applications. The architecture of a common non-local block [33] is depicted in Fig. 1(a). The block first calculates the similarities of all locations between each other, requiring a matrix multiplication of computational complexity , given an input feature map with size $C\times H\times W$ . Then it requires another matrix multiplication of computational complexity to gather the influence of all locations to themselves. Concerning the high complexity brought by the matrix multiplications, we are interested in this work if there are efficient ways to solve this without sacrificing the performance.

描述了Non-local的結構和複雜度。

這兩段是問題的提出。

Figure 1: Architecture of a standard non-local block (a) and the asymmetric non-local block (b). $N=H\cdot W$ while $S\ll N$ .

We notice that as long as the outputs of the key branch and value branch hold the same size, the output size of the non-local block remains unchanged. Considering this, if we could sample only a few representative points from key branch and value branch, it is possible that the time complexity is significantly decreased without sacrificing the performance. This motivation is demonstrated in Fig. 1 when changing a large value N in the key branch and value branch to a much smaller value S (From (a) to (b)).

本文的觀察和動機。Non-local是計算每個點和所有點的相關性（即每個像素點都對應一張attention map）；而本文認爲不需要計算每個點的attention map，而是計算每個像素點與某幾個局部區域的 attention map。

In this paper, we propose a simple yet effective nonlocal module called Asymmetric Pyramid Non-local Block (APNB) to decrease the computation and GPU memory consumption of the standard non-local module [33] with applications to semantic segmentation. Motivated by the spatial pyramid pooling [12, 16, 47] strategy, we propose to embed a pyramid sampling module into non-local blocks, which could largely reduce the computation overhead of matrix multiplications yet provide substantial semantic feature statistics. This spirit is also related to the sub-sampling tricks [33] (e.g., max pooling). Our experiments suggest that APNB yields much better performance than those sub-sampling tricks with a decent decrease of computations. To better illustrate the boosted efficiency, we compare the GPU times of APNB and a standard non-local block in Fig. 2, averaging the running time of 10 different runs with the same configuration. Our APNB largely reduces the time cost on matrix multiplications, thus being nearly 6 times faster than a non-local block.

描述APNB的形成機制（Non-Local + Spatial Pyramid Pooling），具備低複雜度、高 performance。

Besides, we also adapt APNB to fuse the features of different stages of a deep network, which brings a considerable improvement over the baseline model. We call the adapted block as Asymmetric Fusion Non-local Block (AFNB). AFNB calculates the correlations between every pixel of the low-level and high-level feature maps, yielding a fused feature with long range interactions. Our network is built based on a standard ResNet-FCN model by integrating APNB and AFNB together.

描述AFNB：融合了low-level和high-level的特徵，生成了具有long range interactions的、融合的特徵。

Recent advances focus on exploring the context information and can be roughly categorized into five directions:

爲了挖掘 context information，現有的方法大致包括5個研究方向：

Encoder-Decoder：U-Net

Conditional Random Field: Deeplab

Different Convolutions: Dilated Conv.

Spatial Pyramid Pooling: PSPNet, Atrous Spatial Pyramid Pooling layer (ASPP)

Non-local Network: GCNet(19CVPR), NLNet

Different from these works, our network uniquely incorporates pyramid sampling strategies with non-local blocks to capture the semantic statistics of different scales with only a minor budget of computation, while maintaining the excellent performance as the original non-local modules.

Asymmetric Non-local Neural Network

While APNB aims to decrease the computational overhead of non-local blocks, AFNB improves the learning capacity of non-local blocks thereby improving the segmentation performance.

APNB的目標是降低 non-local blocks 的計算開銷，而AFNB提高了 non-local blocks 的學習能力，從而提高了分割性能。

Revisiting Non-local Block

Non-local 原理就不多寫了，這裏列出後面要用到的公式。函數can take the form from softmax, rescaling, and none. 選擇 softmax，就是Embedded Gaussian.

Asymmetric Pyramid Non-local Block

Motivation and Analysis

By inspecting the general computing flow of a non-local block, one could clearly find that Eq. (2) and Eq. (4) dominate the computation. The time complexities of the two matrix multiplications are both $\small O(\hat{C}N^2)=O(\hat{C}H^2W^2)$ . In semantic segmentation, the output of the network usually has a large resolution to retain detailed semantic features [6, 47]. That means N is large (for example in our training phase, N = 96 × 96 = 9216). Hence, the large matrix multiplication is the main cause of the inefficiency of a non-local block (see our statistic in Fig. 2). A more straightforward pipeline is given as

這段很好理解，就是詳細介紹 Non-local 的複雜度爲啥那麼高。

We hold a key yet intuitive observation that by changing N to another number $\small S(S\ll N)$ , the output size will remain the same, as

Returning to the design of the non-local block, changing N to a small number S is equivalent to sampling several representative points from θ and γ instead of feeding all the spatial points, as illustrated in Fig. 1. Consequently, the computational complexity could be considerably decreased.

這段也很好理解，就是把N縮小到S，S遠遠小於N。那這個操作是怎麼實現的呢？

Solution

Based on the above observation, we propose to add sampling modules $\small \mathcal{P}_\theta$ and $\small \mathcal{P}_\gamma$ after θ and γ to sample several sparse anchor points denoted as $\small \theta _P\in\mathcal{R}^{\hat{C}\times S}$ and $\small \gamma_P\in\mathcal{R}^{\hat{C}\times S}$ , where S is the number of sampled anchor points. Mathematically, this is computed by

$\large \theta _P=\mathcal{P}_\theta (\theta ),~\gamma _P=\mathcal{P}_\gamma(\gamma )$ (8)

The similarity matrix $\small {V}_P$ between $\small \theta$ and the anchor points $\small {\theta}_P$ is thus calculated by

Note that $\small {V}_P$ is an asymmetric matrix of size N × S. $\small {V}_P$ then goes through the same normalizing function as a standard non-local block, giving the unified similarity matrix $\small \vec{V}_P$ . And the attention output is acquired by

where the output is in the same size as that of Eq. (4). Following non-local blocks, the final output $\small Y_P$ is given as

The time complexity of such an asymmetric matrix multiplication is only O(CNS ˆ ), significantly lower than O(CNˆ 2 ) in a standard non-local block. It is ideal that S should be much smaller than N. However, it is hard to ensure that when S is small, the performance would not drop too much in the meantime.

這部分詳細介紹採用pyramid pooling降低複雜度。Non-local 中，輸入的 query 和 key 是經過 1x1 卷積得到的 CxN 維；而這裏，query 和 key 是經過 1x1 卷積，還要用 pyramid pooling 進行採樣，得到 CxS 維。即，non-local是把每個像素點作爲 query，而APNB是對圖像做採樣，用幾個 sparse anchor points 作爲query。如此，計算複雜度從 $\small O(\hat{C}N^2)$ 降低到 $O(\hat{C}NS)$ 。

作者同時認爲，S 不能太小。

下面部分則是更進一步介紹具體的 pyramid pooling 是如何操作的。

Spatial Pyramid Pooling

As discovered by previous works [16, 47], global and multi-scale representations are useful for categorizing scene semantics. Such representations can be comprehensively carved by Spatial Pyramid Pooling [16], which contains several pooling layers with different output sizes in parallel. In addition to this virtue, spatial pyramid pooling is also parameter-free and very efficient. Therefore, we embed pyramid pooling in the non-local block to enhance the global representations while reducing the computational overhead.

By doing so, we now arrive at the final formulation of Asymmetric Pyramid Non-local Block (APNB), as given in Fig. 3. As can be seen, our APNB derives from the design of a standard non-local block [33]. A vital change is to add a spatial pyramid pooling module after θ and γ respectively to sample representative anchors. This sampling process is clearly depicted in Fig. 4, where several pooling layers are applied after θ or γ and then the four pooling results are flattened and concatenated to serve as the input to the next layer. We denote the spatial pyramid pooling modules as $\small \mathcal{P}^n_\theta$ and $\small \mathcal{P}^n_\gamma$ , where the superscript $\small n$ means the width (or height) of the output size of the pooling layer (empirically, the width is equal to the height). In our model, we set $\small n\subseteq \{1,3,6,8\}$ . Then the total number of the anchor points is

Figure 3: Overview of the proposed Asymmetric Non-local Neural Network. In our implementation, the key branch and the value branch in APNB share the same 1×1 convolution and sampling module, which decreases the number of parameters and computation without sacrificing the performance.

Figure 4: Demonstration of the pyramid max or average sampling process.

SPP具體的操作細節。以H = 128 and W = 256爲例，本文的方法降低複雜度爲non-local的 (256x128)/112=298倍。

注意：在APNB中，作者把key和value的conv1x1及pyramid pooling共享了，即用的是同一組矩陣。

Asymmetric Fusion Non-local Block

Fusing features of different levels are helpful to semantic segmentation and object tracking as hinted in [16, 18, 26, 41, 46, 51]. Common fusing operations such as addition/concatenation, are conducted in a pixel-wise and local manner. We provide an alternative that leverages long range dependencies through a non-local block to fuse multi-level features, called Fusion Non-local Block. A standard non-local block only has one input source while FNB has two: a high-level feature map $\small X_h\in \mathcal{R}^{C_h\times N_h}$ and a low-level feature map $\small X_l\in \mathcal{R}^{C_l\times N_l}$ . Nh and Nl are the numbers of spatial locations of $\small X_h$ and $\small X_l$ , respectively. $\small C_h$ and $\small C_l$ are the channel numbers of $\small X_h$ and $\small X_l$ , respectively. Likewise, 1 × 1 convolutions $\small W_h^\varphi$ , $\small W_l^\theta$ and $\small W_l^\gamma$ are used to transform $\small X_h$ and $\small X_l$ to embeddings $\small \varphi_h\in \mathcal{R}^{\hat{C}\times N_h}$ , $\small \theta _l\in \mathcal{R}^{\hat{C}\times N_l}$ and $\small \gamma _l\in \mathcal{R}^{\hat{C}\times N_l}$ as

the similarity matrix $\small V_F$

The output $\small O_F\in \mathcal{R}^{N_l\times \hat{C}}$ reflects the bonus of $\small X_l$ to $\small X_h$ , which are carefully selected from all locations in $\small X_l$ .

AFNB的細節，其實圖3已經解釋的很清楚了。

Network Architecture

The overall architecture of our network is depicted in Fig. 3. We choose ResNet-101 [13] as our backbone network following the choice of most previous works [38, 47, 48]. We remove the last two down-sampling operations and use the dilation convolutions instead to hold the feature maps from the last two stages 1/8 of the input image. Concretely, all the feature maps in the last three stages have the same spatial size. According to our experimental trials, we fuse the features of Stage4 and Stage5 using AFNB. The fused features are thereupon concatenated with the feature maps after Stage5, avoiding situations that AFNB could not produce accurate strengthened features particularly when the training just begins and degrades the overall performance. Such features, full of rich long range cues from different feature levels, serve as the input to APNB, which then help to discover the correlations among pixels. As done for AFNB, the output of APNB is also concatenated with its input source. Note that in our implementation for APNB, $\small W_\theta$ and $\small W_\gamma$ share parameters in order to save parameters and computation, following the design of [42]. This design doesn’t decrease the performance of APNB. Finally, a classifier is followed up to produce channel-wise semantic maps that later receive their supervisions from the ground truth maps. Note we also add another supervision to Stage4 following the settings of [47], as it is beneficial to improve the performance.

網絡構成細節：

1. ResNet 101

2. stage 4 和 5 分辨率不變，採用的正是 dilated ResNet [參考我的博客 Dilated Residual Networks]

3. AFNB 融合的是 stage 4 和 5

4. AFNB 之後跟 APNB，殘差連接方式

5. $\small W_\theta$ and $\small W_\gamma$ share parameters [42: OCnet: Object context network for scene parsing]

6. add another supervision to Stage4 [47: Pyramid scene parsing network]

Experiments

Datasets and Evaluation Metrics

Datasets：Cityscapes [9], ADE20K [50] and PASCAL Context [21].

Evaluation Metric：Mean IoU (mean of classwise intersection over union)

Implementation Details

Training Objectives

Following [47 Pyramid scene parsing network], our model has two supervisions: one after the final output of our model while another at the output layer of Stage4. For Lfinal, we perform online hard pixel mining, which excels at coping with difficult cases.

Comparisons with Other Methods

1. Efficiency Comparison with Non-local Block

2. Performance Comparisons

3. Ablation Study

Efficacy of the APNB and AFNB

Selection of Sampling Methods：

Influence of the Anchor Points Numbers

Conclusion

個人總結：

這篇文章兩個亮點：

1. 與Non-local 不同的是：

Non-local：每個點（query）與每個點之間的相關；pixel-wise

APNB：每個點與局部內容之間的關係；patch-wise

2. 融合不同level的的特徵：AFNB

MyDLNote - Attention: [NLA系列] Asymmetric Non-local Neural Networks for Semantic Segmentation

Asymmetric Non-local Neural Networks for Semantic Segmentation

Abstract

Introduction