Long-range dependencies can capture useful contextual information to benefit visual understanding problems. In this work, we propose a Criss-Cross Network (CCNet) for obtaining such important information through a more effective and efficient way. Concretely, for each pixel, our CCNet can harvest the contextual information of its surrounding pixels on the criss-cross path through a novel criss-cross attention module. By taking a further recurrent operation, each pixel can finally capture the long-range dependencies from all pixels. Overall, our CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block in computing long-range dependencies. 3) The state-of-the-art performance. We conduct extensive experiments on popular semantic segmentation benchmarks including Cityscapes, ADE20K, and instance segmentation benchmark COCO. In particular, our CCNet achieves the mIoU score of 81.4 and 45.22 on Cityscapes test set and ADE20K validation set, respectively, which are the new state-of-the-art results.

一句話概況什麼是CCNet：對於每個像素，CCNet 可以通過一個新穎的 交叉注意模塊 來獲取其周圍像素在交叉路徑上的上下文信息；’通過兩次這樣的操作，每個像素最終都可以捕獲所有像素的遠程依賴關係。

Introduction

To capture long-range dependencies, Chen et al. [Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS] proposed atrous spatial pyramid pooling module with multi-scale dilation convolutions for contextual information aggregation. Zhao et al. [Pyramid scene parsing network] further introduced PSPNet with pyramid pooling module to capture contextual information. However, the dilated convolution based methods collect information from a few surrounding pixels and can not generate dense contextual information actually. Meanwhile, the pooling based methods aggregate contextual information in a non-adaptive manner and the homogeneous contextual information is adopted by all image pixels, which does not satisfy the requirement the different pixel needs the different contextual dependencies.

[ 重點 ]：膨脹卷積和金字塔模型的在獲取 contextual information 缺點。

然而，基於卷積的方法從周圍的幾個像素點收集信息，實際上並不能生成密集的上下文信息。

基於池的方法以非自適應的方式對上下文信息進行聚合，所有圖像像素都採用同構的上下文信息，不能滿足不同像素對不同上下文依賴的需求。

To generate dense and pixel-wise contextual information, PSANet [PSANet: Point-wise spatial attention network for scene parsing] learns to aggregate contextual information for each position via a predicted attention map. Non-local Networks [Non-local Neural Networks] utilizes a self-attention mechanism, which enable a single feature from any position to perceive features of all the other positions, leading to generate more power pixel-wise representation. Here, each position in the feature map is connected with all other ones through self-adaptively predicted attention maps, thus harvesting various range contextual information, see in Fig. 1 (a). However, these attention-based methods need to generate huge attention maps to measure the relationships for each pixel-pair, whose complexity in time and space are both $O((H\times W)\times(H\times W))$ , where $H \times W$ donates the spatial dimension of input feature map. Since the input feature map is always with high resolution in semantic segmentation task, self-attention based methods have high computation complexity and occupy a huge number of GPU memory. We argue that: Is there an alternative solution to achieve such a target in a more efficient way?

這段主要提供兩點內容：1. 生成密集的像素級上下文信息的方法 PSANet 和 Non-local。

2. 提出 motivation：Non-local 計算量和內存佔用太大，本文想提出一種更高效的方法。

Figure 1. Diagrams of two attention-based context aggregation methods. (a) For each position (e.g. blue), Non-local module generates a dense attention map which has $H \times W$ weights (in green). (b) For each position (e.g. blue), criss-cross attention module generates a sparse attention map which only has weights. After recurrent operation, each position (e.g. red) in the final output feature maps can capture long-range dependencies from all pixels. For clear display, residual connections are ignored.

We found that the current no-local operation adopted by [Non-local Neural Networks] can be alternatively replaced by two consecutive criss-cross operations, in which each one only has sparse connections for each position in the feature maps. This motivates us to propose the criss-cross attention module to aggregate long-range pixel-wise contextual information in horizontal and vertical direction. By serially stacking two criss-cross attention modules, it can collect contextual information from all pixels. The decomposition greatly reduce the complexity in time and space from $O((H\times W)\times(H\times W))$ to $O((H\times W)\times(H + W-1))$ .

提出了本文的核心思想：我們發現目前非局部神經網絡所採用的非局部操作可以被兩個連續的交叉操作所替代。複雜度直接降低到 $O((H\times W)\times(H + W-1))$ 。

Concretely, our criss-cross attention module is able to harvest various information nearby and far away on the criss-cross path. As shown in Fig. 1, both non-local module, and criss-cross attention module feed the input feature maps with spatial size $H \times W$ to generate attention maps (upper branch) and adapted feature maps (lower branch), respectively. Then, the weighted sum is adopted as aggregation way. 1) In criss-cross attention module, each position (e.g., blue color) in the feature map is connected with other ones which are in the same row and the same column through predicted sparsely attention map. The predicted attention map only has weights rather than $H\times W$ in non-local module. 2) Furthermore, we propose the recurrent crisscross attention module to capture the long-range dependencies from all pixels. The local features are passed into criss-cross attention module only once, which collects the contextual information in horizontal and vertical directions. The output feature map of a criss-cross attention module is fed into the next criss-cross attention module; each position (e.g. red color) in the second feature map collects information from all others to augment the pixel-wise representations. All the criss-cross attention modules share parameters for reducing extra parameters. Our criss-cross attention module can be plugged into any fully convolutional neural network, named CCNet, for leaning to segment in an end-to-end manner.

這段又進一步具體介紹 CCNet 是如何構建的：

1）在交叉注意模塊中，通過預測稀疏注意圖，將特徵圖中的每個位置(如藍色)與同行同列的其他位置進行連接。

2）在此基礎上，我們提出了一個遞歸交叉注意模塊來捕獲所有像素點的長期依賴關係。（爲什麼呢？後面兩句給出解釋：局部特徵只被一次傳遞到交叉注意模塊中，該模塊從水平方向和垂直方向收集上下文信息。將交叉注意模塊的輸出特徵映射反饋到下一個交叉注意模塊;第二個特徵圖中的每個位置(例如紅色)都從所有其他位置收集信息，以增強像素表示。）

交叉注意力模塊可以插入到任何全卷積神經網絡 (CCNet) 中。

In summary, our main contributions are two-fold:

• We propose a novel criss-cross attention module in this work, which can be leveraged to capture contextual information from long-range dependencies in a more efficient and effective way.

• We propose a CCNet by taking advantages of two recurrent criss-cross attention modules, achieving leading performance on segmentation-based benchmarks, including Cityscapes, ADE20K and MSCOCO.

Attention model

Attention model is widely used for various tasks. Squeeze-and-Excitation Networks enhanced the representational power of the network by modeling channel-wise relationships in an attention mechanism. Chen et al. [Scale-aware semantic image segmentation] made use of several attention masks to fuse feature maps or predictions from different branches. Vaswani et al. [Attention is all you need] applied a self-attention model on machine translation. Wang et al. [Non-local Neural Networks] proposed the non-local module to generate the huge attention map by calculating the correlation matrix between each spatial point in the feature map, then the attention guided dense contextual information aggregation. OCNet [OCNet: Object context network for scene parsing] and DANet [Dual attention network for scene segmentation] utilized self-attention mechanism to harvest the contextual information. PSA [PSANet: Point-wise spatial attention network for scene parsing] learned an attention map to aggregate contextual information for each individual point adaptively and specifically. Our CCNet is different from the aforementioned studies to generate huge attention map to record the relationship for each pixel-pair in feature map. The contextual information is aggregated by criss-cross attention module on the criss-cross path. Beside, CCNet can also obtain dense contextual information in a recurrent fashion which is more effective and efficient.

介紹了幾個注意力模型，我都給出鏈接了。

我們的CCNet與前面的研究不同，它生成了巨大的注意圖，記錄了特徵圖中每個像素對之間的關係。上下文信息通過交叉路徑上的交叉注意模塊進行聚合。此外，CCNet 還可以週期性地獲取密集的上下文信息，更有效、更高效。

Approach

In this section, we give the details of the proposed CrissCross Network (CCNet) for semantic segmentation. At first, we will first present a general framework of our network. Then, we will introduce criss-cross attention module which captures long-range contextual information in horizontal and vertical direction. At last, to capture the dense and global contextual information, we propose the recurrent criss-cross attention module.

在本節中，我們將詳細介紹用於語義分割的交叉網絡(CCNet)。首先，我們將介紹我們網絡的總體框架。然後，我們將介紹在水平和垂直方向上捕獲長程上下文信息的交叉注意模塊。最後，爲了獲取密集的全局語境信息，我們提出了遞歸交叉注意模塊。

Overall 總體框架

The network architecture is given in Fig. 2. An input image is passed through a deep convolutional neural networks (DCNN), which is designed in a fully convolutional fashion [Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS], then, produces a feature map . We denote the spatial size of as $H \times W$ . In order to retain more details and efficiently produce dense feature maps, we remove the last two down-sampling operations and employ dilation convolutions in the subsequent convolutional layers, thus enlarging the width/height of the output feature maps to 1/8 of the input image.

輸入圖像經過深度卷積神經網絡(DCNN)， DCNN採用完全卷積的方式設計。爲了保留更多的細節，高效的生成稠密的 feature map，我們去掉了最後兩個 down-sampling 操作，在後續的 convolutional layers 中使用了 dilation convolutions，使得輸出 feature map 的寬度/高度增加到輸入圖像的 1/8。

Figure 2. Overview of the proposed CCNet for semantic segmentation. The proposed recurrent criss-cross attention takes as input feature maps and output feature maps which obtain rich and dense contextual information from all pixels. Recurrent criss-cross attention module can be unrolled into loops, in which all Criss-Cross Attention module share parameters.

After obtaining feature maps , we first apply a convolution layer to obtain the feature maps of dimension reduction, then, the feature maps would be fed into the criss-cross attention (CCA) module and generate new feature maps which aggregate long-range contextual information together for each pixel in a criss-cross way. The feature maps only aggregate the contextual information in horizontal and vertical directions which are not powerful enough for semantic segmentation. To obtain richer and denser context information, we feed the feature maps into the criss-cross attention module again and output feature maps . Thus, each position in feature maps actually gather the information from all pixels. Two crisscross attention modules before and after share the same parameters to avoid adding too many extra parameters. We name this recurrent structure as recurrent criss-cross attention (RCCA) module.

首先應用卷積層對特徵圖降維；

然後，將這些特徵圖輸入到交叉注意 (criss-cross attention, CCA) 模塊中，生成新的特徵圖，並以交叉的方式將每個像素的長期上下文信息聚合在一起。

爲了獲得更豐富、更密集的上下文信息，將特徵圖重新輸入到交叉注意模塊中，並輸出特徵圖。特徵圖中的每個位置實際上都是從所有像素中收集信息。

前後兩個注意力模塊共享相同的參數，避免添加過多的額外參數。我們將這種週期性結構命名爲遞歸交叉注意(RCCA)模塊。

Then we concatenate the dense contextual feature with the local representation feature . It is followed by one or several convolutional layers with batch normalization and activation for feature fusion. Finally, the fused features are fed into the segmentation layer to generate the final segmentation map.

然後將密集的上下文特徵與局部表示特徵 concatenate 起來。

然後是一個或多個卷積層，進行批量歸一化和激活特徵融合。最後將融合後的特徵輸入到分割層，生成最終的分割圖。

Criss-Cross Attention

In order to model long-range contextual dependencies over local feature representations using lightweight computation and memory, we introduce a criss-cross attention module. The criss-cross attention module collects contextual information in horizontal and vertical directions to enhance pixel-wise representative capability.

交叉注意模塊在水平和垂直方向收集上下文信息，增強像素級的代表性能力。

As shown in Fig 3, given a local feature $H \in \mathbb{R}^{C\times W\times H}$ , the criss-cross attention module firstly applies two convolution layers with $1\times 1$ filters on to generate two feature maps and , respectively, where $\{Q, K\} \in \mathbb{R}^{ C'\times W\times H}$ . is the channel number of feature maps, which is less than for dimension reduction.

交叉注意模塊首先應用兩個 $1\times 1$ 卷積層，生成兩個特徵圖 and ， $\{Q, K\} \in \mathbb{R}^{ C'\times W\times H}$ 。

After obtaining feature maps and , we further generate attention maps $A \in \mathbb{R}^{(H+W-1)\times W\times H}$ via Affinity operation. At each position in spatial dimension of feature maps , we can get a vector $Q_u \in \mathbb{R}^{ C'}$ . Meanwhile, we can obtain the set $\Omega_u$ by extracting feature vectors from which are in the same row or column with position . Thus, $\Omega _u\in\mathbb{R}^{(H+W-1)\times C'}$ . $\Omega _{i,u}\in \mathbb{R}^{C'}$ is th element of $\Omega_u$ . The Affinity operation is defined as follows:

$d_{i,u}=Q_u\Omega^T_{i,u}$ (1)

in which $d_{i,u} \in D$ denotes the degree of correlation between feature and $\Omega_{i,u}, i = [1, ..., |\Omega_u|]$ , $D \in \mathbb{R}^{(H+W-1)\times W\times H}$ . Then, we apply a softmax layer on along the channel dimension to calculate the attention map .

要看懂這段，需要對 self-attention 或 non-local attention 有一定的瞭解。

是一個 $H\times W \times C'$ 的矩陣，先 reshape 到一個 $N \times C'$ 的矩陣， $N=H\times W$ 就是的個數。一共個 $Q_u \in \mathbb{R}^{ C'}$ 這樣的橫向量。

在 non-lcoal attention 中， $\Omega$ 是 reshape 得到的矩陣，且 $\Omega \in \mathbb{R}^{N\times C'}$ 。但是，CC attention 可不是了， $\Omega _u\in\mathbb{R}^{(H+W-1)\times C'}$ 是通過提取與位置相同的行或列的特徵向量來得到集合。 $\Omega _{i,u}\in \mathbb{R}^{C'}$ 是 $\Omega_u$ 的第個元素，一共有個這樣的橫向量。

$d_{i,u}=Q_u\Omega^T_{i,u}$ 是一個值，所有這樣的點可以構成矩陣。有多少個呢？ $Q_u \in \mathbb{R}^{ C'}$ 有 $H\times W$ 個， $\Omega _{i,u}\in \mathbb{R}^{C'}$ 有個，所以， $D \in \mathbb{R}^{(H+W-1)\times W\times H}$ 。用 softmax 對進行歸一化，就得到了。

Figure 3. The details of criss-cross attention module.

Then another convolutional layer with $1 \times 1$ filters is applied on to generate $V \in \mathbb{R} ^{C\times W \times H}$ for feature adaption. At each position in spatial dimension of feature maps , we can obtain a vector $V_u \in \mathbb{R} ^C$ and a set $\Phi _u \in \mathbb{R}^{(H+W-1)\times C}$ . The set $\Phi _u$ is collection of feature vectors in which are in the same row or column with position . The long-range contextual information is collected by the Aggregation operation:

$H'_ u = \sum _{i\in|\Phi_u| }A_{i,u}\Phi_{i,u} + H_u$ (2)

in which denotes a feature vector in output feature maps $H'\in \mathbb{R}^{C\times W \times H}$ at position . $A_{i,u}$ is a scalar value at channel and position in . The contextual information is added to local feature to enhance the local features and augment the pixel-wise representation. Therefore, it has a wide contextual view and selectively aggregates contexts according to the spatial attention map. This feature representations achieve mutual gains and are more robust for semantic segmentation.

後面這頓操作跟一般的 Non-local attention 就一樣了。

在局部特徵中加入上下文信息加上局部特徵，能夠增強局部特徵和增強 pixel-wise representation。因此，它具有廣闊的上下文視圖，並根據空間注意圖有選擇地聚合上下文。該特徵表示方法實現了相互增益，對語義分割具有較強的魯棒性。

The proposed criss-cross attention module is a self-contained module which can be dropped into a CNN architecture at any point, and in any number, obtaining rich contextual information. This module is very computationally cheap and adds a few parameters, causing very little GPU memory usage.

交叉注意模塊是一個 self-contained module，可以在任意點、任意數量的投放到 CNN 架構中，獲取豐富的上下文信息。這個模塊在計算上非常 cheap，並且增加了一些參數，導致很少的 GPU 內存使用。

Recurrent Criss-Cross Attention

Despite a criss-cross attention module can capture longrange contextual information in horizontal and vertical direction, the connections between the pixel and around pixels are still sparse. It is helpful to obtain dense contextual information for semantic segmentation. To achieve this, we introduce the recurrent criss-cross attention based on the criss-cross attention module described above. The recurrent criss-cross attention module can be unrolled into loops. In the first loop, the criss-cross attention module takes as input feature maps extracted from a CNN model and output feature maps , where and have the same shape. In the second loop, the criss-cross attention module takes as input feature maps and output feature maps . As shown in Fig. 2, recurrent criss-cross attention module has two loops (R=2) which is enough to harvest long-range dependencies from all pixels to generate new feature maps with dense and rich contextual information.

儘管交叉注意模塊可以在水平和垂直方向上捕獲長期的上下文信息，但是像素與周圍像素之間的連接仍然是稀疏的。

它有助於獲得密集的上下文信息進行語義分割。

爲此，我們引入了基於上述交叉注意模塊的遞歸交叉注意。遞歸交叉注意力模塊可以展開成個循環。

在第一個循環中，cross -cross attention 模塊將從 CNN 模型中提取的作爲輸入 feature maps，並輸出形狀相同的 feature maps 。

在第二個循環中，交叉注意模塊作爲輸入特徵圖和輸出特徵圖。

如圖2所示，週期性交叉注意模塊有兩個循環，這兩個循環足以從所有像素中獲取長期依賴關係，從而生成具有密集和豐富上下文信息的特徵圖。

The and are donated as the attention maps in loop 1 and loop 2, respectively. Since we are interested only in contextual information spreads in spatial dimension rather than in channel dimension, the convolutional layer with $1\times1$ filters can be view as identical connection. In addition, the mapping function from position , to weight $A_{i,x,y}$ is defined as $A_{i,x,y} = f(A, x,y, x', y')$ . For any position at feature map and any position $\theta$ at feature map , there is a connection if . One case is that and $\theta$ are in the same row or column:

$H''_u \leftarrow [f(A, u, \theta ) + 1] \cdot f(A' , u, \theta ) \cdot H_{\theta }$ (3)

in which $\leftarrow$ donates add-to operation.

和分別作爲 loop 1 和 loop 2 中的注意力圖。由於我們只對上下文信息在空間維度上的傳播感興趣，而對信道維度上的信息傳播不感興趣，因此具有 $1\times1$ 濾波器的卷積層可以看作是恆等連接。

另外，從位置，到權重 $A_{i,x,y}$ 的映射函數定義爲 $A_{i,x,y} = f(A, x,y, x', y')$ 。在任意位置特性映射和任意位置 $\theta$ 特性映射，如果有一個連接。一個案例是，和 $\theta$ 是在相同的行或列：

$H''_u \leftarrow [f(A, u, \theta ) + 1] \cdot f(A' , u, \theta ) \cdot H_{\theta }$

其中 $\leftarrow$ 表示 add-to 操作。

Figure 4. An example of information propagation when the loop number is 2.

Another case is that and $\theta$ are not in the same row and column. Fig 4 shows the propagation path of context information in spatial dimension:

$H''_u \leftarrow [f(A, u_x, \theta_y, \theta_x, \theta_y) \cdot f(A' , u_x, u_y, u_x, \theta_y)+ \\ ~~~~~~~~~~~~~~f(A, \theta_x, u_y, \theta_x, \theta_y)\cdot· f(A' , u_x, u_y, \theta_x, u_y)] \cdot H_{\theta }$ (4)

In general, our recurrent criss-cross attention module makes up for the deficiency of criss-cross attention module that cannot obtain the dense contextual information from all pixels. Compared with criss-cross attention module, the recurrent criss-cross attention module (R = 2) does not bring extra parameters and can achieve better performance with the cost of minor computation increment. The recurrent criss-cross attention module is also a self-contained module that can be plugged into any CNN architecture at any stage and be optimized in an end-to-end manner.

另一種情況是和不在同一行和同列。圖4爲上下文信息在空間維度上的傳播路徑：

總的來說，遞歸交叉注意模塊彌補了交叉注意模塊無法從所有像素中獲取密集的上下文信息的不足。

與交叉注意模塊相比，遞歸交叉注意模塊 (R = 2) 沒有帶來額外的參數，並且能夠以較小的計算增量代價獲得更好的性能。

遞歸交叉注意模塊也是一個 self-contained 模塊，可以在任何階段插入到任何 CNN 架構中，並以端到端方式進行優化。

MyDLNote - Attention:[NLA系列] CCNet: Criss-Cross Attention for Semantic Segmentation

CCNet: Criss-Cross Attention for Semantic Segmentation

Abstract