MyDLNote - Attention: [2020 CVPR] Exploring Self-attention for Image Recognition

Abstract

Recent work has shown that self-attention can serve as a basic building block for image recognition models. We explore variations of self-attention and assess their effectiveness for image recognition. We consider two forms of self-attention. One is pairwise self-attention, which generalizes standard dot-product attention and is fundamentally a set operator. The other is patchwise self-attention, which is strictly more powerful than convolution. Our pairwise self-attention networks match or outperform their convolutional counterparts, and the patchwise models substantially outperform the convolutional baselines. We also conduct experiments that probe the robustness of learned representations and conclude that self-attention networks may have significant benefits in terms of robustness and generalization.

Introduction

開始講故事啦：

Convolutional networks have revolutionized computer vision. Thirty years ago, they were applied successfully to recognizing handwritten digits [19]. Building directly on this work, convolutional networks were scaled up in 2012 to achieve breakthrough accuracy on the ImageNet dataset, outperforming all prior methods by a large margin and launching the deep learning era in computer vision [18, 29]. Subsequent architectural improvements yielded successively larger and more accurate convolutional networks for image recognition, including GoogLeNet [31], VGG [30], ResNet [12], DenseNet [16], and squeeze-and-excitation [15]. These architectures in turn serve as templates for applications in computer vision and beyond.

All these networks, from LeNet [19] onwards, are based fundamentally on the discrete convolution. The discrete convolution operator ∗ can be defined as follows:

$(F\ast k)(p) = \sum_{s+t=p}F(s) k(t)$ . (1)

Here is a discrete function and is a discrete filter. A key characteristic of the convolution is its translation invariance: the same filter is applied across the image . While the convolution has undoubtedly been effective as the basic operator in modern image recognition, it is not without drawbacks. For example, the convolution lacks rotation invariance. The number of parameters that must be learned grows with the footprint of the kernel . And the stationarity of the filter can be seen as a drawback: the aggregation of information from a neighborhood cannot adapt to its content. Is it possible that networks based on the discrete convolution are a local optimum in the design space of image recognition models? Could other parts of the design space yield models with interesting new capabilities?

前兩段：大背景介紹，講了一下傳統卷積的輝煌歷史與問題（跟我的博士畢設背景寫的好相似呀）。

傳統的離散卷積的兩個問題：缺乏旋轉不變性；鄰域的信息聚合不能適應它的內容。

Recent work has shown that self-attention may constitute a viable alternative for building image recognition models [13, 27]. The self-attention operator has been adopted from natural language processing, where it serves as the basis for powerful architectures that have displaced recurrent and convolutional models across a variety of tasks [33, 7, 6, 40]. The development of effective self-attention architectures in computer vision holds the exciting prospect of discovering models with different and perhaps complementary properties to convolutional networks.

第三段：傳統離散卷積有問題，self-attention 橫空出世，緩解了卷積的缺點。過渡段。

In this work, we explore variations of the self-attention operator and assess their effectiveness as the basic building block for image recognition models. We explore two types of self-attention. The first is pairwise self-attention, which generalizes the standard dot-product attention used in natural language processing [33]. Pairwise attention is compelling because, unlike the convolution, it is fundamentally a set operator, rather than a sequence operator. Unlike the convolution, it does not attach stationary weights to specific locations (s in equation (1)) and is invariant to permutation and cardinality. One consequence is that the footprint of a self-attention operator can be increased (e.g., from a 3×3 to a 7×7 patch) or even made irregular without any impact on the number of parameters. We present a number of variants of pairwise attention that have greater expressive power than dot-product attention while retaining these invariance properties. In particular, our weight computation does not collapse the channel dimension and allows the feature aggregation to adapt to each channel.

第四段：self-attention 這麼好，本文的工作就是基於 self-attention 提出的。

第一個工作：pairwise self-attention

與卷積不同，它本質上是一個集合運算符，而不是一個序列運算符；

permutation and cardinality 不變性（不是很清楚這兩個概念，希望懂的大牛評論裏面指導一下）

比 dot-product attention 更有表現力；

權重計算不會摺疊通道維數，並允許特性聚合適應每個通道。（最近我正在考慮這件事，可惜這個文章已經解決這個問題了，晚了一步。）

Next, we explore a different class of operators, which we term patchwise self-attention. These operators, like the convolution, have the ability to uniquely identify specific locations within their footprint. They do not have the permutation or cardinality invariance of pairwise attention, but are strictly more powerful than convolution.

第五段：第二個工作：patchwise self-attention

像卷積一樣，在特定區域內計算；

沒有 pairwise attention 的排列或基數不變性；

但比卷積更強大。

Our experiments indicate that both forms of self-attention are effective for building image recognition models. We construct self-attention networks that can be directly compared to convolutional ResNet models [12], and conduct experiments on the ImageNet dataset [29]. Our pairwise selfattention networks match or outperform their convolutional counterparts, with similar or lower parameter and FLOP budgets. Controlled experiments also indicate that our vectorial operators outperform standard scalar attention. Furthermore, our patchwise models substantially outperform the convolutional baselines. For example, our mid-sized SAN15 with patchwise attention outperforms the much larger ResNet50, with a 78% top-1 accuracy for SAN15 versus 76.9% for ResNet50, with a 37% lower parameter and FLOP count. Finally, we conduct experiments that probe the robustness of learned representations and conclude that self-attention networks may have significant benefits in terms of robustness and generalization.

最後一段：這個小故事的結論：

pairwise selfattention 相當或優於卷積網絡，參數或低或相近，開銷也較低；超越標準標量注意；

patchwise models 大大優於卷積；

自注意網絡在魯棒性和泛化方面可能具有顯著的優勢。

Most closely related to our work are the recent results of Hu et al. [13] and Ramachandran et al. [27]. One of their key innovations is restricting the scope of self-attention to a local patch (for example, 7×7 pixels), in contrast to earlier constructions that applied self-attention globally over a whole feature map [35, 1]. Such local attention is key to limiting the memory and computation consumed by the model, facilitating successful application of self-attention throughout the network, including early high-resolution layers. Our work builds on these results and explores a broader variety of self-attention formulations. In particular, our primary selfattention mechanisms compute a vector attention that adapts to different channels, rather than a shared scalar weight. We also explore a family of patchwise attention operators that are structurally different from the forms used in [13, 27] and constitute strict generalizations of convolution. We show that all the presented forms of self-attention can be implemented at scale, with favorable parameter and FLOP budgets.

[13] Local relation networks for image recognition. In ICCV, 2019

[27] Stand-alone self-attention in vision models. In NeurIPS, 2019

[13] 和 [27] 關鍵創新之一是將 self-attention 的範圍限制在一個局部的 patch 上(例如，7×7像素)；早期在整個特徵圖上應用。

這樣可以明顯地減少參數，從而被應用於高分辨率圖像任務中。

特別地，我們主要的 self-attention 計算一個適應不同渠道的向量注意，而不是一個共享的標量權重。

結果表明，所提出的各種形式的 self-attention 都可以大規模地實現，且具有良好的參數和每秒浮點運算次數預算。

Self-attention Networks

In convolutional networks for image recognition, the layers of the network perform two functions. The first is feature aggregation, which the convolution operation performs by combining features from all locations tapped by the kernel. The second function is feature transformation, which is performed by successive linear mappings and nonlinear scalar functions: these successive mappings and nonlinear operations shatter the feature space and give rise to complex piecewise mappings.

在用於圖像識別的卷積網絡中，網絡的各層執行兩個功能。

第一個是特徵聚合，卷積運算是通過將由內核提取的所有位置的特徵組合在一起來實現的。

第二個函數是特徵變換，它是由連續的線性映射和非線性標量函數來完成的:這些連續的映射和非線性操作打破了特徵空間，產生了複雜的分段映射。

One observation that underlies our construction is that these two functions – feature aggregation and feature transformation – can be decoupled. If we have a mechanism that performs feature aggregation, then feature transformation can be performed by perceptron layers that process each feature vector (for each pixel) separately. A perceptron layer consists of a linear mapping and a nonlinear scalar function: this pointwise operation performs feature transformation. Our construction therefore focuses on feature aggregation.

特徵聚合和特徵轉換可以解耦。

如果我們有一個機制來執行特徵聚合，那麼特徵轉換可以由感知器層來執行，感知器層分別處理每個特徵向量(對於每個像素)。感知器層由一個線性映射和一個非線性標量函數組成:這個點態操作執行特徵變換。

因此，我們的構建側重於特徵聚合。

The convolution operator performs feature aggregation by a fixed kernel that applies pretrained weights to linearly combine feature values from a set of nearby locations. The weights are fixed and do not adapt to the content of the features. And since each location must be processed with a dedicated weight vector, the number of parameters scales linearly with the number of aggregated features. We present a number of alternative aggregation schemes and construct high-performing image recognition architectures that interleave feature aggregation (via self-attention) and feature transformation (via elementwise perceptrons).

我們提出了許多備選的聚合方案，並構建了高性能的圖像識別體系結構，該體系結構將特徵聚合 (通過自我注意) 和特徵轉換 (通過element-wise 感知器) 交織在一起。

Pairwise Self-attention

We explore two types of self-attention. The first, which we refer to as pairwise, has the following form:

$y_i = \sum_{j\in {R}(i)} \alpha(x_i , x_j ) \odot \beta(x_j )$ , (2)

where $\odot$ is the Hadamard product, is the spatial index of feature vector (i.e., its location in the feature map), and is the local footprint of the aggregation. The footprint is a set of indices that specifies which feature vectors are aggregated to construct the new feature .

The function $\beta$ produces the feature vectors $\beta(x_j)$ that are aggregated by the adaptive weight vectors $\alpha(x_i , x_j )$ . Possible instantiations of this function, along with feature transformation elements that surround self-attention operations in our architecture, are discussed later in this section.

這個公式中， $\alpha$ 相當於注意力圖， $\beta$ 相當於輸入特徵（經過卷積操作），就是卷積窗口大小。

以經典的 self-attention 爲例，輸入是， $\beta$ 就是， $\alpha$ 就是 $Q\times K^T$ 。

Hadamard product 是啥玩意兒？就是元素相乘。

The function $\alpha$ computes the weights $\alpha(x_i , x_j )$ that are used to combine the transformed features $\beta(x_j)$ . To simplify exposition of different forms of self-attention, we decompose $\alpha$ as follows:

$\alpha(x_i , x_j ) = \gamma(\delta(x_i , x_j ))$ . (3)

The relation function $\delta$ outputs a single vector that represents the features and .

The function $\gamma$ then maps this vector into a vector that can be combined with $\beta(x_j)$ as shown in Eq. 2. The function $\gamma$ enables us to explore relations $\delta$ that produce vectors of varying dimensionality that need not match the dimensionality of $\beta(x_j)$ . It also allows us to introduce additional trainable transformations into the construction of the weights $\alpha(x_i , x_j )$ , making this construction more expressive. This function performs a linear mapping, followed by a nonlinearity, followed by another linear mapping; i.e., $\gamma={Linear\rightarrow ReLU\rightarrow Linear}$ . The output dimensionality of $\gamma$ does not need to match that of $\beta$ as attention weights can be shared across a group of channels.

We explore multiple forms for the relation function $\delta$ :

Summation: $\delta(x_i , x_j ) =\phi (x_i) + \psi (x_j )$

Subtraction: $\delta(x_i , x_j ) =\phi (x_i) - \psi (x_j )$

Concatenation: $\delta(x_i , x_j ) =\[\phi (x_i) , \psi (x_j )]$

Hadamard product: $\delta(x_i , x_j ) =\phi (x_i) \odot \psi (x_j )$

Dot product: $\delta(x_i , x_j ) =\phi (x_i)^T \psi (x_j )$

Here $\phi$ and $\psi$ are trainable transformations such as linear mappings, and have matching output dimensionality. With summation, subtraction, and Hadamard product, the dimensionality of $\delta(x_i , x_j )$ is the same as the dimensionality of the transformation functions. With concatenation, the dimensionality of $\delta(x_i , x_j )$ will be doubled. With the dot product, the dimensionality of $\delta(x_i , x_j )$ is 1.

這一段呢，提出的 pairwise self-attention 與傳統的 self-attention 有兩個不同：

1）把傳統的 self-attention 的點積部分，用不同的形式可以替換；

2）沒有用 sigmoid 或者 softmax，而是用 $\gamma={Linear\rightarrow ReLU\rightarrow Linear}$ 。

最後還分析了每種方法輸出的維度是多大。注意到，點乘方法將空間注意力對通道上進行了共享；加、減、元素相乘計算的空間注意力對每個通道計算；級聯（Concatenation）產生兩倍於輸入的通道，需要再經過卷積將通道融合，一般是將通道再縮放一倍，這樣計算的空間注意力也是爲每個通道計算一個，沒有共享。

另外一個困擾我很久的問題來了：像 CBAM、self-attention、non-local attention，幾乎之前的空間注意力模型，都是計算一個空間注意力圖，對所有通道進行共享。爲什麼呢？爲什麼不爲每個通道計算一個空間注意力圖呢？

在 RAM: Residual Attention Module for Single Image Super-Resolution 一文中，作者用 depth-wise 卷積的方式爲每個通道計算了空間注意力圖，但我在圖像去霧任務上進行實驗，對網絡性能的影響是負面的！！！這篇文章也確實被 CVPR 據稿了，引用量也很低，或許審稿人也發現這個方法的問題了。

那麼，爲什麼這篇文章再次提出每個通道計算空間注意力，卻是有效的呢？

空間注意力圖要不要在每個通道上共享，還希望大家一起在留言處討論！！！

Position encoding

A distinguishing characteristic of pairwise attention is that feature vectors are processed independently and the weight computation $\alpha(x_i , x_j )$ cannot incorporate information from any location other than and . To provide some spatial context to the model, we augment the feature maps with position information. The position is encoded as follows. The horizontal and vertical coordinates along the feature map are first normalized to the range [−1, 1] in each dimension. These normalized two-dimensional coordinates are then passed through a trainable linear layer, which can map them to an appropriate range for each layer in the network. This linear mapping outputs a two-dimensional position feature for each location in the feature map. For each pair such that $j \in R(i)$ , we encode the relative position information by calculating the difference . The output of $\delta(x_i , x_j )$ is augmented by concatenating prior to the mapping $\gamma$ .

這段呢，介紹瞭如何對位置進行編碼。這個問題起源於 self-attention 缺乏空間方向（相對位置）的識別機制。在NLP 領域中，Universal Transformer 和 Transformer-XL 就是爲了解決 Transformer 缺少相對位置編碼的問題。

這個過程在哪裏用呢？

$\delta(x_i , x_j )$ 的輸出和級聯再輸入映射 $\gamma$ 。

Patchwise Self-attention

The other type of self-attention we explore is referred to as patchwise and has the following form:

$y_i = \sum_{j\in R(i)} \alpha(x_{R(i)})_j \odot \beta(x_j )$ , (4)

where $x_{R(i)}$ is the patch of feature vectors in the footprint . $\alpha(x_{R(i)})$ is a tensor of the same spatial dimensionality as the patch $x_{R(i)}$ . $\alpha(x_{R(i)})_j$ is the vector at location in this tensor, corresponding spatially to the vector in $x_{R(i)}$ .

patchwise attention 的一般形式不難理解，注意力 $\alpha$ 是在一個局部 patch 內計算的。

In patchwise self-attention, we allow the construction of the weight vector that is applied to $\beta(x_j )$ to refer to and incorporate information from all feature vectors in the footprint . Note that, unlike pairwise self-attention, patchwise self-attention is no longer a set operation with respect to the features . It is not permutation-invariant or cardinality-invariant: the weight computation $\alpha(x_{R(i)})$ can index the feature vectors individually, by location, and can intermix information from feature vectors from different locations within the footprint. Patchwise self-attention is thus strictly more powerful than convolution.

這一段就是對 patchwise attention 的分析。

黑體字部分，還不是很理解，希望大家一起討論！

We decompose $\alpha(x_{R(i)})$ as follows:

$\alpha(x_{R(i)})=\gamma(\delta(x_{R(i)}))$ . (5)

The function $\gamma$ maps a vector produced by $\delta(x_{R(i)})$ to a tensor of appropriate dimensionality. This tensor comprises weight vectors for all locations . The function $\delta$ combines the feature vectors from the patch $x_{R(i)}$ . We explore the following forms for this combination:

Star-product : $\delta(x_{R(i)}) = [\phi(x_i) ^T\psi(x_j )]_{\forall j\in R(i)}$

Clique-product : $\delta(x_{R(i)}) = [\phi(x_j) ^T\psi(x_k )]_{\forall j,k\in R(i)}$

Concatenation : $\delta(x_{R(i)}) = [\phi(x_i) ,[\psi(x_j )]_{\forall j\in R(i)}]$

Self-attention Block

The self-attention operations described in Sections 3.1 and 3.2 can be used to construct residual blocks [12] that perform both feature aggregation and feature transformation. Our self-attention block is illustrated in Figure 1. The input feature tensor (channel dimensionality ) is passed through two processing streams. The left stream evaluates the attention weights $\alpha$ by computing the function $\delta$ (via the mappings $\phi$ and $\psi$ ) and a subsequent mapping $\gamma$ . The right stream applies a linear transformation $\beta$ that transforms the input features and reduces their dimensionality for efficient processing. The outputs of the two streams are then aggregated via a Hadamard product. The combined features are passed through a normalization and an elementwise nonlinearity, and are processed by a final linear layer that expands their dimensionality back to .

這一段描述的就是圖1。

注意1：aggregation 就是 Hadamard product，要求兩個張量維度相同。

注意2：最左側支路輸入通過 linear 層，輸出個通道；而中間這一路輸入經過 linear 層輸出個通道；它們是怎麼做 Hadamard product 的呢？是通過 $\gamma={Linear\rightarrow ReLU\rightarrow Linear}$ 轉換的。那麼，是第一個 linear 還是的第二個 linear 將轉換到呢？看代碼去吧。

Figure 1. Our self-attention block. is the channel dimensionality. The left stream evaluates the attention weights $\alpha$ , the right stream transforms the features via a linear mapping $\beta$ . Both streams reduce the channel dimensionality for efficient processing. The outputs of the streams are aggregated via a Hadamard product and the dimensionality is subsequently expanded back to .

Network Architectures

Our network architectures generally follow residual networks, which we will use as baselines [12]. Table 1 presents three architectures obtained by stacking self-attention blocks at different resolutions. These architectures – SAN10, SAN15, and SAN19 – are in rough correspondence with ResNet26, ResNet38, and ResNet50. The number X in SANX refers to the number of self-attention blocks. Our architectures are based fully on self-attention.

Table 1. Self-attention networks for image recognition. ‘C-d linear’ means that the output dimensionality of the linear layer is ‘C’. ‘C-d sa’ stands for a self-attention operation with output dimensionality ‘C’. SAN10, SAN15, and SAN19 are in rough correspondence with ResNet26, ResNet38, and ResNet50, respectively. The number X in SANX refers to the number of self-attention blocks. Our architectures are based fully on self-attention.

提出的 SAN 由三部分組成：SA Block；Transition；Classification。

Backbone. The backbone of SAN has five stages, each with different spatial resolution, yielding a resolution reduction factor of 32. Each stage comprises multiple self-attention blocks. Consecutive stages are bridged by transition layers that reduce spatial resolution and expand channel dimensionality. The output of the last stage is processed by a classification layer that comprises global average pooling, a linear layer, and a softmax.

Transition. Transition layers reduce spatial resolution, thus reducing the computational burden and expanding receptive field. The transition comprises a batch normalization layer, a ReLU [25], 2×2 max pooling with stride 2, and a linear mapping that expands channel dimensionality

Footprint of self-attention. The local footprint controls the amount of context gathered by a self-attention operator from the preceding feature layer. We set the footprint size to 7×7 for the last four stages of SAN. The footprint is set to 3×3 in the first stage due to the high resolution of that stage and the consequent memory consumption. Note that increasing the footprint size has no impact on the number of parameters in pairwise self-attention. We will study the effect of footprint size on accuracy, capacity, and FLOPs in Section 5.3.

Instantiations. The number of self-attention blocks in each stage can be adjusted to obtain networks with different capacities. In the networks presented in Table 1, the number of self-attention blocks used in the last four stages is the same as the number of residual blocks in ResNet26, ResNet38, and ResNet50, respectively.

網絡結構表1 已經很清楚了。

Comparison

In this section, we relate the family of self-attention operators presented in Section 3 to other constructions, including convolution [19] and scalar attention [33, 35, 27, 13]. Table 2 summarizes some differences between the constructions. These are discussed in more detail below.

Table 2. The convolution does not adapt to the content of the image. Scalar attention produces scalar weights that do not vary along the channel dimension. Our operators efficiently compute attention weights that adapt across both spatial dimensions and channels.

Convolution. The regular convolution operator has fixed kernel weights that are independent of the content of the image. It does not adapt to the input content. The kernel weights can vary across channels.

Scalar attention. Scalar attention, as used in the transformer [33] and related constructions in computer vision [35, 27, 13], typically has the following form:

$yi = \sum_{j\in R(i)} \phi (x_i)^T\psi (x_j ) \beta (x_j )$ (6)

(A softmax and other forms of normalization can be added.) Unlike the convolution, the aggregation weights can vary across different locations, depending on the content of the image. On the other hand, the weight $\phi (x_i)^T\psi (x_j )$ is a scalar that is shared across all channels. (Hu et al. [13] explored alternatives to the dot product, but these alternatives operated on scalar weights that were likewise shared across channels.) This construction does not adapt the attention weights at different channels. Although this can be mitigated to some extent by introducing multiple heads [33], the number of heads is a small constant and scalar weights are shared by all channels within a head.

Vector attention (ours). The operators presented in Section 3 subsume scalar attention and generalize it in important ways. First, within the pairwise attention family, the relation function $\delta$ can produce vector output. This is the case for the summation, subtraction, Hadamard, and concatenation forms. This vector can then be further processed and mapped to the right dimensionality by $\gamma$ , which can also take position encoding channels as input. The mapping $\gamma$ produces a vector that has compatible dimensionality to the transformed features $\beta$ . This gives the construction significant flexibility in accommodating different relation functions and auxiliary inputs, expressive power due to multiple linear mappings and nonlinearities along the computation graph, ability to produce attention weights that vary along both spatial and channel dimensions, and computational efficiency due to the ability to reduce dimensionality by the mappings $\gamma$ and $\beta$ .

The patchwise family of operators generalizes convolution while retaining parameter and FLOP efficiency. This family of operators produces weight vectors for all positions along a feature map that also vary along the channel dimension. The weight vectors are informed by the entirety of the footprint of the operator.

【持續更新。。。】

MyDLNote - Attention: [2020 CVPR] Exploring Self-attention for Image Recognition

MyDLNote - Attention: [2020 CVPR] Exploring Self-attention for Image Recognition

Abstract

Introduction