MyDLNote - Attention: ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

MyDLNote - Attention: [2020CVPR] ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Qilong Wang1 , Banggu Wu1 , Pengfei Zhu1 , Peihua Li2 , Wangmeng Zuo3 , Qinghua Hu1,∗

1 Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University, China

2 Dalian University of Technology, China

3 Harbin Institute of Technology, China

【前言】本文的貢獻是改變了對傳統 SE 的認知,從中學到的收貨是,對於一個網絡,不要盲目順從其中的細節,動手做些真實的實驗,能得到新的結論和發現。

語言寫作只能說正常。


目錄

MyDLNote - Attention: [2020CVPR] ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Abstract

Introduction

Proposed Method

Revisiting Channel Attention in SE Block

Efficient Channel Attention (ECA) Module

Coverage of Local Cross-Channel Interaction

ECA Module for Deep CNNs

Code



Abstract

Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain.

本文的 motivation:傳統的方法致力於複雜的注意力設計,而本文提出的方法只引入少量參數,卻帶來明顯的性能增益。

By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity.

本文的核心思想:避免降維對於學習通道注意是很重要的;適當的跨通道交互可以在顯著降低模型複雜度的同時保持性能。

Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction.

本文的核心工作:

提出了一種無降維的局部跨通道交互策略,該策略可以通過一維卷積有效地實現;

建立了一種自適應選擇一維卷積核大小的方法,確定了局部跨通道相互作用的覆蓋範圍。

The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.

實驗結果。


Introduction

講清楚一個故事:

Recently, incorporation of channel attention into convolution blocks has attracted a lot of interests, showing great potential in performance improvement [14, 33, 13, 4, 9, 18, 7]. 

第一段: 大背景介紹 --- attention 用在卷積模塊中獲得很大成果。

[4] A2 -Nets: Double attention networks. In NIPS, 2018.

[7] Dual attention network for scene segmentation. In CVPR, 2019.

[9] Global second-order pooling convolutional networks. In CVPR, 2019.

[13] Gather-excite: Exploiting feature context in convolutional neural networks. In NeurIPS, 2018.

[14] Squeeze-and-excitation networks. In CVPR, 2018.

[18] Channel locality block: A variant of squeeze-andexcitation. arXiv, 1901.01493, 2019.

[33] CBAM: Convolutional block attention module. In ECCV, 2018.

Following the setting of squeeze (i.e., feature aggregation) and excitation (i.e., feature recalibration) in SENet [14], some researches improve SE block by capturing more sophisticated channel-wise dependencies [33, 4, 9, 7] or by combining with additional spatial attention [33, 13, 7]. Although these methods have achieved higher accuracy, they often bring higher model complexity and suffer from heavier computational burden. Different from the aforementioned methods that achieve better performance at the cost of higher model complexity, this paper focuses instead on a question: Can one learn effective channel attention in a more efficient way?

第二段:提出問題 --- 但是,這些成果都是在設計更復雜的通道 attention 或添加複雜的空間 attention。本文就想,能不能提出一個不用明顯增加複雜度,而提高 attention 性能的方法呢?

To answer this question, we first revisit the channel attention module in SENet. Specifically, given the input features, SE block first employs a global average pooling for each channel independently, then two fully-connected (FC) layers with non-linearity followed by a Sigmoid function are used to generate channel weights. The two FC layers are designed to capture non-linear cross-channel interaction, which involve dimensionality reduction for controlling model complexity. Although this strategy is widely used in subsequent channel attention modules [33, 13, 9], our empirical studies show dimensionality reduction brings side effect on channel attention prediction, and it is inefficient and unnecessary to capture dependencies across all channels.

第三段:提出本文的核心發現 --- SE 的兩個問題:

1. FC 中間隱含層降維:降維對同道注意力預測有負面影響;(這個問題大家都知道)

2. FC 層學習通道之間的關係:跨所有通道捕獲依賴項是低效且不必要的。(這個問題我是第一次聽說,還是蠻吸引人的)

Therefore, this paper proposes an Efficient Channel Attention (ECA) module for deep CNNs, which avoids dimensionality reduction and captures cross-channel interaction in an efficient way. As illustrated in Figure 2, after channel-wise global average pooling without dimensionality reduction, our ECA captures local cross-channel interaction by considering every channel and its k neighbors. Such method is proven to guarantee both efficiency and effectiveness. Note that our ECA can be efficiently implemented by fast 1D convolution of size k, where kernel size k represents the coverage of local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel. To avoid manual tuning of k via cross-validation, we develop a method to adaptively determine k, where coverage of interaction (i.e., kernel size k) is proportional to channel dimension. As shown in Figure 1 and Table 3, as opposed to the backbone models [11], deep CNNs with our ECA module (called ECA-Net) introduce very few additional parameters and negligible computations, while bringing notable performance gain. For example, for ResNet-50 with 24.37M parameters and 3.86 GFLOPs, the additional parameters and computations of ECA-Net50 are 80 and 4.7e- 4 GFLOPs, respectively; meanwhile, ECA-Net50 outperforms ResNet-50 by 2.28% in terms of Top-1 accuracy.

第四段:針對發現,提出解決方案 --- 1)避免 FC 中的降維,以有效的方式捕獲跨通道交互;2)自動調整 k 的大小。

具體地,1)就是將感知器換成局部卷積層,輸出與輸入維度相同;雖然很簡單,但確實是常規思維的突破;

2)交互的範圍(即,內核 k 大小)與通道維數成正比。

Figure 2. Diagram of our efficient channel attention (ECA) module. Given the aggregated features obtained by global average pooling (GAP), ECA generates channel weights by performing a fast 1D convolution of size k, where k is adaptively determined via a mapping of channel dimension C.

Table 1 summarizes existing attention modules in terms of whether channel dimensionality reduction (DR), cross-channel interaction and lightweight model, where we can see that our ECA module learn effective channel attention by avoiding channel dimensionality reduction while capturing cross-channel interaction in an extremely lightweight way. To evaluate our method, we conduct experiments on ImageNet-1K [6] and MS COCO [23] in a variety of tasks using different deep CNN architectures.

第五段:提出解決方案和其它方法的直觀對比,可以贏得瀆者對本文方法的好感和信任。

Table 1. Comparison of existing attention modules in terms of whether no channel dimensionality reduction (No DR), crosschannel interaction and less parameters than SE (indicated by lightweight) or not.

The contributions of this paper are summarized as follows. (1) We dissect the SE block and empirically demonstrate avoiding dimensionality reduction and appropriate cross-channel interaction are important to learn effective and efficient channel attention, respectively. (2) Based on above analysis, we make an attempt to develop an extremely lightweight channel attention module for deep CNNs by proposing an Efficient Channel Attention (ECA), which increases little model complexity while bringing clear improvement. (3) The experimental results on ImageNet-1K and MS COCO demonstrate our method has lower model complexity than state-of-the-arts while achieving very competitive performance.

第六段:總結貢獻:(1)我們解剖了SE塊,並通過實證證明避免降維和適當的跨通道交互對學習對於有效的和高效的通道注意是重要的。(2)基於以上分析,我們提出了一種有效的通道注意(ECA)方法,嘗試爲深度CNNs開發一種非常輕量級的通道注意模塊,該方法在提高模型複雜度的同時,也帶來了明顯的改進。(3)在ImageNet-1K和MS COCO上的實驗結果表明,我們的方法在獲得極具競爭力的性能的同時,具有較低的模型複雜度。


Proposed Method

In this section, we first revisit the channel attention module in SENet [14] (i.e., SE block). Then, we make a empirical diagnosis of SE block by analyzing effects of dimensionality reduction and cross-channel interaction. This motivates us to propose our ECA module. In addition, we develop a method to adaptively determine parameter of our ECA, and finally show how to adopt it for deep CNNs.

 

Revisiting Channel Attention in SE Block

Let the output of one convolution block be X \in \mathbb{R}^{W\times H\times C} , where W, H and C are width, height and channel dimension (i.e., number of filters). Accordingly, the weights of channels in SE block can be computed as

\omega = \sigma(f_{\{W_1,W_2\}}(g(X))),                           (1)

where g(X)=(WH)^{-1}\sum^{W,H}_{i=1,j=1}X_{i,j} is channel-wise global average pooling (GAP) and \sigma is a Sigmoid function. Let

y = g(X ), f_{\{W_1,W_2\}} takes the form

f_{\{W_1,W_2\}}(y) = W_2ReLU(W_1y),              (2)

where ReLU indicates the Rectified Linear Unit. To avoid high model complexity, sizes of W_1 and W_2 are set to C \times ( C/r ) and ( C/r ) \times C, respectively. We can see that f_{\{W_1,W_2\}} involves all parameters of channel attention block. While dimensionality reduction in Eq. (2) can reduce model complexity, it destroys the direct correspondence between channel and its weight. For example, one single FC layer predicts weight of each channel using a linear combination of all channels. But Eq. (2) first projects channel features into a low-dimensional space and then maps them back, making correspondence between channel and its weight be indirect.

提出了SE的問題:Eq.(2)首先將信道特徵投影到低維空間中,然後將其映射回來,使得信道與其權值之間的對應是間接的。

 

Efficient Channel Attention (ECA) Module

Avoiding Dimensionality Reduction

As discussed above, dimensionality reduction in Eq. (2) makes correspondence between channel and its weight be indirect. To verify its effect, we compare the original SE block with its three variants (i.e., SE-Var1, SE-Var2 and SEVar3), all of which do not perform dimensionality reduction. As presented in Table 2, SE-Var1 with no parameter is still superior to the original network, indicating channel attention has ability to improve performance of deep CNNs. Meanwhile, SE-Var2 learns the weight of each channel independently, which is slightly superior to SE block while involving less parameters. It may suggest that channel and its weight needs a direct correspondence while avoiding dimensionality reduction is more important than consideration of nonlinear channel dependencies. Additionally, SE-Var3 employing one single FC layer performs better than two FC layers with dimensionality reduction in SE block. All of above results clearly demonstrate avoiding dimensionality reduction is helpful to learn effective channel attention. Therefore, we develop our ECA module without channel dimensionality reduction.

這小節給出了重要實驗觀察,即:

SE-Var1:沒有參數的 SE-Var1 仍優於原網絡(這句是不是寫錯了?),說明信道注意有能力提高深度CNNs的性能。

SE-Var2:SE- var2 獨立學習每個通道的權值(depth-wise 卷積嗎?),在參數較少的情況下略優於 SE 塊。

SE-Var3:使用單個 FC 層的 SE- var3 要比使用兩個降維的 FC 層的 SE 塊性能更好。

以上結果清楚地表明,避免降維有助於學習有效的渠道注意。因此,我們開發的ECA模塊沒有降低通道維數。

Table 2. Comparison of various channel attention modules using ResNet-50 as backbone model on ImageNet. #.Param. indicates number of parameters of the channel attention module; \odot indicates element-wise product; GC and C1D indicate group convolutions and 1D convolution, respectively; k is kernel size of C1D.

 

Local Cross-Channel Interaction

Given the aggregated feature y \in \mathbb{R}^C without dimensionality reduction, channel attention can be learned by \omega =\sigma(Wy), where W is a C \times C parameter matrix. In particular, for SE-Var2 and SE-Var3 we have

where W_{var2}  for SE-Var2 is a diagonal matrix, involving C parameters; W_{var3} for SE-Var3 is a full matrix, involving C \times Cparameters. As shown in Eq. (4), the key difference is that SE-Var3 considers cross-channel interaction while SE-Var2 does not, and consequently SE-Var3 achieves better performance. This result indicates that cross-channel interaction is beneficial to learn channel attention. However, SE-Var3 requires a mass of parameters, leading to high model complexity, especially for large channel numbers.

上一小節說明了無縮放的 SE-Var2 和 SE-Var3 效果最好。這一段呢,就解釋一下,爲啥 SE-Var3 優於 SE-Var2:SE-Var3考慮跨通道交互,而SE-Var2不考慮,因此SE-Var3獲得了更好的性能。

進一步推理,那把 SE-Var2 和 SE-Var3 結合一下,會好嗎?

A possible compromise between SE-Var2 and SE-Var3 is extension of W_{var2} to a block diagonal matrix, i.e.,

where Eq. (5) divides channel into G groups each of which includes C/G channels, and learns channel attention in each group independently, which captures cross-channel interaction in a local manner. Accordingly, it involves C^2/G parameters. From perspective of convolution, SE-Var2, SEVar3 and Eq. (5) can be regarded as a depth-wise separable convolution, a FC layer and group convolutions, respectively. Here, SE block with group convolutions (SE-GC) is indicated by \sigma(GC_G(y)) = \sigma(W_Gy). However, as shown in [24], excessive group convolutions will increase memory access cost and so decrease computational efficiency. Furthermore, as shown in Table 2, SE-GC with varying groups bring no gain over SE-Var2, indicating it is not an effective scheme to capture local cross-channel interaction. The reason may be that SE-GC completely discards dependences among different groups.

SE-Var2 和 SE-Var3 結合的方法,就是用組卷積方法 --- SE-GC

雖然很巧妙,但效果不好。原因可能是SE-GC完全拋棄了不同組之間的依賴性。

In this paper, we explore another method to capture local cross-channel interaction, aiming at guaranteeing both efficiency and effectiveness. Specifically, we employ a band matrix W_k to learn channel attention, and W_k has

Clearly, W_k in Eq. (6) involves k \times C parameters, which is usually less than those of Eq. (5). Furthermore, Eq. (6) avoids complete independence among different groups in Eq. (5). As compared in Table 2, the method in Eq. (6) (namely ECA-NS) outperforms SE-GC of Eq. (5). As for Eq. (6), the weight of y_i is calculated by only considering interaction between y_i and its k neighbors, i.e.,

where \Omega_i^k indicates the set of k adjacent channels of y_i.

A more efficient way is to make all channels share the same learning parameters, i.e.,

SE-GC 效果不好的原因是交互性不好,那就提一個交互性好的方法,就是公式(6)。

我分析覺得這是一個分成 C 個的 1\times k 的卷積,且相鄰卷積的中心位置步進爲 1。

進一步爲了減少參數,所有的卷積共享參數,即,這是一個 1\times k、步進爲 1 的卷積。

Note that such strategy can be readily implemented by a fast 1D convolution with kernel size of k, i.e.,

\omega = \sigma(C1D_k(y)),                      (9)

where C1D indicates 1D convolution. Here, the method in Eq. (9) is called by efficient channel attention (ECA) module, which only involves k parameters. As presented in Table 2, our ECA module with k = 3 achieves similar results with SE-var3 while having much lower model complexity, which guarantees both efficiency and effectiveness by appropriately capturing local cross-channel interaction.

驚訝不!做了半天實驗,推導了半天,最後就是一個 1\times 3 卷積代替了原來的 FC 層!!!

這就是 efficient channel attention (ECA) module。結論很簡單,但過程比較細緻。

當然了,這個 3 是實驗值,下面提出了一種自適應的方法,來確定卷積核的長度。

 

Coverage of Local Cross-Channel Interaction

Since our ECA module (9) aims at appropriately capturing local cross-channel interaction, so the coverage of interaction (i.e., kernel size k of 1D convolution) needs to be determined. The optimized coverage of interaction could be tuned manually for convolution blocks with different channel numbers in various CNN architectures. However, manual tuning via cross-validation will cost a lot of computing resources. Group convolutions have been successfully adopted to improve CNN architectures [37, 34, 16], where high-dimensional (low-dimensional) channels involve long range (short range) convolutions given the fixed number of groups. Sharing the similar philosophy, it is reasonable that the coverage of interaction (i.e., kernel size k of 1D convolution) is proportional to channel dimension C. In other words, there may exist a mapping \varphi between k and C:

C = \varphi (k).                          (10)

The simplest mapping is a linear function, i.e., \varphi (k) = \gamma \ast k- b.

交互的覆蓋範圍應該與輸入通道個數成正比,然後給出了最簡單的線性形式 \varphi (k) = \gamma \ast k- b

[16] Deep roots: Improving cnn efficiency with hierarchical filter groups. In CVPR, 2017

[34] Aggregated residual transformations for deep neural networks. In CVPR, 2017.

[37] Interleaved group convolutions. In ICCV, 2017.

However, the relations characterized by linear function are too limited. On the other hand, it is well known that channel dimension C (i.e., number of filters) usually is set to power of 2. Therefore, we introduce a possible solution by extending the linear function \varphi (k) = \gamma \ast k- b to a non-linear one, i.e.,

C = \varphi (k) = 2^{(\gamma \ast k-b)} .                                   (11)

Then, given channel dimension C, kernel size k can be adaptively determined by

where |t|_{odd} indicates the nearest odd number of t. In this paper, we set \gamma and b to 2 and 1 throughout all the experiments, respectively. Clearly, through the mapping \Psi, high-dimensional channels have longer range interaction while low-dimensional ones undergo shorter range interaction by using a non-linear mapping.

但是呢,線性的方式過於簡單,或者說不能很好地擬合交互覆蓋範圍與輸入通道數之間的潛在關係。

考慮到 C 一般都是 2 的冪次方,因此,設計了非線性關係表達式:

C = \varphi (k) = 2^{(\gamma \ast k-b)}

最後,反求 k,當然,k 爲奇數。

\gamma=2, ~~ b=1

 

ECA Module for Deep CNNs

Figure 2 illustrates the overview of our ECA module. After aggregating convolution features using GAP without dimensionality reduction, ECA module first adaptively determines kernel size k, and then performs 1D convolution followed by a Sigmoid function to learn channel attention. For applying our ECA to deep CNNs, we replace SE block by our ECA module following the same configuration in [14]. The resulting networks are named by ECA-Net. Figure 3 gives PyTorch code of our ECA. 4.

爲了將ECA應用到深度CNNs,用ECA模塊替換SE塊,在[14]中使用相同的配置。

[14] Densely connected convolutional networks. In CVPR, 2017.


Code

Figure 3. PyTorch code of our ECA module.

 

 

 

實驗部分等用的時候再分析吧。。。【持續更新。。。】

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章