ECANet: Efficient Channel Attention for Deep Convolutional Neural Networks
Abstract
Channel attention has recently demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules to achieve better performance, inevitably increasing the computational burden. To overcome the paradox of performance and complexity tradeoff, this paper makes an attempt to investigate an extremely lightweight attention module for boosting the performance of deep CNNs. In particular, we propose an Efficient Channel Attention (ECA) module, which only involves parameters but brings clear performance gain. By revisiting the channel attention module in SENet, we empirically show avoiding dimensionality reduction and appropriate crosschannel interaction are important to learn effective channel attention. Therefore, we propose a local crosschannel interaction strategy without dimension reduction, which can be efficiently implemented by a fast convolution. Furthermore, we develop a function of channel dimension to adaptively determine kernel size of convolution, which stands for coverage of local crosschannel interaction. Our ECA module can be flexibly incorporated into existing CNN architectures, and the resulting CNNs are named by ECANet. We extensively evaluate the proposed ECANet on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our ECANet is more efficient while performing favorably against its counterparts. The source code and models can be available at https://github.com/BangguWu/ECANet.
Introduction
Deep convolutional neural networks (CNNs) have been widely used in artificial intelligence, and have achieved great progress in a broad range of tasks, e.g., image classification, object detection and semantic segmentation. Starting from the groundbreaking AlexNet [18], many researches are continuously investigated to further improve the performance of deep CNNs [30, 31, 11, 15, 19, 21, 33]. Recently, incorporation of attention mechanism into convolution blocks has attracted a lot of attentions, showing great potential for performance improvement [14, 34, 13, 4, 9, 7]. Among these methods, one of the representative works is squeezeandexcitation networks (SENet) [14], which learns channel attention for each convolution block, bringing clear performance gain over various deep CNN architectures.
Following the setting of squeeze (i.e., feature aggregation) and excitation (i.e., feature recalibration) in SENet [14], some researches improve SE block by capturing more sophisticated channelwise dependencies [34, 4, 9, 7] or by combining with additional spatial attention [34, 13, 7]. Although these methods have achieved higher accuracy, they often bring higher model complexity and suffer from heavier computational burden. Different from the aforementioned methods that achieve better performance at the cost of higher model complexity, this paper focuses instead on a question: Can one learn effective channel attention in a more efficient way?
To answer this question, we first revisit the channel attention module in SENet. Specifically, given the input features, SE block first employs a global average pooling for each channel independently, then two fullyconnected (FC) layers with nonlinearity followed by a Sigmoid function are used to generate weight of each channel. The two FC layers are designed to capture nonlinear crosschannel interaction, which involve dimensionality reduction for avoiding too high model complexity. Although this policy is widely used in the subsequent channel attention modules [34, 13, 9], our empirical analyses demonstrate dimensionality reduction will bring side effect on prediction of channel attention, and it is inefficient and unnecessary to capture dependencies across all channels.
Based on the above analyses, avoiding dimensionality reduction and appropriate crosschannel interaction are suggested to play a vital role in developing channel attention mechanisms. Therefore, this paper proposes an Efficient Channel Attention (ECA) module for deep CNNs based on above two properties. As illustrated in Figure 2 (b), after channelwise global average pooling without dimensionality reduction, our ECA captures local crosschannel interaction by considering every channel and its neighbors. As such, our ECA can be efficiently implemented by a fast convolution of size . The kernel size represents the coverage of local crosschannel interaction, i.e., how many neighbors participate in attention prediction of one channel. Clearly, it will affect both efficiency and effectiveness of ECA. It is reasonable that coverage of interaction is in connection with channel dimension, so we propose a function associated with channel dimension to adaptively determine . As shown in Figure 1 and Table 2, as opposed to the backbone models [11], deep CNNs with our ECA module (called ECANet) introduce very few additional parameters and negligible computations, while bringing notable performance gain. For example, for ResNet50 with 24.37M parameters and 3.86 GFLOPs, the addtional parameters and computations of ECANet50 are 80 and 4.7e4 GFLOPs, resecptively; meanwhile, ECANet50 outperforms ResNet50 by 2.28% in terms of Top1 accuracy. To evaluate our method, we conduct experiments on ImageNet1K [6] and MS COCO [23] using different deep CNN architectures and tasks. The contributions of this paper are summarized as follows.

We empirically demonstrate avoiding dimensionality reduction and appropriate crosschannel interaction are important to learn efficient and effective channel attention for deep CNNs.

We make an attempt to develop an extremely lightweight channel attention module for deep CNNs by proposing a novel Efficient Channel Attention (ECA), which increases little model complexity but brings clear improvement.

The experimental results on ImageNet1K and MS COCO demonstrate our method has lower model complexity than stateofthearts while achieving very competitive performance.
Related Work
Attention mechanism has proven to be a potential means to reinforce deep CNNs. SENet [14] presents for the first time an effective mechanism to learn channel attention and achieves promising performance. Subsequently, development of attention modules can be roughly divided into two directions: (1) enhancement of feature aggregation; (2) combination of channel and spatial attentions. Specifically, CBAM [34] employs both average and max pooling to aggregate features. GSoP [9] introduces a secondorder pooling for more effective feature aggregation. GE [13] explores spatial extension using a depthwise convolution [5] to aggregate features. scSE [28] and CBAM [34] compute spatial attention using a convolution of kernel size , then combine it with channel attention. Sharing similar philosophy with nonlocal neural networks [33], Double Attention Networks (Nets) [4] introduces a novel relation function for image or video recognition, while Dual Attention Network (DAN) [7] and CrissCross Network (CCNet) [16] simultaneously consider nonlocal channel and nonlocal spatial attentions for semantic segmentation. Analogously, Li et al. propose an ExpectationMaximization Attention (EMA) module for semantic segmentation [20]. However, these nonlocal attention modules can only be used in one single or a few convolution blocks due to their high model complexity. Obviously, all of the above methods focus on developing sophisticated attention modules for better performance. Different from them, our ECA aims at learning effective channel attention with low model complexity.
Our work is also related to efficient convolutions, which are designed for lightweight CNN architectures. The two most widely used efficient convolutions are group convolutions [38, 35, 17] and depthwise separable convolutions [5, 29, 39, 24]. As given in Table 1, although these efficient convolutions involve less parameters, they show little effectiveness in attention module. Our ECA module aims at capturing local crosschannel interaction, which shares some similarities with channel local convolutions [37] and channelwise convolutions [8]; different from them, our method focuses on proposing a convolution with adaptive kernel size to replace FC layers in channel attention module. Comparing with group and depthwise separable convolutions, our method achieves better results with lower model complexity.
Proposed Method
In this section, we first revisit the channel attention module in SENet [14]. Then, we make an empirical comparison to analyze the effect of dimensionality reduction and crosschannel interaction, which motivate us to propose our efficient channel attention (ECA) module. In addition, we introduce an adaptive kernel size selection for our ECA and finally show how to adopt it for deep CNNs.
Methods  Attention  Param.  Top1  Top5 
Vanilla  N/A  0  75.30  92.20 
SE  76.71  93.38  
SEVar1  76.00  92.90  
SEVar2  77.07  93.31  
SEVar3  77.42  93.64  
SEGC1  76.95  93.47  
SEGC2  76.98  93.31  
SEGC3  76.96  93.38  
ECANS  with Eq. (4)  77.35  93.61  
ECA (Ours)  77.43  93.65 
Revisiting Channel Attention
Let the output of one convolution block be , where , and are width, height and channel dimension (i.e., number of filters), respectively. As shown in Figure 2 (a), the weights of channel attention in SE block can be computed as
(1) 
where is channelwise global average pooling (GAP) and is a Sigmoid function. Let , takes the form
(2) 
where ReLU indicates the Rectified Linear Unit [26]. To avoid too high model complexity, sizes of and are set to and , respectively. We can see that involves all parameters of channel attention block. While dimensionality reduction in Eq. (2) can reduce model complexity, it destroys the direct correspondence between channel and its weight^{1}^{1}1For example, one single FC layer predicts weight of each channel using a linear combination of all channels. But Eq. (2) first projects channel features into a lowdimensional space and then maps them back, making correspondence between channel and its weight be indirect..
Efficient Channel Attention (ECA) Module
In this subsection, we make an empirical comparison for deeper analysis on the effect of channel dimensionality reduction and crosschannel interaction on learning channel attention. According to these analyses, we propose our efficient channel attention (ECA) module.
Avoiding Dimensionality Reduction
As discussed above, dimensionality reduction in Eq. (2) makes correspondence between channel and its weight be indirect. To verify its effect, we compare the original SE block with its three variants (i.e., SEVar1, SEVar2 and SEVar3), all of which do not perform dimensionality reduction. As presented in Table 1, SEVar1 with no parameter is still superior to the original network, indicating channel attention has ability to improve performance of deep CNNs. Meanwhile, SEVar2 learns the weight of each channel independently, which is slightly superior to SE block while involving less parameters. It may suggest that channel and its weight needs a direct correspondence while avoiding dimensionality reduction is more important than consideration of nonlinear channel dependencies. Additionally, SEVar3 employing one single FC layer performs better than two FC layers with dimensionality reduction in SE block. All of above results clearly demonstrate the importance of avoiding dimensionality reduction in attention module. Therefore, we develop our ECA module without channel dimensionality reduction.
Local CrossChannel Interaction
Although both of SEVar2 and SEVar3 keep channel dimension unchanged, the latter one achieves better performance. The main difference is that SEVar3 captures crosschannel interaction while SEVar2 does not. It indicates that crosschannel interaction is helpful to learn effective attention. However, SEVar3 involves a mass of parameters, leading to too high model complexity. From perspective of efficient convolutions [38, 35], SEVar2 can be regarded as a depthwise separable convolution [5]. Naturally, group convolutions as another kind of efficient convolutions also can be used to capture crosschannel interaction. Given a FC layer, group convolutions divide it into multiple groups and perform linear transform in each group independently. SE block with group convolutions (SEGC) is written as
(3) 
where is a block diagonal matrix, whose number of parameters is and is number of groups. However, as shown in Table 1, SEGC with varying groups bring no gain over SEVar2, indicating that group convolution is not an effective scheme for exploiting crosschannel interaction. Meanwhile, excessive group convolutions will increase memory access cost [24].
By visualizing channel features , we find that they usually exhibit a certain local periodicity (please refer to Appendix A1 for details). Therefore, different from the above methods (i.e., depthwise separable convolutions, group convolutions and FC layers), we aim at capturing local crosschannel interaction, i.e., only considering interaction between each channel and its neighbors. Thus, the weight of can be calculated as
(4) 
where indicates the set of adjacent channels of . Clearly, Eq. (4) captures local crosschannel interaction, and such locality constraint avoids interaction across all channels, which allows high model efficiency. In this way, each channel attention module involves parameters. To further reduce model complexity and improve efficiency, we let all channels share the same leaning parameters, i.e.,
(5) 
As such, our efficient channel attention (ECA) module can be readily implemented by a fast convolution with kernel size of , i.e.,
(6) 
where C1D indicates convolution. As listed in Table 1, by introducing local crosschannel interaction, our ECA achieves similar results with SEvar3 and ECANS in Eq. (4) (i.e., ECA without shared parameters), while has much lower model complexity (it only involves parameters). In Table 1, is set to 3.
Adaptive Selection of Kernel Size
In our ECA module (Eq. (6)), kernel size is a key parameter. Since convolution is used to capture local crosschannel interaction, determines the coverage of interaction, which may vary against convolution blocks with different channel numbers and various CNN architectures. Albeit could be tuned manually, it will cost a lot of computing resources. It is reasonable that is in connection with channel dimension . In general, it is expected that larger size of channels favor longrange interaction while smaller size of channels prefer shortterm interaction. In other words, there may exist a certain mapping between and :
(7) 
Here, the optimal formulation of mapping usually is unknown. However, based on above analysis, is suggested to be nonlinear proportional to , so the parameterized exponential function is a feasible choice. Meanwhile, for the classical kernel tricks [2, 25], exponential family functions (e.g., Gaussian) as kernel functions are most widely used to handle the issues of unknown mappings. Therefore, we approximate the mapping using an exponential function, i.e.,
(8) 
Furthermore, since channel dimension (i.e., number of filters) usually is set to integral power of 2, we replace ^{2}^{2}2Note that . by . Then, given channel dimension , kernel size can be adaptively determined by
(9) 
where indicates the nearest odd number of . In this paper, we set and to 2 and 1, respectively. Clearly, the mapping function makes larger size of channels have longrange interaction and vice versa.
ECA for Deep CNNs
Figure 2 compares our ECA module with the SE block. For adopting our ECA to deep CNNs, we exploit exactly the same configuration with SENet [14], and just replace SE block by our ECA module. The resulting networks are named by ECANet. Figure 3 gives PyTorch code of our ECA, which is easy to be reproduced.
Experiments
In this section, we evaluate the proposed method on largescale image classification and object detection using ImageNet [6] and MS COCO [23], respectively. Specifically, we first assess the effect of kernel size on our ECA module and compare with stateoftheart counterparts on ImageNet. Then, we verify the effectiveness of our ECA module on object detection using Faster RCNN [27] and Mask RCNN [10].
Implementation Details
To evaluate our ECANet on ImageNet classification, we employ three widely used CNNs as backbone models, including ResNet50 [11], ResNet101 [11], ResNet152 [11] and MobileNetV2 [29]. For training ResNet50, ResNet101 and ResNet152 with our ECA, we adopt exactly the same data augmentation and hyperparameter settings in [11, 14]. Specifically, the input images are randomly cropped to 224224 with random horizontal flipping. The parameters of networks are optimized by stochastic gradient descent (SGD) with weight decay of 1e4, momentum of 0.9 and minibatch size of 256. All models are trained within 100 epochs by setting the initial learning rate to 0.1, which is decreased by a factor of 10 per 30 epochs. For training MobileNetV2 with our ECA, we follow the settings in [29], where networks are trained within 400 epochs using SGD with weight decay of 4e5, momentum of 0.9 and minibatch size of 96. The initial learning rate is set to 0.045, and is decreased by a linear decay rate of 0.98. For testing on the validation set, the shorter side of an input image is first resized to 256 and a center crop of 224 224 is used for evaluation. All models are implemented by PyTorch toolkit^{3}^{3}3https://github.com/pytorch/pytorch.
We further evaluate our method on MS COCO using Faster RCNN [27] and Mask RCNN [10], where ResNet50 and ResNet101 along with FPN [22] are used as backbone models. We implement all detectors by using MMDetection toolkit [3] and employ the default settings. Specifically, the shorter side of input images are resized to 800, then all models are optimized using SGD with weight decay of 1e4, momentum of 0.9 and minibatch size of 8 (4 GPUs with 2 images per GPU). The learning rate is initialized to 0.01 and is decreased by a factor of 10 after 8 and 11 epochs, respectively. We train all detectors within 12 epochs on train2017 of COCO and report the results on val2017 for comparison. All programs are run on a PC equipped with four RTX 2080Ti GPUs and an Intel(R) Xeon Silver 4112 CPU@2.60GHz.
Largescale Image Classification on ImageNet1K
Here, we first access the effect of kernel size on our ECA module and effectiveness of adaptive kernel size selection, then compare with stateoftheart counterparts and CNN models using ResNet50, ResNet101, ResNet152 and MobileNetV2.
Effect of Kernel Size and Adaptive Kernel Size Selection
As shown in Eq. (6), our ECA module involves a parameter , i.e., kernel size of convolution. In this part, we evaluate its effect on our ECA module and validate the effectiveness of the proposed adaptive selection of kernel size. To this end, we employ ResNet50 and ResNet101 as backbone models, and train them with our ECA module by setting be from to . The results are illustrated in Figure 4, from it we have the following observations.
Method  Backbone Models  Param.  FLOPs  Training  Inference  Top1  Top5 

ResNet [11]  ResNet50  24.37M  3.86G  1024 FPS  1855 FPS  75.20  92.52 
SENet [14]  26.77M  3.87G  759 FPS  1620 FPS  76.71  93.38  
CBAM [34]  26.77M  3.87G  472 FPS  1213 FPS  77.34  93.69  
Nets [4]  33.00M  6.50G  N/A  N/A  77.00  93.50  
GSoPNet1 [9]  28.05M  6.18G  596 FPS  1383 FPS  77.68  93.98  
AANet [1]  25.80M  4.15G  N/A  N/A  77.70  93.80  
ECANet (Ours)  24.37M  3.86G  785 FPS  1805 FPS  77.48  93.68  
ResNet [11]  ResNet101  42.49M  7.34G  386 FPS  1174 FPS  76.83  93.48 
SENet [14]  47.01M  7.35G  367 FPS  1044 FPS  77.62  93.93  
CBAM [34]  47.01M  7.35G  270 FPS  635 FPS  78.49  94.31  
AANet [1]  45.40M  8.05G  N/A  N/A  78.70  94.40  
ECANet (Ours)  42.49M  7.35G  380 FPS  1089 FPS  78.65  94.34  
ResNet [11]  ResNet152  57.40M  10.82G  281 FPS  815 FPS  77.58  93.66 
SENet [14]  63.68M  10.85G  268 FPS  761 FPS  78.43  94.27  
ECANet (Ours)  57.40M  10.83G  279 FPS  785 FPS  78.92  94.55  
MobileNetV2 [29]  MobileNetV2  3.34M  319.4M  711 FPS  2086 FPS  71.64  90.20 
SENet  3.40M  320.1M  671 FPS  2000 FPS  72.42  90.67  
ECANet (Ours)  3.34M  319.9M  676 FPS  2010 FPS  72.56  90.81 
Firstly, when is fixed in all convolution blocks, ECA module obtains the best results at and for ResNet50 and ResNet101, respectively. Since ResNet101 has more intermediate layers that dominate performance of ResNet101, so it may prefer to small kernel size. Furthermore, these results show that different deep CNNs have various optimal numbers of , and has a clear effect on performance of ECANet. Secondly, our adaptive selection of kernel size tries to find the optimal number of for each convolution block, which can alleviate effect of depth of deep CNNs and avoid manual tuning of parameter . Moreover, it usually brings further improvement, demonstrating the effectiveness of adaptive selection of kernel size. Finally, ECA module with various numbers of consistently outperform SE block, indicating that avoiding dimensionality reduction and local crosschannel interaction indeed exert positive effects on learning channel attention.
Comparisons using ResNet50
Next, we compare our ECA module with several stateoftheart attention methods using ResNet50 on ImageNet, including SENet [14], CBAM [34], Nets [4], AANet [1] and GSoPNet1 [9]. The evaluation metrics concern both efficiency (i.e., network parameters, floating point operations per second (FLOPs) and training/inference speed) and effectiveness (i.e., Top1/Top5 accuracy). For a fair comparison, we duplicate the results of all compared methods from their original papers, except training/inference speed. To test training/inference speed of various models, we employ publicly available models for the compared CNNs, and run them on the same computing platform. The results are given in Table 2, where we can see that our ECANet shares almost the same model complexity (i.e., network parameters, FLOPs and speed) with the original ResNet50, while achieving 2.28% gains in terms of Top1 accuracy. Comparing with stateoftheart counterparts (i.e., SENet, CBAM, Nets, AANet and GSoPNet1), ECANet obtains better or competitive performance while benefiting lower model complexity.
CNN Models  Param.  FLOPs  Top1  Top5 

ResNet152  57.40M  10.82G  77.58  93.66 
SENet152  63.68M  10.85G  78.43  94.27 
ResNet200  74.45M  14.10G  78.20  94.00 
ResNeXt101  46.66M  7.53G  78.80  94.40 
DenseNet264  28.78M  5.15G  77.85  93.78 
ECANet50 (Ours)  24.37M  3.86G  77.48  93.68 
ECANet101 (Ours)  42.49M  7.35G  78.65  94.34 
Comparisons using ResNet101
Using ResNet101 as backbone model, we compare our ECANet with SENet [14], CBAM [34] and AANet [1]. From Table 2 we can see that ECANet outperforms the original ResNet101 by 1.8% in terms of Top1 accuracy with almost the same model complexity. Sharing the same tendency on ResNet50, ECANet is superior to SENet and CBAM while it is very competitive to AANet with lower model complexity.
Comparisons using ResNet152
Using ResNet101 as backbone model, we compare our ECANet with SENet [14]. From Table 2 we can see that ECANet improves the original ResNet152 over about 1.3% in terms of Top1 accuracy with almost the same model complexity while outperforming SENet by 0.5% in terms of Top1 accuracy with lower model complexity. The results with respect to ResNet50, ResNet101 and ResNet152 demonstrate the effectiveness of our ECA module on the widely used ResNet architectures.
Methods  Detectors  Param.  GFLOPs  AP  

ResNet50  Faster RCNN  41.53 M  207.07  36.4  58.2  39.2  21.8  40.0  46.2 
+ SE block  44.02 M  207.18  37.7  60.1  40.9  22.9  41.9  48.2  
+ ECA (Ours)  41.53 M  207.18  38.0  60.6  40.9  23.4  42.1  48.0  
ResNet101  60.52 M  283.14  38.7  60.6  41.9  22.7  43.2  50.4  
+ SE block  65.24 M  283.33  39.6  62.0  43.1  23.7  44.0  51.4  
+ ECA (Ours)  60.52 M  283.32  40.3  62.9  44.0  24.5  44.7  51.3  
ResNet50  Mask RCNN  44.18 M  275.58  37.2  58.9  40.3  22.2  40.7  48.0 
+ SE block  46.67 M  275.69  38.7  60.9  42.1  23.4  42.7  50.0  
+ ECA (Ours)  44.18 M  275.69  39.0  61.3  42.1  24.2  42.8  49.9  
ResNet101  63.17 M  351.65  39.4  60.9  43.3  23.0  43.7  51.4  
+ SE block  67.89 M  351.84  40.7  62.5  44.3  23.9  45.2  52.8  
+ ECA (Ours)  63.17 M  351.83  41.3  63.1  44.8  25.1  45.8  52.9 
Comparisons using MobileNetV2
Besides ResNet architectures, we also verify the effectiveness of our ECA module on lightweight CNN architectures. To this end, we employ MobileNetV2 [29] as backbone model and compare our ECA module with SE block. In particular, we integrate SE block and ECA module in convolution layer before residual connection lying in each ’bottleneck’ of MobileNetV2, and parameter of SE block is set to 8. All models are trained using exactly the same settings. The results in Table 2 show our ECANet improves the original MobileNetV2 and SENet by about 0.9% and 0.14% in terms of Top1 accuracy, respectively. Furthermore, our ECANet has smaller model size and faster training/inference speed than SENet. All above results demonstrate the efficiency and effectiveness of our ECA module in deep CNNs again.
Methods  AP  

ResNet50  34.1  55.5  36.2  16.1  36.7  50.0 
+ SE block  35.4  57.4  37.8  17.1  38.6  51.8 
+ ECA (Ours)  35.6  58.1  37.7  17.6  39.0  51.8 
ResNet101  35.9  57.7  38.4  16.8  39.1  53.6 
+ SE block  36.8  59.3  39.2  17.2  40.3  53.6 
+ ECA (Ours)  37.4  59.9  39.8  18.1  41.1  54.1 
Comparisons with Other CNN Models
At the end of this part, we compare our ECANet with other stateoftheart CNN models, including ResNet152 [11], SENet152 [14], ResNet200 [12], ResNeXt [35] and DenseNet264 [15]. These CNN models have deeper and wider architectures, and their results all are copied from the original papers. As listed in Table 3, our ECANet50 is comparable to ResNet152 while ECANet101 outperforms SENet152 and ResNet200, indicating that our ECANet can improve the performance of deep CNNs using much less computational cost. Meanwhile, our ECANet101 is very competitive to ResNeXt101, while the latter one employs more convolution filters and expensive group convolutions. In addition, ECANet50 is comparable to DenseNet264, but it has lower model complexity. All above results demonstrate that our ECANet performs favorably against stateoftheart CNNs while benefiting much lower model complexity. Note that our ECA also has great potential to further improve the performance of the compared CNN models.
Object Detection on MS COCO
In this subsection, we evaluate our ECANet on object detection task using Faster RCNN [27] and Mask RCNN [10]. Here, we compare our ECANet with the original ResNet and SENet. All CNN models are first pretrained on ImageNet, and then are transferred to MS COCO by finetuning.
Comparisons using Faster RCNN
Using Faster RCNN as the basic detector, we employ ResNets of 50 and 101 layers along with FPN [22] as backbone models. As shown in Table 4, integration of either SE block or our ECA module can improve performance of object detection by a clear margin. Meanwhile, our ECA outperforms SE block by 0.3% and 0.7% in terms of AP using ResNet50 and ResNet101, respectively. Furthermore, our ECA module has lower model complexity than SE block. It is worth mentioning that our ECA module achieves more gains for small objects, which are usually harder to be detected.
Comparisons using Mask RCNN
We further exploit Mask RCNN to verify the effectiveness of our ECANet on object detection task. As listed in Table 4, our ECA module is superior to the original ResNet by 1.8% and 1.9% in terms of AP under the settings of 50 and 101 layers, respectively. Meanwhile, ECA module achieves 0.3% and 0.6% gains over SE block using ResNet50 and ResNet101, respectively. The results in Table 4 demonstrate that our ECA module can be well generalized to object detection and is more suitable for detecting small objects.
Instance Segmentation on MS COCO
Finally, we give instance segmentation results of our ECA module using Mask RCNN on MS COCO. As compared in Table 5, ECA module achieves notable gain over the original ResNet while performing better than SE block with less model complexity. These results verify our ECA module has good generalization ability to various tasks.
Conclusion
In this paper, we focus on learning channel attention for deep CNNs with low model complexity. To this end, we propose a novel efficient channel attention (ECA) module, which generates channel attention through a fast convolution, whose kernel size can be adaptively determined by a function of channel dimension. Experimental results demonstrate our ECA is an extremely lightweight plugandplay block to improve the performance of various deep CNN architectures, including the widely used ResNets and lightweight MobileNetV2. Moreover, our ECANet exhibits good generalization ability in object detection and instance segmentation tasks. In future, we will adopt our ECA module to more CNN architectures (e.g., ResNeXt and Inception [32]) and further investigate the interaction between ECA and spatial attention module.
Appendix A1. Visualization of Global Average Pooling of Convolution Activations
Here, we visualize the results of global average pooling of convolution activations, which are fed to attention modules for learning channel weights. Specifically, we first train ECANet50 on the training set of ImageNet. Then, we randomly select some images from ImageNet validation set. Given a selected image, we first get it through ECANet50 and compute the global average pooling of activations from different convolution layers. The selected images are illustrated in left side of Figure 6 and we visualize the values of global average pooling of activations computed from conv_2_3, conv_3_2, conv_3_4, conv_4_3, conv_4_6 and conv_5_3, which are indicated by GAP_2_3, GAP_3_2, GAP_3_4, GAP_4_3, GAP_4_6 and GAP_5_3, respectively. Here, conv_2_3 indicates 3 convolution layer of 2 stage. As shown in Figure 6, we can observe that different images have similar trend in the same convolution layer, while these trends usually exhibit a certain local periodicity. Some of them are indicated by red rectangular boxes. This phenomenon may suggest that we can capture channel interaction in a local manner.
Appendix A2. Visualization of Weights Learned by ECA Modules and SE Blocks
To further analyze the effect of our ECA module on learning channel attention, we visualize the weights learned by ECA modules and compare with SE blocks. Here, we employ ResNet50 as backbone model, and illustrate weights of different convolution blocks. Specifically, we randomly sample four classes from the ImageNet, which are hammerhead shark, ambulance, medicine chest and butternut squash, respectively. Some example images are illustrated in Figure 5. After training the networks, for all images of each class collected from ImageNet validation, we compute the channel weights of convolution blocks on average. Figure 7 visualizes the channel weights of conv__, where indicates th stage and is th convolution block in th stage. Besides the visualization results of four random sampled classes, we also give the distribution of the average weights across classes as reference. The channel weights learned by ECA modules and SE blocks are illustrated in bottom and top of each row, respectively.
From Figure 7 we have the following observations. Firstly, for both ECA modules and SE blocks, the distributions of channel weights for different classes are very similar at the earlier layers (i.e., ones from conv_2_1 to conv_3_4), which may be caused by that the earlier layers aim at capturing the basic elements (e.g., boundaries and corners) [36]. These features are almost similar for different classes. Such phenomenon also was described in the extended version of [14]^{4}^{4}4https://arxiv.org/abs/1709.01507. Secondly, for the channel weights of different classes learned by SE blocks, most of them tend to be the same (i.e., 0.5) in conv_4_2 conv_4_5 while the differences among various classes are not obvious. On the contrary, the weights learned by ECA modules are clearly different across various channels and classes. Since convolution blocks in 4 stage prefer to learn semantic information, so the weights learned by ECA modules can better distinguish different classes. Finally, convolution blocks in the final stage (i.e., conv_5_1, conv_5_2 and conv_5_3) capture highlevel semantic features and they are more classspecific. Obviously, the weights learned by ECA modules are more classspecific than ones learned by SE blocks. Above results clearly demonstrate that the weights learned by our ECA modules have better discriminative ability.
References
 [1] (2019) Attention augmented convolutional networks. arXiv:1904.09925. Cited by: Comparisons using ResNet50, Comparisons using ResNet101, Table 2.
 [2] (1992) A training algorithm for optimal margin classifiers. In COLT, pp. 144–152. Cited by: Adaptive Selection of Kernel Size .
 [3] (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv:1906.07155. Cited by: Implementation Details.
 [4] (2018) ANets: double attention networks. In NIPS, Cited by: Figure 1, Introduction, Introduction, Related Work, Comparisons using ResNet50, Table 2.
 [5] (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: Related Work, Related Work, Local CrossChannel Interaction.
 [6] (2009) ImageNet: a largescale hierarchical image database. In CVPR, Cited by: Introduction, Experiments.
 [7] (2019) Dual attention network for scene segmentation. In CVPR, Cited by: Introduction, Introduction, Related Work.
 [8] (2018) ChannelNets: compact and efficient convolutional neural networks via channelwise convolutions. In NeurIPS, Cited by: Related Work.
 [9] (2019) Global secondorder pooling convolutional networks. In CVPR, Cited by: Introduction, Introduction, Introduction, Related Work, Comparisons using ResNet50, Table 2.
 [10] (2017) Mask RCNN. In ICCV, pp. 2980–2988. Cited by: Implementation Details, Object Detection on MS COCO, Experiments.
 [11] (2016) Deep residual learning for image recognition. In CVPR, Cited by: Figure 1, Introduction, Introduction, Implementation Details, Comparisons with Other CNN Models, Table 2.
 [12] (2016) Identity mappings in deep residual networks. In ECCV, Cited by: Comparisons with Other CNN Models.
 [13] (2018) Gatherexcite: exploiting feature context in convolutional neural networks. In NeurIPS, Cited by: Introduction, Introduction, Introduction, Related Work.
 [14] (2018) Squeezeandexcitation networks. In CVPR, Cited by: Figure 1, Introduction, Introduction, Related Work, ECA for Deep CNNs, Proposed Method, Implementation Details, Comparisons using ResNet50, Comparisons using ResNet101, Comparisons using ResNet152, Comparisons with Other CNN Models, Table 2, Appendix A2. Visualization of Weights Learned by ECA Modules and SE Blocks.
 [15] (2017) Densely connected convolutional networks. In CVPR, Cited by: Introduction, Comparisons with Other CNN Models.
 [16] (2019) CCNet: crisscross attention for semantic segmentation. In ICCV, Cited by: Related Work.
 [17] (2017) Deep roots: improving cnn efficiency with hierarchical filter groups. In CVPR, Cited by: Related Work.
 [18] (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: Introduction.
 [19] (2017) Is secondorder information helpful for largescale visual recognition?. In ICCV, Cited by: Introduction.
 [20] (2019) Expectationmaximization attention networks for semantic segmentation. In ICCV, Cited by: Related Work.
 [21] (2017) Factorized bilinear models for image recognition. In ICCV, Cited by: Introduction.
 [22] (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: Implementation Details, Comparisons using Faster RCNN.
 [23] (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: Introduction, Experiments.
 [24] (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In ECCV, Cited by: Related Work, Local CrossChannel Interaction.
 [25] (1998) Kernel PCA and denoising in feature spaces. In NIPS, pp. 536–542. Cited by: Adaptive Selection of Kernel Size .
 [26] (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: Revisiting Channel Attention.
 [27] (2017) Faster RCNN: towards realtime object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. Cited by: Implementation Details, Object Detection on MS COCO, Experiments.
 [28] (2019) Recalibrating fully convolutional networks with spatial and channel ”squeeze and excitation” blocks. IEEE Trans. Med. Imaging 38 (2), pp. 540–549. Cited by: Related Work.
 [29] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Related Work, Implementation Details, Comparisons using MobileNetV2, Table 2.
 [30] (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: Introduction.
 [31] (2015) Going deeper with convolutions. In CVPR, Cited by: Introduction.
 [32] (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: Conclusion.
 [33] (2018) Nonlocal neural networks. In CVPR, Cited by: Introduction, Related Work.
 [34] (2018) CBAM: convolutional block attention module. In ECCV, Cited by: Figure 1, Introduction, Introduction, Introduction, Related Work, Comparisons using ResNet50, Comparisons using ResNet101, Table 2.
 [35] (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: Related Work, Local CrossChannel Interaction, Comparisons with Other CNN Models.
 [36] (2014) Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Cited by: Appendix A2. Visualization of Weights Learned by ECA Modules and SE Blocks.
 [37] (2018) ClcNet: improving the efficiency of convolutional neural network using channel local convolutions. In CVPR, Cited by: Related Work.
 [38] (2017) Interleaved group convolutions. In ICCV, Cited by: Related Work, Local CrossChannel Interaction.
 [39] (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: Related Work.