ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks
Channel attention has recently demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules to achieve better performance, inevitably increasing the computational burden. To overcome the paradox of performance and complexity trade-off, this paper makes an attempt to investigate an extremely lightweight attention module for boosting the performance of deep CNNs. In particular, we propose an Efficient Channel Attention (ECA) module, which only involves parameters but brings clear performance gain. By revisiting the channel attention module in SENet, we empirically show avoiding dimensionality reduction and appropriate cross-channel interaction are important to learn effective channel attention. Therefore, we propose a local cross-channel interaction strategy without dimension reduction, which can be efficiently implemented by a fast convolution. Furthermore, we develop a function of channel dimension to adaptively determine kernel size of convolution, which stands for coverage of local cross-channel interaction. Our ECA module can be flexibly incorporated into existing CNN architectures, and the resulting CNNs are named by ECA-Net. We extensively evaluate the proposed ECA-Net on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our ECA-Net is more efficient while performing favorably against its counterparts. The source code and models can be available at https://github.com/BangguWu/ECANet.
Deep convolutional neural networks (CNNs) have been widely used in artificial intelligence, and have achieved great progress in a broad range of tasks, e.g., image classification, object detection and semantic segmentation. Starting from the groundbreaking AlexNet , many researches are continuously investigated to further improve the performance of deep CNNs [30, 31, 11, 15, 19, 21, 33]. Recently, incorporation of attention mechanism into convolution blocks has attracted a lot of attentions, showing great potential for performance improvement [14, 34, 13, 4, 9, 7]. Among these methods, one of the representative works is squeeze-and-excitation networks (SENet) , which learns channel attention for each convolution block, bringing clear performance gain over various deep CNN architectures.
Following the setting of squeeze (i.e., feature aggregation) and excitation (i.e., feature recalibration) in SENet , some researches improve SE block by capturing more sophisticated channel-wise dependencies [34, 4, 9, 7] or by combining with additional spatial attention [34, 13, 7]. Although these methods have achieved higher accuracy, they often bring higher model complexity and suffer from heavier computational burden. Different from the aforementioned methods that achieve better performance at the cost of higher model complexity, this paper focuses instead on a question: Can one learn effective channel attention in a more efficient way?
To answer this question, we first revisit the channel attention module in SENet. Specifically, given the input features, SE block first employs a global average pooling for each channel independently, then two fully-connected (FC) layers with non-linearity followed by a Sigmoid function are used to generate weight of each channel. The two FC layers are designed to capture non-linear cross-channel interaction, which involve dimensionality reduction for avoiding too high model complexity. Although this policy is widely used in the subsequent channel attention modules [34, 13, 9], our empirical analyses demonstrate dimensionality reduction will bring side effect on prediction of channel attention, and it is inefficient and unnecessary to capture dependencies across all channels.
Based on the above analyses, avoiding dimensionality reduction and appropriate cross-channel interaction are suggested to play a vital role in developing channel attention mechanisms. Therefore, this paper proposes an Efficient Channel Attention (ECA) module for deep CNNs based on above two properties. As illustrated in Figure 2 (b), after channel-wise global average pooling without dimensionality reduction, our ECA captures local cross-channel interaction by considering every channel and its neighbors. As such, our ECA can be efficiently implemented by a fast convolution of size . The kernel size represents the coverage of local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel. Clearly, it will affect both efficiency and effectiveness of ECA. It is reasonable that coverage of interaction is in connection with channel dimension, so we propose a function associated with channel dimension to adaptively determine . As shown in Figure 1 and Table 2, as opposed to the backbone models , deep CNNs with our ECA module (called ECA-Net) introduce very few additional parameters and negligible computations, while bringing notable performance gain. For example, for ResNet-50 with 24.37M parameters and 3.86 GFLOPs, the addtional parameters and computations of ECA-Net50 are 80 and 4.7e-4 GFLOPs, resecptively; meanwhile, ECA-Net50 outperforms ResNet-50 by 2.28% in terms of Top-1 accuracy. To evaluate our method, we conduct experiments on ImageNet-1K  and MS COCO  using different deep CNN architectures and tasks. The contributions of this paper are summarized as follows.
We empirically demonstrate avoiding dimensionality reduction and appropriate cross-channel interaction are important to learn efficient and effective channel attention for deep CNNs.
We make an attempt to develop an extremely lightweight channel attention module for deep CNNs by proposing a novel Efficient Channel Attention (ECA), which increases little model complexity but brings clear improvement.
The experimental results on ImageNet-1K and MS COCO demonstrate our method has lower model complexity than state-of-the-arts while achieving very competitive performance.
Attention mechanism has proven to be a potential means to reinforce deep CNNs. SE-Net  presents for the first time an effective mechanism to learn channel attention and achieves promising performance. Subsequently, development of attention modules can be roughly divided into two directions: (1) enhancement of feature aggregation; (2) combination of channel and spatial attentions. Specifically, CBAM  employs both average and max pooling to aggregate features. GSoP  introduces a second-order pooling for more effective feature aggregation. GE  explores spatial extension using a depth-wise convolution  to aggregate features. scSE  and CBAM  compute spatial attention using a convolution of kernel size , then combine it with channel attention. Sharing similar philosophy with non-local neural networks , Double Attention Networks (-Nets)  introduces a novel relation function for image or video recognition, while Dual Attention Network (DAN)  and Criss-Cross Network (CCNet)  simultaneously consider non-local channel and non-local spatial attentions for semantic segmentation. Analogously, Li et al. propose an Expectation-Maximization Attention (EMA) module for semantic segmentation . However, these non-local attention modules can only be used in one single or a few convolution blocks due to their high model complexity. Obviously, all of the above methods focus on developing sophisticated attention modules for better performance. Different from them, our ECA aims at learning effective channel attention with low model complexity.
Our work is also related to efficient convolutions, which are designed for lightweight CNN architectures. The two most widely used efficient convolutions are group convolutions [38, 35, 17] and depth-wise separable convolutions [5, 29, 39, 24]. As given in Table 1, although these efficient convolutions involve less parameters, they show little effectiveness in attention module. Our ECA module aims at capturing local cross-channel interaction, which shares some similarities with channel local convolutions  and channel-wise convolutions ; different from them, our method focuses on proposing a convolution with adaptive kernel size to replace FC layers in channel attention module. Comparing with group and depth-wise separable convolutions, our method achieves better results with lower model complexity.
In this section, we first revisit the channel attention module in SENet . Then, we make an empirical comparison to analyze the effect of dimensionality reduction and cross-channel interaction, which motivate us to propose our efficient channel attention (ECA) module. In addition, we introduce an adaptive kernel size selection for our ECA and finally show how to adopt it for deep CNNs.
|ECA-NS||with Eq. (4)||77.35||93.61|
Revisiting Channel Attention
Let the output of one convolution block be , where , and are width, height and channel dimension (i.e., number of filters), respectively. As shown in Figure 2 (a), the weights of channel attention in SE block can be computed as
where is channel-wise global average pooling (GAP) and is a Sigmoid function. Let , takes the form
where ReLU indicates the Rectified Linear Unit . To avoid too high model complexity, sizes of and are set to and , respectively. We can see that involves all parameters of channel attention block. While dimensionality reduction in Eq. (2) can reduce model complexity, it destroys the direct correspondence between channel and its weight111For example, one single FC layer predicts weight of each channel using a linear combination of all channels. But Eq. (2) first projects channel features into a low-dimensional space and then maps them back, making correspondence between channel and its weight be indirect..
Efficient Channel Attention (ECA) Module
In this subsection, we make an empirical comparison for deeper analysis on the effect of channel dimensionality reduction and cross-channel interaction on learning channel attention. According to these analyses, we propose our efficient channel attention (ECA) module.
Avoiding Dimensionality Reduction
As discussed above, dimensionality reduction in Eq. (2) makes correspondence between channel and its weight be indirect. To verify its effect, we compare the original SE block with its three variants (i.e., SE-Var1, SE-Var2 and SE-Var3), all of which do not perform dimensionality reduction. As presented in Table 1, SE-Var1 with no parameter is still superior to the original network, indicating channel attention has ability to improve performance of deep CNNs. Meanwhile, SE-Var2 learns the weight of each channel independently, which is slightly superior to SE block while involving less parameters. It may suggest that channel and its weight needs a direct correspondence while avoiding dimensionality reduction is more important than consideration of nonlinear channel dependencies. Additionally, SE-Var3 employing one single FC layer performs better than two FC layers with dimensionality reduction in SE block. All of above results clearly demonstrate the importance of avoiding dimensionality reduction in attention module. Therefore, we develop our ECA module without channel dimensionality reduction.
Local Cross-Channel Interaction
Although both of SE-Var2 and SE-Var3 keep channel dimension unchanged, the latter one achieves better performance. The main difference is that SE-Var3 captures cross-channel interaction while SE-Var2 does not. It indicates that cross-channel interaction is helpful to learn effective attention. However, SE-Var3 involves a mass of parameters, leading to too high model complexity. From perspective of efficient convolutions [38, 35], SE-Var2 can be regarded as a depth-wise separable convolution . Naturally, group convolutions as another kind of efficient convolutions also can be used to capture cross-channel interaction. Given a FC layer, group convolutions divide it into multiple groups and perform linear transform in each group independently. SE block with group convolutions (SE-GC) is written as
where is a block diagonal matrix, whose number of parameters is and is number of groups. However, as shown in Table 1, SE-GC with varying groups bring no gain over SE-Var2, indicating that group convolution is not an effective scheme for exploiting cross-channel interaction. Meanwhile, excessive group convolutions will increase memory access cost .
By visualizing channel features , we find that they usually exhibit a certain local periodicity (please refer to Appendix A1 for details). Therefore, different from the above methods (i.e., depth-wise separable convolutions, group convolutions and FC layers), we aim at capturing local cross-channel interaction, i.e., only considering interaction between each channel and its neighbors. Thus, the weight of can be calculated as
where indicates the set of adjacent channels of . Clearly, Eq. (4) captures local cross-channel interaction, and such locality constraint avoids interaction across all channels, which allows high model efficiency. In this way, each channel attention module involves parameters. To further reduce model complexity and improve efficiency, we let all channels share the same leaning parameters, i.e.,
As such, our efficient channel attention (ECA) module can be readily implemented by a fast convolution with kernel size of , i.e.,
where C1D indicates convolution. As listed in Table 1, by introducing local cross-channel interaction, our ECA achieves similar results with SE-var3 and ECA-NS in Eq. (4) (i.e., ECA without shared parameters), while has much lower model complexity (it only involves parameters). In Table 1, is set to 3.
Adaptive Selection of Kernel Size
In our ECA module (Eq. (6)), kernel size is a key parameter. Since convolution is used to capture local cross-channel interaction, determines the coverage of interaction, which may vary against convolution blocks with different channel numbers and various CNN architectures. Albeit could be tuned manually, it will cost a lot of computing resources. It is reasonable that is in connection with channel dimension . In general, it is expected that larger size of channels favor long-range interaction while smaller size of channels prefer short-term interaction. In other words, there may exist a certain mapping between and :
Here, the optimal formulation of mapping usually is unknown. However, based on above analysis, is suggested to be nonlinear proportional to , so the parameterized exponential function is a feasible choice. Meanwhile, for the classical kernel tricks [2, 25], exponential family functions (e.g., Gaussian) as kernel functions are most widely used to handle the issues of unknown mappings. Therefore, we approximate the mapping using an exponential function, i.e.,
Furthermore, since channel dimension (i.e., number of filters) usually is set to integral power of 2, we replace 222Note that . by . Then, given channel dimension , kernel size can be adaptively determined by
where indicates the nearest odd number of . In this paper, we set and to 2 and 1, respectively. Clearly, the mapping function makes larger size of channels have long-range interaction and vice versa.
ECA for Deep CNNs
Figure 2 compares our ECA module with the SE block. For adopting our ECA to deep CNNs, we exploit exactly the same configuration with SENet , and just replace SE block by our ECA module. The resulting networks are named by ECA-Net. Figure 3 gives PyTorch code of our ECA, which is easy to be reproduced.
In this section, we evaluate the proposed method on large-scale image classification and object detection using ImageNet  and MS COCO , respectively. Specifically, we first assess the effect of kernel size on our ECA module and compare with state-of-the-art counterparts on ImageNet. Then, we verify the effectiveness of our ECA module on object detection using Faster R-CNN  and Mask R-CNN .
To evaluate our ECA-Net on ImageNet classification, we employ three widely used CNNs as backbone models, including ResNet-50 , ResNet-101 , ResNet-152  and MobileNetV2 . For training ResNet-50, ResNet-101 and ResNet-152 with our ECA, we adopt exactly the same data augmentation and hyper-parameter settings in [11, 14]. Specifically, the input images are randomly cropped to 224224 with random horizontal flipping. The parameters of networks are optimized by stochastic gradient descent (SGD) with weight decay of 1e-4, momentum of 0.9 and mini-batch size of 256. All models are trained within 100 epochs by setting the initial learning rate to 0.1, which is decreased by a factor of 10 per 30 epochs. For training MobileNetV2 with our ECA, we follow the settings in , where networks are trained within 400 epochs using SGD with weight decay of 4e-5, momentum of 0.9 and mini-batch size of 96. The initial learning rate is set to 0.045, and is decreased by a linear decay rate of 0.98. For testing on the validation set, the shorter side of an input image is first resized to 256 and a center crop of 224 224 is used for evaluation. All models are implemented by PyTorch toolkit333https://github.com/pytorch/pytorch.
We further evaluate our method on MS COCO using Faster R-CNN  and Mask R-CNN , where ResNet-50 and ResNet-101 along with FPN  are used as backbone models. We implement all detectors by using MMDetection toolkit  and employ the default settings. Specifically, the shorter side of input images are resized to 800, then all models are optimized using SGD with weight decay of 1e-4, momentum of 0.9 and mini-batch size of 8 (4 GPUs with 2 images per GPU). The learning rate is initialized to 0.01 and is decreased by a factor of 10 after 8 and 11 epochs, respectively. We train all detectors within 12 epochs on train2017 of COCO and report the results on val2017 for comparison. All programs are run on a PC equipped with four RTX 2080Ti GPUs and an Intel(R) Xeon Silver 4112 CPU@2.60GHz.
Large-scale Image Classification on ImageNet-1K
Here, we first access the effect of kernel size on our ECA module and effectiveness of adaptive kernel size selection, then compare with state-of-the-art counterparts and CNN models using ResNet-50, ResNet-101, ResNet-152 and MobileNetV2.
Effect of Kernel Size and Adaptive Kernel Size Selection
As shown in Eq. (6), our ECA module involves a parameter , i.e., kernel size of convolution. In this part, we evaluate its effect on our ECA module and validate the effectiveness of the proposed adaptive selection of kernel size. To this end, we employ ResNet-50 and ResNet-101 as backbone models, and train them with our ECA module by setting be from to . The results are illustrated in Figure 4, from it we have the following observations.
|ResNet ||ResNet-50||24.37M||3.86G||1024 FPS||1855 FPS||75.20||92.52|
|SENet ||26.77M||3.87G||759 FPS||1620 FPS||76.71||93.38|
|CBAM ||26.77M||3.87G||472 FPS||1213 FPS||77.34||93.69|
|GSoP-Net1 ||28.05M||6.18G||596 FPS||1383 FPS||77.68||93.98|
|ECA-Net (Ours)||24.37M||3.86G||785 FPS||1805 FPS||77.48||93.68|
|ResNet ||ResNet-101||42.49M||7.34G||386 FPS||1174 FPS||76.83||93.48|
|SENet ||47.01M||7.35G||367 FPS||1044 FPS||77.62||93.93|
|CBAM ||47.01M||7.35G||270 FPS||635 FPS||78.49||94.31|
|ECA-Net (Ours)||42.49M||7.35G||380 FPS||1089 FPS||78.65||94.34|
|ResNet ||ResNet-152||57.40M||10.82G||281 FPS||815 FPS||77.58||93.66|
|SENet ||63.68M||10.85G||268 FPS||761 FPS||78.43||94.27|
|ECA-Net (Ours)||57.40M||10.83G||279 FPS||785 FPS||78.92||94.55|
|MobileNetV2 ||MobileNetV2||3.34M||319.4M||711 FPS||2086 FPS||71.64||90.20|
|SENet||3.40M||320.1M||671 FPS||2000 FPS||72.42||90.67|
|ECA-Net (Ours)||3.34M||319.9M||676 FPS||2010 FPS||72.56||90.81|
Firstly, when is fixed in all convolution blocks, ECA module obtains the best results at and for ResNet-50 and ResNet-101, respectively. Since ResNet-101 has more intermediate layers that dominate performance of ResNet-101, so it may prefer to small kernel size. Furthermore, these results show that different deep CNNs have various optimal numbers of , and has a clear effect on performance of ECA-Net. Secondly, our adaptive selection of kernel size tries to find the optimal number of for each convolution block, which can alleviate effect of depth of deep CNNs and avoid manual tuning of parameter . Moreover, it usually brings further improvement, demonstrating the effectiveness of adaptive selection of kernel size. Finally, ECA module with various numbers of consistently outperform SE block, indicating that avoiding dimensionality reduction and local cross-channel interaction indeed exert positive effects on learning channel attention.
Comparisons using ResNet-50
Next, we compare our ECA module with several state-of-the-art attention methods using ResNet-50 on ImageNet, including SENet , CBAM , -Nets , AA-Net  and GSoP-Net1 . The evaluation metrics concern both efficiency (i.e., network parameters, floating point operations per second (FLOPs) and training/inference speed) and effectiveness (i.e., Top-1/Top-5 accuracy). For a fair comparison, we duplicate the results of all compared methods from their original papers, except training/inference speed. To test training/inference speed of various models, we employ publicly available models for the compared CNNs, and run them on the same computing platform. The results are given in Table 2, where we can see that our ECA-Net shares almost the same model complexity (i.e., network parameters, FLOPs and speed) with the original ResNet-50, while achieving 2.28% gains in terms of Top-1 accuracy. Comparing with state-of-the-art counterparts (i.e., SENet, CBAM, -Nets, AA-Net and GSoP-Net1), ECA-Net obtains better or competitive performance while benefiting lower model complexity.
Comparisons using ResNet-101
Using ResNet-101 as backbone model, we compare our ECA-Net with SENet , CBAM  and AA-Net . From Table 2 we can see that ECA-Net outperforms the original ResNet-101 by 1.8% in terms of Top-1 accuracy with almost the same model complexity. Sharing the same tendency on ResNet-50, ECA-Net is superior to SENet and CBAM while it is very competitive to AA-Net with lower model complexity.
Comparisons using ResNet-152
Using ResNet-101 as backbone model, we compare our ECA-Net with SENet . From Table 2 we can see that ECA-Net improves the original ResNet-152 over about 1.3% in terms of Top-1 accuracy with almost the same model complexity while outperforming SENet by 0.5% in terms of Top-1 accuracy with lower model complexity. The results with respect to ResNet-50, ResNet-101 and ResNet-152 demonstrate the effectiveness of our ECA module on the widely used ResNet architectures.
|ResNet-50||Faster R-CNN||41.53 M||207.07||36.4||58.2||39.2||21.8||40.0||46.2|
|+ SE block||44.02 M||207.18||37.7||60.1||40.9||22.9||41.9||48.2|
|+ ECA (Ours)||41.53 M||207.18||38.0||60.6||40.9||23.4||42.1||48.0|
|+ SE block||65.24 M||283.33||39.6||62.0||43.1||23.7||44.0||51.4|
|+ ECA (Ours)||60.52 M||283.32||40.3||62.9||44.0||24.5||44.7||51.3|
|ResNet-50||Mask R-CNN||44.18 M||275.58||37.2||58.9||40.3||22.2||40.7||48.0|
|+ SE block||46.67 M||275.69||38.7||60.9||42.1||23.4||42.7||50.0|
|+ ECA (Ours)||44.18 M||275.69||39.0||61.3||42.1||24.2||42.8||49.9|
|+ SE block||67.89 M||351.84||40.7||62.5||44.3||23.9||45.2||52.8|
|+ ECA (Ours)||63.17 M||351.83||41.3||63.1||44.8||25.1||45.8||52.9|
Comparisons using MobileNetV2
Besides ResNet architectures, we also verify the effectiveness of our ECA module on lightweight CNN architectures. To this end, we employ MobileNetV2  as backbone model and compare our ECA module with SE block. In particular, we integrate SE block and ECA module in convolution layer before residual connection lying in each ’bottleneck’ of MobileNetV2, and parameter of SE block is set to 8. All models are trained using exactly the same settings. The results in Table 2 show our ECA-Net improves the original MobileNetV2 and SENet by about 0.9% and 0.14% in terms of Top-1 accuracy, respectively. Furthermore, our ECA-Net has smaller model size and faster training/inference speed than SENet. All above results demonstrate the efficiency and effectiveness of our ECA module in deep CNNs again.
|+ SE block||35.4||57.4||37.8||17.1||38.6||51.8|
|+ ECA (Ours)||35.6||58.1||37.7||17.6||39.0||51.8|
|+ SE block||36.8||59.3||39.2||17.2||40.3||53.6|
|+ ECA (Ours)||37.4||59.9||39.8||18.1||41.1||54.1|
Comparisons with Other CNN Models
At the end of this part, we compare our ECA-Net with other state-of-the-art CNN models, including ResNet-152 , SENet-152 , ResNet-200 , ResNeXt  and DenseNet-264 . These CNN models have deeper and wider architectures, and their results all are copied from the original papers. As listed in Table 3, our ECA-Net50 is comparable to ResNet-152 while ECA-Net101 outperforms SENet-152 and ResNet-200, indicating that our ECA-Net can improve the performance of deep CNNs using much less computational cost. Meanwhile, our ECA-Net101 is very competitive to ResNeXt-101, while the latter one employs more convolution filters and expensive group convolutions. In addition, ECA-Net50 is comparable to DenseNet-264, but it has lower model complexity. All above results demonstrate that our ECA-Net performs favorably against state-of-the-art CNNs while benefiting much lower model complexity. Note that our ECA also has great potential to further improve the performance of the compared CNN models.
Object Detection on MS COCO
In this subsection, we evaluate our ECA-Net on object detection task using Faster R-CNN  and Mask R-CNN . Here, we compare our ECA-Net with the original ResNet and SENet. All CNN models are first pre-trained on ImageNet, and then are transferred to MS COCO by fine-tuning.
Comparisons using Faster R-CNN
Using Faster R-CNN as the basic detector, we employ ResNets of 50 and 101 layers along with FPN  as backbone models. As shown in Table 4, integration of either SE block or our ECA module can improve performance of object detection by a clear margin. Meanwhile, our ECA outperforms SE block by 0.3% and 0.7% in terms of AP using ResNet-50 and ResNet-101, respectively. Furthermore, our ECA module has lower model complexity than SE block. It is worth mentioning that our ECA module achieves more gains for small objects, which are usually harder to be detected.
Comparisons using Mask R-CNN
We further exploit Mask R-CNN to verify the effectiveness of our ECA-Net on object detection task. As listed in Table 4, our ECA module is superior to the original ResNet by 1.8% and 1.9% in terms of AP under the settings of 50 and 101 layers, respectively. Meanwhile, ECA module achieves 0.3% and 0.6% gains over SE block using ResNet-50 and ResNet-101, respectively. The results in Table 4 demonstrate that our ECA module can be well generalized to object detection and is more suitable for detecting small objects.
Instance Segmentation on MS COCO
Finally, we give instance segmentation results of our ECA module using Mask R-CNN on MS COCO. As compared in Table 5, ECA module achieves notable gain over the original ResNet while performing better than SE block with less model complexity. These results verify our ECA module has good generalization ability to various tasks.
In this paper, we focus on learning channel attention for deep CNNs with low model complexity. To this end, we propose a novel efficient channel attention (ECA) module, which generates channel attention through a fast convolution, whose kernel size can be adaptively determined by a function of channel dimension. Experimental results demonstrate our ECA is an extremely lightweight plug-and-play block to improve the performance of various deep CNN architectures, including the widely used ResNets and lightweight MobileNetV2. Moreover, our ECA-Net exhibits good generalization ability in object detection and instance segmentation tasks. In future, we will adopt our ECA module to more CNN architectures (e.g., ResNeXt and Inception ) and further investigate the interaction between ECA and spatial attention module.
Appendix A1. Visualization of Global Average Pooling of Convolution Activations
Here, we visualize the results of global average pooling of convolution activations, which are fed to attention modules for learning channel weights. Specifically, we first train ECA-Net50 on the training set of ImageNet. Then, we randomly select some images from ImageNet validation set. Given a selected image, we first get it through ECA-Net50 and compute the global average pooling of activations from different convolution layers. The selected images are illustrated in left side of Figure 6 and we visualize the values of global average pooling of activations computed from conv_2_3, conv_3_2, conv_3_4, conv_4_3, conv_4_6 and conv_5_3, which are indicated by GAP_2_3, GAP_3_2, GAP_3_4, GAP_4_3, GAP_4_6 and GAP_5_3, respectively. Here, conv_2_3 indicates 3- convolution layer of 2- stage. As shown in Figure 6, we can observe that different images have similar trend in the same convolution layer, while these trends usually exhibit a certain local periodicity. Some of them are indicated by red rectangular boxes. This phenomenon may suggest that we can capture channel interaction in a local manner.
Appendix A2. Visualization of Weights Learned by ECA Modules and SE Blocks
To further analyze the effect of our ECA module on learning channel attention, we visualize the weights learned by ECA modules and compare with SE blocks. Here, we employ ResNet-50 as backbone model, and illustrate weights of different convolution blocks. Specifically, we randomly sample four classes from the ImageNet, which are hammerhead shark, ambulance, medicine chest and butternut squash, respectively. Some example images are illustrated in Figure 5. After training the networks, for all images of each class collected from ImageNet validation, we compute the channel weights of convolution blocks on average. Figure 7 visualizes the channel weights of conv__, where indicates -th stage and is -th convolution block in -th stage. Besides the visualization results of four random sampled classes, we also give the distribution of the average weights across classes as reference. The channel weights learned by ECA modules and SE blocks are illustrated in bottom and top of each row, respectively.
From Figure 7 we have the following observations. Firstly, for both ECA modules and SE blocks, the distributions of channel weights for different classes are very similar at the earlier layers (i.e., ones from conv_2_1 to conv_3_4), which may be caused by that the earlier layers aim at capturing the basic elements (e.g., boundaries and corners) . These features are almost similar for different classes. Such phenomenon also was described in the extended version of 444https://arxiv.org/abs/1709.01507. Secondly, for the channel weights of different classes learned by SE blocks, most of them tend to be the same (i.e., 0.5) in conv_4_2 conv_4_5 while the differences among various classes are not obvious. On the contrary, the weights learned by ECA modules are clearly different across various channels and classes. Since convolution blocks in 4- stage prefer to learn semantic information, so the weights learned by ECA modules can better distinguish different classes. Finally, convolution blocks in the final stage (i.e., conv_5_1, conv_5_2 and conv_5_3) capture high-level semantic features and they are more class-specific. Obviously, the weights learned by ECA modules are more class-specific than ones learned by SE blocks. Above results clearly demonstrate that the weights learned by our ECA modules have better discriminative ability.
-  (2019) Attention augmented convolutional networks. arXiv:1904.09925. Cited by: Comparisons using ResNet-50, Comparisons using ResNet-101, Table 2.
-  (1992) A training algorithm for optimal margin classifiers. In COLT, pp. 144–152. Cited by: Adaptive Selection of Kernel Size .
-  (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv:1906.07155. Cited by: Implementation Details.
-  (2018) A-Nets: double attention networks. In NIPS, Cited by: Figure 1, Introduction, Introduction, Related Work, Comparisons using ResNet-50, Table 2.
-  (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: Related Work, Related Work, Local Cross-Channel Interaction.
-  (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: Introduction, Experiments.
-  (2019) Dual attention network for scene segmentation. In CVPR, Cited by: Introduction, Introduction, Related Work.
-  (2018) ChannelNets: compact and efficient convolutional neural networks via channel-wise convolutions. In NeurIPS, Cited by: Related Work.
-  (2019) Global second-order pooling convolutional networks. In CVPR, Cited by: Introduction, Introduction, Introduction, Related Work, Comparisons using ResNet-50, Table 2.
-  (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: Implementation Details, Object Detection on MS COCO, Experiments.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: Figure 1, Introduction, Introduction, Implementation Details, Comparisons with Other CNN Models, Table 2.
-  (2016) Identity mappings in deep residual networks. In ECCV, Cited by: Comparisons with Other CNN Models.
-  (2018) Gather-excite: exploiting feature context in convolutional neural networks. In NeurIPS, Cited by: Introduction, Introduction, Introduction, Related Work.
-  (2018) Squeeze-and-excitation networks. In CVPR, Cited by: Figure 1, Introduction, Introduction, Related Work, ECA for Deep CNNs, Proposed Method, Implementation Details, Comparisons using ResNet-50, Comparisons using ResNet-101, Comparisons using ResNet-152, Comparisons with Other CNN Models, Table 2, Appendix A2. Visualization of Weights Learned by ECA Modules and SE Blocks.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: Introduction, Comparisons with Other CNN Models.
-  (2019) CCNet: criss-cross attention for semantic segmentation. In ICCV, Cited by: Related Work.
-  (2017) Deep roots: improving cnn efficiency with hierarchical filter groups. In CVPR, Cited by: Related Work.
-  (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: Introduction.
-  (2017) Is second-order information helpful for large-scale visual recognition?. In ICCV, Cited by: Introduction.
-  (2019) Expectation-maximization attention networks for semantic segmentation. In ICCV, Cited by: Related Work.
-  (2017) Factorized bilinear models for image recognition. In ICCV, Cited by: Introduction.
-  (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: Implementation Details, Comparisons using Faster R-CNN.
-  (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: Introduction, Experiments.
-  (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In ECCV, Cited by: Related Work, Local Cross-Channel Interaction.
-  (1998) Kernel PCA and de-noising in feature spaces. In NIPS, pp. 536–542. Cited by: Adaptive Selection of Kernel Size .
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: Revisiting Channel Attention.
-  (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. Cited by: Implementation Details, Object Detection on MS COCO, Experiments.
-  (2019) Recalibrating fully convolutional networks with spatial and channel ”squeeze and excitation” blocks. IEEE Trans. Med. Imaging 38 (2), pp. 540–549. Cited by: Related Work.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Related Work, Implementation Details, Comparisons using MobileNetV2, Table 2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: Introduction.
-  (2015) Going deeper with convolutions. In CVPR, Cited by: Introduction.
-  (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: Conclusion.
-  (2018) Non-local neural networks. In CVPR, Cited by: Introduction, Related Work.
-  (2018) CBAM: convolutional block attention module. In ECCV, Cited by: Figure 1, Introduction, Introduction, Introduction, Related Work, Comparisons using ResNet-50, Comparisons using ResNet-101, Table 2.
-  (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: Related Work, Local Cross-Channel Interaction, Comparisons with Other CNN Models.
-  (2014) Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Cited by: Appendix A2. Visualization of Weights Learned by ECA Modules and SE Blocks.
-  (2018) ClcNet: improving the efficiency of convolutional neural network using channel local convolutions. In CVPR, Cited by: Related Work.
-  (2017) Interleaved group convolutions. In ICCV, Cited by: Related Work, Local Cross-Channel Interaction.
-  (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: Related Work.