Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification

Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification


Convolutional neural networks (CNN) are capable of learning robust representation with different regularization methods and activations as convolutional layers are spatially correlated. Based on this property, a large variety of regional dropout strategies have been proposed, such as Cutout [1], DropBlock [2], CutMix [3], etc. These methods aim to promote the network to generalize better by partially occluding the discriminative parts of objects. However, all of them perform this operation randomly, without capturing the most important region(s) within an object. In this paper, we propose Attentive CutMix, a naturally enhanced augmentation strategy based on CutMix [3]. In each training iteration, we choose the most descriptive regions based on the intermediate attention maps from a feature extractor, which enables searching for the most discriminative parts in an image. Our proposed method is simple yet effective, easy to implement and can boost the baseline significantly. Extensive experiments on CIFAR-10/100, ImageNet datasets with various CNN architectures (in a unified setting) demonstrate the effectiveness of our proposed method, which consistently outperforms the baseline CutMix and other methods by a significant margin.


Devesh Walawalkar   Zhiqiang Shen1   Zechun Liu   Marios Savvides \addressCarnegie Mellon University,
Department of Electrical and Computer Engineering
Pittsburgh, PA, USA {keywords} Deep Neural Networks, Regularization, Data Augmentation, Image Classification

1 Introduction

Regularization in deep neural networks such as dropout, weight decay, early stopping etc. are popular and effective training strategies to improve the training accuracy, robustness, model performance, while also avoiding overfitting to some extent on limited training data. Among these, dropout is a widely-recognized technique in training neural network which is mainly used in fully connected layers of CNN [4] due to its spatial correlation and dependence property. In recent years, an interesting range of regional dropout or replacement strategies have been proposed, such as Cutout [1], DropBlock [2], CutMix [3], etc. Specifically, Cutout proposed to randomly mask out square regions of input during training in order to improve the robustness and overall performance of CNNs. DropBlock developed a structured form of dropout that is particularly effective in regularizing CNNs. During training, a contiguous region of a feature map is dropped instead of individual elements in the feature map. CutMix is motivated by Mixup [5], where the regions in an image are randomly cut and pasted among training images and the ground truth labels are also mixed proportional to the area of the regions. Although regional dropout or replacement operation methods have shown great effectiveness of recognition and localization performance in some benchmarks, the dropping or replacing operation is usually randomly conducted on the input. We argue that this strategy may reduce the efficiency of training and also limit the improvement if the networks are unable to capture the most discriminative regions. A representative comparison of our method with other strategies is shown in Fig. 1.

Figure 1: Comparison of our proposed Attentive CutMix with Mixup [5], Cutout [1] and CutMix [3].
Figure 2: Framework overview of proposed Attentive CutMix.

To address the aforementioned shortcoming, we propose to use an attention based CutMix method. Our goal is to learn a more robust network that can attend to the most important part(s) of an object with better recognition performance without incurring any additional testing costs. We achieve this by utilizing the attention maps generated from a pretrained network to guide the localization operation of cutting and pasting among training image pairs in CutMix. Our proposed method is extremely simple yet effective. It boosts the strong baseline CutMix by a significant margin.

We conduct extensive experiments on CIFAR-10/100 and ImageNet [6] datasets. Our results show that the proposed Attentive CutMix consistently improves the accuracy across a variety of popular network architectures (ResNet [7], DenseNet [8] and EfficientNet [9]). For instance, on CIFAR-100, we achieve 75.37% accurcy with ResNet-152, which is 2.16% higher than the baseline CutMix (73.21%).

2 Related Work

Data augmentation. Data augmentation operations, such as flipping, rotation, scaling, cropping, contrast, translation, adding gaussian noise etc., are among the most popular techniques for training deep neural networks and improving the generalization capabilities of models. However, in real world, natural data can still exist in a variety of conditions that cannot be accounted for by simple strategies. For instance, the task of identifying the landscape in images can range from rivers, blue sky, freezing tundras to grasslands, forests, etc. Thus, some works [10, 11, 12] have proposed to generate effects such as different seasons artificially to augment the dataset. In this paper, we focus on recognizing natural objects like cat, dog, car, people, etc. wherein we initially discern the most important parts from an object, then use cut and paste inspired from CutMix to generate a new image which helps the networks better attend to the local regions of an image.

CutMix. CutMix is an augmentation strategy incorporating region-level replacement. For a pair of images, patches from one image are randomly cut and pasted onto the other image along with the ground truth labels being mixed together proportionally to the area of patches. Conventional regional dropout strategies [1, 2] have shown evidence of boosting the classification and localization performances to a certain degree, while removed regions are replaced usually with zeros or filled with random noise, which greatly reduce/occlude informative pixels in training images. To this end, instead of simply removing pixels, CutMix replaces the removed regions with a patch from another image, which utilizes the fact that there is no uninformative pixel during training, making it more efficient and effective.

Attention mechanism. Attention can be viewed as the process of adjusting or allocating activation intensity towards the most informative or useful locations of inputs. There are several methods for exploiting attention mechanism to improve image classification [13, 14] and object detection [15] tasks. GFR-DSOD [15] proposed a gated CNN for object detection based on [16] that passes messages between features from different resolutions and used gated functions to control information flow. SENet [14] used attention mechanism to model channel-wise relationships and enhanced the representation ability of modules through the networks. In this paper, we introduce a simple attention-based region selection that can find out the most discriminative parts spatially.

3 Proposed Approach

3.1 Algorithm

The central idea of Attentive CutMix is to create a new training sample given two distinct training samples and . Here, is the training image and is the training label. Similar to CutMix [3], we define this combining operation as,


where denotes a binary mask indicating which pixels belong to either of the two images, 1 is a binary mask filled with ones and is the general element-wise multiplication. Here is the ratio of patches cut from the first image and pasted onto the second image to the total number of patches in the second image.
We first obtain a heatmap (generally a 77 grid map) of the first image by passing it through an ImageNet pretrained classification model like Resnet-152 and take out the final 77 output feature map. We then select the top “” patches from this 77 grid as our attentive region patches to cut from the given image. Here can range from 1 to 49 (i.e. the entire image itself). Later, we will present an ablation study on the number of attentive patches to be cut from a given image.

We then map the selected attentive patches back to the original image. For example, a single patch in a 77 grid would map back to a 3232 image patch on a 224224 size input image. The patches are cut from the first image and pasted onto the second image at their respective original locations, assuming both images are of the same size. The pair of training samples are randomly selected during each training phase. For the composite label, considering that we pick the top 6 attentive patches from a 77 grid, would then be . Every image in the training batch is augmented with patches cutout from another randomly selected image in the original batch. Please refer Fig. 2 for an illustrative representation of our method.

3.2 Theoretical Improvements over CutMix

CutMix provides good empirical capabilities of improving the classification accuracy of Deep Learning models. However, there are weak theoretical foundations to its effectiveness. One of the reasons for its effectiveness could be that pasting random patches onto an image provides random occlusions to the main classification subject in the image thus making it harder for the model to overfit on a particular subject and forces it to learn more important features associated with a given subject.

However, the patch cutout is of random size and taken from a random location, thus creating the possibility of cutting an unimportant background patch and simultaneously pasting it onto the background in the second image. Since the composite label contains a part of first label, we are theoretically associating the background region to that label for the model to learn. This hinders the empirical gains of CutMix and this is where Attentive CutMix provides improvements over its CutMix counterpart.

Rather than randomly selecting the patch, Attentive CutMix takes help of a pretrained network to determine the most important or representative regions within the image for the classification task. This technique’s effectiveness thus directly co-relates to the pretrained model used. The better the model the more effective Attentive CutMix will be. Also, the cutout attentive patch is pasted onto the same area in the second image as it was in the original image. This further helps to better occlude the image since the pasting randomization in CutMix does provide a possibility of the patch being pasted onto the background rather than the object of interest. Thus Attentive CutMix improves on dual fronts of patch selection and pasting by removing the randomness and using attentive intuition to make more robust fusing of images.

4 Experiments and Analysis

4.1 Datasets and Models

To prove the effectiveness of our data augmentation technique we perform extensive experiments across wide range of popular models and Image classification datasets. We select four variants each of ResNet [7], DenseNet [8] and EfficientNet [9] architectures. We selected these particular architecture since they provide substantial variation in their architectural concepts. The individual variants in each architecture help us test the method across different depths/sizes of each architecture. For image classification datasets we chose CIFAR10 , CIFAR100 [17] and ImageNet [18] as these are widely used benchmark datasets to compare our method against.

4.2 Implementation Details

We trained the individual models from scratch to prevent any pretraining bias to affect our evaluation results. We run the baselines according to the hyperparameter configurations used in their original papers, but due to some absence of implementation details, on some particular datasets/networks the settings may be slightly different. All the data augmentation techniques for a given architecture and dataset were run for a fixed number of epochs which were enough for the models to converge onto a test set accuracy, considering our prime objective was to test other data augmentation techniques against ours rather than matching state-of-the-art results. All the models were implemented in Pytorch [19] framework and all data augmentation technique implementation was also done in requirements with the Pytorch framework.

4.3 Results on CIFAR10

For CIFAR10 dataset, each of the models for a given data augmentation technique was trained for 80 epochs. The batch size was kept at 32 and the learning rate at 1e-3. The model weights were initialized using Xavier Normal technique. Weight decay was incorporated for regularization and it’s value was kept at 1e-5. Results are presented in Table 2. Our method provides better results over all tested models compared to CutMix, Mixup and the baseline methods.


CIFAR-10 (%)
Method Baseline Mixup CutMix Attentive CutMix
ResNet-18 84.67 88.52 87.92 88.94
ResNet-34 87.12 88.70 88.75 90.40
ResNet-101 90.47 91.89 92.13 93.25
ResNet-152 92.45 94.21 94.35 94.79
DenseNet-121 85.65 87.56 87.98 88.34
DenseNet-169 87.67 89.12 89.23 90.45
DenseNet-201 91.21 93.21 93.45 94.16
DenseNet-264 92.78 94.20 94.34 94.83
EfficientNet - B0 87.45 88.07 88.67 88.94
EfficientNet - B1 90.12 90.99 91.37 92.10
EfficientNet - B6 92.74 93.76 93.28 93.92
EfficientNet - B7 94.95 95.11 95.25 95.86


Table 1: Comparison of accuracy (%) with baseline, Mixup and CutMix on CIFAR-10.


CIFAR-100 (%)
Method Baseline Mixup CutMix Attentive CutMix
ResNet-18 63.14 64.40 65.90 67.16
ResNet-34 65.54 67.83 68.32 70.03
ResNet-101 68.24 70.76 71.32 72.86
ResNet-152 71.49 74.81 73.21 75.37
DenseNet-121 65.12 66.84 67.62 69.23
DenseNet-169 66.42 68.24 69.58 71.34
DenseNet-201 70.28 72.89 73.57 74.65
DenseNet-264 73.51 76.49 75.23 77.58
EfficientNet - B0 64.67 65.78 66.95 67.48
EfficientNet - B1 66.89 68.23 68.12 68.96
EfficientNet - B6 71.34 73.56 73.75 74.82
EfficientNet - B7 75.67 77.21 77.57 78.52


Table 2: Comparison of accuracy (%) with baseline, Mixup and CutMix on CIFAR-100.

4.4 Results on CIFAR100

For CIFAR100 dataset, every model was trained for 120 epochs. The batch size was kept at 32 and the learning rate at 1e-3. The model weights were initialized using Xavier Normal technique. Weight decay was incorporated for regularization and it’s value was kept at 1e-5. Our experiments are presented in Table 2. Our method again provides better overall results compared to other data augmentation methods.

4.5 Results on ImageNet

For ImageNet dataset, each of the ResNet [7] and DenseNet [8] models for a given data augmentation technique was trained for 100 epochs. EfficientNet models were trained longer for 180 epochs. The batch size was kept at 64 and the learning rate at 1e-3. Input images were normalized using mean and standard deviation statistics derived from the dataset. Results are presented in Table 3. Our method consistently provides better results compared to other data augmentation methods.


ImageNet (Top-1 accuracy %)
Method Baseline Mixup CutMix Attentive CutMix
ResNet-18 73.54 74.46 75.32 75.78
ResNet-34 77.31 79.03 79.22 80.13
ResNet-101 78.73 79.42 80.56 81.16
ResNet-152 78.98 80.01 80.25 80.93
DenseNet-121 75.87 76.89 77.34 77.98
DenseNet-169 77.03 79.10 79.32 79.78
DenseNet-201 78.67 80.14 80.23 80.87
DenseNet-264 79.59 82.11 82.36 82.79
EfficientNet - B0 76.12 78.19 78.21 78.79
EfficientNet - B1 78.47 79.96 80.17 81.03
EfficientNet - B6 83.89 84.43 84.60 85.29
EfficientNet - B7 84.34 85.12 85.19 85.32


Table 3: Comparison of accuracy (%) with baseline, Mixup and CutMix on ImageNet.

4.6 Ablation Study

The number of patches “” to be cutout from the first image is a hyperparameter that needs to be tuned for optimal performance of our method. We conducted a study for the optimal value of this hyperparameter across the range of 1 to 15. We tested each value in this range for all our experiments and found out that cutting out top 6 attentive patches gave us the best average performance across the experiments. One of the explanations for this value could be that cutting out less than 6 patches doesn’t provide enough occlusion to the main subject in the second image. On the contrary, cutting more than 6 patches might be providing excessive occlusion to the original subject in the image so as to make the respective label for that image not enough discriminative for the model to learn anything useful.

4.7 Discussion

Our experiments do provide strong evidence that our method provides much better results than CutMix and Mixup on average over different datasets and architectures. Attentive CutMix consistently provides an average increase of 1.5% over other methods which validates the effectiveness of our attention mechanism. One disadvantage of our method is the fact that a pretrained feature extractor is needed in addition to the actual network that is to be trained. However, depending on the classification task and training complexity of the model and dataset used, we can vary the size of pretrained extractor being used in data augmentation. Thus this can prove be a minor extra computation in the overall training scheme and it can easily be offset by the performance gains of this method. We believe that this method could provide similar gains for other computer vision tasks like object detection, instance segmentation, etc., since all rely on robust features being extracted from the image, which is also the main foundation for image classification task. The future direction for this work would thus be to incorporate this method for other vision tasks.

5 Conclusion

We have presented Attentive CutMix, a attention-based data augmentation method that can automatically find the most discriminative parts of an object and replace them with patches cutout from other image. Our proposed method is simple yet effective, straight-forward to implement and can help boost the baseline significantly. Our experimental evaluation on three benchmarks CIFAR-10/100 and ImageNet verifies the effectiveness of our proposed method, which obtains consistent improvement for a variety of network architectures.


  1. thanks: Accepted as a conference paper at ICASSP 2020.Corresponding author.


  1. Terrance DeVries and Graham W Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  2. Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le, “Dropblock: A regularization method for convolutional networks,” in NeurIPS, 2018.
  3. Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, 2019.
  4. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012.
  5. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz, “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
  6. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  8. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
  9. Mingxing Tan and Quoc Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, 2019.
  10. Zhiqiang Shen, Mingyang Huang, Jianping Shi, Xiangyang Xue, and Thomas S Huang, “Towards instance-level image-to-image translation,” in CVPR, 2019.
  11. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017.
  12. Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz, “Multimodal unsupervised image-to-image translation,” in ECCV, 2018.
  13. Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang, “Residual attention network for image classification,” in CVPR, 2017.
  14. Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
  15. Zhiqiang Shen, Honghui Shi, Jiahui Yu, Hai Phan, Rogerio Feris, Liangliang Cao, Ding Liu, Xinchao Wang, Thomas Huang, and Marios Savvides, “Improving object detection from scratch via gated feature reuse,” in BMVC, 2019.
  16. Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in ICCV, 2017.
  17. Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
  18. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
  19. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NeurIPS Autodiff Workshop, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description