CutMix: Regularization Strategy to Train Strong Classifiers
with Localizable Features
Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to attend on less discriminative parts of objects (\egleg as opposed to head of a person), thereby letting the network generalize better and have better object localization capabilities. On the other hand, current methods for regional dropout removes informative pixels on training images by overlaying a patch of either black pixels or random noise. Such removal is not desirable because it leads to information loss and inefficiency during training. We therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on the ImageNet weakly-supervised localization task. Moreover, unlike previous augmentation methods, our CutMix-trained ImageNet classifier, when used as a pretrained model, results in consistent performance gains in Pascal detection and MS-COCO image captioning benchmarks. We also show that CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances.
Deep convolutional neural networks (CNNs) have shown promising performances on various computer vision problems such as image classification [30, 19, 11], object detection [29, 23], semantic segmentation [1, 24], and video analysis [27, 31]. To further improve the training efficiency and performance, a number of training strategies have been proposed, including data augmentation  and regularization techniques [33, 16, 37].
In particular, to prevent a CNN from focusing too much on a small set of intermediate activations or on a small region on input images, random feature removal regularizations have been proposed. Examples include dropout  for randomly dropping hidden activations and regional dropout [2, 49, 32, 7] for erasing random regions on the input. Researchers have shown that the feature removal strategies improve generalization and localization by letting a model attend not only to the most discriminative parts of objects, but rather to the entire object region [32, 7].
|ResNet-50||Mixup ||Cutout ||CutMix|
|Label||Dog 1.0||Dog 0.5 Cat 0.5||Dog 1.0||Dog 0.6 Cat 0.4|
|ImageNet Cls (%)||76.3 (+0.0)||77.4 (+1.1)||77.1 (+0.8)||78.4 (+2.1)|
|ImageNet Loc (%)||46.3 (+0.0)||45.8 (-0.5)||46.7 (+0.4)||47.3 (+1.0)|
|Pascal VOC Det (mAP)||75.6 (+0.0)||73.9 (-1.7)||75.1 (-0.5)||76.7 (+1.1)|
While regional dropout strategies have shown improvements of classification and localization performances to a certain degree, deleted regions are usually zeroed-out [2, 32] or filled with random noise , greatly reducing the proportion of informative pixels on training images. We recognize this as a severe conceptual limitation as CNNs are generally data hungry . How can we maximally utilize the deleted regions, while taking advantage of better generalization and localization using regional dropout?
We address the above question by introducing an augmentation strategy CutMix. Instead of simply removing pixels, we replace the removed regions with a patch from another image (See Table 1). The ground truth labels are also mixed proportionally to the number of pixels of combined images. CutMix now enjoys the property that there is no uninformative pixel during training, making training efficient, while retaining the advantages of regional dropout to attend to non-discriminative parts of objects. The added patches further enhance localization ability by requiring the model to identify the object from a partial view. The training and inference budgets remain the same.
CutMix shares similarity with Mixup  which mixes two samples by interpolating both the images and labels. Mixup has been found to improve classification, but the interpolated sample tends to be unnatural (See the mixed image in Table 1). On the other hand, CutMix overcomes the problem by replacing the image region with a image patch from another training image.
Table 1 gives an overview of Mixup , Cutout , and CutMix on image classification, weakly supervised localization, and transfer learning to object detection methods. Although Mixup and Cutout enhance the ImageNet classification accuracies, they suffer from performance degradation on ImageNet localization and object detection tasks. On the other hand, CutMix consistently achieves significant enhancements across three tasks, proving its superior classification and localization ability beyond the baseline and other augmentation methods.
We present extensive evaluations of our CutMix on various CNN architectures and on various datasets and tasks. Summarizing the key results, CutMix has significantly improved the accuracy of a baseline classifier on CIFAR-100 and has obtained the state-of-the-art top-1 error . On ImageNet , applying CutMix to ResNet-50 and ResNet-101  has improved the classification accuracy by and , respectively. On the localization front, CutMix improves the performance of the weakly-supervised object localization (WSOL) task on CUB200-2011  and ImageNet  by and gains, respectively. The superior localization capability is further evidenced by fine-tuning a detector and an image caption generator on CutMix-ImageNet-pretrained models; the CutMix pretraining has improved the overall detection performances on Pascal VOC  by mAP and image captioning performance on MS-COCO  by BLEU score. CutMix also enhances the model robustness and dramatically alleviates the over-confident issue [12, 21] of deep networks.
2 Related Works
Regional dropout: Methods [2, 49, 32] removing random regions in images have been proposed to enhance the generalization and localization performances of CNNs. CutMix is similar to those methods, while the critical difference is that the removed regions are filled with patches from other images. On the feature level, DropBlock  has generalized the regional dropout to the feature space and have shown enhanced generalizability as well. CutMix can also be performed on the feature space, as we will see in the experiments.
Synthesizing training data: Some works have explored synthesizing training data for further generalizability. Generating  new training samples by Stylizing ImageNet  has guided the model to focus more on shape than texture, leading to better classification and object detection performances. CutMix also generates new samples by cutting and pasting patches within mini-batches, leading to performance boosts in many computer vision tasks; the main advantage of CutMix is that the additional cost for sample generation is negligible. For object detection, object insertion methods [4, 3] have been proposed as a way to synthesize objects in the background. These methods are different from CutMix because they aim to represent a single object well while CutMix generates combined samples which may contain multiple objects.
Mixup: CutMix shares similarity with Mixup [46, 39] in that both combines two samples, where the ground truth label of the new sample is given by the mixture of one-hot labels. As we will see in the experiments, Mixup samples suffer from the fact that they are locally ambiguous and unnatural, and therefore confuses the model, especially for localization. Recently, Mixup variants [40, 34, 9] have been proposed; they perform feature-level interpolation and other types of transformations. Above works, however, generally lack a deep analysis in particular on the localization ability and transfer-learned performances.
Tricks for training deep networks: Efficient training of deep networks is one of the most important problems in research community, as they require great amount of compute and data. Methods such as weight decay, dropout , and Batch Normalization  are widely used to train more generalizable deep networks. Recently, methods adding noises to internal features [16, 7, 44] or adding extra path to the architecture [14, 13] have been proposed. CutMix is complementary to the above methods because it operates on the data level, without changing internal representations or architecture.
We describe the CutMix algorithm in detail.
Let and denote a training image and its label, respectively. The goal of CutMix is to generate a new training sample by combining two training samples and . Then, the new generated training sample is used to train the model with its original loss function. To this end, we define the combining operation as
where denotes a binary mask indicating where to drop out and fill in from two images, is a binary mask filled with ones, and is element-wise multiplication. Like Mixup , the combination ratio between two data points is sampled from the beta distribution . In our all experiments, we set to , that is is sampled from the uniform distribution . Note that the major difference is that CutMix replaces image region with a patch from another training image and can generate more locally natural image than Mixup.
To sample the binary mask , we first sample the bounding box coordinates indicating the cropping regions on and . The region in is dropped out and filled with the patch cropped at of .
In our experiments, we sample rectangular masks whose aspect ratio is proportional to the original image. The box coordinates are uniformly sampled according to:
making the cropped area ratio . With the cropping region, the binary mask is decided by filling with within the bounding box , otherwise .
Since implementing CutMix is simple and has negligible computational overheads as existing data augmentation techniques as used in [35, 15], we can efficiently utilize it to train any network architectures. In each training iteration, a CutMix-ed sample is generated by combining randomly selected two training samples in a mini-batch according to Equation (1). Code-level details are presented in Appendix A.
What does model learn with CutMix? We have motivated CutMix such that full object regions are considered for classification, as Cutout is designed for, while ensuring two objects are recognized from partial views from a single image to increase training efficiency. To verify that CutMix is indeed learning to recognize two objects from the augmented samples from their respective partial views, we visually compare the activation maps for CutMix against Cutout  and Mixup . Figure 1 shows example augmentation inputs as well as corresponding class activation maps (CAM)  for two classes present, Saint Bernard and Miniature Poodle. We use vanilla ResNet-50 model111We use ImageNet-pretrained ResNet-50 provided by PyTorch . for obtaining the CAMs to clearly see the effect of augmentation method only.
|Usage of full image region||✔||✘||✔|
|Mixed image & label||✔||✘||✔|
We can observe that Cutout successfully lets a model focus on less discriminative parts of the object. For example, the model focuses on the belly of Saint Bernard on the Cutout-ed sample. We also observe, however, that they make less efficient use of training data due to uninformative pixels. Mixup, on the other hand, makes full use of pixels, but introduces unnatural artifacts. The CAM for Mixup, as a result, shows the confusion of model in choosing cues for recognition. We hypothesize that such confusion leads to its suboptimal performance in classification and localization as we will see in Section 4. CutMix efficiently improves upon Cutout by being able to localize the two object classes accurately, as Cutout can only deal with one object on a single image. We summarize the comparison among Mixup, Cutout, and CutMix as in Table 2.
Analysis on validation error: We analyze the effect of CutMix on stabilizing the training of deep networks. We compare the top-1 validation error during the training with CutMix against the baseline. We train ResNet-50  for ImageNet Classification, and PyramidNet-200  for CIFAR-100 Classification. Figure 2 shows the results.
We observe, first of all, that CutMix achieves lower validation errors than the baseline at the end of training. After the half of the epochs where learning rates are further reduced, we observe that the baseline suffers from overfitting with increasing validation error. CutMix, on the other hand, shows a steady decrease in validation error, proving its ability to reduce overfitting by guiding the training with diverse samples.
In this section, we evaluate CutMix for its capability to improve localizability as well as generalizability of a trained model on multiple tasks. We first study the effect of CutMix on image classification (Section 4.1) and weakly supervised object localization (Section 4.2). Next, we show the transferability of a pretrained model using CutMix when it is fine-tuned for object detection and image captioning tasks (Section 4.3). We also show that CutMix can improve the model robustness and alleviate the over-confident issue in Section 4.4. Throughout the experiments, we verify that CutMix outperforms other state-of-the-art regularization methods in above tasks and we further analyze the inner mechanisms behind such superiority.
4.1 Image Classification
4.1.1 ImageNet Classification
We evaluate on ImageNet-1K benchmark  a dataset containing over 1M training images and 50K validation images of 1K categories. For fair comparison, we use the standard augmentation setting for ImageNet dataset such as resizing, cropping, and flipping, as also done in [10, 7, 15, 36]. We found that such regularization methods including Stochastic Depth , Cutout , Mixup , and our CutMix, require a greater number of training epochs till convergence. Therefore, we have trained all the models with epochs initial learning rate , decayed by factor at epochs , , and . The batch size is set to . The mixture hyper-parameter for CutMix is set to .
|Model||# Params||Top-1 Err (%)||Top-5 Err (%)|
|ResNet-101 + SE Layer* ||49.4 M||20.94||5.50|
|ResNet-101 + GE Layer* ||58.4 M||20.74||5.29|
|ResNet-50 + SE Layer* ||28.1 M||22.12||5.99|
|ResNet-50 + GE Layer* ||33.7 M||21.88||5.80|
|ResNet-50 (Baseline)||25.6 M||23.68||7.05|
|ResNet-50 + Cutout ||25.6 M||22.93||6.66|
|ResNet-50 + StochDepth ||25.6 M||22.46||6.27|
|ResNet-50 + Mixup ||25.6 M||22.58||6.40|
|ResNet-50 + Manifold Mixup ||25.6 M||22.50||6.21|
|ResNet-50 + DropBlock* ||25.6 M||21.87||5.98|
|ResNet-50 + Feature CutMix||25.6 M||21.80||6.06|
|ResNet-50 + CutMix||25.6 M||21.60||5.90|
We briefly describe the settings for baseline augmentation schemes. We set the dropping rate of residual blocks to for the best performance of Stochastic Depth . The mask size for Cutout  is set to and the location for dropping out is uniformly sampled. The performance of DropBlock  is from the original paper and the difference from our setting is the training epochs which is set to . Manifold Mixup  applies Mixup operation on the randomly chosen internal feature map. Hyper-parameter for Mixup and Manifold Mixup was tested with 0.5 and 1.0 and we selected 1.0 which shows better performance. Conceptually it is also possible to extend CutMix to feature-level augmentation. To test this, we propose “Feature CutMix” scheme, which applies the CutMix operation at a randomly chosen layer per minibatch as Manifold Mixup does.
|Model||# Params||Top-1 Err (%)||Top-5 Err (%)|
|ResNet-101 (Baseline) ||44.6 M||21.87||6.29|
|ResNet-101 + CutMix||44.6 M||20.17||5.24|
|ResNeXt-101 (Baseline) ||44.1 M||21.18||5.57|
|ResNeXt-101 + CutMix||44.1 M||19.47||5.03|
Comparison against baseline augmentations: Results are given in Table 3. We observe that our CutMix method achieves the best result, 21.60% top-1 error, among the considered augmentation strategies. CutMix outperforms Cutout and Mixup, the two closest approaches to ours, by and , respectively. On the feature level as well, we find CutMix preferable to Mixup, with top-1 errors and , respectively.
Comparison against architectural improvements: We have also compared improvements due to CutMix against the improvements due to architectural improvements (\eggreater depth or additional modules). We observe that CutMix improves the performance by +2.08% while increased depth (ResNet-50 ResNet-152) boosts and SE  and GE  boosts and , respectively. The improvement due to CutMix is more impressive, since it does not require additional parameters or more computation per SGD update (as architectural changes do). CutMix is a competitive data augmentation scheme that requires minimal additional cost.
4.1.2 CIFAR Classification
Here we describe the results on CIFAR classification. We set mini-batch size to and training epochs to for CIFAR classification. The learning rate was initially set to and decayed by the factor of at and epoch. To ensure the effectiveness of the proposed method, we used very strong baseline, PyramidNet-200 , the widening factor and the number of parameters is M, which shows state-of-the-art performance on CIFAR-100 (top-1 error is ).
|PyramidNet-200 (=240) (# params: 26.8 M)||Top-1 Err (%)||Top-5 Err (%)|
|+ StochDepth ||15.86||3.33|
|+ Label smoothing (=0.1) ||16.73||3.37|
|+ Cutout ||16.53||3.65|
|+ Cutout + Label smoothing (=0.1)||15.61||3.88|
|+ DropBlock ||15.73||3.26|
|+ DropBlock + Label smoothing (=0.1)||15.16||3.86|
|+ Mixup (=0.5) ||15.78||4.04|
|+ Mixup (=1.0) ||15.63||3.99|
|+ Manifold Mixup (=1.0) ||16.14||4.07|
|+ Cutout + Mixup (=1.0)||15.46||3.42|
|+ Cutout + Manifold Mixup (=1.0)||15.09||3.35|
|+ ShakeDrop ||15.08||2.72|
|+ CutMix + ShakeDrop ||13.81||2.29|
Table 5 shows the performance comparison with other state-of-the-art data augmentation and regularization methods. All experiments were conducted three times and the averaged performance were reported.
Hyper-parameter settings: We set the hole size of Cutout  to . For DropBlock , keep_prob and block_size are set to and , respectively. The drop rate for Stochastic Depth  is set to 0.25. For Mixup , we tested the hyper-parameter with 0.5 and 1.0. For Manifold Mixup , we applied Mixup operation at a randomly chosen layer per minibatch.
|Model||# Params||Top-1 Err (%)||Top-5 Err (%)|
|PyramidNet-110 () ||1.7 M||19.85||4.66|
|PyramidNet-110 + CutMix||1.7 M||17.97||3.83|
|ResNet-110 ||1.1 M||23.14||5.95|
|ResNet-110 + CutMix||1.1 M||20.11||4.43|
|PyramidNet-200 (=240)||Top-1 Error (%)|
|+ Mixup (=1.0)||3.09|
|+ Manifold Mixup (=1.0)||3.15|
Combination of regularization methods: One step further for validating each regularization methods, we also tested the combination of the various methods. We found that both Cutout  and label smoothing  could not improve the accuracy when independently adopted to the training, but it was effective when we used the two methods simultaneously. Dropblock , the generalized version of Cutout to feature-maps, was also more effective when label smoothing was attached. We observe that Mixup  and Manifold Mixup  both achieved higher accuracy when the image is applied by Cutout. The combination of Cutout and Mixup tends to generate locally separated and mixed samples since the Cutout-ed region has less ambiguity than the vanilla Mixup. Thus, the success of combining Cutout and Mixup shows that mixing via cut-and-paste manner is better than interpolation, as we conjectured.
Consequently, we achieved error in CIFAR-100 classification, which is higher than baseline error-rate. Also, we note that we achieved a new state-of-the-art performance when adding CutMix and ShakeDrop , which is a regularization technique by adding noise to feature space.
CutMix for CIFAR-10: We evaluated CutMix on CIFAR-10 dataset using the same baseline and training setting for CIFAR-100. The results are given in Table 7. CutMix can also enhance the performance over the baseline by and outperforms Mixup and Cutout.
4.1.3 Ablation Studies
We conducted ablation study in CIFAR-100 dataset using the same experimental settings in Section 4.1.2. We evaluated CutMix with varing the parameters to , , , , and and the results are given in the left graph of Figure 3. From all the different values of , we achieved better results than the baseline (), and the best performance was achieved when .
The performance of feature-level CutMix is given in the right graph of Figure 3. We changed the location where to apply CutMix from image level to feature level. We denote the index as (0=image level, 1=after first conv-bn, 2=after layer1, 3=after layer2, 4=after layer3). CutMix achieved the best performance when it was applied to input. Still, feature-level CutMix except the layer3 case can improve the accuracy over the baseline ().
|Center Gaussian CutMix||15.95||3.40|
Table 8 shows the performance of CutMix over various configurations. ‘Center Gaussian CutMix’ denotes the experiment adapting Gaussian distribution whose mean is the center of image instead of uniform distribution when sampling of Equation (2). ‘Fixed-size CutMix’ fixes the size of cropping region as , thus is always . ‘Scheduled CutMix’ changes the probability to apply CutMix or not during training as [7, 16] do. ‘One-hot CutMix’ denotes the case where the label is not combined as Equation (1), but decided to a single label which has larger portion in the image. We scheduled the probability from to linearly as increasing training epoch. The results show that adding the priors in center for CutMix (Center Gaussian CutMix) and fixing the size of cropping region (Fixed-size CutMix) lead performance degradation. Scheduled CutMix shows slightly worse performance than CutMix. One-hot CutMix shows much worse performance than CutMix, highlighting the effect of combined label.
4.2 Weakly Supervised Object Localization
|ResNet-50 (Baseline)||23.68||76.7 (+0.0)||75.6 (+0.0)||61.4 (+0.0)||22.9 (+0.0)|
|Mixup-trained||22.58||76.6 (-0.1)||73.9 (-1.7)||61.6 (+0.2)||23.2 (+0.3)|
|Cutout-trained||22.93||76.8 (+0.1)||75.0 (-0.6)||63.0 (+1.6)||24.0 (+1.1)|
|CutMix-trained||21.60||77.6 (+0.9)||76.7 (+1.1)||64.2 (+2.8)||24.9 (+2.0)|
Weakly supervised object localization (WSOL) task aims to train the classifier to localize target objects by using only the class label. To localize the target well, it is important to make CNNs look at the full object region not to focus on a discriminant part of the target. That is, learning spatially distributed representation is the key for improving performance on WSOL task. Thus, here we measure how CutMix learns spatially distributed representation beyond other baselines by conducting WSOL task. We followed the training and evaluation strategy of the existing WSOL methods [32, 47, 48]. ResNet-50 is used as the base model. The model is initialized using ImageNet Pre-trained model before training, and is modified to enlarge the final convolutional feature map size to from . Then, the model is fine-tuned on CUB200-2011  and ImageNet-1K  dataset only using class labels. At evaluation, we utilized class activation mappings (CAM)  to estimate the bounding box of an object. The quantitative and qualitative results are given in Table 9 and Figure 4, respectively. The implementation details are in Appendix B.
Comparison against Mixup and Cutout: CutMix outperforms Mixup  by and on CUB200-2011 and ImageNet dataset, respectively. We observe that Mixup degraded the localization accuracy over baseline and tends to focus on a small region of image as shown in Figure 4. As we hypothesized in Section 3.2, the Mixuped sample has the ambiguity, so the CNN trained with those samples tends to focus on the most discriminative part for classification, which leads the degradation of localization ability. Although Cutout  can improve the localization accuracy over the baseline, CutMix still outperforms Cutout by and on CUB200-2011 and ImageNet dataset, respectively.
4.3 Transfer Learning of Pretrained Model
In this section, we check the generalization ability of CutMix by transferring task from image classification to other computer vision tasks such as object detection and image captioning, which require the localization ability of the learned CNN feature. For each task, we replace the backbone network of the original model with various ImageNet-pretrained models using Mixup , Cutout , and our CutMix. Then the model is finetuned for each task and we validate the performance improvement of CutMix over the original backbone and other baselines. ResNet-50 model is used as a baseline.
Transferring to Pascal VOC object detection: Two popular detection models, SSD  and Faster RCNN , are used for the experiment. Originally the two methods utilized VGG-16 as a backbone network, but we changed it to ResNet-50. The ResNet-50 backbone is initialized with various ImageNet-pretrained models and finetuned using Pascal VOC 2007 and 2012  trainval data, and evaluated with VOC Pascal 2007 test data using mAP metric. We follow the finetuning strategy as the original methods [23, 29] do and the implementation details are in Appendix C. The performance result is represented in Table 10. The pretrained backbone models of Cutout and Mixup failed to improve the performance on object detection task over the original model. However, the backbone pretrained by CutMix can clearly improve the performance of both SSD and Faster-RCNN. This shows that our method can train CNNs to have generalization ability to be applied to object detection.
Transferring to MS-COCO image captioning: We used Neural Image Caption (NIC)  as the base model for image captioning experiment. We change the backbone network of encoder from GoogLeNet  to ResNet-50. The backbone network is initialized with ImageNet-pretrained models, and then we trained and evaluated NIC on MS-COCO dataset . The implementation details and other evaluation metrics (METEOR, CIDER, etc.) are in Appendix D. Table 10 shows the result performance when applying each pretrained model. CutMix outperforms Mixup and Cutout in both BLEU1 and BLEU4 metrics.
Note that simply replacing backbone network with our CutMix-pretrained model gives clear performance gains for object detection and image captioning tasks.
4.4 Robustness and Uncertainty
Many researches have shown that deep models are easily fooled by small and unrecognizable perturbations to the input image, which is called adversarial attacks [8, 38]. One straightforward way to enhance robustness and uncertainty is an input augmentation by generating unseen samples . We evaluate robustness and uncertainty improvements by input augmentation methods including Mixup, Cutout and CutMix comparing to the baseline.
Robustness: We evaluate the robustness of the trained models to adversarial samples, occluded samples and in-between class samples. We use ImageNet-pretrained ResNet-50 models with same setting in Section 4.1.1.
Fast Gradient Sign Method (FGSM)  is used to generate adversarial perturbations and we assume that the adversary has full information of the models, which is called white-box attack. We report top-1 accuracy after attack on ImageNet validation set in Table 11. CutMix significantly improves the robustness to adversarial attacks compared to other augmentation methods.
For occlusion experiments, we generate occluded samples in two ways: center occlusion by filling zeros in a center hole and boundary occlusion by filling zeros outside of the hole. In Figure 4(a), we measure the top-1 error by varying the hole size from to . For both employed occlusion scenario, Cutout and CutMix achieve significant improvements of robustness while Mixup nearly improves robustness to occlusion. Interestingly, CutMix almost achieves compatible performance compared to Cutout even though CutMix did not obseverve occluded samples during the training stage unlike Cutout.
Finally, we evaluate the top-1 error of Mixup and CutMix in-between samples. The probability to predict neither two classes by varying the combination ratio is illustrated in Figure 4(b). We randomly select in-between samples in ImageNet validation set. In the both in-between class sample experiments, Mixup and CutMix improves the performance of the network while improvements by Cutout is almost neglectable. Similarly to the previous occlusion experiments, CutMix even improves the robustness to the unseen Mixup in-between class samples.
|Top-1 Acc (%)||8.2||24.4||11.5||31.0|
|Method||TNR at TPR 95%||AUROC||Detection Acc.|
|Baseline||26.3 (+0)||87.3 (+0)||82.0 (+0)|
|Mixup||11.8 (-14.5)||49.3 (-38.0)||60.9 (-21.0)|
|Cutout||18.8 (-7.5)||68.7 (-18.6)||71.3 (-10.7)|
|CutMix||69.0 (+42.7)||94.4 (+7.1)||89.1 (+7.1)|
Uncertainty: We measure the performance of the out-of-distribution (OOD) detectors proposed by  which determines whether the sample is in- or out-of-distribution by score thresholding. We use PyramidNet-200 trained on CIFAR-100 datasets with same setting in Section 4.1.2. In Table 12, we report averaged OOD detection performances against seven out-of-distribution samples from [12, 21], including TinyImageNet, LSUN , uniform noise, Gaussian noise, etc. More results are illustrated in Appendix E. Note that Mixup and Cutout seriously impair the baseline performance, in other words, Mixup and Cutout augmentations aggravate the overconfidence issue of the base network. Meanwhile, our proposed CutMix significantly outperforms the baseline performance.
In this work, we introduced CutMix for training CNNs to have strong classification and localization ability. CutMix is simple, easy to implement, and has no computational overheads, but surprisingly effective on various tasks. On ImageNet classification, applying CutMix to ResNet-50 and ResNet-101 brings and top-1 accuracy improvements. On CIFAR classification, CutMix also can significantly improve the performance of baseline by and achieved the state-of-the-art top-1 error performance. On weakly supervised object localization (WSOL), CutMix can enhance the localization accuracy and achieved comparable localization performance to state-of-the-art WSOL methods without any WSOL techniques. Furthermore, simply using CutMix-ImageNet-pretrained model as the initialized backbone of the object detection and image captioning brings overall performance improvements. Last, CutMix achieves performance improvements in robustness and uncertainty benchmarks compared to the other augmentation methods.
We are grateful to Clova AI members with valuable discussions, and to Ziad Al-Halah for proofreading the manuscript.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
-  T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  N. Dvornik, J. Mairal, and C. Schmid. Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pages 364–380, 2018.
-  D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1301–1310, 2017.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
-  R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
-  G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10750–10760, 2018.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
-  H. Guo, Y. Mao, and R. Zhang. Mixup as locally linear out-of-manifold regularization. arXiv preprint arXiv:1809.02499, 2018.
-  D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In CVPR, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
-  J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems, pages 9423–9433, 2018.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In arXiv:1709.01507, 2017.
-  G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  H. Kim, M. Kim, D. Seo, J. Kim, H. Park, S. Park, H. Jo, K. Kim, Y. Yang, Y. Kim, N. Sung, and J. Ha. NSML: meet the mlaas platform with a real-world case study. CoRR, abs/1810.09957, 2018.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
-  S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
-  D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018.
-  H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
-  K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3544–3553. IEEE, 2017.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
-  C. Summers and M. J. Dinneen. Improved mixed-example data augmentation. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1262–1270. IEEE, 2019.
-  C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshop, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Y. Tokozume, Y. Ushiku, and T. Harada. Between-class learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5486–5494, 2018.
-  V. Verma, A. Lamb, C. Beckham, A. Courville, I. Mitliagkis, and Y. Bengio. Manifold mixup: Encouraging meaningful on-manifold interpolation as a regularizer. arXiv preprint arXiv:1806.05236, 2018.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
-  Y. Yamada, M. Iwamura, T. Akiba, and K. Kise. Shakedrop regularization for deep residual learning. arXiv preprint arXiv:1802.02375, 2018.
-  F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
-  X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1325–1334, 2018.
-  X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang. Self-produced guidance for weakly-supervised object localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 597–613, 2018.
-  Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
Appendix A CutMix Algorithm
We present the code-level description of CutMix algorithm in Algorithm A1. N, C, and K denote the size of minibatch, channel size of input image, and the number of classes. First, CutMix shuffles the order of the minibatch input and target along the first axis of the tensors. And the lambda and the cropping region (x1,x2,y1,y2) are sampled. Then, we mix the input and input_s by replacing the cropping region of input to the region of input_s. The target label is also mixed by interpolating method.
Note that CutMix is easy to implement with few lines (from line to line ), so is very practical algorithm giving significant impact on a wide range of tasks.
Appendix B Weakly-supervised Object Localization
We describe the training and evaluation procedure in detail.
Network modification: Basically weakly-supervised object localization (WSOL) has the same training strategy as image classification does. Training WSOL is starting from ImageNet-pretrained model. From the base network structure (ResNet-50 ), WSOL takes larger spatial size of feature map whereas the original ResNet-50 has . To enlarge the spatial size, we modify the base network’s final residual block (layer4) to have no stride, which originally has stride .
Since the network is modified and the target dataset could be different from ImageNet , the last fully-connected layer is randomly initialized with the final output dimension of and for CUB200-2011  and ImageNet, respectively.
Input image transformation: For fair comparison, we used the same data augmentation strategy except Mixup, Cutout, and CutMix as the state-of-the-art WSOL methods do [32, 47]. In training, the input image is resized to size and randomly cropped size images are used to train network. In testing, the input image is resized to , cropped at center with size and used to validate the network, which called single crop strategy.
Estimating bounding box: We utilize class activation mapping (CAM)  to estimate the bounding box of an object. First we compute CAM of an image, and next, we decide the foreground region of the image by binarizing the CAM with a specific threshold. The region with intensity over the threshold is set to 1, otherwise to 0. We use the threshold as a specific rate of the maximum intensity of the CAM. We set to for all our experiments. From the binarized foreground map, the tightest box which can cover the largest connected region in the foreground map is selected to the bounding box for WSOL.
Evaluation metric: To measure the localization accuracy of models, we report top-1 localization accuracy (Loc), which is used for ImageNet localization challenge . For top-1 localization accuracy, intersection-over-union (IoU) between the estimated bounding box and ground truth position is larger than , and, at the same time, the estimated class label should be correct. Otherwise, top-1 localization accuracy treats the estimation was wrong.
CUB-200-2011 dataset  contains over 11 K images with 200 categories of birds. We set the number of training epochs to . The learning rate for the last fully-connected layer and the other were set to and , respectively. The learning rate is decaying by the factor of at every epochs. We used SGD optimizer, and the minibatch size, momentum, weight decay were set to , , and . Table A1 shows that our model, ResNet-50 + CutMix, achieves localization accuracy on CUB200 dataset which outperforms other state-of-the-art WSOL methods [50, 47, 48].
b.2 ImageNet dataset
ImageNet-1K  is a large-scale dataset for general objects consisting of 13 M training samples and 50 K validation samples. We set the number of training epochs to . The learning rate for the last fully-connected layer and the other were set to and , respectively. The learning rate is decaying by the factor of at every epochs. We used SGD optimizer, and the minibatch size, momentum, weight decay were set to , , and . In Table A1, our model, ResNet-50 + CutMix, also achieves localization accuracy on ImageNet-1K dataset, which shows comparable performance compared with other state-of-the-art WSOL methods [50, 47, 48].
|Top-1 Loc (%)||Top-1 Loc (%)|
|VGG16 + CAM* ||-||42.80|
|VGG16 + ACoL* ||45.92||45.83|
|InceptionV3 + SPG* ||46.64||48.60|
|ResNet-50 + Mixup||49.30||45.84|
|ResNet-50 + Cutout||52.78||46.69|
|ResNet-50 + CutMix||54.81||47.25|
Appendix C Transfer Learning to Object Detection
We evaluate the models on the Pascal VOC 2007 detection benchmark  with 5 K test images over 20 object categories. For training, we use both VOC2007 and VOC2012 trainval (VOC07+12).
Finetuning on SSD222https://github.com/amdegroot/ssd.pytorch : The input image is resized to (SSD300) and we used the basic training strategy of the original paper such as data augmentation, prior boxes, and extra layers. Since the backbone network is changed from VGG16 to ResNet-50, the pooling location conv4_3 of VGG16 is modified to the output of layer2 of ResNet-50. For training, we set the batch size, learning rate, and training iterations to , , and K, respectively. The learning rate is decayed by the factor of at K and K iterations.
Finetuning on Faster-RCNN333https://github.com/jwyang/faster-rcnn.pytorch : Faster-RCNN takes fully-convolutional structure, so we only modify the backbone from VGG16 to ResNet-50. The batch size, learning rate, training iterations are set to , , and K. The learning rate is decayed by the factor of at K iterations.
Appendix D Transfer Learning to Image Captioning
|ResNet-50 + Mixup||61.6||44.1||31.6||23.2||22.9||47.9||72.2|
|ResNet-50 + Cutout||63.0||45.3||32.6||24.0||22.6||48.2||74.1|
|ResNet-50 + CutMix||64.2||46.3||33.6||24.9||23.1||49.0||77.6|
MS-COCO dataset  contains K trainval images and K test images. From the base model NIC444https://github.com/stevehuanghe/image_captioning , the backbone model is changed from GoogLeNet to ResNet-50. For training, we set batch size, learning rate, and training epochs to , , and , respectively. For evaluation, the beam size is set to for all the experiments. Image captioning results with various metrics are shown in Table A2.
Appendix E Robustness and Uncertainty
In this section, we describe the details of the experimental setting and evaluation methods.
We evaluate the model robustness to adversarial perturbations, occlusion and in-between samples using ImageNet trained models. For the base models, we use ResNet-50 structure and follow the settings in Section 4.1.1. For comparison, we use ResNet-50 trained without any additional regularization or augmentation techniques, ResNet-50 trained by Mixup strategy, ResNet-50 trained by Cutout strategy and ResNet-50 trained by our proposed CutMix strategy.
Fast Gradient Sign Method (FGSM): We employ Fast Gradient Sign Method (FGSM)  to generate adversarial samples. For the given image , the ground truth label and the noise size , FGSM generates an adversarial sample as the following
where denotes a loss function, for example, cross entropy function. In our experiments, we set the noise scale .
Occlusion: For the given hole size , we make a hole with width and height equals to in the center of the image. For center occluded samples, we zeroed-out inside of the hole and for boundary occluded samples, we zeroed-out outside of the hole. In our experiments, we test the top-1 ImageNet validation accuracy of the models with varying hole size from to .
In-between class samples: To generate in-between class samples, we first sample pairs of images from the ImageNet validation set. For generating Mixup samples, we generate a sample from the selected pair and by . We report the top-1 accuracy on the Mixup samples by varying from to . To generate CutMix in-between samples, we employ the center mask instead of the random mask. We follow the hole generation process used in the occlusion experiments. We evaluate the top-1 accuracy on the CutMix samples by varing hole size from to .
|Method||TNR at TPR 95%||AUROC||Detection Acc.||TNR at TPR 95%||AUROC||Detection Acc.|
|Baseline||43.0 (0.0)||88.9 (0.0)||81.3 (0.0)||29.8 (0.0)||84.2 (0.0)||77.0 (0.0)|
|Mixup||22.6 (-20.4)||71.6 (-17.3)||69.8 (-11.5)||12.3 (-17.5)||56.8 (-27.4)||61.0 (-16.0)|
|Cutout||30.5 (-12.5)||85.6 (-3.3)||79.0 (-2.3)||22.0 (-7.8)||82.8 (-1.4)||77.1 (+0.1)|
|CutMix||57.1 (+14.1)||92.4 (+3.5)||85.0 (+3.7)||55.4 (+25.6)||91.9 (+7.7)||84.5 (+7.5)|
|LSUN (crop)||LSUN (resize)|
|Baseline||34.6 (0.0)||86.5 (0.0)||79.5 (0.0)||34.3 (0.0)||86.4 (0.0)||79.0 (0.0)|
|Mixup||22.9 (-11.7)||76.3 (-10.2)||72.3 (-7.2)||13.0 (-21.3)||59.0 (-27.4)||61.8 (-17.2)|
|Cutout||33.2 (-1.4)||85.7 (-0.8)||78.5 (-1.0)||23.7 (-10.6)||84.0 (-2.4)||78.4 (-0.6)|
|CutMix||47.6 (+13.0)||90.3 (+3.8)||82.8 (+3.3)||62.8 (+28.5)||93.7 (+7.3)||86.7 (+7.7)|
|Baseline||32.0 (0.0)||85.1 (0.0||77.8 (0.0)|
|Mixup||11.8 (-20.2)||57.0 (-28.1)||61.0 (-16.8)|
|Cutout||22.2 (-9.8)||82.8 (-2.3)||76.8 (-1.0)|
|CutMix||60.1 (+28.1)||93.0 (+7.9)||85.7 (+7.9)|
|Baseline||0.0 (0.0)||89.2 (0.0)||89.2 (0.0)||10.4 (0.0)||90.7 (0.0)||89.9 (0.0)|
|Mixup||0.0 (0.0)||0.8 (-88.4)||50.0 (-39.2)||0.0 (-10.4)||23.4 (-67.3)||50.5 (-39.4)|
|Cutout||0.0 (0.0)||35.6 (-53.6)||59.1 (-30.1)||0.0 (-10.4)||24.3 (-66.4)||50.0 (-39.9)|
|CutMix||100.0 (+100.0)||99.8 (+10.6)||99.7 (+10.5)||100.0 (+89.6)||99.7 (+9.0)||99.0 (+9.1)|
Deep neural networks are often overconfident in their predictions. For example, deep neural networks produce high confidence number even for random noise . One standard benchmark to evaluate the overconfidence of the network is Out-of-distribution (OOD) detection proposed by . The authors proposed a threshold-baed detector which solves the binary classification task by classifying in-distribution and out-of-distribution using the prediction of the given network. Recently, a number of reserchs are proposed to enhance the performance of the baseline detector [21, 20] but in this paper, we follow only the baseline detector algorithm without any input enhancement and temperature scaling .
Setup: We compare the OOD detector performance using CIFAR-100 trained models described in Section 4.1.2. For comparison, we use PyramidNet-200 model without any regularization method, PyramidNet-200 model with Mixup, PyramidNet-200 model with Cutout and PyramidNet-200 model with our proposed CutMix.
Evaluation Metrics and Out-of-distributions: In this work, we follow the experimental setting used in [12, 21]. To measure the performance of the OOD detector, we report the true negative rate (TNR) at 95% true positive rate (TPR), the area under the receiver operating characteristic curve (AUROC) and detection accuracy of each OOD detector. We use seven datasets for out-of-distribution: TinyImageNet (crop), TinyImageNet (resize), LSUN  (crop), LSUN (resize), iSUN, Uniform noise and Gaussian noise.
Results: We report OOD detector performance to seven OODs in Table A3. Overall, CutMix outperforms baseline, Mixup and Cutout. Moreover, we find that even though Mixup and Cutout outperform the classification performance, Mixup and Cutout largely degenerate the baseline detector performance. Especially, for Uniform noise and Gaussian noise, Mixup and Cutout seriously impair the baseline performance while CutMix dramatically improves the performance. From the experiments, we observe that our proposed CutMix enhances the OOD detector performance while Mixup and Cutout produce more overconfident predictions to OOD samples than the baseline.