MaxUp: A Simple Way to Improve Generalization of Neural Network Training

MaxUp: A Simple Way to Improve Generalization of Neural Network Training


We propose MaxUp, an embarrassingly simple, highly effective technique for improving the generalization performance of machine learning models, especially deep neural networks. The idea is to generate a set of augmented data with some random perturbations or transforms, and minimize the maximum, or worst case loss over the augmented data. By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance. For example, in the case of Gaussian perturbation, MaxUp is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness. We test MaxUp on a range of tasks, including image classification, language modeling, and adversarial certification, on which MaxUp consistently outperforms the existing best baseline methods, without introducing substantial computational overhead. In particular, we improve ImageNet classification from the state-of-the-art top-1 accuracy without extra data to . Code will be released soon.


1 Introduction

A central theme of machine learning is to alleviate the issue of overfitting, improving the generalization performance on testing data. This is often achieved by leveraging important prior knowledge of the models and data of interest. For example, the regularization-based methods introduce penalty on the complexity of the model, which often amount to enforcing certain smoothness properties. Data augmentation techniques, on the other hand, leverage important invariance properties of the data (such as the shift and rotation invariance of images) to improve performance. Novel approaches that exploit important knowledge of the models and data hold the potential of substantially improving the performance of machine learning systems.

We propose MaxUp, a simple yet powerful training method to improve the generalization performance and alleviate the over-fitting issue. Different from standard methods that minimize the average risk on the observed data, MaxUp generates a set of random perturbations or transforms of each observed data point, and minimizes the average risk of the worst augmented data of each data point. This allows us to enforce robustness against the random perturbations and transforms, and hence improve the generalization performance. MaxUp can easily leverage arbitrary state-of-the-art data augmentation schemes (e.g. Zhang et al., 2018; DeVries and Taylor, 2017; Cubuk et al., 2019a), and substantially improves over them by minimizing the worst (instead of average) risks on the augmented data, without adding significant computational ahead.

Theoretically, in the case of Gaussian perturbation, we show that MaxUp effectively introduces a gradient-norm regularization term that serves to encourage smoothness of the loss function, which does not appear in standard data augmentation methods that minimize the average risk.

MaxUp can be viewed as a “lightweight” variant of adversarial training against adversarial input pertubrations (e.g. Tramèr et al., 2018; Madry et al., 2017), but is mainly designed to improve the generalization on the clean data, instead of robustness on perturbed data (although MaxUp does also increase the adversarial robustness in Gaussian adversarial certification as we shown in our experiments (Section 4.4)). In addition, compared with standard adversarial training methods such as projected gradient descent (PGD) (Madry et al., 2017), MaxUp is much simpler and computationally much faster, and can be easily adapted to increase various robustness defined by the corresponding data augmentation schemes.

We test MaxUp on three challenging tasks: image classification, language modeling, and certified defense against adversarial examples (Cohen et al., 2019). We find that MaxUp can leverage the different state-of-the-art data augmentation methods and boost their performance to achieve new state-of-the-art on a range of tasks, datasets, and neural architectures. In particular, we set up a new state-of-the-art result on ImageNet classification without extra data, which improves the best top1 accuracy by Xie et al. (2019) to . For the adversarial certification task, we find Maxup allows us to train more verifiably robust classifiers than prior arts such as the PGD-based adversarial training proposed by Salman et al. (2019).

  Input: Dataset ; transformation distribution ; number of augmented data ; initialization ; SGD parameters (batch size, step size , etc).
     Draw a mini-batch from , and update via
where are drawn i.i.d. from for each in the mini batch . See Equation 3.
  until convergence
Algorithm 1 MaxUp with Stochastic Gradient Descent

2 Main Method

We start with introducing the main idea of MaxUp, and then discuss its effect of introducing smoothness regularization in Section 2.1.


Giving a dataset , learning often reduces to a form of empirical risk minimization (ERM):


where is a parameter of interest (e.g., the weights of a neural network), and denotes the loss associated with data point . A key issue of ERM is the risk of overfitting, especially when the data information is insufficient.


We propose MaxUp to alleviate overfitting. The idea is to generate a set of random augmented data and minimize the maximum loss over the augmented data.

Formally, for each data point in , we generate a set of perturbed data points that are similar to , and estimate by minimizing the maximum loss over :

MaxUp: (2)

This loss can be easily minimized with stochastic gradient descent (SGD). Note that the gradient of the maximum loss is simply the gradient of the worst copy, that is,


where . This yields a simple and practical algorithm shown in Algorithm 4.

In our work, we assume the augmented data is i.i.d. generated from a distribution . The can be based on small perturbations around , e.g., , the Gaussian distribution with mean and isotropic variance . The can also be constructed based on invariant data transformations that are widely used in the data augmentation literature, such as random crops, equalizing, rotations, and clips for images (see e.g Cubuk et al., 2019a; DeVries and Taylor, 2017; Cubuk et al., 2019b).

2.1 MaxUp as a Smoothness Regularization

We provide a theoretical interpretation of Maxup as introducing a gradient-norm regularization to the original ERM objective to encourage smoothness. Here we consider the simple case of isotropic Gaussian perturbation, when . For simplifying notation, we define


which represents the expected MaxUp risk of data point with augmented copies.

Theorem 1 (MaxUp as Gradient-Norm Regularization).

Consider defined in (4) with . Assume is second-order differentiable w.r.t. . Then

where is a constant and , where denotes the big-Theta notation.

Theorem 1 shows that, the expected MaxUp risk can be viewed as introducing a Lipschitz-like regularization with the gradient norm , which encourages the smoothness of w.r.t. the input . The strength of the regularization is controlled by , which depends on the number of samples and perturbation magnitude .


Using Taylor expansion, we have

where we assume , which follows . The rest of the proof is due to the Lemma 1 below. ∎

Lemma 1.

Let be a fixed vector in , and are i.i.d. random variables from . We have



Define . Then is i.i.d. from . Therefore, , which is well known to be . See e.g., Orabona and Pál (2015); Kamath (2015) for bounds related to . More specifically, we have following Kamath (2015). ∎

3 Related Methods and Discussion

MaxUp is closely related to both data augmentation and adversarial training. It can be viewed as an adversarial variant of data augmentation, in that it minimizes the worse case loss on the perturbed data, instead of an average loss like typical data augmentation methods. MaxUp can also be viewed as a “lightweight” variant of adversarial training, in that the maximum loss is calculated by simple random sampling, instead of more accurate gradient-based optimizers for finding the adversarial loss, such as projected gradient descent (PGD); MaxUp is much simpler and faster than the PGD-based adversarial training, and is more suitable for our purpose of alleviating over-fitting on clean data (instead of adversarial defense). We now elaborate on these connections in depth.

3.1 Data Augmentation

Data augmentation has been widely used in machine learning, especially on image data which admits a rich set of invariance transforms (e.g. translation, rotation, random cropping). Recent augmentation techniques, such as MixUp (Zhang et al., 2018), CutMix (Yun et al., 2019) and manifold MixUp (Verma et al., 2019) have been found highly useful in training deep neural networks, especially in achieving state-of-the-art results on important image classification benchmarks such as SVHN, CIFAR and ImageNet. More recently, more advanced methods have been developed to find the optimal data augmentation policies using reinforcement learning or adversarial generative network (e.g. Cubuk et al., 2019a, b; Zhang et al., 2020).

MaxUp can easily leverage these advanced data augmentation techniques to achieve good performance. The key difference, however, is that MaxUp in (2) minimizes the maximum loss on the augmented data, while typical data augmentation methods minimize the average loss, that is,


which we refer to as standard data augmentation throughout the paper. It turns out (2) and (5) behave very different as regularization mechanisms, in that (5) does not introduce the gradient-norm regularization as (2), and hence does not have the benefit of having gradient-norm regularization. This is because the first-order term in the Taylor expansion is canceled out due to the averaging in (5).

Specifically, let be any distribution whose expectation is and is second-order differentiable w.r.t . Define the expected loss related to (5) on data point :


Then with a simple Taylor expansion, we have

which misses the gradient-norm regularization term when compared with MaxUp decomposition in Theorem 1.

Note that the MaxUp update is computationally faster than the solving (5) with the same , because we only need to backpropagate on the worst augmented copy for each data point (see Equation 3), while solving (5) requires to backpropagate on all the copies at each iteration.

3.2 Adversarial Training

Adversarial training has been developed to defense various adversarial attacks on the data inputs (Madry et al., 2017). It estimates by solving the following problem:


where represents a ball centered at with radius under some metrics (e.g. , , , or distances). The inner maximization is often solved by running projected gradient descent (PGD) for a number of iterations.

MaxUp in (2) can be roughly viewed as solving the inner adversarial maximization problem in (7) using a “mild”, or “lightweight” optimizer by randomly drawing points from and finding the best. Such mild adversarial optimization increases the robustness against the random perturbation it introduces, and hence enhance the generalization performance. Adversarial ideas have also been used to improvement generalization in a series of recent works (e.g., Xie et al., 2019; Zhu et al., 2020).

Different from our method, typical adversarial training methods, especially these based PGD (Madry et al., 2017), tend to solve the adversarial optimization much more aggressively to achieve higher robustness, but at the cost of scarifying the accuracy on clean data. There has been shown a clear trade-off between the accuracy of a classifier on clean data and its robustness against adversarial attacks (see e.g., Tsipras et al., 2019; Zhang et al., 2019; Yin et al., 2019; Schmidt et al., 2018). By using a mild adversarial optimizer, MaxUp strikes a better balance between the accuracy on clean data and adversarial robustness.

Besides, MaxUp is much more computationally efficient than PGD-based adversarial training, because it does not introduce additional back-propagation steps as PGD. In practice, MaxUp can be equipped with various complex data augmentation methods (in which case can be discrete distributions), while PGD-based adversarial training mostly focuses on perturbations in balls.

3.3 Online Hard Example Mining

Online hard example mining (OHEM) (Shrivastava et al., 2016) is a training method originally developed for region-based objective detection, which improves the performance of neural networks by picking the hardest examples within mini batches of stochastic gradient descent (SGD). It can be viewed as running SGD for minimizing the following expected loss

which amounts to randomly picking a mini-batch at each iteration and minimizing the loss of the hardest example within . By doing so, OHEM can focus more on the hard examples and hence improves the performance on borderline cases. This makes OHEM particularly useful for class-imbalance tasks, e.g. object detection (Shrivastava et al., 2016), person re-identification (Luo et al., 2019).

Different with MaxUp, the hardest examples in OHEM are selected in mini-batches consisting of independently selected examples, with no special correlation or similarity. Mathematically, it can be viewed as reweighing the data distribution to emphasize harder instances. This is substantially different from MaxUp, which is designed to enforce the robustness against existing random data augmentation/perturbation schemes.

4 Experiments

We test our method using both image classification and language modeling for which a variety of strong regularization techniques and data augmentation methods have been proposed. We show that MaxUp can outperform all of these methods on the most challenging datasets (e.g. ImageNet, Penn Treebank, and Wikitext-2) and state-of-the-art models (e.g. ResNet, EfficientNet, AWD-LSTM). In addition, we apply our method to adversarial certification via Gaussian smoothing (Cohen et al., 2019), for which we find that MaxUp can outperform both the augmented data baseline and PGD-based adversarial training baseline.

For all the tasks, if training from scratch, we first train the model with standard data augmentation with 5 epochs and then switch to MaxUp.

Time and Memory Cost

MaxUp only slightly increase the time and memory cost compared with standard training. During MaxUp, we only need to find the worst instance out of the augmented copies through forward-propagation, and then only back-propagate on the worst instance. Therefore, the additional cost of MaxUp over standard training is forward-propagation, which introduces no significant overhead on both memory and time cost.

4.1 ImageNet

We evaluate MaxUp on ILSVRC2012, a subset of ImageNet classification dataset (Deng et al., 2009). This dataset contains around 1.3 million training images and 50,000 validation images. We follow the standard data processing pipeline including scale and aspect ratio distortions, random crops, and horizontal flips in training. During the evaluation, we only use the single-crop setting.

Method Top-1 error Top-5 error
Vanilla (He et al., 2016a) 76.3 -
Dropout (Srivastava et al., 2014) 76.8 93.4
DropPath (Larsson et al., 2017) 77.1 93.5
Manifold Mixup (Verma et al., 2019) 77.5 93.8
AutoAugment (Cubuk et al., 2019a) 77.6 93.8
Mixup (Zhang et al., 2018) 77.9 93.9
DropBlock (Ghiasi et al., 2018) 78.3 94.1
CutMix (Yun et al., 2019) 78.6 94.0
MaxUp+CutMix 78.9 94.2
Table 1: Summary of top1 and top5 accuracies on the validation set of ImageNet for ResNet-50.
Model Model Size FLOPs +CutMix (%) +MaxUp+CutMix (%)
ResNet-101 44.55M 7.85G 79.83 80.26
ProxylessNet-CPU 7.12M 481M 75.32 75.65
ProxylessNet-GPU 4.36M 470M 75.08 75.42
ProxylessNet-Mobile 6.86M 603M 76.71 77.17
EfficientNet-B7 66.35M 38.20G  85.22  85.45
Fix-EfficientNet-B8 87.42M 101.79G  85.57  85.80
Table 2: Top1 accuracies of different models on the validation set of ImageNet 2012. The “” indicates that MaxUp is applied to the pre-trained model and trained for epochs.

Implementation Details

We test MaxUp with defined by the CutMix data augmentation technique (Yun et al., 2019) (referred to as MaxUp+CutMix). CutMix randomly cuts and pasts patches among training images, while the ground truth labels are also mixed proportionally to the area of the patches. MaxUp+CutMix applies CutMix on one image for times (cutting different randomly sampled patches), and select the worst case to do backpropagation.

We test our method on ResNet-50, ResNet-101 (He et al., 2016a), as well as recent energy-efficient architectures, including ProxylessNet (Cai et al., 2019) and EfficientNet (Tan and Le, 2019). We resize the images to and for EfficientNet-B7 and EfficientNet-B8, respectively (Tan and Le, 2019), for which we process the images with the data processing pipelines proposed by Touvron et al. (2019). For the other models, the input image size is . To save computation resources, we only fine-tune the pre-trained models with MaxUp for a few epochs. We set for MaxUp in the ImageNet-2012 experiments unless indicated otherwise. This means that we optimize the worst case in augmented samples for each image.

For ResNet-50, ResNet-101 and ProxylessNets, we train the models for 20 epochs with learning rate and batch size on 4 GPUs for 20 epochs. For EfficientNet, we fix the parameters in the batch normalization layers and train the other parameters with learning rate and batch size 1000 for 5 epochs.

As shown in Table 2, for ResNet-50 and ResNet-101, we achieve the best results among all the data augmentation method. For EfficientNet-B8, we further improve the state-of-the-art result on ImageNet with no extra data.

ResNet-50 on ImageNet

Table 1 compares a number of state-of-the-art regularization techniques with MaxUp+CutMix on ImageNet with ResNet-50.1 We can see that MaxUp+CutMix achieves better performance compared to all the strong data augmentation and regularization baselines. From Table 1, we see that CutMix gives the best top1 error () among all the augmentation tasks, but our method further improves it to . DropBlock outperforms all the other methods in terms of the top5 error, but by augmenting CutMix with MaxUp, we improve the top5 error rate obtained by DropbBlock to .

More Results on Different Architectures

Table 2 shows the result of ImageNet on ResNet-101, ProxylessNet-CPU/GPU/Mobile (Cai et al., 2019) and EfficientNet. We can see that MaxUp consistently improves the results in all these cases. On ResNet-101, it improves the baseline to . On ProxylessNet-CPU and ProxylessNet-GPU, MaxUp enhances the and top1 accuracy to and , respectively. On ProxylessNet-Mobile, we improve the top1 accuracy to .

For EfficientNet-B7, CutMix enhances the original top1 accuracy  (by Tan and Le, 2019) to . MaxUp further improves the top1 accuracy to . On Fix-EfficientNet-B8, MaxUp obtains the state-of-the-art top1 accuracy. The previous state-of-the-art top1 accuracy, , is achieved by EfficientNet-L2.

4.2 CIFAR-10 and CIFAR-100

We test MaxUp equipped with Cutout (DeVries and Taylor, 2017) on CIFAR-10 and CIFAR-100, and denote it by MaxUp+Cutout. We conduct our method on several neural architectures, including ResNet-110 He et al. (2016a), PreAct-ResNet-110 (He et al., 2016b) and WideResNet-28-10 (Zagoruyko and Komodakis, 2016). We set for WideResNet and for the other models. We use the public code2 and keep their hyper-parameters.

Implementation Details

For CIFAR-10 and CIFAR-100, we use the standard data processing pipeline (mirror+ crop) and train the model with 200 epochs. All the results reported in this section are averaged over five runs.

We train the models for 200 epochs on the training set with 256 examples per mini-batch, and evaluate the trained models on the test set. The learning rate starts at 0.1 and is divided by 10 after 100 and 150 epochs for ResNet-110 and PreAct-ResNet-110. For WideResNet-28-10, we follow the settings in the original paper (Zagoruyko and Komodakis, 2016), where the learning rate is divided by 10 after 60, 120 and 180 epochs. Weight decay is set to for all the models, and we do not use dropout.

Model + Cutout + MaxUp+Cutout
ResNet-110 94.84 0.11 95.41 0.08
PreAct-ResNet-110 95.02 0.15 95.52 0.06
WideResNet-28-10 96.92 0.16 97.18 0.06
Table 3: Test accuracy on CIFAR10 for different architectures.
Model + Cutout + MaxUp+Cutout
ResNet-110 73.64 0.15 75.26 0.21
PreAct-ResNet-110 74.37 0.13 75.63 0.26
WideResNet-28-10 81.59 0.27 82.48 0.23
Table 4: Test accuracy on CIFAR100 for different architectures.


The results on CIFAR-10 and CIFAR-100 are summarized in Table 3 and Table 4. We can see that the models trained using MaxUp+Cutout significantly outperform the standard Cutout for all the cases.

On CIAFR-10, MaxUp improves the standard Cutout baseline from to on ResNet-110. It also improves the accuracy from to on PreAct-ResNet-110.

On CIFAR-100, MaxUp obtains improvements by a large margin. On ResNet-110 and PreAct-ResNet-110, MaxUp improves the performance of Cutout from and to and , respectively. MaxUp+Cutout also improves the standard Cutout from to on WideResNet-28-10 on CIFAR-100.

Ablation Study

We test MaxUp with different sample size and investigate its impact on the performance on ResNet-100 (a relatively small model) and WideResNet-28-10 (a larger model).

Table 5 shows the result when we vary the sample size in . Note that MaxUp reduces to the naïve data augmentation method when . As shown in Table 5, MaxUp with all can improve the result of standard augmentation (). Setting or achieves best performance on ResNet-110 , and obtains best performance on WideResNet-28-10. We can see that the results are not sensitive once is in a proper range (e.g., ), and it is easy to outperform the standard data augmentation without much tuning of . Furthermore, we suggest to use a large for large models, and a small for relatively small models.

ResNet-110 WideResNet-28-10
1 73.64 0.15 81.59 0.27
4 75.26 0.21 81.82 0.22
10 75.19 0.13 82.48 0.23
20 74.37 0.18 82.43 0.24
Table 5: Test accuracy on CIFAR100 with ResNet-110 and WideResNet-28-10, when the sample size varies.
Method Params Valid Test
NAS-RNN (Zoph and Le, 2017) 54M - 62.40
AWD-LSTM (Merity et al., 2018) 24M 58.50 56.50
AWD-LSTM + FRAGE (Gong et al., 2018) 24M 58.10 56.10
AWD-LSTM + MoS (Yang et al., 2018) 22M 56.54 54.44
w/o dynamic evaluation
ADV-AWD-LSTM (Wang et al., 2019) 24M 57.15 55.01
ADV-AWD-LSTM + MaxUp 24M 56.25 54.27
+  dynamic evaluation (Krause et al., 2018)
ADV-AWD-LSTM (Wang et al., 2019) 24M 51.60 51.10
ADV-AWD-LSTM + MaxUp 24M 50.83 50.29
Table 6: Perplexities on the validation and test sets on the Penn Treebank dataset. Smaller perplexities refer to better language modeling performance. Params denotes the number of model parameters.
Method Params Valid Test
AWD-LSTM (Merity et al., 2018) 33M 68.60 65.80
AWD-LSTM + FRAGE (Gong et al., 2018) 33M 66.50 63.40
AWD-LSTM + MoS (Yang et al., 2018) 35M 63.88 61.45
w/o dynamic evaluation
ADV-AWD-LSTM (Wang et al., 2019) 33M 63.68 61.34
ADV-AWD-LSTM + MaxUp 33M 62.48 60.19
+  dynamic evaluation (Krause et al., 2018)
ADV-AWD-LSTM (Wang et al., 2019) 33M 42.36 40.53
ADV-AWD-LSTM + MaxUp 33M 41.29 39.61
Table 7: Perplexities on the validation and test sets on the WikiText-2 dataset. Smaller perplexities refer to better language modeling performance. Params denotes the number of model parameters.

4.3 Language Modeling

For language modeling, we test MaxUp on two benchmark datasets: Penn Treebank (PTB) and Wikitext-2 (WT2). We use the code provided by Wang et al. (2019) as our baseline3, which stacks a three-layer LSTM and implements a bag of regularization and optimization tricks for neural language modeling proposed by Merity et al. (2018), such as weight tying, word embedding drop and Averaged SGD.

For this task, we apply MaxUp using word embedding dropout (Merity et al., 2018) as the random data augmentation method. Word embedding dropout implements dropout on the embedding matrix at the word level, where the dropout is broadcasted across all the embeddings of all the word vectors. For the selected words, their embedding vectors are set to be zero vectors. The other word embeddings in the vocabulary are scaled by , where is the probability of embedding dropout.

As the word embedding layer serves as the first layer in a neural language model, we apply MaxUp in this layer. We do feed-forward for times and select the worst case to do backpropagation for each given sentence. In this section, we set a small since the models are already well-regularized by other regularization techniques.

Implement Details

The PTB corpus (Marcus et al., 1993) is a standard dataset for benchmarking language models. It consists of 923k training, 73k validation and 82k test words. We use the processed version provided by Mikolov et al. (2010) that is widely used for PTB.

The WT2 dataset is introduced in  Merity et al. (2018) as an alternative to PTB. It contains pre-processed Wikipedia articles, and the training set contains 2 million words.

The training procedure can be decoupled into two stages: 1) optimizing the model with SGD and averaged SGD (ASGD); 2) restarting ASGD for fine-tuning twice. We apply MaxUp in both stages, and report the perplexity scores at the end of the second stage. We also report the perplexity scores with a recently-proposed post-process method, dynamical evaluation  (Krause et al., 2018) after the training process.

Results on PTB and WT2

The results on PTB and WT2 corpus are illustrated in Table 6 and Table 7, respectively. We calculate the perplexity on the validation and test set for each method to evaluate its performance. We can see that MaxUp outperforms the state-of-the-art results achieved by Frage (Gong et al., 2018) and Mixture of SoftMax (Yang et al., 2018). We further compare MaxUp to the result of  Wang et al. (2019) based on AWD-LSTM (Merity et al., 2018) at two checkpoints, with or without dynamic evaluation (Krause et al., 2018). On PTB, we enhance the baseline from to at these two checkpoints on the test set. On WT2, we enhance the baseline from to at these two checkpoints on the test set. Results on validation set are reported in both Table 6 and 7 to show that the improvement can not achieved by simple hyper-parameter tuning on the test set.

4.4 Adversarial Certification

Cohen et al. (2019) (%) 60 43 34 23 17 14 12 10 8 6 4
Salman et al. (2019) (%) 74 57 48 38 33 29 25 19 17 14 12
Ours (%) 74 57 49 40 35 31 27 22 19 17 15
Table 8: Certified accuracy on CIFAR-10 of the best classifiers by different methods, evaluated against attacks of different radiuses.

Modern image classifiers are known to be sensitive to small, adversarially-chosen perturbations on inputs (Goodfellow et al., 2014). Therefore, for making high-stakes decisions, it is of critical importance to develop methods with certified robustness, which provide (high probability) provable guarantees on the correctness of the prediction subject to arbitrary attacks within certain perturbation ball.

Recently, Cohen et al. (2019) proposed to construct certifiably robust classifiers against attacks by introducing Gaussian smoothing on the inputs, which is shown to outperform all the previous -robust classifiers in CIFAR-10. There has been two major methods for training such smoothed classifiers: Cohen et al. (2019) trains the classifier with a Gaussian data augmentation technique, while Salman et al. (2019) improves the original Gaussian data augmentation by using PGD (projected gradient descent) adversarial training, in which PGD is used to find a local maximal within a given perturbation ball.

In our experiment, we use MaxUp with Gaussian perturbation (referred to as MaxUp+Gauss) to train better smoothed classifiers than the methods by Cohen et al. (2019) and Salman et al. (2019). Like how MaxUp improves upon standard data augmentation, it is natural to expect that our MaxUp+Gauss can learn more robust classifiers than the standard Gaussian data augmentation method in Cohen et al. (2019).

Training Details

We applied  MaxUp to Gaussian augmented data on CIFAR-10 with ResNet-110 (He et al., 2016a). We follow the training pipelines described in  Salman et al. (2019). We set a batch size of 256, an initial learning rate of 0.1 which drops by a factor of 10 every 50 epochs, and train the models for 150 epochs.


After training the smoothed classifiers, we evaluation the certified accuracy of different models under different perturbation sets. Given an input image and a perturbation region , the smoothed classifier is called certifiably correct if its prediction is correct and has a guaranteed lower bound larger than in . The certified accuracy is the percentage of images that are certifiably correct. Following Salman et al. (2019), we calculate the certified accuracy of all the classifiers for various radius and report the best results overall of the classifiers. We use the codes provided by  Cohen et al. (2019) to calculate certified accuracy.4

Following Salman et al. (2019), we select the best hyperparameters with grid search. The only two hyperparameters of our MaxUp+Gauss are the sample size and the variance of the Gaussian perturbation, which we search in and . In comparison, Salman et al. (2019) requiers to search a larger number of hyper-parameters, including the number of steps of the PGD, the number of noise samples, the maximum perturbation, and the variance of Gaussian data augmentation during training and testing. Overall, Salman et al. (2019) requires to train and evaluate over 150 models for hyperparmeter tuning, while MaxUp+Gauss requires only 20 models.


We show the certified accuraries on CIFAR-10 in Table 8 under attacks for each radius. We find that MaxUp outperforms Cohen et al. (2019) for all the radiuses by a large margin. For example, MaxUp can improve the certified accuracy at radius from 60% to 74% and improve the 4% accuracy on radius to 15%. MaxUp also outperforms the PGD-based adversarial training of Salman et al. (2019) for all the radiuses, boosting the accuracy from 14% to 17% at radius , and from 12% to 15% at radius .

In summary, MaxUp clearly outperforms both Cohen et al. (2019) and Salman et al. (2019). MaxUp is also much faster and requires less hyperparameter tuning than Salman et al. (2019). Although the PGD-based method of Salman et al. (2019) was designed to outperform the original method by Cohen et al. (2019), MaxUp+Gauss further improves upon Salman et al. (2019), likely because MaxUp with Gaussian perturbation is more compatible with the Gaussian smoothing based certification of Cohen et al. (2019) than PGD adversarial optimization.

5 Conclusion

In this paper, we propose MaxUp, a simple and efficient training algorithms for improving generalization, especially for deep neural networks. MaxUp can be viewed as a introducing a gradient-norm smoothness regularization for Gaussian perturbation, but does not require to evaluate the gradient norm explicitly, and can be easily combined with any existing data augmentation methods. We empirically show that MaxUp can improve the performance of data augmentation methods in image classification, language modeling, and certified defense. Especially, we achieve SOTA performance on ImageNet.

For future works, we will apply MaxUp to more applications and models, such as BERT (Devlin et al., 2019). Furthermore, we will generalize MaxUp to apply mild adversarial optimization on feature and label spaces for other challenging tasks in machine learning, including transfer learning, semi-supervised learning.


  1. All the FLOPS and model size reported in this paper is calculated by
  2. The code is downloaded from


  1. Proxylessnas: direct neural architecture search on target task and hardware. ICLR. Cited by: §4.1, §4.1.
  2. Certified adversarial robustness via randomized smoothing. ICML. Cited by: §1, §4.4, §4.4, §4.4, §4.4, §4.4, Table 8, §4.
  3. Autoaugment: learning augmentation policies from data. CVPR. Cited by: §1, §2, §3.1, Table 1.
  4. RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §2, §3.1.
  5. Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.1.
  6. Bert: pre-training of deep bidirectional transformers for language understanding. Cited by: §5.
  7. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §1, §2, §4.2.
  8. Dropblock: a regularization method for convolutional networks. In NeurIPS, pp. 10727–10737. Cited by: Table 1.
  9. Frage: frequency-agnostic word representation. In NeurIPS, pp. 1334–1345. Cited by: §4.3, Table 6, Table 7.
  10. Explaining and harnessing adversarial examples. ICLR. Cited by: §4.4.
  11. Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1, §4.2, §4.4, Table 1.
  12. Identity mappings in deep residual networks. In ECCV, pp. 630–645. Cited by: §4.2.
  13. Bounds on the expectation of the maximum of samples from a gaussian. URL http://www. gautamkamath. com/writings/gaussian max. pdf. Cited by: §2.1.
  14. Dynamic evaluation of neural sequence models. ICML. Cited by: §4.3, §4.3, Table 6, Table 7.
  15. Fractalnet: ultra-deep neural networks without residuals. ICLR. Cited by: Table 1.
  16. Bag of tricks and a strong baseline for deep person re-identification. In CVPRW, pp. 0–0. Cited by: §3.3.
  17. Towards deep learning models resistant to adversarial attacks. ICLR. Cited by: §1, §3.2, §3.2.
  18. Building a large annotated corpus of english: the penn treebank. Cited by: §4.3.
  19. Regularizing and optimizing lstm language models. ICLR. Cited by: §4.3, §4.3, §4.3, §4.3, Table 6, Table 7.
  20. Recurrent neural network based language model. In ISCA, Cited by: §4.3.
  21. Optimal non-asymptotic lower bound on the minimax regret of learning with expert advice. arXiv preprint arXiv:1511.02176. Cited by: §2.1.
  22. Provably robust deep learning via adversarially trained smoothed classifiers. NeurIPS. Cited by: §1, §4.4, §4.4, §4.4, §4.4, §4.4, §4.4, §4.4, Table 8.
  23. Adversarially robust generalization requires more data. In NeurIPS, pp. 5014–5026. Cited by: §3.2.
  24. Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. Cited by: §3.3.
  25. Dropout: a simple way to prevent neural networks from overfitting. JMLR, pp. 1929–1958. Cited by: Table 1.
  26. EfficientNet: rethinking model scaling for convolutional neural networks. ICML. Cited by: §4.1, §4.1.
  27. Fixing the train-test resolution discrepancy. arXiv preprint arXiv:1906.06423. Cited by: §4.1.
  28. Ensemble adversarial training: attacks and defenses. ICLR. Cited by: §1.
  29. Robustness may be at odds with accuracy. In ICLR, Cited by: §3.2.
  30. Manifold mixup: encouraging meaningful on-manifold interpolation as a regularizer. ICML. Cited by: §3.1, Table 1.
  31. Improving neural language modeling via adversarial training. ICML. Cited by: §4.3, §4.3, Table 6, Table 7.
  32. Adversarial examples improve image recognition. arXiv preprint arXiv:1911.09665. Cited by: §1, §3.2.
  33. Breaking the softmax bottleneck: a high-rank RNN language model. In ICLR, Cited by: §4.3, Table 6, Table 7.
  34. Rademacher complexity for adversarially robust generalization. In ICML, pp. 7085–7094. Cited by: §3.2.
  35. Cutmix: regularization strategy to train strong classifiers with localizable features. ICCV. Cited by: §3.1, §4.1, Table 1.
  36. Wide residual networks. In BMVC, pp. 87.1–87.12. Cited by: §4.2, §4.2.
  37. Theoretically principled trade-off between robustness and accuracy. In ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 7472–7482. Cited by: §3.2.
  38. Mixup: beyond empirical risk minimization. In ICLR, Cited by: §1, §3.1, Table 1.
  39. Adversarial autoaugment. ICLR. Cited by: §3.1.
  40. FreeLB: enhanced adversarial training for language understanding. ICLR. Cited by: §3.2.
  41. Neural architecture search with reinforcement learning. ICLR. Cited by: Table 6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description