RandAugment: Practical data augmentation with no separate search
Abstract
Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models. Recently, learned augmentation strategies have led to stateoftheart results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to stateoftheart results in semisupervised learning and improved robustness to common corruptions of images. One obstacle to a largescale adoption of these methods is a separate search phase which significantly increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these learned augmentation approaches are unable to adjust the regularization strength based on model or dataset size. Learned augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models. In this work, we remove both of these obstacles. RandAugment may be trained on the model and dataset of interest with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes. RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous learned augmentation approaches on CIFAR10, CIFAR100, SVHN, and ImageNet. On the ImageNet dataset we achieve 85.0% accuracy, a 0.6% increase over the previous stateoftheart and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.01.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO. Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.
CIFAR10  SVHN  ImageNet  ImageNet  
PyramidNet  WideResNet  ResNet50  EfficientNetB7  
cost  acc.  cost  acc.  cost  acc.  cost  acc.  
Baseline  0  97.3  0  98.5  0  76.3  0  84.0 
AA  5K  98.5  1K  98.9  15K  77.6  15K  84.4 
Fast AA  3.5  98.3  1.5  98.8  450  77.6     
PBA  5.0  98.5  1.0  98.9         
RA (ours)  0  98.5  0  99.0  0  77.6  0  85.0 
1 Introduction
Data augmentation is a widely used method for generating additional data to improve machine learning systems, for image classification [42, 22, 7, 53], object detection [13], instance segmentation [10], and speech recognition [20, 15, 35]. Unfortunately, data augmentation methods require expertise, and manual work to design policies that capture prior knowledge in each domain. This requirement makes it difficult to extend existing data augmentation methods to other applications and domains.
Learning policies for data augmentation has recently emerged as a method to automate the design of augmentation strategies and therefore has the potential to address some weaknesses of traditional data augmentation methods [5, 56, 19, 24]. Training a machine learning model with a learned data augmentation policy may significantly improve accuracy [5], model robustness [31, 51, 40], and performance on semisupervised learning [49] for image classification; likewise, for object detection tasks on COCO and PASCALVOC [56]. Notably, unlike engineering better network architectures [58], all of these improvements in predictive performance incur no additional computational cost at inference time.
In spite of the benefits of learned data augmentation policies, the computational requirements as well as the added complexity of two separate optimization procedures can be prohibitive. The original presentation of neural architecture search (NAS) realized an analogous scenario in which the dual optimization procedure resulted in superior predictive performance, but the original implementation proved prohibitive in terms of complexity and computational demand. Subsequent work accelerated training efficiency and the efficacy of the procedure [29, 37, 27, 28], eventually making the method amenable to a unified optimization based on a differentiable process [29]. In the case of learned augmentations, subsequent work identified more efficient search methods [19, 24], however such methods still require a separate optimization procedure, which significantly increases the computational cost and complexity of training a machine learning model.
The original formulation for learned data augmentation postulated a separate search on a small, proxy task whose results may be transferred to a larger target task [58, 57]. This formulation makes a strong assumption that the proxy task provides a predictive indication of the larger task [27, 2]. In the case of learned data augmentation, we provide experimental evidence to challenge this core assumption. In particular, we demonstrate that this strategy is suboptimal as the strength of the augmentation depends strongly on model and dataset size. These results suggest that an improved data augmentation may be possible if one could remove the separate search phase on a proxy task.
In this work, we propose a method for data augmentation – termed RandAugment – that does not require a separate search. In order to remove a separate search, we find it necessary to dramatically reduce the search space for data augmentation. The reduction in parameter space is in fact so dramatic that simple grid search is sufficient to find a data augmentation policy that outperforms all learned augmentation methods that employ a separate search phase. Our contributions can be summarized as follows:

We demonstrate that the optimal strength of a data augmentation depends on the model size and training set size. This observation indicates that a separate optimization of an augmentation policy on a smaller proxy task may be suboptimal for learning and transferring augmentation policies.

We introduce a vastly simplified search space for data augmentation containing 2 interpretable hyperparameters. One may employ simple grid search to tailor the augmentation policy to a model and dataset, removing the need for a separate search process.

Leveraging this formulation, we demonstrate stateoftheart results on CIFAR [21], SVHN [33], and ImageNet [6]. On object detection [26], our method is within 0.3% mAP of stateoftheart. On ImageNet we achieve a stateoftheart accuracy of 85.0%, a 0.6% increment over previous methods and 1.0% over baseline augmentation.
2 Related Work
Data augmentation has played a central role in the training of deep vision models. On natural images, horizontal flips and random cropping or translations of the images are commonly used in classification and detection models [52, 22, 13]. On MNIST, elastic distortions across scale, position, and orientation have been applied to achieve impressive results [42, 4, 48, 41]. While previous examples augment the data while keeping it in the training set distribution, operations that do the opposite can also be effective in increasing generalization. Some methods randomly erase or add noise to patches of images for increased validation accuracy [8, 54], robustness [45, 51, 11], or both [31]. Mixup [53] is a particularly effective augmentation method on CIFAR10 and ImageNet, where the neural network is trained on convex combinations of images and their corresponding labels. Objectcentric cropping is commonly used for object detection tasks [30], whereas [9] adds new objects on training images by cutandpaste.
Moving away from individual operations to augment data, other work has focused on finding optimal strategies for combining different operations. For example, Smart Augmentation learns a network that merges two or more samples from the same class to generate new data [23]. Tran et al. generate augmented data via a Bayesian approach, based on the distribution learned from the training set [47]. DeVries et al. use transformations (e.g. noise, interpolations and extrapolations) in the learned feature space to augment data [7]. Furthermore, generative adversarial networks (GAN) have been used to choose optimal sequences of data augmentation operations[38]. GANs have also been used to generate training data directly [36, 32, 55, 1, 43], however this approach does not seem to be as beneficial as learning sequences of data augmentation operations that are predefined [39].
Another approach to learning data augmentation strategies from data is AutoAugment [5], which originally used reinforcement learning to choose a sequence of operations as well as their probability of application and magnitude. Application of AutoAugment policies involves stochasticity at multiple levels: 1) for every image in every minibatch, a subpolicy is chosen with uniform probability. 2) operations in each subpolicy has an associated probability of application. 3) Some operations have stochasticity over direction. For example, an image can be rotated clockwise or counterclockwise. The layers of stochasticity increase the amount of diversity that the network is trained on, which in turn was found to significantly improve generalization on many datasets. More recently, several papers used the AutoAugment search space and formalism with improved optimization algorithms to find AutoAugment policies more efficiently [19, 24]. Although the time it takes to search for policies has been reduced significantly, having to implement these methods in a separate search phase reduces the applicability of AutoAugment. For this reason, this work aims to eliminate the search phase on a separate proxy task completely.
Some of the developments in RandAugment were inspired by the recent improvements to searching over data augmentation policies. For example, Population Based Augmentation (PBA) [19] found that the optimal magnitude of augmentations increased during the course of training, which inspired us to not search over optimal magnitudes for each transformation but have a fixed magnitude schedule, which we discuss in detail in Section 3. Furthermore, authors of Fast AutoAugment [24] found that a data augmentation policy that is trained for density matching leads to improved generalization accuracy, which inspired our first order differentiable term for improving augmentation (see Section 4.7).
3 Methods
The primary goal of RandAugment is to remove the need for a separate search phase on a proxy task. The reason we wish to remove the search phase is because a separate search phase significantly complicates training and is computationally expensive. More importantly, the proxy task may provide suboptimal results (see Section 4.1). In order to remove a separate search phase, we aspire to fold the parameters for the data augmentation strategy into the hyperparameters for training a model. Given that previous learned augmentation methods contained 30+ parameters [5, 24, 19], we focus on vastly reducing the parameter space for data augmentation.
Previous work indicates that the main benefit of learned augmentation policies arise from increasing the diversity of examples [5, 19, 24]. Indeed, previous work enumerated a policy in terms of choosing which transformations to apply out of =14 available transformations, and probabilities for applying each transformation:
identity  autoContrast  equalize 
rotate  solarize  color 
posterize  contrast  brightness 
sharpness  shearx  sheary 
translatex  translatey 
In order to reduce the parameter space but still maintain image diversity, we replace the learned policies and probabilities for applying each transformation with a parameterfree procedure of always selecting a transformation with uniform probability . Given transformations for a training image, RandAugment may thus express potential policies.
The final set of parameters to consider is the magnitude of the each augmentation distortion. Following [5], we employ the same linear scale for indicating the strength of each transformation. Briefly, each transformation resides on an integer scale from 0 to 10 where a value of 10 indicates the maximum scale for a given transformation. A data augmentation policy consists of identifying an integer for each augmentation [5, 24, 19]. In order to reduce the parameter space further, we observe that the learned magnitude for each transformation follows a similar schedule during training (e.g. Figure 4 in [19]) and postulate that a single global distortion may suffice for parameterizing all transformations.
We experimented with four methods for selecting : a random magnitude, a constant magnitude, a linearly increasing magnitude and a random magnitude with increasing upper bound [19]. A random magnitude uniformly randomly samples the distortion magnitude between two values. A constant magnitude sets the distortion magnitude to a constant number during the course of training. A linearly increasing magnitude interpolates the distortion magnitude during training between two values. A random magnitude with increasing upper bound is similar to a random magnitude, but the upper bound is increased linearly during training. In preliminary experiments, we found that all strategies worked equally well. Thus, we selected a constant magnitude because this strategy includes only a single hyperparameter, and we employ this for the rest of the work.
The resulting algorithm contains two parameters and and may be expressed simply in two lines of Python code (Figure 2). Both parameters are humaninterpretable such that larger values of and increase regularization strength. Standard methods may be employed to efficiently perform hyperparameter optimization [44, 14], however given the extremely small search space we find that naive grid search is quite effective (Section 4.1). We justify all of the choices of this proposed algorithm in this subsequent sections by comparing the efficacy of the learned augmentations to all previous learned data augmentation methods.
4 Results
To explore the space of data augmentations, we experiment with core image classification and object detection tasks. In particular, we focus on CIFAR10, CIFAR100, SVHN, and ImageNet datasets as well as COCO object detection so that we may compare with previous work. For all of these datasets, we replicate the corresponding architectures and set of data transformations. Our goal is to demonstrate the relative benefits of employing this method over previous learned augmentation methods.
baseline  PBA  Fast AA  AA  RA  
CIFAR10  
WideResNet282  94.9      95.9  95.8 
WideResNet2810  96.1  97.4  97.3  97.4  97.3 
ShakeShake  97.1  98.0  98.0  98.0  98.0 
PyramidNet  97.3  98.5  98.3  98.5  98.5 
CIFAR100  
WideResNet282  75.4      78.5  78.3 
WideResNet2810  81.2  83.3  82.7  82.9  83.3 
SVHN (core set)  
WideResNet282  96.7      98.0  98.3 
WideResNet2810  96.9      98.1  98.3 
SVHN  
WideResNet282  98.2      98.7  98.7 
WideResNet2810  98.5  98.9  98.8  98.9  99.0 
4.1 Systematic failures of a separate proxy task
A central premise of learned data augmentation is to construct a small, proxy task that may be reflective of a larger task [57, 58, 5]. Although this assumption is sufficient for identifying learned augmentation policies to improve performance [5, 56, 35, 24, 19], it is unclear if this assumption is overly stringent and may lead to suboptimal data augmentation policies.
In this first section, we challenge the hypothesis that formulating the problem in terms of a small proxy task is appropriate for learned data augmentation. In particular, we explore this question along two separate dimensions that are commonly restricted to achieve a small proxy task: model size and dataset size.
To explore this hypothesis, we systematically measure the effects of data augmentation policies on CIFAR10. First, we train a family of WideResNet architectures [52], where the model size may be systematically altered through the widening parameter governing the number of convolutional filters. For each of these networks, we train the model on CIFAR10 and measure the final accuracy compared to a baseline model trained with default data augmentations (i.e. flip leftright and random translations). The WideResNet models are trained with the additional =14 data augmentations (see Methods) over a range of global distortion magnitudes parameterized on a uniform linear scale ranging from [0, 30] ^{1}^{1}1Note that the range of magnitudes exceeds the specified range of magnitudes in the Methods because we wish to explore a larger range of magnitudes for this preliminary experiment. We retain the same scale as [5] for a value of 10 to maintain comparable results..
Figure 3a demonstrates the relative gain in accuracy of a model trained across increasing distortion magnitudes for three WideResNet models. The squares indicate the distortion magnitude with which achieves the highest accuracy. Note that in spite of the measurement noise, Figure 3a demonstrates systematic trends across distortion magnitudes. In particular, plotting all WideResNet architectures versus the optimal distortion magnitude highlights a clear monotonic trend across increasing network sizes (Figure 3b). Namely, larger networks demand larger data distortions for regularization. Figure 1 highlights the visual difference in the optimal distortion magnitude for differently sized models. Conversely, a learned policy based on [5] provides a fixed distortion magnitude (Figure 3b, dashed line) for all architectures that is clearly suboptimal.
A second dimension for constructing a small proxy task is to train the proxy on a small subset of the training data. Figure 3c demonstrates the relative gain in accuracy of WideResNet2810 trained across increasing distortion magnitudes for varying amounts of CIFAR10 training data. The squares indicate the distortion magnitude with that achieves the highest accuracy. Note that in spite of the measurement noise, Figure 3c demonstrates systematic trends across distortion magnitudes. We first observe that models trained on smaller training sets may gain more improvement from data augmentation (e.g. 3.0% versus 1.5% in Figure 3c). Furthermore, we see that the optimal distortion magnitude is larger for models that are trained on larger datasets. At first glance, this may disagree with the expectation that smaller datasets require stronger regularization.
Figure 3d demonstrates that the optimal distortion magnitude increases monotonically with training set size. One hypothesis for this counterintuitive behavior is that aggressive data augmentation leads to a low signaltonoise ratio in small datasets. Regardless, this trend highlights the need for increasing the strength of data augmentation on larger datasets and the shortcomings of optimizing learned augmentation policies on a proxy task comprised of a subset of the training data. Namely, the learned augmentation may learn an augmentation strength more tailored to the proxy task instead of the larger task of interest.
The dependence of augmentation strength on the dataset and model size indicate that a small proxy task may provide a suboptimal indicator of performance on a larger task. This empirical result suggests that a distinct strategy may be necessary for finding an optimal data augmentation policy. In particular, we propose in this work to focus on a unified optimization of the model weights and data augmentation policy.
Figure 3 suggest that merely searching for a shared distortion magnitude across all transformations may provide sufficient gains that exceed learned optimization methods [5]. Furthermore, Figure 3a and 3c indicate that merely sampling a few distortion magnitudes is sufficient to achieve good results. Coupled with a second free parameter , we consider these results to prescribe an algorithm for learning an augmentation policy. In the subsequent sections, we identify two free parameters and specifying RandAugment through a minimal grid search and compare these results against computationallyheavy learned data augmentations based on proxy tasks.
4.2 Cifar
CIFAR10 has been extensively studied with previous data augmentation methods and we first test this proposed method on this data. The default augmentations for all methods include flips, padandcrop and Cutout [8]. and were selected based on the validation performance on 5K held out examples from the training set for 1 and 5 settings for and , respectively. Results indicate that RandAugment achieves either competitive (i.e. within 0.1%) or stateoftheart on CIFAR10 across four network architectures (Table 2).
As a more challenging task, we additionally compare the efficacy of RandAugment on CIFAR100 for WideResNet282 and WideResNet2810. On the held out 5K dataset, we sampled 2 and 4 settings for and , respectively (i.e. = and =). For WideResNet282 and WideResNet2810, we find that =1, =2 and =2, =14 achieves best results, respectively. Again, RandAugment achieves competitive or superior results across both architectures (Table 2).
4.3 Svhn
Because SVHN is composed of numbers instead of natural images, the data augmentation strategy for SVHN may differ substantially from CIFAR10. Indeed, [5] identified a qualitatively different policy for CIFAR10 than SVHN. Likewise, in a semisupervised setting for CIFAR10, a policy learned from CIFAR10 performs better than a policy learned from SVHN [49].
SVHN has a core training set of 73K images [33]. In addition, SVHN contains 531K less difficult “extra” images to augment training. We compare the performance of the augmentation methods on SVHN with and without the extra data on WideResNet282 and WideResNet2810 (Table 2). In spite of the large differences between SVHN and CIFAR, RandAugment consistently matches or outperforms previous methods with no alteration to the list of transformations employed. Notably, for WideResNet282, applying RandAugment to the core training dataset improves performance more than augmenting with 531K additional training images (98.3% vs. 98.2%). For, WideResNet2810, RandAugment is competitive with augmenting the core training set with 531K training images (i.e. within 0.2%). Nonetheless, WideResNet2810 with RandAugment matches the previous stateoftheart accuracy on SVHN which used a more advanced model [5].
4.4 ImageNet
baseline  Fast AA  AA  RA  

ResNet50  76.3 / 93.1  77.6 / 93.7  77.6 / 93.8  77.6 / 93.8 
EfficientNetB5  83.2 / 96.7    83.3 / 96.7  83.9 / 96.8 
EfficientNetB7  84.0 / 96.9    84.4 / 97.1  85.0 / 97.2 
Data augmentation methods that improve CIFAR10 and SVHN models do not always improve largescale tasks such as ImageNet. For instance, Cutout substantially improves CIFAR and SVHN performance [8], but fails to improve ImageNet [31]. Likewise, AutoAugment does not increase the performance on ImageNet as much as other tasks [5], especially for large networks (e.g. +0.4% for AmoebaNetC [5] and +0.1% for EfficientNetB5 [46]). One plausible reason for the lack of strong gains is that the small proxy task was particularly impoverished by restricting the task to 10% of the 1000 ImageNet classes.
Table 3 compares the performance of RandAugment to other learned augmentation approaches on ImageNet. RandAugment matches the performance of AutoAugment and Fast AutoAugment on the smallest model (ResNet50), but on larger models RandAugment significantly outperforms other methods achieving increases of up to +1.3% above the baseline. For instance, on EfficientNetB7, the resulting model achieves 85.0% – a new stateoftheart accuracy – exhibiting a 1.0% improvement over the baseline augmentation. These systematic gains are similar to the improvements achieved with engineering new architectures [58, 27], however these gains arise without incurring additional computational cost at inference time.
4.5 Coco
To further test the generality of this approach, we next explore a related task of largescale object detection on the COCO dataset [26]. Learned augmentation policies have improved object detection and lead to stateoftheart results [56]. We followed previous work by training on the same architectures and following the same training schedules (see Appendix A.3). Briefly, we employed RetinaNet [25] with ResNet101 and ResNet200 as a backbone [16]. Models were trained for 300 epochs from random initialization.
Table 4 compares results between a baseline model, AutoAugment and RandAugment. AutoAugment leveraged additional, specialized transformations not afforded to RandAugment in order to augment the localized bounding box of an image [56]. In addition, note that AutoAugment expended 15K GPU hours for search, where as RandAugment was tuned by on merely 6 values of the hyperparameters (see Appendix A.3). In spite of the smaller library of specialized transformations and the lack of a separate search phase, RandAugment surpasses the baseline model and provides competitive accuracy with AutoAugment. We reserve for future work to expand the transformation library to include bounding box specific transformation to potentially improve RandAugment results even further.
Model  Augmentation  mAP  cost 

Baseline  38.8  0  
ResNet101  AutoAugment  40.4  15K 
RandAugment  40.1  0  
Baseline  39.9  0  
ResNet200  AutoAugment  42.1  15K 
RandAugment  41.9  0 
4.6 Investigating the dependence on image transformations
RandAugment achieves stateoftheart results across different tasks and datasets using the same list of transformations. This result suggests that RandAugment is largely insensitive to the selection of transformations for different datasets. To further study the sensitivity, we experimented with RandAugment on a WideResNet282 trained on 4K samples of CIFAR10, where we remove one of the transformations from the list randomly. Table 5 indicates that removing one transformation randomly does not significantly reduce the accuracy of RandAugment (85.6% vs. 85.5%). We likewise tried using only geometric transformations in RandAugment (e.g. identity, rotate, shearx, translatex, translatey, and sheary) and found that the performance of RandAugment deteriorates significantly (82.6% vs. 85.5%) indicating that color transformations play a significant role in improving the predictive performance.
accuracy  

baseline  
all operations  
one transformation removed  
only geometric transformations 
4.7 Learning the probabilities for selecting image transformations
baseline  AA  RA  + 1^{st}  

Reduced CIFAR10  
WideResNet282  82.0  85.6  85.3  85.5 
WideResNet2810  83.5  87.7  86.8  87.4 
CIFAR10  
WideResNet282  94.9  95.9  95.8  96.1 
WideResNet2810  96.1  97.4  97.3  97.4 
RandAugment selects all image transformations with equal probability. This opens up the question of whether learning probabilities may improve performance further. Most of the image transformations (except posterize, equalize, and autoContrast) are differentiable, which permits backpropagation to learn the probabilities [29]. Let us denote as the learned probability of selecting image transformation for operation . For =14 image transformations and =2 operations, constitutes 28 parameters. We initialize all weights such that each transformation is equal probability (i.e. RandAugment), and update these parameters based on how well a model classifies a held out set of validation images distorted by . This approach was inspired by density matching [24], but instead uses a differentiable approach in lieu of Bayesian optimization. We label this method as a 1^{st}order density matching approximation.
To test the efficacy of density matching to learn the probabilities of each transformation, we trained WideResNet282 and WideResNet2810 on CIFAR10 and the reduced form of CIFAR10 containing 4K training samples. Table 6 indicates that learning the probabilities slightly improves performance on reduced and full CIFAR10 (RA vs 1^{st}). The 1^{st}order method improves accuracy by more than 3.0% for both models on reduced CIFAR10 compared to the baseline of flips and padandcrop. On CIFAR10, the 1^{st}order method improves accuracy by 0.9% on the smaller model and 1.2% on the larger model compared to the baseline. We further see that the 1^{st}order method always performs better than RandAugment, with the largest improvement on WideResNet2810 trained on reduced CIFAR10 (87.4% vs. 86.8%). On CIFAR10, the 1^{st}order method outperforms AutoAugment on WideResNet282 (96.1% vs. 95.9%) and matches AutoAugment on WideResNet2810 ^{2}^{2}2As a baseline comparison, in preliminary experiments we additionally learn based on differentiating through a virtual training step [29]. In this approach, the 2^{nd}order approximation yielded consistently negative results (see Appendix A.1).. Although the density matching approach is promising, this method can be expensive as one must apply all transformations times to each image independently. Hence, because the computational demand of transformations is prohibitive for large images, we reserve this for future exploration. In summary, we take these results to indicate that learning the probabilities through density matching may improve the performance on smallscale tasks and reserve explorations to largerscale tasks for the future.
5 Discussion
Data augmentation is a necessary method for achieving stateoftheart performance [42, 22, 7, 53, 13, 35]. Learned data augmentation strategies have helped automate the design of such strategies and likewise achieved stateoftheart results [5, 24, 19, 56]. In this work, we demonstrated that previous methods of learned augmentation suffers from systematic drawbacks. Namely, not tailoring the number of distortions and the distortion magnitude to the dataset size nor the model size leads to suboptimal performance. To remedy this situation, we propose a simple parameterization for targeting augmentation to particular model and dataset sizes. We demonstrate that RandAugment is competitive with or outperforms previous approaches [5, 24, 19, 56] on CIFAR10, CIFAR100, SVHN, ImageNet and COCO without a separate search for data augmentation policies.
In previous work, scaling learned data augmentation to larger dataset and models have been a notable obstacle. For example, AutoAugment and Fast AutoAugment could only be optimized for small models on reduced subsets of data [5, 24]; population based augmentation was not reported for largescale problems [19]. The proposed method scales quite well to datasets such as ImageNet and COCO while incurring minimal computational cost (e.g. 2 hyperparameters), but notable predictive performance gains. An open question remains how this method may improve model robustness [31, 51, 40] or semisupervised learning [49].
Future work will study how this method applies to other machine learning domains, where data augmentation is known to improve predictive performance, such as image segmentation [3], 3D perception [34], speech recognition [18] or audio recognition [17]. In particular, we wish to better understand if or when datasets or tasks may require a separate search phase to achieve optimal performance. Finally, an open question remains how one may tailor the set of transformations to a given tasks in order to further improve the predictive performance of a given model.
6 Acknowledgements
We thank Samy Bengio, Daniel Ho, Jaehoon Lee, Hanxiao Liu, Raphael Gontijo Lopes, Ruoming Pang, Ben Poole, Mingxing Tan, and the rest of the Brain team for their help.
References
 [1] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
 [2] LiangChieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multiscale architectures for dense image prediction. In Advances in Neural Information Processing Systems, pages 8699–8710, 2018.
 [3] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
 [4] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multicolumn deep neural networks for image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3642–3649. IEEE, 2012.
 [5] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
 [6] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 [7] Terrance DeVries and Graham W Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.
 [8] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 [9] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1301–1310, 2017.
 [10] HaoShu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, YongLu Li, and Cewu Lu. Instaboost: Boosting instance segmentation via probability map guided copypasting. arXiv preprint arXiv:1908.07801, 2019.
 [11] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019.
 [12] Xavier Gastaldi. Shakeshake regularization. arXiv preprint arXiv:1705.07485, 2017.
 [13] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron, 2018.
 [14] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
 [15] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567, 2014.
 [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [17] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for largescale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
 [18] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
 [19] Daniel Ho, Eric Liang, Ion Stoica, Pieter Abbeel, and Xi Chen. Population based augmentation: Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393, 2019.
 [20] Naoyuki Kanda, Ryu Takeda, and Yasunari Obuchi. Elastic spectral distortion for low resource speech recognition with deep neural networks. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 309–314. IEEE, 2013.
 [21] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
 [23] Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation learning an optimal data augmentation strategy. IEEE Access, 5:5858–5869, 2017.
 [24] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv preprint arXiv:1905.00397, 2019.
 [25] TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
 [26] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [27] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.
 [28] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. In International Conference on Learning Representations, 2018.
 [29] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [30] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
 [31] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019.
 [32] Seongkyu Mun, Sangwook Park, David K Han, and Hanseok Ko. Generative adversarial network based acoustic scene training set augmentation and selection using svm hyperplane. In Detection and Classification of Acoustic Scenes and Events Workshop, 2017.
 [33] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 [34] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Alsharif, Patrick Nguyen, et al. Starnet: Targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069, 2019.
 [35] Daniel S Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
 [36] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
 [37] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, 2018.
 [38] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. Learning to compose domainspecific transformations for data augmentation. In Advances in Neural Information Processing Systems, pages 3239–3249, 2017.
 [39] Suman Ravuri and Oriol Vinyals. Classification accuracy score for conditional generative models. arXiv preprint arXiv:1905.10887, 2019.
 [40] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811, 2019.
 [41] Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229, 2015.
 [42] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of International Conference on Document Analysis and Recognition, 2003.
 [43] Leon Sixt, Benjamin Wild, and Tim Landgraf. Rendergan: Generating realistic labeled data. arXiv preprint arXiv:1611.01331, 2016.
 [44] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 [45] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [46] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
 [47] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and Ian Reid. A bayesian data augmentation approach for learning deep models. In Advances in Neural Information Processing Systems, pages 2794–2803, 2017.
 [48] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
 [49] Qizhe Xie, Zihang Dai, Eduard Hovy, MinhThang Luong, and Quoc V Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019.
 [50] Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. arXiv preprint arXiv:1802.02375, 2018.
 [51] Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. arXiv preprint arXiv:1906.08988, 2019.
 [52] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
 [53] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
 [54] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
 [55] Xinyue Zhu, Yifan Liu, Zengchang Qin, and Jiahong Li. Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648, 2017.
 [56] Barret Zoph, Ekin D Cubuk, Golnaz Ghiasi, TsungYi Lin, Jonathon Shlens, and Quoc V Le. Learning data augmentation strategies for object detection. arXiv preprint arXiv:1906.11172, 2019.
 [57] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
 [58] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Appendix A Appendix
a.1 Second order term from bilevel optimization
For the second order term for the optimization of augmentation parameters, we follow the formulation in [29], which we summarize below. We treat the optimization of augmentation parameters and weights of the neural network as a bilevel optimization problem, where are the augmentation parameters and are the weights of the neural network. Then the goal is to find the optimal augmentation parameters such that when weights are optimized on the training set using data augmentation given by parameters, the validation loss is minimized. In other words:
(1)  
Then, again following [29], we approximate this bilevel optimization by a single virtual training step,
(2)  
where is the virtual learning rate. Eq. 2 can be expanded as
(3)  
where . In the case where the virtual learning rate, , is zero, the second term disappears and the first term becomes , which was called the firstorder approximation [29]. This firstorder approximation was found to be highly significant for architecture search, where most of the improvement (0.3% out of 0.5%) could be achieved using this approximation in a more efficient manner (1.5 days as opposed to 4 days). Unfortunately, when represents augmentation parameters, firstorder approximation is irrelevant since the predictions of a model on the clean validation images do not depend on the augmentation parameters . Then we are left with just the second order approximation, where , which we approximate via finite difference approximation as
(4)  
where and is a small number.
a.2 Experimental Details
a.2.1 Cifar
The WideResNet models were trained for 200 epochs with a learning rate of 0.1, batch size of 128, weight decay of 5e4, and cosine learning rate decay. ShakeShake [12] model was trained for 1800 epochs with a learning rate of 0.01, batch size of 128, weight decay of 1e3, and cosine learning rate decay. ShakeDrop [50] models were trained for 1800 epochs with a learning rate of 0.05, batch size of 64 (as 128 did not fit on a single GPU), weight decay of 5e5, and cosine learning rate decay.
On CIFAR10, we used 3 for the number of operations applied () and tried 4, 5, 7, 9, and 11 for magnitude. For WideResNet2 and WideResNet10, we find that the optimal magnitude is 4 and 5, respectively. For ShakeShake (26 2x96d) and PyramidNet + ShakeDrop models, the optimal magnitude was 9 and 7, respectively.
a.2.2 Svhn
For both SVHN datasets, we applied cutout after RandAugment as was done for AutoAugment and related methods. On core SVHN, for both WideResNet282 and WideResNet2810, we used a learning rate of 5e3, weight decay of 5e3, and cosine learning rate decay for 200 epochs. We set and tried 5, 7, 9, and 11 for magnitude. For both WideResNet282 and WideResNet2810, we find the optimal magnitude to be 9.
On full SVHN, for both WideResNet282 and WideResNet2810, we used a learning rate of 5e3, weight decay of 1e3, and cosine learning rate decay for 160 epochs. We set and tried 5, 7, 9, and 11 for magnitude. For WideResNet282, we find the optimal magnitude to be 5; whereas for WideResNet2810, we find the optimal magnitude to be 7.
a.2.3 ImageNet
The ResNet models were trained for 180 epochs using the standard ResNet50 training hyperparameters. The image size was 224 by 244, the weight decay was 0.0001 and the momentum optimizer with a momentum parameter of 0.9 was used. The learning rate was 0.1, which gets scaled by the batch size divided by 256. A global batch size of 4096 was used, split across 32 workers. For ResNet50 the optimal distortion magnitude was 9 and (). The distortion magnitudes we tried were 5, 7, 9, 11, 13, 15 and the values of that were tried were 1, 2 and 3.
The EfficientNet experiments used the default hyper parameters and training schedule, which can be found in [46]. We trained for 350 epochs, used a batch size of 4096 split across 256 replicas. The learning rate was 0.016, which gets scaled by the batch size divided by 256. We used the RMSProp optimizer with a momentum rate of 0.9, epsilon of 0.001 and a decay of 0.9. The weight decay used was 1e5. For EfficientNet B5 the image size was 456 by 456 and for EfficientNet B7 it was 600 by 600. For EfficientNet B5 we tried and and found them to perform about the same. We found the optimal distortion magnitude for B5 to be 17. The different magnitudes we tried were 8, 11, 14, 17, 21. For EfficientNet B7 we used and found the optimal distortion magnitude to be 28. The magnitudes tried were 17, 25, 28, 31.
The default augmentation of horizontal flipping and random crops were used on ImageNet, applied before RandAugment. The standard training and validation splits were employed for training and evaluation.
a.3 Coco
We applied horizontal flipping and scale jitters in addition to RandAugment. We used the same list of data augmentation transformations as we did in all other classification tasks. Geometric operations transformed the bounding boxes the way it was defined in Ref. [56]. We used a learning rate of 0.08 and a weight decay of 1e − 4. The focal loss parameters are set to be and . We set and tried distortion magnitudes between 4 and 9. We found the optimal distortion magnitude for ResNet101 and ResNet200 to be 5 and 6, respectively.