Improving AutoAugment via AugmentationWise Weight Sharing
Abstract
The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic argumentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be timeconsuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the AugmentationWise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way. Comprehensive analysis verifies the superiority of this approach in terms of effectiveness and efficiency. The augmentation policies found by our method achieve the best accuracy compared with existing autoaugmentation search methods. On CIFAR10, we achieve a top1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, we get a top1 error rate of 20.36% for ResNet50, which leads to 3.34% absolute error rate reduction over the baseline augmentation.
1 Introduction
Deep learning techniques have been heavily utilized in the computer vision area and made remarkable progress in lots of tasks, such as image classification AlexNet (); AA22 (); zhang2017mixup (), object detection FAA21 (); FAA27 (), segmentation FAA2 (); FAA9 (), image captioning vinyals2015show (), and human pose estimation toshev2014deeppose (). Overfit is a commonly acknowledged issue of deep learning algorithms. Various Regularization techniques are proposed in different tasks to fight overfit. Data augmentation, which increases both the amount and the diversity of the data by applying semantic invariant image transformations to training samples AA22 (); AA11 (), is the most commonly used regularization due to its simplicity and effectiveness. There are various frequently used augmentation operations for image data, including traditional image transformations such as resizing, cropping, shearing, horizontal flipping, translation, and rotation. Recently, several special operations are proposed, such as Cutout Cutout () and Sample Pairing SP (), are also proposed. It has been widely observed LeNet (); AlexNet (); DeepImage () that augmentation strategies influence the final performances of deep learning models considerably.
However, choosing appropriate data augmentation strategies is timeconsuming and requires extensive efforts from experienced human experts. Hence automatic augmentation techniques AA (); RA (); FAA (); PBA (); OHL (); AAA () are leveraged to search for performant augmentation strategy according to specific datasets and models. Numerous experiments show that these searched policies are superior to handcrafted policies in many computer vision tasks. These techniques design different evaluation processes to conduct searches.
The most straightforward approach AA () use plain evaluation process which fully train the model with different augmentation policy repeatedly to obtain the reward for reinforcement learning agent. Inevitably, this approach raises the timeconsuming issue as it requires a tremendous amount of computational resources to train thousands of child models to complete.
To alleviate the computational cost, most of the efficient works PBA (); OHL (); AAA () utilize the joint optimization approach to evaluate the strategies every few iterations, getting rid of training multiple networks from scratch repeatedly. Although being efficient, most of these methods have only mediocre performance due to the compromised evaluation process similar to that of random augmentation RA (). Specifically, the compromised evaluation process would distort the ranking for augmentation strategies since the rank for the models trained with too few iterations are known to be inconsistent with the final models trained with sufficient iterations. This phenomenon is shown in Fig.2, where the relative ranks change a lot during the whole training process.
An ideal evaluation process should be efficient as well as highly reliable to produce accurate rewards for augmentation strategies. In order to achieve this, we dive into the training dynamics with different data augmentations. We observe that the augmentation operations in the later training period are more influential. Based on this, we design a new evaluation process, which is a proxy task with an Augmentationwise Weight Sharing (AWS) strategy. Compared with AA (), we improve efficiency significantly via this weight sharing strategy and make it affordable to directly search on large scale datasets. And the performance gains are also substantial. Compared with previous efficient methods, our method produces more reliable evaluation shown in Sec.4.4 with competitive computation resources. Our main contribution can be summarized as follows: 1) We propose an efficient yet reliable proxy task utilizing a novel augmentationwise weight sharing strategy to be the evaluation process for augmentation search methods. 2) We design a new search pipeline for autoaugmentation search utilizing the proposed proxy task and achieved the best accuracy compared with existing autoaugmentation search methods.
The augmentation policies found by our approach achieve outstanding performance. On CIFAR10, we achieve a top1 error rate of 1.24%, which is the currently bestperforming single model without extra training data. On ImageNet, we get a top1 error rate of 20.36% for ResNet50, which leads to 3.34% improvement over the baseline augmentation. The augmentation policies we found on both CIFAR and ImageNet benchmark will be released to the public as an offtheshelf augmentation policy to push the boundary of the stateoftheart performance.
2 Related Work
2.1 Auto Machine Learning and Neural Architecture Search
Auto Machine Learning (AutoML) aims to free human practitioners and researchers from these menial tasks. Recent advances focus on automatically searching neural network architectures. One of the first attempts zoph2017learning (); zoph2016neural () was utilizing reinforcement learning to train a controller representing a policy to generate a sequence of symbols representing the network architecture. An alternative to reinforcement learning is evolutionary algorithms, that evolved the topology of architectures by mutating the best architectures found so far real2018regularized (); xie2017genetic (); saxena2016convolutional (). Recent efforts such as liu2018darts (); luo2018neural (); pham2018efficient (), utilized several techniques trying to reduce the search cost. Note that AA () utilized a similar controller inspired by zoph2017learning (), whose training is timeconsuming, to guide the augmentation policy search. Our autoaugmentation strategy is much more efficient and economical compared to these methods.
2.2 Automatic Augmentation
Recently, some automatic augmentation approaches AA (); FAA (); PBA (); OHL (); RA (); AAA () have been proposed. The common purpose of them is to search for powerful augmentation policies automatically, by which the performances of deep models can be enhanced further. AA () formulates the automatic augmentation policy search as a discrete search problem and employs a reinforcement learning framework to search the policy consisting of possible augmentation operations, which is most closely related to our work. Our proposed method also optimizes the distribution of the discrete augmentation operations, but it is much more computationally economical, benefiting from the weight sharing technique. Much other previous work PBA (); OHL (); AAA () takes the singlestep approximation to reduce the computational cost dramatically by getting rid of training multiple networks.
3 Method
3.1 Motivations
Key Observation
As a powerful regularization technique, data augmentation is applied to relief overfitting shorten2019survey (). Another popular regularization method is early stopping sarle1996stopped (), meaning to compute the validation error periodically, and stop training when the validation error starts to go up. It shows the overfitting phenomenon may mostly occur in the late stages of the training. Thus, a natural conjecture could be raised: data augmentation improves the generalization of the model, mainly in the later training process.
To investigate and verify this, we explore the relationship between the performance gains and the augmented periods. We train ResNet18 ResNet () on CIFAR10 CIFAR () for 300 epochs in total, some of which are augmented by the searched policy of AutoAug AA (). Specifically, we apply augmentation in the start or the end epochs, where denotes the number of epochs with augmentation. We repeat each experiment eight times to ensure reliability. The result is shown in Fig.2, which indicates two main pieces of evidence as follows: 1) With the same number of the augmented epochs , applying data augmentation in the later stages can constantly get better model performance, as the dashed curve is always above the solid one. 2) In order to train models to the same level of performance, conducting data augmentation in the later stages requires fewer epochs of augmentation compared with conducting it in the early stages, as the dashed curve is always on the left of the solid one.
In sum, our empirical results show that data augmentation functions more in the late training stages, which could be took advantage of to produce efficient and reliable reward estimation for different augmentation strategies.
AugmentationWise Weight Sharing
Inspired by our observation, we propose a new proxy task for automatic augmentation. It consists of two stages. In the first stage, we choose a shared augmentation strategy to train the shared weights, namely, the augmentationwise shared model weights. We borrow the weight sharing paradigm from NAS that shares weights among different network architectures to speed up the search. Please note that, to the best of our knowledge, this is the first work to investigate the weight sharing technique for automatic augmentation search. In the second stage, we conduct the policy search efficiently. Reliability remains as augmentation operations function more in the late stages. And experiments in Sec.4.4 also verify this.
3.2 AutoAug Formulation
AutoAug strategy aims to find a set of augmentation operations for training data, which maximize the performance of a deep model. In this work, we denote training set as , validation set as . We use and to denote the image and its label. Here we are searching data augmentation strategy for a specific model denoted as , which is parameterized by . We regard our augmentation strategy as a distribution over candidate image transformations, which is controlled by . denotes the set of operations. More detailed construction on the augmentation policy space would be introduced in Sec.3.4.
The objective of obtaining the best augmentation policy (solving for ) could be described as a bilevel optimization problem. The inner level is the model weight optimization, which is solving for the optimal given a fixed augmentation policy
(1) 
where denotes the loss function, i.e. cross entropy loss.
The outer level is the augmentation policy optimization, which is optimizing the policy parameter given the result of the inner level problem. Notably, the objective for the optimization of is the validation accuracy ACC
(2) 
where denotes the parameter of the optimal policy and denotes the validation accuracy obtained by . This problem is a typical bilevel optimization problem bilevel (). Solving for the inner loop exactly is extremely timeconsuming. Thus, it is almost impossible to generalize this approach to a large scale dataset without compromise AA (). More recent works focusing on reducing the time complexity for solving bilevel optimization problems PBA (); OHL (); AAA () have been proposed. They take a singlestep approximation borrowing from NAS literature liu2018darts () to avoid training multiple networks from scratch. Instead of solving the inner level problem, singlestep approximation takes only one step for based on previous , and utilizes , which approximates the solution of the inner level problem, to update . These approaches are empirically efficient, but RA () shows it is possible to achieve compatible or even stronger performance using a random augmentation policy. Thus, a new approach to perform an efficient autoaugmentation search is desirable.
3.3 Our Proxy Task
Inspired by our observation that the later augmentation operations are more influential than the early ones, in this paper, we propose a new proxy task that substitutes the process of solving the inner level optimization by a computational efficient evaluation process.
The basic idea of our proxy task is to partition the augmented training of the network parameters (i.e., the inner level optimization) into two parts. In the first part (i.e., the early stage) a shared augmentation policy is applied to training the network regardless of the current policy given by the outer level optimization; and in the second part (i.e., the late stage), the network model is finetuned from the augmentationwise shared weights by the given policy so that it could be used to evaluate the performance of this policy. Since the shared augmented training in the first part is independent of the the given policy , it only needs to be trained once for all candidate augmentation policies to search, which significantly speeds up the optimization. We call this strategy the augmentationwise weight sharing.
Now our problem boils down to find a good shared augmentation policy for the first part training. In the following, we show that this could be trivially obtained via the following proposition.
Proposition. Let denote an arbitrary augmentation trajectory consisting of augmentation operations. Let and be the trajectory distributions without or with the augmentationwise weight sharing, that is: , and , where denotes the numbers of augmentation operations in the early training stage. Here and indicate the shared and the given policy, respectively. The KLdivergence between and is minimized when is uniform sampling, i.e., for all the possible . The detailed proof is provided in the supplementary materials.
The above proposition tells us that with a simple uniform sampling for augmentationwise weight sharing, the obtained augmentation trajectories would be similar to those without using the augmentationwise weight sharing. This is a favorable property because of the follows. To enhance the reliability of the search algorithm, it is necessary to maintain a high correlation between and , where indicates the network parameters trained by our proxy task. So, it is desired to make and as close as possible. We can achieve this by producing similar augmentation trajectories via employing a uniform sampling for the shared augmentation policy .
We use a uniform distribution sampling the augmentation transforms to train the shared parameter checkpoint :
(3) 
In the second part training, to get the performance estimation for particular augmentation policy which has the parameter equals to we load and finetune the checkpoint with this augmentation policy. We denote the parameter obtained by finetuning with augmentation as . Note that the cost for obtaining is very cheap compared with training from scratch. Thus, we optimize the augmentation policy parameters with
(4) 
In other words, we obtain once, then repeat times to reuse it and conduct the late training process for optimizing the policy parameter.
Moreover, by adjusting the number of epochs of finetuning, we can still maintain the reliability of policy evaluation to a large extent, which is verified in the supplementary material. In Sec.4.4 we study the superiority of this proxy task, as we empirically find that there is a strong correlation between and .
3.4 Augmentation Policy Space and Search Pipeline
Augmentation Policy Space In this paper, we regard the policy parameter as probability distributions on the possible augmentation operations. Let be the number of available data augmentation operations in the search space, and be the set of candidates. Accordingly, each of them has a probability of being selected denoted by . For each training image, we sample an augmentation operation from the distribution of , then apply to it. Each augmentation operation is a pair of augmentation elements. Following AA (), we select the same augmentation elements, except Cutout Cutout () and Sample Pairing SP (). There are 36 different augmentation elements in total. The details are listed in the supplementary material.
More precisely, the augmentation distribution is a multinomial distribution which has possible outcomes. The probability of the th operation is a normalized sigmoid function of . As a single augmentation operation is defined as a pair of two elements, resulting in possible combinations (the same augmentation choice may repeat), we have
Search Pipeline As our proxy task is flexible, any heuristic search algorithm is applicable. In practical implementation, we empirically find that Proximal Policy Optimization PPO () is good enough to find a good in Equ.4. In practice, we also utilize the baseline trick PGbaseline () to reduce the variance of the gradient estimation. The baseline function is an exponential moving average of previous rewards with a weight of 0.9. The complete task pipeline using the augmentationwise weight sharing technique is presented in Algorithm 1.
4 Experiments and Results
4.1 Datasets and Comparison Methods
Following the literature on automatic augmentation, we evaluate the performance of our proposed method on three classification datasets: CIFAR10 CIFAR (), CIFAR100 CIFAR (), and ImageNet ImageNet (). The detailed description and splitting ways of these datasets are presented in the supplementary material. To fully demonstrate the advantage of our proposed method, we make a comprehensive comparison with the stateofthearts augmentation methods, includings Cutout Cutout (), AutoAugment (AutoAug) AA (), Fast AutoAugment (Fast AA) FAA (), OHLAutoAug (OHL) OHL (), PBA PBA (), Rand Augment (RandAug) RA (), and Adversarial AutoAugment (Adv. AA) AAA ().
4.2 Implementation Details
Cifar On CIFAR10 and CIFAR100, following the literature, we use ResNet18 ResNet () and WideResNet2810 WRN (), respectively, as the basic models to search the policies, and transfer the searched policies to other models, including to ShakeShake (26 d) SKSK () and PyramidNet+ShakeDrop PYN (). As mentioned, our training process is divided into two parts. The numbers of epochs of each part are set to and , respectively, leading to total number of epochs in the search process. The is set to . To optimize the policy, we use the Adam optimizer with a learning rate of , and . Some other details are studied and reported in the supplementary.
ImageNet During the policy search process, we use ResNet50 ResNet () as the basic model, and then transfer the policies to ResNet200 ResNet (). The learning rate is set to 0.2. The numbers of epochs of the two training stages are set to and , respectively. Other hyperparameters of the search process are the same as what we use for the CIFAR datasets. Some other details are studied and reported in the supplementary.
4.3 Comparison with the stateofthearts
The comparisons between our AWS method and the stateofthearts are reported in Tab.1 and Tab.3. To minimize the influence of randomness, we run our method repetitively for eight times on CIFAR and four times on ImageNet, and report our test error rates in terms of Mean STD (standard deviation). For other methods in comparison, we directly quote their results from the original papers. Except Adv. AA AAA (), these methods only report the average test error rates. “Baseline” in Tab.1 and Tab.3 refers to the basic models using only the default preprocessing without applying the searched augmentation policies and the Cutout. For a fair comparison, we report our resulting both using and without using the Enlarge Batch (EB) proposed by Adv. AA AAA (). By leveraging EB in practice, the minibatch size is times larger, while the number of iterations is not changed. Besides, our searched policies have strong preferences, as only a few augmentation operations are preserved eventually, which is quite different from other methods like AA (); FAA (); AAA (). Details about them are presented in the supplementary material.
Approach  Res18  WRN  ShakeShake  PyramidNet 

Baseline  4.66  3.87  2.86  2.67 
Cutout Cutout ()  3.62  3.08  2.56  2.31 
Fast AA FAA ()    2.7  2.0  1.7 
RandAug RA ()    2.7  2.0  1.5 
AutoAug AA ()  3.46  2.68  1.99  1.48 
PBA PBA ()    2.58  2.03  1.46 
OHL OHL ()  3.29  2.61     
Adv. AA (EB) AAA ()    1.90 0.15  
Ours  2.91 0.062  1.95 0.047  
Ours (EB)  2.38 0.041  1.57 0.038 
Approach  WRN  ShakeShake  PyramidNet 

Baseline  18.80  17.1  13.99 
Cutout Cutout ()  18.41  16.0  12.19 
Fast AA FAA ()  17.3  14.6  11.7 
RandAug RA ()  16.7     
AutoAug AA ()  17.1  14.3  10.67 
PBA PBA ()  16.7  15.3  10.94 
Adv. AA (EB) AAA ()  
Ours  15.28 0.067  
Ours (EB)  14.16 0.055 
Results on CIFAR The results on CIFAR10 are summarized in Tab.1. Comparing the results horizontally in Tab.1, it can be seen that our learned policies using ResNet18 could be well transferred to training other network models like WRN WRN (), Shakeshake SKSK () and PyramidNet PYN (). Compared with the baseline without using the searched augmentation, the performance of all these models significantly improves after applying our searched policies for augmentation. Comparing the results vertically in Tab.1, our AWS method is the best performer across all four network architectures. Specifically, ours achieves the best top1 test error of with PyramidNet+ShakeDrop, which is better than the secondbest performer Adv. AA AAA (), even though we, unlike AAA (), do not use the Sample Pairing SP () for search. Consistent observations are found on the results on CIFAR100 in Tab.1. Ours again performs best on all four network architectures among the methods in comparison.
Approach  ResNet50  ResNet200 

Baseline  23.7 / 6.9  21.5 / 5.8 
Fast AA FAA ()  22.4 / 6.3  19.4 / 4.7 
AutoAug AA ()  22.4 / 6.2  20.0 / 5.0 
RandAug RA ()  22.4 / 6.2   
OHL OHL ()  21.07 / 5.68   
Adv. AA (EB)  20.60 0.15 / 5.53 0.05  18.68 0.18 / 4.70 0.05 
Ours  20.61 0.17 / 5.49 0.08  18.64 0.16 / 4.67 0.07 
Ours ( or EB)  20.36 0.15 / 5.41 0.07  18.56 0.14 / 4.62 0.05 
Results on ImageNet The results on ImageNet are summarized in Tab.3. We report both the top1 and the top5 test errors following the convention. Adv. AA AAA () has evaluated its performance for different EB ratios . The test accuracy improves rapidly with the increase of up to 8. The further increase of does not bring a significant improvement. So is finally used in Adv. AA. We tried to increase the batch size, but we can only use EB for ResNet50 and EB for ResNet200 due to the limited resources. As can be seen, we still achieve superior performance over those automatic augmentation works in comparison. Moreover, our outstanding performance using the heavy model ResNet200 also verifies the generalization of our learned augmentation policies.
Results on computational cost We further compare the computational cost among different autoaugmentation methods and report the error reductions of WRN relative to Cutout’s. Following the existing works, the computation costs on CIFAR10 are reported in Tab.4. In this table, we use the GPU hours used by OHLAutoAug OHL () as the baseline, and report the relative time consuming on this baseline. As can be seen that our method is times faster than AutoAugment AA () with an even better performance. Although it is slightly slower than OHL, our method has a salient performance advantage over it. Overall, our proposed method is a very promising approach: it has the best performance with acceptable computational cost.
4.4 Comparison Among Proxy Tasks
To verify the superiority of the proxy selected by us, we make comparisons among different proxy tasks using ResNet18 on CIFAR10. To solve the problem in Equ.4, we design four optional proxies, which are summarized in Tab.5. The correlations between and are investigated, shown in Fig.4. As can be seen in Tab.5, among the four options, our selection () produces the highest Pearson correlation coefficient , which outperforms other options by a large margin. Specifically, the proxy trains the model without data augmentation in the first stage, and only searches the augmentation policy in the second stage like AutoAugment AA (). Its inferior performance to our proposed proxy may suggest that simply performing AutoAugment AA () only in the late stage could not lead to good results. As for the proxy , there is no firststage training and the network parameters in the second stage are randomly initialized. Its inferior performance to may suggest that less trained network parameters could not generate a reliable ranking for rewording. Finally, the proxy is similar to that used in Fast AutoAugment FAA () and its low correlation coefficient is consistent with the mediocre performance of Fast AutoAugment FAA () in our previous comparisons.
Symbolic  The way to  The way to  pearsonr 

representation  obtain  optimize  
(ours)  Train with Augmentation  Finetune+Eval  0.85 
Train without Augmentation  Finetune+Eval  0.55  
Random Initialized  Finetune+Eval  0.36  
Train with Augmentation  Eval  0.045 
5 Conclusion
In this paper, we propose an innovative and elegant way to search for autoaugmentation policies effectively and efficiently. We first verify that data augmentation operations function more in the late training stages. Based on this phenomenon, we propose an efficient and reliable proxy task for fast evaluation of augmentation policy and solve the autoaugmentation search problem with an augmentationwise weight sharing proxy. The intensive empirical evaluations show that the proposed AWS autoaugmentation outperforms all previous searched or handcrafted augmentation policy. To our knowledge, it is the first time for a weight sharing proxy paradigm to be applied to augmentation search. The augmentation policies we found on both CIFAR and ImageNet benchmark are released to the public as an offtheshelf augmentation policy to push the boundary of the statoftheart performance.
Broader Impact
In this paper, we propose a new framework to conduct an efficient and reliable Automated Augmentation (AutoAug) search and achieve superior performance compared with existing methods. AutoAug enhances the performances of deep models as a typical Automated Machine Learning (AutoML) technique.
For fundamental research and ML applications, our research contributes towards many computer vision areas that benefit from image data augmentations. It may help reduce the demand for data scientists by enabling domain experts to automatically design tailored augmentation strategies without extensive knowledge of statistics and machine learning.
For broader societal implications, as an AutoML technique, our approach can be utilized to build models and establish reasonable lower bounds of them for performance quickly and cheaply. It may be useful and powerful to ML practitioners in various entities, such as the media industry, the transportation industry, and the automatic production industries. However, each of these uses may result in job losses. Some other issues, like personal privacy leak problems, may also be raised when this technique is used by those malicious. In summary, this technique may be socially beneficial or harmful, which depends on the users. We would encourage the researchers, general practitioners, or anyone else to use it for social benefits, rather than infringe the interests of individuals and the nation, and threaten social stability.
Supplementary Material
Appendix A Ablation Study
In this ablation study, we further investigate the power of the policies searched by our approach and the closely related method AutoAug AA (). We rank the augmentation operations based on their probabilities in decreasing order. Therefore, the operations ranked on the top could be deemed as the most important augmentations. Then we gradually remove the most important operations from the searched policy one by one and investigate the change of the Top1 test error rates, as reported in Tab.6. As can be seen, when the most important operations are removed gradually, the performance of the AutoAug remains similar. On the contrary, during this process, the performance of ours drops significantly. This shows that the augmentations in our policy are much more powerful than those in the AutoAug.
Approach  Apply  Without  Without  Without 

All  Top  Top  Top  
AutoAug AA () (our impl.)  3.40 0.070  3.45 0.066  3.36 0.065  3.46 0.082 
Ours  2.91 0.062  3.10 0.056  3.13 0.099  3.19 0.110 
Appendix B Proof
The proof of the proposition in Sec. 3.3 is as follows:
Proof. The KLdivergence between and is as follows:
(5) 
Since is constant with respect to , the that minimizes should satisfy:
(6) 
By Jensen’s inequality with the strictly concave function , we have:
(7) 
The equality holds if and only if for all the possible . In other words, the KL divergence is minimized when is uniform sampling.
Appendix C Investigation of the Number of Epochs of Finetuning
We investigate different numbers of epochs in the late training stage(). By adjusting we can still maintain the reliability of policy evaluation to a large extent. The results are shown in Fig.5. We find that the policy optimization becomes hard to converge when a small is used, as the performances among different are too close. And when a large is used, is notably higher. However, the final performance does not benefit from this, as the correlation between and does not change much between and . Thus we choose as the final configuration for the efficiency.
Appendix D Augmentation Elements
The augmentation elements are listed as follows. We use almost the same elements as AutoAug’s AA (). But we do not introduce Cutout Cutout () and Sample Pairing SP () into the search space.
Elements 



Horizontal Shear  
Vertical Shear  
Horizontal Translate  
Vertical Translate  
Rotate  
Color Adjust  
Posterize  
Solarize  
Contrast  
Sharpness  
Brightness  
Autocontrast  None  
Equalize  None  
Invert  None 
Appendix E Datasets Splitting Details
Dataset  Train  Validation  Test 

Set Size  Set Size  Set Size  
CIFAR10 CIFAR ()  40,000  10,000  10,000 
CIFAR100 CIFAR ()  40,000  10,000  10,000 
Reduced ImageNet ImageNet ()  128,000  50,000  50,000 
Appendix F More Implementation Details
Cifar Once the policies have been learned, they are applied to training the models again from scratch, as well as another network models for the investigation of the transferability of the policies between different network models. For ResNet18 and WideResNet2810, we use a minibatch size of 256 and the SGD with a Nesterov momentum of 0.9. The weight decay is set to , and the cosine learning rate scheme is utilized with the maximum learning rate of . The number of epochs is set to 300. For PyramidNet+ShakeDrop and ShakeShake (26 d), we use the same settings as those in FAA ().
For a fair comparison among different augmentation methods, we apply a basic preprocessing following the convention of the stateoftheart CIFAR10 models: standardizing the data, random horizontal flips with 50% probability, zeropadding and random crops, and finally Cutout Cutout () with 1616 pixels. During our comparison, the searched policy is applied on top of this basic preprocessing step. That is, for each input training image, the basic preprocessing is first performed, then the policies learned by an augmentation method, and finally the Cutout.
ImageNet Once the policies have been obtained, they are applied to training ResNet50 from scratch, as well as another network model ResNet200 for the study of policy transferability. The hyperparameters used to train ResNet50 and ResNet200 are the same as those in AA () except a cosine learning rate scheduler. Moreover, our learned policies are applied on top of a standard Inceptionstyle preprocessing, which includes standardizing, random horizontal flips with 50% probability, and random distortions of colors AA59 (). This preprocessing step is uniformly applied to all the methods in comparison.
Appendix G Details of Searched Policies.
References
 H. S. Baird. Document image defect models. In Structured Document Image Analysis, pages 546–556. Springer, 1992.
 L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
 B. Colson, P. Marcotte, and G. Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
 E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113–123, 2019.
 E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719, 2019.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
 T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 X. Gastaldi. Shakeshake regularization. arXiv preprint arXiv:1705.07485, 2017.
 D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5927–5935, 2017.
 K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen. Population based augmentation: Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393, 2019.
 H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
 A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. Fast autoaugment. In Advances in Neural Information Processing Systems, pages 6662–6672, 2019.
 C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin, and W. Ouyang. Online hyperparameter learning for autoaugmentation strategy. In Proceedings of the IEEE International Conference on Computer Vision, pages 6579–6588, 2019.
 H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
 R. Luo, F. Tian, T. Qin, E. Chen, and T.Y. Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7827–7838, 2018.
 H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 W. S. Sarle. Stopped training and other remedies for overfitting. Computing science and statistics, pages 352–360, 1996.
 S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, pages 4053–4061, 2016.
 J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
 D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Icdar, volume 3, 2003.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
 O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
 R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015.
 L. Xie and A. L. Yuille. Genetic cnn. In ICCV, pages 1388–1397, 2017.
 S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
 X. Zhang, Q. Wang, J. Zhang, and Z. Zhong. Adversarial autoaugment. arXiv preprint arXiv:1912.11188, 2019.
 B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.