Improving Auto-Augment via Augmentation-Wise Weight Sharing

Improving Auto-Augment via Augmentation-Wise Weight Sharing


The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic argumentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be time-consuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way. Comprehensive analysis verifies the superiority of this approach in terms of effectiveness and efficiency. The augmentation policies found by our method achieve the best accuracy compared with existing auto-augmentation search methods. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to 3.34% absolute error rate reduction over the baseline augmentation.

1 Introduction

Deep learning techniques have been heavily utilized in the computer vision area and made remarkable progress in lots of tasks, such as image classification AlexNet (); AA22 (); zhang2017mixup (), object detection FAA21 (); FAA27 (), segmentation FAA2 (); FAA9 (), image captioning vinyals2015show (), and human pose estimation toshev2014deeppose (). Overfit is a commonly acknowledged issue of deep learning algorithms. Various Regularization techniques are proposed in different tasks to fight overfit. Data augmentation, which increases both the amount and the diversity of the data by applying semantic invariant image transformations to training samples AA22 (); AA11 (), is the most commonly used regularization due to its simplicity and effectiveness. There are various frequently used augmentation operations for image data, including traditional image transformations such as resizing, cropping, shearing, horizontal flipping, translation, and rotation. Recently, several special operations are proposed, such as Cutout Cutout () and Sample Pairing SP (), are also proposed. It has been widely observed LeNet (); AlexNet (); DeepImage () that augmentation strategies influence the final performances of deep learning models considerably.

However, choosing appropriate data augmentation strategies is time-consuming and requires extensive efforts from experienced human experts. Hence automatic augmentation techniques AA (); RA (); FAA (); PBA (); OHL (); AAA () are leveraged to search for performant augmentation strategy according to specific datasets and models. Numerous experiments show that these searched policies are superior to hand-crafted policies in many computer vision tasks. These techniques design different evaluation processes to conduct searches.

The most straightforward approach AA () use plain evaluation process which fully train the model with different augmentation policy repeatedly to obtain the reward for reinforcement learning agent. Inevitably, this approach raises the time-consuming issue as it requires a tremendous amount of computational resources to train thousands of child models to complete.

Figure 1: An investigation of the change of rankings in augmented training. We train ResNet-18 ResNet () on CIFAR-10 CIFAR () for 300 epochs in total, utilizing different augmentation strategies (AA (), RA (), and ours).
Figure 2: An investigation of the relationship between the performance gains and the augmented training periods. We apply augmentation AA () in the start or the end epochs.

To alleviate the computational cost, most of the efficient works PBA (); OHL (); AAA () utilize the joint optimization approach to evaluate the strategies every few iterations, getting rid of training multiple networks from scratch repeatedly. Although being efficient, most of these methods have only mediocre performance due to the compromised evaluation process similar to that of random augmentation RA (). Specifically, the compromised evaluation process would distort the ranking for augmentation strategies since the rank for the models trained with too few iterations are known to be inconsistent with the final models trained with sufficient iterations. This phenomenon is shown in Fig.2, where the relative ranks change a lot during the whole training process.

An ideal evaluation process should be efficient as well as highly reliable to produce accurate rewards for augmentation strategies. In order to achieve this, we dive into the training dynamics with different data augmentations. We observe that the augmentation operations in the later training period are more influential. Based on this, we design a new evaluation process, which is a proxy task with an Augmentation-wise Weight Sharing (AWS) strategy. Compared with AA (), we improve efficiency significantly via this weight sharing strategy and make it affordable to directly search on large scale datasets. And the performance gains are also substantial. Compared with previous efficient methods, our method produces more reliable evaluation shown in Sec.4.4 with competitive computation resources. Our main contribution can be summarized as follows: 1) We propose an efficient yet reliable proxy task utilizing a novel augmentation-wise weight sharing strategy to be the evaluation process for augmentation search methods. 2) We design a new search pipeline for auto-augmentation search utilizing the proposed proxy task and achieved the best accuracy compared with existing auto-augmentation search methods.

The augmentation policies found by our approach achieve outstanding performance. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is the currently best-performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to 3.34% improvement over the baseline augmentation. The augmentation policies we found on both CIFAR and ImageNet benchmark will be released to the public as an off-the-shelf augmentation policy to push the boundary of the state-of-the-art performance.

2 Related Work

2.1 Auto Machine Learning and Neural Architecture Search

Auto Machine Learning (AutoML) aims to free human practitioners and researchers from these menial tasks. Recent advances focus on automatically searching neural network architectures. One of the first attempts zoph2017learning (); zoph2016neural () was utilizing reinforcement learning to train a controller representing a policy to generate a sequence of symbols representing the network architecture. An alternative to reinforcement learning is evolutionary algorithms, that evolved the topology of architectures by mutating the best architectures found so far real2018regularized (); xie2017genetic (); saxena2016convolutional (). Recent efforts such as liu2018darts (); luo2018neural (); pham2018efficient (), utilized several techniques trying to reduce the search cost. Note that AA () utilized a similar controller inspired by zoph2017learning (), whose training is time-consuming, to guide the augmentation policy search. Our auto-augmentation strategy is much more efficient and economical compared to these methods.

2.2 Automatic Augmentation

Recently, some automatic augmentation approaches AA (); FAA (); PBA (); OHL (); RA (); AAA () have been proposed. The common purpose of them is to search for powerful augmentation policies automatically, by which the performances of deep models can be enhanced further. AA () formulates the automatic augmentation policy search as a discrete search problem and employs a reinforcement learning framework to search the policy consisting of possible augmentation operations, which is most closely related to our work. Our proposed method also optimizes the distribution of the discrete augmentation operations, but it is much more computationally economical, benefiting from the weight sharing technique. Much other previous work PBA (); OHL (); AAA () takes the single-step approximation to reduce the computational cost dramatically by getting rid of training multiple networks.

3 Method

3.1 Motivations

Key Observation

As a powerful regularization technique, data augmentation is applied to relief overfitting shorten2019survey (). Another popular regularization method is early stopping sarle1996stopped (), meaning to compute the validation error periodically, and stop training when the validation error starts to go up. It shows the overfitting phenomenon may mostly occur in the late stages of the training. Thus, a natural conjecture could be raised: data augmentation improves the generalization of the model, mainly in the later training process.

To investigate and verify this, we explore the relationship between the performance gains and the augmented periods. We train ResNet-18 ResNet () on CIFAR-10 CIFAR () for 300 epochs in total, some of which are augmented by the searched policy of AutoAug AA (). Specifically, we apply augmentation in the start or the end epochs, where denotes the number of epochs with augmentation. We repeat each experiment eight times to ensure reliability. The result is shown in Fig.2, which indicates two main pieces of evidence as follows: 1) With the same number of the augmented epochs , applying data augmentation in the later stages can constantly get better model performance, as the dashed curve is always above the solid one. 2) In order to train models to the same level of performance, conducting data augmentation in the later stages requires fewer epochs of augmentation compared with conducting it in the early stages, as the dashed curve is always on the left of the solid one.

In sum, our empirical results show that data augmentation functions more in the late training stages, which could be took advantage of to produce efficient and reliable reward estimation for different augmentation strategies.

Augmentation-Wise Weight Sharing

Inspired by our observation, we propose a new proxy task for automatic augmentation. It consists of two stages. In the first stage, we choose a shared augmentation strategy to train the shared weights, namely, the augmentation-wise shared model weights. We borrow the weight sharing paradigm from NAS that shares weights among different network architectures to speed up the search. Please note that, to the best of our knowledge, this is the first work to investigate the weight sharing technique for automatic augmentation search. In the second stage, we conduct the policy search efficiently. Reliability remains as augmentation operations function more in the late stages. And experiments in Sec.4.4 also verify this.

3.2 Auto-Aug Formulation

Auto-Aug strategy aims to find a set of augmentation operations for training data, which maximize the performance of a deep model. In this work, we denote training set as , validation set as . We use and to denote the image and its label. Here we are searching data augmentation strategy for a specific model denoted as , which is parameterized by . We regard our augmentation strategy as a distribution over candidate image transformations, which is controlled by . denotes the set of operations. More detailed construction on the augmentation policy space would be introduced in Sec.3.4.

The objective of obtaining the best augmentation policy (solving for ) could be described as a bilevel optimization problem. The inner level is the model weight optimization, which is solving for the optimal given a fixed augmentation policy


where denotes the loss function, i.e. cross entropy loss.

The outer level is the augmentation policy optimization, which is optimizing the policy parameter given the result of the inner level problem. Notably, the objective for the optimization of is the validation accuracy ACC


where denotes the parameter of the optimal policy and denotes the validation accuracy obtained by . This problem is a typical bilevel optimization problem bilevel (). Solving for the inner loop exactly is extremely time-consuming. Thus, it is almost impossible to generalize this approach to a large scale dataset without compromise AA (). More recent works focusing on reducing the time complexity for solving bilevel optimization problems PBA (); OHL (); AAA () have been proposed. They take a single-step approximation borrowing from NAS literature liu2018darts () to avoid training multiple networks from scratch. Instead of solving the inner level problem, single-step approximation takes only one step for based on previous , and utilizes , which approximates the solution of the inner level problem, to update . These approaches are empirically efficient, but RA () shows it is possible to achieve compatible or even stronger performance using a random augmentation policy. Thus, a new approach to perform an efficient auto-augmentation search is desirable.

Figure 3: The overview of our method. Firstly, we train the model with the shared augmentation policy to get the augmentation-wise shared weights . Then we fine-tune it repeatedly and use to update the policy under the searching.

3.3 Our Proxy Task

Inspired by our observation that the later augmentation operations are more influential than the early ones, in this paper, we propose a new proxy task that substitutes the process of solving the inner level optimization by a computational efficient evaluation process.

The basic idea of our proxy task is to partition the augmented training of the network parameters (i.e., the inner level optimization) into two parts. In the first part (i.e., the early stage) a shared augmentation policy is applied to training the network regardless of the current policy given by the outer level optimization; and in the second part (i.e., the late stage), the network model is fine-tuned from the augmentation-wise shared weights by the given policy so that it could be used to evaluate the performance of this policy. Since the shared augmented training in the first part is independent of the the given policy , it only needs to be trained once for all candidate augmentation policies to search, which significantly speeds up the optimization. We call this strategy the augmentation-wise weight sharing.

Now our problem boils down to find a good shared augmentation policy for the first part training. In the following, we show that this could be trivially obtained via the following proposition.

Proposition. Let denote an arbitrary augmentation trajectory consisting of augmentation operations. Let and be the trajectory distributions without or with the augmentation-wise weight sharing, that is: , and , where denotes the numbers of augmentation operations in the early training stage. Here and indicate the shared and the given policy, respectively. The KL-divergence between and is minimized when is uniform sampling, i.e., for all the possible . The detailed proof is provided in the supplementary materials.

The above proposition tells us that with a simple uniform sampling for augmentation-wise weight sharing, the obtained augmentation trajectories would be similar to those without using the augmentation-wise weight sharing. This is a favorable property because of the follows. To enhance the reliability of the search algorithm, it is necessary to maintain a high correlation between and , where indicates the network parameters trained by our proxy task. So, it is desired to make and as close as possible. We can achieve this by producing similar augmentation trajectories via employing a uniform sampling for the shared augmentation policy .

We use a uniform distribution sampling the augmentation transforms to train the shared parameter checkpoint :


In the second part training, to get the performance estimation for particular augmentation policy which has the parameter equals to we load and fine-tune the checkpoint with this augmentation policy. We denote the parameter obtained by finetuning with augmentation as . Note that the cost for obtaining is very cheap compared with training from scratch. Thus, we optimize the augmentation policy parameters with


In other words, we obtain once, then repeat times to reuse it and conduct the late training process for optimizing the policy parameter.

Moreover, by adjusting the number of epochs of finetuning, we can still maintain the reliability of policy evaluation to a large extent, which is verified in the supplementary material. In Sec.4.4 we study the superiority of this proxy task, as we empirically find that there is a strong correlation between and .

3.4 Augmentation Policy Space and Search Pipeline

Augmentation Policy Space In this paper, we regard the policy parameter as probability distributions on the possible augmentation operations. Let be the number of available data augmentation operations in the search space, and be the set of candidates. Accordingly, each of them has a probability of being selected denoted by . For each training image, we sample an augmentation operation from the distribution of , then apply to it. Each augmentation operation is a pair of augmentation elements. Following AA (), we select the same augmentation elements, except Cutout Cutout () and Sample Pairing SP (). There are 36 different augmentation elements in total. The details are listed in the supplementary material.

More precisely, the augmentation distribution is a multinomial distribution which has possible outcomes. The probability of the -th operation is a normalized sigmoid function of . As a single augmentation operation is defined as a pair of two elements, resulting in possible combinations (the same augmentation choice may repeat), we have

Search Pipeline As our proxy task is flexible, any heuristic search algorithm is applicable. In practical implementation, we empirically find that Proximal Policy Optimization PPO () is good enough to find a good in Equ.4. In practice, we also utilize the baseline trick PGbaseline () to reduce the variance of the gradient estimation. The baseline function is an exponential moving average of previous rewards with a weight of 0.9. The complete task pipeline using the augmentation-wise weight sharing technique is presented in Algorithm 1.

  Obtain in Equ.3;
  while  do
     Load ;
     Finetune to get ;
     Use to update ;
  end while
  return ;
Algorithm 1 AWS Auto-Aug Search

4 Experiments and Results

4.1 Datasets and Comparison Methods

Following the literature on automatic augmentation, we evaluate the performance of our proposed method on three classification datasets: CIFAR-10 CIFAR (), CIFAR-100 CIFAR (), and ImageNet ImageNet (). The detailed description and splitting ways of these datasets are presented in the supplementary material. To fully demonstrate the advantage of our proposed method, we make a comprehensive comparison with the state-of-the-arts augmentation methods, includings Cutout Cutout (), AutoAugment (AutoAug) AA (), Fast AutoAugment (Fast AA) FAA (), OHL-Auto-Aug (OHL) OHL (), PBA PBA (), Rand Augment (RandAug) RA (), and Adversarial AutoAugment (Adv. AA) AAA ().

4.2 Implementation Details

Cifar On CIFAR-10 and CIFAR-100, following the literature, we use ResNet-18 ResNet () and Wide-ResNet-28-10 WRN (), respectively, as the basic models to search the policies, and transfer the searched policies to other models, including to Shake-Shake (26 d) SKSK () and PyramidNet+ShakeDrop PYN (). As mentioned, our training process is divided into two parts. The numbers of epochs of each part are set to and , respectively, leading to total number of epochs in the search process. The is set to . To optimize the policy, we use the Adam optimizer with a learning rate of , and . Some other details are studied and reported in the supplementary.

ImageNet During the policy search process, we use ResNet-50 ResNet () as the basic model, and then transfer the policies to ResNet-200 ResNet (). The learning rate is set to 0.2. The numbers of epochs of the two training stages are set to and , respectively. Other hyper-parameters of the search process are the same as what we use for the CIFAR datasets. Some other details are studied and reported in the supplementary.

4.3 Comparison with the state-of-the-arts

The comparisons between our AWS method and the state-of-the-arts are reported in Tab.1 and Tab.3. To minimize the influence of randomness, we run our method repetitively for eight times on CIFAR and four times on ImageNet, and report our test error rates in terms of Mean STD (standard deviation). For other methods in comparison, we directly quote their results from the original papers. Except Adv. AA AAA (), these methods only report the average test error rates. “Baseline” in Tab.1 and Tab.3 refers to the basic models using only the default pre-processing without applying the searched augmentation policies and the Cutout. For a fair comparison, we report our resulting both using and without using the Enlarge Batch (EB) proposed by Adv. AA AAA (). By leveraging EB in practice, the mini-batch size is times larger, while the number of iterations is not changed. Besides, our searched policies have strong preferences, as only a few augmentation operations are preserved eventually, which is quite different from other methods like AA (); FAA (); AAA (). Details about them are presented in the supplementary material.

Approach Res-18 WRN Shake-Shake PyramidNet
Baseline 4.66 3.87 2.86 2.67
Cutout Cutout () 3.62 3.08 2.56 2.31
Fast AA FAA () - 2.7 2.0 1.7
RandAug RA () - 2.7 2.0 1.5
AutoAug AA () 3.46 2.68 1.99 1.48
PBA PBA () - 2.58 2.03 1.46
OHL OHL () 3.29 2.61 - -
Adv. AA (EB) AAA () - 1.90 0.15
Ours 2.91 0.062 1.95 0.047
Ours (EB) 2.38 0.041 1.57 0.038
Table 1: CIFAR-10 results. Top-1 test error rates (%) are reported (lower is better). For fair comparison, we report our results both using and without using the Enlarge Batch proposed by Adv. AA AAA (). We report Mean STD (standard deviation) of the test error rates wherever available.
Approach WRN Shake-Shake PyramidNet
Baseline 18.80 17.1 13.99
Cutout Cutout () 18.41 16.0 12.19
Fast AA FAA () 17.3 14.6 11.7
RandAug RA () 16.7 - -
AutoAug AA () 17.1 14.3 10.67
PBA PBA () 16.7 15.3 10.94
Adv. AA (EB) AAA ()
Ours 15.28 0.067
Ours (EB) 14.16 0.055
Table 2: CIFAR-100 results. Top-1 test error rates (%) are reported (lower is better). We report Mean STD (standard deviation) of the test error rates wherever available.

Results on CIFAR The results on CIFAR-10 are summarized in Tab.1. Comparing the results horizontally in Tab.1, it can be seen that our learned policies using ResNet-18 could be well transferred to training other network models like WRN WRN (), Shake-shake SKSK () and PyramidNet PYN (). Compared with the baseline without using the searched augmentation, the performance of all these models significantly improves after applying our searched policies for augmentation. Comparing the results vertically in Tab.1, our AWS method is the best performer across all four network architectures. Specifically, ours achieves the best top-1 test error of with PyramidNet+ShakeDrop, which is better than the second-best performer Adv. AA AAA (), even though we, unlike AAA (), do not use the Sample Pairing SP () for search. Consistent observations are found on the results on CIFAR-100 in Tab.1. Ours again performs best on all four network architectures among the methods in comparison.

Approach ResNet-50 ResNet-200
Baseline 23.7 / 6.9 21.5 / 5.8
Fast AA FAA () 22.4 / 6.3 19.4 / 4.7
AutoAug AA () 22.4 / 6.2 20.0 / 5.0
RandAug RA () 22.4 / 6.2 -
OHL OHL () 21.07 / 5.68 -
Adv. AA (EB) 20.60 0.15 / 5.53 0.05 18.68 0.18 / 4.70 0.05
Ours 20.61 0.17 / 5.49 0.08 18.64 0.16 / 4.67 0.07
Ours ( or EB) 20.36 0.15 / 5.41 0.07 18.56 0.14 / 4.62 0.05
Table 3: ImageNet results. Top-1 / Top-5 test error rates (%) are reported (lower is better). We report Mean STD (standard deviation) of the test error rates of our method. “Ours” denotes our approach without using EB. “Ours ( or EB)” denotes our approach using 4 times the mini-batch size for ResNet-50 and 2 times the mini-batch size for ResNet-200. Please note that Adv. AA used 8 times the mini-batch size.

Results on ImageNet The results on ImageNet are summarized in Tab.3. We report both the top-1 and the top-5 test errors following the convention. Adv. AA AAA () has evaluated its performance for different EB ratios . The test accuracy improves rapidly with the increase of up to 8. The further increase of does not bring a significant improvement. So is finally used in Adv. AA. We tried to increase the batch size, but we can only use EB for ResNet-50 and EB for ResNet-200 due to the limited resources. As can be seen, we still achieve superior performance over those automatic augmentation works in comparison. Moreover, our outstanding performance using the heavy model ResNet-200 also verifies the generalization of our learned augmentation policies.

Approach Cutout Cutout () AutoAug AA () OHLOHL () Adv. AA AAA () Ours
Time Consuming (times) - 60 1 5 1.5
Relative Error Reduction (%) 0 12.99 15.26 38.31 49.02
Table 4: Comparison with the computation cost.

Results on computational cost We further compare the computational cost among different auto-augmentation methods and report the error reductions of WRN relative to Cutout’s. Following the existing works, the computation costs on CIFAR-10 are reported in Tab.4. In this table, we use the GPU hours used by OHL-Auto-Aug OHL () as the baseline, and report the relative time consuming on this baseline. As can be seen that our method is times faster than AutoAugment AA () with an even better performance. Although it is slightly slower than OHL, our method has a salient performance advantage over it. Overall, our proposed method is a very promising approach: it has the best performance with acceptable computational cost.

4.4 Comparison Among Proxy Tasks

To verify the superiority of the proxy selected by us, we make comparisons among different proxy tasks using ResNet-18 on CIFAR-10. To solve the problem in Equ.4, we design four optional proxies, which are summarized in Tab.5. The correlations between and are investigated, shown in Fig.4. As can be seen in Tab.5, among the four options, our selection () produces the highest Pearson correlation coefficient , which outperforms other options by a large margin. Specifically, the proxy trains the model without data augmentation in the first stage, and only searches the augmentation policy in the second stage like AutoAugment AA (). Its inferior performance to our proposed proxy may suggest that simply performing AutoAugment AA () only in the late stage could not lead to good results. As for the proxy , there is no first-stage training and the network parameters in the second stage are randomly initialized. Its inferior performance to may suggest that less trained network parameters could not generate a reliable ranking for rewording. Finally, the proxy is similar to that used in Fast AutoAugment FAA () and its low correlation coefficient is consistent with the mediocre performance of Fast AutoAugment FAA () in our previous comparisons.

Symbolic The way to The way to pearsonr
representation obtain optimize
(ours) Train with Augmentation Finetune+Eval 0.85
Train without Augmentation Finetune+Eval 0.55
Random Initialized Finetune+Eval 0.36
Train with Augmentation Eval 0.045
Table 5: Optional proxy tasks. The first three proxies fine-tune ( does not learn ) in different ways with the same goal to maximize the validation accuracy. The last proxy aims to maximize the accuracy on an augmented validation set without fine-tuning.
Figure 4: Correlation between and , where denotes the optimal network parameters for a fixed augmentation policy and denotes the network parameters obtained by a proxy task via finetuing the checkpoint. Four proxy tasks, , , and are investigated.

5 Conclusion

In this paper, we propose an innovative and elegant way to search for auto-augmentation policies effectively and efficiently. We first verify that data augmentation operations function more in the late training stages. Based on this phenomenon, we propose an efficient and reliable proxy task for fast evaluation of augmentation policy and solve the auto-augmentation search problem with an augmentation-wise weight sharing proxy. The intensive empirical evaluations show that the proposed AWS auto-augmentation outperforms all previous searched or handcrafted augmentation policy. To our knowledge, it is the first time for a weight sharing proxy paradigm to be applied to augmentation search. The augmentation policies we found on both CIFAR and ImageNet benchmark are released to the public as an off-the-shelf augmentation policy to push the boundary of the stat-of-the-art performance.

Broader Impact

In this paper, we propose a new framework to conduct an efficient and reliable Automated Augmentation (AutoAug) search and achieve superior performance compared with existing methods. AutoAug enhances the performances of deep models as a typical Automated Machine Learning (AutoML) technique.

For fundamental research and ML applications, our research contributes towards many computer vision areas that benefit from image data augmentations. It may help reduce the demand for data scientists by enabling domain experts to automatically design tailored augmentation strategies without extensive knowledge of statistics and machine learning.

For broader societal implications, as an AutoML technique, our approach can be utilized to build models and establish reasonable lower bounds of them for performance quickly and cheaply. It may be useful and powerful to ML practitioners in various entities, such as the media industry, the transportation industry, and the automatic production industries. However, each of these uses may result in job losses. Some other issues, like personal privacy leak problems, may also be raised when this technique is used by those malicious. In summary, this technique may be socially beneficial or harmful, which depends on the users. We would encourage the researchers, general practitioners, or anyone else to use it for social benefits, rather than infringe the interests of individuals and the nation, and threaten social stability.

Supplementary Material

Appendix A Ablation Study

In this ablation study, we further investigate the power of the policies searched by our approach and the closely related method AutoAug AA (). We rank the augmentation operations based on their probabilities in decreasing order. Therefore, the operations ranked on the top could be deemed as the most important augmentations. Then we gradually remove the most important operations from the searched policy one by one and investigate the change of the Top-1 test error rates, as reported in Tab.6. As can be seen, when the most important operations are removed gradually, the performance of the AutoAug remains similar. On the contrary, during this process, the performance of ours drops significantly. This shows that the augmentations in our policy are much more powerful than those in the AutoAug.

Approach Apply Without Without Without
All Top Top Top
AutoAug AA () (our impl.) 3.40 0.070 3.45 0.066 3.36 0.065 3.46 0.082
Ours 2.91 0.062 3.10 0.056 3.13 0.099 3.19 0.110
Table 6: Ablation Study. Top-1 test error rates (%) is reported (lower is better). We report Mean STD (standard deviation) of the test error rates.

Appendix B Proof

The proof of the proposition in Sec. 3.3 is as follows:

Proof. The KL-divergence between and is as follows:


Since is constant with respect to , the that minimizes should satisfy:


By Jensen’s inequality with the strictly concave function , we have:


The equality holds if and only if for all the possible . In other words, the KL divergence is minimized when is uniform sampling.

Appendix C Investigation of the Number of Epochs of Fine-tuning

Figure 5: We investigate the key hyper-parameter by visualizing the difference it brings to the search dynamics. All the experiments are conducted with ResNet-18 ResNet () on CIFAR-10 CIFAR (). The Pearson correlation coefficient () between and are annotated in the figure.

We investigate different numbers of epochs in the late training stage(). By adjusting we can still maintain the reliability of policy evaluation to a large extent. The results are shown in Fig.5. We find that the policy optimization becomes hard to converge when a small is used, as the performances among different are too close. And when a large is used, is notably higher. However, the final performance does not benefit from this, as the correlation between and does not change much between and . Thus we choose as the final configuration for the efficiency.

Appendix D Augmentation Elements

The augmentation elements are listed as follows. We use almost the same elements as AutoAug’s AA (). But we do not introduce Cutout Cutout () and Sample Pairing SP () into the search space.

 Ranges of
Horizontal Shear
Vertical Shear
Horizontal Translate
Vertical Translate
Color Adjust
Autocontrast None
Equalize None
Invert None
Table 7: List of Candidate Augmentation Elements.

Appendix E Datasets Splitting Details

Dataset Train Validation Test
Set Size Set Size Set Size
CIFAR-10 CIFAR () 40,000 10,000 10,000
CIFAR-100 CIFAR () 40,000 10,000 10,000
Reduced ImageNet ImageNet () 128,000 50,000 50,000
Table 8: Datasets Splitting Details. On both of the two CIFAR CIFAR () datasets, we use a validation set of 10,000 images, which is randomly split from the original training set, which contains 50,000 images, to calculate the validation accuracy during the searching. For ImageNet ImageNet (), we use a reduced subset of ImageNet train set when searching the policies, although our method is affordable to be directly performed on ImageNet. This subset contains images and classes (randomly chosen). We also set aside a validation set (no intersection with the reduced train subset and containing the same classes) of 50,000 images split from the training dataset for getting the validation accuracy.

Appendix F More Implementation Details

Cifar Once the policies have been learned, they are applied to training the models again from scratch, as well as another network models for the investigation of the transferability of the policies between different network models. For ResNet-18 and Wide-ResNet-28-10, we use a mini-batch size of 256 and the SGD with a Nesterov momentum of 0.9. The weight decay is set to , and the cosine learning rate scheme is utilized with the maximum learning rate of . The number of epochs is set to 300. For PyramidNet+ShakeDrop and Shake-Shake (26 d), we use the same settings as those in FAA ().

For a fair comparison among different augmentation methods, we apply a basic pre-processing following the convention of the state-of-the-art CIFAR-10 models: standardizing the data, random horizontal flips with 50% probability, zero-padding and random crops, and finally Cutout Cutout () with 1616 pixels. During our comparison, the searched policy is applied on top of this basic pre-processing step. That is, for each input training image, the basic pre-processing is first performed, then the policies learned by an augmentation method, and finally the Cutout.

ImageNet Once the policies have been obtained, they are applied to training ResNet-50 from scratch, as well as another network model ResNet-200 for the study of policy transferability. The hyper-parameters used to train ResNet-50 and ResNet-200 are the same as those in AA () except a cosine learning rate scheduler. Moreover, our learned policies are applied on top of a standard Inception-style pre-processing, which includes standardizing, random horizontal flips with 50% probability, and random distortions of colors AA59 (). This pre-processing step is uniformly applied to all the methods in comparison.

Appendix G Details of Searched Policies.

Figure 6: Probability distribution of the searched policies on CIFAR-10 (the left) and ImageNet (the right) over time. We visualize the probability distribution of the searched policies on CIFAR-10 and ImageNet over time. We calculate the marginal distribution parameters of the first element in our augmentation operations. Specifically, we convert the policy parameter to a 36 36 matrix, then sum up each row of the matrix following normalization. As shown in the picture, our searched policies have strong preferences, as only a few augmentation operations are preserved eventually, and probabilities of many other operations are close to zero, which is quite different from other methods like AA (); FAA (); AAA ().


  1. H. S. Baird. Document image defect models. In Structured Document Image Analysis, pages 546–556. Springer, 1992.
  2. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  3. B. Colson, P. Marcotte, and G. Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
  4. E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113–123, 2019.
  5. E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719, 2019.
  6. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  7. T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  8. X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
  9. D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5927–5935, 2017.
  10. K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  11. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  12. D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen. Population based augmentation: Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393, 2019.
  13. H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
  14. A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  15. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  16. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  17. S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. Fast autoaugment. In Advances in Neural Information Processing Systems, pages 6662–6672, 2019.
  18. C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin, and W. Ouyang. Online hyper-parameter learning for auto-augmentation strategy. In Proceedings of the IEEE International Conference on Computer Vision, pages 6579–6588, 2019.
  19. H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  20. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  21. R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7827–7838, 2018.
  22. H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
  23. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
  24. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  25. W. S. Sarle. Stopped training and other remedies for overfitting. Computing science and statistics, pages 352–360, 1996.
  26. S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, pages 4053–4061, 2016.
  27. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  28. C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
  29. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  30. P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Icdar, volume 3, 2003.
  31. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  32. A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
  33. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  34. R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015.
  35. L. Xie and A. L. Yuille. Genetic cnn. In ICCV, pages 1388–1397, 2017.
  36. S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  37. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  38. X. Zhang, Q. Wang, J. Zhang, and Z. Zhong. Adversarial autoaugment. arXiv preprint arXiv:1912.11188, 2019.
  39. B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  40. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description