Improving Auto-Augment via Augmentation-Wise Weight Sharing
The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic argumentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be time-consuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way. Comprehensive analysis verifies the superiority of this approach in terms of effectiveness and efficiency. The augmentation policies found by our method achieve the best accuracy compared with existing auto-augmentation search methods. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to 3.34% absolute error rate reduction over the baseline augmentation.
Deep learning techniques have been heavily utilized in the computer vision area and made remarkable progress in lots of tasks, such as image classification AlexNet (); AA22 (); zhang2017mixup (), object detection FAA21 (); FAA27 (), segmentation FAA2 (); FAA9 (), image captioning vinyals2015show (), and human pose estimation toshev2014deeppose (). Overfit is a commonly acknowledged issue of deep learning algorithms. Various Regularization techniques are proposed in different tasks to fight overfit. Data augmentation, which increases both the amount and the diversity of the data by applying semantic invariant image transformations to training samples AA22 (); AA11 (), is the most commonly used regularization due to its simplicity and effectiveness. There are various frequently used augmentation operations for image data, including traditional image transformations such as resizing, cropping, shearing, horizontal flipping, translation, and rotation. Recently, several special operations are proposed, such as Cutout Cutout () and Sample Pairing SP (), are also proposed. It has been widely observed LeNet (); AlexNet (); DeepImage () that augmentation strategies influence the final performances of deep learning models considerably.
However, choosing appropriate data augmentation strategies is time-consuming and requires extensive efforts from experienced human experts. Hence automatic augmentation techniques AA (); RA (); FAA (); PBA (); OHL (); AAA () are leveraged to search for performant augmentation strategy according to specific datasets and models. Numerous experiments show that these searched policies are superior to hand-crafted policies in many computer vision tasks. These techniques design different evaluation processes to conduct searches.
The most straightforward approach AA () use plain evaluation process which fully train the model with different augmentation policy repeatedly to obtain the reward for reinforcement learning agent. Inevitably, this approach raises the time-consuming issue as it requires a tremendous amount of computational resources to train thousands of child models to complete.
To alleviate the computational cost, most of the efficient works PBA (); OHL (); AAA () utilize the joint optimization approach to evaluate the strategies every few iterations, getting rid of training multiple networks from scratch repeatedly. Although being efficient, most of these methods have only mediocre performance due to the compromised evaluation process similar to that of random augmentation RA (). Specifically, the compromised evaluation process would distort the ranking for augmentation strategies since the rank for the models trained with too few iterations are known to be inconsistent with the final models trained with sufficient iterations. This phenomenon is shown in Fig.2, where the relative ranks change a lot during the whole training process.
An ideal evaluation process should be efficient as well as highly reliable to produce accurate rewards for augmentation strategies. In order to achieve this, we dive into the training dynamics with different data augmentations. We observe that the augmentation operations in the later training period are more influential. Based on this, we design a new evaluation process, which is a proxy task with an Augmentation-wise Weight Sharing (AWS) strategy. Compared with AA (), we improve efficiency significantly via this weight sharing strategy and make it affordable to directly search on large scale datasets. And the performance gains are also substantial. Compared with previous efficient methods, our method produces more reliable evaluation shown in Sec.4.4 with competitive computation resources. Our main contribution can be summarized as follows: 1) We propose an efficient yet reliable proxy task utilizing a novel augmentation-wise weight sharing strategy to be the evaluation process for augmentation search methods. 2) We design a new search pipeline for auto-augmentation search utilizing the proposed proxy task and achieved the best accuracy compared with existing auto-augmentation search methods.
The augmentation policies found by our approach achieve outstanding performance. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is the currently best-performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to 3.34% improvement over the baseline augmentation. The augmentation policies we found on both CIFAR and ImageNet benchmark will be released to the public as an off-the-shelf augmentation policy to push the boundary of the state-of-the-art performance.
2 Related Work
2.1 Auto Machine Learning and Neural Architecture Search
Auto Machine Learning (AutoML) aims to free human practitioners and researchers from these menial tasks. Recent advances focus on automatically searching neural network architectures. One of the first attempts zoph2017learning (); zoph2016neural () was utilizing reinforcement learning to train a controller representing a policy to generate a sequence of symbols representing the network architecture. An alternative to reinforcement learning is evolutionary algorithms, that evolved the topology of architectures by mutating the best architectures found so far real2018regularized (); xie2017genetic (); saxena2016convolutional (). Recent efforts such as liu2018darts (); luo2018neural (); pham2018efficient (), utilized several techniques trying to reduce the search cost. Note that AA () utilized a similar controller inspired by zoph2017learning (), whose training is time-consuming, to guide the augmentation policy search. Our auto-augmentation strategy is much more efficient and economical compared to these methods.
2.2 Automatic Augmentation
Recently, some automatic augmentation approaches AA (); FAA (); PBA (); OHL (); RA (); AAA () have been proposed. The common purpose of them is to search for powerful augmentation policies automatically, by which the performances of deep models can be enhanced further. AA () formulates the automatic augmentation policy search as a discrete search problem and employs a reinforcement learning framework to search the policy consisting of possible augmentation operations, which is most closely related to our work. Our proposed method also optimizes the distribution of the discrete augmentation operations, but it is much more computationally economical, benefiting from the weight sharing technique. Much other previous work PBA (); OHL (); AAA () takes the single-step approximation to reduce the computational cost dramatically by getting rid of training multiple networks.
As a powerful regularization technique, data augmentation is applied to relief overfitting shorten2019survey (). Another popular regularization method is early stopping sarle1996stopped (), meaning to compute the validation error periodically, and stop training when the validation error starts to go up. It shows the overfitting phenomenon may mostly occur in the late stages of the training. Thus, a natural conjecture could be raised: data augmentation improves the generalization of the model, mainly in the later training process.
To investigate and verify this, we explore the relationship between the performance gains and the augmented periods. We train ResNet-18 ResNet () on CIFAR-10 CIFAR () for 300 epochs in total, some of which are augmented by the searched policy of AutoAug AA (). Specifically, we apply augmentation in the start or the end epochs, where denotes the number of epochs with augmentation. We repeat each experiment eight times to ensure reliability. The result is shown in Fig.2, which indicates two main pieces of evidence as follows: 1) With the same number of the augmented epochs , applying data augmentation in the later stages can constantly get better model performance, as the dashed curve is always above the solid one. 2) In order to train models to the same level of performance, conducting data augmentation in the later stages requires fewer epochs of augmentation compared with conducting it in the early stages, as the dashed curve is always on the left of the solid one.
In sum, our empirical results show that data augmentation functions more in the late training stages, which could be took advantage of to produce efficient and reliable reward estimation for different augmentation strategies.
Augmentation-Wise Weight Sharing
Inspired by our observation, we propose a new proxy task for automatic augmentation. It consists of two stages. In the first stage, we choose a shared augmentation strategy to train the shared weights, namely, the augmentation-wise shared model weights. We borrow the weight sharing paradigm from NAS that shares weights among different network architectures to speed up the search. Please note that, to the best of our knowledge, this is the first work to investigate the weight sharing technique for automatic augmentation search. In the second stage, we conduct the policy search efficiently. Reliability remains as augmentation operations function more in the late stages. And experiments in Sec.4.4 also verify this.
3.2 Auto-Aug Formulation
Auto-Aug strategy aims to find a set of augmentation operations for training data, which maximize the performance of a deep model. In this work, we denote training set as , validation set as . We use and to denote the image and its label. Here we are searching data augmentation strategy for a specific model denoted as , which is parameterized by . We regard our augmentation strategy as a distribution over candidate image transformations, which is controlled by . denotes the set of operations. More detailed construction on the augmentation policy space would be introduced in Sec.3.4.
The objective of obtaining the best augmentation policy (solving for ) could be described as a bilevel optimization problem. The inner level is the model weight optimization, which is solving for the optimal given a fixed augmentation policy
where denotes the loss function, i.e. cross entropy loss.
The outer level is the augmentation policy optimization, which is optimizing the policy parameter given the result of the inner level problem. Notably, the objective for the optimization of is the validation accuracy ACC
where denotes the parameter of the optimal policy and denotes the validation accuracy obtained by . This problem is a typical bilevel optimization problem bilevel (). Solving for the inner loop exactly is extremely time-consuming. Thus, it is almost impossible to generalize this approach to a large scale dataset without compromise AA (). More recent works focusing on reducing the time complexity for solving bilevel optimization problems PBA (); OHL (); AAA () have been proposed. They take a single-step approximation borrowing from NAS literature liu2018darts () to avoid training multiple networks from scratch. Instead of solving the inner level problem, single-step approximation takes only one step for based on previous , and utilizes , which approximates the solution of the inner level problem, to update . These approaches are empirically efficient, but RA () shows it is possible to achieve compatible or even stronger performance using a random augmentation policy. Thus, a new approach to perform an efficient auto-augmentation search is desirable.
3.3 Our Proxy Task
Inspired by our observation that the later augmentation operations are more influential than the early ones, in this paper, we propose a new proxy task that substitutes the process of solving the inner level optimization by a computational efficient evaluation process.
The basic idea of our proxy task is to partition the augmented training of the network parameters (i.e., the inner level optimization) into two parts. In the first part (i.e., the early stage) a shared augmentation policy is applied to training the network regardless of the current policy given by the outer level optimization; and in the second part (i.e., the late stage), the network model is fine-tuned from the augmentation-wise shared weights by the given policy so that it could be used to evaluate the performance of this policy. Since the shared augmented training in the first part is independent of the the given policy , it only needs to be trained once for all candidate augmentation policies to search, which significantly speeds up the optimization. We call this strategy the augmentation-wise weight sharing.
Now our problem boils down to find a good shared augmentation policy for the first part training. In the following, we show that this could be trivially obtained via the following proposition.
Proposition. Let denote an arbitrary augmentation trajectory consisting of augmentation operations. Let and be the trajectory distributions without or with the augmentation-wise weight sharing, that is: , and , where denotes the numbers of augmentation operations in the early training stage. Here and indicate the shared and the given policy, respectively. The KL-divergence between and is minimized when is uniform sampling, i.e., for all the possible . The detailed proof is provided in the supplementary materials.
The above proposition tells us that with a simple uniform sampling for augmentation-wise weight sharing, the obtained augmentation trajectories would be similar to those without using the augmentation-wise weight sharing. This is a favorable property because of the follows. To enhance the reliability of the search algorithm, it is necessary to maintain a high correlation between and , where indicates the network parameters trained by our proxy task. So, it is desired to make and as close as possible. We can achieve this by producing similar augmentation trajectories via employing a uniform sampling for the shared augmentation policy .
We use a uniform distribution sampling the augmentation transforms to train the shared parameter checkpoint :
In the second part training, to get the performance estimation for particular augmentation policy which has the parameter equals to we load and fine-tune the checkpoint with this augmentation policy. We denote the parameter obtained by finetuning with augmentation as . Note that the cost for obtaining is very cheap compared with training from scratch. Thus, we optimize the augmentation policy parameters with
In other words, we obtain once, then repeat times to reuse it and conduct the late training process for optimizing the policy parameter.
Moreover, by adjusting the number of epochs of finetuning, we can still maintain the reliability of policy evaluation to a large extent, which is verified in the supplementary material. In Sec.4.4 we study the superiority of this proxy task, as we empirically find that there is a strong correlation between and .
3.4 Augmentation Policy Space and Search Pipeline
Augmentation Policy Space In this paper, we regard the policy parameter as probability distributions on the possible augmentation operations. Let be the number of available data augmentation operations in the search space, and be the set of candidates. Accordingly, each of them has a probability of being selected denoted by . For each training image, we sample an augmentation operation from the distribution of , then apply to it. Each augmentation operation is a pair of augmentation elements. Following AA (), we select the same augmentation elements, except Cutout Cutout () and Sample Pairing SP (). There are 36 different augmentation elements in total. The details are listed in the supplementary material.
More precisely, the augmentation distribution is a multinomial distribution which has possible outcomes. The probability of the -th operation is a normalized sigmoid function of . As a single augmentation operation is defined as a pair of two elements, resulting in possible combinations (the same augmentation choice may repeat), we have
Search Pipeline As our proxy task is flexible, any heuristic search algorithm is applicable. In practical implementation, we empirically find that Proximal Policy Optimization PPO () is good enough to find a good in Equ.4. In practice, we also utilize the baseline trick PGbaseline () to reduce the variance of the gradient estimation. The baseline function is an exponential moving average of previous rewards with a weight of 0.9. The complete task pipeline using the augmentation-wise weight sharing technique is presented in Algorithm 1.
4 Experiments and Results
4.1 Datasets and Comparison Methods
Following the literature on automatic augmentation, we evaluate the performance of our proposed method on three classification datasets: CIFAR-10 CIFAR (), CIFAR-100 CIFAR (), and ImageNet ImageNet (). The detailed description and splitting ways of these datasets are presented in the supplementary material. To fully demonstrate the advantage of our proposed method, we make a comprehensive comparison with the state-of-the-arts augmentation methods, includings Cutout Cutout (), AutoAugment (AutoAug) AA (), Fast AutoAugment (Fast AA) FAA (), OHL-Auto-Aug (OHL) OHL (), PBA PBA (), Rand Augment (RandAug) RA (), and Adversarial AutoAugment (Adv. AA) AAA ().
4.2 Implementation Details
Cifar On CIFAR-10 and CIFAR-100, following the literature, we use ResNet-18 ResNet () and Wide-ResNet-28-10 WRN (), respectively, as the basic models to search the policies, and transfer the searched policies to other models, including to Shake-Shake (26 d) SKSK () and PyramidNet+ShakeDrop PYN (). As mentioned, our training process is divided into two parts. The numbers of epochs of each part are set to and , respectively, leading to total number of epochs in the search process. The is set to . To optimize the policy, we use the Adam optimizer with a learning rate of , and . Some other details are studied and reported in the supplementary.
ImageNet During the policy search process, we use ResNet-50 ResNet () as the basic model, and then transfer the policies to ResNet-200 ResNet (). The learning rate is set to 0.2. The numbers of epochs of the two training stages are set to and , respectively. Other hyper-parameters of the search process are the same as what we use for the CIFAR datasets. Some other details are studied and reported in the supplementary.
4.3 Comparison with the state-of-the-arts
The comparisons between our AWS method and the state-of-the-arts are reported in Tab.1 and Tab.3. To minimize the influence of randomness, we run our method repetitively for eight times on CIFAR and four times on ImageNet, and report our test error rates in terms of Mean STD (standard deviation). For other methods in comparison, we directly quote their results from the original papers. Except Adv. AA AAA (), these methods only report the average test error rates. “Baseline” in Tab.1 and Tab.3 refers to the basic models using only the default pre-processing without applying the searched augmentation policies and the Cutout. For a fair comparison, we report our resulting both using and without using the Enlarge Batch (EB) proposed by Adv. AA AAA (). By leveraging EB in practice, the mini-batch size is times larger, while the number of iterations is not changed. Besides, our searched policies have strong preferences, as only a few augmentation operations are preserved eventually, which is quite different from other methods like AA (); FAA (); AAA (). Details about them are presented in the supplementary material.
|Cutout Cutout ()||3.62||3.08||2.56||2.31|
|Fast AA FAA ()||-||2.7||2.0||1.7|
|RandAug RA ()||-||2.7||2.0||1.5|
|AutoAug AA ()||3.46||2.68||1.99||1.48|
|PBA PBA ()||-||2.58||2.03||1.46|
|OHL OHL ()||3.29||2.61||-||-|
|Adv. AA (EB) AAA ()||-||1.90 0.15|
|Ours||2.91 0.062||1.95 0.047|
|Ours (EB)||2.38 0.041||1.57 0.038|
|Cutout Cutout ()||18.41||16.0||12.19|
|Fast AA FAA ()||17.3||14.6||11.7|
|RandAug RA ()||16.7||-||-|
|AutoAug AA ()||17.1||14.3||10.67|
|PBA PBA ()||16.7||15.3||10.94|
|Adv. AA (EB) AAA ()|
|Ours (EB)||14.16 0.055|
Results on CIFAR The results on CIFAR-10 are summarized in Tab.1. Comparing the results horizontally in Tab.1, it can be seen that our learned policies using ResNet-18 could be well transferred to training other network models like WRN WRN (), Shake-shake SKSK () and PyramidNet PYN (). Compared with the baseline without using the searched augmentation, the performance of all these models significantly improves after applying our searched policies for augmentation. Comparing the results vertically in Tab.1, our AWS method is the best performer across all four network architectures. Specifically, ours achieves the best top-1 test error of with PyramidNet+ShakeDrop, which is better than the second-best performer Adv. AA AAA (), even though we, unlike AAA (), do not use the Sample Pairing SP () for search. Consistent observations are found on the results on CIFAR-100 in Tab.1. Ours again performs best on all four network architectures among the methods in comparison.
|Baseline||23.7 / 6.9||21.5 / 5.8|
|Fast AA FAA ()||22.4 / 6.3||19.4 / 4.7|
|AutoAug AA ()||22.4 / 6.2||20.0 / 5.0|
|RandAug RA ()||22.4 / 6.2||-|
|OHL OHL ()||21.07 / 5.68||-|
|Adv. AA (EB)||20.60 0.15 / 5.53 0.05||18.68 0.18 / 4.70 0.05|
|Ours||20.61 0.17 / 5.49 0.08||18.64 0.16 / 4.67 0.07|
|Ours ( or EB)||20.36 0.15 / 5.41 0.07||18.56 0.14 / 4.62 0.05|
Results on ImageNet The results on ImageNet are summarized in Tab.3. We report both the top-1 and the top-5 test errors following the convention. Adv. AA AAA () has evaluated its performance for different EB ratios . The test accuracy improves rapidly with the increase of up to 8. The further increase of does not bring a significant improvement. So is finally used in Adv. AA. We tried to increase the batch size, but we can only use EB for ResNet-50 and EB for ResNet-200 due to the limited resources. As can be seen, we still achieve superior performance over those automatic augmentation works in comparison. Moreover, our outstanding performance using the heavy model ResNet-200 also verifies the generalization of our learned augmentation policies.
Results on computational cost We further compare the computational cost among different auto-augmentation methods and report the error reductions of WRN relative to Cutout’s. Following the existing works, the computation costs on CIFAR-10 are reported in Tab.4. In this table, we use the GPU hours used by OHL-Auto-Aug OHL () as the baseline, and report the relative time consuming on this baseline. As can be seen that our method is times faster than AutoAugment AA () with an even better performance. Although it is slightly slower than OHL, our method has a salient performance advantage over it. Overall, our proposed method is a very promising approach: it has the best performance with acceptable computational cost.
4.4 Comparison Among Proxy Tasks
To verify the superiority of the proxy selected by us, we make comparisons among different proxy tasks using ResNet-18 on CIFAR-10. To solve the problem in Equ.4, we design four optional proxies, which are summarized in Tab.5. The correlations between and are investigated, shown in Fig.4. As can be seen in Tab.5, among the four options, our selection () produces the highest Pearson correlation coefficient , which outperforms other options by a large margin. Specifically, the proxy trains the model without data augmentation in the first stage, and only searches the augmentation policy in the second stage like AutoAugment AA (). Its inferior performance to our proposed proxy may suggest that simply performing AutoAugment AA () only in the late stage could not lead to good results. As for the proxy , there is no first-stage training and the network parameters in the second stage are randomly initialized. Its inferior performance to may suggest that less trained network parameters could not generate a reliable ranking for rewording. Finally, the proxy is similar to that used in Fast AutoAugment FAA () and its low correlation coefficient is consistent with the mediocre performance of Fast AutoAugment FAA () in our previous comparisons.
|Symbolic||The way to||The way to||pearsonr|
|(ours)||Train with Augmentation||Finetune+Eval||0.85|
|Train without Augmentation||Finetune+Eval||0.55|
|Train with Augmentation||Eval||0.045|
In this paper, we propose an innovative and elegant way to search for auto-augmentation policies effectively and efficiently. We first verify that data augmentation operations function more in the late training stages. Based on this phenomenon, we propose an efficient and reliable proxy task for fast evaluation of augmentation policy and solve the auto-augmentation search problem with an augmentation-wise weight sharing proxy. The intensive empirical evaluations show that the proposed AWS auto-augmentation outperforms all previous searched or handcrafted augmentation policy. To our knowledge, it is the first time for a weight sharing proxy paradigm to be applied to augmentation search. The augmentation policies we found on both CIFAR and ImageNet benchmark are released to the public as an off-the-shelf augmentation policy to push the boundary of the stat-of-the-art performance.
In this paper, we propose a new framework to conduct an efficient and reliable Automated Augmentation (AutoAug) search and achieve superior performance compared with existing methods. AutoAug enhances the performances of deep models as a typical Automated Machine Learning (AutoML) technique.
For fundamental research and ML applications, our research contributes towards many computer vision areas that benefit from image data augmentations. It may help reduce the demand for data scientists by enabling domain experts to automatically design tailored augmentation strategies without extensive knowledge of statistics and machine learning.
For broader societal implications, as an AutoML technique, our approach can be utilized to build models and establish reasonable lower bounds of them for performance quickly and cheaply. It may be useful and powerful to ML practitioners in various entities, such as the media industry, the transportation industry, and the automatic production industries. However, each of these uses may result in job losses. Some other issues, like personal privacy leak problems, may also be raised when this technique is used by those malicious. In summary, this technique may be socially beneficial or harmful, which depends on the users. We would encourage the researchers, general practitioners, or anyone else to use it for social benefits, rather than infringe the interests of individuals and the nation, and threaten social stability.
Appendix A Ablation Study
In this ablation study, we further investigate the power of the policies searched by our approach and the closely related method AutoAug AA (). We rank the augmentation operations based on their probabilities in decreasing order. Therefore, the operations ranked on the top could be deemed as the most important augmentations. Then we gradually remove the most important operations from the searched policy one by one and investigate the change of the Top-1 test error rates, as reported in Tab.6. As can be seen, when the most important operations are removed gradually, the performance of the AutoAug remains similar. On the contrary, during this process, the performance of ours drops significantly. This shows that the augmentations in our policy are much more powerful than those in the AutoAug.
|AutoAug AA () (our impl.)||3.40 0.070||3.45 0.066||3.36 0.065||3.46 0.082|
|Ours||2.91 0.062||3.10 0.056||3.13 0.099||3.19 0.110|
Appendix B Proof
The proof of the proposition in Sec. 3.3 is as follows:
Proof. The KL-divergence between and is as follows:
Since is constant with respect to , the that minimizes should satisfy:
By Jensen’s inequality with the strictly concave function , we have:
The equality holds if and only if for all the possible . In other words, the KL divergence is minimized when is uniform sampling.
Appendix C Investigation of the Number of Epochs of Fine-tuning
We investigate different numbers of epochs in the late training stage(). By adjusting we can still maintain the reliability of policy evaluation to a large extent. The results are shown in Fig.5. We find that the policy optimization becomes hard to converge when a small is used, as the performances among different are too close. And when a large is used, is notably higher. However, the final performance does not benefit from this, as the correlation between and does not change much between and . Thus we choose as the final configuration for the efficiency.
Appendix D Augmentation Elements
Appendix E Datasets Splitting Details
|Set Size||Set Size||Set Size|
|CIFAR-10 CIFAR ()||40,000||10,000||10,000|
|CIFAR-100 CIFAR ()||40,000||10,000||10,000|
|Reduced ImageNet ImageNet ()||128,000||50,000||50,000|
Appendix F More Implementation Details
Cifar Once the policies have been learned, they are applied to training the models again from scratch, as well as another network models for the investigation of the transferability of the policies between different network models. For ResNet-18 and Wide-ResNet-28-10, we use a mini-batch size of 256 and the SGD with a Nesterov momentum of 0.9. The weight decay is set to , and the cosine learning rate scheme is utilized with the maximum learning rate of . The number of epochs is set to 300. For PyramidNet+ShakeDrop and Shake-Shake (26 d), we use the same settings as those in FAA ().
For a fair comparison among different augmentation methods, we apply a basic pre-processing following the convention of the state-of-the-art CIFAR-10 models: standardizing the data, random horizontal flips with 50% probability, zero-padding and random crops, and finally Cutout Cutout () with 1616 pixels. During our comparison, the searched policy is applied on top of this basic pre-processing step. That is, for each input training image, the basic pre-processing is first performed, then the policies learned by an augmentation method, and finally the Cutout.
ImageNet Once the policies have been obtained, they are applied to training ResNet-50 from scratch, as well as another network model ResNet-200 for the study of policy transferability. The hyper-parameters used to train ResNet-50 and ResNet-200 are the same as those in AA () except a cosine learning rate scheduler. Moreover, our learned policies are applied on top of a standard Inception-style pre-processing, which includes standardizing, random horizontal flips with 50% probability, and random distortions of colors AA59 (). This pre-processing step is uniformly applied to all the methods in comparison.
Appendix G Details of Searched Policies.
- H. S. Baird. Document image defect models. In Structured Document Image Analysis, pages 546–556. Springer, 1992.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
- B. Colson, P. Marcotte, and G. Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
- E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113–123, 2019.
- E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719, 2019.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
- D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5927–5935, 2017.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen. Population based augmentation: Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393, 2019.
- H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
- A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. Fast autoaugment. In Advances in Neural Information Processing Systems, pages 6662–6672, 2019.
- C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin, and W. Ouyang. Online hyper-parameter learning for auto-augmentation strategy. In Proceedings of the IEEE International Conference on Computer Vision, pages 6579–6588, 2019.
- H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
- R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7827–7838, 2018.
- H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
- E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- W. S. Sarle. Stopped training and other remedies for overfitting. Computing science and statistics, pages 352–360, 1996.
- S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, pages 4053–4061, 2016.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Icdar, volume 3, 2003.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
- R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015.
- L. Xie and A. L. Yuille. Genetic cnn. In ICCV, pages 1388–1397, 2017.
- S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- X. Zhang, Q. Wang, J. Zhang, and Z. Zhong. Adversarial autoaugment. arXiv preprint arXiv:1912.11188, 2019.
- B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.