Improving Deep Learning using Generic Data Augmentation
Deep artificial neural networks require a large corpus of training data in order to effectively learn, where collection of such training data is often expensive and laborious. Data augmentation overcomes this issue by artificially inflating the training set with label preserving transformations. Recently there has been extensive use of generic data augmentation to improve Convolutional Neural Network (CNN) task performance. This study benchmarks various popular data augmentation schemes to allow researchers to make informed decisions as to which training methods are most appropriate for their data sets. Various geometric and photometric schemes are evaluated on a coarse-grained data set using a relatively simple CNN. Experimental results, run using 4-fold cross-validation and reported in terms of Top-1 and Top-5 accuracy, indicate that cropping in geometric augmentation significantly increases CNN task performance.
Improving Deep Learning using Generic Data Augmentation
Luke Taylor University of Cape Town Department of Computer Science Cape Town, South Africa firstname.lastname@example.org Geoff Nitschke University of Cape Town Department of Computer Science Cape Town, South Africa email@example.com
Convolutional Neural Networks (CNNs) are [?] synonymous with deep learning, a hierarchical model of learning with multiple levels of representations, where higher levels capture more abstract concepts. A CNNs connectivity pattern, inspired by the animal visual cortex, enables the recognition and learning of spatial data such as images [?], audio [?] and text [?]. With recent developments of large data sets and increased computing power, CNNs have managed to achieve state-of-the-art results in various computer vision tasks including large scale image and video classification [?]. However, an issue is that most large data sets are not publicly available and training a CNN on small data-sets makes it prone to over-fitting, inhibiting the CNNs capability to generalize to unseen invariant data.
A potential solution is to make use of Data Augmentation (DA) [?], which is a regularization scheme that artificially inflates the data-set by using label preserving transformations to add more invariant examples. Generic DA is a set of computationally inexpensive methods [?], previously used to reduce over-fitting in training a CNN for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [?], and achieved state-of the-art results at the time. This augmentation scheme consists of Geometric and Photometric transformations [?], [?].
Geometric transformations alter the geometry of the image with the aim of making the CNN invariant to change in position and orientation. Example transformations include flipping, cropping, scaling and rotating. Photometric transformations amend the color channels with the objective of making the CNN invariant to change in lighting and color. For example, color jittering and Fancy Principle Component Analysis (PCA) [?], [?].
Complex DA is a scheme that artificially inflate the data set by using domain specific synthesization to produce more training data. This scheme has become increasingly popular [?; ?] as it has the ability to generate richer training data compared to the generic augmentation methods. For example, Masi et al. [?] developed a facial recognition system using synthesized faces with different poses and facial expressions to increase appearance variability enabling comparable task performance to state of the art facial recognition systems using less training data. At the frontier of data synthesis are Generative Adversarial Networks (GANs) [?] that have the ability to generate new samples after being trained on samples drawn from some distribution. For example, Zhang et al. [?] used a stack construction of GANs to generate realistic images of birds and flowers from text descriptions.
Thus, DA is a scheme to further boost CNN performance and prevent over-fitting. The use of DA is especially well-suited when the training data is limited or laborious to collect. Complex DA, albeit being a powerful augmentation scheme, is computationally expensive and time-consuming to implement. A viable option is to apply generic DA as it is computationally inexpensive and easy to implement.
Prevalent studies that comparatively evaluate various popular DA methods include those outlined in table 1 and described in the following. Chatfield et al. [?] addressed how different CNN architectures compared to each other by evaluating them on a common data-set. This study was mainly focused on rigorous evaluation of deep architectures and shallow encoding methods, though an evaluation of three augmentation methods was included. These consisted of flipping and combining flipping and cropping on training images in the coarse grained Caltech 101111 and Pascal VOC222 data-sets. In additional experiments the authors trained a CNN using gray-scale images, though lower task performance was observed. Overall results indicated that combining flipping and cropping yielded an increased task performance of . A shortcoming of this study was the few DA methods evaluated.
Mash et al. [?] bench-marked a variety of geometric augmentation methods for the task of aircraft classification, using a fine-grained data-set of classes. Augmentation methods tested included cropping, rotating, re-scaling, polygon occlusion and combinations of these methods. The cropping scheme combined with occlusion yielded the most benefits, achieving a improvement over a benchmark task performance. Although this study evaluated various DA methods, photometric methods were not investigated and none were bench-marked on a coarse-grained data-set.
In line with this work, further research [?] noted that certain augmentation methods benefit from fine-grained data-sets. For example, extensive use of rotating training images to increase CNN task performance for galaxy morphology classification using a fine-grained data-set [?].
However, to date, there has been no comprehensive studies that comparatively evaluate various popular DA methods on large coarse-grained data-sets in order to ascertain the most appropriate DA method for any given data-set. Hence, this study’s objective is to evaluate a variety of popular geometric and photometric augmentation schemes on the coarse grained Caltech101 data-set. Using a relatively simple CNN based on that used by Zeiler and Fergus [?], the goal is to contribute to empirical data in the field of deep-learning to enable researchers to select the most appropriate generic augmentation scheme for a given data-set.
|Chatfield et al.||✓||✓|
|Mash et al.||✓|
2 Data Augmentation (DA) Methods
DA refers to any method that artificially inflates the original training set with label preserving transformations and can be represented as the mapping:
Where, is the original training set and is the augmented set of . The artificially inflated training set is thus represented as:
Where, contains the original training set and the respective transformations defined by .
Note the term label preserving transformations refers to the fact that if image is an element of class then is also an element of class .
As there is an endless array of mappings that satisfy the constraint of being label preserving, this paper evaluates popular augmentation methods used in recent literature [?], [?] as well as a new augmentation method (section 2.2). Specifically, seven augmentation methods were evaluated (figure 2), where one was defined as being No-Augmentation which acted as the task performance benchmark for all the experiments given three geometric and three photometric methods.
2.1 Geometric Methods
These are transformations that alter the geometry of the image by mapping the individual pixel values to new destinations. The underlying shape of the class represented within the image is preserved but altered to some new position and orientation.
Given their success in related work [?; ?; ?] we investigated the flipping, rotation
and cropping schemes.
Flipping mirrors the image across its vertical axis. It is one of the most used augmentation schemes after being popularized by Krizhevsky et al. [?]. It is computationally efficient and easy to implement due to only requiring rows of image matrices to be reversed.
The rotation scheme rotates the image around its center via mapping each pixel of an image to with the following transformation [?; ?]:
Where, exploratory experiments indicated that setting as and establishes rotations that are large enough to generate new invariant samples.
Cropping is another augmentation scheme popularized by Krizhevsky et al. [?]. We used the same procedure as in related work [?], which consisted of extracting crops from the four corners and the center of the image.
2.2 Photometric Methods
These are transformations that alter the channels by shifting each pixel value to new pixel values
according to pre-defined heuristics.
This adjusts image lighting and color and leaves the geometry unchanged.
We investigated the color jittering, edge enhancement and fancy PCA photometric methods.
Color jittering is a method that either uses random color manipulation [?] or set color adjustment [?]. We selected the latter due to its accessible implementation using a HSB filter333www.jhlabs.com/ip/filters/HSBAdjustFilter.html.
Edge enhancement is a new augmentation scheme that enhances the contours of the class represented within the image. As the learned kernels in the CNN identify shapes it was hypothesized that CNN performance could be boosted by providing training images with intensified contours. This augmentation scheme was implemented as presented in Algorithm 1. Edge filtering was accomplished using the Sobel edge detector [?] where edges were identified by computing the local gradient at each pixel in the image .
Fancy PCA is a scheme that performs PCA on sets of pixels throughout the training set by adding multiples of principles components to the training images (Algorithm 2). In related work [?] it is unclear as to whether the authors performed PCA on individual images or on the entire training set. However, due to computation and memory constraints we adopted the former approach.
3 CNN Architecture
A CNN architecture was developed with the objective of obtaining a favorable tradeoff between task performance and training speed.
Training speed was crucial as the CNN had to be trained seven times on data-sets ranging from to thousand images using fold cross validation. Also, the CNN had to be deep enough (containing enough parameters) so as the network would fit the training data.
We used an architecture (figure 3) similar to that described by [?], where exploratory experiments indicated a reasonable tradeoff between topology size and training speed. The architecture consisted of trainable layers: convolutional layers, fully connected layer and softmax layer. The CNN took a channel (representing the RGB channels) image as input, filters of size with a stride of and a padding of were convolved over the image producing a feature map of size . This layer was compressed by a max-pooling filter of size with stride of which reduced it to a new feature map of dimensions . A set of filters followed with the same size and stride as before which were convolved over the layer producing a new feature map of size . Again a max pooling function of the same size and stride was applied which further reduced the feature map to . This was fed into a last CNN layer using a set of filters of size of stride producing a layer of parameters. Finally this was fed into a fully connected layer of size which connected to a soft-max layer of size .
Overlapping pooling was deployed which increased CNN performance by reducing over-fitting [?]. This yielded sums of overlapping neighboring groups of neurons in the same feature map. A fully connected layer of neurons was chosen as increased sizes did not generate greater improvements in performance [?]. Exploratory experiments indicated that smaller layer sizes resulted in richer encodings of the distinct classes yielding better generalization. The depth sizes of individual convolutional layers were determined by trial and error. Further convolutional layers did not increase performance considerably and were thus omitted to reduce computational time. It was also found that a second fully connected layer did not improve task performance.
All neurons used a Rectified Linear Unit [?] with the weights initially being initialised from a Gaussian distribution with a mean and a standard deviation of . An initialisation scheme known as Xavier [?] was deployed which mitigated slow convergence to local minima. All weights were updated using back-propagation and stochastic gradient descent with a learning rate of . A variety of update functions were tested including Adam [?] and Adagard [?], however we selected Nesterov [?] with a momentum of which we found to converge relatively fast and not suffer from from numerical instability and stagnating convergence. Additionally regularization was deployed in the form of gradient normalization with a L2 of to reduce over-fitting. Hyper-parameters for activation function, learning rate, optimisation algorithm and updater were based on those used in related work [?] and all other parameter values were determined by exploratory experiments. All CNN parameters used in this study are presented in table 2.
|Regularization||L2 Gradient Normalization|
This study evaluated various DA methods on the Caltech101 data-set, which is a coarse-grained data-set consisting of
categories containing a total of images.
The Caltech101 data-set was chosen as it is a well established data-set for training CNNs containing
a large amount of varying classes [?; ?].
For CNN training most images in the Caltech101 data-set were used. That is, images, including the background
category in the Caltech101 data-set as it contained many uncorrelated images, were omitted.
Also, further trimming was applied
such that the cardinality of every class was divisible by for cross-validation, which further increased the number
images excluded, meaning that images in total were evaluated.
We elected to use cross-validation which maximized
the use of selected images and better estimated how the CNNs performance would scale to other unknown data-sets.
Specifically 4-fold444Higher fold cross validation would have taken too long to train.
cross-validation was used which partitioned the data-set into equal sized subsets, where subsets were used for training and the
other for validation purposes.
All images within the data-set were transformed to a size of , where every image was downsized such that the largest dimension was equal to . This downsized image was then centrally drawn on top of a black image555Numeric value of 0 for all channels thus acting as zero padding.. This enabled all augmentation schemes to have access to the full image in a fixed resolution of . Finally the transformed images underwent normalization by scaling all pixel values from .
Every CNN was trained using epochs. This value was determined by exploratory experiments that evaluated validation and test scores every epoch. All implementation was completed in Java 8 using DL4j666 with the experiments being conducted on a NVIDIA Tesla K80 GPU using CUDA. All source code and experiment details can be found online777.
5 Results and Discussion
|Top-1 Accuracy||Top-5 Accuracy|
|Baseline||48.13 0.42%||64.50 0.65%|
|Flipping||49.73 1.13%||67.36 1.38%|
|Rotating||50.80 0.63%||69.41 0.48%|
|Cropping||61.95 1.01%||79.10 0.80%|
|Color Jittering||49.57 0.53%||67.18 0.42%|
|Edge Enhancement||49.29 1.16%||66.49 0.84%|
|Fancy PCA||49.41 0.84%||67.54 1.01%|
Table 3 presents experimental results, where Top-1 and Top-5 scores were evaluated as percentages as done in the Imagenet competition [?], though we report accuracies rather than error rates. The CNN’s output is a multinomial distribution over all classes:
The Top-1 score is the number of times the highest probability is associated with the correct target over all testing images.
The Top-5 score is the number of times the correct label is contained within the highest probabilities.
As the CNN’s accuracy was evaluated using cross-validation a standard deviation was associated with every result, thus indicating
how variable the result is over different testing folds.
In table 3, Top-1 and Top-5 scores in the geometric and photometric DA category are represented in bold.
Results indicate that in all cases of applying DA, CNN classification task performance increased. Notably, the geometric augmentation schemes outperformed the photometric schemes in terms of Top-1 and Top-5 scores. The exception was the flipping scheme Top-5 score being inferior to the fancy PCA Top-5 score. For all schemes, a standard deviation of indicated similar results over all folds with the cross-validation.
The cropping scheme yielded the greatest improvement in Top-1 score with an improvement of in classification accuracy. Results also indicated that Top-5 classification yielded a similar task improvement which corroborated related work [?; ?]. We theorize that the cropping scheme outperforms the other methods as it generates more sample images than the other augmentation schemes. This increase in training data reduces the likelihood of over-fitting, improving generalization and thus increasing overall classification task performance. Also, cropping represents specific translations allowing the CNN exposure to a greater receptive view of training images which the other augmentation schemes do not take advantage of [?].
However, the photometric augmentation methods yielded modest improvements in performance compared to the geometric schemes, indicating the CNN yields increased task performance when trained on images containing invariance in geometry rather than lighting and color. The most appropriate photometric schemes were found to be color jittering with a top-1 classification improvement of and fancy PCA which improved top-5 classification by . Fancy PCA increased top-1 performance by which supported the findings of previous work [?].
We also hypothesize that color jittering outperformed the other photometric schemes in top-1 classification as this scheme generated augmented images containing more variation compared to the other methods (figure 2). Also, edge enhancement augmentation did not yield comparable task performance, likely due to the overlay of the transformed image onto the source image (as described in section 2.2) did not enhance the contours enough but rather lightened the entire image. However, the exact mechanisms responsible for the variability of CNN classification task performance given geometric versus photometric augmentation methods for coarse-grained data-sets remains the topic of ongoing research.
This study’s results demonstrate that an effective method of increasing CNN classification task performance is to make use of
Data Augmentation (DA).
Specifically, having evaluated a range of DA schemes using a relatively simple CNN architecture
we were able to demonstrate that geometric augmentation methods outperform photometric methods
when training on a coarse-grained data-set (that is, the Caltech101 data-set).
The greatest task performance improvement was yielded by specific translations generated by the cropping method
with a Top-1 score increase of .
These results indicate the importance of augmenting coarse-grained training data-sets using transformations that alter the geometry
of the images rather than just lighting and color.
Future work will experiment with different coarse-grained data-sets to establish whether results obtained using the Caltech101 are transferable to other data-sets. Additionally different CNN architectures as well as other DA methods and combinations of DA methods will be investigated in comprehensive studies to ascertain the impact of applying a broad range of generic DA methods on coarse-grained data-sets.
- [Abdel-Hamid et al., 2014] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1533–1545, 2014.
- [Chatfield et al., 2014] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
- [Dieleman et al., 2015] S. Dieleman, K. Willett, and J. Dambre. Rotation-invariant convolutional neural networks for galaxy morphology prediction. In Monthly Notices of the Royal Astronomical Society, pages 1441–1459, 2015.
- [Duchi et al., 2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(1):2121–2159, 2011.
- [Glorot and Bengio, 2010] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249–256, 2010.
- [Gonzalez and Woods, 1992] R. Gonzalez and R. Woods. Digital image processing. In Addison Wesley, pages 414–428, 1992.
- [Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- [Hahnloser et al., 2000] R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. Douglas, and H. Seung. Digital selection and analogue amplification coesist in a cortex-inspired silicon circuit. Nature, 405:947–951, 2000.
- [He et al., 2014] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision ECCV 2014, pages 346–361. Springer, Berlin, Germany, 2014.
- [Howard, 2013] A. Howard. Some improvements on deep convolutional neural network based image classification. In arXiv preprint arXiv:1312.5402, 2013.
- [Hu et al., 2014] B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching natural language sentences. In Advances in Neural Information Processing Systems, pages 2042–2050, 2014.
- [Kingma and Ba, 2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In arXiv preprint arXiv:1412.6980, 2014.
- [Krizhevsky et al., 2012] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- [LeCun et al., 1996] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, pages 396–404, 1996.
- [Mash et al., 2016] R. Mash, B. Borghetti, and J. Pecarina. Improved aircraft recognition for aerial refueling through data augmentation in convolutional neural networks. In International Symposium on Visual Computing, pages 113–122. Springer, 2016.
- [Masi et al., 2016] I. Masi, A. Tran, T. Hassner, J. Leksut, and G. Medioni. Do we really need to collect millions of faces for effective face recognition. In European Conference on Computer Vision, pages 579–596. Springer International Publishing, 2016.
- [Nesterov, 1983] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.
- [Parkhi et al., 2015] O. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, page 6, 2015.
- [Peng et al., 2015] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep object detectors from 3d models. In Proceedings of the IEEE International Conference on Computer Vision, pages 1278–1286, 2015.
- [Razavian et al., 2014] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 806–813, 2014.
- [Rogez and Schmid, 2015] G. Rogez and C. Schmid. MoCap-guided data augmentation for 3D pose estimation in the wild. In International Journal of Computer Vision, pages 211–252, 2015.
- [Russakovsky et al., 2016] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein. Imagenet large scale visual recognition challenge. In Advances in Neural Information Processing Systems, pages 3108–3116, 2016.
- [Simonyan and Zisserman, 2014] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, Montreal, Canada, 2014. NIPS Foundation, Inc.
- [Wu et al., 2015] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. In arXiv preprint arXiv:1501.02876, 2015.
- [Yaeger et al., 1996] L. Yaeger, R. Lyon, and B. Webb. Effective training of a neural network character classifier for word recognition. In Advances in Neural Information Processing Systems, pages 807–813, 1996.
- [Zeiler and Fergus, 2014] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer International Publishing, 2014.
- [Zhang et al., 2016] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In arXiv preprint arXiv:1612.03242, 2016.