STaDA: Style Transfer as Data Augmentation
The success of training deep Convolutional Neural Networks (CNNs) heavily depends on a significant amount of labelled data. Recent research has found that neural style transfer algorithms can apply the artistic style of one image to another image without changing the latter’s high-level semantic content, which makes it feasible to employ neural style transfer as a data augmentation method to add more variation to the training dataset. The contribution of this paper is a thorough evaluation of the effectiveness of the neural style transfer as a data augmentation method for image classification tasks. We explore the state-of-the-art neural style transfer algorithms and apply them as a data augmentation method on Caltech 101 and Caltech 256 dataset, where we found around 2% improvement from 83% to 85% of the image classification accuracy with VGG16, compared with traditional data augmentation strategies. We also combine this new method with conventional data augmentation approaches to further improve the performance of image classification. This work shows the potential of neural style transfer in computer vision field, such as helping us to reduce the difficulty of collecting sufficient labelled data and improve the performance of generic image-based deep learning algorithms.
Neural Style Transfer, Data Augmentation, Image Classification
Data augmentation refers to the task of adding diversity to the training data of a neural network, especially when there is a paucity of sufficient samples. Popular deep architectures such as AlexNet [krizhevsky2012imagenet] or VGGNet [simonyan2014very] have millions of parameters and thus require a reasonably large dataset to be trained for a particular task. Lack of adequate data leads to overfitting i.e. high training accuracy but poor generalisation over the test samples [caruana2001overfitting]. In many computer vision tasks, gathering raw data can be very time-consuming and expensive. For example, in the domain of medical image analysis, in critical tasks such as cancer detection [kyprianidis2013state] and cancer classification [vasconcelos2017increasing], researchers are often restricted by the lack of reliable data. Thus, it is a common practice to use data augmentation techniques such as flipping, rotation, cropping etc. to increase the variety of samples fed to the network. Recently, more complex techniques using a green screen with different random backgrounds to increase the number of training images has been introduced by [chalasani2018Egocentric].
In this paper, we explore the capacity of neural style transfer [gatys2016image] as an alternative data augmentation strategy. We propose a pipeline that applies this strategy on image classification tasks and verify its effectiveness on multiple datasets. Style transfer refers to the task of applying the artistic style of one image to another, without changing the high-level semantic content (Figure 1). The main idea of this algorithm is to jointly minimise the distance of the content representation and the style representation learned on different layers of a convolutional neural network, which allows translation from noise to the target image in a single pass through a network that is trained per style.
The crux of this work is to use a style transfer network as a generative model to create more samples for training a CNN. Since style transfer preserves the overall semantic content of the original image, the high-level discriminative features of an object are maintained. On the other hand, by changing the artistic style of some randomly selected training samples it is possible to train a classifier which is invariant to undesirable components of the data distribution. For example, let us assume a scenario where a dataset of simple objects such as Caltech 256 [griffin2007caltech] has a category car but there are more images of red cars than any other colour. The model trained on such a dataset will associate red colour to the car category, which is undesired for a generic car classifier. Using style transfer as the data augmentation method can be an effective strategy to avoid such associations.
In this work, we use eight different style images as palette to augment the original image datasets (Section 3.3). Additionally, we investigate if different image styles have different effects on image classification. This paper is organised as follows. In Section 2, we discuss research related to our work. In Section 3, we describe different components of the architecture used, in Section 4 we describe our experiments and report and analyse the results.
|(a) Raw image||(b) Style Image|
|(c) Stylized Image|
2 Related Work
In this section, we review the state-of-the-art research relevant to the problem addressed in this paper. In Section 2.1, we review the development of neural style transfer algorithms and in Section 2.2, we discuss the traditional data augmentation strategies and how they affect the performance of a CNN in the standard computer vision tasks such as classification, detection, segmentation etc.
2.1 Neural Style Transfer
The main goal of style transfer is to synthesise a stylised image from two different images, one supplying the texture and another providing the high-level semantic content. Broadly, the state-of-the-art style transfer algorithms can be divided into two groups — descriptive and generative. Descriptive approach refers to changing the pixels of a noise image in an iterative manner whereas generative approach achieves the same in a single forward pass by using a pre-trained model of the desired style [JingYFYS17].
Descriptive Approach: The work done by [gatys2016image] is the first descriptive approach of neural style transfer. Starting from random noise, their algorithm transforms the random noise image in an iterative manner such that it mimics the content from one image and style or texture from another. The content is optimised by minimising the distance between the high level CNN features of the content and stylised image. On the other hand, the style is optimised by matching the Gram matrices of the style and stylised image. Several algorithms followed directly from this approach by addressing some of its limitations. [risser2017stable] propose a more stable optimisation strategy by adding histogram losses. [li2017demystifying] closely investigate the different components of the work by [gatys2016image] and present key insights on the optimisation methods, kernels used and normalisation strategies adopted. [yin2016content, chen2016towards] propose content-aware neural style transfer strategies which aim to preserve high-level semantic correspondence between the content and the target.
Generative Approach: The descriptive approach is limited by its slow computation time. In the generative or the faster approach, a model is trained in advance for each style image. While the inference in descriptive approach occurs slowly over several iterations, in this case, it is achieved through a single forward pass. [johnson2016Perceptual] propose a two-component architecture — generator and loss networks. They also introduce a novel loss function based on perceptual difference between the content and target. In [ulyanov2016texture], the authors improve the work by [johnson2016Perceptual] by using a better generator network. [li2016precomputed] train a Markovian feed-forward network by using an adversarial loss function. [dumoulin2016learned] train for multiple styles using the same network.
2.2 Data Augmentation
As discussed previously in Section 1, data augmentation refers to the process of adding more variation to the training data in order to improve the generalisation capabilities of the trained model. It is particularly useful in scenarios where there is a scarcity of training samples. Some common strategies for data augmentation are flipping, re-scaling, cropping, etc. Research related to developing novel data augmentation strategies is almost as old as the research in deeper neural networks with more parameters.
In [krizhevsky2012imagenet], the authors applied two different augmentation methods to improve the performance of their model. The first one is horizontal flip, without which their network showed substantial overfitting even with only five layers. The second strategy is to perform a PCA on the RGB values of the image pixels in the training set, and use the top principal components, which reduced over of the top- error rate. Similarly, the ZFNet [zeiler2014visualizing] and VGGNet [simonyan2014very] also apply multiple different crops and flips for each training image to boost training set size and improve the performance of their models. These well-known CNNs architectures achieved outstanding results in the ImageNet challenge [deng2009imagenet], and their success demonstrated the effectiveness and importance of data augmentation.
Besides the traditional transformative data augmentation strategies as discussed above, some recently proposed methods follow a generative approach. In the work [perez2017effectiveness], the authors explore Generative Adversarial Nets to generate stylised images to augment the dataset. Their work is called Neural Augmentation, which uses CycleGAN [zhu2017unpaired] to create new stylised images in the training dataset. This method is finally tested on a 5-layer network with the MNIST dataset and delivers better performance than most traditional approaches.
3 Design and Implementation
In this section, we propose a modular design for using style transfer as data-augmentation technique. Figure 2 summarises the modular architecture we followed for creating our data augmentation strategy. We choose the network designed in [engstrom2016faststyletransfer] for our style transfer module. The reasons for this choice, architecture and implementation of the network are explained in subsection 3.1. For testing the data augmentation technique we use the classification module. In our experiments we used the standard VGG16 from [simonyan2014very] explained in section 3.2. In section 3.3 we briefly explain the datasets used for evaluation of our strategy.
3.1 Style Transfer Architecture
For style transfer to be as viable as the other traditional data augmentation strategies (crop, flip etc.) we need a fast running solution. There has been successful style transfer solutions using CNNs but they are considerably slow [gatys2016image]. To alleviate this problem we choose a generative architecture that only needs a forward pass to compute a stylised image [engstrom2016faststyletransfer]. This network consists of a generative Style Transfer Network coupled with a Loss Network that computes a cumulative loss which can account for both style from the style image and content from the training image. In the following subsections (3.1.1, 3.1.2) we will look at architectures of the Style Transfer Network and Loss Network.
3.1.1 Transformation Network
For the transformation network, we follow the state-of-the-art implementation from the work [engstrom2016faststyletransfer] and [resNets], with changes to hyper parameters based on our experiments.
Five residual blocks are used in the style transformation network to avoid optimisation difficulty when the network gets deep [he2016deep]. Other none residual convolutional layers are followed by Ulyanov’s instance normalisation [ulyanov1607instance] and ReLU layers. At the output layer, a scaled tanh is used to get an output image with pixels in the range from 0 to 255.
This network is trained coupled with the Loss Network (described in the following subsection) using stochastic gradient descent [bottou2010large] to minimise a weighted combination of loss functions. We treat the overall loss as a linear combination of the content reconstruction loss and style reconstruction loss. The weights of two losses can be fine-tuned depending on the preference. By minimising the overall loss we can get a model well trained per style.
3.1.2 Loss Network
Since we already define the transformation network that can generate stylised images, we also need to create a loss network that is used to represent loss function to evaluate the generated images and use the loss to optimise the style transfer network based on stochastic gradient descent.
We use a deep convolutional neural network pretrained for image classification on imageNet to measure the texture and content differences between the generated image and the target image. Recent work has shown that deep convolutional neural networks can encode the texture information and high-level content information in the feature maps [gatys2015texture][mahendran2015understanding]. Based on this finding, we define a content reconstruction loss and a style reconstruction loss in the loss network and use their weighted sum to measure the total difference between the stylised image and the image we want to get. For every style, we train the transformation network with the same pretrained loss network.
Content Reconstruction Loss: To achieve that, an image needs to be reconstructed from the image information encoded in the neural network, i.e., computing an approximate inverse from the feature map. Given an image , the image will go through the CNN model and be encoded in each layer by the filter responses to it. We use to store the feature maps in a layer where is the feature map of the filter at position in layer . Let be the feature maps for the content image in layer , and we can update the pixels of image to minimise the loss to make sure these two images have the similar feature maps in the network and thus have the similar semantic content:
Style Reconstruction Loss: We also want the generated images to have similar texture as the style target image, so we want to penalise the style differences with the style reconstruction loss. The feature maps in a few layers of a trained network are used for representing the texture by the correlations between them. Instead of using these feature maps directly, the correlations between the different channels of the feature maps are given by Gram matrix , where the is the inner product between the vectorised feature map and feature map in layer :
The original texture is passed through the CNNs and the Gram matrices on the feature responses of some layers are computed. We can pass a white noise image through the CNNs and compute the Gram matrix difference on every layer included in the texture model as the loss. If we use this loss to perform gradient on the white noise image and try to minimise the Gram matrix difference, we can find a new image that has the same style as the original image texture. The loss is computed by the mean-squared distance between the Gram matrix of two images. So let and be the Gram matrix of two images in layer , the loss of that layer equal:
and the loss for all chosen layers:
Total Variation Regularization: We also follow prior work [gatys2016image] and make use of total variation regularizer to gain more spatial smoothness in the output image .
Terms Balancing: As we have above definitions, generating an image can be seen as solving the optimising problem in the style transfer module in figure 2. We initialise the image with white noise, and the work [gatys2016image] found that the initialisation has a minimal impact on the final results. , , and are hyperparameters that we can tune according to the monitoring of the results. To get the stylised image, we need to minimise a weighted combination of two loss functions and the regularization term:
3.2 Image Classification
To evaluate the effectiveness of this design, we perform image classification tasks with the stylised images. In a image classification task, for each given image, the programs or algorithms need to produce the most reasonable object categories [ILSVRC15]. The performance of the algorithm will be evaluated based on if the predicated category matches the ground truth label for the image. Since we will provide input images from multiple categories, the overall performance of an algorithm is the average score of overall test images.
Once we train the transformation network that can generate the stylised images, we apply it to the training dataset to create a larger dataset. The stylized images are saved on the disk with the ground truth categories. We then use them with their original images to train the neural networks to solve the image classification problems. In this research, the model we chose is VGGNet, which is a simple but effective model [simonyan2014very]. Their team got the first place in the localisation and the second place in the classification task in ImageNet Challenge 2014. This model strictly used filters with stride and pad of 1, along with max-pooling layers with stride 2. 3 convolutional layers back to back have an effective receptive field of . Compared with one filter, filter size can have the same effective receptive field with fewer parameters.
To fully understand the effectiveness of the style transfer and explore how useful style transfer can be compared and combined with other traditional data augmentation approaches for image classification problem, we need to experiment from multiple perspectives. The first experiments are to use traditional transformations alone. For each input image, we generate a set of duplicate images that are shifted, rotated, or flipped from the original image. Both the original image and duplicates are fed into the neural net to train the model. The classification performance will be measured on the validation dataset as the baseline to compare these augmentation strategies. The pipeline can be found in Figure 2. The second experiments are to apply the well-trained transformation network to augment the training dataset. For each input image, we select a style image from a subset of different styles from famous artists and use the transformation network to generate new images from the original image. We store the newly generated images on the disk, and both original and stylised images are fed to the image classification network. To explore if we can get better results, we go further to combine two approaches to get more images in training dataset .
3.3 Dataset and Image Styles
The transformation network is trained on the COCO 2014 collection [DBLP:journals/corr/LinMBHPRDZ14] containing more than 83k training images, which are enough to get the transformation network well trained. Since these images are used to feed the transformation network, we ignored the label information during training.
Two different datasets are used for images clasification tasks, caltech 101 [fei2006one] and caltech 256 [griffin2007caltech]. We keep training and testing on a fixed number of images and repeating the experiment with different augmented data and compare the results with others. The images are divided by a 70:30 split between training and validation for both datasets.
For the chosen styles, we try to select the images that look very different. At last, eight different images were chosen as the style input to train the transformation network. All styles can be found in the GitHub repo. https://github.com/zhengxu001/neural-data-augmentation.
This section presents the evaluation of style transfer for data augmentation. We evaluate the results from multiple perspectives based on the classification Top-1 accuracy of VGGNet. The results of experiments on traditional augmentation are collected as the baseline. We then do experiments on every single style and some combined styles. We also combine the traditional methods with our style transfer method to verify their effectiveness and see if we can improve on previous methods.
4.1 Traditional Image Augmentation
We first used the pretrained VGGNet without any data augmentation and reached a classification accuracy of 83.34% in one hour of training time. We then apply two different traditional image augmentation strategies, Flipping and Rotation, to train the model. Finally, we combine both strategies. Detailed results can be found in table 1. We found the model itself works the best without any augmentation. Using Flipping as data augmentation strategy gives very similar results, however the combination of Rotation and Flipping significantly reduces classification accuracy to 77%. Adding Rotation as data augmentation does not seem to help for classification, as is also shown in our following experiments.
|Traditional Image Augmentation|
4.2 Single Style
We select eight different styles that look different from each other to train the transformation network. All styles can be found in the Appendix. We feed each of the images in the training set to the eight fully trained transformation networks to generate eight stylised images. Both the original images and the stylised images are fed to VGGNet to train the network and the best validation accuracy from all epochs is recorded. The results for each style can be found in table 2. Compared with the traditional strategies, we can see that 7 out of 8 styles work better than the traditional strategies. It can also be seen that the Snow style works the best, reaching an accuracy of 85.26%, whereas YourName style only reaches 82.61%. This is due to the addition of too much noise and colour to the original images for that particular style. In figure 3 a comparison between the original image and two different stylised images is shown. As can be seen, YourName style adds too many colours and shapes on the original image, which explains the bad performance in terms of the image classification accuracy.
|Single Style with VGG16 and Caltech 101|
4.3 Combined Methods
To evaluate the combination of two different styles, we take the original images and feed them to two different transformation networks and generate two stylized images for each input image. We then merge the stylized images and original images to compose the final training dataset. This gives us three times the number of images than the original dataset. We use this augmented data set to train the VGGNet model. We also try to combine the traditional augmentation methods with style transfer together and evaluate the performance. The results can be found in table 3.
We notice a very slight increase in performance of the combined style (ScreamWave) over the single styles. We further notice that adding Flipping as a data augmentation strategy degrades the performance, although to a lesser degree for the combined style.
4.4 Content Weights Change
In this experiment, we change the proportion of content weights and style weight in the transformation network to evaluate the impact on the performance. We increased the content weight to create a new style named Wave2. As can be seen from the table 4, no significant change can be observed for a change in content weights. As can be seen in figure 4, the images for the two content weights look very similar to each other, which explains the minimal impact in our experiments. In future work we would like to examine this effect further with more content weights.
|Different Content Weights|
|Traditional Method||Style Name||Result|
To evaluate the generalisation of our approach over network architectures, we used VGG19 as a classification network and duplicated our experiments. The results can be found in table 5.
|Experiments on VGG19|
|Traditional Method||Style Name||Result|
The baseline classification accuracy for Caltech 101 is 84.5%. Based on this number, there are some interesting findings in line with the VGG16. Using Wave or Flipping itself does not improve the performance of VGG19, but if we combine Flipping with Wave we can get a considerable improvement, with accuracy reaching 85.81%.
Similar results can be seen for our experiments on Caltech 256. The use of Flipping as data augmentation gives an accuracy of 66.66%, while the use of Scream gives at 63.13%. However, if we combine the two approaches, we can get an accuracy of 67.28%. The combination between Flipping and Wave gives 66.32% accuracy which is higher than for Wave alone.
The experiments performed on VGG19 show that the style transfer is still an effective data augmentation method, which can be combined with the traditional approaches to further improve the performance.
In this paper, we proposed a novel data augmentation approach based on neural style transfer algorithm. From our experiments, we observe that this approach is an effective way to increase the performance of CNNs in image classification tasks. The accuracy for VGG16 is increased to 85.26% from 83.34% while for VGG19 the accuracy increased to 85.81% from 84.50%. We also found that we can combine this new approach with traditional methods like flipping, rotation etc. to boost the performance.
We tested our method only for image classification task. As a future work, it would be interesting to try it out for other computer vision tasks such as segmentation, detection etc. The set of styles is also limited. Even though we tried to select images of different styles, we did not classify the images according to their category. A more elaborate set of styles might train more robust models. Another limitation of this approach is the speed of training, which is quite slow. However, once trained, the inference can still be fast as it does not involve augmentation.
Our approach is independent of the architecture. Better and faster models for style transfer will enable more diverse and robust augmentation for a CNN. We hope that the proposed approach will be useful for computer vision research in future.
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under the Grant Number 15/RP/2776.