Neural Style Transfer: A Review
The recent work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNN) in creating artistic imagery by separating and recombing the image content and style. This process of using CNN to migrate the semantic content of one image to different styles is referred to as Neural Style Transfer. Since then, Neural Style Transfer has become a trending topic both in academic literature and industrial applications. It is receiving increasing attention from computer vision researchers and several methods are proposed to either improve or extend the original neural algorithm proposed by Gatys et al. This review aims to provide an overview of the current progress towards Neural Style Transfer, as well as discussing its various applications and open problems for future research. A list of mentioned papers in this review, corresponding codes, pre-trained models and more comparison results are publicly available at: https://github.com/ycjing/Neural-Style-Transfer-Papers.
Painting is a popular form of art. For hundreds of years, people have been attracted by the art of painting with the advent of many fantastic artworks, \eg, Van Gogh’s “The Starry Night”. In the past, re-drawing an image in a particular style manually required a well-trained artist and lots of time.
Since the mid-1990s, the art theories behind the fantastic artworks have been attracting the attention of not only the artists but many computer science researchers. There are plenty of studies exploring how to automatically turn images into synthetic artworks such that everyone can be an artist. Among these studies, the advances in Non-photorealistic Rendering (NPR) [33, 81, 74] are inspiring and nowadays it is a firmly established field in the community of computer graphics. However, most of these pre-neural NPR stylisation algorithms are usually highly dependent on specific artistic styles they simulate [74, 30] and cannot be easily extended to produce stylised results for all other artistic styles. While in the community of computer vision, style transfer is usually studied as a generalisation of texture synthesis, which is to extract and transfer the texture from the source to target. However, only low-level features are exploited during this transfer process and the results are sometimes not that impressive.
Recently, inspired by the power of Convolutional Neural Network (CNN) and recent advances in Visual Texture Modelling, Gatys \etalfirst studied how to use CNN to reproduce famous painting styles on natural images. They proposed to model the content of a photo as feature responses from pre-trained CNN, and further model the style of an artwork as its texture information, which is represented by spatial summary statistics on CNN responses. Their experimental results demonstrated that the modelled high-level content and style representations were separable, which indicates the probability of changing a photo’s style while preserving desired semantic content. Based on this finding, Gatys \etal first propose to exploit the responses of pre-trained CNN to recombine the content of a given photo and the style of famous artworks. The key idea behind their algorithm is to optimise an image with the objective of matching the desired statistical CNN feature distribution, which involves both the photo’s content information and artwork’s style information. Their proposed algorithm successfully produces fantastic stylised images with the appearance of a given artwork. An example of transferring the style of the Chinese painting “Dwelling in the Fuchun Mountains” onto a photo of The Great Wall is shown in Figure 1. Since this Neural Style Transfer algorithm does not have any explicit restrictions for the type of style images, it breaks previous approaches’ constraints. The work of Gatys \etalopened up a new field called Neural Style Transfer, which is the process of using Convolutional Neural Network to migrate the semantic content of one image to different styles.
The seminal work of Gatys \etalhas attracted wide attentions both academically and industrially. In academia, lots of follow-up studies were proposed to either improve or extend this innovative algorithm and before long, these technologies were applied to many successful industrial applications (\eg, Prisma , Ostagram , Deep Forger ). However, there is no comprehensive survey summarizing and discussing recent significant advances as well as challenges within this new field of Neural Style Transfer.
In this paper, we aim to give an overview of advances (up to March 2018) in Neural Style Transfer. Our contributions are threefold. Firstly, we investigate, classify and summarise recent advances in the field of Neural Style Transfer. Secondly, we present several evaluation methods and experimentally compare different Neural Style Transfer methods. Thirdly, we summarise current challenges in this field, and propose the corresponding possible solutions.
The organisation of this paper is as follows. We start our review with an introduction of the style transfer methods in the pre-neural era. Then we explain the derivations and foundations of Neural Style Transfer in Section 3 as prior knowledge. Based on Section 3, Section 4 categorises existing Neural Style Transfer methods and explains these methods in detail. Also, some improvement strategies for these methods as well as extensions will be given in Section 5. Section 6 presents several methodologies for evaluating Neural Style Transfer algorithms, and aims to build a standardised benchmark for follow-up studies. Section 7 demonstrates commercial applications of these Neural Style Transfer methods, including current successful usages as well as potential applications. In Section 8, we summarise current challenges in Neural Style Transfer, as well as propose corresponding possible solutions. Finally, Section 9 concludes the paper, and Section 10 delineates several promising directions for future research.
2 Style Transfer in Pre-neural Era
Style transfer is a long-standing research topic. Due to its wide variety of applications, it has been an important research area for more than two decades. Before the appearance of Neural Style Transfer, the related researches in computer graphics have expanded into an area called Non-photorealistic Rendering (NPR). In the field of computer vision, style transfer is often considered as a generalised problem of texture synthesis. In this section, we briefly review some of these pre-neural style transfer algorithms. For a more comprehensive overview, we recommend [74, 51, 77].
Style Transfer via Stroke-based Rendering.
Stroke-based rendering (SBR) refers to a process of placing virtual strokes (\eg, brush strokes, tiles, stipples) upon a digital canvas to render a photograph with a particular style . The process of SBR is generally starting from a source photo, incrementally compositing strokes to match the photo, and finally producing a non-photorealistic imagery which looks like the photo but with an artistic style. During this process, an objective function is designed to guide the greedy or iterative placement of strokes. Despite the effectiveness of the large body of SBR algorithms, one disadvantage of SBR is that each algorithm is usually designed for a particular style (\eg, oil paintings, watercolours, sketches), which is not that flexible.
Style Transfer via Image Analogy.
Image analogy is a framework which aims to learn a mapping between a pair of source image and target stylised image in a supervised manner . The training set of image analogy comprises pairs of unstylised source images and corresponding stylised images with a particular style. From the example training pairs, image analogy algorithm learns the analogous transformation and then creates analogous stylised result given a test input photograph. Image analogy can also be extended in various ways, \eg, to learn stroke placements for portrait painting rendering . In general, image analogy is effective for a variety of artistic styles. However, pairs of training data are usually unavailable in practical.
Style Transfer via Image Filtering.
Creating an artistic image is actually a process that aims for image simplification and abstraction. Therefore, it is natural to consider adapting or combining some related image processing filters to render a given photo. For example, in , Winnemöller \etalfor the first time exploit bilateral  and difference of Gaussians filters  to automatically produce cartoon-like effect. In general, image filtering based rendering algorithms are straightforward to implement and efficient in practical. At an expense, they are very limited in style diversity.
Style Transfer via Texture Synthesis.
Textures are repeated visual patterns in an image. Texture synthesis is a process which grows similar textures in the source texture image. It has long been an active research topic in computer vision [25, 91]. Given the distribution of the texture instance , the process of texture synthesis can be considered to draw a sample from the certain distribution:
Texture synthesis is very related to style transfer, since one can consider style as a kind of texture. In that sense, style transfer is actually a process of texture transfer, which constrains the given semantic content of the image while synthesizing textures:
This view has already been proposed in , which is one of the earliest works on texture synthesis. There are many related works [24, 21] following this route, which are generally built upon patch matching and quilting. Even recently, Frigo \etal propose an effective style transfer algorithm based on traditional texture synthesis technique. Their idea is to firstly divide the input image adaptively into suitable patches, search the optimal mapping from the candidate regions in the style image, and then apply bilinear blending and colour transfer to obtain the final stylised result. Another recent work in  propose a series of steps to stylize images, which include similarity matching of content and style patches, patch aggregation, content fusion based on segmentation algorithm, \etc. Built upon traditional texture synthesis techniques, style transfer can be performed in an unsupervised manner. However, these texture synthesis based algorithms only exploit low-level image features, which limits their performance. Also, to our knowledge, none of these algorithms can stylize images in real-time due to their step-by-step procedure.
3 Derivations of Neural Style Transfer
For a better understanding of the development of Neural Style Transfer, we start by introducing its derivations. To automatically transfer an artistic style, the first and most important issue that comes to our mind is how to model and extract style from an image. Since style is also a form of texture, a straightforward way is to relate visual style modelling back to previously well-studied Visual Texture Modelling methods. After obtaining the style representation, the next issue is how to reconstruct an image with desired style information but also preserves its semantic content. There comes Image Reconstruction techniques.
3.1 Visual Texture Modelling
Visual texture modelling  is previously studied as the heart of texture synthesis. Throughout the history of visual texture modelling, there are two distinct approaches to model visual textures, which are Parametric Texture Modelling with Summary Statistics and Non-parametric Texture Modelling with MRFs.
Parametric Texture Modelling with Summary Statistics.
One path towards texture modelling is to capture image statistics from a sample texture and exploit summary statistical property to model the texture. The idea is first described by Julesz in , which models textures as pixel-based -th order statistics. Later work in  exploits filter responses to analyze textures, instead of directly pixel-based measurements. Portilla and Simoncelli  further introduce a texture model based on multi-scale orientated filter responses and use gradient descent to improve synthesised results. A recent parametric texture modelling approach proposed by Gatys \etal is the first to measure summary statistics in the domain of Convolutional Neural Network (CNN). They design a Gram-based representation to model textures, which is the correlations between filter responses of a texture image in different layers of a pre-trained classification network (VGG network) , \ie, the Gram-based representation encodes the second order statistics of the set of CNN filter responses. Next we will explain this representation in detail for the usage of the following sections.
Assume that the feature map of a sample texture image at layer of VGG network is , where is the number of channels, and and represent the height and width of the feature map . Then the Gram-based representation can be obtained by computing Gram matrix over the feature map (a reshaped version of ):
This Gram-based texture representation from CNN is effective at modelling wide varieties of both natural and non-natural textures. However, Gram-based representation is designed to capture global statistics and toss spatial arrangements, which leads to unsatisfying results for modelling regular textures with long-range symmetric structures. To address this problem, Berger and Memisevic  propose to horizontally and vertically translate feature map by pixels to correlate the feature at position with those at positions and . In this way, the representation incorporates spatial arrangement information and is more effective at modelling textures with symmetric properties.
Non-parametric Texture Modelling with MRFs.
Another notable methodology is to use non-parametric resampling. A variety of non-parametric methods are based on Markov Random Fields (MRF) model, which assumes that in a texture image, each pixel is entirely characterised by its spatial neighbourhood. Under this assumption, Efros and Leung  propose to synthesise each pixel one by one by searching similar neighbourhoods in the source texture image and assigning the corresponding pixel. Their work is one of the earliest non-parametric algorithms with MRF model. Following their work, Wei and Levoy  further speed up the neighbourhood matching process by always using a fixed neighbourhood.
3.2 Image Reconstruction
In general, for many vision tasks (\eg, object recognition), an essential process is to extract an abstract representation from the input image. Image reconstruction is actually a reverse process, which is to reconstruct the whole input image from the extracted image representation. It is previously studied to analyze particular image representation and discover what information the representation contains. Here our major focus is on CNN representation based image reconstruction algorithms, which can be categorised into Slow Image Reconstruction based on Online Image Optimisation and Fast Image Reconstruction based on Offline Model Optimisation.
Slow Image Reconstruction based on Online Image Optimisation.
The first algorithm to reverse CNN representations is proposed by Mahendran and Vedaldi [63, 64]. Given a CNN representation to be reversed, their algorithm iteratively optimises an image (generally starting from random noise) until it has similar desired CNN representation. The iterative Optimisation process is based on gradient descent in image space. Therefore, the process is time-consuming especially when the desired reconstructed image is large.
Fast Image Reconstruction based on Offline Model Optimisation.
To address the efficiency issue of [63, 64], Dosovitskiy and Brox  propose to train a feed-forward network in advance and put the computational burden at training stage. At testing stage, the reverse process can be simply done with a network forward pass. Their algorithm significantly speeds up image reconstruction process. In their later work , they further combine Generative Adversarial Network (GAN)  to improve the results.
4 A Taxonomy of Neural Style Transfer Algorithms
Neural Style Transfer is a subset of the large body of aforementioned style transfer works, as shown in Figure 2. It actually denotes the group of Style Transfer via Neural Network. One can also say that Neural Style Transfer is a combination of Style Transfer via Texture Synthesis and Convolutional Neural Network. In this section, we provide a categorisation of Neural Style Transfer methods. Current Neural Style Transfer methods fit into one of two categories, Slow Neural Method Based On Online Image Optimisation and Fast Neural Method Based On Offline Model Optimisation. The first category transfers the style by iteratively Optimising an image, \ie, algorithms belong to this category are built on the basis of Slow Image Reconstruction techniques. The second category optimises a generative model offline and produces the stylised image with a single forward pass, which exploits the idea of Fast Image Reconstruction techniques.
4.1 Slow Neural Method Based On Online Image Optimisation
DeepDream  is the first to propose the idea of generating artistic work by reversing CNN representations with Slow Image Reconstruction techniques. By further exploiting Visual Texture Modelling techniques to model style, Slow Neural Methods Based On Online Image Optimisation are subsequently proposed, which build the early foundations for the field of Neural Style Transfer. The basic idea is to firstly model and extract style and content information from style and content images, migrate them as the target representation and then iteratively reconstruct a (yet unknown) result to match target representation. In general, different Slow Neural Methods share the same Slow Image Reconstruction techniques, but only differ in the way they model visual style, which is built on previous two categories of Visual Texture Modelling techniques.
4.1.1 Parametric Slow Neural Method with Summary Statistics
The first subset of Slow Neural Methods is based on Parametric Texture Modelling with Summary Statistics. The style is characterised as a set of spatial summary statistics.
We start by introducing the first Neural Style Transfer algorithm proposed by Gatys \etal[28, 30]. By reconstructing representations from intermediate layers in VGG network, Gatys \etalobserve that deep convolutional neural network is capable of extracting semantic image content from an arbitrary photograph and some appearance information from the well-known artwork. According to this observation, they build the content component of newly stylised image by penalizing the difference of high-level representations derived from content image and stylised image, and further style component by matching Gram-based summary statistics of stylised image and style image, which is derived from their proposed texture modelling technique  (Section 3.1 for details).
Given a content image and a style image , the algorithm in  tries to find a stylised image that minimises the following objective:
where compares content representation of the content image to that of the (yet unknown) stylised image, and compares Gram-based style representation derived from the style image to that of the (yet unknown) stylised image. and are used to balance the content component and style component in the stylised result.
The content loss is defined by the squared Euclidean distance between feature representations of content image in layer and that of the (yet unknown) stylised image :
where denotes the set of VGG layers for computing the content loss. For the style loss ,  exploits Gram-based visual texture modelling technique to model the style, which has already been described in Section 3.1. Therefore, the style loss is defined by the squared Euclidean distance between the Gram-based style representations of and :
where is the aforementioned Gram matrix to encode the second order statistics of the set of filter responses. represents the set of VGG layers for calculating the style loss. The choice of and empirically follows the principle that the usage of lower layer tends to retain low-level features (\eg, colours), while the usage of higher layer generally preserves more high-level semantic content information. Therefore, is usually computed with lower layers and is generally computed with higher layers. Given the pre-trained VGG-19  as the loss network, Gatys \etal’s choice in  is and . Also, VGG loss network is not the only choice. Similar performance can also be achieved by selecting other pre-trained classification networks, \eg, ResNet .
In Equation (4), both and are differentiable. Thus, with random noise as the initial , Equation (4) can be minimised by using gradient descent in image space with backpropagation to generate the final stylised result. In addition, a total variation denoising term is usually added to encourage the smoothness in the stylised result in practical.
Gram-based style representation is not the only choice to statistically encode style information. There are also some other effective statistical style representations which are derived from Gram-based representation. Li \etal derive different style representations by considering style transfer in the domain of transfer learning, or more specifically, domain adaption . Given that training and testing data are drawn from different distributions, the goal of domain adaption is to adapt a model trained on labelled training data from a source domain to predict labels of unlabelled testing data from a target domain. One way for domain adaption is to match a sample in the source domain to that in the target domain by minimising their distribution discrepancy, in which Maximum Mean Discrepancy (MMD) is a popular choice to measure discrepancy between two distributions. Li \etalprove that matching Gram-based style representations between style and stylised image is intrinsically minimising MMD with a quadratic polynomial kernel. Therefore, it is expected that other kernel functions for MMD can be equally applied in style transfer, \eg, linear kernel, polynomial kernel, Gaussian kernel. Another related representation is BN statistic representation, which is to use mean and variance of feature maps in VGG layers to model style:
where is the -th feature map channel at layer of VGG network and is the number of channels.
However, the Gram-based algorithm has the limitation of instabilities during optimisations. Also, it requires manually tuning the parameters, which is very tedious. Risser \etal find that feature activations with quite different means and variances can still have the same Gram matrix, which is the main reason of instabilities. Inspired by this observation, Risser \etalintroduce an extra histogram loss, which guides the optimisation to match the entire histogram of feature activations. They also present a preliminary solution to automatic parameter tuning, which is to explicitly prevent gradients with extreme values through extreme gradient normalisation.
For these aforementioned neural methods, one common aspect is that they only compare content and stylised images in the CNN feature space to make the stylised image semantically similar to the content image. But since CNN features inevitably lose some low-level information contained in the image, there are usually some unappealing distorted structures and irregular artifacts in the stylised results. To preserve the coherence of fine structures during stylization, Li \etal propose to incorporate additional constraints upon low-level features in the pixel space. They introduce an additional Laplacian loss, which is defined as the squared Euclidean distance between the Laplacian filter responses of content image and stylised result. Laplacian filter computes the second order derivatives of pixels in an image and is widely used for edge detection.
4.1.2 Non-parametric Slow Neural Method with MRFs
Non-parametric Slow Neural Method is built on the basis of Non-parametric Texture Modelling with MRFs. This category considers the Neural Style Transfer at a local level, \ie, operating on patches to match the style.
Li and Wand  are the first to propose an MRF-based Neural Style Transfer algorithm. They find that the parametric style transfer method with summary statistics only capture the per-pixel feature correlations and does not constrain the spatial layout, which leads to the less visual plausibility results for photorealistic styles. Their solution is to model the style in a non-parametric way and introduce a new style loss function which includes a patch-based MRF prior as follows:
where is a set of all local patches from feature map . denotes the local patch and denotes the most similar style patch with the -th local patch in the stylised image . The best matching is obtained by calculating normalised cross-correlation over all the style patches in the style image . is the total number of local patches. Since their algorithm matches a style in the patch-level, the fine structure and arrangement can be preserved much better. Given a photograph as the content, their algorithm can achieve remarkable results especially for photorealistic styles.
4.2 Fast Neural Method Based On Offline Model Optimisation
Although the Slow Neural Method Based On Online Image Optimisation is able to yield impressive stylised images, there are still some limitations. The most concerned limitation is the efficiency issue. The second category Fast Neural Method addresses the speed and computational cost issue by exploiting Fast Image Reconstruction based on Offline Model Optimisation to reconstruct the stylised result, \ie, a feed-forward network is optimised over a large set of images in advance for one or more style images :
Depending on the number of artistic styles a single can produce, Fast Neural Methods are further divided into Per-Style-Per-Model Fast Neural Method (PSPM), Multiple-Style-Per-Model Fast Neural Method (MSPM), and Arbitrary-Style-Per-Model Fast Neural Method (ASPM).
4.2.1 Per-Style-Per-Model Fast Neural Method
The first two Fast Neural Methods are proposed by Johnson \etal and Ulyanov \etal respectively. The two methods share a similar idea, which is to pre-train a feed-forward style-specific network and produce a stylised result with a single forward pass at testing stage. They only differ in network architecture, for which Johnson \etal’s design roughly follows the network proposed by Radford \etal but with residual blocks as well as strided and fractionally strided convolutions, and Ulyanov \etaluse a multi-scale architecture as the generator network. The objective function is similar to the algorithm of Gatys \etal, which indicates that they are also parametric methods with summary statistics.
Shortly after [48, 84], Ulyanov \etal further find that simply applying normalisation to each single image rather than a batch of images (precisely batch normalization (BN)) leads to a significant improvement in stylisation quality. This single image normalisation is called Instance Normalisation (IN), which is actually equivalent to batch normalisation when the batch size is set to 1. The style transfer network with IN is shown to converge faster than BN and also achieves visually better results. One interpretation is that IN is actually a form of style normalisation and can directly normalise the style of each content image to the desired style . Therefore, the objective is easier to learn as the rest of the network only needs to take care of the content loss.
Another work by Li and Wand  is inspired by the MRF-based Neural Style Transfer  in Section 4.1.2. They address the efficiency issue by training a Markovian feed-forward network using adversarial training. Similar to , their algorithm is a patch-based non-parametric method with MRFs. Their method is shown to outperform the algorithms of Johnson \etaland Ulyanov \etalin the preservation of coherent textures in complex images, thanks to their patch-based design.
4.2.2 Multiple-Style-Per-Model Fast Neural Method
Although above PSPM approaches are capable of producing stylised images two orders of magnitude faster than slow methods, their limitation is that separate generative networks have to be trained for each specific style image, which is quite time-consuming and inflexible. But many paintings (\eg, impressionist paintings) actually share similar paint strokes but only differ in the colour palette. Intuitively, it is redundant to train a separate network for each of them. MSPM is therefore proposed, which improves the flexibility of PSPM by further incorporating multiple styles into one single model. There are generally two paths towards handling this problem: 1) tying only a small amount of parameters in a network to each style ([22, 14]) and 2) still exploiting only a single network like PSPM but combining both style and content as inputs ([95, 55]).
1) Tying only a small amount of parameters to each style.
An early work by Dumoulin \etal is built on the basis of the proposed IN layer in PSPM algorithm  which has already been introduced in Section 4.2.1. They surprisingly find that using the same convolutional parameters but only scaling and shifting parameters in IN layers is sufficient to model different styles. Therefore, they propose an algorithm to train a conditional style transfer network for multiple styles based on conditional instance normalisation (CIN), which is defined as:
where is the input feature activation and is the index of the desired style from a set of style images. As shown in Equation (10), the conditioning for each style is done by scaling and shifting parameters and after normalising feature activation , \ie, each style can be achieved by tuning parameters of an affine transformation. One interpretation for its impressive results is similar to that for  in Section 4.2.1, which is the normalisation of feature statistics with different affine parameters can normalise input content image to different styles. Furthermore, the algorithm of Dumoulin \etalcan be also extended to combine multiple styles in a single stylised result by combining parameters of different styles.
Another algorithm which follows the first path of MSPM is proposed by Chen \etal. Their idea is to explicitly decouple style and content, \ie, using separate network components to learn corresponding content and style information. More specifically, they use mid-level convolutional filters (called “StyleBank” layer) to individually learn different styles. Each style is tied to a set of parameters in “StyleBank” layer. The rest components in the network is used to learn semantic content information, which is shared by different styles. Their algorithm also supports flexible incremental training, which fixes the content components in the network and only trains “StyleBank” layer for newly coming style.
2) Combining both style and content as inputs.
The disadvantage of the first path is that the model size becomes larger with the increase of the number of learned styles. The second path of MSPM addresses this limitation by fully exploring the expressive ability of a single network and combining both content and style into the network for style identification. Different MSPM algorithms differ in the way to incorporate style into the network.
Li \etal propose to firstly sample a set of noise maps from a uniform distribution and establish a one-to-one mapping between each style and noise map. For clarity, we divide the style transfer network into an encoder () and decoder () pair. At training stage, for each style, the corresponding noise map is concatenated () with the encoded feature activations and then feeded into the decoder to get the stylised result: .
In , Zhang and Dana firstly forward each style image in the style set through the pre-trained VGG network and obtain multi-scale feature activations in different VGG layers. Then multi-scale are combined with multi-scale encoded features from different layers in the encoder through their proposed inspiration layers. The inspiration layers are designed to reshape to match the desired dimension, and also have a learnable weight matrix to tune feature maps to help minimise the objective function.
4.2.3 Arbitrary-Style-Per-Model Fast Neural Method
Arbitrary-Style-Per-Model Fast Neural Method (ASPM) aims for one-model-for-all, \ie, one single trainable model to transfer arbitrary artistic styles. There are also two types of ASPM, one built upon Non-parametric Texture Modelling with MRFs and the other one built upon Parametric Texture Modelling with Summary Statistics.
1) Non-parametric ASPM with MRFs.
The first ASPM algorithm is proposed by Chen and Schmidt . They firstly extract a set of activation patches from content and style feature activations computed in pre-trained VGG network. Then match each content patch to the most similar style patch and swap them (called “Style Swap” in ). The stylised result can be produced by reconstructing the resulting activation map after “Style Swap”, with either Slow Image Reconstruction based on Online Image Optimisation or Fast Image Reconstruction based on Offline Model Optimisation.
2) Parametric ASPM with summary statistics.
Based on  in Section 4.2.2, the simplest approach for arbitrary style transfer is to train a separate parameter prediction network for predicting and in Equation (10) with a number of training styles . Given a test style image , CIN layers in the style transfer network take affine parameters and from , and normalise the input content image to the desired style with a forward pass.
Another similar approach based on  is propose by Huang and Belongie in . Instead of training a parameter prediction network, Huang and Belongie propose to modify conditional instance normalisation (CIN) in Equation (10) to adaptive instance normalisation (AdaIN):
AdaIN transfers the channel-wise mean and variance feature statistics between content and style feature activations, which also shares similar idea with . Different from , the encoder in the style transfer network of  is fixed and comprises of the first few layers in pre-trained VGG network. Therefore, in  is actually the feature activation from a pre-trained VGG network. The decoder part needs to be trained with a large set of style and content images to decode resulting feature activations after AdaIN to the stylised result: .
A more recent work by Li \etal attempts to exploit a series of feature transformation as well as whitening and colouring transformation to transfer arbitrary artistic style in a style learning free manner. Similar to , Li \etaluse the first few layers of pre-trained VGG as the encoder and train the corresponding decoder. But they replace the AdaIN layer  in between the encoder and decoder with a pair of whitening and colouring transformation (WCT): . Their algorithm is built on the finding that whitening transformation can remove the style related information and preserve the structure of content. Therefore, receiving content activations from the encoder, whitening transformation can filter the original style out of the input content image and return a filtered representation with only content information. Then, by applying colouring transformation, the style patterns contained in are incorporated into the filtered content representation, and the stylised result can be obtained by decoding the transformed feature representations. They also extend this single-level stylisation to multi-level stylisation to further improve visual quality.
5 Improvements and Extensions
Since the boom of Neural Style Transfer, there are also some researches devoted to improving current Neural Style Transfer algorithms by controlling perceptual factors (\eg, stroke size control, spatial style control, colour control). Also, all of aforementioned Neural Style Transfer methods are designed for general still images. They may not be appropriate for other types of images (\eg, doodles, head portrait, video frames). A variety of follow-up studies aim to extend general Neural Style Transfer algorithms to these specific types of images, and even extend them beyond artistic image style (\eg, audio style).
Controlling Perceptual Factors in Neural Style Transfer.
Gatys \etalthemselves  propose several slight modifications to improve their previous algorithm . They demonstrate a spatial style control strategy, which is to define a guidance map for the feature activations, where the desired region (getting the style) is assigned and otherwise. While for the colour control, the origin algorithm produces stylised images with the colour distribution of the style image. However, sometimes people prefer a colour-preserving style transfer, \ie, preserving the colour of the content image during style transfer. The method is to first transform the style image’s colours to match the content image’s colours before style transfer, or alternatively perform style transfer only in the luminance channel.
1) Slow Neural Style Transfer with non-high-resolution images: Since current style statistics (\eg, Gram-based and BN-based statistics) are scale-sensitive , to achieve different stroke sizes, the solution is simply resizing style image to different scales.
2) Fast Style Transfer with non-high-resolution images: One possible solution is to resize the input image to different scales before the forward pass, which inevitably hurts stylisation quality. Another possible solution is to train multiple models with different scales of style image, which is space and time consuming. Also, the possible solution fails to preserve stroke consistency among results with different stroke sizes, \ie, different stroke size results vary in stroke orientations, stroke configurations, \etc. However, users generally desire to only change the stroke size but not others. To address this problem, Jing \etal propose a stroke controllable PSPM Fast Style Transfer algorithm. The core component of their algorithm is a StrokePyramid module, which learns different stroke sizes with adaptive receptive fields. Without trading off quality and speed, their algorithm is the first to exploit one single model to achieve flexible continuous stroke size control while preserving stroke consistency, and further achieve spatial stroke size control to produce new artistic effects. Although one can also use ASPM algorithm to control stroke size, ASPM trades off quality and speed. As a result, ASPM is not effective at producing fine strokes and details compared with .
3) Slow Neural Style Transfer with high-resolution images: For high-resolution images (\eg, pixels in ), a large stroke size cannot be achieved by simply resizing style image to a large scale. Since only the region in content image with receptive field size of VGG can be affected by a neuron in the loss network, there is almost no difference between a very large and larger brush strokes in a small image region with receptive field size. Gatys \etal tackle this problem by proposing a coarse-to-fine Slow Style Transfer procedure with several steps of downsampling, stylising, upsampling and final stylising.
4) Fast Style Transfer with high-resolution images: Similar to 3), stroke size in stylised result does not vary with style image scale for high-resolution images. The solution  is also similar to Gatys \etal’s algorithm in , which is a coarse-to-fine stylisation procedure. The idea of  is to propose a multimodel, which comprises of multiple subnetworks. Each subnetwork receives the upsampled stylised result of previous subnetworks as the input and stylises it again with finer strokes.
Another limitation of current algorithms is that they do not consider the depth information contained in the image. To address this limitation, depth preserving Neural Style Transfer algorithms [59, 60] are proposed. Their approach is to add a depth loss function based on  to measure the difference in depth between the content image and the (yet unknown) stylised image. The image depth is acquired by applying a single-image depth estimation algorithm (\eg, Zoran \etal’s work in ).
Semantic Style Transfer.
Given a pair of style and content images which are similar in content, the goal of semantic style transfer is to build a semantic correspondence that maps each style region to content region, and transfer the style in each style region to the corresponding semantically similar content region.
1) Slow Semantic Style Transfer. Since the patch matching scheme is naturally consistent with region-based correspondence, Champandard  proposes to build a semantic style transfer algorithm based on the aforementioned patch-based algorithm  (Section 4.1.2). Actually, the result produced by  is close to the target of semantic style transfer but without incorporating an accurate segmentation mask, which sometimes leads to a wrong semantic match. Therefore, Champandard augments an additional semantic channel upon , which is a downsampled semantic segmentation map. The segmentation map can be either manually annotated or from a semantic segmentation algorithm. Despite the remarkable results produced by , MRF-based algorithm design is not the only choice. Instead of combining MRF prior, Chen and Hsu  provide an alternative way for semantic style transfer, which is to exploit masking out process to constrain the spatial correspondence and also a higher order style feature statistic to further improve the result.
2) Fast Semantic Style Transfer. As before, the efficiency issue is always a big issue. Both  and  are based on Slow Neural Style Transfer algorithms and therefore leave much room for improvement. Lu \etal speed up the process by optimising the objective function in the feature space, instead of in the pixel space. More specifically, they propose to do feature reconstruction, instead of image reconstruction as previous algorithms do. This optimisation strategy reduces the computation burden since the loss does not need to propagate through a deep network. The resulting reconstructed feature is decoded into the final result with a trained decoder. Since the speed of  does not reach real-time, there is still big room for further research.
Instance Style Transfer.
Instance style transfer is built on instance segmentation and aims to stylise only a single user-specified object within an image. The challenge mainly lies in the transition between stylised object and non-stylised background. Castillo \etal address this challenge by adding an extra MRF-based loss to smooth and anti-alias boundary pixels.
Doodle Style Transfer.
An interesting extension can be found in , which is to exploit Neural Style Transfer to transform rough sketches into fine artworks. The method is simply discarding content loss term and using doodles as segmentation map to do semantic style transfer.
Stereoscopic Style Transfer.
Driven by the demand of AR/VR, Chen \etal propose a stereoscopic Neural Style Transfer algorithm for stereoscopic images. They propose a disparity loss to penalize the bidirectional disparity. Their algorithm is shown to produce more consistent strokes for different views.
Portrait Style Transfer.
Current style transfer algorithms are usually not appropriate for head portrait images. As they do not impose spatial constraints, directly applying these existing algorithms to head portraits will deform facial structures, which is unacceptable for human visual system. Selim \etal address this problem by extending  to head portrait painting transfer. They propose to use the notion of gain maps to constrain spatial configurations, which transfers the texture of the style image while preserving the facial structures.
Video Style Transfer.
Neural Style Transfer algorithms for video sequences are substantially proposed shortly after Gatys \etal’s first style transfer algorithm for still images . Different from still image style transfer, the design of video style transfer algorithms needs to consider smooth transition between adjacent video frames. Like before, we divide related algorithms into Slow and Fast Video Style Transfer.
1) Slow Video Style Transfer based on online image optimisation. The first video style transfer algorithm is proposed by Ruder \etal. They introduce a temporal consistency loss guided by optical flow to penalise deviations along point trajectories. The optical flow is calculated by novel optical flow estimation algorithms [92, 72]. As a result, their algorithm eliminates temporal artifacts and produces smooth stylised videos. However, they build their algorithm upon  and need several minutes to process a single frame in practical.
2) Fast Video Style Transfer based on offline model optimisation. Several follow-up studies are devoted to stylising a given video in real-time. Huang \etal propose to augment Ruder \etal’s temporal consistency loss  upon current PSPM algorithm. Given two consecutive frames, optical flow based temporal consistency loss is directly computed using two outputs of style transfer network to encourage pixel-wise consistency, and a corresponding two-frame synergic training strategy is introduced for temporal consistency loss. Another concurrent work which shares similar idea with  but with an additional detailed explanation of style instability problem can be found in . Different from [42, 37], Chen \etal propose a flow subnetwork to produce feature flow and incorporate optical flow information in the feature space. Also, their algorithm is built on a pre-trained style transfer network (a encoder-decoder pair) and wraps feature activations from the pre-trained stylisation encoder using the obtained feature flow.
Character Style Transfer.
Given a style image containing multiple characters, Character Style Transfer aims to apply the idea of Neural Style Transfer to generate new fonts and text effects. In , Atarsaikhan \etaldirectly applies the algorithm in  to font style transfer and achieves remarkable results. Yang \etal propose to firstly characterise style elements and exploit extracted characteristics to guide the generation of text effects. A more recent work  designs a conditional GAN model for glyph shape prediction, and also a ornamentation network for colour and texture prediction. By training these two networks jointly, font style transfer is realised in an end-to-end manner.
Colour Style Transfer.
Colour style transfer aims to transfer the style of colour distributions. The general idea is to build upon current semantic style transfer but to eliminate distortions and preserve the original structure of content image. 1) Slow Colour Style Transfer. The earliest colour style transfer approach is proposed by Luan \etal. They propose to add a photorealism regularization upon  to penalize image distortions and achieves remarkable results. But since Luan \etal’s algorithm is built on an online image optimisation based Slow Semantic Style Transfer algorithm , their algorithm is computationally expensive.
2) Fast Colour Style Transfer. Li \etal address the efficiency issue of  by handling this problem in two steps, the stylisation step and smoothing step. The stylisation step is to apply  but replacing upsampling layers with unpooling layers to produce stylised result with less distortions. The smoothing step is then applied to further eliminate structural artifacts. The two aforementioned algorithms  and  are mainly designed for natural images. Another work in  propose to exploit GAN to transfer the colour from human-designed anime images to sketches. Their algorithm demonstrates a promising application of Colour Style Transfer, which is the automatic image colourisation.
Attribute Style Transfer.
Image attributes are referred to image colours, textures, \etc. Previously, image attribute style transfer is accomplished by image analogy  in a supervised manner (Section 2). By combining the idea of Neural Style Transfer, Liao \etalpropose a deep image analogy to study image analogy in the domain of CNN features. Their algorithm is based on patch matching and realises a weak supervised image analogy, \ie, only given a single pair of source image and target image.
Fashion Style Transfer.
Fashion style transfer receives fashion style image as target and generates clothing images with desired fashion styles. The challenge of Fashion Style Transfer lies in the preservation of similar design with the basic clothing while blending desired style patterns. This idea is first proposed by Jiang and Fu . They address this problem by proposing a pair of fashion style generator and discriminator.
Audio Style Transfer.
In addition to transferring image style, [88, 65] extend the domain of image style to audio style, and synthesise new sounds by transferring the desired style from a target audio. The study of audio style transfer also follows the route of image style transfer, \ie, Slow Audio Style Transfer and then Fast Audio Style Transfer. Inspired by image optimisation based image style transfer, Verma and Smith  propose a Slow Audio Style Transfer algorithm based on online audio optimisation. They start from a noise signal and optimise it iteratively using back-propagation.  improves the efficiency by transferring an audio in a feed-forward manner and produces the result in real-time.
6 Evaluation Methodology
Evaluations of Neural Style Transfer algorithms remain an open and important problem in this field since there is no ground truth. Neural Style Transfer is the creation of art. For the same stylised result, different people may have different or even opposite views. From our point of view, there are two major types of evaluation methodologies that can be employed in the field of Neural Style Transfer, \ie, qualitative evaluation and quantitative evaluation. Qualitative evaluation relies on the aesthetic of observers. The evaluation results may vary with lots of factors (\eg, age and occupation of participants). While quantitative evaluation focuses on the precise evaluation metrics, which include time complexity, loss variation, \etc.
In this section, we experimentally compare different Neural Style Transfer algorithms both qualitatively and quantitatively. We hope that our study can build a standardised benchmark for this area.
|Group I||Group II||Group III||Group IV||Group V||Group VI||Group VII||Group VIII|
|Content & Style:|
|Li and Wand :|
|Zhang and Dana :|
|Chen and Schmidt :|
|Huang and Belongie :|
6.1 Experimental Setup
Totally, there are ten style images and forty content images. For style images, we select artworks of diversified styles, as is shown in Figure 4. For example, there are impressionism artwork, cubism artwork, abstract artwork, contemporary artwork, futurism artwork, surrealist artwork, and expressionism artwork. Regarding the mediums, some of these artworks are painted on canvas, while others are painted on cardboard or wool, cotton, polyester, \etc. For content images, we also try to select a wide variety of photos, which includes animal photography, still life photography, landscape photography and portrait photography. All the images are never seen during training.
To maximise the fairness of comparison, we also follow the following principles during our experiment:
1) In order to cover every detail in each algorithm, we try to use the official implementation provided by the authors. For , since there is no official implementation provided by the authors, we use a popular open source code  which is also admitted by the author. Except for [22, 8] which are based on TensorFlow, all the other codes are based on Torch 7, which maximises fairness especially for speed comparison.
2) Since the visual effect is influenced by the content and style ratio, it is difficult to compare results with different degrees of stylisation. Simply giving the same content and style weight is not an optimal solution due to the different ways to calculate losses in each algorithm (\eg, different choices of content and style layers, different loss functions). Therefore, in our experiment, we try our best to balance the content and style ratio among different algorithms.
3) We try to use the default parameters (\eg, choice of layers, learning rate, \etc) provided by the authors except for the aforementioned content and style weight. Although the results for some algorithms may be further improved by more careful hyperparameter tuning, we select the authors’ default parameters since we hold the point that the sensitivity for hyperparameters is also an important implicit criteria for comparison. For example, we cannot say an algorithm is good if it needs heavy work to tune its parameters for each style.
Implementation details. There are also some other implementation details to be noted. For  and , we use the instance normalisation strategy by following , which is not covered in the original papers. Since this experiment mainly focuses on the comparison of general results of stylisation algorithms, we do not consider the diversity loss term for all algorithms, which is proposed in  and . For Chen and Schmidt’s algorithm , we use the feed-forward reconstruction to reconstruct stylised results.
|256 256||512 512||1024 1024|
|Li and Wand ||0.015||0.055||0.229||1|
|Zhang and Dana ||0.019 (0.039)||0.059 (0.133)||0.230 (0.533)|
|Chen and Schmidt ||0.123 (0.130)||1.495 (1.520)|
|Huang and Belongie ||0.026 (0.037)||0.095 (0.137)||0.382 (0.552)|
Note: The fifth column shows the number of styles that a single model can produce. Time both excludes (out of parenthesis) and includes (in parenthesis) the style encoding process is shown, since ,  and  support storing encoded style statistics in advance to further speed up the stylisation process for the same style but different content images. Time of  for producing images is not shown due to the memory limitation. The speed of [22, 32] are similar to  since they share similar architecture. Therefore, we do not redundantly list them in this table.
6.2 Qualitative Evaluation
Since the results of qualitative evaluation vary a lot with different observers, here we choose to present stylised results of different algorithms and leave the judgement to readers. Example stylised results are shown in Figure 5. More results can be found in the supplementary material111http://yongchengjing.com/pdf/review_supp.pdf. In Figure 5, we build several blocks to separate results of different categories of Neural Style Transfer algorithms.
1) Results of Slow Neural Style Transfer.
Following the demonstration of example style and content images, the first block contains the results of Gatys \etal’s Slow Neural Style Transfer algorithm  based on online image optimisation. The style transfer process is computationally expensive, but in contrast, the results are appealing in visual quality. In many researches, the algorithm of Gatys \etalis usually regarded as the gold-standard method.
2) Results of PSPM Fast Style Transfer.
The second block shows the results of Per-Style-Per-Model Fast Style Transfer algorithms (Section 4.2). Each model only fits one style. It can be noticed that the stylised results of Ulyanov \etal and Johnson \etal are somewhat similar. This is not surprising since they share similar ideas and only differ in their detailed network architectures. For the results of Li and Wand , the results are sightly less impressive. Since  is based on Generative Adversarial Network (GAN), to some extent, the training process is not that stable. But we believe that GAN-based style transfer is a very promising direction and there are already some GAN-based works [6, 96, 99] in the community of Neural Style Transfer.
3) Results of MSPM Fast Style Transfer.
The third block demonstrates the results of Multiple-Style-Per-Model Fast Style Transfer. Multiple styles are incorporated in a single model. The idea of both Dumoulin \etal’s algorithm  and Chen \etal’s algorithm  is to tie a small amount of parameters to each style. Also, both of them build their algorithm upon the architecture of . Therefore, it is not surprising that their results are visually similar. Although their results are appealing, their model size will become larger when the number of learned styles increases. In contrast, Zhang and Dana’s algorithm  and Li \etal’s algorithm  use a single network with the same trainable network weights for multiple styles. The model size problem is solved, but there may be some interferences between different styles (Group II and VII). Different styles are not better separated during training, which slightly influences the stylisation quality.
4) Results of ASPM Fast Style Transfer.
The forth block presents the last category of Fast Style Transfer, namely Arbitrary-Style-Per-Model Fast Neural Methods. Their idea is one model for all. Globally, the results of ASPM are slightly less impressive than other types of algorithms. This is acceptable in that a three-way trade-off between speed, flexibility and quality is common in research. Chen and Schmidt’s patch-based algorithm  seems to not combine enough style elements into the content image. Their algorithm is based on similar patch swap. When lots of content patches are swapped with style patches that do not contain enough style elements, the target style will not be well reflected. Ghiasi \etal’s algorithm  is data-driven and their stylisation quality is very dependent on the varieties of training styles and contents. For the algorithm of Huang and Belongie , they propose to match global summary feature statistics and successfully improve visual quality compared with . However, their algorithm seems not good at handling complex style patterns (Group III and VI), since their stylisation quality is still related with training styles. The algorithm of Li \etal replaces the training process with large numbers of style images with a series of transformations. But  is not effective at producing sharp details and fine strokes.
6.3 Quantitative Evaluation
Regrading the quantitative evaluation, we mainly focus on five evaluation metrics, which are: generating time for a single content image of different sizes; training time for a single model; average loss for content images to measure how well the loss function is minimised; loss variation during training to measure how fast the model converges; style scalability to measure how large the learned style set can be.
The issue of efficiency is the focus of Fast Style Transfer. In this subsection, we compare different algorithms quantitatively in terms of the stylisation speed. Table 1 demonstrates the average time to stylise one image with three resolutions using different algorithms. In our experiment, the style images have the same size with content images. The fifth column of Table 1 represents the number of styles that one model of each algorithm can produce. represents that a single model can produce multiple styles, which corresponds to MSPM algorithms. means that a single model works for any style, which corresponds to ASPM algorithms. The numbers reported in Table 1 are obtained by averaging the generating time of 100 images. Note that we do not include the speed of [22, 32] in Table 1 as their algorithm is to scale and shift parameters based on the algorithm of Johnson \etal. The time required to stylise one image using [22, 8] is very close to  under the same condition. For Chen \etal’s algorithm , since their algorithm is protected by patent, here we just attach the speed information provided by the authors for reference: On a Pascal Titan X GPU, : 0.007s; : 0.024s; : 0.089s. For Chen and Schmidt’s algorithm , the time for producing image is not reported due to the limit of video memory. Swapping patches of two images needs more than 24 GB video memory and thus, the stylisation is not practical. We can observe that except for [16, 56], all the other Fast Style Transfer algorithms are capable of stylising even high-resolution content images in real-time. ASPM algorithms are generally slower than PSPM and MSPM, which also demonstrates the aforementioned three-way trade-off again.
Another concern is the training time for one single model. The training time of different algorithms is hard to compare as sometimes the model trained with just a few iterations is capable of producing enough visually appealing results. So we just outline our training time of different algorithms (under the default setting provided by the author) as a reference for follow-up studies. On a NVIDIA Quadro M6000, the training time for a single model is about 3.5 hours for the algorithm of Johnson \etal, 3 hours for the algorithm of Ulyanov \etal, 2 hours for the algorithm of Li and Wand , 6.5 hours for Chen \etal, 4 hours for Zhang and Dana  and 8 hours for Li \etal. Chen and Schmidt’s algorithm  and Huang and Belongie’s algorithm  are much longer (\eg, a couple of days), which is acceptable since a pre-trained model can work for any style. The training time of  depends on how large the training style set is. For MSPM algorithms, the training time can be further reduced through incremental learning over a pre-trained model. For example, the algorithm of Chen \etal only needs 8 minutes to incrementally learn a new style.
One way to evaluate some Fast Style Transfer algorithms which share the same loss function is to compare their loss variation during training, \ie, the training curve comparison. It helps researchers to justify the choice of architecture design by measuring how fast the model converges and how well the same loss function can be minimised during training. Here we compare training curves of two popular Fast Style Transfer algorithms in Figure 6, since most of follow-up works are based on their architecture designs. We remove the total variation term and keep the same objective for both two algorithms. Other settings (\eg, loss network, chosen layers) are also kept the same. For the style images, we random select four styles from our style set and represent them in different colours. In Figure 6, it seems that the two algorithms are similar in terms of the convergence speed. Both algorithms minimise the content loss well during training and they mainly differ in the speed of learning the style objective. The algorithm of  seems to minimise the style loss better.
Another related criteria is to compare the final loss values of different algorithms over a set of test images. For a fair comparison, the loss function and other settings are also required to keep the same for their corresponding stylised images. It demonstrates how well the same loss function is minimised by using different algorithms. We show the results of one Slow and two Fast Neural Style Transfer algorithms in Figure 7. The result is consistent with the aforementioned trade-off between speed and quality. Although Fast Style Transfer algorithms are capable of stylising images in real-time, they are not good as Slow Neural Style algorithm in terms of minimising the same loss function.
Scalability is a very important criteria for MSPM algorithms. However, it is very hard to measure since the maximum capabilities of a single model is highly related with the set of chosen styles. If most of the styles have somewhat similar patterns, a single model can produce thousands of styles or even more, since these similar styles share somewhat similar distribution of style feature statistics. In contrast, if the style patterns vary a lot among style images, the capability of a single model will be much smaller. But it is hard to measure how much these styles vary from each other. Therefore, to provide the reader a reference, here we only summarise the authors’ attempt for style scalability: the number is for , for both  and , and for .
Due to the amazing stylised results, the research of Neural Style Transfer has led to many successful industrial applications and begun to deliver commercial benefits. There are also some application papers aiming at investigating how to apply Neural Style Transfer technique in different applications [7, 49]. This section summaries these applications and presents some potential usages.
7.1 Social Communication
One of the reasons why Neural Style Transfer catches eyes in both academia and industry is its popularity in some social networking sites, for instance, Facebook and Twitter. A recently emerged mobile application named Prisma  is one of the first industrial applications that provides the Neural Style Transfer algorithm as a service. Before Prisma, the general public almost never imagines that one day they are able to turn their photos into art paintings in only a few minutes. Due to its high quality, Prisma achieved great success and is becoming popular around the world. Soon some applications providing the same service appeared one after another and began to deliver commercial benefits, \eg, a web application Ostagram  requires users to pay for a faster generating speed. Under the help of these industrial applications [10, 1, 80], people are able to create their own fantastic art paintings like a painter and share the artwork with others in Twitter and Facebook, which brings a new form of social communication. There are also some related application papers:  introduces an iOS app Pictory which combines style transfer techniques with image filtering;  further presents the technical implementation details of Pictory;  demonstrates the design of anther GPU-based mobile app ProsumerFX.
The application of Neural Style Transfer in social communication reinforces connections between people and also has positive effects on both academia and industry. For academia, when people share their own masterpiece, they usually make some comments on the disadvantages of the service, which helps the researchers to further improve the algorithm. Moreover, the application of Neural Style Transfer in social communication also drives the advances of other new techniques. For instance, inspired by the real-time requirements of Neural Style Transfer for videos, Facebook AI Research (FAIR) first developed a new mobile-embedded deep learning system Caffe2Go and then Caffe2, which can run deep neural networks on mobile phones . For industry, the application brings commercial benefits and promotes the economic development.
7.2 User-assisted Creation Tools
Another use of Neural Style Transfer is to act as user-assisted creation tools. Although, to the best of our knowledge, there are no popular applications that applied the Neural Style Transfer technique in creation tools, we believe that it will be a promising potential usage in the future.
Neural Style Transfer is capable of acting as a creation tool for painters and designers. Neural Style Transfer makes it more convenient for a painter to create an artifact of a specific style, especially when creating computer-made fine art images. Moreover, with Neural Style Transfer algorithms it is trivial to produce stylised fashion elements for fashion designers and stylised CAD drawings for architects in a variety of styles, which is costly to produce them by hand.
7.3 Production Tools for Entertainment Applications
Some entertainment applications such as movies, animations and games are probably the most application forms of Neural Style Transfer. For example, creating an animation usually requires 8 to 24 painted frames per second. The production costs will be largely reduced if Neural Style Transfer can be applied to automatically stylize a live-action video into an animation style. Similarly, Neural Style Transfer can significantly save time and costs when applied to the creation of some movies and computer games.
There are already some application papers aiming at introducing how to apply Neural Style Transfer to the production of movies, \eg, Joshi \etalexplore the use of Neural Style Transfer in redrawing some scenes in a movie named Come Swim , which indicates promising potential applications of Neural Style Transfer in this field.
8 Challenges and Possible Solutions
The advances in the field of Neural Style Transfer is amazing and some algorithms have already found use in industrial applications. Although current algorithms achieve remarkable results, there are still several challenges and open issues. In this section, we summarize key challenges within this field of Neural Style Transfer and discuss their corresponding possible solutions.
The most concerned challenge is probably the three-way trade-off between speed, flexibility and quality in Neural Style Transfer. Although current ASPM algorithms successfully transfer arbitrary styles, they are not that satisfying in perceptual quality and speed. The quality of data-driven ASPM quite relies on the diversity of training styles. However, one can hardly cover every style due to the great diversity of artwork. Image transformation based ASPM transfer arbitrary styles in a learning-free manner, but it is behind in speed.
One of the keys for this problem may be a better understanding of optimisation procedure. The choice of optimiser (\eg, Adam and L-BFGS) in Neural Style Transfer greatly influences the visual quality. We believe that a deep understanding towards optimisation procedure will help understand how to find the local minima that leads to a high quality. Also, a well-studied automatic layer chosen strategy may also help improve the quality.
Interpretable Neural Style Transfer.
Another significant issue is the interpretability of Neural Style Transfer. Like many other CNN-based vision tasks, Neural Style Transfer is a black box, which makes it quite uncontrollable. Interpreting CNN feature statistics based Neural Style Transfer can benefit the separation of different style attributes and address the problem of a finer control during stylisation. For example, current Neural Style Transfer algorithms cannot guarantee the detailed orientations and continuities of curves in stylised results. However, brush stroke orientation is an important element in paintings, which can impress the viewer and convey the painter’s ideas. Regarding the solution to this problem, fortunately, there are already researches devoted to interpreting CNN  which may shed light on the interpretable Neural Style Transfer.
Adversarial Neural Style Transfer.
Several studies have shown that deep classification network is easily fooled by adversarial examples [82, 36], which are generated by applying perturbations to input images (\eg, Figure 8(c)). The emergence of adversarial examples reveals the difference between deep neural network and human vision system. The perturbed result by changing an originally correctly classified image is still recognisable to humans, but leads to a misclassified label for deep neural network. Previous studies on adversarial examples mainly focus on deep classification network. However, in Figure 8, we demonstrate that adversarial examples also exist in deep generative network. In Figure 8(d), one can hardly recognise the semantic content, which is originally contained in Figure 8(c). The corresponding countermeasure to this adversarial Neural Style Transfer may benefit from previous researches on deep classification network. A recent survey on adversarial examples can be found in .
We believe that the lack of a gold standard aesthetic criteria is a major cause that prevents Neural Style Transfer from becoming a mainstream research direction in computer vision (like object detection and recognition). Li \etal propose to design a user study to address the aesthetic evaluation problem. It is not practical since the results vary a lot with different observers. We conduct an experiment for user studies and show our results in Figure 9. Given the same stylised result, different observers have quite different ratings. We believe that the problem of standard aesthetic criteria is an extensive problem of Photographic Image Aesthetic Assessment, and one could get inspirations from related researches in this area. Here, we recommend  for an overview of Photographic Image Aesthetic Assessment.
9 Discussions and Conclusions
Over the past several years, Neural Style Transfer has continued to become an inspiring research area, motivated by both scientific challenges and industrial demands. A considerable amount of researches have been conducted in the field of Neural Style Transfer. Key advances in this field are summarised in Figure 2. Neural Style Transfer is quite a fast-paced area, and we are looking forwarding to more exciting works devoted to advancing the development of this field.
During the period of preparing this review, we are also delighted to find that related researches on Neural Style Transfer also bring new inspirations for other areas and accelerate the development of a wider vision community:
1) For the area of Image Reconstruction, derived from Neural Style Transfer, Ulyanov \etal propose a novel deep image prior, which replaces the manually-designed total variation regulariser in  with a randomly initialised deep neural network. Given a task-dependent loss function , an image and a fixed uniform noise as input, their algorithm can be formulated as:
One can easily notice that Equation (12) is very similar with Equation (9). Actually, the process in  is equivalent with the training process of Fast Style Transfer when there is only one available image in the training set, but replacing with and with . In other words, in  is trained to overfit one single sample.
2) Inspired by Neural Style Transfer, Upchurch \etal propose a deep feature interpolation technique and provide a new baseline for the area of Image Transformation (\eg, face aging and smiling). Upon the procedure of style transfer, they add an extra step which is interpolating in the VGG feature space and successfully change image contents in a learning-free manner.
3) Another area closely related with Neural Style Transfer is Face Photo-sketch Synthesis. For example,  exploits style transfer to generate shadings and textures for final sketches.
4) Neural Style Transfer also provides a new solution for Domain Adaption, as is validated in  to improve Monocular Depth Estimation model.
10 Future Work
Promising directions for future research in Neural Style Transfer mainly focus on three aspects. The first one is to solve the existing aforementioned challenges for current algorithms. Descriptions of these challenges as well as corresponding possible solutions have been demonstrated in Section 8. The second aspect is to derive more extensions to Neural Style Transfer, as presented in Section 5. These interesting extensions may become trending topics in the future and even expand into a new area. It is also promising to exploit Neural Style Transfer techniques to benefit other vision communities, as introduced in Section 9.
We would like to thank Hang Zhang, Dongdong Chen and Tian Qi Chen for providing pre-trained models for our study, and thank Xun Huang and Yijun Li for helpful discussions. We would also like to thank the anonymous reviewers for their insightful comments and suggestions.
This work is supported in part by National Key Research and Development Program (2016YFB1200203), National Natural Science Foundation of China (61572428, U1509206), Fundamental Research Funds for the Central Universities (2017FZA5014), Key Research and Development Program (2016YFB1200203) of Zhejiang Province (2018C01004), and Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies.
-  DeepArt, 2016.
-  Ostagram, 2016.
-  N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553, 2018.
-  A. Atapour-Abarghouei and T. Breckon. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  G. Atarsaikhan, B. K. Iwana, A. Narusawa, K. Yanai, and S. Uchida. Neural font style transfer. In Proceedings of the IAPR International Conference on Document Analysis and Recognition, volume 5, pages 51–56. IEEE, 2017.
-  S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Darrell. Multi-content gan for few-shot font style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  F. Becattini, A. Ferracani, L. Landucci, D. Pezzatini, T. Uricchio, and A. Del Bimbo. Imaging novecento. a mobile app for automatic recognition of artworks and transfer of artistic styles. In Euro-Mediterranean Conference, pages 781–791. Springer, 2016.
-  G. Berger and R. Memisevic. Incorporating long-range consistency in cnn-based texture generation. In International Conference on Learning Representations, 2017.
-  C. Castillo, S. De, X. Han, B. Singh, A. K. Yadav, and T. Goldstein. Son of zorn’s lemma: Targeted style transfer using instance-aware semantic segmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1348–1352. IEEE, 2017.
-  A. J. Champandard. Deep forger: Paint photos in the style of famous artists, 2015.
-  A. J. Champandard. Semantic style transfer and turning two-bit doodles into fine artworks. ArXiv e-prints, Mar. 2016.
-  C. Chen, X. Tan, and K.-Y. K. Wong. Face sketch synthesis with style transfer using pyramid column feature. In IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, USA, 2018.
-  D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua. Coherent online video style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1105–1114, 2017.
-  D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1897–1906, 2017.
-  D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stereoscopic neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. ArXiv e-prints, Dec. 2016.
-  Y.-L. Chen and C.-T. Hsu. Towards deep style transfer: A content-aware perspective. In Proceedings of the British Machine Vision Conference, 2016.
-  Y. Deng, C. C. Loy, and X. Tang. Image aesthetic assessment: An experimental survey. IEEE Signal Processing Magazine, 34(4):80–106, 2017.
-  A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pages 658–666, 2016.
-  A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4829–4837, 2016.
-  I. Drori, D. Cohen-Or, and H. Yeshurun. Example-based style synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages II–143. IEEE, 2003.
-  V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In International Conference on Learning Representations, 2017.
-  T. Dürschmid, M. Söchting, A. Semmo, M. Trapp, and J. Döllner. Prosumerfx: Mobile design of image stylization components. In SIGGRAPH Asia 2017 Mobile Graphics & Interactive Applications, pages 1:1–1:8. ACM, 2017.
-  A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341–346. ACM, 2001.
-  A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In Proceedings of the IEEE International Conference on Computer Vision, volume 2, pages 1033–1038. IEEE, 1999.
-  M. Elad and P. Milanfar. Style transfer via texture synthesis. IEEE Transactions on Image Processing, 26(5):2338–2351, 2017.
-  O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 553–561, 2016.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. ArXiv e-prints, Aug. 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3985–3993, 2017.
-  G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In Proceedings of the British Machine Vision Conference, 2017.
-  B. Gooch and A. Gooch. Non-photorealistic rendering. A. K. Peters, Ltd., Natick, MA, USA, 2001.
-  B. Gooch, E. Reinhard, and A. Gooch. Human facial illustrations: Creation and psychophysical evaluation. ACM Transactions on Graphics, 23(1):27–44, 2004.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
-  A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei. Characterizing and improving stability in neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4067–4076, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–238. ACM, 1995.
-  A. Hertzmann. Painterly rendering with curved brush strokes of multiple sizes. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 453–460. ACM, 1998.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 327–340. ACM, 2001.
-  H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu. Real-time neural style transfer for videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 783–791, 2017.
-  X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
-  Y. Jia and P. Vajda. Delivering real-time ai in the palm of your hand, 2016.
-  S. Jiang and Y. Fu. Fashion style generator. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3721–3727. AAAI Press, 2017.
-  Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song. Stroke controllable fast style transfer with adaptive receptive fields. arXiv preprint arXiv:1802.07101, 2018.
-  J. Johnson. neural-style. https://github.com/jcjohnson/neural-style, 2015.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711, 2016.
-  B. J. Joshi, K. Stewart, and D. Shapiro. Bringing impressionism to life with neural style transfer in come swim. ArXiv e-prints, Jan. 2017.
-  B. Julesz. Visual pattern discrimination. IRE transactions on Information Theory, 8(2):84–92, 1962.
-  J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg. State of the” artâ: A taxonomy of artistic stylization techniques for images and video. IEEE transactions on visualization and computer graphics, 19(5):866–885, 2013.
-  C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2479–2486, 2016.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716, 2016.
-  S. Li, X. Xu, L. Nie, and T.-S. Chua. Laplacian-steered neural style transfer. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1716–1724. ACM, 2017.
-  Y. Li, F. Chen, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3920–3928, 2017.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems, pages 385–395, 2017.
-  Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization. arXiv preprint arXiv:1802.06474, 2018.
-  Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 2230–2236, 2017.
-  R. Liao, Y. Xia, and X. Zhang. Depth-preserving style transfer. 2016.
-  X.-C. Liu, M.-M. Cheng, Y.-K. Lai, and P. L. Rosin. Depth-aware neural style transfer. In Proceedings of the Symposium on Non-Photorealistic Animation and Rendering, pages 4:1–4:10, 2017.
-  M. Lu, H. Zhao, A. Yao, F. Xu, Y. Chen, and L. Zhang. Decoder network over lightweight reconstructed feature for fast semantic style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2469–2477, 2017.
-  F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6997–7005. IEEE, 2017.
-  A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5188–5196, 2015.
-  A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3):233–255, 2016.
-  P. K. Mital. Time domain neural audio style transfer. In Proceedings of the NIPS Workshop on Machine Learning for Creativity and Design, 2018.
-  A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Going deeper into neural networks, 2015.
-  S. Pasewaldt, A. Semmo, M. Klingbeil, and J. Döllner. Pictory - neural style transfer and editing with coreml. In SIGGRAPH Asia 2017 Mobile Graphics & Interactive Applications, pages 12:1–12:2. ACM, 2017.
-  V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
-  J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
-  I. Prisma Labs. Prisma: Turn memories into art using artificial intelligence, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv e-prints, Nov. 2015.
-  J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1164–1172, 2015.
-  E. Risser, P. Wilmot, and C. Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. ArXiv e-prints, Jan. 2017.
-  P. Rosin and J. Collomosse. Image and video-based artistic stylisation, volume 42. Springer Science & Business Media, 2012.
-  M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transfer for videos. In German Conference on Pattern Recognition, pages 26–36, 2016.
-  A. Selim, M. Elgharib, and L. Doyle. Painting style transfer for head portraits using convolutional neural networks. ACM Transactions on Graphics, 35(4):129, 2016.
-  A. Semmo, T. Isenberg, and J. Döllner. Neural style transfer: A paradigm shift for image-based artistic rendering? In Proceedings of the Symposium on Non-Photorealistic Animation and Rendering, pages 5:1–5:13. ACM, 2017.
-  A. Semmo, M. Trapp, J. Döllner, and M. Klingbeil. Pictory: Combining neural style transfer and image filtering. In ACM SIGGRAPH 2017 Appy Hour, pages 5:1–5:2. ACM, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  R. Sreeraman. Neuralstyler: Turn your videos/photos/gif into art, 2016.
-  T. Strothotte and S. Schlechtweg. Non-photorealistic computer graphics: modeling, rendering, and animation. Morgan Kaufmann, 2002.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
-  C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proceedings of the IEEE International Conference on Computer Vision, pages 839–846. IEEE, 1998.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In International Conference on Machine Learning, pages 1349–1357, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6924–6932, 2017.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Weinberger. Deep feature interpolation for image content changes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7064–7073, 2017.
-  P. Verma and J. O. Smith. Neural style transfer for audio spectograms. In Proceedings of the NIPS Workshop on Machine Learning for Creativity and Design, 2017.
-  X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5239–5247, 2017.
-  L.-Y. Wei, S. Lefebvre, V. Kwatra, and G. Turk. State of the art in example-based texture synthesis. In Eurographics 2009, State of the Art Report, EG-STAR, pages 93–117. Eurographics Association, 2009.
-  L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 479–488. ACM Press/Addison-Wesley Publishing Co., 2000.
-  P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 1385–1392. IEEE, 2013.
-  H. Winnemöller, S. C. Olsen, and B. Gooch. Real-time video abstraction. In ACM Transactions On Graphics (TOG), volume 25, pages 1221–1226. ACM, 2006.
-  S. Yang, J. Liu, Z. Lian, and Z. Guo. Awesome typography: Statistics-based text effects transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7464–7473, 2017.
-  H. Zhang and K. Dana. Multi-style generative network for real-time transfer. arXiv preprint arXiv:1703.06953, 2017.
-  L. Zhang, Y. Ji, and X. Lin. Style transfer for anime sketches with enhanced residual u-net and auxiliary classifier gan. In Proceedings of the Asian Conference on Pattern Recognition, 2017.
-  Q.-s. Zhang and S.-C. Zhu. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
-  M. Zhao and S.-C. Zhu. Portrait painting using active templates. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Non-Photorealistic Animation and Rendering, pages 117–124. ACM, 2011.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2223–2232, 2017.
-  D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for mid-level vision. In Proceedings of the IEEE International Conference on Computer Vision, pages 388–396, 2015.