Two-Stream FCNs to Balance Content and Style for Style Transfer

Two-Stream FCNs to Balance Content and Style for Style Transfer

Duc Minh Vo1
SOKENDAI (Graduate University for Advanced Studies)
Tokyo, Japan
vmduc@nii.ac.jp
   Akihiro Sugimoto
The National Institute of Informatics
Tokyo, Japan
11Corresponding author.
Abstract

Style transfer is to render given image contents in given styles, and it has an important role in both computer vision fundamental research and industrial applications. Following the success of deep learning based approaches, this problem has been re-launched very recently, but still remains a difficult task because of trade-off between preserving contents and faithful rendering of styles. In this paper, we propose an end-to-end two-stream Fully Convolutional Networks (FCNs) aiming at balancing the contributions of the content and the style in rendered images. Our proposed network consists of the encoder and decoder parts. The encoder part utilizes a FCN for content and a FCN for style where the two FCNs have feature injections and are independently trained to preserve the semantic content and to learn the faithful style representation in each. The semantic content feature and the style representation feature are then concatenated adaptively and fed into the decoder to generate style-transferred (stylized) images. In order to train our proposed network, we employ a loss network, the pre-trained VGG-16, to compute content loss and style loss, both of which are efficiently used for the feature injection as well as the feature concatenation. Our intensive experiments show that our proposed model generates more balanced stylized images in content and style than state-of-the-art methods. Moreover, our proposed network achieves efficiency in speed.

\epstopdfDeclareGraphicsRule

.tiffpng.pngconvert #1 \OutputFile \AppendGraphicsExtensions.tiff

1 Introduction

Figure 1: Example of stylized results. Left-most column: content image (large) and style image (small). From left to right: the stylized image by our method, Johnson+ [16], Huang+ [13], and Gatys+ [7], Sheng+ [32], Chen+ [3], and Li+ [22]. Our results surrounded with red rectangles are more balanced in content and style than the others.
Figure 2: Example of stylized results obtained by Johnson+ [16] and Gatys+ [7] by changing the ratio of content and style from 1:5 to 1:1. Left-most column: content image (large) and style image (small). In each block, from left to right: the stylized image with various ratio of content and style.

How New York looks like in “The Starry Night” by Vincent van Gogh is an interesting question and, at the same time, difficult to answer. In practice, re-painting a famous fine-art style takes much time and requires well-trained artists. Answering this question can be stated as the problem of migrating semantic content of one image to different styles, and it is called style transfer.

Style transfer is long-standing and has fallen into the image synthesis problem which is a fundamental research in computer vision. Style transfer has its origin from non-photo-realistic rendering [18] and is closely related to texture synthesis and color transfer [1, 5]. Along with the impressive progress of various tasks in computer vision using deep neural networks, this topic has very recently been re-launched in both academy and industry.  [7] showed that the image representation derived from a Convolutional Neural Network (CNN) can be used to represent the semantic content of an image and the style, which opened up a new trend of CNN-based style transfer.

CNN-based approaches in style transfer fall into two categories [15]: Image-Optimisation-Based Online Neural Methods (IOB-NST) and Model-Optimisation-Based Offline Neural Methods (MOB-NST). The key idea of IOB-NST is to synthesis a stylized image by directly updating pixels in the image iteratively through the back-propagation. The IOB-NST such as [7, 25, 27] starts with a noise image and iteratively updates the image by changing the distribution of noise along with the statistics of content and style until the defined loss function is minimized. MOB-NST such as [16, 13, 32, 3, 22, 37, 29, 2], on the other hand, first optimizes a generative model through iterations, and then renders the stylized image using a forward pass. In order to optimize the generative model, MOB-NST trains each feed-forward model for each specific style by using the gradient descent over a large dataset. IOB-NST is known to produce better stylized results in quality than MOB-NST [15], while MOB-NST has more efficiency in speed.

Although existing methods [16, 13, 7, 32, 3, 22, 25, 27, 37, 29, 2] show the capability of rendering image contents in different styles, generated stylized images are not always well balanced in content and style. Such methods take care of either the content or the style, but not both, producing unbalanced stylized images. IOB-NST is good at faithfully rendering the style while it tends to lose the content. MOB-NST, on the other hand, preserves more semantic content than the style. How to keep the balance between the content and the style in style transfer is a crucial issue to improve the quality of stylized images. This is because such balance is required in many applications; for instance, font transfer [38], realistic photo transfer [27, 23]. IOB-NST and MOB-NST have the capability of controlling the balance between the content and the style. Namely, they allow to manually change the ratio of content and style. However, changing the ratio do not guarantee that network parameters for stylized images changes as expected, meaning that the contributions of the content and the style in a stylized image are uncontrollable in reality. Fig. 2 shows examples obtained by IOB-NST (Gatys+ [7]) and MOB-NST (Johnson+ [16]) with various settings of contributions of the content and the style. We can see although the ratio of content and style is significantly changed, the results do not change much.

Another important issue to address is the computational speed. Although MOB-NST such as [16, 13, 32, 3, 22, 37, 29, 2] are able to produce stylized images fast, they rely on a strong computational power. Therefore, either IOB-NST or MOB-NST is hard to apply to real-time applications.

We propose an end-to-end two-stream network for balancing the content and style in stylized images where contributions of the content and the style are adaptively taken into account. The encoder part of our network consists of the content stream and the style stream where the streams have different architectures. The two streams are connected by adaptive feature injection and independently trained to learn the semantic content or the style representation. The content features and the style features are then combined in our proposed adaptive concatenation layer to ensure the balanced contribution of each stream. As the decoder part of our network, we use the feed-forward model to reduce the rendering time while we spend much time on learning like [16, 13, 32, 3, 22, 37]. Unlike other methods that train a new model from the scratch for a yet unknown style, we fine-tune parameters from an existing model, allowing our network not only to accommodate fast training but also to easily adapt new styles. Our experiments demonstrate that our method produces more balanced stylized images in both content and style than the state-of-the-art methods (Fig. 1). They also show that our method runs about 22 faster than the state-of-the-art methods. We remark that our proposed model is trained for one style only, but it is easy to be fine-tuned to other styles incrementally with a low cost.

The rest of this paper is organized as follows. We briefly review and analyze related work in Section 2. Next, we analyze the semantic levels of image features for content and style in Section 3. Then, we present the detail of our proposed method in Section 4. Section 5 and Section 6 discuss our experiments. Section 7 draws the conclusion. We remark that this paper extends the work reported in [36]. Our main extensions in this paper are building a new network using both our proposed adaptive feature injection and concatenation, and adding more experiments.

2 Related work

Figure 3: Examples of the feature reconstruction for several layers from the VGG-16 pre-trained network.
Figure 4: Examples of style image reconstruction for several layers from the VGG-16 pre-trained network.
Figure 5: Examples of combination of content and style images from to . Left-most column: content image (large) and style image (small), From left to right: the stylized images at different combination levels by Gatys+ [7] where the ratio of contributions of content and style is 1:1.

Early work on style transfer was reported in the context of texture synthesis. Some methods there used histogram matching [10] and/or non-parametric sampling [1, 5]. These methods had limited results because they relied on hand-crafted low-level features and often failed in capturing features in semantic levels from the content and the style.

[7] for the first time proposed a method using CNNs and showed remarkable results. Their method trains CNNs to learn the semantic information from content images and matched it with the distribution of the style. It starts from a randomly distributed noise image and iteratively updates the image to produce an image satisfying the semantic distribution of the content image and appearance statistics of the style. During the iteration, the weighted sum of style loss and content loss is minimized. As follow-up work of [7], [25] proposed a structure preservation method using Matting Laplacian for photo-realistic style transfer. [27] utilized the screened Poisson equation to make a stylized image more photo-realistic. [20] proposed a Laplacian loss that computes the Euclidean distance between the Laplacian filters responding to a content image and a stylized image in order to keep a fine structure of the content image. These approaches fall into the IOB-NST category, and all face with the computational speed problem.

[16] and [34], on the other hand, took MOB-NST, proposing a feed-forward CNN and used the perceptual loss function for gradient-based optimization. The perceptual loss used there is similar to content and style loss in [7]. Their models have only to pass the content image to a single forward network to produce a stylized image, which is fast. Their two models are different only in the network architecture. [16] follows the design of [30] with their modification of using residual blocks and fractionally strided convolutions while [34] uses a multi-scale in their generator. [37] also utilized the feed-forward network, and they used multiple-generator to improve the quality of results. These methods are fast in generating stylized images, but they are capable of dealing with a single style only.

[4] proposed a multi-style network that introduces shared-computation in many style images where they used instance normalization (IN) [35] for balancing features from the content and from the style. They also proposed an improvement of IN to learn a different set of affine parameters for multi-styles in the batch way. However, their model can train a limited number of styles because the network capability is limited, meaning that the number of styles to handle is limited. [3] proposed a method that overcomes the limitation of the number of styles by using a patch-based method. Their method first extracts a set of patches from the content and style each, and then, for each content patch, the method finds its closest style patch and swaps their activation. In this way, their method transfers an unlimited number of styles; however, the cost for patch extraction and swapping increases the computational time significantly. [22] also proposed a method for multi-style transfer using feature transformations. They first employ pre-trained VGG-19 as their encoder to train an decoder for image reconstruction. Then, with fixing both encoder (VGG-19) and decoder, their model performs the style transfer through whitening and coloring transforms on a given content image and a style image. Though their method successfully solves the multi-style transfer, it still suffers from the computational cost and loses the content due to the feature transformations.

Recently, [13] and [32] proposed multi-style transfer models consisting of two CNN streams for content and style. [13] employed the pre-trained VGG-16 to extract content and style features and introduced Adaptive Instance Normalization (AIN) to make the mean and the variance of content features similar to those of style features. [32], on the other hand, proposed AvatarNet which employed the pre-trained VGG-19 to extract the content and style features. These features are matched by using style-swap [3] or AIN [13] before being fed into the decoder. Different from [13], their models have skip-connections from the style encoder to the decoder. [13] and [32], however, used the same architecture for the content CNN and for the style CNN. Having the same CNN architecture for the content and the style causes unavoidable unbalance between the content and the style because semantic levels extracted from the content and the style should not be the same in style transfer. Those models require expensive computational cost as well. Furthermore, AIN [13] assumes the standard distribution on pixel values of images, which is not always ensured in styles when normalizing data. In deed, AIN [13] tends to produce a lot of artifacts; especially they are visible on flat surfaces [29]. We remark that the skip-connection in AvatarNet [32] weights the style contribution more, causing unbalance in stylized images.

Along with using Generative Adversarial Network (GAN) [8] in image synthesis, several GAN-based models for style transfer are also proposed [29, 2, 19]. These models also optimize the network with a large number of content images during the training step, and thus fall in the MOB-NST category. Though GAN-based models bring a promising approach to improve the quality of stylized images, their results, at this time, still are less impressive [15]. Furthermore, as in common with other GAN-based approaches, their training processes are also unstable.

Different from the methods above, we take into account the contributions of the content and the style through a two-stream feed-forward network to balance the content and the style in stylized images. In particular, our proposed two-stream network is different from [13, 32] in that our network has different depths in layer for the content and the style encoders to extract different semantic levels of the content and the style. In addition, separating content and style enables our method easy to fine-tune to other styles with a cheaper computational cost (re-training time, required numbers of training images) than other models possessing only one encoder [16, 3, 37]. As a result, our method is able to easily deal with multi-styles.

3 Semantic levels of image features for content and style

Along with the depth, CNN is known to extract different semantic levels of image features in layers. As demonstrated in [16, 7], features in early layers reflect colors, textures, and common patterns of images while those in latter layers preserve content and spatial structure of images. We, therefore, expect that the features in lower layers work as style features and those in higher layers do as content features. Using appropriate semantic levels of image features in style transfer is crucial. We thus experimentally exploit the semantic levels of image features in VGG-16 [33] to design suitable numbers of layers in designing our network to extract content and style features. We remark that we refer [16, 7] in which image reconstruction is learned using hidden features in CNN layers.

For the content image reconstruction, we randomly prepare 100 images. We then feed each of the 100 images into the VGG-16 [33] pre-trained on object recognition using ImageNet dataset [31] without any fine-tuning and extract the features at each Rectified Linear Unit (ReLU) [28]. These features are employed to reconstruct original images using inverting technique [26]. Fig. 3 shows some examples of image reconstruction at several layers. We see that at low levels, i.e., from the nd layer () to the th layer (), the reconstructed images are similar to the original image, meaning that these layers successfully keep colors, textures, and common patterns of images. At higher levels, i.e., from the th layer () to the th layer (), the reconstructed images preserve the content and spatial structure. At even higher layers that start from the th layer (), semantic features are gradually learned; the exact shape, on the other hand, is not preserved.

For the style image reconstruction, we use Adam optimization [17] to find an image that minimizes the style reconstruction loss (proposed in [7]). To obtain style reconstructed images, we start from a noise image and optimize the style loss as [7] using the VGG-16 pre-trained on ImageNet. Fig. 4 shows an example of the style image reconstruction. We see that the style of image can be obtained until the th layer ()

The above observation holds true for the images and the styles that we evaluated. Combining the insight given by [16, 7], we may thus conclude that the low-level layers reflect the style of the image while the high-level layers capture the content of the image. More precisely, from the th layer () to the th layer (), the network is capable of appropriately capturing content information in the images. The style information, on the other hand, can be obtained from the nd () to the th () layers.

[7] pointed out that image content and style cannot be completely disentangled. This indicates that depending on the objective, we have to appropriately design the layer levels of content and style features for their combination. We thus further analyze effectiveness of the layers from the th () to the th () for content matching to determine the best one for combination. We follow [7] to synthesize the stylized images where we set the contributions of content and style to be equal with each other. To this end, we fix the style matching from the nd () to the th () layers, while performing the content matching at every single layer from the th () to the th layers (). Fig. 5 shows examples of stylized images having different layers in combination. We see that the content matching at the th and the th layers ( and ) is most reasonable to keep the balance of content and style in stylized images.

Using above observation, we design our network to fully exploit the characteristics of image features. We choose the th layer for content because it has a smaller number of parameters than the th layer (it is faster to learn). We choose the th layer for style because it is neither too early in layer nor marginally different from the layer used for content. In conclusion, we use the features at the th layer () for content and those at the th layer () for style.

Figure 6: Framework of our proposed method. Our network consists of two encoders having different architectures and one decoder. The loss network is used to train the encoders and the decoder.

4 Proposed method

4.1 Network design

Our network follows end-to-end encoder-decoder architecture for rendering of the content in a given style [16, 3, 37]. The network in [16, 3, 37] possesses only one encoder to extract the semantic content and style. This means that the extracted semantic level of the content and that of the style are the same. When we stylize images, the role of the content should be different from that of the style because the content gives us what exist (object shapes and locations) in the rendered image and the style gives us the impression of the rendered image. Accordingly, the semantic level used for the rending should be different depending on the content or the style. Otherwise, unbalance between the content and style remains in stylized images. We thus design a network having two encoders in which their architectures are different from each other to extract different semantic levels of the content and the style. With the two encoders, our model treats the content and the style in different ways, allowing the network to be able to balance the roles of the content and the style better than the model having only one encoder.

Ideally, the network should be able to retain the semantics of the content as well as the statistics of the style as much as possible. The semantic content and style of an image are captured at different layers in the network (see  [16, 7] and Section 3): the network obtains the style at low-level layers in depth while high-level layers become more sensitive to the actual content of the image. We thus design the encoders with different depths to retain useful information from both the content and the style. Namely, we design a deep encoder for the content and a shallow encoder for the style. Moreover, in order to reflect features extracted from the style at low-level to those from the content, we employ the feature injection via the skip-connection technique from the shallow encoder to the deep one. Because the content feature and the style feature are extracted at different levels in the network, they have different characteristics. We thus introduce an effective concatenation to enhance the contribution of these features for good performances instead of implementing their simple ones.

4.2 Network architecture

Our proposed network consists of three Fully Convolutional Network (FCNs): two encoders and one decoder (Fig. 6). The two encoders are a deep network, the content subnet, to extract content feature from a content image, and a shallow network, the style subnet, to extract style feature from a style image. The feature injection is employed between the content subnet and the style subnet using the balance weight (cf. Section 4.4). This balance weight is also used to adaptively concatenate the features and at the top of content and style subnet before being fed into a deep network, the generator subnet, to produce a stylized image. We employ the VGG-16 model [33] as the loss network in the training phase.

Our network receives the content and style images where each image is with the size of ( is the size of image, 3 are for RGB channels), and synthesizes an stylized image of . In the training phase, we use the images of ( = 256). Although we train the network on images with the size of , the network can accept any size of images in testing ( can be 64, 128, 256, or 512). We remark that the size of the content image and that of the style image have to be the same to ensure the consistency of the feature size when injecting and concatenating the content and the style features.

4.2.1 Content subnet

The content subnet is a stack of six convolution layers with the filter size of , and the padding size of . We use the stride of at the third, the fifth, and the sixth layers to reduce the size of feature maps and the stride of at the other layers. The numbers of the output channels are 32, 48, 64, 80, 96, and 128, respectively. Each convolution layer is followed by a spatial instance normalization (IN) layer [35] and a Rectified Linear Unit (ReLU) layer [28]. In order to avoid the border artifacts caused by convolution, the reflection-padding is used instead of the zero-padding similarly to [4].

4.2.2 Style subnet

The style subnet, which has four convolution layers, is shallow network (more precisely, shallower than the content subnet). All convolution layers have the filter size of , the reflection-padding of , and the stride of , except for the first layer that employs the stride of . The numbers of the output channels are 32, 64, 96, and 128, respectively. Similarly to the content subnet, each convolution layer is also followed by an IN layer [35] and a ReLU layer [28].

We employ feature injection from the feature at the -th layer in the style subnet to those at the -th layer in the content subnet, the size of whose feature map is the same (Table 1). To take into account the contributions of and , we introduce the adaptive feature injection with the balance weight (cf. Section 4.4).

Content subnet Style subnet
No Layer Output channel No Layer Output channel
 0 Content image 3 0 Style image 3
 1 Convolution 32 1 Convolution 32
 2 Instance normalization 32 2 Instance normalization 32
 3 ReLU 32 3 ReLU 32
 4 Convolution 48
 5 Instance normalization 48
 6 ReLU 48
 7 Convolution 64 4 Convolution 64
 8 Instance normalization 64 5 Instance normalization 64
 9 ReLU 64 6 ReLU 64
 10 Convolution 80
 11 Instance normalization 80
 12 ReLU 80
 13 Convolution 96 7 Convolution 96
 14 Instance normalization 96 8 Instance normalization 96
 15 ReLU 96 9 ReLU 96
 16 Convolution 128 10 Convolution 128
 17 Instance normalization 128 11 Instance normalization 128
 18 ReLU 128 12 ReLU 128
Table 1: Architecture of our encoders. The arrow () indicates the adaptive feature injection.

4.2.3 Generator subnet

The generator subnet consists of five residual blocks, three deconvolution layers, and two convolution layers in this order.

[16] argues that the residual block can enrich the information involved in the input feature. We, therefore, use residual blocks to increase the impact of the balance weight in the concatenated feature. Similarly to [16], we use five residual blocks outputting 256 channels, where each of them has two convolution layers with the filter size of , the reflection-padding of , the stride of , and a summation layer as in [9]. All convolution layers are followed by an IN layer [35] (we use it to replace the batch normalization [14] in the original architecture [9]) and a ReLU layer [28].

To upscale the feature map, we employ three deconvolution layers with the same filter size of , the reflection-padding of , and the stride of , outputting 128, 96, and 64 channels, respectively.

In order to eliminate the affect of the convolution stride, we use two convolution layers which have the filter size of , the padding of , and the stride of , outputting 32 and 3 channels. All deconvolution layers and convolution layers are followed by an IN layer [35] and a ReLU layer [28], except for the last convolution layer that uses the activation to guarantee that the range of the output can be normalized to be .

4.3 Loss function

We employ two loss functions for content loss and style loss, which are computed from layers of the loss network. The content loss computes the similarity of high-level features between the content image and the stylized image. The style loss , on the other hand, computes the similarity of low-level features between the style image and the stylized image.

The overall loss is a weighted sum of the content loss and the style loss:

(1)

where , and denote the content image, the style, and the stylized image, respectively. is the combination weight (we set in our experiments to equally weight these two loss functions).

We obtain the content loss at layers as follows:

where denotes the normalized feature map at the -th layer, which has elements.

The style loss is computed at layers as follows:

where denotes the Frobenius norm [11]. is the Gram matrix [11] of the normalized feature map at the -th layer. The Gram matrix has elements where are features at the -th and the -th channels respectively of the feature map .

4.4 Adaptive feature injection and concatenation layer

In our network, we employ the feature injection between the content features and the style features. We also concatenate them to feed into the generator subnet. To weight the contributions of the content features and the style features, we introduce the balance weight . This balance weight is adaptively updated during the training so that it retains the balance between the content and the style in stylized images.

At the -th iteration in training phase, the balance weight is computed as follows:

To restrict the fluctuation of the balance weight, we compute at every non-overlapping iterations and use it for the next iterations:

(2)

Using , we sum up the content feature at the -th layer and the style feature at the -th layer for the feature in adaptive feature injection as follows:

Similarly, we concatenate the content feature and the style feature in the adaptive concatenation layer as follows:

The learned balance weight ensures the balance of the contributions of the content feature and the style feature in both feature injection and concatenation layers. For example, when is smaller than (meaning in Eq. (2)), the contribution of style feature is increased in the next iterations, and vice verse. Moreover, the learned balance weight is more advantageous than the fixed balance weight that does not concern the balance of losses.

5 Experimental setup

5.1 Dataset and compared methods

5.1.1 Dataset

We used in our experiments, images in the MS-COCO 2014 dataset [24] as our content images, and six famous paintings widely used in style transfer [7, 16, 13], as our style images (cf. Fig. 7).

We used the MS-COCO 2014 training set for our training, and we randomly selected 20 images from the MS-COCO 2014 validation set for our validation. In the testing phase, on the other hand, we randomly selected 50 images from MS-COCO 2014 validation (different ones from the 20 images used in our validation).

Figure 7: Styles used in experiments. From left to right: Starry Night, Mosaic, Composition VII, La Muse, The Wave, and Feathers.

5.1.2 Compared methods

We compared our method with state-of-the-art methods: Gatys+ [7], Johnson+ [16], Huang+ [13], Sheng+ [32], Chen+ [3], and Li+ [22]. We note that Gatys+ is based on IOB-NST and the others are on MOB-NST. For Gatys+, we used the re-implementation version by J. Johnson222https://github.com/jcjohnson/neural-style. For the others, we used publicly available source codes with parameters recommended by the authors (Johnson+333https://github.com/jcjohnson/fast-neural-style, Huang+444https://github.com/xunhuang1995/AdaIN-style, Sheng+555https://github.com/LucasSheng/avatar-net, Chen+666https://github.com/rtqichen/style-swap, Li+777https://github.com/Yijunmaverick/UniversalStyleTransfer). We remark that we set 1000 iterations for Gatys+.

5.2 Implementation details

5.2.1 Implementation setup

We implemented our method in PyTorch888https://pytorch.org/. We used the instance incremental learning strategy for dealing with multiple styles. We conducted all experiments using a PC with CPU core i7 3.7 GHz, 12 GB of RAM, and GTX 770 GPU (4 GB of VRAM).

We performed the adaptive feature injection from layers in the style subnet to layers in the content subnet, respectively (Table 1). We adopted the VGG-16 model [33] pre-trained on the ImageNet [31] as the loss network without any fine-tuning. All layers after layer were dropped. We obtained the content loss at layer, e.g., , and the style loss at layers, e.g., , , and ( and are defined in Section 4.3).

5.2.2 Training the model

Our method addresses a one-style model to reduce computational time. For training a new yet unknown style, we fine-tune parameters from an existing model. With this learning strategy, our method can easily adapt a new style with a lower cost than existing work [7, 16, 13, 37]. Moreover, the fine-tuning learning enables our method to deal with an unlimited number of styles fast unlike existing methods such as [3, 4].

We first trained an initial model on the Starry Night style and then incrementally fine-tuned on the other styles one by one. We trained the network on the Starry Night style with a batch size of 2 for 80k iterations corresponding to 2 epochs. The balance weight in Eq. (2) is re-computed at every iterations. All the training and validation images are resized to . To train the model, we used the Adam optimizer [17] with the learning rate of , the moments and , and the division from zero parameter . We did not use the learning rate decay and the weight decay.

For the initial model, we trained all subnets simultaneously with independently updating the weight of each subnet. Validation was performed at every 100 iterations during the training process. When observing the content loss and the style loss on the validation set, if any loss function raises the overfitting problem, we stopped updating the weight of the corresponding subnet.

We incrementally fine-tuned the initial model to the other styles one by one. 2000 images in the MS-COCO 2014 training set [24] were randomly selected as content images for training. The network was trained for 1000 iterations with the batch size of 2. The Adam optimizer [17] was also used with the same parameters as the training of the initial model. The balance weight in Eq. (2) was re-computed at every iterations. The loss-based training technique was also applied to avoid overfitting, where the validation was performed at every 50 iterations.

5.3 Evaluation metric

We introduce a metric to evaluate how the stylized image is balanced in content and style.

For each pair of content image and style image , we compute content loss and style loss . When and are almost the same, we may say that the stylized image is well balanced in content and style. This means that in the 2D plane whose coordinate system is defined by content loss and style loss, how close is to the line of “content loss”“style loss” (called the balanced axis hereafter) can be a criterion to evaluate how the stylized image is balanced in content and style. The distance between the origin and is, of course, a criterion for evaluating the quality of the stylized image.

We assume that we have stylized images. We normalize content loss and style loss for each stylized image over images:

where , , , and are the mean and the standard deviation of content loss and style loss over stylized images, respectively.

Let denote the angle between the line going through the origin and and the content loss axis or the style loss axis (the smaller angle is selected):

Lager indicates that is closer to the balance axis, meaning that the stylized image is more balanced in content and style.

The quality of stylized images is evaluated using

Using and above, we define our metric :

concerns both the criterion for balance and the criterion for the quality. Therefore it is a useful metric for evaluating stylized images. We note that larger is better because should be larger and should be smaller for better stylized images.

6 Experimental results

6.1 Qualitative evaluation

Figure 8 shows examples of the obtained results, showing that the stylized images obtained by our method are more balanced in content and style. We also see that overall the results obtained by Gatys+ [7], Sheng+ [32], and Li+ [22] reflect the style well, but they mostly lose content (we cannot understand the content of stylized results using La Muse and Feathers styles). In some styles (Starry Night, Composition VII, and The Wave), we see that Johnson+ [16] seems to randomly select a patch in the style and paste it into the content image. Huang+ [13] also loses the content and suffers from a so-called checkerboard effect. We also see that Chen+ [3] loses almost style and tends to keep the original content images.

Figure 8: Visual comparison of our method against the state-of-the-art methods. Left-most column: content image (large) and style image (small). From left to right: the stylized image by our method, Johnson+ [16], Huang+ [13], and Gatys+ [7], Sheng+ [32], Chen+ [3], and Li+ [22]. Our results surrounded with red rectangles are more balanced in content and style than the others. Note that all stylized images are with the size of .

To objectively compare the obtained results, we conducted three user studies, including overall quality, content preserving and style look-like. From the visual comparison in Fig. 8, we see that evaluating all stylized results among compared methods is pretty difficult. We thus picked up three methods only for our user studies. To this end, we investigated the quantitative comparison (Section 6.2). As Gatys+ [7] is known to keep styles most while Johnson+ [16] retain the content most, these methods are appropriate to choose for our user studies. Among the remaining compared methods, we see that Huang+ [13] is most balanced (the loss distributions of Huang+ [13] appear near balance axis (Fig. 10)). We, therefore, chose Gatys+ [7], Johnson+ [16] and Huang+ [13] for our user studies.

For our user studies, we randomly selected 20 images from the 50 testing images as content images and chose 5 styles by excluding The Wave style because it is simpler than the other styles (Fig. 7). We remark that the combination of 20 content images and 5 styles results in 100 stylized images by each method. In each user study, we presented 100 sets of images to 31 subjects where each set consists of a content image, a style image, and four output images obtained by our method and the three comparison methods [7, 16, 13]. We then asked the subjects to rank the four output stylized images at each set (1st is best, and 4th is worst). For the overall quality study, the subjects were asked to give the ranking based on the overall quality at each set. For the content preserving study, the subjects were asked to rank output images in each set based on how faithfully the images preserve the content in content images. For the style look-like study, on the other hand, the subjects ranked output images in each set based on how the images look like the style in style images. We note that four output images are aligned in the random order in each set and that each set was displayed for 6 seconds.

Table 2, Table 3, and Table 4 show the average of rankings over the 100 sets for the overall quality, the content preserving, and the style look-like studies, respectively. We also computed the average of rankings in each style, which is also illustrated in Table 2, Table 3, and Table 4.

We see that our method takes the best ranking among the four methods in overall quality (Table 2). Looking into the results in more detail, we see that our method is ranked in the first place at the Mosaic style, and in the second place at others (except for Composition VII style). This indicates that our method performs stably well in overall quality in accordance with human cognition. We remark that the Composition VII style is rather complex (Fig. 7) and, the results for this style are difficult to evaluate. We also remark that the single-style models (ours, Gatys+[7], and Johnson+[16]) performed better than the multi-style model (Huang+[13]).

For the content preserving (Table 3) and the style look-like (Table 4) studies, our method takes the second best ranking. Note that the scores in these studies more largely distributed than those in the overall quality study. As MOB-NST is known to perform better in content preserving than IOB-NST [15]; Johnson+ [16], which is MOB-NST, takes the best ranking in the content preserving study. Gatys+ [7], on the other hand, which is IOB-NST, takes the best ranking in the style look-like study. In contrast, our method is ranked in the second place for all styles in the content preserving study (except for Composition VII style) (Table 3) and in the style look-like study (except for Feathers style) (Table 4). These indicate that our method stably produces stylized images balanced in content and style for almost all the styles. We remark that in the case of the Feathers style, the two best methods for the look-like study follow the MOB-NST approach. As MOB-NST is known not to keep styles well [15], this suggests that the Feathers style is a difficult style for users to evaluate stylized images.

Style Ours Johnson+ Huang+ Gatys+
[16] [13] [7]
Starry Night 2.12 2.72 3.14 2.01
Mosaic 2.21 2.25 2.91 2.63
Composition VII 2.47 2.95 2.4 2.18
La Muse 2.38 2.28 2.82 2.51
Feathers 2.15 1.82 3.28 2.74
All together 2.27 2.40 2.91 2.41
Table 2: Average of rankings in the overall quality study. The best and the second best results are given in red and blue, respectively.
Style Ours Johnson+ Huang+ Gatys+
[16] [13] [7]
Starry Night 2.53 1.96 2.67 2.84
Mosaic 2.13 1.60 3.05 3.22
Composition VII 3.02 1.81 2.50 2.67
La Muse 1.99 1.82 3.06 3.13
Feathers 2.02 1.81 2.50 2.67
All together 2.34 1.80 2.87 2.99
Table 3: Average of rankings in the content preserving study. The best and the second best results are given in red and blue, respectively.
Style Ours Johnson+ Huang+ Gatys+
[16] [13] [7]
Starry Night 2.27 2.77 3.34 1.61
Mosaic 2.26 2.66 2.94 2.13
Composition VII 2.49 2.96 2.65 1.90
La Muse 2.71 2.81 2.81 1.67
Feathers 1.69 2.34 3.42 2.55
All together 2.28 2.71 3.03 1.97
Table 4: Average of rankings in the style look-like study. The best and the second best results are given in red and blue, respectively.

6.2 Quantitative evaluation

Method () ()
Ours 0.37 2.95
Johnson+ [16] 0.54 1.60
Huang+ [13] 0.45 1.23
Gatys+ [7] 0.45 1.36
Sheng+ [32] 0.52 1.21
Chen+ [3] 0.59 0.72
Li+ [22] 0.49 1.40
Table 5: Averages of (smaller is better) and (larger is better).
(a) ().
(b) ().
Figure 9: Averages of and in each style.
(a) Starry Night style.
(b) Mosaic style.
(c) Composition VII style.
(d) La Muse style.
(e) The Wave style.
(f) Feathers style.
Figure 10: Loss distribution in each style. Red lines denote the balanced axis. Our method has the distributions nearer the balanced axis than the other methods.

In order to quantitatively evaluate the obtained results, we computed the averages of ’s and ’s over 300 ( contents 6 styles) sets for each method (Table 5). We see that our method performs best both in and . We also computed the averages of ’s and ’s in each style, which is illustrated in Fig. 9. Fig. 9 shows that our method performs best in and best in for all the styles.

To look into the results in more detail, we show the loss distribution of 50 stylized images in each style (Fig. 10). We see that (1) the content loss and the style loss (for each stylized result) in our method are similar with each other and that (2) loss distributions in our method appear densely near the balanced axis for all the styles while those in the other methods do not.

6.3 Computational speed

Method Image size Implemented framework
Ours 0.05 0.18 PyTorch
Johnson+ [16] 1.12 3.79 Torch
Huang+ [13] 1.98 6.78 Torch
Gatys+ [7] 74.12 269.74 Torch
Sheng+ [32] 3.04 10.67 TensorFlow
Chen+ [3] 2.74 9.33 Torch
Li+ [22] 3.53 9.42 Torch
Table 6: The average wall-clock time in second for producing one stylized image.

We measured the running time for generating 300 stylized images with the sizes of and by each method and compared the average for generating one stylized image by each method.

Table 6 illustrates the average of the running time in generating one stylized image. As we see, our method is the fastest and speeds up 22 times for the image size of and 21 times for that of when compared with the fastest state-of-the-arts [16]. We can thus conclude that our method is promising for real-time applications.

6.4 More detailed analysis

6.4.1 Effectiveness of feature injection

Figure 11: Visual comparison of the methods with/without adaptive feature injection. In each block, from left to right, a content image (large one) with a style (small one) is followed by outputs by the method with adaptive feature injection and without feature injection. Note that all stylized images are with the size of .
(a) Starry Night style.
(b) Mosaic style.
(c) Composition VII style.
(d) La Muse style.
(e) The Wave style.
(f) Feathers style.
Figure 12: Loss distribution in each style obtained by the method with/without adaptive feature injection. Red lines denote the balanced axis.

In this section, we evaluate the effectiveness of the introduction to the adaptive feature injection between the content subnet and the style subnet.

We compared our complete method with a method without adaptive feature injection, which is shown in Fig. 11. Fig. 11 shows that the stylized images obtained by the method with adaptive feature injection are more balanced in content and style than those without adaptive feature injection.

We also compared the and of stylized images (Table 7). We see that the method with adaptive feature injection performs better both in and than the method without adaptive feature injection. Table 7 also shows that employing adaptive feature injection improves both and for each style (except for La Muse style). This indicates that adaptive feature injection is effective to improve not only the quality but also the balance in content and style of stylized images. With respect to the La Muse style, of the method with adaptive feature injection is comparable to that of method without feature injection, however is not the case. This can be explained as follows. The La Muse style follows Cubism and thus it is very unique. Because of this, the adaptive feature injection tends to keep more style to reflect the impression of this style.

Finally, we compare the loss distributions of 50 stylized images in each style (Fig. 12). We see that for all styles (except for the La Muse style) the loss distributions of the method with adaptive feature injection appears more densely near the balanced axis and is closer to the origin than those of the method without feature injection for all styles. In the case of the Starry Night style (Fig. 11(a)), we see that the method without adaptive feature injection preserves much more content than the style because the loss distribution appears far above the balanced axis. This observation also holds true for the Mosaic style (Fig. 11(b)), the Composition VII (Fig. 11(c)), and the La Muse (Fig. 11(d)). By using adaptive feature injection, the method is able to reduce the style loss in stylized images (e.g., the Starry Night, the Mosaic, the Composition VII, the La Muse styles), compared to the method without adaptive feature injection. These observations indicate that the adaptive feature injection effectively improves to keep the balance in content and style of stylized images.

Style () ()
w/ w/o w/ w/o
Starry Night 0.34 0.46 2.12 1.35
Mosaic 0.45 0.57 1.80 1.03
Composition VII 0.21 0.26 3.61 3.21
La Muse 0.30 0.27 1.72 2.54
The Wave 0.15 0.24 5.39 3.15
Feathers 0.23 0.28 3.71 3.20
All together 0.28 0.35 3.06 2.41
Table 7: Averages of (smaller is better) and (larger is better) in the method with (is denoted by w/) and without (is denoted by w/o) adaptive feature injection.

6.4.2 Effectiveness of changing combination weight in loss function

Here, we evaluate whether in Eq. (1) plays a role of controlling the contribution of the content and the style.

We confirmed that in Eq. (2) is learned through the content and the style losses so that the two losses become well-balanced. Thanks to , we need not control anymore to balance the two losses. This is the reason why we set in the above experiments. This also suggests that can play the role of the indicator for whether the content or the style is emphasized in obtained stylized images.

To confirm the role of , we used different ’s to generating stylized images: . Some examples under different ’s are illustrated in Fig. 13. As we see, for smaller , the style is more emphasized and results become more similar to those by Gatys+ [7] than the case of . For larger , on the other hand, the content is more emphasized and results become more similar to those by Johnson+ [16]. We may conclude that indeed directly control contribution of the content and the style in stylized images.

Figure 13: Example of stylized images by changing from 0.1 to 0.9. Left-most column: content image (large) and style image (small). From left to right: the stylized image with various values of .

7 Conclusion

We presented an end-to-end two-stream network for balancing the content and style in stylized images. Our proposed method utilizes a deep FCN to preserve the semantic content and a shallow FCN to faithfully learn the style representation, whose outputs are adaptively feature injected and concatenated using the balance weight and fed into the decoder to generate stylized images. Our intensive experiments demonstrate the effectiveness of our proposed method against state-of-the-art methods in terms of balancing content and style. Furthermore, our proposed method outperforms the state-of-the-art methods in speed.

As an extension of image style transfer, the real-time video stylization methods are currently proposed [12, 6, 21]. Since our proposed method runs fast, we believe that it can be useful for real-time video stylization. Though video stylization is out of the scope of this paper, we applied our method in the frame-by-frame manner to several videos for video stylization demonstration. Fig. 14 shows some examples of stylized frames from a video. Our approach was able to stylize videos in real-time with the resolution at 30 FPS or more. As we see, our method produces reasonable results for consecutive frames with varying appearance, meaning that the usage of our method for real-time video stylization is promising. We remark that we did not use either temporal regularization or post-processing. Different from image style transfer, real-time video stylization needs to pay attentions to the temporal consistency among adjacent video frames. Incorporating the temporal consistency into our method for real-time video stylization is left for our future work.

Figure 14: Examples of stylized video in real-time using the ”Starry night” style. We use the video of Eadweard Muybridge ”The horse in motion” (1878) as the content input. Our model processes every frame independently without any post-processing. Video resolution is at 30 FPS.

References

  • [1] M. Ashikhmin (2001) Synthesizing natural textures. In Symposium on Interactive 3D Graphics, pp. 217–226. Cited by: §1, §2.
  • [2] S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Darrell (2018) Multi-content gan for few-shot font style transfer. CVPR. Cited by: §1, §1, §1, §2.
  • [3] T. Q. Chen and M. Schmidt (2016) Fast patch-based style transfer of arbitrary style. In NIPS, Cited by: Figure 1, §1, §1, §1, §1, §2, §2, §2, §4.1, §5.1.2, §5.2.2, Figure 8, §6.1, Table 5, Table 6.
  • [4] V. Dumoulin, J. Shlens, and M. Kudlur (2017) A learned representation for artistic style. ICLR. Cited by: §2, §4.2.1, §5.2.2.
  • [5] A. A. Efros and W. T. Freeman (2001) Image quilting for texture synthesis and transfer. In SIGGRAPH, pp. 341–346. Cited by: §1, §2.
  • [6] C. Gao, D. Gu, F. Zhang, and Y. Yu (2018) ReCoNet: real-time coherent video style transfer network. ACCV. Cited by: §7.
  • [7] L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In CVPR, pp. 2414–2423. Cited by: Figure 1, Figure 2, §1, §1, §1, Figure 5, §2, §2, §3, §3, §3, §3, §4.1, §5.1.1, §5.1.2, §5.2.2, Figure 8, §6.1, §6.1, §6.1, §6.1, §6.1, §6.4.2, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2.3.
  • [10] D. J. Heeger and J. R. Bergen (1995) Pyramid-based texture analysis/synthesis. In SIGGRAPH, pp. 229–238. Cited by: §2.
  • [11] R. A. Horn and C. R. Johnson (2012) Matrix analysis. 2nd edition, Cambridge University Press, New York, NY, USA. External Links: ISBN 0521548233, 9780521548236 Cited by: §4.3.
  • [12] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu (2017) Real-time neural style transfer for videos. In CVPR, pp. 7044–7052. Cited by: §7.
  • [13] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: Figure 1, §1, §1, §1, §1, §2, §2, §5.1.1, §5.1.2, §5.2.2, Figure 8, §6.1, §6.1, §6.1, §6.1, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §4.2.3.
  • [15] Y. Jing, Y. Yang, Z. Feng, J. Ye, and M. Song (2018) Neural style transfer: A review. CoRR abs/1705.04058. Cited by: §1, §2, §6.1.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: Figure 1, Figure 2, §1, §1, §1, §1, §2, §2, §3, §3, §4.1, §4.1, §4.2.3, §5.1.1, §5.1.2, §5.2.2, Figure 8, §6.1, §6.1, §6.1, §6.1, §6.1, §6.3, §6.4.2, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [17] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §3, §5.2.2, §5.2.2.
  • [18] J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg (2013) State of the ”art”: a taxonomy of artistic stylization techniques for images and video. IEEE Transactions on Visualization and Computer Graphics, pp. 866–885. Cited by: §1.
  • [19] C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, pp. 702–716. Cited by: §2.
  • [20] S. Li, X. Xu, L. Nie, and T. Chua (2017) Laplacian-steered neural style transfer. In ACM-MM, MM ’17, pp. 1716–1724. Cited by: §2.
  • [21] W. Li, L. Wen, X. Bian, and S. Lyu (2018) Evolvement constrained adversarial learning for video style transfer. ACCV abs/1811.02476. Cited by: §7.
  • [22] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017) Universal style transfer via feature transforms. In NIPS, Cited by: Figure 1, §1, §1, §1, §1, §2, §5.1.2, Figure 8, §6.1, Table 5, Table 6.
  • [23] Y. Li, M. Liu, X. Li, M. Yang, and J. Kautz (2018) A closed-form solution to photorealistic image stylization. ECCV. Cited by: §1.
  • [24] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §5.1.1, §5.2.2.
  • [25] F. Luan, S. Paris, E. Shechtman, and K. Bala (2017) Deep photo style transfer. In CVPR, pp. 6997–7005. Cited by: §1, §1, §2.
  • [26] A. Mahendran and A. Vedaldi (2015) Understanding deep image representations by inverting them. In CVPR, pp. 5188–5196. Cited by: §3.
  • [27] R. Mechrez, E. Shechtman, and L. Zelnik-Manor (2017) Photorealistic style transfer with screened poisson equation. In BMVC, Cited by: §1, §1, §2.
  • [28] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §3, §4.2.1, §4.2.2, §4.2.3, §4.2.3.
  • [29] B. Ommer (2018) A style-aware content loss for real-time hd style transfer. ECCV. Cited by: §1, §1, §1, §2, §2.
  • [30] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §2.
  • [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision, pp. 211–252. Cited by: §3, §5.2.1.
  • [32] L. Sheng, Z. Lin, J. Shao, and X. Wang (2018) Avatar-net: multi-scale zero-shot style transfer by feature decoration. In CVPR, pp. 1–9. Cited by: Figure 1, §1, §1, §1, §1, §2, §2, §5.1.2, Figure 8, §6.1, Table 5, Table 6.
  • [33] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3, §3, §4.2, §5.2.1.
  • [34] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky (2016) Texture networks: feed-forward synthesis of textures and stylized images. In ICML, pp. 1349–1357. Cited by: §2.
  • [35] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. In ICML, Cited by: §2, §4.2.1, §4.2.2, §4.2.3, §4.2.3.
  • [36] D. M. Vo, T. N. Le, and A. Sugimoto (2018) Balancing content and style with two-stream fcns for style transfer. In WACV, Cited by: §1.
  • [37] X. Wang, G. Oxholm, D. Zhang, and Y. Wang (2017) Multimodal transfer: a hierarchical deep convolutional neural network for fast artistic style transfer. In CVPR, Cited by: §1, §1, §1, §1, §2, §2, §4.1, §5.2.2.
  • [38] Y. Zhang, Y. Zhang, and W. Cai (2018) Separating style and content for generalized style transfer. CVPR. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
399485
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description