Pair-wise Exchangeable Feature Extraction for Arbitrary Style Transfer
Style transfer has been an important topic in both computer vision and graphics. Gatys et al. first prove that deep features extracted by the pre-trained VGG network represent both content and style features of an image and hence, style transfer can be achieved through optimization in feature space. Huang et al. then show that real-time arbitrary style transfer can be done by simply aligning the mean and variance of each feature channel. In this paper, however, we argue that only aligning the global statistics of deep features cannot always guarantee a good style transfer. Instead, we propose to jointly analyze the input image pair and extract common/exchangeable style features between the two. Besides, a new fusion mode is developed for combining content and style information in feature space. Qualitative and quantitative experiments demonstrate the advantages of our approach.
A style transfer method takes a pair of images as input and synthesize an output image that preserves the content of the first image while mimicking the style of the second image. The study on this topic has drawn much attention in recent years due to its scientific and artistic values. Recently, the seminal work  found that multi-level feature statistics extracted from a pre-trained CNN model can be used to separate content and style information, making it possible to combine content and style of arbitrary images. This method, however, depends on a slow iterative optimization, which limits its range of application.
Since then, many attempts have been made to accelerate the above approach through replacing the optimization process with a feed-forward neural networks [5, 14, 19, 34, 31]. While these methods can effectively speed up the stylization process, they are generally constrained to a predefined set of styles and cannot adapt to an arbitrary style specified by a single exemplar image.
Notable effort[22, 28, 4] has been devoted to solving this flexibility v.s. speed dilemma. A successful direction is to apply statistical transformation, which aligns feature statistics of the input content image to that of the style image [11, 29, 21]. Such approaches implicitly assume that feature statistics (i.e., channel-wise mean and variance) contains all and only style information, which can be exchanged between any pair of content and style images. When this assumption does not hold for a given pair of images, the corresponding style transfer result can be poor.
Instead of aligning the input image to features independently computed from either a batch of samples (batch normalization) or a single style sample (instance normalization), we jointly consider both content and style images and extract exchangeable style features, which are customized for this pair of images only. As a result, the stylization of different content images are guided by different exchangeable features even under the same style image. Our experiments demonstrate that performing style transfer through pairwise exchangeable feature extraction yields more structured results and better visual details than existing approaches; see e.g., Figures 1 and 5.
To compute exchangeable style features from feature statistics of two input images, a novel Feature Exchanging Block is designed, which is inspired by the works on private-shared component analysis [2, 3]. In addition, we propose a new Content-Style Fusion mode to fuse together the content information and exchangeable style information, before a decoder is used to synthesize the output image. To summarize, the contributions of our work include:
The importance of computing pairwise exchangeable features for style transfer between two images is clearly demonstrated.
A novel Feature Exchanging Block is designed for learning common information in-between features extracted from a pair of input images.
A simple yet effective mode is developed to fuse content and style information together through channel compression and expansion.
The overall end-to-end style transfer framework can perform arbitrary style transfer in real-time and synthesize highly detailed results with favored styles.
2 Related Work
2.1 Style Transfer
Intuitively, style transfer aims at changing the style of an image while preserving its content. Earlier works of non-parametric methods are usually build upon low-level image features [10, 9]. Recently, impressive neural style transfer is realized by Gatys et al. . In this pioneer work, they found that deep feature map extracted by the neural network pre-trained on large dataset (e.g., ImageNet) is a good representation of the content information for an image, whereas the correlation between different filter responses at a given layer of the network encodes the style info. By matching the two representations between deep features of content and style images, the output image can be iteratively updated until a satisfied stylization is reached.
The iterative optimization process used in the above approach is slow and thus limits its practical application. Since then, numerous methods [14, 18, 31] have been proposed to accelerate it by training feed-forward neural networks with the same loss as in . Some other studies have been developed to improve the quality , photorealism , and user controllability [7, 32]. Very recently, Sanakoyeu et al.  propose to define the style based on a collection of related artistic images, achieving a better stylization from the aspect of art history experts. Nonetheless, most of the above methods are constrained by the limited styles. Dumoulin et al.  try to solve this problem and succeed to train a feed-forward network being capable of encoding 32 styles. Li et al.  then extend the style types up to 1000. But still, the set of transferable styles is fixed and these models cannot adapt to arbitrary new styles.
To achieve both efficiency and flexibility, Huang et al.  propose to explicitly match the mean and variance of each feature channel of the content image to those of the style image. This simple yet effective approach enables transferring an arbitrary style specified by a single exemplar image. Li et al.  further apply whitening and coloring transform between the extracted deep features.
In this paper, we argue that aligning global statistics only cannot guarantee good style transfer results, especially when there are significant differences between the content and the style images. Inspired by the domain adaptation works [2, 3], we jointly analyze both content and style images to compute the exchangeable feature component. Through manipulating the feature channels of the content image based on this exchangeable feature component extracted, the final stylization is significantly improved as evidenced in the section of results.
2.2 Image-to-Image Translation
Image-to-image translation refers to the task of mapping an image from a source domain to a target domain. Isola et al.  firstly propose a supervised framework based on conditional GANs, where paired training data are required. A few unsupervised methods are proposed later on to learn the translation between two image collections with only unpaired data [36, 23, 33]. Nevertheless, these methods suffer from the lack of mapping diversity. To tackle this issue, some works [16, 12, 8, 24] are proposed recently, all of which adopt the disentanglement strategy. More specifically, the disentangled shared/common part is considered as the content representation, while the private/domain-specific part represents the style component.
Since we are not pursuing multi-modal mapping, here we still follow the assumptions made by Gatys et al. . The key difference is that, for a better stylization, we analyze the style features of the two input images jointly. A common style feature is disentangled, which is then used to guide the extraction of exchangeable style representations from the raw style features of content and style images.
3 Developed Framework
As shown in Figure 2, we present a new framework that enables fast style transfer via learning to extract exchangeable style features between the two input images, which are then intertwined with content codes for decoding the final synthesized results. A distinct feature of our approach is that it is trained over pairs of input images. Hence, a dataset with content images and style images provides us training samples. Such pairwise training approach allows our framework to better leverage inter-dependency between the two input images and improve the final results. In this architecture, inspired by the work on private-shared component analysis [2, 3], we develop a novel block, named Feature Exchange block, to learn common features for the styles from both input content and corresponding style images. A common feature and two private features will be used to represent the styles of two input images. A simple yet efficient mode to fuse content and style is then studied. Figure 3 illustrates the architecture of our framework.
3.1 Exchangeable Feature for Style Transfer
The overall goal of the presented framework is to learn two exchangeable style features for content image () and style image (), which can be fused with content features to decode either a reconstructed or a stylized image. As illustrated in Figure 3, our framework consists of a shared encoder, some Feature Exchange blocks and two decoders. Similar to prior work [11, 29], we use the first few layers of the pre-trained VGG-19 model (up to relu4_1) to initialize the encoder module, which is fixed during training. The VGG-based encoder is used to map the images into a latent space. We denote as the feature map outputted by encoder for content images and for style images. Both and have 512 channels at each pixel location. These two feature maps encode the basic content information for the corresponding input images.
Next, we compute a covariance matrix for each of and by treating the channel as the element of a random vector. The covariance matrices store the raw style features for the two images and contain richer information than just mean and variance at each channel. Then to reduce the number of parameters, each of the two covariance matrices is fed into multiple convolution layers, followed by a fully-connected layer, resulting a vector. The two vectors are denoted as for content image and for style image. Inspired by private-shared component analysis, and are further processed to output three feature vectors: two unique feature vectors, and for the two images and a common feature vector . More precisely, and are initialized by feeding and through two fully-connected layers, respectively, whereas is initialized by feeding the concatenation of and into a fully-connected layer. These three initial feature vectors are then refined using several Feature Exchanging blocks that are chained together; see Section 3.2.
The refined common feature, denoted as , is employed to guide each style feature ( or ) to learn content purification weights and exchangeable style features for the respective images. To be specific and take the style image () as example, the refined common feature is concatenated with to form a vector. It is used to compute three vectors, each through a dedicated fully connected layer. The first one is a weight vector () and is used for suppressing style-related information in the original feature map . This goal is achieved by multiplying with in a channel-wise attention manner for content purification. The resulting purified feature map is denoted as (or for the one computed for content image). The next two vectors, a column vector and a row vector , encode exchangeable style features, which can be fused with purified feature map (or ); see Section 3.3.
Finally, a decoder is learned to invert the feature maps to the image space. The resulting is the stylization image that transfers style in to , whereas is the reconstruction for the style image . Similar operation is performed for computing stylization and reconstruction . Note that in our framework, and share one decoder while and share another for more structured synthesized results.
3.2 Feature Exchange Block
The architecture of a single Feature Exchange block is illustrated in Figure. 4. Generally speaking, the main idea of the Feature Exchange Block is to use the residual features to convey message so that each input feature is updated in an iterative manner, like message passing operation does. This property allows us to chain any number of Feature Exchange blocks in a model, without breaking its initial behavior. One can see that, each block includes two Residual Message Passing Unit, which is used to learn a residual feature via an attention gate.
The proposed residual message passing unit takes two features as input, as depicted in Figure 4(b) . The unit aims to learn two residual vectors to update the two original input features. It is able to efficiently consider the two input features at the same time and determine how much information to output. In particular, this component is built with four learnable weights. The original inputs are respectively weighted by the first two weights, followed by a non-linear operation (Relu). The two processed features are then added up, which is further fed into two different learnable weighting layers for the final attentional gating. Therefore, two residual features are the eventual outputs based on two inputs. Note that, all the learnable layers in this unit have the size of in our experiments.
To gradually refine the common feature, each Feature Exchange block takes three inputs. The middle feature vector () encodes the common information and another two ( and ) represent the unique information of corresponding images. As shown in Figure 4(a), is simultaneously fed into two residual message passing units and is updated using the outputs of both. It is hence encouraged to encode information shared by the two images. The residual messages that are unique to individual images are passed to vectors and .
Employing residual connections facilitates gradient propagation during training, and makes a direct modification on the original feature. For the four learnable weights in each unit, it is expected that all these weights will learn to accommodate the importance of the intermediate features. It is also worth noting that, the Feature Exchange block is easy to extend to learn a common feature for more images or for other tasks.
3.3 Content-style Fusion
In this section, we present a simple yet effective mode, which fuses content and style features in a channel compression then expansion manner. Without losing generality, here we discuss the fusion between the content information from (represented as purified feature map ) and the exchangeable style information from (represented as a column vector and a row vector ). Our first step, referred as a style-aware content pooling, is designed for removing the information from that does not match with the target style through channel compression. That is:
where , , and . is the number of pixels. This operation effectively compresses all 512 channels at a given pixel location in into a single scalar.
Then different channels are restored based on the style information extracted from , i.e.:
where row vector and is the final stylized feature map.
Compared to existing methods [11, 21, 29], the proposed mode employs the target style feature vectors to discard unrelated information and merge useful ones. Figure 8 shows that the proposed fusion mode is more successful to remove rich color information from the content image while still preserving the structure. Two alternative fusion modes (concatenation and AdaIN) failed, proving the rationality of our new fusion mode. It is also noteworthy that although the expanded channels are all linearly dependent, our learned decoder is capable of inferring high-resolution stylized image from the fusion results.
3.4 Loss Function for Training
As illustrated in Figure 2, three different types of losses are computed for each input image pair. The first one is perceptual loss , which is used to evaluate the stylized results. Following previous work [11, 29], we employ a VGG model  pre-trained on ImageNet to compute the perceptual content loss:
and style loss
where denotes the VGG-based encoder and represents a Gram matrix for features extracted at the layer i in the encoder module. The set L contains conv1_1, conv2_1, conv3_1, conv4_1 layers.
The second one is a reconstruction loss function, which helps to improve the fidelity of our model. It employs regularization to compute the difference between the reconstructed content image and the original one , as well as the ones for style image ( and ). That is:
Finally a feature exchange loss term is defined to facilitate common feature extraction at Feature Exchange Block. According to the work on private-shared component analysis , the disentangled common feature should be different from the two unique features and meanwhile combining them should reconstruct the original style features. In other words, take the content feature for example, we want and to be as much orthogonal as possible and able to reconstruct as well. The reconstruction is performed by feeding the sum of and into a fully connected layer, which is trained to output . Hence, the overall feature exchange loss is computed as:
where is the output of the fully-connected reconstruction layer. Note that, this layer is only used in training stage and the loss is computed only over the output of the last feature exchange block.
To summarize, the full objective function of our proposed network is:
where the four weights parameters are respectively set as 1, 2, 5, and 7 through out the experiments.
3.5 Implementation Details
We implement our model with Tensorflow . Place365 database  and WiKiArt dataset  are used for content and style images respectively, following . During training, we resize the smaller dimension of each image to 512 pixels with the original image ratio. Then we train our model with randomly sampled patches of size . Note that in the testing stage, both the content and style images can be of any size.
In general, our framework consists of one encoder, three Feature Exchange blocks, two decoders. For the decoder of each branch, we use three residual blocks to process the content codes first. After the fusion of style and content features, two extra residual blocks will be used, followed by several upsampling operations. Nearest-neighbor upscaling plus convolution strategy is used to reduce artifacts in the upsampling stage .
We choose Adam optimizer  with a batch size of 4 and a learning rate of 0.0001, and set the decay rates by default for 350000 iterations.
4 Experimental Results
Comparison with Existing Methods
We compare our approach with three types of state-of-the-art techniques: 1) the general but slow optimization-based approach ; 2) three feed-forward neural methods for arbitrary style transfer (AdaIn , WCT  and Avatar-Net ); and 3) the recent image-to-image translation algorithm (DRIT ) that uses disentanglement representation and can be adapted to style transfer task. We set the maximum number of iteration to 500 for . For AdaIn , WCT , Avatar-Net  and DRIT , publicly available code released by the authors are used with default configurations.
Results of qualitative comparisons are shown in Figure 5. As we can see, our method achieves favorable performance against the state-of-the-art approaches. The optimization-based method  can transfer arbitrary styles but also is very easy to get stuck into local minimum, causing distortion in the results (see the rows 3 & 5). Additionally, it takes several minutes to generate the final results, which is inconvenient for parameter tuning. AdaIN  significantly speeds up this process, however, it does not respect semantic information and sometimes generates results with color distribution different from the style image (see the row 4). WCT  tries to use covariance matrix to improve the performance but heavily depends on hyperparameters. As shown in the rows 1 & 4, it sometimes produces messy and less-structured images. Avatar-Net improves AdaIN and WCT with a feature decorating module, but it distorts the semantic structures a lot and artifacts of blurring and color bumps are also introduced. As an image-to-image translation technique, DRIT  can generate results with high fidelity, however they are often insufficiently stylized (see the rows 2, 4, & 5). In contrast, our method learns exchangeable style features for individual image pairs, which allows us to generate more semantic structured images with better visual details (see the row 1) as well as richer color distribution (see the row 5).
Figure 6 provides close-up views for a better comparison on the generated details. Compared to the other baselines, our proposed model produces results with better structures and stylization (such as the stroke-like textures and similar color distribution to the style image). AdaIN fails to transfer the temple into the target style while the result of WCT is less structured or even a bit messy, losing detail textures. DRIT is poor in color distribution and fails to transfer the texture details as well.
Table 1 further compares different methods quantitatively in terms of perceptual loss. This evaluation metrics contain both content and style terms and have been used in previous approaches . It is worth noting that our approach does not minimize perceptual loss directly since it is only one of the three types of losses we use. Nevertheless, our model achieves the lowest perceptual loss among all feed-forward models, with style loss being the lowest and content loss slightly higher than some of the baselines. This indicates our approach favors fully stylized results over results with high content fidelity.
|Gatys et al. ||14.0196||68.3269||82.3465|
|Gatys et al. ||16.51||43.25||162.49|
Table 2 lists the running time of our approach and various state-of-the-art baselines [11, 6, 21, 29, 16] under three image scales. Existing feed-forward network approaches [11, 21, 29] are known to be faster than the optimization-based method . Among them, WCT  requires several passes and extra SVD operation, whereas Avatar-Net  uses CPU-based operation. This makes them more than a magnitude slower than other neural methods. Our approach is slower than, but still comparable to the fastest AdaIN algorithm.
Here we evaluate the impacts of common feature learning and the proposed style-content fusion mode. Common feature disentanglement during joint analysis plays a key role in our approach. Its importance can be evaluated by disabling the feature exchange loss, which prevents the network to learn exchangeable features. As shown in Figures 7(a-b), without this loss term, the color distribution and texture patterns in the result image no longer mimic the target style image. In comparison, our proposed model yields a much more favorable result; see Figure 7(c).
The proposed fusion mode is evaluated by replacing it with two alternatives while fixing the other parts. One choices are fusing like AdaIN [11, 12] and concatenation like Lee et al. . The comparison shown in Figure 8 demonstrates that only our fusion mode can effectively remove the rich colors from the content image, leading to better stylization result with respect to the input style.
We demonstrate the flexibility of our model using three applications. All these tasks are completed with the same trained model without any further fine-tuning.
Being able to adjust the degree of stylization is a useful feature. In our model, this can be achieved by blending between stylized feature map and reconstructed feature map before feeding the result to the decoder. That is, we have:
By definition, the network outputs the reconstructed image when , the fully stylized image when , and a smooth transition between the two when is gradually changed from 0 to 1; see Figure 9.
In Figure 10, we present our model’s ability for applying different styles to different image regions. Masks are used to specify the correspondences between different content image regions and the desired styles. Pairwise exchangeable feature extraction only consider the masked regions when applying a given style, helping to achieve optimal stylization effect for individual regions.
Our method can also be applied to video stylization based on per-frame style transfer; see Figure 11. Comparing to WCT , the color distributions in our stylization results are closer to the provided style image and the semantic structures of the content frames are better preserved. Moreover, the adjacent frames are more coherent thanks to our sample-level common feature analysis.
5 Conclusions and Future Work
In this paper, we have presented a novel framework to address transferring an arbitrary style over a content image. By analyzing the common style feature from both inputs as a guider, exchangeable style features are extracted. Better stylization can be achieved for the content image by fusing its purified content feature with the exchangeable style feature from the style image. In addition, we study a novel yet efficient mode to fuse content and style in a channel compression-expansion manner. Experiments show that our method significantly improves the stylization performance over the prior state-of-the-art methods.
Many directions can be explored in the future. Currently the covariance matrices are computed from VGG feature map at a fix layer. Whether involving covariance matrices from other layers can help enhance the performance worth to be investigated. The presented Feature Exchange Block is proven to be powerful for learning the inter-dependency between samples. How to apply it to other tasks, such as image-to-image translation or domain adaptation could be investigated later. Finally, the presented channel compression-then-expansion fusion mode may have discarded too much information, since the resulting channels are linearly dependent. Designing a more advanced strategy could further improve quality.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
-  J. Cao, O. Katzir, P. Jiang, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li. Dida: Disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019, 2018.
-  T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. CoRR, abs/1612.04337, 2016.
-  V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. CoRR, abs/1610.07629, 2016.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3730–3738, 2017.
-  A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio. Image-to-image translation for cross-domain disentanglement. In NIPS, 2018.
-  D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In SIGGRAPH, 1995.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. Salesin. Image analogies. In SIGGRAPH, 2001.
-  X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519, 2017.
-  X. Huang, M.-Y. Liu, S. J. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. CoRR, abs/1804.04732, 2018.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, 2018.
-  C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2479–2486, 2016.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, 2016.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 266–274, 2017.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 266–274, 2017.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In NIPS, 2017.
-  Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization. CoRR, abs/1802.06474, 2018.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
-  L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool. Exemplar guided unsupervised image-to-image translation with semantic consistency. arXiv preprint arXiv:1805.11145, 2018.
-  K. Nichol. Painter by numbers, wikiart. https://www.kaggle.com/c/painter-by-numbers, 2016.
-  A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
-  A. Sanakoyeu, D. Kotovenko, S. Lang, and B. Ommer. A style-aware content loss for real-time hd style transfer. CoRR, abs/1807.10201, 2018.
-  F. Shen, S. Yan, and G. Zeng. Meta networks for neural style transfer. CoRR, abs/1709.04111, 2017.
-  L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. CoRR, abs/1805.03857, 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.
-  P. Wilmot, E. Risser, and C. Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. CoRR, abs/1701.08893, 2017.
-  Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In IEEE International Conference on Computer Vision, 2017.
-  H. Zhang and K. J. Dana. Multi-style generative network for real-time transfer. CoRR, abs/1703.06953, 2017.
-  B. Zhou, À. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
Appendix A Appendix
a.1 Ablation Study
In this section, additional results of ablation study on the proposed content-style fusion mode and the Feature Exchange blocks are presented.
Content-style Fusion Mode.
The proposed fusion mode combines content and style features in a compression-then-expansion mode. Compared to two existing modes, AdaIN and Concatenation, our proposed mode can discard unrelated information in content features based on the target styles. As visualized in Figure 12, the proposed mode is more successful in adapting its color distribution to the style image than the other two modes.
Number of Feature Exchange Blocks.
To further evaluate the impact of Feature Exchange Blocks on common feature learning, we train a series of models where the block number varies from 0 to 3. In addition, another model that iterates over one shared block three times is compared. As we can see in Figure 13, more blocks can reduce unexpected artifacts and boost the performance, while the model that iterates over a single block cannot achieve the same effect.
a.2 Comparison with Existing Methods
Figure 14 presents additional comparison results over several state-of-the-art methods. As we can see, our proposed framework can generate more structured and better stylized results. Moreover, our model is more successful in removing unrelated information in content features and better correspondences between the style and content images can be see in our results.
a.3 More Stylization Results
Full style-swap of our framework.
Figure 17 lists the full results of our framework. As described in paper, we can get four different types of generated images, among which the stylization for the content image (i.e. ) is the goal of our method. Note that the reconstruction of input images is mainly for stabilizing the training. Although unrelated information of content features are discarded during the fusion, we can see that our model is still able to reasonably reconstruct the input images.
a.4 High-resolution stylization
In this section, we demonstrate the ability of our proposed model to transfer styles for high-resolution images. Figure 22 shows a comparison between Avatar-Net and our framework with content image at resolution . One can see that our synthesized image exhibits a lot of details such as the color transition within the mountains and semantic structures between various objects are preserved very well. In contrast, the result of Avatar-Net is more noisy and less structured.
a.5 Video Stylization
A supplementary video consisting of various contents and styles is attached. At the beginning of the video, we compare results generated by our model with those produced by the baseline method WCT. We can see that our framework can generate much more stable stylized video. The remaining part shows several impressive stylization results produced by our method. Please refer to the YouTube link: https://www.youtube.com/watch?v=Vo-S1RiQBUg.