EFANet: Exchangeable Feature Alignment Network for Arbitrary Style Transfer
Style transfer has been an important topic both in computer vision and graphics. Since the seminal work of Gatys et al. first demonstrates the power of stylization through optimization in the deep feature space, quite a few approaches have achieved real-time arbitrary style transfer with straightforward statistic matching techniques. In this work, our key observation is that only considering features in the input style image for the global deep feature statistic matching or local patch swap may not always ensure a satisfactory style transfer; see e.g., Figure 1. Instead, we propose a novel transfer framework, EFANet, that aims to jointly analyze and better align exchangeable features extracted from content and style image pair. In this way, the style features from the style image seek for the best compatibility with the content information in the content image, leading to more structured stylization results. In addition, a new whitening loss is developed for purifying the computed content features and better fusion with styles in feature space. Qualitative and quantitative experiments demonstrate the advantages of our approach.
A style transfer method takes a pair of images as input and synthesize an output image that preserves the content of the first image while mimicking the style of the second image. The study on this topic has drawn much attention in recent years due to its scientific and artistic values. Recently, the seminal work  found that multi-level feature statistics extracted from a pre-trained CNN model can be used to separate content and style information, making it possible to combine content and style of arbitrary images. This method, however, depends on a slow iterative optimization, which limits its range of application.
Since then, many attempts have been made to accelerate the above approach through replacing the optimization process with a feed-forward neural networks [6, 12, 16, 27, 30]. While these methods can effectively speed up the stylization process, they are generally constrained to a predefined set of styles and cannot adapt to an arbitrary style specified by a single exemplar image.
Notable efforts [4, 9, 17, 23, 24] have been devoted to solving this flexibility v.s. speed dilemma. A successful direction is to apply statistical transformation, which aligns feature statistics of the input content image to that of the style image [9, 17, 24]. However, as shown in Figure 1, the style images can be dramatically different from each other and from the content image, both in terms of semantic structures and style features. Performing style transfer through statistically matching different content images to the same set of features extracted from the style image often introduces unexpected or distorted patterns [9, 17]. Several methods [24, 29, 21] conquer these disadvantages through patch swap with a multi-scale feature fusion, but may contain spatially distorting semantic structures when the local patterns from input images differ a lot.
To address the aforementioned problems, in this paper, we jointly consider both content and style images and extract common style features, which is customized for this pair of images only. Through maximizing the common features, our goal is to align the style features of both content and style images as much as possible. This follows the intuition that when the target style features are compatible with the content image, we can get good transfer result. Since the style features of content image are computed from its own content information, they are definitely compatible with each other. Hence aligning the style features of the two images helps to improve the final stylization; see the comparison of our method with & without common feature in Figure 1.
Intuitively, the common style features we extracted bridge the gap between the input content and style images, making our method outperform existing methods in many challenging scenarios. We call the aligned style features as exchangeable style features. Experiments demonstrate that performing style transfer based on our exchangeable style features yields more structured results with better visual style patterns than existing approaches; see e.g., Figures 1 and 5.
To compute exchangeable style features from feature statistics of two input images, a novel Feature Exchange Block is designed, which is inspired by the works on private-shared component analysis [2, 3]. In addition, we propose a new whitening loss to facilitate the combination between content and style features by removing style patterns existed in content images. To summarize, the contributions of our work include:
The importance of aligning style features for style transfer between two images is clearly demonstrated.
A novel Feature Exchange Block as well as a constraint loss function are designed for the pair-wise analysis of learning common information in-between style features.
A simple yet effective whitening loss is developed to encourage the fusion between content and style information by filtering style patterns in content images.
The overall end-to-end style transfer framework can perform arbitrary style transfer in real-time and synthesize high-quality results with favored styles.
Fast Abitrary Style Transfer
Intuitively, style transfer aims at changing the style of an image while preserving its content. Recently, impressive style transfer is realized by Gatys et al. \citeyeargatys2016image based on deep neural networks. Since then, many methods are proposed to train a single model that can transfer arbitrary styles. Here we only review the related works on arbitrary style transfer and refer the readers to  for a comprehensive survey.
Chen et al. \citeyearChen2016FastPS realize the first fast neural method by matching and swapping local patches between the intermediate features of content and style images, which is thus called Style-Swap. Then Huang et al. \citeyearHuang2017ArbitraryST propose an adaptive instance normalization (AdaIN) to explicitly match the mean and variance of each feature channel of the content image to those of the style image. Li et al. \citeyearLi2017UniversalST further apply whitening and coloring transform (WCT) to align the correlations between the extracted deep features. Sheng et al. \citeyearSheng2018AvatarNetMZ develop Avatar-Net to combine local and holistic style pattern transformation, achieving better stylization regardless of the domain gap. Very recently, AAMS (Yao et al. \citeyearyao2019attention) tries to transfer the multi-stroke patterns by introducing self-attention mechanism. Meanwhile, SANet  promotes Avatar-Net by learning a similarity matrix and flexibly matching the semantically nearest style features onto the content features. And Li et al. \citeyearLi_2019_CVPR speeds up WCT with a linear propagation module. In order to boost the generalization ability, ETNet  evaluate errors in the synthesized results and correct them iteratively. The above methods, however, all achieve stylization by a straightforward statistic matching or local patch matching and ignore the gaps between input features, which may not be able to adapt to the unlimited variety.
In this paper, we still follow the holistic alignment with respect to feature correlations. The key difference is that before applying style features, we jointly analyze the similarities between the style features of content and style images. Thus these style features can be aligned accordingly, which enables the style features to match the content images more flexibly and improves the final compatibility level between target content and style features significantly.
Learning disentangled representation aims at separating the learned internal representation into the factors of data variations . It improves the re-usability and interpretation of the model, which is very useful for e.g., domain adaptation [2, 3]. Recently, several concurrent works [14, 10, 8, 18] have been proposed for multi-modal image-to-image translation. They map the input images into one common feature space for content representation and two unique feature spaces for styles. Yi et al. \citeyearyi2018branched design BranchGAN to achieve scale-disentanglement in image generation. Wu et al. \citeyearSAGnet19 advance 3D shape generation by disentangling geometry and structure information. For style transfer, some efforts [32, 31] are also made to separate a representation of one image into the content and style. Different from the mentioned methods, we perform feature disentanglement only on style features of the input image pair. A common component is thus extracted, which is then used to compute exchangeable style features for style transfer.
Following , we consider the deep feature extracted by the network pretrained on large dataset as the content representation for an image, and the feature correlation at a given layer as the style information. By fusing the content feature with a new target style feature, we can generate a stylized image.
The overall goal of our framework is to better align style features between the style and content images, such that the style features from one image can better match the content of the other image, resulting a better stylization adaptively. To achieve that, a key module of Feature Exchange block is proposed to jointly analyze the style features of the two input images. A common feature is disentangled to encode the shared components between the style features, indicating the similarity information among them. Then with the common features as guiders, we can make the target style features be more similar to the input contents and facilitate the alignment between them.
Exchangeable Feature for Style Transfer
As illustrated in Figure 3(a), our framework mainly consists of three parts: one encoder, several EFANet modules () and one decoder for generating the final images. We denote and , as the feature maps outputted by the layer of the pre-trained VGG encoder, which correspond to content and style images ( and ) respectively. We equip the multi-scale style adaption strategy to advance the stylization performance. Specifically, in the bottleneck of the conventional encoder-decoder architecture, starting from and , different EFANet modules are applied to progressively fuse the styles from input images into the corresponding decoded features in a coarse-to-fine manner as . The indicates a decoded stylized feature and , where is an upsampling operator and the superscript denotes the -th scale. Note that, initially we set and apply the superscript to indicate the -th style vector of a Gram matrix in the following paragraphs.
In Figure 3(b), given and as inputs, we first compute two Gram matrices across the feature channels as the raw style representations and denote them as and . The indicates the channel number for and . In order to preserve more style details in output results and reduce computation burden, we process only a part of style information at a time and represent and as two lists of style vectors, e.g. and . Each style vector, and , compactly encodes the mutual relationships between the -th channel and the whole feature map. Then each corresponding style vector pair (, ) is processed using one Feature Exchange block. Accordingly a common feature and two unique feature vectors for decoded information (as content) and style, and , can be disentangled.
Guided by , the style features are aligned in the following manner: we first concatenate with the raw style vectors and respectively. Then they are sent into fully connected layers individually, yielding the aligned style vectors and . We call them as exchangeable style features since each of them can be used easily to adapt its style to the target image. Then we stack the style vectors and into two matrices, and , for later fusion as:
Inspired by the whitening operation of WCT , we assume that better stylization results can be achieved when the target content features are uncorralated before content-style fusion. The whitening operation can be regarded as a function, where the content feature is filtered by its corresponding style info. Thus after the feature alignment, to facilitate transferring a new style to an image, we use the exchangeable style to purify its own content feature through a fusion as:
where and indicates the fusion operation and a learnable matrix respectively [32, 30]. Moreover, we develop a whitening loss to further encourage the removal of correlations between different channels; see Figure 2 as a validating example. The details of the whitening loss are discussed in the Loss Function section below.
Finally, we exchange the aligned style vectors and fuse them with the purified content features as:
Then the will be propagated to receive style information at finer scales or decoded to output stylized images. The decoder is trained to learn the inversion from the fused feature map to image space, and hereby, style transfer is eventually achieved for both input images. Note that the resulting denotes the stylization image that transfers style in to ,
Feature Exchange Block
According to Bousmalis et al. \shortciteBousmalis2016DomainSN, explicitly modeling the unique information would help improve the extraction of the shared component. To adapt this idea for our exchangeable style features, a Feature Exchange block is proposed to jointly analyze the style features of both input images and model their inter-relationships, based on which we explicitly update the common feature and two unique features for the disentanglement. Figure 4 illustrates the detailed architecture, where the unique features, and , are first initialized with and respectively and the with their combination. Then they are updated by the learned residual features. Using residual learning is to facilitate gradient propagation during training and convey messages so that each input feature can be directly updated. This property allows us to chain any number of Feature Exchange blocks in a model, without breaking its initial behavior.
As shown in Figure 4, there are two shared fully-connected layers inside each block. To be specific, the disentangled features are updated as:
where and denote the fully-connected layers to output residuals for the common features and unique features respectively. indicates a concatenation operation. We can update in a similar way.
By doing so, the feature exchange blocks enable and (or ) to interact with each other by modelling their dependencies and thus to be refined to the optimal.
On the other hand, to make sure the feature exchange block conduct proper disentanglement, a constraint on the disentangled feature is added following Bousmalis et al. \shortciteBousmalis2016DomainSN. First, should be orthogonal to both and as much as possible. Meanwhile, it should let us be able to reconstruct and based on the finally disentangled features. Therefore, a feature exchange loss can be defined as:
where is the reconstructed style vector by feeding the sum of and into a fully connected layer. is the reconstruction from and . Note that this fully connected layer for reconstruction is only valid in training stage, and is only computed with the final output of the feature exchange block. And we use only one feature exchange block in each EFANet module.
Finally, to maximize the common information, we also penalize the amount of unique features. Thus the final loss function for the common feature extraction is:
where denotes norm of a vector, and is set to 0.0001 in all our experiments.
Loss Function for Training
As illustrated in Figure 3, three different types of losses are computed for each input image pair. The first one is perceptual loss , which is used to evaluate the stylized results. Following previous work [9, 24], we employ a VGG model  pre-trained on ImageNet  to compute the perceptual content loss:
and style loss:
where denotes the VGG-based encoder and represents a Gram matrix for features extracted at -th scale in the encoder module. As mentioned before, we set .
The second is the whitening loss, which is used to remove style information in target content images at training stages. According to Li et al. \shortciteLi2017UniversalST, after the whitening operation, should equal the identity matrix. Thus we define the whitening loss as:
where denotes the identity matrix. By doing so, we can encourage feature map to be as uncorrelated as possible.
The third one is the common feature loss, , defined previously for a better feature disentanglement.
Note that, for both and , we sum up the losses over all scales, e.g. and . The superscript here indicates losses computed at -th scale, where . To summarize, the full objective function of our proposed network is:
where the four weighting parameters are respectively set as 1, 7, 0.1 and 5 through out the experiments.
We implement our model with Tensorflow . In general, our framework consists of an encoder, several EFANet modules and a decoder. Similar to prior work [9, 24], we use the VGG-19 model  (up to relu4_1) pre-trained on ImageNet  to initialize the fixed encoder. For the decoder, after the fusion of style and content features, two residual blocks are used, followed by upsampling operations. Nearest-neighbor upscaling plus convolution strategy is used to reduce artifacts in the upsampling stage . We choose Adam optimizer  with a batch size of 4 and a learning rate of 0.0001, and set the decay rates by default for 150000 iterations.
Place365 database  and WiKiArt dataset  are used for content and style images respectively, following . During training, we resize the smaller dimension of each image to 512 pixels with the original image ratio. Then we train our model with randomly sampled patches of size . Note that in the testing stage, both the content and style images can be of any size.
|Loss||AdaIN||WCT||Avatar-Net||AAMS||SANet||Li et al.||Ours w/o CF||Ours||Ours|
Comparison with Existing Methods
We compare our approach with six state-of-the-art methods for arbitrary style transfer: AdaIn , WCT , Avatar-Net , AAMS , SANet  and Li et al. . For the compared methods, publicly available codes with default configurations are used for a fair comparison.
Results of qualitative comparisons are shown in Figure 5. For the holistic statistic matching pipelines, AdaIN  can achieve arbitrary style transfer in real-time. However, it does not respect semantic information and sometimes generates less stylized results with color distribution different from the style image (see row 1 & 3). WCT  improves the stylization a lot but often introduces distorted patterns. As shown in rows 3 & 4, it sometimes produces messy and less-structured images. Li et al. \citeyearLi_2019_CVPR proposes a linear propagation module and achieves the fastest transfer among all the compared methods. But it often gets stuck into the instylization issuses and can not adapt the compatible style patterns or color variations to results (row 1 & 3).
Then Avatar-Net  improves over the holistic matching methods by adapting more style details to results with a feature decorating module, but it also blurs the semantic structures (rows 3) and sometimes distorts the salient style patterns (see rows 1 & 5). While AAMS  stylizes images with multi-stroke style patterns, similar to Avatar-Net, it still suffers from the structure distortion issues (row 3) and introduces unseen dot-wise artifacts (row 2 & 5). It also fails to capture the patterns presented in style image (row 5). In order to match the semantically nearest style features onto the content features, SANet  shares the similar spirits with Avatar-Net but employs a style attention module in a more flexible way. Thus it might still blur the content structures (row 3) and directly copy some semantic patterns in content images to stylization results (e.g. the eyes in row 1, 2 & 3). Due to the local patch matching, SANet also distorts the presented style patterns and fails to reserve the texture consistency (row 5).
In contrast, our approach achieves more favorable performance. The alignment on style features allows our model to better match the regions in content images with patterns in style images. The target style textures can be adaptively transferred to the content images, manifesting superior texture detail (last row) and richer color variation (2nd row). Compared to most methods, our approach can also generate more structured results while the style pattern, like brush strokes, is preserved well (3rd row).
Assessing style transfer results could be subjective. We thus conduct two quantitative comparisons, which are reported in first 2 rows of Table 1. We first compares different methods in terms of perceptual loss. This evaluation metrics contain both content and style terms which have been used in previous approaches . It is worth noting that our approach does not minimize perceptual loss directly since it is only one of the three types of losses we use. Nevertheless, our model achieves the lowest perceptual loss among all feed-forward models, with style loss being the lowest and content loss only slightly higher than AdaIN. This indicates our approach favors fully stylized results over results with high content fidelity.
We then conduct a user study to evaluate the visual preference of the six methods. 30 content images and 30 style images are randomly selected from the test set and 900 stylization results are generated for each method. Then results of the same stylization are randomly chosen for a participant who is asked to vote for the method that achieves the best stylization. Each participant is asked to do 20 rounds of comparison. The stylized results from different methods are exhibited in a random order. Thus we collect 600 votes from 30 subjects. The average preference scores of different methods are reported in Column 4 of Table 1, which shows our method obtains the highest score.
Table 1 also lists the running time of our approach and various state-of-the-art baselines. All results are obtained with a 12G Titan V GPU and averaged over 100 test images. Generally speaking, existing patch based network approaches are known to be slower than the holistic matching methods. Among all the approaches, Li et al. achieves the fastest stylization with a linear propagation module. Our full model equiped with multi-scale strategy slightly increases the computation burden but are still comparable to AdaIN, thus achieving style transfer in real-time.
Here we respectively evaluate the impacts of common feature learning, the proposed whitening loss on content feature, and the multi-scale usages of our framework.
Common feature disentanglement during joint analysis plays a key role in our approach. Its importance can be evaluated by removing the Feature Exchange block and disabling the feature exchange loss, which prevents the network to learn exchangeable features. As shown in Figure 1, for the ablated model without common features, the color distribution and texture patterns in the result image no longer mimic the target style image. Visually, our full model yields a much more favorable result. We also compares the perceptual losses over 100 test images for both the baseline model (i.e. our model without common features) and our full model. As reported in Table 1, the style loss of our full model is significantly improved over the baseline, demonstrating the effectiveness of common features.
To verify the effect of whitening operation functioned on content features, we remove learnable matrices at all scales to see how the performance changes. As shown in Figure 9, without the purified operation and whitening loss, the baseline model blurs the overall contours with yellow blobs. In constrast, our full model better matches the target style to the content image and preserves the spatial structures & style pattern consistency, yielding more visually pleasing results. This proves that the proposed operation enables the content features to be more compatible with the target styles.
The multi-scale strategy is evaluated by replacing the full model with an alternative model that only fuses content and style at layer while fixing the other parts. The comparison shown in Figure 8 demonstrates that the multi-scale strategy is more successful in capturing the salient style patterns, leading to better stylization results.
We demonstrate the flexibility of our model using two applications. All these tasks are completed with the same trained model without any further fine-tuning.
Being able to adjust the degree of stylization is a useful feature. In our model, this can be achieved by blending between the stylized feature map and the VGG-based feature before feeding to the decoder, which is:
By definition, the network outputs the reconstructed image when , the fully stylized image when , and a smooth transition between the two when is gradually changed from 0 to 1; see Figure 6.
In Figure 7, we present our model’s ability for applying different styles to different image regions. Masks are used to specify the correspondences between different content image regions and the desired styles. Pair-wise exchangeable feature extraction only consider the masked regions when applying a given style, helping to achieve optimal stylization effect for individual regions.
In this paper, we have presented a novel framework, EFANet, for transferring an arbitrary style to a content image. By analyzing the common style feature from both inputs as a guider for alignment, exchangeable style features are extracted. Better stylization can be achieved for the content image by fusing its purified content feature with the aligned style feature from the style image. Experiments show that our method significantly improves the stylization performance over the prior state-of-the-art methods.
-  (2016) TensorFlow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: Implementation Details.
-  (2016) Domain separation networks. In NIPS, Cited by: Introduction, Feature Disentanglement.
-  (2018) DiDA: disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019. Cited by: Introduction, Feature Disentanglement.
-  (2016) Fast patch-based style transfer of arbitrary style. CoRR abs/1612.04337. Cited by: Introduction.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Loss Function for Training, Implementation Details.
-  (2016) A learned representation for artistic style. CoRR abs/1610.07629. Cited by: Introduction.
-  (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423. Cited by: Introduction, Developed Framework.
-  (2018) Image-to-image translation for cross-domain disentanglement. In NIPS, Cited by: Feature Disentanglement.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1510–1519. Cited by: Introduction, Loss Function for Training, Implementation Details, Comparison with Existing Methods, Comparison with Existing Methods, Comparison with Existing Methods.
-  (2018) Multimodal unsupervised image-to-image translation. CoRR abs/1804.04732. Cited by: Feature Disentanglement.
-  (2017) Neural style transfer: a review. arXiv preprint arXiv:1705.04058. Cited by: Fast Abitrary Style Transfer.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: Introduction, Loss Function for Training.
-  (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: Implementation Details.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: Feature Disentanglement.
-  (2019-06) Learning linear transformations for fast image and video style transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Comparison with Existing Methods.
-  (2017) Diversified texture synthesis with feed-forward networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 266–274. Cited by: Introduction.
-  (2017) Universal style transfer via feature transforms. In NIPS, Cited by: Introduction, Exchangeable Feature for Style Transfer, Comparison with Existing Methods, Comparison with Existing Methods.
-  (2018) Exemplar guided unsupervised image-to-image translation with semantic consistency. arXiv preprint arXiv:1805.11145. Cited by: Feature Disentanglement.
-  (2016) Painter by numbers, wikiart.. https://www.kaggle.com/c/painter-by-numbers. External Links: Cited by: Implementation Details.
-  (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Cited by: Implementation Details.
-  (2019-06) Arbitrary style transfer with style-attentional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Introduction, Fast Abitrary Style Transfer, Comparison with Existing Methods, Comparison with Existing Methods.
-  (2018) A style-aware content loss for real-time hd style transfer. CoRR abs/1807.10201. Cited by: Implementation Details.
-  (2017) Meta networks for neural style transfer. CoRR abs/1709.04111. Cited by: Introduction.
-  (2018) Avatar-net: multi-scale zero-shot style transfer by feature decoration. pp. 8242–8250. Cited by: Introduction, Loss Function for Training, Implementation Details, Comparison with Existing Methods, Comparison with Existing Methods.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: Loss Function for Training, Implementation Details.
-  (2019) ETNet: error transition network for arbitrary style transfer. External Links: Cited by: Fast Abitrary Style Transfer.
-  (2016) Texture networks: feed-forward synthesis of textures and stylized images. In ICML, Cited by: Introduction.
-  (2016) Disentangled representations in neural models. arXiv preprint arXiv:1602.02383. Cited by: Feature Disentanglement.
-  (2019) Attention-aware multi-stroke style transfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Introduction, Comparison with Existing Methods, Comparison with Existing Methods.
-  (2017) Multi-style generative network for real-time transfer. CoRR abs/1703.06953. Cited by: Introduction, Exchangeable Feature for Style Transfer.
-  (2018) Style separation and synthesis via generative adversarial networks. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 183–191. Cited by: Feature Disentanglement.
-  (2018) Separating style and content for generalized style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8447–8455. Cited by: Feature Disentanglement, Exchangeable Feature for Style Transfer.
-  (2014) Learning deep features for scene recognition using places database. In NIPS, Cited by: Implementation Details.