EFANet: Exchangeable Feature Alignment Network for Arbitrary Style Transfer

EFANet: Exchangeable Feature Alignment Network for Arbitrary Style Transfer

Zhijie Wu1, Chunjin Song1, Yang Zhou1, Minglun Gong2, Hui Huang1
1 Shenzhen University,   2 University of Guelph
{wzj.micker, songchunjin1990, zhouyangvcc}@gmail.com,
minglun@uoguelph.ca, hhzhiyan@gmail.com
Equal contribution. Order determined by coin toss.Corresponding authors.
Abstract

Style transfer has been an important topic both in computer vision and graphics. Since the seminal work of Gatys et al. first demonstrates the power of stylization through optimization in the deep feature space, quite a few approaches have achieved real-time arbitrary style transfer with straightforward statistic matching techniques. In this work, our key observation is that only considering features in the input style image for the global deep feature statistic matching or local patch swap may not always ensure a satisfactory style transfer; see e.g., Figure 1. Instead, we propose a novel transfer framework, EFANet, that aims to jointly analyze and better align exchangeable features extracted from content and style image pair. In this way, the style features from the style image seek for the best compatibility with the content information in the content image, leading to more structured stylization results. In addition, a new whitening loss is developed for purifying the computed content features and better fusion with styles in feature space. Qualitative and quantitative experiments demonstrate the advantages of our approach.

Introduction

A style transfer method takes a pair of images as input and synthesize an output image that preserves the content of the first image while mimicking the style of the second image. The study on this topic has drawn much attention in recent years due to its scientific and artistic values. Recently, the seminal work [7] found that multi-level feature statistics extracted from a pre-trained CNN model can be used to separate content and style information, making it possible to combine content and style of arbitrary images. This method, however, depends on a slow iterative optimization, which limits its range of application.

Since then, many attempts have been made to accelerate the above approach through replacing the optimization process with a feed-forward neural networks [6, 12, 16, 27, 30]. While these methods can effectively speed up the stylization process, they are generally constrained to a predefined set of styles and cannot adapt to an arbitrary style specified by a single exemplar image.

Figure 1: The existing method (AdaIN) ignores differences in style images while our approach jointly analyzes each content-style image pair and computes exchangeable style features. As a result, AdaIN and the baseline model without common features (4th column) only work well with a simple style (1st and 2nd row). When the target styles become more complex and the content-style images have different patterns/color distributions, AdaIN and the baseline model fail to capture the salient style patterns and suffer from insufficiently stylized results (color distribution and textures in 3rd & 4th row). In comparison, our model better adapts to pattern/color variation in the content image and map compatible patterns/colors in the style images accordingly.

Notable efforts [4, 9, 17, 23, 24] have been devoted to solving this flexibility v.s. speed dilemma. A successful direction is to apply statistical transformation, which aligns feature statistics of the input content image to that of the style image [9, 17, 24]. However, as shown in Figure 1, the style images can be dramatically different from each other and from the content image, both in terms of semantic structures and style features. Performing style transfer through statistically matching different content images to the same set of features extracted from the style image often introduces unexpected or distorted patterns [9, 17]. Several methods [24, 29, 21] conquer these disadvantages through patch swap with a multi-scale feature fusion, but may contain spatially distorting semantic structures when the local patterns from input images differ a lot.

To address the aforementioned problems, in this paper, we jointly consider both content and style images and extract common style features, which is customized for this pair of images only. Through maximizing the common features, our goal is to align the style features of both content and style images as much as possible. This follows the intuition that when the target style features are compatible with the content image, we can get good transfer result. Since the style features of content image are computed from its own content information, they are definitely compatible with each other. Hence aligning the style features of the two images helps to improve the final stylization; see the comparison of our method with & without common feature in Figure 1.

Intuitively, the common style features we extracted bridge the gap between the input content and style images, making our method outperform existing methods in many challenging scenarios. We call the aligned style features as exchangeable style features. Experiments demonstrate that performing style transfer based on our exchangeable style features yields more structured results with better visual style patterns than existing approaches; see e.g., Figures 1 and 5.

To compute exchangeable style features from feature statistics of two input images, a novel Feature Exchange Block is designed, which is inspired by the works on private-shared component analysis [2, 3]. In addition, we propose a new whitening loss to facilitate the combination between content and style features by removing style patterns existed in content images. To summarize, the contributions of our work include:

  • The importance of aligning style features for style transfer between two images is clearly demonstrated.

  • A novel Feature Exchange Block as well as a constraint loss function are designed for the pair-wise analysis of learning common information in-between style features.

  • A simple yet effective whitening loss is developed to encourage the fusion between content and style information by filtering style patterns in content images.

  • The overall end-to-end style transfer framework can perform arbitrary style transfer in real-time and synthesize high-quality results with favored styles.

Related Work

Fast Abitrary Style Transfer

Intuitively, style transfer aims at changing the style of an image while preserving its content. Recently, impressive style transfer is realized by Gatys et al. \citeyeargatys2016image based on deep neural networks. Since then, many methods are proposed to train a single model that can transfer arbitrary styles. Here we only review the related works on arbitrary style transfer and refer the readers to [11] for a comprehensive survey.

Chen et al. \citeyearChen2016FastPS realize the first fast neural method by matching and swapping local patches between the intermediate features of content and style images, which is thus called Style-Swap. Then Huang et al. \citeyearHuang2017ArbitraryST propose an adaptive instance normalization (AdaIN) to explicitly match the mean and variance of each feature channel of the content image to those of the style image. Li et al. \citeyearLi2017UniversalST further apply whitening and coloring transform (WCT) to align the correlations between the extracted deep features. Sheng et al. \citeyearSheng2018AvatarNetMZ develop Avatar-Net to combine local and holistic style pattern transformation, achieving better stylization regardless of the domain gap. Very recently, AAMS (Yao et al. \citeyearyao2019attention) tries to transfer the multi-stroke patterns by introducing self-attention mechanism. Meanwhile, SANet [21] promotes Avatar-Net by learning a similarity matrix and flexibly matching the semantically nearest style features onto the content features. And Li et al. \citeyearLi_2019_CVPR speeds up WCT with a linear propagation module. In order to boost the generalization ability, ETNet [26] evaluate errors in the synthesized results and correct them iteratively. The above methods, however, all achieve stylization by a straightforward statistic matching or local patch matching and ignore the gaps between input features, which may not be able to adapt to the unlimited variety.

In this paper, we still follow the holistic alignment with respect to feature correlations. The key difference is that before applying style features, we jointly analyze the similarities between the style features of content and style images. Thus these style features can be aligned accordingly, which enables the style features to match the content images more flexibly and improves the final compatibility level between target content and style features significantly.

Feature Disentanglement

Learning disentangled representation aims at separating the learned internal representation into the factors of data variations [28]. It improves the re-usability and interpretation of the model, which is very useful for e.g., domain adaptation [2, 3]. Recently, several concurrent works [14, 10, 8, 18] have been proposed for multi-modal image-to-image translation. They map the input images into one common feature space for content representation and two unique feature spaces for styles. Yi et al. \citeyearyi2018branched design BranchGAN to achieve scale-disentanglement in image generation. Wu et al. \citeyearSAGnet19 advance 3D shape generation by disentangling geometry and structure information. For style transfer, some efforts [32, 31] are also made to separate a representation of one image into the content and style. Different from the mentioned methods, we perform feature disentanglement only on style features of the input image pair. A common component is thus extracted, which is then used to compute exchangeable style features for style transfer.

Developed Framework

Figure 2: Images decoded from whitened features. The results on the right are rescaled for better visualization. The whitened features still keep spatial structures but various style patterns are removed.

Following [7], we consider the deep feature extracted by the network pretrained on large dataset as the content representation for an image, and the feature correlation at a given layer as the style information. By fusing the content feature with a new target style feature, we can generate a stylized image.

The overall goal of our framework is to better align style features between the style and content images, such that the style features from one image can better match the content of the other image, resulting a better stylization adaptively. To achieve that, a key module of Feature Exchange block is proposed to jointly analyze the style features of the two input images. A common feature is disentangled to encode the shared components between the style features, indicating the similarity information among them. Then with the common features as guiders, we can make the target style features be more similar to the input contents and facilitate the alignment between them.

Figure 3: (a) Architecture overview. The input image pair and , goes through the pre-trained VGG encoder to extract feature maps and . Then, starting from and , different EFANet modules are applied to progressively fuse styles into corresponding decoded features for final stylized images. (b) The architecture of EFANet module. Given and as inputs, we compute two Gram matrices as the raw styles and then represent them as two lists of feature vectors and . Each corresponding style vector pair ( and ) is fed into the newly proposed Feature Exchange Block and a common feature vector is extracted via the joint analysis. We concatenate with and respectively to learn two exchangeable style feature and . is used for the content feature purification, which will be further fused with , outputting . Finally will be either propagated for finer-scale information or decoded into stylized images .
Figure 4: Architecture of a Feature Exchange Block, where denotes element-wise addition. Each block has three input features, one common feature and two unique features for content and style images, respectively. The features, and , are first initialized with and respectively and the with their combination. Then the block allows common feature to interact with unique features and outputs refined results , , and .

Exchangeable Feature for Style Transfer

As illustrated in Figure 3(a), our framework mainly consists of three parts: one encoder, several EFANet modules () and one decoder for generating the final images. We denote and , as the feature maps outputted by the layer of the pre-trained VGG encoder, which correspond to content and style images ( and ) respectively. We equip the multi-scale style adaption strategy to advance the stylization performance. Specifically, in the bottleneck of the conventional encoder-decoder architecture, starting from and , different EFANet modules are applied to progressively fuse the styles from input images into the corresponding decoded features in a coarse-to-fine manner as . The indicates a decoded stylized feature and , where is an upsampling operator and the superscript denotes the -th scale. Note that, initially we set and apply the superscript to indicate the -th style vector of a Gram matrix in the following paragraphs.

In Figure 3(b), given and as inputs, we first compute two Gram matrices across the feature channels as the raw style representations and denote them as and . The indicates the channel number for and . In order to preserve more style details in output results and reduce computation burden, we process only a part of style information at a time and represent and as two lists of style vectors, e.g. and . Each style vector, and , compactly encodes the mutual relationships between the -th channel and the whole feature map. Then each corresponding style vector pair (, ) is processed using one Feature Exchange block. Accordingly a common feature and two unique feature vectors for decoded information (as content) and style, and , can be disentangled.

Guided by , the style features are aligned in the following manner: we first concatenate with the raw style vectors and respectively. Then they are sent into fully connected layers individually, yielding the aligned style vectors and . We call them as exchangeable style features since each of them can be used easily to adapt its style to the target image. Then we stack the style vectors and into two matrices, and , for later fusion as:

Inspired by the whitening operation of WCT [17], we assume that better stylization results can be achieved when the target content features are uncorralated before content-style fusion. The whitening operation can be regarded as a function, where the content feature is filtered by its corresponding style info. Thus after the feature alignment, to facilitate transferring a new style to an image, we use the exchangeable style to purify its own content feature through a fusion as:

where and indicates the fusion operation and a learnable matrix respectively [32, 30]. Moreover, we develop a whitening loss to further encourage the removal of correlations between different channels; see Figure 2 as a validating example. The details of the whitening loss are discussed in the Loss Function section below.

Finally, we exchange the aligned style vectors and fuse them with the purified content features as:

Then the will be propagated to receive style information at finer scales or decoded to output stylized images. The decoder is trained to learn the inversion from the fused feature map to image space, and hereby, style transfer is eventually achieved for both input images. Note that the resulting denotes the stylization image that transfers style in to ,

Feature Exchange Block

According to Bousmalis et al. \shortciteBousmalis2016DomainSN, explicitly modeling the unique information would help improve the extraction of the shared component. To adapt this idea for our exchangeable style features, a Feature Exchange block is proposed to jointly analyze the style features of both input images and model their inter-relationships, based on which we explicitly update the common feature and two unique features for the disentanglement. Figure 4 illustrates the detailed architecture, where the unique features, and , are first initialized with and respectively and the with their combination. Then they are updated by the learned residual features. Using residual learning is to facilitate gradient propagation during training and convey messages so that each input feature can be directly updated. This property allows us to chain any number of Feature Exchange blocks in a model, without breaking its initial behavior.

As shown in Figure 4, there are two shared fully-connected layers inside each block. To be specific, the disentangled features are updated as:

where and denote the fully-connected layers to output residuals for the common features and unique features respectively. indicates a concatenation operation. We can update in a similar way.

By doing so, the feature exchange blocks enable and (or ) to interact with each other by modelling their dependencies and thus to be refined to the optimal.

On the other hand, to make sure the feature exchange block conduct proper disentanglement, a constraint on the disentangled feature is added following Bousmalis et al. \shortciteBousmalis2016DomainSN. First, should be orthogonal to both and as much as possible. Meanwhile, it should let us be able to reconstruct and based on the finally disentangled features. Therefore, a feature exchange loss can be defined as:

where is the reconstructed style vector by feeding the sum of and into a fully connected layer. is the reconstruction from and . Note that this fully connected layer for reconstruction is only valid in training stage, and is only computed with the final output of the feature exchange block. And we use only one feature exchange block in each EFANet module.

Finally, to maximize the common information, we also penalize the amount of unique features. Thus the final loss function for the common feature extraction is:

where denotes norm of a vector, and is set to 0.0001 in all our experiments.

Loss Function for Training

As illustrated in Figure 3, three different types of losses are computed for each input image pair. The first one is perceptual loss [12], which is used to evaluate the stylized results. Following previous work [9, 24], we employ a VGG model [25] pre-trained on ImageNet [5] to compute the perceptual content loss:

and style loss:

where denotes the VGG-based encoder and represents a Gram matrix for features extracted at -th scale in the encoder module. As mentioned before, we set .

The second is the whitening loss, which is used to remove style information in target content images at training stages. According to Li et al. \shortciteLi2017UniversalST, after the whitening operation, should equal the identity matrix. Thus we define the whitening loss as:

where denotes the identity matrix. By doing so, we can encourage feature map to be as uncorrelated as possible.

The third one is the common feature loss, , defined previously for a better feature disentanglement.

Note that, for both and , we sum up the losses over all scales, e.g. and . The superscript here indicates losses computed at -th scale, where . To summarize, the full objective function of our proposed network is:

where the four weighting parameters are respectively set as 1, 7, 0.1 and 5 through out the experiments.

Implementation Details

We implement our model with Tensorflow [1]. In general, our framework consists of an encoder, several EFANet modules and a decoder. Similar to prior work [9, 24], we use the VGG-19 model [25] (up to relu4_1) pre-trained on ImageNet [5] to initialize the fixed encoder. For the decoder, after the fusion of style and content features, two residual blocks are used, followed by upsampling operations. Nearest-neighbor upscaling plus convolution strategy is used to reduce artifacts in the upsampling stage [20]. We choose Adam optimizer [13] with a batch size of 4 and a learning rate of 0.0001, and set the decay rates by default for 150000 iterations.

Place365 database [33] and WiKiArt dataset [19] are used for content and style images respectively, following [22]. During training, we resize the smaller dimension of each image to 512 pixels with the original image ratio. Then we train our model with randomly sampled patches of size . Note that in the testing stage, both the content and style images can be of any size.

Experimental Results

Figure 5: Comparison with results from different methods. Note that the proposed model generates images with better visual quality while the results of other baselines have various artifacts; see text for detailed discussions.
Loss AdaIN WCT Avatar-Net AAMS SANet Li et al. Ours w/o CF Ours Ours
Content () 14.4226 19.5318 16.8482 17.1321 23.3074 18.7288 16.3763 16.8600 16.5927
Style () 40.5989 27.1998 31.1532 34.7786 29.7760 37.3573 22.6713 24.9123 14.8582
Preference/% 0.110 0.155 0.150 0.137 0.140 0.108 - - 0.200
Time/sec 0.0192 0.4268 0.9258 1.1938 0.0983 0.0071 0.0227 0.0208 0.0234
Table 1: Quantitative comparison over different models on perceptual (content & style) loss, preference score of user study and running time. Note that all the results are averaged over 100 test images except the preference score. The denotes a model equiped with single-scale strategy.

Comparison with Existing Methods

We compare our approach with six state-of-the-art methods for arbitrary style transfer: AdaIn [9], WCT [17], Avatar-Net [24], AAMS [29], SANet [21] and Li et al. [15]. For the compared methods, publicly available codes with default configurations are used for a fair comparison.

Results of qualitative comparisons are shown in Figure 5. For the holistic statistic matching pipelines, AdaIN [9] can achieve arbitrary style transfer in real-time. However, it does not respect semantic information and sometimes generates less stylized results with color distribution different from the style image (see row 1 & 3). WCT [17] improves the stylization a lot but often introduces distorted patterns. As shown in rows 3 & 4, it sometimes produces messy and less-structured images. Li et al. \citeyearLi_2019_CVPR proposes a linear propagation module and achieves the fastest transfer among all the compared methods. But it often gets stuck into the instylization issuses and can not adapt the compatible style patterns or color variations to results (row 1 & 3).

Then Avatar-Net [24] improves over the holistic matching methods by adapting more style details to results with a feature decorating module, but it also blurs the semantic structures (rows 3) and sometimes distorts the salient style patterns (see rows 1 & 5). While AAMS [29] stylizes images with multi-stroke style patterns, similar to Avatar-Net, it still suffers from the structure distortion issues (row 3) and introduces unseen dot-wise artifacts (row 2 & 5). It also fails to capture the patterns presented in style image (row 5). In order to match the semantically nearest style features onto the content features, SANet [21] shares the similar spirits with Avatar-Net but employs a style attention module in a more flexible way. Thus it might still blur the content structures (row 3) and directly copy some semantic patterns in content images to stylization results (e.g. the eyes in row 1, 2 & 3). Due to the local patch matching, SANet also distorts the presented style patterns and fails to reserve the texture consistency (row 5).

In contrast, our approach achieves more favorable performance. The alignment on style features allows our model to better match the regions in content images with patterns in style images. The target style textures can be adaptively transferred to the content images, manifesting superior texture detail (last row) and richer color variation (2nd row). Compared to most methods, our approach can also generate more structured results while the style pattern, like brush strokes, is preserved well (3rd row).

Figure 6: Balance between content and style. At testing stage, the degree of stylization can be controlled using parameter .
Figure 7: Application for spatial control. Left: content image. Middle: style images with masks to indicate target regions. Right: synthesized result.
Figure 8: Ablation study on multi-scale strategy. By fusing the content and style in multi-scales, we can enrich the local and global style patterns for stylized images.
Figure 9: Ablation study on whitening loss. With the proposed loss, clearer content contours and better style pattern consistency are achieved.

Assessing style transfer results could be subjective. We thus conduct two quantitative comparisons, which are reported in first 2 rows of Table 1. We first compares different methods in terms of perceptual loss. This evaluation metrics contain both content and style terms which have been used in previous approaches [9]. It is worth noting that our approach does not minimize perceptual loss directly since it is only one of the three types of losses we use. Nevertheless, our model achieves the lowest perceptual loss among all feed-forward models, with style loss being the lowest and content loss only slightly higher than AdaIN. This indicates our approach favors fully stylized results over results with high content fidelity.

We then conduct a user study to evaluate the visual preference of the six methods. 30 content images and 30 style images are randomly selected from the test set and 900 stylization results are generated for each method. Then results of the same stylization are randomly chosen for a participant who is asked to vote for the method that achieves the best stylization. Each participant is asked to do 20 rounds of comparison. The stylized results from different methods are exhibited in a random order. Thus we collect 600 votes from 30 subjects. The average preference scores of different methods are reported in Column 4 of Table 1, which shows our method obtains the highest score.

Table 1 also lists the running time of our approach and various state-of-the-art baselines. All results are obtained with a 12G Titan V GPU and averaged over 100 test images. Generally speaking, existing patch based network approaches are known to be slower than the holistic matching methods. Among all the approaches, Li et al. achieves the fastest stylization with a linear propagation module. Our full model equiped with multi-scale strategy slightly increases the computation burden but are still comparable to AdaIN, thus achieving style transfer in real-time.

Ablation Study

Here we respectively evaluate the impacts of common feature learning, the proposed whitening loss on content feature, and the multi-scale usages of our framework.

Common feature disentanglement during joint analysis plays a key role in our approach. Its importance can be evaluated by removing the Feature Exchange block and disabling the feature exchange loss, which prevents the network to learn exchangeable features. As shown in Figure 1, for the ablated model without common features, the color distribution and texture patterns in the result image no longer mimic the target style image. Visually, our full model yields a much more favorable result. We also compares the perceptual losses over 100 test images for both the baseline model (i.e. our model without common features) and our full model. As reported in Table 1, the style loss of our full model is significantly improved over the baseline, demonstrating the effectiveness of common features.

To verify the effect of whitening operation functioned on content features, we remove learnable matrices at all scales to see how the performance changes. As shown in Figure 9, without the purified operation and whitening loss, the baseline model blurs the overall contours with yellow blobs. In constrast, our full model better matches the target style to the content image and preserves the spatial structures & style pattern consistency, yielding more visually pleasing results. This proves that the proposed operation enables the content features to be more compatible with the target styles.

The multi-scale strategy is evaluated by replacing the full model with an alternative model that only fuses content and style at layer while fixing the other parts. The comparison shown in Figure 8 demonstrates that the multi-scale strategy is more successful in capturing the salient style patterns, leading to better stylization results.

Applications

We demonstrate the flexibility of our model using two applications. All these tasks are completed with the same trained model without any further fine-tuning.

Being able to adjust the degree of stylization is a useful feature. In our model, this can be achieved by blending between the stylized feature map and the VGG-based feature before feeding to the decoder, which is:

By definition, the network outputs the reconstructed image when , the fully stylized image when , and a smooth transition between the two when is gradually changed from 0 to 1; see Figure 6.

In Figure 7, we present our model’s ability for applying different styles to different image regions. Masks are used to specify the correspondences between different content image regions and the desired styles. Pair-wise exchangeable feature extraction only consider the masked regions when applying a given style, helping to achieve optimal stylization effect for individual regions.

Conclusions

In this paper, we have presented a novel framework, EFANet, for transferring an arbitrary style to a content image. By analyzing the common style feature from both inputs as a guider for alignment, exchangeable style features are extracted. Better stylization can be achieved for the content image by fusing its purified content feature with the aligned style feature from the style image. Experiments show that our method significantly improves the stylization performance over the prior state-of-the-art methods.

References

Figure 10: Stylization matrix of transferring different content images to different styles. The first row consists of style images and the content images are listed in the leftmost column.
Figure 11: Stylization matrix of transferring different content images to different styles. The first row consists of style images and the content images are listed in the leftmost column.
Figure 12: Stylization matrix of transferring different content images to different styles. The first row consists of style images and the content images are listed in the leftmost column.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398340
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description