Wasserstein Style Transfer
Abstract
We propose Gaussian optimal transport for Image style transfer in an Encoder/Decoder framework . Optimal transport for Gaussian measures has closed forms Monge mappings from source to target distributions. Moreover interpolates between a content and a style image can be seen as geodesics in the Wasserstein Geometry. Using this insight, we show how to mix different target styles , using Wasserstein barycenter of Gaussian measures. Since Gaussians are closed under Wasserstein barycenter, this allows us a simple style transfer and style mixing and interpolation. Moreover we show how mixing different styles can be achieved using other geodesic metrics between gaussians such as the Fisher Rao metric, while the transport of the content to the new interpolate style is still performed with Gaussian OT maps. Our simple methodology allows to generate new stylized content interpolating between many artistic styles. The metric used in the interpolation results in different stylizations.
Wasserstein Style Transfer
Youssef Mroueh IBM Research IBM T.J Watson Research Center mroueh@us.ibm.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Image style transfer consists in the task of modifying an image in a way that preserves its content and matches the artistic style of a target image or a collection of images. Defining a loss function that captures this content/style constraint is challenging. A big progress in this field was made since the introduction of the neural style transfer in the seminal work of Gatys et al [1, 2]. Gatys et al showed that by matching statistics of the spatial distribution of images in the feature space of deep convolutional neural networks (spatial Grammian), one could define a style loss function. In Gatys et al method, the image is updated via an optimization process to minimize this “network loss". One shortcoming of this approach is that is slow and that it requires an optimization per content and per style images. Many workarounds have been introduced to speedup this process via feedforward networks optimization that produce stylizations in a single forward pass [3, 4, 5, 6]. Nevertheless this approach was still limited to a single style image. [7] introduced Instance Normalization (IN) to improve quality and diversity of stylization. Multiple styles neural transfer was then introduced in [8] thanks to Conditional Instance Normalization (CIN). CIN adapts the normalized statistics of the transposed convolutional layers in the feedforward network with learned scaling and biases for each style image for a fixed number of style images. The concept of layer swap in [9] resulted in one of the first arbitrary style transfer. Adaptive instance Normalization was introduced in [10] by making CIN scaling and biases learned functions from the style image, which enabled also arbitrary style transfer . The Whitening Coloring Transform (WCT) [11] which we discuss in details in Section 2 developed a simple framework for arbitrary style transfer using an Encoder/Decoder framework and operate a simple normalization transform (WCT) in the encoder feature space to perform the style transfer.
Our work is the closest to the WCT transform, where we start by noticing that instance normalization layers (IN,CIN, adaIN and WCT) are performing a transport map from the spatial distribution of a content image to the one of a style image, and the implicit assumption in deriving those maps is the Gaussianity of the spatial distribution of images in a deep CNN feature space. The Wasserstein geometry of Gaussian measures is very well studied in optimal transport [12] and Gaussian Optimal Transport (OT) maps have closed forms. We show in Section 3 that those normalization transforms are approximations of the OT maps. Linear interpolations of different content or styles at the level of those normalization feature transforms have been successfully applied in [10, 8] we show in Section 4 that this can be interpreted and improved as Gaussian geodesics in the Wasserstein geometry . Furthermore using this insight, we show in Section 5 that we can define novel styles using Wasserstein barycenyter of Gaussians [13]. We also extend this to other Fréchet means in order to study the impact of the ground metric used on the covariances in the novel style obtained via this non linear interpolation. Experiments are presented in Section 7.
2 Universal Style Transfer
We review in this Section the approach of universal style transfer of WCT [11].
Encoding Map. Given a content image and a style image and a Feature extractor where is the spatial output of , is its feature dimension . Define the following Encoding map: where is the space of empirical measures on . For example is a VGG [14] CNN that maps an image to ( is the number of channels, the height and the width). In other words the CNN defines a distribution in the space of dimension , and we are given samples of this distribution. We note this empirical distribution, i.e the spatial distribution of image in the feature space of a deep convolutional network .
Decoding Map. We assume that the encoding is invertible , i.e exists: such that . is a VGG image Encoder/ Decoder for instance trained from the pixel domain to a spatial convolutional layer output in VGG and vice-versa.
Universal Style Transfer in Feature Space. Universal style transfer approach [11] works in the following way: WCT (Whitening Coloring Transform) defines a transform (we will elaborate later on this transform) in the feature space : the style transfer Transform operates in the feature space and defines naturally a push forward map on the spatial distribution of the features of content image :
is defined so that the style transfer happens in the feature space i.e . We obtain the stylized image by decoding back to the image domain :
From this formalism we see that the universal style transfer problem amounts to finding a transport map from the spatial distribution of a content image in a feature space to the the spatial distribution of a target image in the same feature space . We show in the next section how to leverage optimal transport theory to define such maps. Moreover we show that the WCT transform and Adaptive instance normalization are approximations to the optimal transport maps.
3 Wasserstein Universal Style Transfer
Given and , we formulate the style transfer problem as finding an optimal Monge map:
(1) |
the optimal value of this problem is , the Wasserstein two distance between and . Under some regularity conditions on the distributions, the optimal transport exists and is unique and is the gradient of a convex potential [15]
Wasserstein Geometry of Gaussian Measures. Computationally Problem (1) can be solved using for example entropic regularization of the equivalent Kantorovich form of [16, 17] or in an end to end approach using automatic differentiation of a Sinkhorn loss [18, 19] . We take here another route, using the following known fact that Gaussian measures OT provides a lower bound on the Wasserstein distance [20] . For any two measures and :
where are means and covariance of , and of . The Wasserstein geometry of Gaussian measures is well studied and have many convenient computational properties [12], we summarize them in the following:
1) Closed Form . Given two Gaussians distributions , and we have:
where
is the Bures metric between covariances. The Bures metric is a goedesic metric on the PSD cone. (In Section 5.2 we discuss properties of this metric).
2) Closed Form Monge Map. The optimal transport map between two Gaussians with non degenerate covariances (full rank ) has a closed form: where i.e and is optimal in the sense. If the Gaussian were degenerate we can replace the square root matrices inverses with pseudo-inverses [21].
Gaussian Wasserstein Style Transfer. The spatial distribution of images in CNN feature space is not exactly Gaussian, but instead of having the solve Problem (1) we can use the Gaussian lower bound and obtain a closed form optimal map from the content distribution to the style distribution as follows:
(2) |
where , and , and are means and covariances of and resp. and
Finally the Universal Wasserstein Style Transfer can be written in the following compact way, that is summarized in Figure 1:
(3) |

Relation to WCT and to Adaptive Instance Normalization.
We consider two particular cases:
1) Commuting covariances and WCT [11]. Assuming that the covariances and commute meaning that ( and have a common orthonormal basis ) it is easy to see that the optimal transport map reduces to :
which is exactly the Whitening and Coloring Transform (WCT). Hence we see that WCT [11] is only optimal when the covariances commute (a particular case is diagonal covariances).
2) Diagonal Covriances and AdaIN, Instance Normalization (IN) and Conditional Instance Normalization (CIN)[10, 7, 8]. Let be the diagonal of and be the diagonal of . In case the covariances were diagonal it is easy to see that:
this is exactly the expression of adaptive instance normalization AdaIN. We conclude that AdaIN, IN and CIN implement a diagonal approximation of the optimal Gaussian transport map (, are learned constant scaling and biases in IN and CIN , and are adaptive in adaIN).
4 Wasserstein Style/Content Interpolation with McCann Interpolates
One shortcoming of the formulation in problem (1) is that it does not allow to balance the content/style preservation as it is the case in end to end style transfer. Let we formulate the style transfer problem with content preservation as follows:
(4) |
The first term in Equation (4) measure the usual "content loss" in style transfer and the second term measures the "style loss". balances the interpolation between the style and the content. In optimal transport theory, Problem (4) is known as the McCann interpolate [22] between and and the solution of (4) is a Wasserstein geodesic from to and is given by:
The spatial distribution of images in CNN is not exactly Gaussian, but instead of having the solve Problem (4) we can again use the following Gaussian lower bound:
(5) |
Fortunately this problem has also a closed form [22]:
where is given in Equation (2). is a geodesic between and . Finally the Wasserstein Style/Content Interpolation can be written in the following compact way:
(6) |
In practice both WCT and AdaIN propose similar interpolations in feature space, we give here a formal justification for this approach. This formalism allows us to generalize to multiple styles interpolation using the Gaussian Wasserstein geometry of the spatial distribution of CNN images features.
5 Wasserstein Style Interpolation
Given , target styles images, and a content image ,where are interpolation factors such that . A naive approach to content/ styles interpolation can be given by:
this approach was proposed in both WCT and AdaIn by replacing by and AdaIN respectively. We show here how to define a non linear interpolation that exploits the Wasserstein geometry of Gaussian measures.

5.1 Interpolation with Wasserstein Barycenters
Similarly to the content /style interpolation, we formulate the content / styles interpolation problem as a Wasserstein Barycenter problem [13] as follows. Let , and , we propose to solve the following Wasserstein Barycenter problem:
and then find the optimal map from to the barycenter measure The final stylized image is obtained as follows:
Again we resort to Gaussian optimal transport lower bound of the above problem:
(7) |
As shown by Agueh and Carlier [13] the Wasserstein Barycenter of Gaussians is itself a Gaussian where and is a Bures Mean. Noting we have:
Agueh and Carlier showed that is the unique positive definite matrix solution of the following fixed point problem: In order to solve this problem we use an alternative fixed point strategy proposed in [23], since it converges faster in practice:
(8) |
and we initialized as in [21]: we found that was enough for convergence, i.e we set . Matrix square root and inverses were computed using SVD which gives an overall complexity of and we used truncated SVD to stabilize the inverses. Finally since the Barycenter is a Gaussian , the optimal transport map from the Gaussian spatial content distribution to the barycenter (mix of styles and content) is given in closed form as in Equation (2):
(9) |
Finally to obtain the stylized image as a result of targeting the mixed/style we decode back:
Figure 2 gives an example of our approach for mixing content images with style images. We see that the Wasserstein barycenter captures not only the color distribution but also the details of the artistic style (for instance Frida Kahlo’s unibrow is well captured smoothly in the transition between Picasso self portrait and Frida Kahlo).
5.2 Style Interpolation with Fréchet Means
In the previous section we defined interpolations between the content and the styles images. In this section we define a "novel style" via an interpolation of style images only, we then map the content to the novel style using Gaussian optimal transport.
From Wasserstein Barycenter to Fréchet Means on the PSD manifold. As discussed earlier the Wasserstein Barycenter of the Gaussian approximations of the spatial distribution of style images in CNN feature spaces can be written as:
(10) |
for the euclidean metric , the Bures metric. The Bures Metric is a geodesic metric on the positive definite cone and and has another representation as a procrustes registration metric [24]:
From this we see the advantage of Wasserstein barycenter on for example using the Frobenius norm. Bures Metric aligns the the square root of covariances using a rotation. From this we see that by defining a new metric on covariances we can get different form of interpolates, we fix , and hence on we use always the arithmetic mean . We give here different metrics that defines different Fréchet means on the PSD manifold (see [25] and references there in )
1) Arithmetic Mean: Solving Eq. (10) for , we define the target style , where .
2) Harmonic Mean: Solving Eq. (10) for , we define the target style , where .
3) Fisher Rao Mean (Karcher or Geometric Mean). For that is the Riemannian natural metric or the Fisher Rao metric between Centered Gaussians. here refers to matrix logarithm. The Fisher Rao metric is a geodesic distance and its metric tensor is the Fisher information matrix .
Solving Eq. (10) with the Fisher Rao metric we obtain the so called Karcher Mean between PSD matrices , and we define the target style .
In order to find the Karcher mean we use manifold optimization techniques of [26] as follows. The gradient manifold update is :
(11) |
we initialize as in the Wasserstein case and iterate for iterations with the learning rate set to .
Remark 1.
While we defined here the barycenter style of each metric as a Gaussian, Wasserstein Barycenter is the only one that guarantees a Gaussian barycenter [13].
6 Related works
OT for style Transfer and Image coloring. Color transfer between images using regularized optimal transport on the color distribution of images (RGB for example) was studied and applied in [27]. The color distribution is not gaussian and hence the OT problem has to be solved using regularization. Optimal transport for style transfer using the spatial distribution in the feature space of a deep CNN was also explored in [28, 29]. [28] uses for Gaussians as content and style loss and optimizes it in an end to end fashion similar to [1, 2]. [29] uses an approximation of the Wasserstein distance as a loss that is also optimized in an end to end fashion. Both approaches don’t allow universal style transfer and an optimization is needed for every style/content image pairs.
Wasserstein Barycenter for Texture Mixing. Similar to our approach for Wasserstein mixing in an encoder/decoder framework, [30] uses the wavelet transform to encode textures, applies Wasserstein barycenter on wavelets coefficients, and then decodes back using the inverse wavelet transform to synthesize a novel mixed texture. The wasserstein barycenter problem there has to be solved exactly and the Gaussian approximation can not be used since the wavelet coffecient distribution is not Gaussian. A special model for Gaussian texture mixing was developed in [21]. The advantage of using features of a CNN is that the Gaussian lower bound of the Wasserstein distance seems to be tight.
7 Experiments
In order to test our approach of geometric mixing of styles we use the WCT framework [11], where we use a pyramid of encoders at different spatial resolutions, where corresponds to the coarser resolution, and the finer resolution. Following WCT we use a coarse to fine approach to style transfer as follows. Given interpolation weights , we start with and with :

-
We find the Wasserstein Transport map at resolution between the content and the novel style and compute the transformed features: .
-
We decode the novel image at resolution : .
-
We set , then set to and go to step until reaching .
The stylized output of this procedure is . We also experimented in the Appendix with applying the same approach but in fine to coarse way starting from the higher resolution to the lower resolution encoder . We show in Figure 3 the output of our mixing strategy using two of the geodesic metrics namely Wasserstein and Fisher Rao barycenters. We give as baseline the AdaIn output for this (this same example was given in [10] we reproduce it using their available code). We show that using geodesic metrics to define the mixed style successfully capture the subtle details of different styles. More examples and comparison to literature and other types of mixing can be found in the Appendix.
8 Discussion and Conclusion
We conclude this paper by the following three observations on the spatial distribution of features in a deep convolutional neural network:
-
The success of Gaussian optimal transport between spatial distributions of deep CNN features that we demonstrated in this paper suggests that the network learned to "gaussianize" the space. Gaussinization [35] is a principle in unsupervised learning. It will be interesting to further study this Gaussianity hypothesis and to see if Gaussinization can be used as a regularizer for learning deep CNN or as an objective in self-supervised learning.
-
We showed that many of the spatial normalization layers used in deep learning such as Instance normalization [7] and related variants can be understood as approximations of Gaussian optimal transport. When used in an architecture between layers, the normalization layer acts like a transport map between the spatial distribution of consecutive layers. We hope this angle will help developing new normalization layers and a better understanding of the existing ones.
-
Geodesic metrics such as Wasserstein and the Fisher Rao metric allow better non linear interpolation in feature space.
References
- [1] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
- [2] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual factors in neural style transfer. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
- [3] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. Lecture Notes in Computer Science, 2016.
- [4] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.
- [5] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV (3), 2016.
- [6] Xin Wang, Geoffrey Oxholm, Da Zhang, and Yuan-Fang Wang. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [7] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [8] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. 2017.
- [9] Tian Qi Chen and Mark Schmidt. Fast patch-based style transfer of arbitrary style, 2016.
- [10] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
- [11] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems 30. 2017.
- [12] Asuka Takatsu. Wasserstein geometry of gaussian measures. Osaka J. Math., 2011.
- [13] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM J. Math. Analysis, 43, 2011.
- [14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556, 2014.
- [15] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 2000.
- [16] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
- [17] Gabriel Peyré and Marco Cuturi. Computational optimal transport. Technical report, 2017.
- [18] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
- [19] J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and G. Peyré. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. ArXiv e-prints, 2018.
- [20] J. A. Cuesta-Albertos, C. Matrán-Bea, and A. Tuero-Diaz. On lower bounds for thel2-wasserstein metric in a hilbert space. Journal of Theoretical Probability, 1996.
- [21] Gui-Song Xia, Sira Ferradans, Gabriel Peyré, and Jean-François Aujol. Synthesizing and mixing stationary gaussian texture models. SIAM J. Imaging Sciences, 2014.
- [22] Robert J. McCann. A convexity principle for interacting gases. Advances in Mathematics, 1997.
- [23] Pedro C. Álvarez Esteban, E. del Barrio, J.A. Cuesta-Albertos, and C. Matrán. A fixed-point approach to barycenters in wasserstein space. http://arxiv.org/pdf/1511.05355.
- [24] Valentina Masarotto, Victor M Panaretos, and Yoav Zemel. Procrustes metrics on covariance operators and optimal transportation of gaussian processes. Sankhya A, 2018.
- [25] Rajendra Bhatia. The Riemannian Mean of Positive Matrices. Springer Berlin Heidelberg, 2013.
- [26] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In COLT, 2016.
- [27] Sira Ferradans, Nicolas Papadakis, Julien Rabin, Gabriel Peyré, and Jean-Francois Aujol. Regularized discrete optimal transport. Scale Space and Variational Methods in Computer Vision, 2013.
- [28] Style transfer as optimal transport. https://github.com/VinceMarron/style_transfer.
- [29] Nicholas Kolkin, Jason Salavon, and Greg Shakhnarovich. Style transfer by relaxed optimal transport and self-similarity, 2019.
- [30] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Proceedings of the Third International Conference on Scale Space and Variational Methods in Computer Vision, SSVM’11, 2012.
- [31] Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, and Mingli Song. Neural style transfer: A review. CoRR.
- [32] Chuan Li and Michael Wand. Combining markov random fields and convolutional neural networks for image synthesis. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
- [33] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017.
- [34] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
- [35] Scott Saobing Chen and Ramesh A. Gopinath. Gaussianization. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 423–429. MIT Press, 2001.
Supplementary Material for Wasserstein Style Transfer
Appendix A Algorithms
Appendix B Examples of Interpolating Content and Styles with Wasserstein Barycenter and Optimal Transport
In Figures 4, 5 ,6 we show examples of interpolations of content images with the style images. We used in this experiment a coarse to fine approach, i.e starting from matching upper layers of VGG to lower layers.



Appendix C Mixing Styles with Frechet Means and Optimal Transport Style Transfer
Coarse to Fine.
-We give results of different Mixing strategies and a content stylization in a coarse to fine procedure as follows: Wasserstein Mixing in Table 8;Fisher Rao Mixing in Table 9 ;Arithmetic Mixing that would be close to WCT baseline [11] in Table 10; Harmonic Mixing in Table 11 ; AdaIN Mixing Table in 12. We also give another set of results on Wass Barycenter mixing in Table 15, Fisher Rao in Table 16 and AdaIn in 17
Fine to Coarse.
We experiment baselining WCT mixing [11] and Wasserstein Mixing in a Fine to coarse strategy (from lower layer to upper layers) results are given in Table 13 and Table 1.











