# Learning Wasserstein Embeddings

## Abstract

The Wasserstein distance received a lot of attention recently in the community of machine learning, especially for its principled way of comparing distributions. It has found numerous applications in several hard problems, such as domain adaptation, dimensionality reduction or generative models. However, its use is still limited by a heavy computational cost. Our goal is to alleviate this problem by providing an approximation mechanism that allows to break its inherent complexity. It relies on the search of an embedding where the Euclidean distance mimics the Wasserstein distance. We show that such an embedding can be found with a siamese architecture associated with a decoder network that allows to move from the embedding space back to the original input space. Once this embedding has been found, computing optimization problems in the Wasserstein space (*e.g.* barycenters, principal directions or even archetypes) can be conducted extremely fast. Numerical experiments supporting this idea are conducted on image datasets, and show the wide potential benefits of our method.

## 1Introduction

The Wasserstein distance is a powerful tool based on the theory of optimal transport to compare data distributions with wide applications in image processing, computer vision and machine learning [26]. In a context of machine learning, it has recently found numerous applications, *e.g.* domain adaptation [12], word embedding [21] or generative models [3]. Its power comes from two major reasons: *i)* it allows to operate on empirical data distributions in a non-parametric way *ii)* the geometry of the underlying space can be leveraged to compare the distributions in a geometrically sound way. The space of probability measures equipped with the Wasserstein distance can be used to construct objects of interest such as barycenters [1] or geodesics [33] that can be used in data analysis and mining tasks.

More formally, let be a metric space endowed with a metric . Let and the space of all Borel probability measures on with finite moments of order p, *i.e.* for all in . The p-Wasserstein distance between and is defined as:

Here, is the set of probabilistic couplings on . As such, for every Borel subsets , we have that and . It is well known that defines a metric over as long as (e.g. [38], Definition 6.2).

When , is also known as Earth Mover’s distance (EMD) or Monge-Kantorovich distance. The geometry of (, ) has been thoroughly studied, and there exists several works on computing EMD for point sets in (e.g. [34]). However, in a number of applications the use of (a.k.a *root mean square bipartite matching distance*) is a more natural distance arising in computer vision [7], computer graphics [8] or machine learning [14]. See [16] for a discussion on the quality comparison between and .

Yet, the deployment of Wasserstein distances in a wide class of applications is somehow limited, especially because of an heavy computational burden. In the discrete version of the above optimisation problem, the number of variables scale quadratically with the number of samples in the distributions, and solving the associated linear program with network flow algorithms is known to have a cubical complexity. While recent strategies implying slicing technique [7], entropic regularization [13] or involving stochastic optimization [20], have emerged, the cost of computing pairwise Wasserstein distances between a large number of distributions (like an image collection) is prohibitive. This is all the more true if one considers the problem of computing barycenters [14] or population means. A recent attempt by Staib and colleagues [37] use distributed computing for solving this problem in a scalable way.

We propose in this work to learn an Euclidean embedding of distributions where the Euclidean norm approximates the Wasserstein distances. Finding such an embedding enables the use of standard Euclidean methods in the embedded space and significant speedup in pairwise Wasserstein distance computation, or construction of objects of interests such as barycenters. The embedding is expressed as a deep neural network, and is learnt with a strategy similar to those of Siamese networks [11]. We also show that simultaneously learning the inverse of the embedding function is possible and allows for a reconstruction of a probability distribution from the embedding. We first start by describing existing works on Wasserstein space embedding. We then proceed by presenting our learning framework and give proof of concepts and empirical results on existing datasets.

## 2Related work

**Metric embedding** The question of metric embedding usually arises in the context of approximation algorithms. Generally speaking, one seeks a new representation (embedding) of data at hand in a new space where the distances from the original space are preserved. This new representation should, as a positive side effect, offers computational ease for time-consuming task (e.g. searching for a nearest neighbor), or interpretation facilities (e.g. visualization of high-dimensional datasets). More formally, given two metrics spaces and and , a mapping is an embedding with distortion at most if there exists a coefficient such that . Here, the parameter is to be understood as a global scaling coefficient. The **distortion** of the mapping is the infimum over all possible such that the previous relation holds. Obviously, the lower the , the better the quality of the embedding is. It should be noted that the existence of exact (**isometric**) embedding () is not always guaranteed but sometimes possible. Finally, the embeddability of a metric space into another is possible if there exists a mapping with constant distortion. A good introduction on metric embedding can be found in [29].

**Theoretical results on Wasserstein space embedding** Embedding Wasserstein space in normed metric space is still a theoretical and open questions [30]. Most of the theoretical guarantees were obtained with . In the simple case where , there exists an isometric embedding with between two absolutely continuous (*wrt.* the Lebesgue measure) probability measures and given by their by their cumulative distribution functions and , *i.e.* . This fact has been exploited in the computation of sliced Wasserstein distance [7]. Conversely, there is no known isometric embedding for pointsets in , *i.e.* regularly sampled grids in , but best known distortions are between and [10]. Regarding , recent results [2] have shown there does not exist meaningful embedding over with constant approximation. Their results show notably that an embedding of pointsets of size into must incur a distortion of . Regarding our choice of , there does not exist embeddability results up to our knowledge, but we show that, for a population of locally concentrated measures, a good approximation can be obtained with our technique. We now turn to existing methods that consider local linear approximations of the transport problem.

**Linearization of Wasserstein space** Another line of work [39] also considers the Riemannian structure of the Wasserstein space to provide meaningful linearization by projecting onto the tangent space. By doing so, they notably allows for faster computation of pairwise Wasserstein distances (only transport computations instead of with the number of samples in the dataset) and allow for statistical analysis of the embedded data. They proceed by specifying a template element and compute, from particle approximations of the data, linear transport plans with this template element, that allow to derive an embedding used for analysis. Seguy and Cuturi [33] also proposed a similar pipeline, based on velocity field, but without relying on an implicit embedding. It is to be noted that for data in 2D, such as images, the use of cumulative Radon transform also allows for an embedding which can be used for interpolation or analysis [7], by exploiting the exact solution of the optimal transport in 1D through cumulative distribution functions.

Our work is the first to propose to learn a generic embedding rather than constructing it from explicit approximations/transformations of the data and analytical operators such as Riemannian Logarithm maps. As such, our formulation is generic and adapts to any type of data. Finally, since the mapping to the embedded space is constructed explicitly, handling unseen data does not require to compute new optimal transport plans or optimization, yielding extremely fast computation performances, with similar approximation performances.

## 3Deep Wasserstein Embedding (DWE)

### 3.1Wasserstein learning and reconstruction with siamese networks

We discuss here how our method, coined DWE for Deep Wasserstein Embedding, learns in a supervised way a new representation of the data. To this end we need a pre-computed dataset that consists of pairs of histograms of dimensionality and their corresponding Wasserstein distance . One immediate way to solve the problem would be to concatenate the samples and and learn a deep network that predicts . This would work in theory but it would prevent us from interpreting the Wasserstein space and it is not by default symmetric which is a key property of the Wasserstein distance.

Another way to encode this symmetry and to have a meaningful embedding that can be used more broadly is to use a Siamese neural network [9]. Originally designed for metric learning purpose and similarity learning (based on labels), this type of architecture is usually defined by replicating a network which takes as input two samples from the same learning set, and learns a mapping to new space with a contrastive loss. It has mainly been used in computer vision, with successful applications to face recognition [11] or one-shot learning for example [24]. Though its capacity to learn meaningful embeddings has been highlighted in [40], it has never been used, to the best of our knowledge, for mimicking a specific distance that exhibits computation challenges. This is precisely our objective here.

We propose to learn and embedding network that takes as input a histogram and project it in a given Euclidean space of . In practice, this embedding should mirror the geometrical property of the Wasserstein space. We also propose to regularize the computation this of this embedding by adding a reconstruction loss based on a decoding network . This has two important impacts: First we observed empirically that it eases the learning of the embedding and improves the generalization performance of the network (see experimental results) by forcing the embedded representation to catch sufficient information of the input data to allow a good reconstruction. This type of autoencoder regularization loss has been discussed in [42] in the different context of embedding learning. Second, disposing of.a decoder network allows the interpretation of the results, which is of prime importance in several data-mining tasks (discussed in the next subsection).

An overall picture depicting the whole process is given in Figure 1. The global objective function reads

where weights the two data fitting terms and is the Kullbach-Leibler divergence. This choice is motivated by the fact that the Wasserstein metric operates on probability distributions.

### 3.2Wasserstein data mining in the embedded space

Once the functions and have been learned, several data mining tasks can be operated in the Wasserstein space. We discuss here the potential applications of our computational scheme and its wide range of applications on problems where the Wasserstein distance plays an important role. Though our method is not an exact Wasserstein estimator, we empirically show in the numerical experiments that it performs very well and competes favorably with other classical computation strategies.

**Wasserstein barycenters .** Barycenters in Wasserstein space were first discussed by Agueh and Carlier [1]. Designed through an analogy with barycenters in a Euclidean space, the Wasserstein barycenters of a family of measures are defined as minimizers of a weighted sum of squared Wasserstein distances. In our framework, barycenters can be obtained as

where are the data samples and the weights obeys the following constraints: and . Note that when we have only two samples, the barycenter corresponds to a Wasserstein interpolation between the two distributions with and [32]. When the weights are uniform and the whole data collection is considered, the barycenter is the Wasserstein population mean, also known as Fréchet mean [5].

**Principal Geodesic Analysis in Wasserstein space .** PGA, or Principal Geodesic Analysis, has first been introduced by Fletcher et al. [18]. It can be seen as a generalization of PCA on general Riemannian manifolds. Its goal is to find a set of directions, called geodesic directions or principal geodesics, that best encode the statistical variability of the data. It is possible to define PGA by making an analogy with PCA. Let be a set of elements, the classical PCA amounts to *i)* find the mean of the data and subtract it to all the samples *ii)* build recursively a subspace by solving the following maximization problem:

Fletcher gives a generalization of this problem for complete geodesic spaces by extending three important concepts: **variance** as the expected value of the squared Riemannian distance from mean, **Geodesic subspaces** as a portion of the manifold generated by principal directions, and a **projection** operator onto that geodesic submanifold. The space of probability distribution equipped with the Wasserstein metric (, ) defines a geodesic space with a Riemannian structure [32], and an application of PGA is then an appealing tool for analyzing distributional data. However, as noted in [33], a direct application of Fletcher’s original algorithm is intractable because is infinite dimensional and there is no analytical expression for the exponential or logarithmic maps allowing to travel to and from the corresponding Wasserstein tangent space. We propose a novel PGA approximation as the following procedure: *i)* find the approximate Fréchet mean of the data as and subtract it to all the samples *ii)* build recursively a subspace in the embedding space ( being of the dimension of the embedded space) by solving the following maximization problem:

which is strictly equivalent to perform PCA in the embedded space. Any reconstruction from the corresponding subspace to the original space is conducted through . We postpone a detailed analytical study of this approximation to subsequent works, as it is beyond the goals of this paper.

**Other possible methods.** As a matter of facts, several other methods that operate on distributions can benefit from our approximation scheme. Most of those methods are the transposition of their Euclidian counterparts in the embedding space. Among them, clustering methods, such as Wasserstein k-means [14], are readily adaptable to our framework. Recent works have also highlighted the success of using Wasserstein distance in dictionary learning [31] or archetypal Analysis [41].

## 4Numerical experiments

In this section we evaluate the performances of our method on grayscale images normalized as histograms. Images are offering a nice testbed because of their dimensionality and because large datasets are frequently available in computer vision.

### 4.1Architecture for DWE between grayscale images

The framework of our approach as shown in Figure 1 consists of an encoder and a decoder composed as a cascade. The encoder produces the representation of input images . The architecture used for the embedding consists in 2 convolutional layers with ReLu activations: first a convolutional layer of 20 filters with a kernel of size 3 by 3, then a convolutional layer of 5 filters of size 5 by 5. The convolutional layers are followed by two linear dense layers respectively of size 100 and the final layer of size . The architecture for the reconstruction consists in a dense layer of output 100 with ReLu activation, followed by a dense layer of output 5*784. We reshape the layer to map the input of a convolutional layer: we reshape the output vector into a (5,28,28) 3D-tensor. Eventually, we invert the convolutional layers of with two convolutional layers: first a convolutional layer of 20 filters with ReLu activation and a kernel of size 5 by 5, followed by a second layer with 1 filter, with a kernel of size 3 by 3. Eventually the decoder outputs a reconstruction image of shape 28 by 28. In this work, we only consider grayscale images, that are normalized to represent probability distributions. Hence each image is depicted as an histogram. In order to normalize the decoder reconstruction we use a softmax activation for the last layer.

All the dataset considered are handwritten data and hence holds an inherent sparsity. In our case, we cannot promote the output sparsity through a convex L1 regularization because the softmax outputs positive values only and forces the sum of the output to be 1. Instead, we apply a pseudo -norm regularization with on the reconstructed image, which promotes sparse output and allows for a sharper reconstruction of the images [19].

### 4.2MNIST digit dataset

**Dataset and training.** Our first numerical experiment is performed on the well known MNIST digits dataset. This dataset contains images from 10 digit classes In order to create the training dataset we draw randomly one million pairs of indexes from the 60 000 training samples and compute the exact Wasserstein distance for quadratic ground metric using the POT toolbox [17]. All those pairwise distances can be computed in an embarrassingly parallel scheme (1h30 on 1 CPU). Among this million, 700 000 are used for learning the neural network, 200 000 are used for validation and 100 000 pairs are used for testing purposes. The DWE model is learnt on a standard NVIDIA GPU node and takes around 1h20 with a stopping criterion computed from on a validation set.

**Numerical precision and computational performance** The true and predicted values for the Wasserstein distances are given in Fig. ?. We can see that we reach a good precision with a test MSE of 0.4 and a relative MSE of 2e-3. The correlation is of 0.996 and the quantiles show that we have a very small uncertainty with only a slight bias for large values where only a small number of samples is available. This results show that a good approximation of the can be performed by our approach (1e-3 relative error).

W^2_2

Now we investigate the ability of our approach to compute efficiently. To this end we compute the average speed of Wasserstein distance computation on test dataset to estimate the number of computations per second in the Table of Fig. ?. Note that there are 2 ways to compute the with our approach denoted as Indep and Pairwise. This comes from the fact that our computation is basically a squared Euclidean norm in the embedding space. The first computation measures the time to compute the between independent samples by projecting both in the embedding and computing their distance. The second computation aims at computing all the pairwise between two sets of samples and this time one only needs to project the samples once and compute all the pairwise distances, making it more efficient. Note that the second approach would be the one used in a retrieval problem where one would just embed the query and then compute the distance to all or a selection of the dataset to find a Wasserstein nearest neighbor for instance. The speedup achieved by our method is very impressive even on CPU with speedup of x18 and x1000 respectively for Indep and Pairwise. But the GPU allows an even larger speedup of respectively x1000 and x500 000 with respect to a state-of-the-art C compiled Network Flow LP solver of the POT Toolbox [17]. Of course this speed-up comes at the price of a time-consuming learning phase, which makes our method better suited for mining large scale datasets and online applications.

**Wasserstein Barycenters** Next we evaluate our embedding on the task of computing Wasserstein Barycenters for each class of the MNIST dataset. We take 1000 samples per class from the test dataset and compute their uniform weight Wasserstein Barycenter using Equation 3. The resulting barycenters and their Euclidean means are reported in Figure 2. Note that not only those barycenters are sensible but also conserve most of their sharpness which is a problem that occurs for regularized barycenters [36]. The computation of those barycenters is also very efficient since it requires only 20ms per barycenter (for 1000 samples) and its complexity scales linearly with the number of samples.

**Principal Geodesic Analysis** We report in Figure ? the Principal Component Analysis (L2) and Principal Geodesic Analysis (DWE) for 3 classes of the MNIST dataset. We can see that using Wasserstein to encode the displacement of mass leads to more semantic and nonlinear subspaces such as rotation/width of the stroke and global sizes of the digits. This is well known and has been illustrated in [33]. Nevertheless our method allows for estimating the principal component even in large scale datasets and our reconstruction seems to be more detailed compared to [33] maybe because our approach can use a very large number of samples for subspace estimation.

L2 | DWE | L2 | DWE | L2 | DWE |

### 4.3Google doodle dataset

**Datasets** The Google Doodle dataset is a crowd sourced dataset that is freely available from the web^{1}

**Numerical precision and cross dataset comparison** The numerical performances of the learned models on each of the doodle dataset is reported in the diagonal of Table 1. Those datasets are much more difficult than MNIST because they have not been curated and contain a very large variance due to numerous unfinished doodles. An interesting comparison is the cross comparison between datasets where we use the embedding learned on one dataset to compute the on another. The cross performances is given in Table Table 1 and shows that while there is definitively a loss in accuracy of the prediction, this loss is limited between the doodle datasets that all have an important variety. Performance loss across doodle and MNIST dataset is larger because the latter is highly structured and one needs to have a representative dataset to generalize well which is not the case with MNIST.

Network Data | CAT | CRAB | FACE | MNIST |
---|---|---|---|---|

CAT | 1.195 |
1.718 |
2.07 | 12.132 |

CRAB | 2.621 |
0.854 |
3.158 | 10.881 |

FACE | 5.025 |
5.532 | 3.158 |
50.527 |

MNIST | 9.118 | 6.643 | 4.68 |
0.405 |

**Wasserstein interpolation** We first compute the Wasserstein interpolation between four samples of each datasets in Figure Figure 3. Note that these interpolation might not be optimal w.r.t. the objects but we clearly see a continuous displacement of mass that is characteristic of optimal transport. This leads to surprising artefacts for example when the eye of a face fuse with the border while the nose turns into an eye. Also note that there is no reason for a Wasserstein barycenter to be a realistic sample.

Next we qualitatively evaluate the subspace learned by DWE by comparing the Wasserstein interpolation of our approach with the true Wasserstein interpolation estimated by solving the OT linear program and by using regularized OT with Bregman projections [4]. The interpolation results for all those methods and the Euclidean interpolation are available in Figure 4. The LP solver takes a long time (20 sec/interp) and leads to a “noisy” interpolation as already explained in [15]. The regularized Wasserstein barycenter is obtained more rapidly (4 sec/interp) but is also very smooth at the risk of loosing some details, despite choosing a small regularization that prevents numerical problems. Our reconstruction also looses some details due to the Auto-Encoder error but is very fast and can be done in real time (4 ms/interp).

## 5Conclusion and discussion

In this work we presented a computational approximation of the Wasserstein distance suitable for large scale data mining tasks. Our method finds an embedding of the samples in a space where the Euclidean distance emulates the behavior of the Wasserstein distance. Thanks to this embedding, numerous data analysis tasks can be conducted at a very cheap computational price. We forecast that this strategy can help in generalizing the use of Wasserstein distance in numerous applications.

However, while our method is very appealing in practice it still raises a few questions about the theoretical guarantees and approximation quality. First it is difficult to foresee from a given network architecture if it is sufficiently (or too much) complex for finding a successful embedding. It can be conjectured that it is dependent on the complexity of the data at hand and also the locality of the manifold where the data live in. Second, the theoretical existence results on such Wasserstein embedding with constant distortion are still lacking. Future works will consider these questions as well as applications of our approximation strategy on a wider range of ground loss and data mining tasks. Also, we will study the transferability of one database to another to diminish the computational burden of computing Wasserstein distances on numerous pairs for the learning process.

### Footnotes

### References

**Barycenters in the wasserstein space.**

M. Agueh and G. Carlier.*SIAM Journal on Mathematical Analysis*, 43(2):904–924, 2011.**Impossibility of Sketching of the 3D Transportation Metric with Quadratic Cost.**

A. Andoni, A. Naor, and O. Neiman. In*43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016)*, volume 55, pages 83:1–83:14, 2016.**Wasserstein generative adversarial networks.**

M. Arjovsky, S. Chintala, and L. Bottou. In*Proceedings of the 34th International Conference on Machine Learning*, volume 70, pages 214–223, Sydney, Australia, 06–11 Aug 2017.**Iterative Bregman projections for regularized transportation problems.**

J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré.*SISC*, 2015.**Geodesic pca in the wasserstein space by convex pca.**

J. Bigot, R. Gouet, T. Klein, A. López, et al. In*Annales de l’Institut Henri Poincaré, Probabilités et Statistiques*, volume 53, pages 1–26. Institut Henri Poincaré, 2017.**Wasserstein barycentric coordinates: Histogram regression using optimal transport.**

N. Bonneel, G. Peyré, and M. Cuturi.*ACM Trans. Graph.*, 35(4):71:1–71:10, July 2016.**Sliced and radon wasserstein barycenters of measures.**

N. Bonneel, J. Rabin, G. Peyré, and H. Pfister.*Journal of Mathematical Imaging and Vision*, 51(1):22–45, Jan 2015.**Displacement interpolation using Lagrangian mass transport.**

N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich.*ACM Transaction on Graphics*, 30(6), 2011.**Signature verification using a“ siamese” time delay neural network.**

J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. In*Advances in Neural Information Processing Systems*, pages 737–744, 1994.**Similarity estimation techniques from rounding algorithms.**

M. Charikar. In*Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing*, STOC ’02, pages 380–388, 2002.**Learning a similarity metric discriminatively, with application to face verification.**

S. Chopra, R. Hadsell, and Y. LeCun. In*Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on*, volume 1, pages 539–546. IEEE, 2005.**Optimal transport for domain adaptation.**

N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2017.**Sinkhorn distances: Lightspeed computation of optimal transportation.**

M. Cuturi. In*Advances on Neural Information Processing Systems (NIPS)*, pages 2292–2300, 2013.**Fast computation of Wasserstein barycenters.**

M. Cuturi and A. Doucet. In*ICML*, 2014.**A smoothed dual approach for variational wasserstein problems.**

M. Cuturi and G. Peyré.*SIAM Journal on Imaging Sciences*, 9(1):320–343, 2016.**Blue noise through optimal transport.**

F. de Goes, K. Breeden, V. Ostromoukhov, and M. Desbrun.*ACM Trans. Graph.*, 31(6):171:1–171:11, Nov. 2012.**Pot python optimal transport library.**

R. Flamary and N. Courty. 2017.**Principal geodesic analysis for the study of nonlinear statistics of shape.**

P. T. Fletcher, C. Lu, S. M. Pizer, and S. Joshi.*IEEE Trans. Medical Imaging*, 23(8):995–1005, Aug. 2004.**Recovering sparse signals with a certain family of nonconvex penalties and dc programming.**

G. Gasso, A. Rakotomamonjy, and S. Canu.*IEEE Transactions on Signal Processing*, 57(12):4686–4698, 2009.**Stochastic optimization for large-scale optimal transport.**

A. Genevay, M. Cuturi, G. Peyré, and F. Bach. In*NIPS*, pages 3432–3440, 2016.**Supervised word mover’s distance.**

G. Huang, C. Guo, M. Kusner, Y. Sun, F. Sha, and K. Weinberger. In*Advances in Neural Information Processing Systems*, pages 4862–4870, 2016.**Fast image retrieval via embeddings.**

P. Indyk and N. Thaper. In*3rd International Workshop on Statistical and Computational Theories of Vision*, pages 1–15, 2003.**Nonembeddability theorems via fourier analysis.**

S. Khot and A. Naor.*Mathematische Annalen*, 334(4):821–852, Apr 2006.**Siamese neural networks for one-shot image recognition.**

G. Koch, R. Zemel, and R. Salakhutdinov. In*ICML Deep Learning Workshop*, volume 2, 2015.**The radon cumulative distribution transform and its application to image classification.**

S. Kolouri, S. R. Park, and G. K. Rohde.*IEEE Transactions on Image Processing*, 25(2):920–934, 2016.**Optimal mass transport: Signal processing and machine-learning applications.**

S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde.*IEEE Signal Processing Magazine*, 34(4):43–59, July 2017.**A continuous linear optimal transport approach for pattern analysis in image datasets.**

S. Kolouri, A. Tosun, J. Ozolek, and G. Rohde.*Pattern Recognition*, 51:453 – 462, 2016.**Sliced wasserstein kernels for probability distributions.**

S. Kolouri, Y. Zou, and G. Rohde.*2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pages 5258–5267, 2016.**Lecture notes on metric embeddings.**

J. Matoušek. Technical report, ETH Zurich, 2013.**Open problems on embeddings of finite metric spaces.**

J. Matoušek and A. Naor. available at http://kam.mff.cuni.cz/~matousek/metrop.ps, 2011.**Fast dictionary learning with a smoothed wasserstein loss.**

A. Rolet, M. Cuturi, and G. Peyré. In*AISTATS*, pages 630–638, 2016.**Introduction to optimal transport theory.**

F. Santambrogio.*Notes*, 2014.**Principal geodesic analysis for probability measures under the optimal transport metric.**

V. Seguy and M. Cuturi. In*Advances in Neural Information Processing Systems*, pages 3312–3320, 2015.**Approximate earth movers distance in linear time.**

S. Shirdhonkar and D. W. Jacobs. In*CVPR*, pages 1 – 8, June 2008.**Convolutional wasserstein distances: Efficient optimal transportation on geometric domains.**

J. Solomon, F. de Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas.*ACM Trans. Graph.*, 34(4):66:1–66:11, July 2015.**Convolutional wasserstein distances: Efficient optimal transportation on geometric domains.**

J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas.*ACM Transactions on Graphics (TOG)*, 34(4):66, 2015.**Parallel streaming wasserstein barycenters.**

M. Staib, S. Claici, J. Solomon, and S. Jegelka.*CoRR*, abs/1705.07443, 2017.*Optimal transport: old and new*.

C. Villani. Grund. der mathematischen Wissenschaften. Springer, 2009.**A linear optimal transportation framework for quantifying and visualizing variations in sets of images.**

W. Wang, D. Slepčev, S. Basu, J. Ozolek, and G. Rohde.*International Journal of Computer Vision*, 101(2):254–269, Jan 2013.**Deep learning via semi-supervised embedding.**

J. Weston, F. Ratle, H. Mobahi, and R. Collobert. In*Neural Networks: Tricks of the Trade*, pages 639–655. Springer, 2012.**Statistical archetypal analysis.**

C. Wu and E. Tabak.*arXiv preprint arXiv:1701.08916*, 2017.**Embedding with autoencoder regularization.**

W. Yu, G. Zeng, P. Luo, F. Zhuang, Q. He, and Z. Shi. In*ECML/PKDD*, pages 208–223. Springer, 2013.