Stochastic Wasserstein Barycenters
Abstract
We present a stochastic algorithm to compute the barycenter of a set of probability distributions under the Wasserstein metric from optimal transport. Unlike previous approaches, our method extends to continuous input distributions and allows the support of the barycenter to be adjusted in each iteration. We tackle the problem without regularization, allowing us to recover a much sharper output. We give examples where our algorithm recovers a more meaningful barycenter than previous work. Our method is versatile and can be extended to applications such as generating super samples from a given distribution and recovering blue noise approximations.
1 Introduction
Several scenarios in machine learning require summarizing a collection of probability distributions with shared structure but individual bias. For instance, multiple sensors might gather data from the same environment with different noise distributions; the samples they collect must be assembled into a single signal. As another example, a dataset might be split among multiple computers, each of which carries out MCMC Bayesian inference for a given model; the resulting “subset posterior” latent variable distributions must be reassembled into a single posterior for the entire dataset. In each case, the summarized whole can be better than the sum of its parts: noise in the input distributions cancels when averaging, while shared structure is reinforced.
The theory of optimal transport (OT) provides a promising and theoreticallyjustified approach to averaging distributions over a geometric domain. OT equips the space of measures with a distance metric known as the Wasserstein distance; the average, or barycenter, of a collection is then defined as a Fréchet mean minimizing the sum of squared Wasserstein distances to the input distributions Agueh and Carlier (2011). This mean is aware of the geometric structure of the underlying space. For example, the Wasserstein barycenter of two Dirac distributions and supported at points is a single Dirac delta at the center point rather than the bimodal superposition obtained by averaging algebraically.
If the input distributions are discrete, then the Wasserstein barycenter is computable in polynomial time by solving a large linear program Anderes et al. (2016). Adding entropic regularization yields elegant and efficient approximation algorithms Genevay et al. (2016); Cuturi and Peyré (2016); Cuturi and Doucet (2014); Ye et al. (2017). These and other stateoftheart methods typically suffer from any of a few drawbacks, mainly (1) poor behavior as regularization decreases, (2) required access to the distribution functions rather than sampling machinery, and/or (3) a fixed discretization on which the input or output distribution is supported, chosen without knowledge of the barycenter’s structure.
Given sample access to distributions , we propose an algorithm that iteratively refines an approximation to the true Wasserstein barycenter. The support of our barycenter is adjusted in each iteration, adapting to the geometry of the desired output. Unlike most existing OT algorithms, we tackle the problem without regularization, yielding a sharp result. Experiments show that the support of our barycenter is contained (to tolerance) within the support of the true barycenter even though we use stochastic optimization rather than computational geometry.
Contributions.
We give a straightforward parallelizable stochastic algorithm to approximate and sample from the Wasserstein barycenter of a collection of distributions, which does not rely on regularization to make the problem tractable. We only employ samplers from the input distributions, and our technique is not restricted to input or output distributions supported on a fixed set of points. We verify convergence properties and showcase examples where our approach is inherently more suitable than competing approaches that require a fixed support.
2 Related Work
OT has made significant inroads in computation and machine learning; see Lévy and Schwindt (2017); Peyré and Cuturi (2018); Solomon (2018) for surveys. Although most algorithms we highlight approximate OT distances rather barycenters, they serve as potential starting points for barycenter computation.
Cuturi (2013) renewed interest in OT in machine learning through introduction of entropic regularization. The resulting Sinkhorn algorithm is compact and efficient; it has been extended to barycenter problems through gradient descent Cuturi and Doucet (2014) or iterative projection Benamou et al. (2015). Improvements for structured instances enhance Sinkhorn’s efficiency, e.g. via fast convolution Solomon et al. (2015) or multiscale approximation Schmitzer (2016).
Our technique, however, is influenced more by semidiscrete methods, which compute OT distances to distributions supported on a finite set of points. Semidiscrete OT is equivalent to computing a power diagram Aurenhammer (1987); Aurenhammer et al. (1992), a generalization of a Voronoi diagram whose cells receive the mass from each . Algorithms by Mérigot (2011) in 2D and Lévy (2015) in 3D use computational geometry to extract gradients for the dual semidiscrete problem; Kitagawa et al. (2016a) accelerate convergence via a secondorder Newton method. Similar to our technique, De Goes et al. (2012) move the support of a discrete approximation to a distribution to reduce Wasserstein distance.
Recent stochastic techniques target learning applications. Genevay et al. (2016) propose a scalable stochastic algorithm based on the dual of the entropicallyregularized problem; they are among the first to consider the setting of samplebased access to distributions but rely on entropic regularization to smooth out the problem and approximate OT distances rather than barycenters. Staib et al. (2017) propose a stochastic barycenter algorithm from samples, but a finite, fixed set of support points must be provided a priori. Arjovsky et al. (2017) incorporate a coarse stochastic approximation of the 1Wasserstein distance into a generative adversarial network (GAN); the 1Wasserstein distance typically is not suitable for barycenter computation.
Further machine learning applications range from supervised learning to Bayesian inference. Schmitz et al. (2017) leverage OT theory for dictionary learning. Carrière et al. (2017) apply the Wasserstein distance to point cloud segmentation by developing a notion of distance on topological persistence diagrams. Courty et al. (2016) utilize the optimal transport plan for transfer learning on different domains. Srivastava et al. (2015a, b) use the Wasserstein barycenter to approximate the posterior distribution of a full dataset by the barycenter of the posteriors on smaller subsets; their method provably recovers the full posterior as the number of subsets increases.
3 Background and Preliminaries
Let be a metric space, and let be the space of probability measures on with finite second moment. Given two measures , the squared 2Wasserstein distance between and is given by
(1) 
Here, is the set of measure couplings between and :
where and are the two projections of , and the pushforward of a measure through a measurable map is defined as for any set in a algebra of . The 2Wasserstein distance is a metric on .
For measures , we can define the Wasserstein barycenter as the minimizer of the functional
(2) 
When the input measures are discrete distributions, (2) is a linear program solvable in polynomial time.
If at least one of the measures is absolutely continuous with respect to the Lebesgue measure, then (2) admits a unique minimizer Agueh and Carlier (2011); Santambrogio (2015). However, will also be absolutely continuous, implying that computational systems typically can only find an inexact finite approximation.
We study a discretization of this problem. Suppose consists of points , and define the functional
(3) 
We define the main problem.
Problem 1 (Semidiscrete approximation).
Find a minimizer of subject to the constraints ,
Solving problem (1) for a single input measure is equivalent to finding the optimal point approximation to the input measure. We can use the solution as a set of supersamples from the input Chen et al. (2010), or if the input distribution is a grayscale image, the solution yields a blue noise approximation to the image De Goes et al. (2012).
4 Mathematical Formulation
The OT problem (1) admits an equivalent dual problem
(4) 
where is the Kantorovich potential and is the transform of Santambrogio (2015); Villani (2009).
Following Santambrogio (2015), if is a finite measure supported on , then (4) becomes
(5) 
where . The key observation is that the function is replaced with a finitedimensional .
With this formula in mind, define
(6) 
Note that constant shifts in the do not change the value of . has a simple derivative with respect to the ’s:
(7) 
where is the power cell of point :
From here on we work with compact subsets of the Euclidean space endowed with the Euclidean metric, . To differentiate with respect to the ’s, notice that the first term in equation (6) does not depend on the positions of the points. We rewrite the second term as
Using Reynolds’ transport theorem to differentiate while accounting for boundary terms shows
(8) 
Equation (7) confirms the intuition that each cell contains as much mass as its associated source point. We will leverage (8) to design a fixedpoint iteration that moves each point to the center of its power cell.
Each subproblem of (3) admits a different Kantorovich potential , giving the following optimization functional
(9) 
Define
With this notation in place, the partial derivatives are
(10) 
5 Optimization
With our optimization objective function in place, we now introduce our barycenter algorithm. To simplify nomenclature, from here on we refer to the dual potentials as weights on the generalized Voronoi diagram. Our overall strategy is an alternating optimization of in (9):

For fixed point positions, is concave in the weights and is optimized using stochastic gradient ascent.

For fixed weights, we apply a single fixed point iteration akin to Lloyd’s algorithm Lloyd (1982).
5.1 Estimating Gradients
Each of and can be expressed as an expectation of a simple function with respect to the . We estimate these quantities by a simple Monte Carlo scheme.
In more detail, we can rewrite and as
Here, indicates the indicator function of a set.
Since we have sample access to each , the expectations can be approximated by drawing points independently and computing
(11) 
5.2 Concave Maximization
The first step in our alternating optimization maximizes over the weights while the points are fixed. We call this step of the algorithm an ascent step.
For a fixed set of points, the functional is concave in the weights , since it is the dual of the convex semidiscrete transport problem. To solve for the weights, we perform gradient ascent using the formula in (10) where is approximated using . Note that the gradient for a set of weights only requires computation of the density of a single measure , implying that the ascent steps can be decoupled across different measures.
Write for the initial iterate. The simplest version of our algorithm updates
Here is a dimensional vector representing a weight for each point. The iterates converge when each point contains equal mass in its associated power cell.
has a known Hessian as a function of the that can be used in Newton’s algorithm Kitagawa et al. (2016b). Computing the Hessian, however, is only possible with access to the density functions of the ’s as it requires computing a density of the measure on the boundary between two power cells. The boundary set is inherently lower dimensional than the problem space, and hence sample access to the is insufficient. Moreover, even had we access to the probability density functions, computing the Hessian would require the Delaunay triangulation of the point set, which is expensive in more than two dimensions.
In any event, choosing the step size is important for convergence. Line search is difficult as we do not have access to true objective value at each iterate. Instead, we rely on Nesterov acceleration to improve performance Nesterov (1983). With acceleration, our iterates are
(12)  
(13) 
In our experiments, we use and . Convergence of the accelerated gradient method can be shown when where is the Lipschitz constant of ; in §6, we give an estimate of this constant. Our convergence criterion for this step is .
5.3 Fixed Point Iteration
The second step of our optimization is a fixed point iteration on the point positions. This step is similar to the point update in a means algorithm in that it snaps points to the centers of local cells, and we refer to it as a snap step.
To derive the iteration, we set the second gradient in (10) to zero:
which leads to the point update
(14) 
This suggests a fixed point iteration for the ’s that can be decomposed into the following steps:

First find the barycenter of the power cells of each with respect to each .

Then, average the points with weights given by the density of each measure in the cell.
If the concave maximization has converged appropriately, and uniform areas have been achieved, then the update step becomes a uniform average over the barycenters with respect to each measure.
5.4 Global and Local Strategies
The ascent and snap steps can be used to refine a configuration of points . Once the iterates converge, we have an point approximation to the barycenter that can be used as an initialization for point approximation in two ways. A new point is sampled uniformly from , and then we have a choice between (1) moving all points including the new one or (2) allowing only to move.
These two approaches are codified in Algorithm 1 where the choice on the set dictates which points move. The number of iterations of the outer loop is fixed beforehand. Typically, we see convergence in fewer than steps, and empirically, we observe good performance even with . The two most natural choices for are and . If the barycenter is absolutely continuous with respect to the underlying Lebesgue measure, these two strategies converge at the same rate asymptotically Brancolini et al. (2009). The latter, however, can generate spurious samples that are not in the support of the barycenter. Note that optimizing the weights is regardless a global problem as moving or introducing just one point can change the volumes of the power cells of neighboring points.
Both algorithms are highly parallelizable, since (1) the gradient estimates are expectations computed using Monte Carlo integration and (2) the gradient step in the weights decouples across distributions.
6 Analysis
We justify the use of uniform finitelysupported measures, and then prove that our algorithm converges to a minimum cost under mild assumptions.
We assume in this section that at least one of the distributions is absolutely continuous with respect to the Lebesgue measure, ensuring a unique Wasserstein barycenter.
6.1 Approximation Suitability
The simplest approach for absolutely continuous measures is to sample points from each of the measures and solve for the true barycenter of the empirical distributions Anderes et al. (2016). This approach likely approximates the barycenter as the number of samples increases, but requires solution of a linear program with variables. As an alternative, Staib et al. (2017) propose a stochastic problem for approximating barycenters. They are able to prove a rate of convergence, but the support of their approximate barycenter is fixed to a finite set of points.
Our technique allows the support points to move during the optimization procedure, empirically allowing a better approximation of the barycenter with fewer points. The following theoretical result shows that the use of uniform measures supported on a finite set of points can approximate the barycenter arbitrarily well:
Theorem (Metric convergence, Kloeckner (2012); Brancolini et al. (2009)).
Suppose is a uniform measure supported on points that minimizes , and let denote the true barycenter of the measures . Then where depends on the underlying space , the dimension , and the metric .
Note that this shows convergence in probability since the Wasserstein distance metrizes weak convergence Villani (2009). Brancolini et al. (2009) also show asymptotic equivalence of the local and global algorithms.
While we cannot guarantee that our method converges to , these properties indicate that the global minimizer of our objective provides an effective approximant to the true barycenter as the number of support points .
6.2 Algorithmic Properties
The functional is concave in the weights with fixed point positions, and in fact strictly concave up to constant shifts. We can investigate the convergence properties of the gradient ascent step of the algorithm. However, we will show that the gradient of is not necessarily Lipschitz continuous.
Counterexample.
Assume is a compact subset of . There are measures for which the gradient of is not Lipschitz continuous. A set of weights that satisfies may not exist, and if it does, it may not be unique.
Construction.
We provide a counterexample for . Let with the standard metric and . Let be the fixed positions, and take and for small . Then , but .
Nonexistence is shown in Figure 1. To see nonuniqueness, take with as before. Any set of weights in minimizes . ∎
For mildly behaved measures the gradient of with respect to is Lipschitz continuous:
Lemma.
Assume is a compact subset of , and is absolutely continuous with respect to the Lebesgue measure, with density function . If the points of are distinct and almost everywhere for some constant , then:
where denotes the surface area of and denotes the minimum pairwise distance between points in .
Proof.
Consider the th component of the gradient difference:
The second inequality follows as the area of a power cell is bounded by and the faces of the cells change at a rate linear in . The rate is dependent on the distance between the points, so the constant is required. The Lipschitz bound follows directly from considering all components of the gradient difference together. ∎
This lemma implies convergence for a step size that is the inverse of the Lipschitz constant. While the above requires absolute continuity of , we have found that our ascent steps and method often converge even when this is not satisfied (see Figures 3 and 4).
We may also show that our algorithm monotonically decreases (defined in Equation (3)) after each pair of snap and then ascent steps for compact domain and absolutely continuous . For this purpose, recall that the transport cost for a map sending measure to is:
Fixing the power cells after an ascent step, we may define to be the transport cost for the map sending the power cells to the point set , and we may define to be the joint (average) transport cost. Letting denote the new positions after a snap step, we may now show:
Lemma.
For compact, and absolutely continuous with respect to the Lebesgue measure for all :
Proof.
By strong duality, we have the following equality for each when the have been optimized after an ascent step:
This implies that as is simply the optimal transport cost. We now argue that . We may split up the integrals for transport cost over the power cells corresponding to each th point. We differentiate with respect to to find the point with lowest joint transport cost to the cells . Setting this to 0 yields the following:
Note this is equivalent to the barycenter update step in Equation (14), and with convergence of the previous ascent step, we should have uniform weights. This demonstrates that snapping to the uniform average of barycenters lowers , and we have that . The last inequality follows as the next ascent step will find the optimal transport and decrease the transport cost. ∎
With joint transportation cost being nonnegative, this implies that our objective function decreases at every step. This does not imply that our iterates converge, as there may not be a unique minimizing point configuration (see Figure 2). Empirically, our iterates converge in all of our test cases. We note also that our formula bears some resemblance to the meanshift algorithm and to Lloyd’s algorithm, both of which which are also known to converge under some assumptions Li et al. (2007); Bottou and Bengio (1995).
7 Experiments
(a)  (b)  (c) 
(a)  (b)  (c) 
(a)  (b)  (c) 
We showcase the versatility of our method on several applications. We typically use between 16K and 256K samples per input distribution to approximate the power cell density and barycenter. The variance is due to different problem sizes and dimensionality of the input measures. We stop the gradient ascent step when . The snap step empirically converges in under 20 iterations, and several of our examples use only one step.
7.1 Distributions with Sharp Features
Our algorithm is wellsuited to problems where the input distributions have very sharp features. We test against the algorithms in Staib et al. (2017) and Solomon et al. (2015) on two test cases: ten uniform distributions over lines in the 2D plane (Figure 3), and 20 uniform distributions over ellipses (Figure 4).
The results of Figures 3 and 4 show that our barycenter is more sharply supported than the results of competing methods. Our output agrees with that of Solomon et al. (2015), but our results more closely match expected behavior. We strongly suspect that the true barycenter in Figure 3 is also a uniform measure on a line, while that in Figure 4 is a circle centered at the origin.
7.2 The Case
In the case of two input measures and , we expect the barycenter to be McCann’s interpolant Agueh and Carlier (2011); McCann (1997):
where is the optimal map, and is the inverse map, while denotes the pushforward of a measure.
We test this on two uniform distributions on the unit square in Figure 5. The transport map in this case is transport of the entire distribution along a straight line. As expected from McCann’s interpolant, we recover a uniform distribution on the unit square halfway between the two input distributions. We show our results alongside those of Staib et al. (2017). Notice that their output barycenter is not uniform, and that it has nonzero measure outside the true barycenter.
7.3 The Case
The case bears interest as well. There are instances when sampling iid from a distribution yields samples that do not approximate the underlying distribution accurately. We showcase two applications in generating super samples from distributions, as well as approximating grayscale images through blue noise.
7.3.1 Blue Noise
The term blue noise refers to an unstructured but even and isotropic distribution of points. It has been used in image dithering as it captures image intensity via local point density, without the need for varying point sizes as in halftoning.
De Goes et al. (2012) described the link between optimal transport and blue noise generation. We recover a stochastic version of their algorithm by taking a discrete distribution over the image pixels proportional to intensity. As our method is more general, we observe performance loss, but the output is of comparable quality (Figure 7).
7.3.2 Super Samples
Our method can be adapted to generate super samples from complex distributions Chen et al. (2010). Figure 6 details our results on a mixture of ten Gaussians. Our method better approximates the shape of the underlying distribution due to negative autocorrelations: points move away from oversampled regions. The points drawn iid from the mixture tend to oversample around the larger modes and do not approximate density contours as well.
8 Conclusion
We have proposed an algorithm for computing the Wasserstein barycenter of continuous measures using only samples from the input distributions. The algorithm decomposes into a concave maximization and a fixed point iteration similar to the meanshift and means algorithms. Our algorithm is easy to implement and parallelize, and it does not rely on a fixedsupport grid. This allows us to recover much sharper approximations to the barycenter than previous methods. Our algorithm is general and versatile enough to be applied to other problems beyond barycenter computation.
There are several avenues for future work. Solving the concave maximization problem is currently a bottleneck for our algorithm as we do not have access to the function value or the Hessian, but we believe multiscale methods can be adapted to our approach. The potential applications of this method extend beyond what was covered. One application we highlight is in developing coresets that minimize the distance to the empirical distribution on the input data.
Acknowledgements
The authors acknowledge the generous support of Army Research Office grant W911NF12R0011 (“Smooth Modeling of Flows on Graphs”), from the MIT Research Support Committee, from the MIT–IBM Watson AI Lab, from the Skoltech–MIT Next Generation Program, and from an Amazon Research Award.
References
 Agueh and Carlier [2011] M. Agueh and G. Carlier. Barycenters in the Wasserstein Space. SIAM J. Math. Anal., 43(2):904–924, January 2011. ISSN 00361410. doi: 10.1137/100805741.
 Anderes et al. [2016] Ethan Anderes, Steffen Borgwardt, and Jacob Miller. Discrete Wasserstein barycenters: Optimal transport for discrete data. Math Meth Oper Res, 84(2):389–409, October 2016. ISSN 14322994, 14325217. doi: 10.1007/s001860160549x.
 Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv:1701.07875, 2017.
 Aurenhammer [1987] Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78–96, 1987.
 Aurenhammer et al. [1992] Franz Aurenhammer, Friedrich Hoffmann, and Boris Aronov. Minkowskitype theorems and leastsquares partitioning. In Proceedings of the Eighth Annual Symposium on Computational Geometry, pages 350–357. ACM, 1992.
 Benamou et al. [2015] J. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman Projections for Regularized Transportation Problems. SIAM J. Sci. Comput., 37(2):A1111–A1138, January 2015. ISSN 10648275. doi: 10.1137/141000439.
 Bottou and Bengio [1995] Leon Bottou and Yoshua Bengio. Convergence properties of the kmeans algorithms. In Advances in Neural Information Processing Systems, pages 585–592, 1995.
 Brancolini et al. [2009] Alessio Brancolini, Giuseppe Buttazzo, Filippo Santambrogio, and Eugene Stepanov. Longterm planning versus shortterm planning in the asymptotical location problem. ESAIM: Control, Optimisation and Calculus of Variations, 15(3):509–524, 2009.
 Carrière et al. [2017] Mathieu Carrière, Marco Cuturi, and Steve Oudot. Sliced wasserstein kernel for persistence diagrams. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 664–673, 2017. URL http://proceedings.mlr.press/v70/carriere17a.html.
 Chen et al. [2010] Yutian Chen, Max Welling, and Alexander J. Smola. Supersamples from kernel herding. In UAI 2010, Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 811, 2010, pages 109–116, 2010. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=2148&proceeding_id=26.
 Courty et al. [2016] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal Transport for Domain Adaptation. IEEE Trans. Pattern Anal. Mach. Intell., PP(99):1–1, 2016. ISSN 01628828. doi: 10.1109/TPAMI.2016.2615921.
 Cuturi [2013] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013.
 Cuturi and Doucet [2014] Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 685–693, 2014. URL http://jmlr.org/proceedings/papers/v32/cuturi14.html.
 Cuturi and Peyré [2016] Marco Cuturi and Gabriel Peyré. A Smoothed Dual Approach for Variational Wasserstein Problems. SIAM J. Imaging Sci., 9(1):320–343, January 2016. doi: 10.1137/15M1032600.
 De Goes et al. [2012] Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Desbrun. Blue noise through optimal transport. ACM Transactions on Graphics (TOG), 31(6):171, 2012.
 Genevay et al. [2016] Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic Optimization for Largescale Optimal Transport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3440–3448. Curran Associates, Inc., 2016.
 Kitagawa et al. [2016a] Jun Kitagawa, Quentin Mérigot, and Boris Thibert. Convergence of a Newton algorithm for semidiscrete optimal transport. arXiv:1603.05579, 2016a.
 Kitagawa et al. [2016b] Jun Kitagawa, Quentin Mérigot, and Boris Thibert. Convergence of a Newton algorithm for semidiscrete optimal transport. arXiv:1603.05579 [cs, math], March 2016b.
 Kloeckner [2012] Benoît Kloeckner. Approximation by finitely supported measures. ESAIM Control Optim. Calc. Var., 18(2):343–359, 2012. ISSN 12928119.
 Lévy [2015] Bruno Lévy. A numerical algorithm for semidiscrete optimal transport in 3d. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1693–1715, 2015.
 Lévy and Schwindt [2017] Bruno Lévy and Erica Schwindt. Notions of optimal transport theory and how to implement them on a computer. arXiv:1710.02634, 2017.
 Li et al. [2007] Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift. Pattern Recognition, 40(6):1756–1762, 2007.
 Lloyd [1982] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
 McCann [1997] Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
 Mérigot [2011] Quentin Mérigot. A multiscale approach to optimal transport. In Computer Graphics Forum, volume 30, pages 1583–1592. Wiley Online Library, 2011.
 Nesterov [1983] Yurii Nesterov. A method of solving a convex programming problem with convergence rate . In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
 Peyré and Cuturi [2018] Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. Submitted, 2018.
 Santambrogio [2015] Filippo Santambrogio. Optimal Transport for Applied Mathematicians, volume 87 of Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, Cham, 2015. ISBN 9783319208275 9783319208282. doi: 10.1007/9783319208282.
 Schmitz et al. [2017] Morgan A. Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Maurice Ngolè Mboula, David Coeurjolly, Marco Cuturi, Gabriel Peyré, and JeanLuc Starck. Wasserstein dictionary learning: Optimal transportbased unsupervised nonlinear dictionary learning. CoRR, abs/1708.01955, 2017. URL http://arxiv.org/abs/1708.01955.
 Schmitzer [2016] Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of Mathematical Imaging and Vision, 56(2):238–259, 2016.
 Solomon [2018] Justin Solomon. Optimal Transport on Discrete Domains. AMS Short Course on Discrete Differential Geometry, 2018.
 Solomon et al. [2015] Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains. ACM Trans Graph, 34(4):66:1–66:11, July 2015. ISSN 07300301. doi: 10.1145/2766963.
 Srivastava et al. [2015a] Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: Scalable Bayes via barycenters of subset posteriors. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 912–920, San Diego, California, USA, 09–12 May 2015a. PMLR. URL http://proceedings.mlr.press/v38/srivastava15.html.
 Srivastava et al. [2015b] Sanvesh Srivastava, Volkan Cevher, Quoc TranDinh, and David B. Dunson. WASP: scalable bayes via barycenters of subset posteriors. 2015b. URL http://jmlr.org/proceedings/papers/v38/srivastava15.html.
 Staib et al. [2017] Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming Wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 2644–2655, 2017.
 Villani [2009] Cédric Villani. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009. ISBN 9783540710493. OCLC: ocn244421231.
 Ye et al. [2017] J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support. IEEE Trans. Signal Process., 65(9):2317–2332, May 2017. ISSN 1053587X. doi: 10.1109/TSP.2017.2659647.