A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters
Abstract
In this paper, we consider the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite supports, which finds many applications in areas such as statistics, machine learning and image processing. When the support points of the barycenter are prespecified, this problem can be modeled as a linear program (LP) whose problem size can be extremely large. To handle this largescale LP, we analyse the structure of its dual problem, which is conceivably more tractable and can be reformulated as a wellstructured convex problem with 3 kinds of block variables and a coupling linear equality constraint. We then adapt a symmetric GaussSeidel based alternating direction method of multipliers (sGSADMM) to solve the resulting dual problem and establish its global convergence and global linear convergence rate. As a critical component for efficient computation, we also show how all the subproblems involved can be solved exactly and efficiently. This makes our method suitable for computing a Wasserstein barycenter on a large dataset, without introducing an entropy regularization term as is commonly practiced. In addition, our sGSADMM can be used as a subroutine in an alternating minimization method to compute a barycenter when its support points are not prespecified. Numerical results on synthetic datasets and image datasets demonstrate that our method is highly competitive for solving largescale problems, in comparison to two existing representative methods and the commercial software Gurobi.
Institute of Operations Research and Analytics
National University of Singapore
10 Lower Kent Ridge Road, Singapore 119076 Jia Li jiali@stat.psu.edu
Department of Statistics
Pennsylvania State University
University Park, PA 16802, USA Defeng Sun^{1}^{1}1The research of this author was supported in part by a startup research grant from the Hong Kong Polytechnic University. defeng.sun@polyu.edu.hk
Department of Applied Mathematics
The Hong Kong Polytechnic University
Hung Hom, Kowloon, Hong Kong KimChuan Toh^{2}^{2}2The research of this author was supported in part by the Ministry of Education, Singapore, Academic Research Fund (Grant No. R146000256114). mattohkc@nus.edu.sg
Department of Mathematics and Institute of Operations Research and Analytics
National University of Singapore
10 Lower Kent Ridge Road, Singapore 119076
Keywords: Wasserstein barycenter; discrete probability distribution; semiproximal ADMM; symmetric GaussSeidel.
1 Introduction
In this paper, we consider the problem of computing the mean of a set of discrete probability distributions under the Wasserstein distance (also known as the optimal transport distance or the earth mover’s distance). This mean, called the Wasserstein barycenter, is also a discrete probability distribution (Agueh and Carlier, 2011). Recently, the Wasserstein barycenter has attracted much attention due to its promising performance in many application areas such as data analysis and statistics (Bigot and Klein, 2018), machine learning (Cuturi and Doucet, 2014; Li and Wang, 2008; Ye and Li, 2014; Ye et al., 2017) and image processing (Rabin et al., 2011). For a set of discrete probability distributions with finite support points, a Wasserstein barycenter with its support points being prespecified can be computed by solving a linear programming (LP) problem (Anderes et al., 2016). However, the problem size can be extremely large when the number of discrete distributions or the number of support points of each distribution is large. Thus, classical LP methods such as the simplex method and the interior point method are no longer efficient enough or consume too much memory when solving this problem. This motivates the study of fast algorithms for the computation of Wasserstein barycenters; see, for example, (Benamou et al., 2015; Borgwardt and Patterson, 2018; Carlier et al., 2015; Claici et al., 2018; Cuturi and Doucet, 2014; Cuturi and Peyré, 2016; Oberman and Ruan, 2015; Schmitzer, 2019; Uribe et al., 2018; Xie et al., 2018; Ye and Li, 2014; Ye et al., 2017).
One representative approach is to introduce an entropy regularization in the LP and then apply some efficient firstorder methods, e.g., the gradient descent method (Cuturi and Doucet, 2014) and the iterative Bregman projection (IBP) method (Benamou et al., 2015), to solve the regularized problem. These methods can be implemented efficiently and hence are suitable for largescale datasets. However, they can only return an approximate solution of the LP (due to the entropy regularization) and often encounter numerical issues when the regularization parameter becomes small. IBP is highly efficient if an approximate solution is adequate, as is the case in many learning tasks. However, our aim here is to obtain a high precision solution at low computational costs. Detailed empirical studies on the pros and cons of IBP are provided by Ye et al. (2017) for obtaining higher precision solutions via diminishing the regularization parameter. It was found that it can be difficult to drive the regularization parameter to smaller values for obtaining more accurate solutions without encountering numerical issues. We will provide a comparison with IBP in the experiments. Another approach is to consider the LP as a constrained convex optimization problem with a separable structure and then apply some splitting methods to solve it. For example, the alternating direction method of multipliers (ADMM) was adapted in (Ye and Li, 2014). However, solving the quadratic programming subproblems involved is still highly expensive. Later, Ye et al. (2017) developed a modified Bregman ADMM (BADMM) based on the original one (Wang and Banerjee, 2014) to solve the LP. In this method, all subproblems have closedform solutions and hence can be solved efficiently. Promising numerical performance was also reported in (Ye et al., 2017). However, this modified Bregman ADMM does not have a convergence guarantee so far.
In this paper, we also consider the LP as a constrained convex problem with multiple blocks of variables and develop an efficient method to solve its dual LP without introducing the entropy regularization to modify the objective function. We believe it is important to have an efficient algorithm that can faithfully solve the original nonregularized LP. It is conceivable that the barycenter problem itself is usually embedded as a subroutine of a larger problem, and in that scenario, the original barycenter problem should be solved accurately so that the computed barycenter would not introduce significant error to the solution of the outer problem.
Our method is a convergent 3block ADMM that is designed based on recent progresses in research on convergent multiblock ADMMtype methods for solving convex composite conic programming; see (Chen et al., 2017; Li et al., 2016). It is well known that the classical ADMM was originally proposed to solve a convex problem that contains 2 blocks of variables and a coupling linear equality constraint (Gabay and Mercier, 1976; Glowinski and Marroco, 1975). The 2block ADMM can be naturally extended to a multiblock ADMM for solving a convex problem with more than 2 blocks of variables. However, it has been shown in (Chen et al., 2016) that the directly extended ADMM may not converge when applied to a problem with 3 or more blocks of variables. This has motivated many researchers to develop various convergent variants of the ADMM for convex problems with more than 2 blocks of variables; see, for example, (Chen et al., 2017, 2018; He et al., 2012; Li et al., 2015, 2016; Sun et al., 2015). Among them, the Schur complement based convergent semiproximal ADMM (sPADMM) was proposed by Li et al. (2016) to solve a large class of linearly constrained convex problems with multiple blocks of variables, whose objective can be the sum of two proper closed convex functions and a finite number of convex quadratic or linear functions. This method modified the original ADMM by performing one more forward GaussSeidel sweep after updating the block of variables corresponding to the nonsmooth function in the objective. With this novel strategy, Li et al. (2016) showed that their method can be reformulated as a 2block sPADMM with specially designed semiproximal terms and its convergence is guaranteed from that of the 2block sPADMM; see (Fazel et al., 2013, Appendix B). Later, this method was generalized to the inexact symmetric GaussSeidel based ADMM (sGSADMM) for more general convex problems (Chen et al., 2017; Li et al., 2018). The numerical results reported in (Chen et al., 2017; Li et al., 2016, 2018) also showed that the sGSADMM always performs much better than the possibly nonconvergent directly extended ADMM. In addition, as the sGSADMM is equivalent to a 2block sPADMM with specially designed proximal terms, the linear convergence rate of the sGSADMM can also be derived based on the linear convergence rate of the 2block sPADMM under some mild conditions; more details can be found in (Han et al., 2018, Section 4.1).
Motivated by the above studies, in this paper, we adapt the sGSADMM to compute a Wasserstein barycenter by solving the dual problem of the original primal LP. The contributions of this paper are listed as follows:

We derive the dual problem of the original primal LP and characterize the properties of their optimal solutions; see Proposition 1. The resulting dual problem is our target problem, which we reformulate as a linearly constrained convex problem containing 3 blocks of variables with a carefully delineated separable structure designed for efficient computations. We should emphasize again that we do not introduce the entropic or quadratic regularization to modify the LP so as to make it computationally more tractable. This is in contrast to many existing works that primarily focus on solving an approximation of the original LP arising from optimal transport related problems; see, for example, (Benamou et al., 2015; Cuturi, 2013; Cuturi and Doucet, 2014; Dessein et al., 2018; Essid and Solomon, 2018).

We apply the sGSADMM to solve the resulting dual problem and analyze its global convergence as well as its global linear convergence rate without any condition; see Theorems 1 and 2. As a critical component of the paper, we also develop essential numerical strategies to show how all the subproblems in our method can be solved efficiently and that the subproblems at each step can be computed in parallel. This makes our sGSADMM highly suitable for computing Wasserstein barycenters on a large dataset.

We conduct rigorous numerical experiments on synthetic datasets and MNIST to evaluate the performance of our sGSADMM in comparison to existing stateoftheart methods (IBP and BADMM) and the highly powerful commercial solver Gurobi. The computational results show that our sGSADMM is highly competitive compared to IBP and BADMM, and is also able to outperform Gurobi in solving largescale LPs arising from Wasserstein barycenter problems.
The rest of this paper is organized as follows. In Section 2, we describe the basic problem of computing Wasserstein barycenters and derive its dual problem. In Section 3, we adapt the sGSADMM to solve the resulting dual problem and present the efficient implementations for each step that are crucial in making our method competitive. The convergence analysis of the sGSADMM is presented in Section 4. Finally, numerical results are presented in Section 5, with some concluding remarks given in Section 6.
Notation and Preliminaries
In this paper, we present scalars, vectors and matrices in lower case letters, bold lower case letters and upper case letters, respectively. We use , , and to denote the set of real numbers, dimensional real vectors, dimensional real vectors with nonnegative entries and real matrices, respectively. For a vector , denotes its th entry, denotes its Euclidean norm and denotes its weighted norm associated with the symmetric positive semidefinite matrix . For a matrix , denotes its th entry, denotes its th row, denotes its th column, denotes its Fröbenius norm and denotes the vectorization of . We also use and to denote for all and for all . The identity matrix of size is denoted by . For any and , denotes the matrix obtained by horizontally concatenating and . For any and , denotes the matrix obtained by vertically concatenating and . For any and , the Kronecker product is defined as
For an extendedrealvalued function , we say that it is proper if for all and its domain is nonempty. A proper function is said to be closed if it is lower semicontinuous. Assume that is a proper and closed convex function. The subdifferential of at is defined by and its conjugate function is defined by . For any and , it follows from (Rockafellar, 1970, Theorem 23.5) that
(1) 
For any , the proximal mapping of at is defined by
For a closed convex set , its indicator function is defined by if and otherwise. Moreover, we use to denote the projection of onto a closed convex set . It is easy to see that .
In the following, a discrete probability distribution with finite support points is specified by , where are the support points or vectors and are the associated probabilities or weights satisfying and , .
2 Problem statement
In this section, we briefly recall the Wasserstein distance and describe the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite support points. We refer interested readers to (Villani, 2008, Chapter 6) for more details on the Wasserstein distance and to (Agueh and Carlier, 2011; Anderes et al., 2016) for more details on the Wasserstein barycenter.
Given two discrete distributions and , , the 2Wasserstein distance between and is defined by
where is the optimal objective value of the following linear program:
Then, given a set of discrete probability distributions with , a Wasserstein barycenter with support points is an optimal solution of the following problem
for given weights satisfying and , . It is worth noting that the number of support points of the true barycenter is theoretically upper bounded by ; see (Anderes et al., 2016, Theorem 2). However, in practice, one usually chooses a small to simplify computations. The above problem is a twostage optimization problem that can be easily shown to be equivalent to
(2) 
where

(resp. ) denotes the (resp. ) dimensional vector with all entries being 1;

, ;

, for ;

, for .
Note that problem (2) is a nonconvex problem, where one needs to find the optimal support and the optimal weight vector of a barycenter simultaneously. However, in many real applications, the support of a barycenter can be specified empirically from the support points of . Thus, one only needs to find the weight vector of a barycenter. In view of this, from now on, we assume that the support is given. Consequently, problem (2) reduces to the following problem:
(3) 
where denotes for simplicity. This is also the main problem studied in (Benamou et al., 2015; Borgwardt and Patterson, 2018; Carlier et al., 2015; Claici et al., 2018; Cuturi and Doucet, 2014; Cuturi and Peyré, 2016; Oberman and Ruan, 2015; Schmitzer, 2019; Uribe et al., 2018; Ye and Li, 2014; Ye et al., 2017) for the computation of Wasserstein barycenters. Moreover, one can easily see that problem (3) is indeed a largescale LP containing nonnegative variables and equality constraints. For , and for all , such LP has about nonnegative variables and equality constraints.
Remark 1 (Practical computational consideration when is sparse)
Note that any feasible point of problem (3) must satisfy and for any . This implies that if for some and , then for all , i.e., all entries in the th column of are zeros. Based on this fact, one can verify the following statements.

For any optimal solution of problem (3), the point is also an optimal solution of the following problem
(4) where denotes the support set of , i.e., , denotes the cardinality of , denotes the subvector of obtained by selecting the entries indexed by and denotes the submatrix of obtained by selecting the columns indexed by .
Therefore, one can obtain an optimal solution of problem (3) by computing an optimal solution of problem (4). Note that the size of problem (4) can be much smaller than that of problem (3) when each is sparse, i.e., . Thus, solving problem (4) can reduce the computational cost and save memory in practice. Since problem (4) takes the same form as problem (3), we only consider problem (3) in the following.
For notational simplicity, let and be the indicator function over for each . Then, problem (3) can be equivalently written as
(5) 
We next derive the dual problem of (5) (hence (3)). To this end, we write down the Lagrangian function associated with (5) as follows:
(6) 
where , , are multipliers. Then, the dual problem of (5) is given by
(7) 
Observe that
where is the Fenchel conjugate of . Thus, (7) is equivalent to
By introducing auxiliary variables , we can further reformulate the above problem as
(8) 
Note that problem (8) can be viewed as a linearly constrained convex problem with 3 blocks of variables grouped as , and , whose objective is nonsmooth only with respect to and linear with respect to the other two. Thus, this problem exactly falls into the class of convex problems for which the sGSADMM is applicable; see (Chen et al., 2017; Li et al., 2016). Then, it is natural to adapt the sGSADMM for solving problem (8), which is presented in the next section.
Remark 2 (2block ADMM for solving (3))
It is worth noting that one can also apply the 2block ADMM to solve the primal problem (3) by introducing some proper auxiliary variables. For example, one can consider the following equivalent reformulation of (3):
where . Then, the 2block ADMM can be readily applied with being one block and as the other. This 2block ADMM avoids the need to solve quadratic programming subproblems and hence is more efficient than the one used in (Ye and Li, 2014). However, it needs to compute the projection onto the dimensional simplex times when solving the subproblem in each iteration. This is still timeconsuming when or is large. Thus, this 2block ADMM is also not efficient enough for solving largescale problems. In addition, we have adapted the 2block ADMM for solving other reformulations of (3), but they all perform worse than our sGSADMM to be presented later. Hence, we will no longer consider ADMMtype methods for solving the primal problem (3) or its equivalent variants in this paper.
3 sGSADMM for computing Wasserstein barycenters
In this section, we present the sGSADMM for solving problem (8). First, we write down the Lagrangian function associated with (8) as follows:
(9) 
where , , are multipliers. Then, the augmented Lagrangian function associated with (8) is
where is the penalty parameter. The sGSADMM for solving (8) is now readily presented in Algorithm 1.
Input: the penalty parameter , and the initialization , , , , , , . Set .
while a termination criterion is not met, do

1. Compute

2a. Compute

2b. Compute

2c. Compute

3. Compute
where is the dual stepsize that is typically set to .
end while
Output: , , , , , .
Comparing with the directly extended ADMM, our sGSADMM in Algorithm 1 just has one more update of in Step 2a. This step is actually the key to guarantee the convergence of the algorithm. In the next section, we shall see that computed from Step 2a–2c is equivalent to minimizing plus a special proximal term simultaneously with respect to . Moreover, all subproblems in Algorithm 1 can be solved efficiently (in fact analytically) and the subproblems in each step can also be computed in parallel. This makes our method highly suitable for solving largescale problems.
The reader may have observed that instead of computing and sequentially as in Step 2a–2c, one can also compute simultaneously in one step by solving a huge linear system of equations of dimension . Unfortunately, for the latter approach, the computation of the solution would require the Cholesky factorization of a huge coefficient matrix, and this approach is not practically viable. In contrast, for our approach in Step 2a2c, we shall see shortly that the solutions can be computed analytically without the need to perform Cholesky factorizations of large coefficient matrices. This also explains why we have designed the computations as in Step 2a2c.
Before ending this section, we present the computational details and the efficient implementations in each step of Algorithm 1.

1. Note that is actually separable with respect to and hence one can compute independently. Specifically, is obtained by solving
Thus, we have
where the last equality follows from the Moreau decomposition (Bauschke and Combettes, 2011, Theorem 14.3(ii)), i.e., for any and the proximal mapping of can be computed efficiently by the algorithm proposed in (Condat, 2016) with the complexity of that is typically observed in practice. Moreover, for each , can be computed in parallel by solving
Then, it is easy to see that
where . Note that is already computed for updating in the previous iteration and thus it can be reused in the current iteration. The computational complexity in this step is . We should emphasize that because the matrices such as , are large and numerous, even performing simple operations such as adding two such matrices can be time consuming. Thus we have paid special attention to arrange the computations in each step of the sGSADMM so that matrices computed in one step can be reused for the next step.

2a. Similarly, is separable with respect to and then one can also compute , , in parallel. For each , is obtained by solving
It is easy to prove that
where . Note that has already been computed in Step 1 and hence can be computed by just a simple operation. We note that is computed analytically for all , and the computational complexity in this step is .

2b. In this step, one can see that , , are coupled in (due to the quadratic term ) and hence the problem of minimizing with respect to cannot be reduced to separable subproblems. However, one can still compute them efficiently based on the following observation. Note that , , is obtained by solving
The gradient of with respect to is
where and . It follows from the optimality conditions, namely, that, for any ,
(10) By dividing in (10) for , adding all resulting equations and doing some simple algebraic manipulations, one can obtain that