A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters
In this paper, we consider the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite supports, which finds many applications in areas such as statistics, machine learning and image processing. When the support points of the barycenter are pre-specified, this problem can be modeled as a linear program (LP) whose problem size can be extremely large. To handle this large-scale LP, we analyse the structure of its dual problem, which is conceivably more tractable and can be reformulated as a well-structured convex problem with 3 kinds of block variables and a coupling linear equality constraint. We then adapt a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the resulting dual problem and establish its global convergence and global linear convergence rate. As a critical component for efficient computation, we also show how all the subproblems involved can be solved exactly and efficiently. This makes our method suitable for computing a Wasserstein barycenter on a large dataset, without introducing an entropy regularization term as is commonly practiced. In addition, our sGS-ADMM can be used as a subroutine in an alternating minimization method to compute a barycenter when its support points are not pre-specified. Numerical results on synthetic datasets and image datasets demonstrate that our method is highly competitive for solving large-scale problems, in comparison to two existing representative methods and the commercial software Gurobi.
Keywords: Wasserstein barycenter; discrete probability distribution; semi-proximal ADMM; symmetric Gauss-Seidel.
In this paper, we consider the problem of computing the mean of a set of discrete probability distributions under the Wasserstein distance (also known as the optimal transport distance or the earth mover’s distance). This mean, called the Wasserstein barycenter, is also a discrete probability distribution (Agueh and Carlier, 2011). Recently, the Wasserstein barycenter has attracted much attention due to its promising performance in many application areas such as data analysis and statistics (Bigot and Klein, 2018), machine learning (Cuturi and Doucet, 2014; Li and Wang, 2008; Ye and Li, 2014; Ye et al., 2017) and image processing (Rabin et al., 2011). For a set of discrete probability distributions with finite support points, a Wasserstein barycenter with its support points being pre-specified can be computed by solving a linear programming (LP) problem (Anderes et al., 2016). However, the problem size can be extremely large when the number of discrete distributions or the number of support points of each distribution is large. Thus, classical LP methods such as the simplex method and the interior point method are no longer efficient enough or consume too much memory when solving this problem. This motivates the study of fast algorithms for the computation of Wasserstein barycenters; see, for example, (Benamou et al., 2015; Borgwardt and Patterson, 2018; Carlier et al., 2015; Claici et al., 2018; Cuturi and Doucet, 2014; Cuturi and Peyré, 2016; Oberman and Ruan, 2015; Schmitzer, 2019; Uribe et al., 2018; Xie et al., 2018; Ye and Li, 2014; Ye et al., 2017).
One representative approach is to introduce an entropy regularization in the LP and then apply some efficient first-order methods, e.g., the gradient descent method (Cuturi and Doucet, 2014) and the iterative Bregman projection (IBP) method (Benamou et al., 2015), to solve the regularized problem. These methods can be implemented efficiently and hence are suitable for large-scale datasets. However, they can only return an approximate solution of the LP (due to the entropy regularization) and often encounter numerical issues when the regularization parameter becomes small. IBP is highly efficient if an approximate solution is adequate, as is the case in many learning tasks. However, our aim here is to obtain a high precision solution at low computational costs. Detailed empirical studies on the pros and cons of IBP are provided by Ye et al. (2017) for obtaining higher precision solutions via diminishing the regularization parameter. It was found that it can be difficult to drive the regularization parameter to smaller values for obtaining more accurate solutions without encountering numerical issues. We will provide a comparison with IBP in the experiments. Another approach is to consider the LP as a constrained convex optimization problem with a separable structure and then apply some splitting methods to solve it. For example, the alternating direction method of multipliers (ADMM) was adapted in (Ye and Li, 2014). However, solving the quadratic programming subproblems involved is still highly expensive. Later, Ye et al. (2017) developed a modified Bregman ADMM (BADMM) based on the original one (Wang and Banerjee, 2014) to solve the LP. In this method, all subproblems have closed-form solutions and hence can be solved efficiently. Promising numerical performance was also reported in (Ye et al., 2017). However, this modified Bregman ADMM does not have a convergence guarantee so far.
In this paper, we also consider the LP as a constrained convex problem with multiple blocks of variables and develop an efficient method to solve its dual LP without introducing the entropy regularization to modify the objective function. We believe it is important to have an efficient algorithm that can faithfully solve the original non-regularized LP. It is conceivable that the barycenter problem itself is usually embedded as a subroutine of a larger problem, and in that scenario, the original barycenter problem should be solved accurately so that the computed barycenter would not introduce significant error to the solution of the outer problem.
Our method is a convergent 3-block ADMM that is designed based on recent progresses in research on convergent multi-block ADMM-type methods for solving convex composite conic programming; see (Chen et al., 2017; Li et al., 2016). It is well known that the classical ADMM was originally proposed to solve a convex problem that contains 2 blocks of variables and a coupling linear equality constraint (Gabay and Mercier, 1976; Glowinski and Marroco, 1975). The 2-block ADMM can be naturally extended to a multi-block ADMM for solving a convex problem with more than 2 blocks of variables. However, it has been shown in (Chen et al., 2016) that the directly extended ADMM may not converge when applied to a problem with 3 or more blocks of variables. This has motivated many researchers to develop various convergent variants of the ADMM for convex problems with more than 2 blocks of variables; see, for example, (Chen et al., 2017, 2018; He et al., 2012; Li et al., 2015, 2016; Sun et al., 2015). Among them, the Schur complement based convergent semi-proximal ADMM (sPADMM) was proposed by Li et al. (2016) to solve a large class of linearly constrained convex problems with multiple blocks of variables, whose objective can be the sum of two proper closed convex functions and a finite number of convex quadratic or linear functions. This method modified the original ADMM by performing one more forward Gauss-Seidel sweep after updating the block of variables corresponding to the nonsmooth function in the objective. With this novel strategy, Li et al. (2016) showed that their method can be reformulated as a 2-block sPADMM with specially designed semi-proximal terms and its convergence is guaranteed from that of the 2-block sPADMM; see (Fazel et al., 2013, Appendix B). Later, this method was generalized to the inexact symmetric Gauss-Seidel based ADMM (sGS-ADMM) for more general convex problems (Chen et al., 2017; Li et al., 2018). The numerical results reported in (Chen et al., 2017; Li et al., 2016, 2018) also showed that the sGS-ADMM always performs much better than the possibly non-convergent directly extended ADMM. In addition, as the sGS-ADMM is equivalent to a 2-block sPADMM with specially designed proximal terms, the linear convergence rate of the sGS-ADMM can also be derived based on the linear convergence rate of the 2-block sPADMM under some mild conditions; more details can be found in (Han et al., 2018, Section 4.1).
Motivated by the above studies, in this paper, we adapt the sGS-ADMM to compute a Wasserstein barycenter by solving the dual problem of the original primal LP. The contributions of this paper are listed as follows:
We derive the dual problem of the original primal LP and characterize the properties of their optimal solutions; see Proposition 1. The resulting dual problem is our target problem, which we reformulate as a linearly constrained convex problem containing 3 blocks of variables with a carefully delineated separable structure designed for efficient computations. We should emphasize again that we do not introduce the entropic or quadratic regularization to modify the LP so as to make it computationally more tractable. This is in contrast to many existing works that primarily focus on solving an approximation of the original LP arising from optimal transport related problems; see, for example, (Benamou et al., 2015; Cuturi, 2013; Cuturi and Doucet, 2014; Dessein et al., 2018; Essid and Solomon, 2018).
We apply the sGS-ADMM to solve the resulting dual problem and analyze its global convergence as well as its global linear convergence rate without any condition; see Theorems 1 and 2. As a critical component of the paper, we also develop essential numerical strategies to show how all the subproblems in our method can be solved efficiently and that the subproblems at each step can be computed in parallel. This makes our sGS-ADMM highly suitable for computing Wasserstein barycenters on a large dataset.
We conduct rigorous numerical experiments on synthetic datasets and MNIST to evaluate the performance of our sGS-ADMM in comparison to existing state-of-the-art methods (IBP and BADMM) and the highly powerful commercial solver Gurobi. The computational results show that our sGS-ADMM is highly competitive compared to IBP and BADMM, and is also able to outperform Gurobi in solving large-scale LPs arising from Wasserstein barycenter problems.
The rest of this paper is organized as follows. In Section 2, we describe the basic problem of computing Wasserstein barycenters and derive its dual problem. In Section 3, we adapt the sGS-ADMM to solve the resulting dual problem and present the efficient implementations for each step that are crucial in making our method competitive. The convergence analysis of the sGS-ADMM is presented in Section 4. Finally, numerical results are presented in Section 5, with some concluding remarks given in Section 6.
Notation and Preliminaries
In this paper, we present scalars, vectors and matrices in lower case letters, bold lower case letters and upper case letters, respectively. We use , , and to denote the set of real numbers, -dimensional real vectors, -dimensional real vectors with nonnegative entries and real matrices, respectively. For a vector , denotes its -th entry, denotes its Euclidean norm and denotes its weighted norm associated with the symmetric positive semidefinite matrix . For a matrix , denotes its -th entry, denotes its -th row, denotes its -th column, denotes its Fröbenius norm and denotes the vectorization of . We also use and to denote for all and for all . The identity matrix of size is denoted by . For any and , denotes the matrix obtained by horizontally concatenating and . For any and , denotes the matrix obtained by vertically concatenating and . For any and , the Kronecker product is defined as
For an extended-real-valued function , we say that it is proper if for all and its domain is nonempty. A proper function is said to be closed if it is lower semicontinuous. Assume that is a proper and closed convex function. The subdifferential of at is defined by and its conjugate function is defined by . For any and , it follows from (Rockafellar, 1970, Theorem 23.5) that
For any , the proximal mapping of at is defined by
For a closed convex set , its indicator function is defined by if and otherwise. Moreover, we use to denote the projection of onto a closed convex set . It is easy to see that .
In the following, a discrete probability distribution with finite support points is specified by , where are the support points or vectors and are the associated probabilities or weights satisfying and , .
2 Problem statement
In this section, we briefly recall the Wasserstein distance and describe the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite support points. We refer interested readers to (Villani, 2008, Chapter 6) for more details on the Wasserstein distance and to (Agueh and Carlier, 2011; Anderes et al., 2016) for more details on the Wasserstein barycenter.
Given two discrete distributions and , , the 2-Wasserstein distance between and is defined by
where is the optimal objective value of the following linear program:
Then, given a set of discrete probability distributions with , a Wasserstein barycenter with support points is an optimal solution of the following problem
for given weights satisfying and , . It is worth noting that the number of support points of the true barycenter is theoretically upper bounded by ; see (Anderes et al., 2016, Theorem 2). However, in practice, one usually chooses a small to simplify computations. The above problem is a two-stage optimization problem that can be easily shown to be equivalent to
(resp. ) denotes the (resp. ) dimensional vector with all entries being 1;
, for ;
, for .
Note that problem (2) is a nonconvex problem, where one needs to find the optimal support and the optimal weight vector of a barycenter simultaneously. However, in many real applications, the support of a barycenter can be specified empirically from the support points of . Thus, one only needs to find the weight vector of a barycenter. In view of this, from now on, we assume that the support is given. Consequently, problem (2) reduces to the following problem:
where denotes for simplicity. This is also the main problem studied in (Benamou et al., 2015; Borgwardt and Patterson, 2018; Carlier et al., 2015; Claici et al., 2018; Cuturi and Doucet, 2014; Cuturi and Peyré, 2016; Oberman and Ruan, 2015; Schmitzer, 2019; Uribe et al., 2018; Ye and Li, 2014; Ye et al., 2017) for the computation of Wasserstein barycenters. Moreover, one can easily see that problem (3) is indeed a large-scale LP containing nonnegative variables and equality constraints. For , and for all , such LP has about nonnegative variables and equality constraints.
Remark 1 (Practical computational consideration when is sparse)
Note that any feasible point of problem (3) must satisfy and for any . This implies that if for some and , then for all , i.e., all entries in the -th column of are zeros. Based on this fact, one can verify the following statements.
For any optimal solution of problem (3), the point is also an optimal solution of the following problem
where denotes the support set of , i.e., , denotes the cardinality of , denotes the subvector of obtained by selecting the entries indexed by and denotes the submatrix of obtained by selecting the columns indexed by .
Therefore, one can obtain an optimal solution of problem (3) by computing an optimal solution of problem (4). Note that the size of problem (4) can be much smaller than that of problem (3) when each is sparse, i.e., . Thus, solving problem (4) can reduce the computational cost and save memory in practice. Since problem (4) takes the same form as problem (3), we only consider problem (3) in the following.
For notational simplicity, let and be the indicator function over for each . Then, problem (3) can be equivalently written as
where , , are multipliers. Then, the dual problem of (5) is given by
where is the Fenchel conjugate of . Thus, (7) is equivalent to
By introducing auxiliary variables , we can further reformulate the above problem as
Note that problem (8) can be viewed as a linearly constrained convex problem with 3 blocks of variables grouped as , and , whose objective is nonsmooth only with respect to and linear with respect to the other two. Thus, this problem exactly falls into the class of convex problems for which the sGS-ADMM is applicable; see (Chen et al., 2017; Li et al., 2016). Then, it is natural to adapt the sGS-ADMM for solving problem (8), which is presented in the next section.
Remark 2 (2-block ADMM for solving (3))
It is worth noting that one can also apply the 2-block ADMM to solve the primal problem (3) by introducing some proper auxiliary variables. For example, one can consider the following equivalent reformulation of (3):
where . Then, the 2-block ADMM can be readily applied with being one block and as the other. This 2-block ADMM avoids the need to solve quadratic programming subproblems and hence is more efficient than the one used in (Ye and Li, 2014). However, it needs to compute the projection onto the -dimensional simplex times when solving the -subproblem in each iteration. This is still time-consuming when or is large. Thus, this 2-block ADMM is also not efficient enough for solving large-scale problems. In addition, we have adapted the 2-block ADMM for solving other reformulations of (3), but they all perform worse than our sGS-ADMM to be presented later. Hence, we will no longer consider ADMM-type methods for solving the primal problem (3) or its equivalent variants in this paper.
3 sGS-ADMM for computing Wasserstein barycenters
where , , are multipliers. Then, the augmented Lagrangian function associated with (8) is
Input: the penalty parameter , and the initialization , , , , , , . Set .
while a termination criterion is not met, do
where is the dual step-size that is typically set to .
Output: , , , , , .
Comparing with the directly extended ADMM, our sGS-ADMM in Algorithm 1 just has one more update of in Step 2a. This step is actually the key to guarantee the convergence of the algorithm. In the next section, we shall see that computed from Step 2a–2c is equivalent to minimizing plus a special proximal term simultaneously with respect to . Moreover, all subproblems in Algorithm 1 can be solved efficiently (in fact analytically) and the subproblems in each step can also be computed in parallel. This makes our method highly suitable for solving large-scale problems.
The reader may have observed that instead of computing and sequentially as in Step 2a–2c, one can also compute simultaneously in one step by solving a huge linear system of equations of dimension . Unfortunately, for the latter approach, the computation of the solution would require the Cholesky factorization of a huge coefficient matrix, and this approach is not practically viable. In contrast, for our approach in Step 2a-2c, we shall see shortly that the solutions can be computed analytically without the need to perform Cholesky factorizations of large coefficient matrices. This also explains why we have designed the computations as in Step 2a-2c.
Before ending this section, we present the computational details and the efficient implementations in each step of Algorithm 1.
1. Note that is actually separable with respect to and hence one can compute independently. Specifically, is obtained by solving
Thus, we have
where the last equality follows from the Moreau decomposition (Bauschke and Combettes, 2011, Theorem 14.3(ii)), i.e., for any and the proximal mapping of can be computed efficiently by the algorithm proposed in (Condat, 2016) with the complexity of that is typically observed in practice. Moreover, for each , can be computed in parallel by solving
Then, it is easy to see that
where . Note that is already computed for updating in the previous iteration and thus it can be reused in the current iteration. The computational complexity in this step is . We should emphasize that because the matrices such as , are large and numerous, even performing simple operations such as adding two such matrices can be time consuming. Thus we have paid special attention to arrange the computations in each step of the sGS-ADMM so that matrices computed in one step can be reused for the next step.
2a. Similarly, is separable with respect to and then one can also compute , , in parallel. For each , is obtained by solving
It is easy to prove that
where . Note that has already been computed in Step 1 and hence can be computed by just a simple operation. We note that is computed analytically for all , and the computational complexity in this step is .
2b. In this step, one can see that , , are coupled in (due to the quadratic term ) and hence the problem of minimizing with respect to cannot be reduced to separable subproblems. However, one can still compute them efficiently based on the following observation. Note that , , is obtained by solving
The gradient of with respect to is
where and . It follows from the optimality conditions, namely, that, for any ,
By dividing in (10) for , adding all resulting equations and doing some simple algebraic manipulations, one can obtain that