A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters

# A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters

\nameLei Yang \emailorayl@nus.edu.sg
\addrInstitute of Operations Research and Analytics
National University of Singapore
10 Lower Kent Ridge Road, Singapore 119076 \AND\nameJia Li \emailjiali@stat.psu.edu
Pennsylvania State University
University Park, PA 16802, USA \AND\nameDefeng Sun111The research of this author was supported in part by a start-up research grant from the Hong Kong Polytechnic University. \emaildefeng.sun@polyu.edu.hk
The Hong Kong Polytechnic University
Hung Hom, Kowloon, Hong Kong \AND\nameKim-Chuan Toh222The research of this author was supported in part by the Ministry of Education, Singapore, Academic Research Fund (Grant No. R-146-000-256-114). \emailmattohkc@nus.edu.sg
\addrDepartment of Mathematics and Institute of Operations Research and Analytics
National University of Singapore
10 Lower Kent Ridge Road, Singapore 119076
###### Abstract

In this paper, we consider the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite supports, which finds many applications in areas such as statistics, machine learning and image processing. When the support points of the barycenter are pre-specified, this problem can be modeled as a linear program (LP) whose problem size can be extremely large. To handle this large-scale LP, we analyse the structure of its dual problem, which is conceivably more tractable and can be reformulated as a well-structured convex problem with 3 kinds of block variables and a coupling linear equality constraint. We then adapt a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the resulting dual problem and establish its global convergence and global linear convergence rate. As a critical component for efficient computation, we also show how all the subproblems involved can be solved exactly and efficiently. This makes our method suitable for computing a Wasserstein barycenter on a large dataset, without introducing an entropy regularization term as is commonly practiced. In addition, our sGS-ADMM can be used as a subroutine in an alternating minimization method to compute a barycenter when its support points are not pre-specified. Numerical results on synthetic datasets and image datasets demonstrate that our method is highly competitive for solving large-scale problems, in comparison to two existing representative methods and the commercial software Gurobi.

A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters Lei Yang orayl@nus.edu.sg
Institute of Operations Research and Analytics
National University of Singapore
10 Lower Kent Ridge Road, Singapore 119076
Jia Li jiali@stat.psu.edu
Department of Statistics
Pennsylvania State University
University Park, PA 16802, USA
Defeng Sun111The research of this author was supported in part by a start-up research grant from the Hong Kong Polytechnic University. defeng.sun@polyu.edu.hk
Department of Applied Mathematics
The Hong Kong Polytechnic University
Hung Hom, Kowloon, Hong Kong
Kim-Chuan Toh222The research of this author was supported in part by the Ministry of Education, Singapore, Academic Research Fund (Grant No. R-146-000-256-114). mattohkc@nus.edu.sg
Department of Mathematics and Institute of Operations Research and Analytics
National University of Singapore
10 Lower Kent Ridge Road, Singapore 119076

Keywords: Wasserstein barycenter; discrete probability distribution; semi-proximal ADMM; symmetric Gauss-Seidel.

## 1 Introduction

In this paper, we consider the problem of computing the mean of a set of discrete probability distributions under the Wasserstein distance (also known as the optimal transport distance or the earth mover’s distance). This mean, called the Wasserstein barycenter, is also a discrete probability distribution (Agueh and Carlier, 2011). Recently, the Wasserstein barycenter has attracted much attention due to its promising performance in many application areas such as data analysis and statistics (Bigot and Klein, 2018), machine learning (Cuturi and Doucet, 2014; Li and Wang, 2008; Ye and Li, 2014; Ye et al., 2017) and image processing (Rabin et al., 2011). For a set of discrete probability distributions with finite support points, a Wasserstein barycenter with its support points being pre-specified can be computed by solving a linear programming (LP) problem (Anderes et al., 2016). However, the problem size can be extremely large when the number of discrete distributions or the number of support points of each distribution is large. Thus, classical LP methods such as the simplex method and the interior point method are no longer efficient enough or consume too much memory when solving this problem. This motivates the study of fast algorithms for the computation of Wasserstein barycenters; see, for example, (Benamou et al., 2015; Borgwardt and Patterson, 2018; Carlier et al., 2015; Claici et al., 2018; Cuturi and Doucet, 2014; Cuturi and Peyré, 2016; Oberman and Ruan, 2015; Schmitzer, 2019; Uribe et al., 2018; Xie et al., 2018; Ye and Li, 2014; Ye et al., 2017).

One representative approach is to introduce an entropy regularization in the LP and then apply some efficient first-order methods, e.g., the gradient descent method (Cuturi and Doucet, 2014) and the iterative Bregman projection (IBP) method (Benamou et al., 2015), to solve the regularized problem. These methods can be implemented efficiently and hence are suitable for large-scale datasets. However, they can only return an approximate solution of the LP (due to the entropy regularization) and often encounter numerical issues when the regularization parameter becomes small. IBP is highly efficient if an approximate solution is adequate, as is the case in many learning tasks. However, our aim here is to obtain a high precision solution at low computational costs. Detailed empirical studies on the pros and cons of IBP are provided by Ye et al. (2017) for obtaining higher precision solutions via diminishing the regularization parameter. It was found that it can be difficult to drive the regularization parameter to smaller values for obtaining more accurate solutions without encountering numerical issues. We will provide a comparison with IBP in the experiments. Another approach is to consider the LP as a constrained convex optimization problem with a separable structure and then apply some splitting methods to solve it. For example, the alternating direction method of multipliers (ADMM) was adapted in (Ye and Li, 2014). However, solving the quadratic programming subproblems involved is still highly expensive. Later, Ye et al. (2017) developed a modified Bregman ADMM (BADMM) based on the original one (Wang and Banerjee, 2014) to solve the LP. In this method, all subproblems have closed-form solutions and hence can be solved efficiently. Promising numerical performance was also reported in (Ye et al., 2017). However, this modified Bregman ADMM does not have a convergence guarantee so far.

In this paper, we also consider the LP as a constrained convex problem with multiple blocks of variables and develop an efficient method to solve its dual LP without introducing the entropy regularization to modify the objective function. We believe it is important to have an efficient algorithm that can faithfully solve the original non-regularized LP. It is conceivable that the barycenter problem itself is usually embedded as a subroutine of a larger problem, and in that scenario, the original barycenter problem should be solved accurately so that the computed barycenter would not introduce significant error to the solution of the outer problem.

Motivated by the above studies, in this paper, we adapt the sGS-ADMM to compute a Wasserstein barycenter by solving the dual problem of the original primal LP. The contributions of this paper are listed as follows:

• We derive the dual problem of the original primal LP and characterize the properties of their optimal solutions; see Proposition 1. The resulting dual problem is our target problem, which we reformulate as a linearly constrained convex problem containing 3 blocks of variables with a carefully delineated separable structure designed for efficient computations. We should emphasize again that we do not introduce the entropic or quadratic regularization to modify the LP so as to make it computationally more tractable. This is in contrast to many existing works that primarily focus on solving an approximation of the original LP arising from optimal transport related problems; see, for example, (Benamou et al., 2015; Cuturi, 2013; Cuturi and Doucet, 2014; Dessein et al., 2018; Essid and Solomon, 2018).

• We apply the sGS-ADMM to solve the resulting dual problem and analyze its global convergence as well as its global linear convergence rate without any condition; see Theorems 1 and 2. As a critical component of the paper, we also develop essential numerical strategies to show how all the subproblems in our method can be solved efficiently and that the subproblems at each step can be computed in parallel. This makes our sGS-ADMM highly suitable for computing Wasserstein barycenters on a large dataset.

• We conduct rigorous numerical experiments on synthetic datasets and MNIST to evaluate the performance of our sGS-ADMM in comparison to existing state-of-the-art methods (IBP and BADMM) and the highly powerful commercial solver Gurobi. The computational results show that our sGS-ADMM is highly competitive compared to IBP and BADMM, and is also able to outperform Gurobi in solving large-scale LPs arising from Wasserstein barycenter problems.

The rest of this paper is organized as follows. In Section 2, we describe the basic problem of computing Wasserstein barycenters and derive its dual problem. In Section 3, we adapt the sGS-ADMM to solve the resulting dual problem and present the efficient implementations for each step that are crucial in making our method competitive. The convergence analysis of the sGS-ADMM is presented in Section 4. Finally, numerical results are presented in Section 5, with some concluding remarks given in Section 6.

#### Notation and Preliminaries

In this paper, we present scalars, vectors and matrices in lower case letters, bold lower case letters and upper case letters, respectively. We use , , and to denote the set of real numbers, -dimensional real vectors, -dimensional real vectors with nonnegative entries and real matrices, respectively. For a vector , denotes its -th entry, denotes its Euclidean norm and denotes its weighted norm associated with the symmetric positive semidefinite matrix . For a matrix , denotes its -th entry, denotes its -th row, denotes its -th column, denotes its Fröbenius norm and denotes the vectorization of . We also use and to denote for all and for all . The identity matrix of size is denoted by . For any and , denotes the matrix obtained by horizontally concatenating and . For any and , denotes the matrix obtained by vertically concatenating and . For any and , the Kronecker product is defined as

 X⊗Y=⎡⎢ ⎢⎣[c]x11Y⋯x1nY⋮⋮xm1Y⋯xmnY⎤⎥ ⎥⎦.

For an extended-real-valued function , we say that it is proper if for all and its domain is nonempty. A proper function is said to be closed if it is lower semicontinuous. Assume that is a proper and closed convex function. The subdifferential of at is defined by and its conjugate function is defined by . For any and , it follows from (Rockafellar, 1970, Theorem 23.5) that

 y∈∂f(x)  ⟺  x∈∂f∗(y). (1)

For any , the proximal mapping of at is defined by

 Proxνf(y):=argminx{f(x)+12ν∥x−y∥2}.

For a closed convex set , its indicator function is defined by if and otherwise. Moreover, we use to denote the projection of onto a closed convex set . It is easy to see that .

In the following, a discrete probability distribution with finite support points is specified by , where are the support points or vectors and are the associated probabilities or weights satisfying and , .

## 2 Problem statement

In this section, we briefly recall the Wasserstein distance and describe the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite support points. We refer interested readers to (Villani, 2008, Chapter 6) for more details on the Wasserstein distance and to (Agueh and Carlier, 2011; Anderes et al., 2016) for more details on the Wasserstein barycenter.

Given two discrete distributions and , , the 2-Wasserstein distance between and is defined by

 W2(P(u),P(v)):=√v∗,

where is the optimal objective value of the following linear program:

Then, given a set of discrete probability distributions with , a Wasserstein barycenter with support points is an optimal solution of the following problem

 minP ∑Nt=1γt(W2(P,P(t)))2

for given weights satisfying and , . It is worth noting that the number of support points of the true barycenter is theoretically upper bounded by ; see (Anderes et al., 2016, Theorem 2). However, in practice, one usually chooses a small to simplify computations. The above problem is a two-stage optimization problem that can be easily shown to be equivalent to

 (2)

where

• (resp. ) denotes the (resp. ) dimensional vector with all entries being 1;

• , ;

• , for ;

• , for .

Note that problem (2) is a nonconvex problem, where one needs to find the optimal support and the optimal weight vector of a barycenter simultaneously. However, in many real applications, the support of a barycenter can be specified empirically from the support points of . Thus, one only needs to find the weight vector of a barycenter. In view of this, from now on, we assume that the support is given. Consequently, problem (2) reduces to the following problem:

 (3)

where denotes for simplicity. This is also the main problem studied in (Benamou et al., 2015; Borgwardt and Patterson, 2018; Carlier et al., 2015; Claici et al., 2018; Cuturi and Doucet, 2014; Cuturi and Peyré, 2016; Oberman and Ruan, 2015; Schmitzer, 2019; Uribe et al., 2018; Ye and Li, 2014; Ye et al., 2017) for the computation of Wasserstein barycenters. Moreover, one can easily see that problem (3) is indeed a large-scale LP containing nonnegative variables and equality constraints. For , and for all , such LP has about nonnegative variables and equality constraints.

###### Remark 1 (Practical computational consideration when a(t) is sparse)

Note that any feasible point of problem (3) must satisfy and for any . This implies that if for some and , then for all , i.e., all entries in the -th column of are zeros. Based on this fact, one can verify the following statements.

• For any optimal solution of problem (3), the point is also an optimal solution of the following problem

 (4)

where denotes the support set of , i.e., , denotes the cardinality of , denotes the subvector of obtained by selecting the entries indexed by and denotes the submatrix of obtained by selecting the columns indexed by .

• For any optimal solution of problem (4), the point obtained by setting and is also an optimal solution of problem (3), where .

Therefore, one can obtain an optimal solution of problem (3) by computing an optimal solution of problem (4). Note that the size of problem (4) can be much smaller than that of problem (3) when each is sparse, i.e., . Thus, solving problem (4) can reduce the computational cost and save memory in practice. Since problem (4) takes the same form as problem (3), we only consider problem (3) in the following.

For notational simplicity, let and be the indicator function over for each . Then, problem (3) can be equivalently written as

 minw,{Π(t)} δΔm(w)+∑Nt=1δt+(Π(t))+∑Nt=1⟨D(t),Π(t)⟩s.t.Π(t)emt=w, (Π(t))⊤em=a(t),  t=1,⋯,N. (5)

We next derive the dual problem of (5) (hence (3)). To this end, we write down the Lagrangian function associated with (5) as follows:

 Υ(w,{Π(t)};{y(t)},{z(t)}):=δΔm(w)+N∑t=1δt+(Π(t))+N∑t=1⟨D(t),Π(t)⟩+N∑t=1⟨y(t),Π(t)emt−w⟩+N∑t=1⟨z(t),(Π(t))⊤em−a(t)⟩, (6)

where , , are multipliers. Then, the dual problem of (5) is given by

 max{y(t)},{z(t)}minw,{Π(t)}Υ(w,{Π(t)};{y(t)},{z(t)}). (7)

Observe that

 minw,{Π(t)}Υ(w,{Π(t)};{y(t)},{z(t)})=minw,{Π(t)}{δΔm(w)−⟨∑Nt=1y(t),w⟩+∑Nt=1(δt+(Π(t))+⟨D(t)+y(t)e⊤mt+em(z(t))⊤,Π(t)⟩)  −∑Nt=1⟨z(t),a(t)⟩}={−δ∗Δm(∑Nt=1y(t))−∑Nt=1⟨z(t),a(t)⟩,if D(t)+y(t)e⊤mt+em(z(t))⊤≥0, t=1,⋯,N,−∞,otherwise,

where is the Fenchel conjugate of . Thus, (7) is equivalent to

 min{y(t)},{z(t)} δ∗Δm(∑Nt=1y(t))+∑Nt=1⟨z(t),a(t)⟩s.t.D(t)+y(t)e⊤mt+em(z(t))⊤≥0,  t=1,⋯,N.

By introducing auxiliary variables , we can further reformulate the above problem as

 minu,{V(t)},{y(t)},{z(t)}δ∗Δm(u)+∑Nt=1δt+(V(t))+∑Nt=1⟨z(t),a(t)⟩s.t.∑Nt=1y(t)−u=0,V(t)−D(t)−y(t)e⊤mt−em(z(t))⊤=0,  t=1,⋯,N. (8)

Note that problem (8) can be viewed as a linearly constrained convex problem with 3 blocks of variables grouped as , and , whose objective is nonsmooth only with respect to and linear with respect to the other two. Thus, this problem exactly falls into the class of convex problems for which the sGS-ADMM is applicable; see (Chen et al., 2017; Li et al., 2016). Then, it is natural to adapt the sGS-ADMM for solving problem (8), which is presented in the next section.

###### Remark 2 (2-block ADMM for solving (3))

It is worth noting that one can also apply the 2-block ADMM to solve the primal problem (3) by introducing some proper auxiliary variables. For example, one can consider the following equivalent reformulation of (3):

 minw,{Π(t)},{Γ(t)} δΔm(w)+∑Nt=1δΔΠ(t)(Π(t))+∑Nt=1⟨D(t),Π(t)⟩s.t.Π(t)=Γ(t),Γ(t)emt=w,t=1,⋯,N,

where . Then, the 2-block ADMM can be readily applied with being one block and as the other. This 2-block ADMM avoids the need to solve quadratic programming subproblems and hence is more efficient than the one used in (Ye and Li, 2014). However, it needs to compute the projection onto the -dimensional simplex times when solving the -subproblem in each iteration. This is still time-consuming when or is large. Thus, this 2-block ADMM is also not efficient enough for solving large-scale problems. In addition, we have adapted the 2-block ADMM for solving other reformulations of (3), but they all perform worse than our sGS-ADMM to be presented later. Hence, we will no longer consider ADMM-type methods for solving the primal problem (3) or its equivalent variants in this paper.

## 3 sGS-ADMM for computing Wasserstein barycenters

In this section, we present the sGS-ADMM for solving problem (8). First, we write down the Lagrangian function associated with (8) as follows:

 ˜Υ(u,{V(t)},{y(t)},{z(t)};λ,{Λ(t)})=δ∗Δm(u)+∑Nt=1δt+(V(t))+∑Nt=1⟨z(t),a(t)⟩+⟨λ,∑Nt=1y(t)−u⟩  +∑Nt=1⟨Λ(t),V(t)−D(t)−y(t)e⊤mt−em(z(t))⊤⟩, (9)

where , , are multipliers. Then, the augmented Lagrangian function associated with (8) is

 Lβ(u,{V(t)},{y(t)},{z(t)};λ,{Λ(t)})=˜Υ(u,{V(t)},{y(t)},{z(t)};λ,{Λ(t)})+β2∥∑Nt=1y(t)−u∥2+β2∑Nt=1∥V(t)−D(t)−y(t)e⊤mt−em(z(t))⊤∥2F,

where is the penalty parameter. The sGS-ADMM for solving (8) is now readily presented in Algorithm 1.

Comparing with the directly extended ADMM, our sGS-ADMM in Algorithm 1 just has one more update of in Step 2a. This step is actually the key to guarantee the convergence of the algorithm. In the next section, we shall see that computed from Step 2a–2c is equivalent to minimizing plus a special proximal term simultaneously with respect to . Moreover, all subproblems in Algorithm 1 can be solved efficiently (in fact analytically) and the subproblems in each step can also be computed in parallel. This makes our method highly suitable for solving large-scale problems.

The reader may have observed that instead of computing and sequentially as in Step 2a–2c, one can also compute simultaneously in one step by solving a huge linear system of equations of dimension . Unfortunately, for the latter approach, the computation of the solution would require the Cholesky factorization of a huge coefficient matrix, and this approach is not practically viable. In contrast, for our approach in Step 2a-2c, we shall see shortly that the solutions can be computed analytically without the need to perform Cholesky factorizations of large coefficient matrices. This also explains why we have designed the computations as in Step 2a-2c.

Before ending this section, we present the computational details and the efficient implementations in each step of Algorithm 1.

• 1. Note that is actually separable with respect to and hence one can compute independently. Specifically, is obtained by solving

 minu∈Rm{δ∗Δm(u)−⟨λk,u⟩+β2∥∑Nt=1y(t),k−u∥2}.

Thus, we have

 uk+1=Proxβ−1δ∗Δm(β−1λk+∑Nt=1y(t),k)=(β−1λk+∑Nt=1y(t),k)−β−1ProxβδΔm(λk+β∑Nt=1y(t),k),

where the last equality follows from the Moreau decomposition (Bauschke and Combettes, 2011, Theorem 14.3(ii)), i.e., for any and the proximal mapping of can be computed efficiently by the algorithm proposed in (Condat, 2016) with the complexity of that is typically observed in practice. Moreover, for each , can be computed in parallel by solving

 minV(t){δt+(V(t))+⟨Λ(t),k,V(t)⟩+β2∥V(t)−D(t)−y(t),ke⊤mt−em(z(t),k)⊤∥2F}.

Then, it is easy to see that

 V(t),k+1=max{˜D(t),k−β−1Λ(t),k,0},

where . Note that is already computed for updating in the previous iteration and thus it can be reused in the current iteration. The computational complexity in this step is . We should emphasize that because the matrices such as , are large and numerous, even performing simple operations such as adding two such matrices can be time consuming. Thus we have paid special attention to arrange the computations in each step of the sGS-ADMM so that matrices computed in one step can be reused for the next step.

• 2a. Similarly, is separable with respect to and then one can also compute , , in parallel. For each , is obtained by solving

 minz(t){⟨z(t),a(t)⟩−⟨Λ(t),k,em(z(t))⊤⟩+β2∥V(t),k+1−D(t)−y(t),ke⊤mt−em(z(t))⊤∥2F}.

It is easy to prove that

where . Note that has already been computed in Step 1 and hence can be computed by just a simple operation. We note that is computed analytically for all , and the computational complexity in this step is .

• 2b. In this step, one can see that , , are coupled in (due to the quadratic term ) and hence the problem of minimizing with respect to cannot be reduced to separable subproblems. However, one can still compute them efficiently based on the following observation. Note that , , is obtained by solving

The gradient of with respect to is

 ∇y(t)Φk(y(1),⋯,y(N))=λk+β(∑Nℓ=1y(ℓ)−uk+1)+β(−β−1Λ(t),k+D(t)+y(t)e⊤mt+em(~z(t),k+1)⊤−V(t),k+1)emt=β∑Nℓ=1y(ℓ)+βmt(y(t)−y(t),k)+λk−βuk+1+β(B(t),k+em(~z(t),k+1−z(t),k)⊤)emt=β∑Nℓ=1(y(ℓ)−y(ℓ),k)+βmt(y(t)−y(t),k)+βhk+β˜B(t),kemt,

where and . It follows from the optimality conditions, namely, that, for any ,

 ∑Nℓ=1(y(ℓ),k+1−y(ℓ),k)+mt(y(t),k+1−y(t),k)+hk+˜B(t),kemt=0. (10)

By dividing in (10) for , adding all resulting equations and doing some simple algebraic manipulations, one can obtain that

 ~bk:=∑Nℓ=1(y(ℓ),k+1−y(ℓ),k)=−(∑Nℓ=1m−1ℓ)hk+∑Nℓ=1m−1