1 Introduction

Abstract

We provide theoretical complexity analysis for new algorithms to compute the optimal transport (OT) distance between two discrete probability distributions, and demonstrate their favorable practical performance compared to state-of-art primal-dual algorithms. First, we introduce the accelerated primal-dual randomized coordinate descent (APDRCD) algorithm for computing the OT distance. We show that its complexity is , where stands for the number of atoms of these probability measures and is the desired accuracy. This complexity bound matches the best known complexities of primal-dual algorithms for the OT problems, including the adaptive primal-dual accelerated gradient descent (APDAGD) and the adaptive primal-dual accelerated mirror descent (APDAMD) algorithms. Then, we demonstrate the improved practical efficiency of the APDRCD algorithm through extensive comparative experimental studies. We also propose a greedy version of APDRCD, which we refer to as accelerated primal-dual greedy coordinate descent (APDGCD), to further enhance practical performance. Finally, we generalize the APDRCD and APDGCD algorithms to distributed algorithms for computing the Wasserstein barycenter for multiple probability distributions.

1 Introduction

Optimal transport has become an important topic in statistical machine learning. It finds the minimal cost couplings between pairs of probability measures and provides a geometrically faithful way to compare two probability distributions, with diverse applications in areas including Bayesian nonparametrics [19, 20], scalable Bayesian inference [25, 26], topic modeling [16], isotonic regression [23], and deep learning [4, 2, 27].

Nevertheless, the practical impact of OT has been limited by its computational burden. By viewing the optimal transport distance as a linear programming problem, interior-point methods have been employed as a computational solver, with a best known practical complexity of  [21]. Recently, Lee and Sidford [15] proposed to use the Laplace linear system solver to theoretically improve the complexity of interior-point methods to . However, it remains a practical challenge to develop efficient interior-point implementations in the high-dimensional settings for OT applications in machine learning.

Several algorithms have been proposed to circumvent the scalability issue of the interior-point methods, including the Sinkhorn algorithm [24, 14, 12, 6], which has a complexity bound of where is the desired accuracy [8]. The Greenkhorn algorithm [1] further improves the performance of the Sinkhorn algorithm, with a theoretical complexity of  [17]. However, for large-scale applications of the OT problem, particularly in randomized and asynchronous scenarios such as computational Wasserstein barycenters, existing literature has shown that neither the Sinkhorn algorithm nor the Greenkhorn algorithm are sufficiently scalable and flexible [5, 7].

Recent research has demonstrated the advantages of the family of accelerated primal-dual algorithms [8, 17] over the Sinkhorn algorithms. This family includes the adaptive primal-dual accelerated gradient descent (APDAGD) algorithm [9] and the adaptive primal-dual accelerated mirror descent (APDAMD) algorithms [17], with complexity bounds of and , respectively, where is the inverse of the strong complexity constant of Bregman divergence with respect to the -norm. In addition, the primal-dual algorithms possess the requisite flexibility and scalability compared to the Sinkhorn algorithm, which is crucial for computational OT problems in large-scale applications [5, 11]. Specifically, they are flexible enough to be generalized to the computation of the Wasserstein barycenter for multiple probability distributions in decentralized and asynchronous settings [5, 7].

In the optimization literature, primal-dual methods have served as standard techniques that are readily parallelized for high-dimensional applications [3, 28]. On the other hand, coordinate descent methods have been shown to be well suited to the solution of large-scale machine learning problems [18, 22, 10].

Our contributions. The contributions of the paper are three-fold.

  1. We introduce a novel accelerated primal-dual randomized coordinate descent (APDRCD) algorithm for solving the OT problem. We provide a complexity upper bound of for the APDRCD algorithm, which is comparable to the complexity of state-of-art primal-dual algorithms for OT problems, such as the APDAGD and APDAMD algorithms [8, 17]. To the best of our knowledge, this is the first accelerated primal-dual coordinate descent algorithm for computing the OT problem.

  2. We show that the APDRCD algorithm outperforms the APDAGD and APDAMD algorithms on both synthetic and real image datasets with extensive experiments. To further improve the practical performance of the APDRCD algorithm, we present a greedy version of it which we refer to as the accelerated primal-dual greedy coordinate descent (APDGCD) algorithm. To the best of our knowledge, the APDRCD and APDGCD algorithms achieve the best performance among state-of-art accelerated primal-dual algorithms on solving the OT problem.

  3. We demonstrate that the APDRCD and APDGCD algorithms are suitable and parallelizable for other large-scale problems besides the OT problem, e.g., approximating the Wasserstein barycenter for a finite set of probability measures stored over a distributed network.

Organization. The remainder of the paper is organized as follows. In Section 2, we provide the formulation of the entropic OT problem and its dual form. In Section 3, we introduce the APDRCD algorithm for solving the regularized OT problem and provide a complexity upper bound for it. We then present the greedy APDGCD algorithm. In Section 4, we present comparative experiments between the APDRCD, the APDGCD algorithms and other primal-dual algorithms including the APDAGD and APDAMD algorithms. We conclude the paper with a few future directions in Section 5. Finally, the proofs of all the results are included in the Appendix A, and the generalized APDRCD and APDGCD algorithms for approximating Wasserstein barycenters are presented in Appendix B. Additional experiments are presented in Appendix C.

Notation. We denote the probability simplex for . Furthermore, stands for the set while stands for the set of all vectors in with nonnegative components for any . For a vector and , we denote as its -norm and as the diagonal matrix with on the diagonal. For a matrix , the notation stands for the vector in obtained from concatenating the rows and columns of . 1 stands for a vector with all of its components equal to . refers to a partial gradient of with respect to . Lastly, given the dimension and accuracy , the notation stands for the upper bound where is independent of and . Similarly, the notation indicates the previous inequality may depend on the logarithmic function of and , and where .


Figure 1: Performance of APDRCD and APDAGD algorithms on the synthetic images. In the top two images, the comparison is based on using the distance to the transportation polytope, and the maximum, median and minimum of competitive ratios on ten random pairs of images. In the bottom left image, the comparison is based on varying the regularization parameter and reporting the optimal value of the original optimal transport problem without entropic regularization. Note that the foreground covers 10% of the synthetic images here. In the bottom right image, we compare the algorithms by using the median of competitive ratios with varying coverage ratio of foreground in the range of .

2 Problem Setup

In this section, we provide necessary background for the entropic regularized OT problem. The objective function for the problem is presented in Section 2.1 while its dual form as well as the key properties of that dual form are given in Section 2.2.

2.1 Entropic Regularized OT

As shown in [13], the problem of approximating the OT distance between two discrete probability distributions with at most components is equivalent to the following linear programming problem:

(1)

where is a transportation plan, is a cost matrix with non negative elements, and and refer to two known probability distributions in the probability simplex . The best known practical complexity bound for (1) is  [21], while the best theoretical complexity bound is  [15], achieved via interior-point methods. However, these methods are not efficient in the high-dimensional setting of OT applications in machine learning. This motivates the entropic regularized OT problem [6]:

(2)

where is the regularization parameter and is the entropic regularization given by . The main focus of the paper is to determine an -approximate transportation plan such that and and the following bound holds:

(3)

where is an optimal solution; i.e., an optimal transportation plan for the OT problem (1). To ease the ensuing presentation, we let denote an -approximation for the OT distance. Furthermore, we define matrix such that for any

2.2 Dual Entropic Regularized OT

The Lagrangian function for problem (2) is given by

Given the Lagrangian function, the dual form of the entropic regularized OT problem can be obtained by solving the optimization problem . Since the Lagrangian function is strictly convex, that optimization problem can be solved by setting , which leads to the following form of the transportation plan: for all . With this solution, we have . The dual entropic regularized OT problem is, therefore, equivalent to the following optimization problem:

(4)

Building on Lemma 4.1 in [17], the dual objective function can be shown to be smooth with respect to norm:

Lemma 2.1.

The dual objective function is smooth with respect to norm:

Proof.

The proof is straightforward application of the result from Lemma 4.1 in [17]. Here, we provide the details of this proof for the completeness. Indeed, invoking Lemma 4.1 in [17], we find that

Since is equal to the maximum -norm of a column of A and each column of A contains only two nonzero elements which are equal to one, we have . Combining with the fact that , we establish the result. ∎


Figure 2: Performance of APDRCD and APDAMD algorithms on the synthetic images. The organization of the images is similar to those in Figure 1.

3 Accelerated Primal-Dual Coordinate Descent Algorithms

In this section, we present and analyze an accelerated primal-dual coordinate descent algorithms to obtain an -approximate transportation plan for the OT problem (1). First, in Section 3.1, we introduce the accelerated primal-dual randomized coordinate descent (APDRCD) method for the entropic regularized OT problem. Then, following the approximation scheme of [1], we show how to approximate the OT distance within the APDRCD algorithm; see Algorithm 2 for the pseudo-code for this problem. Furthermore, we provide theoretical analysis to establish a complexity bound of for the APDRCD algorithm to achieve an -approximate transportation plan for the OT problem in Section 3.2. This complexity upper bound of the APDRCD algorithm matches the best known complexity bounds of the APDAGD [8] and APDAMD algorithms [17]. Finally, to further improve the practical performance of the algorithm, we present a greedy variant—the accelerated primal-dual greedy coordinate descent (APDGCD) algorithm—in Section 3.3.

1 Input:
2 while  do
3      Set
4       Compute
5       Randomly sample one coordinate where :
6       Update
(5)
7       Update
(6)
8       Update and
9 end while
Output: where
Algorithm 1 APDRCD ()

3.1 Accelerated Primal-Dual Randomized Coordinate Descent (APDRCD)

We denote by the Lipschitz constant for the dual objective function , which means that , and . The APDRCD algorithm is initialized with the auxiliary sequence and two auxiliary dual variable sequences and , where the first auxiliary sequence is used for the key averaging step and the two dual variable sequences are used to perform the accelerated randomized coordinate descent on the dual objective function as a subroutine. The APDRCD algorithm is composed of two main parts. First, exploiting the convexity property of the dual objective function, we perform a randomized accelerated coordinate descent step on the dual objective function as a subroutine in step 5 and 6. In the second part, we take a weighted average over the past iterations to get a good approximate solution for the primal problem from the approximate solutions to the dual problem (2.2). Notice that the auxiliary sequence is decreasing and the primal solutions corresponding to the more recent dual solutions have larger weight in this average.

  Input: and .
  Step 1: Let and be defined as
  Step 2: Let and be defined by
  Step 3: Compute with defined in 2.2.
  Step 4: Round to by Algorithm 2 [1] such that , .
  Output: .
Algorithm 2 Approximating OT by APDRCD

3.2 Complexity Analysis of APDRCD

Given the updates from APDRCD algorithm in Algorithm 1, we have the following result regarding the difference of the values of at and :

Lemma 3.1.

Given the updates and from the APDRCD algorithm, we have the following inequality:

where is chosen in the APDRCD algorithm.

Proof.

For convenience, we define a vector-valued function such that if , and otherwise. By the update in Eq. (5) of Algorithm 1, we obtain:

(7)

Due to the smoothness of with respect to norm in Lemma 2.1, the following inequalities hold:

(8)

Combining the results of Eq. (7) and Eq. (3.2) completes the proof of the lemma. ∎

The result of Lemma 3.1 is vital for establishing an upper bound for , as shown in the following lemma.

Lemma 3.2.

For each iteration () of the APDRCD algorithm, we have

where the outer expectation in the above display is taken with respect to the random coordinate in Algorithm 1.

The proof of Lemma 3.2 is in Appendix A.1. Now, equipped with the result of Lemma 3.2, we are ready to provide the convergence guarantee and complexity bound of the APDRCD algorithm for approximating the OT problem. First, we start with the following result regarding an upper bound on to reach the stopping rule for . Here, the outer expectation is taken with respect to the random coordinates in Algorithm 1 for .

Theorem 3.3.

The APDRCD algorithm for approximating optimal transport (Algorithm 2) returns an output that satisfies the stopping criterion in a number of iterations k bounded as follows:

where . Here, and are chosen in Algorithm 2.

The proof of Theorem 3.3 is provided in Appendix A.2. Given an upper bound on for the stopping rule in Theorem 3.3 where in Theorem 3.3, we obtain the following complexity bound for the APDRCD algorithm.

Theorem 3.4.

The APDRCD algorithm for approximating optimal transport (Algorithm 2) returns satisfying , and in a total number of

arithmetic operations.

The proof of Theorem 3.4 is provided in Appendix A.3. We show in Appendix A.3 that Theorem 3.4 also directly implies a complexity bound for obtaining an -optimal solution with high probability. Theorem 3.4 indicates that the complexity upper bound of APDRCD matches the best known complexity of the APDAGD [8] and APDAMD [17] algorithms. Furthermore, that complexity of APDRCD is better than that of the Sinkhorn and Greenkhorn algorithms, which is , in terms of the desired accuracy . Later in Section 4, we demonstrate that the APDRCD algorithm also has better practical performance than APDAGD and APDAMD algorithms on both synthetic and real datasets.

1 Input:
2 while  do
3      Set
4       Compute
5       Select coordinate : Update
Update
Update and
6 end while
Output: where
Algorithm 3 APDGCD ()

Figure 3: Performance of the APDRCD, APDAGD and APDAMD algorithms on the MNIST images. In the first row of images, we compare the APDRCD and APDAGD algorithms in terms of iteration counts. The leftmost image specifies the distances to the transportation polytope for two algorithms; the middle image specifies the maximum, median and minimum of competitive ratios on ten random pairs of MNIST images; the rightmost image specifies the values of regularized OT with varying regularization parameter . In addition, the second row of images present comparative results for APDRCD versus APDAMD.

3.3 Accelerated Primal-Dual Greedy Coordinate Descent (APDGCD)

We next present a greedy version of APDRCD algorithm, which we refer to as the accelerated primal-dual greedy coordinate descent (APDGCD) algorithm. The detailed pseudo-code of that algorithm is in Algorithm 3 while an approximating scheme of OT based on the APDGCD algorithm is summarized in Algorithm 4.

Both the APDGCD and APDRCD algorithms follow along the general accelerated primal-dual coordinate descent framework. Similar to the APDRCD algorithm, the algorithmic framework of APDGCD is composed by two main parts: First, instead of performing randomized accelerated coordinate descent on the dual objective function as a subroutine, the APDGCD algorithm chooses the best coordinate that maximizes the absolute value of the gradient of the dual objective function of regularized OT problem among all the coordinates. In the second part, we follow the key averaging step in the APDRCD algorithm by taking a weighted average over the past iterations to get a good approximated solution for the primal problem from the approximated solutions to the dual problem. Since the auxiliary sequence is decreasing, the primal solutions corresponding to the more recent dual solutions have larger weight in this average.

We demonstrate that the APDGCD algorithm enjoys favorable practical performance than APDRCD algorithm on both synthetic and real datasets (cf. Appendix C).

  Input: and .
  Step 1: Let and be defined as
  Step 2: Let and be defined by
  Step 3: Compute with defined in 2.2.
  Step 4: Round to by Algorithm 2 [1] such that , .
  Output: .
Algorithm 4 Approximating OT by APDGCD

4 Experiments

In this section, we carry out comparative experiments between the APDRCD, APDGCD algorithms and the state-of-art primal-dual algorithms for the OT problem including the APDAGD and APDAMD algorithms, on both synthetic images and the MNIST Digits dataset.1 Due to space constraints, the comparative experiments between the APDGCD algorithm and APDAGD/APDAMD algorithms, further experiments for the APDRCD algorithm on larger synthetic datasets and CIFAR10 dataset are deferred to Appendix C. Note that for the above comparisons, we also utilize the default linear programming solver in MATLAB to obtain the optimal value of the original optimal transport problem without entropic regularization.

4.1 APDRCD Algorithm with Synthetic Images

We conduct extensive comparisons of the performance of the APDRCD algorithm with the APDAGD and APDAMD algorithms on synthetic images. The generation of synthetic images follows the procedure of [1, 17]. In particular, the images are of size and generated based on randomly placing a foreground square in the otherwise black background. Then, for the intensities of the background pixels and foreground pixels, we choose uniform distributions on [0,1] and [0, 50] respectively. For comprehensive results, we vary the proportion of the size of the foreground square in 0.1, 0.5, 0.9 of the full size of the image and implement all the algorithms on different kinds of synthetic images.

Evaluation metric: We utilize the metrics from [1]. The first metric is the distance between the output of the algorithm and the transportation polytope, , where and are the row and column marginal vectors of the output matrix while and stand for the true row and column marginal vectors. The second metric is the competitive ratio, defined by where and refer to the distance between the outputs of two algorithms and the transportation polytope.

Experimental settings and results: We perform two pairwise comparative experiments for the APDRCD algorithm versus the APDAGD and APDAMD algorithms by running these algorithms with ten randomly selected pairs of synthetic images. We also evaluate all the algorithms with varying regularization parameter and the optimal value of the original OT problem without the entropic regularization, as suggested by [1, 17].

We present the results in Figure 1 and Figure 2. The APDRCD algorithm has better performance than the APDAGD and APDAMD algorithms in terms of the iterations. When the number of iterations is small, the APDRCD algorithm achieves faster and more stable decrements than the other two algorithms with regard to both the distance to polytope and the value of OT during the computing process, which is beneficial for the purposes of tuning. This superior behavior of APDRCD illustrates the improvement achieved by using randomized coordinate descent on the dual regularized problem, and supports the theoretical assertion that the complexity bound for the APDRCD algorithm matches those of the APDAGD and APDAMD algorithms.

4.2 APDRCD Algorithm with MNIST Images

We present comparison between the APDRCD algorithm and the APDAGD and APDAMD algorithms on MNIST images with the same set of evaluation metrics. The image pre-processing follows the same pre-processing procedure as suggested in [17]; we omit the details for the sake of brevity.

We present the results with the MNIST images in Figure 3 with various values for the regularization parameter . We also evaluate all the algorithms with the optimal value of the original optimal transport problem without entropic regularization. As shown in Figure 3, the APDRCD algorithm outperforms both the APDAGD and APDAMD algorithms on the MNIST dataset in terms of the number of iterations. Additionally, the APDRCD algorithm displays faster and smoother convergence than the other algorithms at small iteration numbers with regard to both the evaluation metrics, which implies the advantage that it is easier to be tuned in practice. In summary, the consistent superior performance of the APDRCD algorithm over the APDAGD and APDAMD algorithms on both the synthetic and MNIST datasets supports the theoretical assertion on the matched complexity bounds of these three algorithms, and shows the advantage of using randomized coordinate descent.

5 Discussion

We have proposed and analyzed new accelerated primal-dual coordinate descent algorithms for approximating the optimal transport distance between two discrete probability measures. These accelerated primal-dual coordinate descent algorithms have comparable theoretical complexity as that of existing accelerated primal-dual algorithms while enjoying better experimental performance in practice. Furthermore, we show that the APDRCD and APDGCD algorithms are suitable for other large-scale problems apart from computing the OT distance; in particular, we proposed extensions that approximate the Wasserstein barycenters for multiple probability distributions.

There are several directions for future work. On the experimental side, given the favorable practical performance of the APDRCD and the APDGCD algorithms over existing primal-dual algorithms, it is of interest to carry out more extensive experiments of the distributed APDRCD and APDGCD algorithms for computing Wasserstein barycenters, with multiple machines in a real network. Another important direction is to construct fast distributed algorithms for the case of time-varying and directed machine networks to approximate the Wasserstein barycenters. It remains as an interesting and important open research question how the dynamics of the network affect the performance of these algorithms.

Acknowledgements

We thank Tianyi Lin for many helpful discussions. This work was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.

Appendix

In the appendix, we first provide detailed proofs of all the remaining results in Section A. Then, we present an application of APDRCD and APDGCD algorithms for the Wasserstein barycenter problem in Section B. Finally, further comparative experiments between APDGCD algorithm versus APDRCD, APDAGD and APDAMD algorithms, and experiments on larger synthetic image datasets and CIFAR10 dataset are in Section C.

Appendix A Proofs for all results

In this appendix, we provide the complete proofs for remaining results in the main text.

a.1 Proof of Lemma 3.2

By Eq. (5), we have the following equations

Combining the above equations with Lemma 3.1, we have

(9)

Furthermore, the results from Eq. (5) and Eq. (6) lead to the following equations

Therefore, we have

The above equation together with Eq. (A.1) yields the following inequality

(10)

By the result of Eq. (6), we have

Therefore, for any , we find that

Note that the above equation is equivalent to the following:

where the second equality in the above display comes from simple algebra. Rewriting the above equality, we have:

Combining the above equation with Eq. (A.1) yields the following inequality:

(11)

Recall the definition of in Step 3 of Algorithm 1 as follows:

which can be rewritten as:

(12)

for any .

The above equation implies that

where the last inequality comes from the convexity of . Combining this equation and taking expectation over for the first two terms of Eq. (A.1), we have:

(13)

where we use Eq. (12) and the convexity of the dual function in the last step. For the last term in the right hand side of Eq. (A.1), by taking expectation over , we have:

(14)

where the last equality comes from the following equations:

where the last inequality is due to the fact that and Jensen’s inequality. Therefore, by simple algebra, we have

Notice that equation (A.1) holds for any value of . Hence, by combining the results from Eq. (A.1) and Eq. (A.1) with Eq. (A.1), at each iteration with a certain value of , we obtain that

As a consequence, we achieve the conclusion of the lemma.


Figure 4: Performance of APDGCD and APDAGD algorithms on the synthetic images. The organization of the images is similar to those in Figure 1.

a.2 Proof of Theorem 3.3

By the result of Lemma 3.2 and the definition of the sequence in Algorithm 1, we obtain the following bounds:

where the outer expectations are taken with respect to the random sequence of the coordinate indices in Algorithm 1. Keep iterating the above bound and using the fact that and , we arrive at the following inequalities: