We provide theoretical complexity analysis for new algorithms to compute the optimal transport (OT) distance between two discrete probability distributions, and demonstrate their favorable practical performance compared to state-of-art primal-dual algorithms. First, we introduce the accelerated primal-dual randomized coordinate descent (APDRCD) algorithm for computing the OT distance. We show that its complexity is , where stands for the number of atoms of these probability measures and is the desired accuracy. This complexity bound matches the best known complexities of primal-dual algorithms for the OT problems, including the adaptive primal-dual accelerated gradient descent (APDAGD) and the adaptive primal-dual accelerated mirror descent (APDAMD) algorithms. Then, we demonstrate the improved practical efficiency of the APDRCD algorithm through extensive comparative experimental studies. We also propose a greedy version of APDRCD, which we refer to as accelerated primal-dual greedy coordinate descent (APDGCD), to further enhance practical performance. Finally, we generalize the APDRCD and APDGCD algorithms to distributed algorithms for computing the Wasserstein barycenter for multiple probability distributions.
Optimal transport has become an important topic in statistical machine learning. It finds the minimal cost couplings between pairs of probability measures and provides a geometrically faithful way to compare two probability distributions, with diverse applications in areas including Bayesian nonparametrics [19, 20], scalable Bayesian inference [25, 26], topic modeling , isotonic regression , and deep learning [4, 2, 27].
Nevertheless, the practical impact of OT has been limited by its computational burden. By viewing the optimal transport distance as a linear programming problem, interior-point methods have been employed as a computational solver, with a best known practical complexity of . Recently, Lee and Sidford  proposed to use the Laplace linear system solver to theoretically improve the complexity of interior-point methods to . However, it remains a practical challenge to develop efficient interior-point implementations in the high-dimensional settings for OT applications in machine learning.
Several algorithms have been proposed to circumvent the scalability issue of the interior-point methods, including the Sinkhorn algorithm [24, 14, 12, 6], which has a complexity bound of where is the desired accuracy . The Greenkhorn algorithm  further improves the performance of the Sinkhorn algorithm, with a theoretical complexity of . However, for large-scale applications of the OT problem, particularly in randomized and asynchronous scenarios such as computational Wasserstein barycenters, existing literature has shown that neither the Sinkhorn algorithm nor the Greenkhorn algorithm are sufficiently scalable and flexible [5, 7].
Recent research has demonstrated the advantages of the family of accelerated primal-dual algorithms [8, 17] over the Sinkhorn algorithms. This family includes the adaptive primal-dual accelerated gradient descent (APDAGD) algorithm  and the adaptive primal-dual accelerated mirror descent (APDAMD) algorithms , with complexity bounds of and , respectively, where is the inverse of the strong complexity constant of Bregman divergence with respect to the -norm. In addition, the primal-dual algorithms possess the requisite flexibility and scalability compared to the Sinkhorn algorithm, which is crucial for computational OT problems in large-scale applications [5, 11]. Specifically, they are flexible enough to be generalized to the computation of the Wasserstein barycenter for multiple probability distributions in decentralized and asynchronous settings [5, 7].
In the optimization literature, primal-dual methods have served as standard techniques that are readily parallelized for high-dimensional applications [3, 28]. On the other hand, coordinate descent methods have been shown to be well suited to the solution of large-scale machine learning problems [18, 22, 10].
Our contributions. The contributions of the paper are three-fold.
We introduce a novel accelerated primal-dual randomized coordinate descent (APDRCD) algorithm for solving the OT problem. We provide a complexity upper bound of for the APDRCD algorithm, which is comparable to the complexity of state-of-art primal-dual algorithms for OT problems, such as the APDAGD and APDAMD algorithms [8, 17]. To the best of our knowledge, this is the first accelerated primal-dual coordinate descent algorithm for computing the OT problem.
We show that the APDRCD algorithm outperforms the APDAGD and APDAMD algorithms on both synthetic and real image datasets with extensive experiments. To further improve the practical performance of the APDRCD algorithm, we present a greedy version of it which we refer to as the accelerated primal-dual greedy coordinate descent (APDGCD) algorithm. To the best of our knowledge, the APDRCD and APDGCD algorithms achieve the best performance among state-of-art accelerated primal-dual algorithms on solving the OT problem.
We demonstrate that the APDRCD and APDGCD algorithms are suitable and parallelizable for other large-scale problems besides the OT problem, e.g., approximating the Wasserstein barycenter for a finite set of probability measures stored over a distributed network.
Organization. The remainder of the paper is organized as follows. In Section 2, we provide the formulation of the entropic OT problem and its dual form. In Section 3, we introduce the APDRCD algorithm for solving the regularized OT problem and provide a complexity upper bound for it. We then present the greedy APDGCD algorithm. In Section 4, we present comparative experiments between the APDRCD, the APDGCD algorithms and other primal-dual algorithms including the APDAGD and APDAMD algorithms. We conclude the paper with a few future directions in Section 5. Finally, the proofs of all the results are included in the Appendix A, and the generalized APDRCD and APDGCD algorithms for approximating Wasserstein barycenters are presented in Appendix B. Additional experiments are presented in Appendix C.
Notation. We denote the probability simplex for . Furthermore, stands for the set while stands for the set of all vectors in with nonnegative components for any . For a vector and , we denote as its -norm and as the diagonal matrix with on the diagonal. For a matrix , the notation stands for the vector in obtained from concatenating the rows and columns of . 1 stands for a vector with all of its components equal to . refers to a partial gradient of with respect to . Lastly, given the dimension and accuracy , the notation stands for the upper bound where is independent of and . Similarly, the notation indicates the previous inequality may depend on the logarithmic function of and , and where .
2 Problem Setup
In this section, we provide necessary background for the entropic regularized OT problem. The objective function for the problem is presented in Section 2.1 while its dual form as well as the key properties of that dual form are given in Section 2.2.
2.1 Entropic Regularized OT
As shown in , the problem of approximating the OT distance between two discrete probability distributions with at most components is equivalent to the following linear programming problem:
where is a transportation plan, is a cost matrix with non negative elements, and and refer to two known probability distributions in the probability simplex . The best known practical complexity bound for (1) is , while the best theoretical complexity bound is , achieved via interior-point methods. However, these methods are not efficient in the high-dimensional setting of OT applications in machine learning. This motivates the entropic regularized OT problem :
where is the regularization parameter and is the entropic regularization given by . The main focus of the paper is to determine an -approximate transportation plan such that and and the following bound holds:
where is an optimal solution; i.e., an optimal transportation plan for the OT problem (1). To ease the ensuing presentation, we let denote an -approximation for the OT distance. Furthermore, we define matrix such that for any
2.2 Dual Entropic Regularized OT
The Lagrangian function for problem (2) is given by
Given the Lagrangian function, the dual form of the entropic regularized OT problem can be obtained by solving the optimization problem . Since the Lagrangian function is strictly convex, that optimization problem can be solved by setting , which leads to the following form of the transportation plan: for all . With this solution, we have . The dual entropic regularized OT problem is, therefore, equivalent to the following optimization problem:
Building on Lemma 4.1 in , the dual objective function can be shown to be smooth with respect to norm:
The dual objective function is smooth with respect to norm:
Since is equal to the maximum -norm of a column of A and each column of A contains only two nonzero elements which are equal to one, we have . Combining with the fact that , we establish the result. ∎
3 Accelerated Primal-Dual Coordinate Descent Algorithms
In this section, we present and analyze an accelerated primal-dual coordinate descent algorithms to obtain an -approximate transportation plan for the OT problem (1). First, in Section 3.1, we introduce the accelerated primal-dual randomized coordinate descent (APDRCD) method for the entropic regularized OT problem. Then, following the approximation scheme of , we show how to approximate the OT distance within the APDRCD algorithm; see Algorithm 2 for the pseudo-code for this problem. Furthermore, we provide theoretical analysis to establish a complexity bound of for the APDRCD algorithm to achieve an -approximate transportation plan for the OT problem in Section 3.2. This complexity upper bound of the APDRCD algorithm matches the best known complexity bounds of the APDAGD  and APDAMD algorithms . Finally, to further improve the practical performance of the algorithm, we present a greedy variant—the accelerated primal-dual greedy coordinate descent (APDGCD) algorithm—in Section 3.3.
3.1 Accelerated Primal-Dual Randomized Coordinate Descent (APDRCD)
We denote by the Lipschitz constant for the dual objective function , which means that , and . The APDRCD algorithm is initialized with the auxiliary sequence and two auxiliary dual variable sequences and , where the first auxiliary sequence is used for the key averaging step and the two dual variable sequences are used to perform the accelerated randomized coordinate descent on the dual objective function as a subroutine. The APDRCD algorithm is composed of two main parts. First, exploiting the convexity property of the dual objective function, we perform a randomized accelerated coordinate descent step on the dual objective function as a subroutine in step 5 and 6. In the second part, we take a weighted average over the past iterations to get a good approximate solution for the primal problem from the approximate solutions to the dual problem (2.2). Notice that the auxiliary sequence is decreasing and the primal solutions corresponding to the more recent dual solutions have larger weight in this average.
3.2 Complexity Analysis of APDRCD
Given the updates from APDRCD algorithm in Algorithm 1, we have the following result regarding the difference of the values of at and :
Given the updates and from the APDRCD algorithm, we have the following inequality:
where is chosen in the APDRCD algorithm.
Due to the smoothness of with respect to norm in Lemma 2.1, the following inequalities hold:
The result of Lemma 3.1 is vital for establishing an upper bound for , as shown in the following lemma.
For each iteration () of the APDRCD algorithm, we have
where the outer expectation in the above display is taken with respect to the random coordinate in Algorithm 1.
The proof of Lemma 3.2 is in Appendix A.1. Now, equipped with the result of Lemma 3.2, we are ready to provide the convergence guarantee and complexity bound of the APDRCD algorithm for approximating the OT problem. First, we start with the following result regarding an upper bound on to reach the stopping rule for . Here, the outer expectation is taken with respect to the random coordinates in Algorithm 1 for .
The proof of Theorem 3.3 is provided in Appendix A.2. Given an upper bound on for the stopping rule in Theorem 3.3 where in Theorem 3.3, we obtain the following complexity bound for the APDRCD algorithm.
The APDRCD algorithm for approximating optimal transport (Algorithm 2) returns satisfying , and in a total number of
The proof of Theorem 3.4 is provided in Appendix A.3. We show in Appendix A.3 that Theorem 3.4 also directly implies a complexity bound for obtaining an -optimal solution with high probability. Theorem 3.4 indicates that the complexity upper bound of APDRCD matches the best known complexity of the APDAGD  and APDAMD  algorithms. Furthermore, that complexity of APDRCD is better than that of the Sinkhorn and Greenkhorn algorithms, which is , in terms of the desired accuracy . Later in Section 4, we demonstrate that the APDRCD algorithm also has better practical performance than APDAGD and APDAMD algorithms on both synthetic and real datasets.
3.3 Accelerated Primal-Dual Greedy Coordinate Descent (APDGCD)
We next present a greedy version of APDRCD algorithm, which we refer to as the accelerated primal-dual greedy coordinate descent (APDGCD) algorithm. The detailed pseudo-code of that algorithm is in Algorithm 3 while an approximating scheme of OT based on the APDGCD algorithm is summarized in Algorithm 4.
Both the APDGCD and APDRCD algorithms follow along the general accelerated primal-dual coordinate descent framework. Similar to the APDRCD algorithm, the algorithmic framework of APDGCD is composed by two main parts: First, instead of performing randomized accelerated coordinate descent on the dual objective function as a subroutine, the APDGCD algorithm chooses the best coordinate that maximizes the absolute value of the gradient of the dual objective function of regularized OT problem among all the coordinates. In the second part, we follow the key averaging step in the APDRCD algorithm by taking a weighted average over the past iterations to get a good approximated solution for the primal problem from the approximated solutions to the dual problem. Since the auxiliary sequence is decreasing, the primal solutions corresponding to the more recent dual solutions have larger weight in this average.
We demonstrate that the APDGCD algorithm enjoys favorable practical performance than APDRCD algorithm on both synthetic and real datasets (cf. Appendix C).
In this section, we carry out comparative experiments between the APDRCD, APDGCD algorithms and the state-of-art primal-dual algorithms for the OT problem including the APDAGD and APDAMD algorithms, on both synthetic images and the MNIST Digits dataset.
4.1 APDRCD Algorithm with Synthetic Images
We conduct extensive comparisons of the performance of the APDRCD algorithm with the APDAGD and APDAMD algorithms on synthetic images. The generation of synthetic images follows the procedure of [1, 17]. In particular, the images are of size and generated based on randomly placing a foreground square in the otherwise black background. Then, for the intensities of the background pixels and foreground pixels, we choose uniform distributions on [0,1] and [0, 50] respectively. For comprehensive results, we vary the proportion of the size of the foreground square in 0.1, 0.5, 0.9 of the full size of the image and implement all the algorithms on different kinds of synthetic images.
Evaluation metric: We utilize the metrics from . The first metric is the distance between the output of the algorithm and the transportation polytope, , where and are the row and column marginal vectors of the output matrix while and stand for the true row and column marginal vectors. The second metric is the competitive ratio, defined by where and refer to the distance between the outputs of two algorithms and the transportation polytope.
Experimental settings and results: We perform two pairwise comparative experiments for the APDRCD algorithm versus the APDAGD and APDAMD algorithms by running these algorithms with ten randomly selected pairs of synthetic images. We also evaluate all the algorithms with varying regularization parameter and the optimal value of the original OT problem without the entropic regularization, as suggested by [1, 17].
We present the results in Figure 1 and Figure 2. The APDRCD algorithm has better performance than the APDAGD and APDAMD algorithms in terms of the iterations. When the number of iterations is small, the APDRCD algorithm achieves faster and more stable decrements than the other two algorithms with regard to both the distance to polytope and the value of OT during the computing process, which is beneficial for the purposes of tuning. This superior behavior of APDRCD illustrates the improvement achieved by using randomized coordinate descent on the dual regularized problem, and supports the theoretical assertion that the complexity bound for the APDRCD algorithm matches those of the APDAGD and APDAMD algorithms.
4.2 APDRCD Algorithm with MNIST Images
We present comparison between the APDRCD algorithm and the APDAGD and APDAMD algorithms on MNIST images with the same set of evaluation metrics. The image pre-processing follows the same pre-processing procedure as suggested in ; we omit the details for the sake of brevity.
We present the results with the MNIST images in Figure 3 with various values for the regularization parameter . We also evaluate all the algorithms with the optimal value of the original optimal transport problem without entropic regularization. As shown in Figure 3, the APDRCD algorithm outperforms both the APDAGD and APDAMD algorithms on the MNIST dataset in terms of the number of iterations. Additionally, the APDRCD algorithm displays faster and smoother convergence than the other algorithms at small iteration numbers with regard to both the evaluation metrics, which implies the advantage that it is easier to be tuned in practice. In summary, the consistent superior performance of the APDRCD algorithm over the APDAGD and APDAMD algorithms on both the synthetic and MNIST datasets supports the theoretical assertion on the matched complexity bounds of these three algorithms, and shows the advantage of using randomized coordinate descent.
We have proposed and analyzed new accelerated primal-dual coordinate descent algorithms for approximating the optimal transport distance between two discrete probability measures. These accelerated primal-dual coordinate descent algorithms have comparable theoretical complexity as that of existing accelerated primal-dual algorithms while enjoying better experimental performance in practice. Furthermore, we show that the APDRCD and APDGCD algorithms are suitable for other large-scale problems apart from computing the OT distance; in particular, we proposed extensions that approximate the Wasserstein barycenters for multiple probability distributions.
There are several directions for future work. On the experimental side, given the favorable practical performance of the APDRCD and the APDGCD algorithms over existing primal-dual algorithms, it is of interest to carry out more extensive experiments of the distributed APDRCD and APDGCD algorithms for computing Wasserstein barycenters, with multiple machines in a real network. Another important direction is to construct fast distributed algorithms for the case of time-varying and directed machine networks to approximate the Wasserstein barycenters. It remains as an interesting and important open research question how the dynamics of the network affect the performance of these algorithms.
We thank Tianyi Lin for many helpful discussions. This work was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.
In the appendix, we first provide detailed proofs of all the remaining results in Section A. Then, we present an application of APDRCD and APDGCD algorithms for the Wasserstein barycenter problem in Section B. Finally, further comparative experiments between APDGCD algorithm versus APDRCD, APDAGD and APDAMD algorithms, and experiments on larger synthetic image datasets and CIFAR10 dataset are in Section C.
Appendix A Proofs for all results
In this appendix, we provide the complete proofs for remaining results in the main text.
a.1 Proof of Lemma 3.2
By Eq. (5), we have the following equations
Combining the above equations with Lemma 3.1, we have
Therefore, we have
The above equation together with Eq. (A.1) yields the following inequality
By the result of Eq. (6), we have
Therefore, for any , we find that
Note that the above equation is equivalent to the following:
where the second equality in the above display comes from simple algebra. Rewriting the above equality, we have:
Combining the above equation with Eq. (A.1) yields the following inequality:
Recall the definition of in Step 3 of Algorithm 1 as follows:
which can be rewritten as:
for any .
The above equation implies that
where the last inequality comes from the convexity of . Combining this equation and taking expectation over for the first two terms of Eq. (A.1), we have:
where the last equality comes from the following equations:
where the last inequality is due to the fact that and Jensen’s inequality. Therefore, by simple algebra, we have
As a consequence, we achieve the conclusion of the lemma.
a.2 Proof of Theorem 3.3
where the outer expectations are taken with respect to the random sequence of the coordinate indices in Algorithm 1. Keep iterating the above bound and using the fact that and , we arrive at the following inequalities: