Convergence of a Newton algorithm for semi-discrete optimal transport
A popular way to solve optimal transport problems numerically is to assume that the source probability measure is absolutely continuous while the target measure is finitely supported. We introduce a damped Newton’s algorithm in this setting, which is experimentally efficient, and we establish its global linear convergence for cost functions satisfying an assumption that appears in the regularity theory for optimal transport.
Some problems in geometric optics or convex geometry can be recast as optimal transport problems between probability measures: this includes the far-field reflector antenna problem, Alexandrov’s Gaussian curvature prescription problem, etc. A popular way to solve these problems numerically is to assume that the source probability measure is absolutely continuous while the target measure is finitely supported. We refer to this setting as semi-discrete optimal transport. Among the several algorithms proposed to solve semi-discrete optimal transport problems, one currently needs to choose between algorithms that are slow but come with a convergence speed analysis [29, 8, 21] or algorithms that are much faster in practice but which come with no convergence guarantees [5, 27, 11, 22, 10]. Algorithms of the first kind rely on coordinate-wise increments and the number of iterations required to reach the solution up to an error of is of order , where is the number of Dirac masses in the target measure. On the other hand, algorithms of the second kind typically rely on the formulation of the semi-discrete optimal transport problem as an unconstrained convex optimization problem which is solved using a Newton or quasi-Newton method.
The purpose of this article is to bridge this gap between theory and practice by introducing a damped Newton’s algorithm which is experimentally efficient and by proving the global convergence of this algorithm with optimal rates. The main assumptions is that the cost function satisfies a condition that appears in the regularity theory for optimal transport (the Ma-Trudinger-Wang condition) and that the support of the source density is connected in a quantitative way (it must satisfy a weighted Poincaré-Wirtinger inequality). In §1.7, we compare this algorithm and the convergence theorem to previous computational approaches to optimal transport.
1.1. Semi-discrete optimal transport
The source space is an open domain of a -dimensional Riemannian manifold, which we endow with the measure induced by the Riemannian metric on the manifold. The target space is an (abstract) finite set . We are given a cost function on the product space , or equivalently a collection of functions on . We assume that the functions are of class on :
Here denotes the class of functions which are -times differentiable and whose -th derivatives are -Hölder continuous. In particular, is the space of -Hölder continuous functions. We consider a compact subset of and a probability density on , i.e. such that is a probability measure. By an abuse of notation, we will often conflate the density with the measure itself. Note that the support of is contained in , but we do not assume that it is actually equal to . The push-forward of by a measurable map is the finitely supported measure . The map is called a transport map between and a probability measure on if . The semi-discrete optimal transport problem consists in minimizing the transport cost over all transport maps between and , that is
This problem is an instance of Monge’s optimal transport problem, where the target measure is finitely supported. Kantorovich proposed a relaxed version of the problem (M) as an infinite dimensional linear programming problem over the space of probability measures with marginals and .
1.2. Laguerre tessellation and economic interpretation
In the semi-discrete setting, the dual of Kantorovich’s relaxation can be conveniently phrased using the notion of Laguerre tessellation. We start with an economic metaphor. Assume that the probability density describes the population distribution over a large city , and that the finite set describes the location of bakeries in the city. Customers living at a location in try to minimize the walking cost , resulting in a decomposition of the space called a Voronoi tessellation. The number of customers received by a bakery is equal to the integral of over its Voronoi cell, namely
If the price of bread is given by a function , customers living at location in make a compromise between walking cost and price by minimizing the sum . This leads to the notion of Laguerre tessellation, whose cells are given by
When the sets and are contained in and the cost is the squared Euclidean distance, the computation of the Laguerre tessellation is a classical problem of computational geometry, for which there exists very efficient softwares, such as CGAL  or Geogram . For other cost functions, one has to adapt the algorithms, as was done for the reflector cost on the sphere in . The shape of the Voronoi and Laguerre tessellations is depicted in Figure 1.
We want the Laguerre cells to form a partition of up to a negligible set. By the implicit function theorem, this will be the case if the following twist condition holds,
where denotes differentiation with respect to the first variable. The twist condition implies that for any prices on , the transport map induced by the Laguerre tessellation
is uniquely defined almost everywhere. It is easy to see (see Proposition 2.2), that for any function on , the map is an optimal transport map between and the pushforward measure
1.3. Kantorovich’s functional
The map is an optimal transport map between and . Conversely, Theorem 1.1 below ensures that any semi-discrete optimal transport problem admits such a solution. In other words, for any probability density on and any probability measures on there exists a function (price) on such that . The proof of this theorem is an easy generalization of the proof given in  for the quadratic cost, but it is nonetheless included in Section 2 for the sake of completeness.
Here and after, we denote the canonical basis of , and the Euclidean norm induced by this basis, while will denote the norm induced by the Riemannian metric on either or (which will be clear from context). We will, slightly abusively, consider the space of probability measures as a subset of .
The following statements are equivalent:
is a global maximizer of ;
is an optimal transport map between and ;
, or equivalently,
We call Kantorovich’s functional the function introduced in (1.3). Note that both this functional and its gradient are invariant by addition of a constant. The non-linear equation (MA) can be considered as a discrete version of the generalized Monge-Ampère equation that characterizes the solutions to optimal transport problems (see for instance Chapter 12 in ).
1.4. Damped Newton algorithm
We consider a simple damped Newton’s algorithm to solve semi-discrete optimal transport problem. This algorithm is very close to the one used by Mirebeau in . To phrase this algorithm in a more general way, we introduce a notation for the measure of Laguerre cells: for we set
so that . In the algorithm (Algorithm 1), we denote by the pseudo-inverse of the matrix .
A tolerance and an initial such that
- Step 1:
- Step 2:
Determine the minimum such that satisfies
- Step 3:
Set and .
The goal of this article is to prove the global convergence of this damped Newton algorithm and to establish estimates on the speed of convergence. As shown in Proposition 6.1, the convergence of Algorithm 1 depends on the regularity and strong monotonicity of the map . As we will see, the regularity of will depend mostly on the geometry of the cost function and the regularity of the density. On the other hand, the strong monotonicity of will require a strong connectedness assumption on the support of , in the form of a weighted Poincaré-Wirtinger inequality. Before stating our main theorem we give some indication about these intermediate regularity and monotonicity results and their assumptions.
1.5. Regularity of Kantorovich’s functional and MTW condition
In order to establish the convergence of a damped Newton algorithm for (MA), we need to study the regularity of Kantorovich’s functional . However, while regularity of follows rather easily from the (Twist) hypothesis (or even from weaker hypothesis, see Theorem 2.1), higher order regularity seems to depend on the geometry of the cost function in a more subtle manner. We found that a sufficient condition for the regularity of is the Ma-Trudinger-Wang condition , which appeared naturally in the study of the regularity of optimal transport maps. We use a discretization of Loeper’s geometric reformulation of the Ma-Trundinger-Wang condition .
Definition 1.1 (Loeper’s condition).
The cost satisfies Loeper’s condition if for every in there exists a convex open subset of and a diffeomorphism such that the functions
are quasi-convex for all in . The map is called the -exponential with respect to , and the domain is an exponential chart.
We comment here, that when is a finite subset of a continuous space and satisfies conditions (Reg), (Twist), and (A3w) (which can be found on pages Reg, Twist, and A3w respectively), the -exponential map defined in the usual sense in optimal transport theory (see Remarks 1.1 and 4.4) will satisfy what we call Loeper’s condition above. However, it will become apparent that for our purposes what is essential is the above quasi-convexity property and not the actual definition of . Thus we will elect to use the notation even in cases when is not a finite subset of a continuous space.
Definition 1.2 (-Convexity).
Assuming Loeper’s condition, a subset of is -convex with respect to a point of if its inverse image is convex. The subset is said to be -convex if it is -convex with respect to every point in .
Note that by assumption, the domain itself is -convex. The connection between this discrete version of Loeper’s condition and the conditions used in the regularity theory for optimal transport is detailed in Remark 1.1. The (QC) condition implies the convexity of each Laguerre cell in its own exponential charts, namely is convex for every in . This plays a crucial role in the regularity of Kantorovich’s functional.
The proof of this theorem and a more precise statement are given in Section 4 (Theorem 4.1), showing that the estimate can be made uniform when the mass of the Laguerre cells is bounded from below by a positive constant.
We remark that under certain assumptions on the cost , our (QC) condition is implied by classical conditions introduced in a smooth setting by X.-N. Ma, N. Trudinger, and X.-J. Wang , which include the well known (MTW) or (A3) condition. See Remark 4.4 for more specifics.
There are a wide variety of known examples satisfying these conditions. Aside from the canonical example of the inner product on , and other costs on Euclidean spaces mentioned in [26, 33], there are the nonflat examples of Riemannian distance squared and on (a subset of) (see ). The last cost is associated to the far-field reflector antenna problem. We refer the reader to [19, p. 1331] for a (more) comprehensive list of such costs.
1.6. Strong concavity of Kantorovich’s functional
As noted earlier, Kantorovich’s functional cannot be strictly concave, since it is invariant under addition of a constant. This implies that the Hessian has a zero eigenvalue corresponding to the constants. A more serious obstruction to the strict concavity of at a point arises when the discrete graph induced by the Hessian (where two points are connected iff ) is not connected. This can happen either because one of the Laguerre cells is empty (hence not connected to any neighbor) or if the support of the probability density is itself disconnected. In order to avoid the latter phenomena, we will require that satisfies a weighted Poincaré-Wirtinger inequality.
Definition 1.3 (weighted Poincaré-Wirtinger).
A continuous probability density on a compact set satisfies a weighted Poincaré-Wirtinger inequality with constant if for every function on ,
where and .
We denote the orthogonal complement (in ) of the space of constant functions on , that is As before, is the set of functions whose Laguerre cells all have positive mass.
1.7. Convergence result
Putting Proposition 6.1, Theorem 1.3 and Theorem 1.4 together, we can prove the global convergence of the damped Newton algorithm for semi-discrete optimal transport (Algorithm 1) together with optimal convergence rates.
The support of the probability density is included in a compact, -convex subset of , and for in .
has positive Poincaré-Wirtinger (PW) constant.
Then the damped Newton algorithm for semi-discrete optimal transport (Algorithm 1) converges globally with linear rate and locally with rate .
This theorem makes no assumption about the convexity (or -convexity) of the support of the source density . Such cases are not handled by other numerical methods for Monge-Ampère equations [6, 24]. For completeness, we provide in Appendix A an explicit example of a radial measure on whose support is an annulus but whose Poincaré-Wirtinger constant is nonetheless positive.
The positive lower bound on the damping parameter ( in Algorithm 1) established in this theorem degrades as grows to infinity. It is plausible (but far from direct) that one could control this quantity when is large by a comparison to the continuous Monge-Ampère equation. The strong concavity estimate (Theorem 1.4) would then need to be replaced by uniform ellipticity estimates for the linearized Monge-Ampère equation, while the regularity estimate (Theorem 1.3) would be replaced by regularity estimates for solutions to the Monge-Ampère equation. We refer to Loeper and Rapetti  for an implementation of this ideas in a continuous setting. The space-discretization of their approach is open.
Comparison to previous work.
There exist a few other numerical methods relying on Newton’s algorithm for the resolution of the standard Monge-Ampère equation or for the quadratic optimal transport problem. Here, we highlight some of the differences between Algorithm 1 and Theorem 1.5 and these existing results. First, we note that many authors have reported the good behavior in practice of Newton’s or quasi-Newton’s methods for solving discretized Monge-Ampère equations or optimal transport problems [27, 11, 6]. Note however that none of these works contain convergence proofs for the Newton’s algorithm.
Loeper and Rapetti  (refined by Saumier, Agueh, and Khouider ) establish the global convergence of a damped Newton’s method for solving quadratic optimal transport on the torus, relying heavily on Caffarelli’s regularity theory. In particular, the convergence of the algorithm requires a positive lower bound on the probability densities, while this condition is not necessary for Theorem 1.5 (see Section 5 and Appendix A where we construct explicitly probability densities with non-convex support that still satisfy the hypothesis of Theorem 1.5). A second drawback on relying on the regularity theory for optimal transport is that the damping parameter, which is an input parameter of the algorithm used in , cannot be determined explicitly from the data. Third, the convergence proof is for continuous densities, and it seems difficult to adapt it to the space-discretized problem. On the positive side, it seems likely that the convergence proof of  can be adapted to cost functions satisfying the Ma-Trudinger-Wang condition (which is equivalent to Loeper’s condition (QC) that we also require).
Oliker and Prussner prove the local convergence of Newton’s method for finding Alexandrov’s solutions to the Monge-Ampère equation with Dirichlet boundary conditions, where is a finitely supported measure . Global convergence for a damped Newton’s algorithm is established by Mirebeau  for a variant of Oliker and Prussner’s discretization, but without convergence rates. Theorem 1.5 can be seen as an extension of the strategy used by Mirebeau to optimal transport problems, which amounts to (a) replacing the Dirichlet boundary conditions with the second boundary value conditions from optimal transport (b) replacing the Lebesgue measure by more general probability densities and (c) changing the Monge-Ampère equation itself in order to deal with more general cost functions.
We also comment here that our result Theorem 5.1 answers a conjecture first raised by Gangbo and McCann, in the case when the cost function satisfies the Ma-Trudinger-Wang condition. In [15, Example 1.6], a numerical approach to the semi-discrete optimal transport problem is suggested by taking what is equivalent to the negative gradient flow of the Kantorovich function defined in (1.3) above. There, Gangbo and McCann conjecture that this gradient flow should converge, and our result of uniform concavity of the Kantorovich functional provides a positive answer to a quantitative strengthening of this conjecture, at least for costs, measures, and domains satisfying the assumptions of Theorem 5.1.
Finally, we note that the overall strategy for proving the convergence of Algorithm 1 (proving regularity then strict concavity of ) shares features to the one used in  to study the relationship between highly anisotropic semi-discrete quadratic optimal transport and Knothe rearrangement.
In Section 2, we establish the differentiability of Kantorovich’s functional , adapting arguments from . In Sections 3 and 4, we prove the (uniform) second-differentiability of Kantorovich’s functional when the cost function satisfies Loeper’s (QC) condition. Section 5 is devoted to the proof of uniform concavity of Kantorovich’s functional, when the probability density satisfies a Poincaré-Wirtinger inequality (PW). In Section 6, we combine these intermediate results to prove the convergence of the damped Newton’s algorithm (Theorem 1.5), and we present a numerical illustration. Appendix A presents an explicit construction of a probability density with non-convex support over which satisfies the assumptions of Theorem 1.5. Appendix B contains the details of the proof of the main theorem of Section 4.
QM and BT would like to acknowledge the support of the French ANR through the grant ANR-16-CE40-0014 (MAGA). BT is also partially supported by LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01).
2. Kantorovich’s functional
The purpose of this section is to present the variational formulation introduced in  for the semi-discrete optimal transport problem, adapting the arguments presented for the squared Euclidean cost in  to cost functions satisfying (Reg’) and (Twist’), which are weaker than the conditions (Reg) and (Twist) presented in the introduction:
Note that under (Twist’), the map defined by (1.2) is uniquely-defined –almost everywhere. Most of the results presented here are well known in the optimal transport literature, we include proofs for completeness.
For any , the map is an optimal transport map for the cost between any probability density on and the pushforward measure .
Assume that where is a measurable map between and . Then, by definition of one has
Multiplying this inequality by and integrating it over gives
Since , the change of variable formula gives
Substracting this equality from the inequality above shows that is optimal:
Let be a probability density over a compact subset of , and let in be such that for all . Then, the function is continuous.
We consider the function . By hypothesis, . Moreover, using Lebesgue’s monotone convergence theorem one easily sees that (resp. ) is right-continuous (resp. left-continuous). This concludes the proof ∎
Proof of Proposition 2.3.
Proving the continuity of amounts to proving the continuity of the functions for any in . Fix in and remark that by definition, where
Denoting the symmetric difference of two sets, we have the following inequalities
Fix , and denote . Then,
2.1. Proof of Theorem 1.1
We simultaneously show that the functional is concave and compute its gradient. For any function on and any measurable map , one has , which by integration gives
Moreover, equality holds when . Taking another function on and setting in Equation (2.11) gives
where is defined as in the statement of Proposition 2.3. This proves that the superdifferential contains , thus establishing the concavity of and its differentiability almost everywhere. It is known that the supergradient is characterized by [30, Theorem 25.6]
where denotes the convex envelope and the set of sequences converging to such that is differentiable at . By Proposition 2.3, the map is continuous, meaning that we have constructed a continuous selection of a supergradient in the superdifferential of the concave function
This proves that is , and that .
3. Local regularity in a -exponential chart
The results presented in this section constitute an intermediate step in the proof of regularity of Kantorovich’s functional. Let be a compact, convex subset of and be functions on which are quasi-convex, meaning that for any scalar the closed sublevel sets are convex. Let be a continuous probability density over . The purpose of this section is to give sufficient conditions to ensure the regularity of the following function near the origin of :
3.1. Assumptions and statement of the theorem.
We will impose two conditions on the functions . As we will see in Section 4, both conditions are satisfied when these functions are constructed from a semi-discrete optimal transport transport problem whose cost function satisfy Loeper’s condition (see Definition 1.1).
The functions satisfy the nondegeneracy condition if the norm of their gradients is bounded from below:
This condition is necessary for the continuity of the map even when .
The boundary of the convex set can be decomposed into facets, namely and . The purpose of the transversality condition we consider is to ensure that the angle between adjacent facets is bounded from below when remains close to some fixed vector .
Definition 3.1 (Normal cone).
Let be a convex compact set of . The normal cone to at a point in is the set
and its elements are said to be normal to at .
Definition 3.2 (Transversality).
The family of functions satisfy the transversality condition near if there exists positive constants and such that for every in satisfying for the usual norm on and every point in one has,
Note that is smooth at , is the ray spanned by the exterior normal to at .
Assume that the functions satisfy the non-degeneracy condition (ND) and the transversality condition (T) near . Let be a probability density on . Then, the map defined in (3.12) is of class on the cube and has partial derivatives given by
In addition, the norm is bounded by a constant depending only on , on the diameter of and on
Note that the constant of depends on the transversality constant but that it does not depend on .
3.2. Sketch of proof
The correct expression for the partial derivatives of , given by equation (3.14), can easily be guessed by applying the co-area formula. The non-degeneracy condition then ensures that the denominator in this expression does not vanish. What is more delicate is to prove that these partial derivatives are -Hölder, with a uniform estimate on the -Hölder norm. A second application of the co-area formula on the manifold suggests that for one should have,
under the assumption that the density is and the facet does not intersect . It will turn out that, thanks to the transversality hypothesis, the -measure of the union of these facets can be bounded uniformly:
Note also that equivalently, a point belongs to the singular set if and only if it satisfies one of the assumptions in (T). In the next subsection, we prove an upper bound on (see Proposition 3.2). The proof of Theorem 3.1 follows from this upper bound and from several applications of the co-area formula. Since it is elementary but quite long, we have postponed the proof of the theorem itself to Appendix B.
3.3. A control on the –Hausdorff measure of singular points
In this section, we prove that the transversality condition (T) and the quasi-convexity of the functions imply a uniform upper bound on the –Hausdorff measure of .
Assuming the transversality condition (T), there exists a constant depending only on and such that for every ,
We will deduce this proposition from a general upper bound on the –Hausdorff measure of the set of –singular points of a compact convex body. A more general and quantitative version of this bound can be found in . We provide below a straightforward and easy proof based on the notions of packing and covering numbers.
Let be a convex, compact set of and . Then,
Recall that the covering number of a subset is the minimum number of Euclidean balls of radius required to cover . The packing number of a subset is given by
We will use the following comparisons between covering and packing numbers:
Proof of Proposition 3.3.
The proof consists in comparing a lower bound and an upper bound of the packing number of the set
Step 1. We first calculate an upper bound on the covering number of the unit bundle . Given a positive radius , we denote by the set of points that are within distance of . By convexity, the projection map , mapping a point to its orthogonal projection on , is well defined and -Lipschitz. We consider
The map is surjective and has Lipschitz constant . We deduce an upper bound on covering number of from the covering number of the level set :
Now, consider a sphere with diameter that encloses the tubular neighborhood with . The projection map is -Lipschitz, and . Using the same argument as above, we have:
Combining these bounds with the inclusion gives us
Step 2. We now establish a lower bound for . Let be a -singular point and be two unit vectors such that . This implies that contains a spherical geodesic segment of length at least , giving us a lower bound on the packing number of , namely . Now, let be a maximal set in the definition of the packing number and for every , let be a maximal set in the definition of the packing number , so that . Then, the set is a packing of , and the cardinality of this set is bounded from below by . This gives
4. regularity of Kantorovich’s functional
For the remainder of the section, for any point in , we will denote the inverse image of the domain in the exponential chart at . The set is convex by -concavity of . We consider the functions
which are quasi-concave by (QC). The main difficulty in deducing Theorem 4.1 from Theorem 3.1 is in establishing the quantitative transversality condition (T) introduced on p. T for the family of functions .