From Knothe’s transport to Brenier’s map and a continuation method for optimal transport

From Knothe’s transport to Brenier’s map and a continuation method for optimal transport

Abstract

A simple procedure to map two probability measures in is the so-called Knothe-Rosenblatt rearrangement, which consists in rearranging monotonically the marginal distributions of the last coordinate, and then the conditional distributions, iteratively. We show that this mapping is the limit of solutions to a class of Monge-Kantorovich mass transportation problems with quadratic costs, with the weights of the coordinates asymptotically dominating one another. This enables us to design a continuation method for numerically solving the optimal transport problem.

Keywords: optimal transport, rearrangement of vector-valued maps, Knothe-Rosenblatt transport, continuation methods.

1 Introduction

Given two Borel probability measures and on , a Borel map : is said to be a transport map between and if where denotes the push-forward (or image measure) of through (i.e. for every Borel ). In the present article, we will focus on two particular transport maps: the Knothe-Rosenblatt transport and the Brenier’s map.

The Knothe-Rosenblatt transport. The Knothe-Rosenblatt rearrangement was independently proposed by Rosenblatt [6] for statistical purposes and by Knothe [4] in order to extend the Brunn-Minkowski inequalities. The principle is the following, as explained in Villani [8]. Let and be two Borel probability measures on and assume for simplicity for the moment that is absolutely continuous with respect to the Lebesgue measure. Let us denote by (respectively ) the -th marginal of (respectively ) and , ,…, (respectively , ,…, ) the successive disintegrations (or conditional measures) of (respectively ) given , ,…, (respectively given , , …, ). Now let be the monotone nondecreasing map transporting to , such a map is well-defined and unique as soon as has no atoms and in this case, it is explicitly given by (with and ). Then let be such that is monotone and maps to . One repeats the construction (well-known by statisticians under the name of conditional quantile transforms) iteratively until we define , which is monotone in and transports onto . Finally, the Knothe-Rosenblatt rearrangement is defined by . Obviously, is a transport map from to , i.e. . By construction, the Knothe transport has a triangular Jacobian matrix with nonnegative entries on its diagonal. Note also that the computation of the Knothe transport only involves one-dimensional monotone rearrangements and that it is well defined as soon the measures one transports have no atoms. The precise assumption is the following.

Assumption (H-source): the measure , as well as almost all the measures , and the measures for a.e. and a.e. …up to almost all the measures , which are all measures on the real line, must have no atoms.

Notice that (H-source) is satisfied as soon as is absolutely continuous with respect to the Lebesgue measure.

The Monge-Kantorovich problem and the Brenier’s map. Optimal transportation theory provides an alternative way to transport to . We recall that in the case of the quadratic cost, the Monge-Kantorovich problem reads as

 infπ∈Γ(μ,ν)∫Rd×Rd|x−y|2dπ(x,y) (1.1)

where denotes the set of transport plans between and i.e. the set of probability measure on with marginals and . We refer to the books of Villani [7], [8] for a modern account of optimal transportation theory. The linear problem (1.1) is a relaxation of the Monge problem

 infS:S♯μ=ν∫Rd|x−S(x)|2dμ(x) (1.2)

When is absolutely continuous with respect to the Lebesgue measure, Brenier [2] proved that (1.2) has a unique solution which is characterized by the fact that it is the gradient of some convex function. More precisely, there exists a unique (up to constants and -a.e. equivalence) convex function such that . Also is the only solution of (1.1) and is called the Brenier’s map between and .

The Knothe-Rosenblatt as a limit of optimal transportation plans. Let us slightly modify the quadratic cost in (1.1) and replace it with the weighted quadratic cost

 cε(x,y):=d∑i=1λi(ε)(xi−yi)2

where the ’s are positive scalars depending on a parameter . If is absolutely continuous with respect to the Lebesgue measure, the corresponding optimal transportation problem admits a unique solution . When in addition, for all , as , it is natural to expect the convergence of to the Knothe transport . We will show that this convergence holds provided satisfies some additional condition, and namely

Assumption (H-target): the measure , as well as almost all the measures , and the measures for a.e. and a.e. …up to almost all the measures , which are all measures on the real line, must have no atoms.

Notice that (H-target) is not natural as (H-source) is. Yet, we will show a counter-example to the convergence result when it is not satisfied. (H-target) as well is satisfied should be absolutely continuous (actually, this assumption is slightly weaker then (H-source), since the last disintegration measures are not concerned).

This convergence result was conjectured by Y. Brenier as a very natural one, and actually its proof is not hard. Yet, it was not known before that extra assumptions on were needed. This makes one of the point of interest of this paper.

The other point is what we investigate later in the paper, i.e. the other direction: from Knothe to Brenier. We will study the dependence by means of the evolution with respect to of the dual variables. This will enable us, to design a numericaly strategy to approximate all the optimal transports taking as initial condition the (cheap to compute) Knothe transport .

An example. To illustrate the problem in a particular case where explicit solutions are available, take , and and two Gaussian measures where and . Take and . Then it can be verified that is linear, and that its matrix in the canonical basis of is

 Tε=1√aε2+c+2ε√ac−b2(aε+√ac−b2bbεc+ε√ac−b2)

which converges as to , which is precisely the matrix of the Knothe transport from to .

Organization of the paper. In section 2, we show, under suitable assumptions, that the optimal transportation maps for the cost converge to Knothe’s transport map as the parameter goes to , we will also emphasize that some conditions are to be imposed on for the convergence to hold. In section 3, we show that the evolution of the dual variables in the optimal transportation problem for cost the is given by a well-posed ordinary differential equation. Finally in section 4, we discretize this equation and give several numerical results.

2 Knothe transport as a limit of quadratic optimal transports

We directly state our first result, whose proof, in the spirit of convergence developments (see [1]), will require several steps.

Theorem 2.1.

Let and be two probability measures on satisfying (H-source) and (H-target), respectively, with finite second moments, and be an optimal transport plan for the costs , for some weights . Suppose that for all , as . Let be the Knothe-Rosenblatt map between and and the associated transport plan (i.e. ). Then as .

Moreover, should the plans be induced by transport maps , then these maps would converge to in as .

Proof.

Take the plans that are optimal for the Brenier-like cost given by

 cε(x,y)=d∑i=1λi(ε)(xi−yi)2

(we suppose for simplicity and ). Suppose (which is possible, up to subsequences) . We want to prove .

By comparing to and using optimality we first get

 ∫cεdγε≤∫cεdγK (2.1)

and, passing to the limit as , since converges locally uniformly to , we get

 ∫c(d)dγ≤∫c(d)dγK.

Yet, the function only depends on the variables and and this shows that the measure gets a better result than with respect to the quadratic cost ( being the projection onto the last coordinates, i.e. ). Yet, the measure has been chosen on purpose to get optimality from to with respect to this cost, and the two measures share the same marginals. Moreover, thanks to the assumptions on , this optimal transport plan (which is actually induced by a transport map) is unique. This implies . Let us call this common measure.

We go back to (2.1) and go on by noticing that all the measures have the same marginals as and hence their (separate) projection onto and are and , respectively. This implies that must realize a result which is worse than as far as the quadratic cost is concerned and consequently we have

which implies, by simplifying the common term in , dividing by and passing to the limit,

 ∫c(d−1)dγ≤∫c(d−1)dγK

(we use the general notation ). We can notice that both integrals depend on the variables and only. Anyway, we can project onto the variables and (obtaining measures and ) so that we disintegrate with respect to the measure . We have

 ∫dγd(xd,yd)∫|xd−1−yd−1|2dγd−1(xd,yd)(xd−1,yd−1) ≤∫dγd(xd,yd)∫|xd−1−yd−1|2dγd−1(xd,yd),K(xd−1,yd−1). (2.2)

It is is sufficient to prove that the measures share the same marginals on and as the corresponding to get that their quadratic performance should be worse than the corresponding performance of (this is because the Knothe measure has been chosen exactly with the intention of being quadratically optimal on once and are fixed). Yet, (2.2) shows that, on average, the result given by the those measures is not worse than the results of the optimal ones. Thus, the two results coincide for almost any pair and, by uniqueness of the optimal transports (this relies on the assumptions on the measures ), we get . To let this proof work it is sufficient to prove that the projections of the two measures coincide for a.e. pair . For fixed we would like to prove, for any

 ∫ϕ(xd−1)dγd−1(xd,yd)=∫ϕ(xd−1)dγd−1(xd,yd),K

(and to prove an analogous equality for functions of ). Since we accept to prove it for a.e. pair , it is sufficient to prove this equality:

 ∫dγd(xd,yd)ψ(xd,yd)∫ϕ(xd−1)dγd−1(xd,yd)=∫dγd(xd,yd)ψ(xd,yd)∫ϕ(xd−1)dγd−1(xd,yd),K

for any and any . This means proving

 ∫ψ(xd,yd)ϕ(xd−1)dγd−1=∫ψ(xd,yd)ϕ(xd−1)dγd−1K,

which is not trivial since we only know that the two measures and have the same marginals with respect to the pairs , (since they have the same projections onto and onto ) and (since we just proved it). But here there is a function of the three variables . Yet, we know that the measure is concentrated on the set for a certain map , and this allows to replace the expression of , thus getting rid of one variable. This proves that the function is actually a function of only, and that equality holds when passing from to .The same can be performed on functions but we have in this case to ensure that we can replace with a function of , i.e. that we can invert . This is possible thanks to the assumption on , since is the optimal transport from to , but an optimal transport exists in the other direction as well and it gives the same optimal plan (thanks to uniqueness). These facts prove that the measures and have the same marginals and hence, since they are both optimal, they coincide for a.e. pair . This implies .

Now, it is possible to go on by steps: once we have proven that , let us take (2.1) and estimate all the terms with and thanks to the optimality of , thus getting

 ∑i≥hλi(ε)∫|xi−yi|2dγK+h−1∑i=1λi(ε)∫(xi−yi)2dγε≤∫cεdγε≤∫cεdγK=∑i≥hλi(ε)∫|xi−yi|2dγK+h−1∑i=1λi(ε)∫(xi−yi)2dγK,

and consequently, by dividing by and passing to the limit,

 ∫c(h−1)dγ≤∫c(h−1)dγK.

We disintegrate with respect to and we act exacly as before: proving that the marginals of the disintegrations coincide is sufficient to prove equality of the measures. Here we will use test-functions fo the form

 ψ(xh,xh+1,\mathellipsis,xd,yh,yh+1,\mathellipsis,yd)ϕ(xh−1)

and

 ψ(xh,xh+1,\mathellipsis,xd,yh,yh+1,\mathellipsis,yd)ϕ(yh−1).

The same trick as before, i.e. replacing the variables with functions of the variables is again possible. To invert the trick and replace with one needs to invert part of Knothe’s transport. This is possible since our assumptions imply that all the monotone transports we get are invertible. In the end we get, as before, . This procedure may go on up to , thus arriving at .

We have now proven . Yet, if all these transport plans come from transport maps, it is well known that implies in , for any , as far as is bounded in . Actually, weak convergence is a simple consequence of boundedness: to go on, we can look at Young’s measures. The assumption (the limit is a transport map as well) exactly means that all the Young measures are dirac masses, which implies strong convergence. In particular we get convergence and -a.e. convergence on a subsequence. ∎

Let us remark here that if instead of considering the quadratic cost , one considers the more general separable cost

 cε(x,y):=d∑i=1λi(ε)ci(xi−yi)

where each is a smooth strictly convex function (with suitable growth), then the previous convergence proof carries over.

A counterexample when the measures have atoms We now show that interestingly, and perhaps counterintuitively, the hypothesis of absence of atoms in theorem 2.1 is necessary not only for , but also for . We propose a very simple example in where is absolutely continuous with respect to the Lebesgue measure but does not satisfy (H-target), and we show that the conclusion of theorem 2.1 fails to hold. On the square , define such that so that the measure is uniformly spread on the upper left and the lower right quadrants, and , being the segment .

The Knothe-Rosenblatt map is easily computed as . The solution of any symmetric transportation problem with is (no transport may do better than this one, which projects on the support of ). Therefore, in this example the optimal transportation maps fail to tend to the Knothe-Rosenblatt map. The reason is the atom in the measure .

Convergence even with atoms The convergence result of theorem 2.1 requires the absence of atoms in the projections of . This is obviously not the case if itself is purely atomic! Yet, this will precisely be the case we will consider in the algorithm we propose in the sequel. The same proof may be extended to this case under the following assumption. Keep the same assumptions on but suppose that is concentrated on a set with the property

 y,z∈S,y≠z⇒yd≠zd.

This means that, if we restrict ourselves to , then all the variables for are actually a function of the last variable . This is particularly useful when is purely atomic, concentrated on a finite (or countable) set of points with different components.

Just come back to the proof. The equality

 ∫dγd(xd,yd)ψ(xd,yd)∫ϕ(xd−1)dγd−1(xd,yd)=∫dγd(xd,yd)ψ(xd,yd)∫ϕ(xd−1)dγd−1(xd,yd),K

only relied on being a function of , which is still true. The other equality, namely

 ∫dγd(xd,yd)ψ(xd,yd)∫ϕ(yd−1)dγd−1(xd,yd)=∫dγd(xd,yd)ψ(xd,yd)∫ϕ(yd−1)dγd−1(xd,yd),K

gives some extra troubles. It is not any more true that is a function of . Yet, it is true that is a function of and this allows us to reduce the expression to functions of only, which is sufficient to get equality. The same procedure may be performed at subsequent steps as well.

3 An ODE for the dual variables

In this section, we consider for simplicity the case (although our method extends to higher dimensions), uniform on some convex polyhedron (for the sake of simplicity we will assume ) and where all the points have a different second coordinate . For every , let be the diagonal matrix with diagonal entries and let be the quadratic cost defined by . We are interested in solving the family of optimal transportation problems

 infπ∈Γ(μ,ν)∫Rd×Rdcε(x,y)dπ(x,y) (3.1)

for all values of the parameter . It is well-known, that (3.1) can be conveniently solved by the dual problem formulated in terms of prices:

 suppΦ(p,ε):=1NN∑i=1pi+∫Ωp∗ε(x)dx, (3.2)

where and we impose as a normalization . For each , there is a unique maximizer . For each we define . It is easy to check that is concave differentiable and that the gradient of is given by

 ∂Φε∂pi(p)=1N−|C(p,ε)i|.

By concavity of , the solution of (3.2) is characterized by the equation . The optimal transportation between and for the cost is then the piecewise map taking value in the cell . Our aim is to charcacterize the evolution of as varies. Formally, differentiating the optimality condition , we obtain a differential equation for the evolution of :

 ∂∂ε∇pΦ(p(ε),ε)+D2p,pΦ(p(ε),ε)⋅dpdε(ε)=0. (3.3)

Our aim now is to show that the equation (3.3) is well-posed starting with the initial condition (corresponding to horizontal cells of area ); this will involve computing the second derivatives of in (3.3), proving their Lipschitz behavior as well as obtaining a negative bound on the larger eigenvalue of the negative semidefinite matrix .

The price vector , along the evolution, will always be such that all the areas are equal (and are equal to ). Yet, we need to prove that the differential equation is well posed and we will set it in an open set,

 O={(p,ε):12N<|C(p,ε)i|<2Nfor every i}. (3.4)

The initial datum of the equation will be such that and we will look at the solution only up to the first moment where it exits . Yet inside the set it will be well-posed and it will imply conservation of the areas. Hence we will never exit .

All the quantities we are interested in depend on the position of the vertices of the cells , which are all polygons. Let us call the two extremal points of the common boundary between and , that we call (if such a common boundary exists; should it be a single point we consider the two points as coinciding). Each one of this points is obtained as the intersection of at least this common boundary with another one, or with the boundary of (which is supposed to be a polygon as well, so that in the neighbourhood of almost any point locally the boundary is a line). We want to investigate the dependence of these points with respect to and prove that this dependence is Lipschitz. Notice that each point is not defined for any value of but only on a certain (closed) subset of the space .

Lemma 3.1.

The positions of the vertices depend in a Lipschitz way on and .

Proof.

Locally it is true that the same point is defined either by a system of equations

 {Aε(x−yi)(x−yi)−pi=Aε(x−yj)(x−yj)−pj,Aε(x−yi)(x−yi)−pi=Aε(x−yh)(x−yh)−ph (3.5)

in the case of a point on the intersection of two common boundaries, or by a system

 {Aε(x−yi)(x−yi)−pi=Aε(x−yj)(x−yj)−pj,Lx=l0 (3.6)

in the case of intersection with the boundary (locally being given by the equation ) . The first system, after simplifying, reads as

 {2Aε(yj−yi)(x)=Aε(yj)(yj)−Aε(yi)(yi)−pj+pi,2Aε(yh−yi)(x)=Aε(yh)(yh)−Aε(yi)(yi)−ph+pi, (3.7)

and the second as well may be simplified the same way. This means that they are of the form

 M(ε)x=V(ε,p)

for a matrix which reads, in usual coordinates,

 M(ε)=⎛⎝ε(y(1)j−y(1)i)y(2)j−y(2)iε(y(1)h−y(1)i)y(2)h−y(2)i⎞⎠ in the first case and M(ε)=(ε(y(1)j−y(1)i)y(2)j−y(2)il1l2) in the second.

The vector is obtained regarding the right hand sides of the system, as in (3.7). Both and depend Lipschitzly on , with uniformly bounded Lipschitz constants. Hence, to check that the dependence of on is Lipschitz, we only need to bound (away from zero) . The determinant of a matrix is given by the product of the modulus of the two vector composing its lines, times the sinus of the angle between them. In the first case the vectors are and , while in the second they are and . In both cases they are the normal vectors to the sides of the cell we are considering. This implies, thanks to the lower bound on the areas of the cells, that the angles between these vectors may not be too small. Actually, since is bounded and the cells are convex, the area of each cell is smaller than being any angle between two neighbour sides. This implies a lower bound on . The lower bound on the moduli comes from the fact that the vectors are chosen with modulus one and the modulus of is always greater than its vertical component, which is , which is supposed different from zero for any pair . Notice that in the first case the matrix has in the determinant, even if we proved a lower bound on such a determinant, independent on : this agrees with the fact that actually, this kind of crossing (between two common boundaries of two cells) will only happen for (for we only have almost horizontal strips crossing non-horizontal sides of ).∎

Lemma 3.2.

The function admits pure second derivatives with respect to and mixed second derivatives with respect to and , and these second derivatives are Lipschitz continuous.

Proof.

We have proven that the positions of the points depend Lipschitzly, with uniformly bounded Lipschitz constants, on and . Since the volumes of the cells are Lipschitz functions of these points, this implies that is . Hence it admits derivatives almost everywhere and we can compute them in the following way.

The derivative of a volume of a polygonal cell is given, side by side, by the length of the side times the average of the components which are normal to such a side of the two derivatives and (the other terms - which are mainly the terms at the corners - are of higher order). The equation of the side is, as we know,

 2Aε(yj−yi)(x)=Aε(yj)(yj)−Aε(yi)(yi)−pj+pi (3.8)

and the normal unit vector to the side is

 n=Aε(yj−yi)|Aε(yj−yi)|.

Let us start from the derivatives of the cell with respect to a variable with . We differentiate (3.8) with respect to and we get

 2Aε(yj−yi)(˙x)=−1.

This formula only works for and . Obviously it ony works where they are differentiable, i.e. almost everywhere. Hence the derivative, by summing up and rescaling the normal vector, is given by

 ∂|C(p,ε)i|∂pj=−li,j2|Aε(yj−yi)|, (3.9)

where is the length of .

As far as the derivative with respect to is concerned, it is not difficult to check that we have (by summing up the results on every side)

 ∂|C(p,ε)i|∂pi=∑jli,j2|Aε(yj−yi)|, (3.10)

where the sum is performed on all the indices such that the cell is in contact with the cell .

Automatically, since these derivatives only depend on the values of , which depend in a Lipschitz manner on the positions of , they are Lipschitz as well. This proves that is actually and that these derivatives are well defined and admit the previous expressions (3.9)-(3.10) everywhere.

The computation of the derivatives with respect to is a bit trickier. We derive again (3.8), but with respect to . Since , we get

 2Aε(yj−yi)(˙x)=−2B(yj−yi)(x)+B(yj)(yj)−B(yi)(yi)=2B(yj−yi)(yj+yi2−x).

Then we renormalize the normal vector, sum up the results for and , multiply by the lengths and sum up the results for all the sides, and get

 ∂|C(p,ε)i|∂ε=∑jli,jB(yj−yi)(yj+yi−x(p,ε)+i,j−x(p,ε)−i,j)2|Aε(yj−yi)|. (3.11)

In this case as well the result is Lipschitz in and hence is differentiable everywhere with respect to , with Lipschitz derivative.

We can come now back to the evolution of and consider again the differential equation (3.3). To solve this equation we need to prove that the matrix is actually invertible (for numerical purpose, we will also need to bound its eigenvalues away from zero). It is important to recall that we look at the evolution of the vector , since we may assume for all . Hence, we will not look at the entries in the vectors or the matrices. The matrix we consider is has the following properties:

• on each line, outside the diagonal we have negative terms ;

• each element on the diagonal is the sum of minus all the others on the same line (hence it is positive), and possibly of the term which should be in the same line at the first column;

• an entry of the matrix is non-zero if and only if the cells and share a common boundary with positive length;

• in particular, for any pair , even if the entry at place is zero, it is possible to find a path so that the matrix has non-zero values at all the positions ;

• the total of the entries of the first column (the one which is not present in the matrix) is strictly positive.

The invertibility of is ensured by the following:

Lemma 3.3.

Let the matrix satisfy the following properties

 (H1) for all i,Mi,i≥∑j≠i|Mi,j|,(H2) there exists i such that Mi,i>∑j≠i|Mi,j|,(H3) for any pair (i,j) there is a sequence i0,i1,i2,\mathellipsis,ikwith i1=i,ik=j, and Mih,ih+1≠0.

then is invertible.

Proof.

Let and let be an index such that is maximal. We may suppose for simplicity that is positive. Then we have

 0=M¯i,¯ix¯i−∑jM¯i,jxj≥M¯i,¯ix¯i−∑jM¯i,jx¯i=x¯i(M¯i,¯i−∑jM¯i,j)≥0.

This implies that all inequalities are equalities and in particular whenever . Hence, the entries of on all the indices which are “neighbours” of equal (and they are maximal as well). This allows to repeat the argument replacing with another maximizing index and so on… since any index is connected by a chain of neighbours to , we get that all the entries are equal. But this implies that the vector in the kernel we selected must be a multiple of the vector . Yet, this vector is not in the kernel since the sum of the elements on each line is not zero for all lines, by assumption . This proves that is invertible. ∎

Finding a lower bound for the modulus of the eigenvalues of , i.e. quantifying its inversibility is not straightforward. Indeed, the properties , ,