Regularized Optimal Transport is Ground Cost Adversarial

Regularized Optimal Transport is Ground Cost Adversarial

Abstract

Regularizing Wasserstein distances has proved to be the key in the recent advances of optimal transport (OT) in machine learning. Most prominent is the entropic regularization of OT, which not only allows for fast computations and differentiation using Sinkhorn algorithm, but also improves stability with respect to data and accuracy in many numerical experiments. Theoretical understanding of these benefits remains unclear, although recent statistical works have shown that entropy-regularized OT mitigates classical OT’s curse of dimensionality. In this paper, we adopt a more geometrical point of view, and show using Fenchel duality that any convex regularization of OT can be interpreted as ground cost adversarial. This incidentally gives access to a robust dissimilarity measure on the ground space, which can in turn be used in other applications. We propose algorithms to compute this robust cost, and illustrate the interest of this approach empirically.

1 Introduction

Optimal transport (OT) has become a generic tool in machine learning, with applications in various domains such as supervised machine learning Frogner et al. (2015); Abadeh et al. (2015); Courty et al. (2016), graphics Solomon et al. (2015); Bonneel et al. (2016), imaging Rabin and Papadakis (2015); Cuturi and Peyré (2016), generative models Arjovsky et al. (2017); Salimans et al. (2018), biology Hashimoto et al. (2016); Schiebinger et al. (2019) or NLP Grave et al. (2019); Alaux et al. (2019). The key to using OT in these applications lies in the different forms of regularization of the original OT problem studied in the renowned books of Villani (2009); Santambrogio (2015). Adding a small convex regularization to the classical linear cost not only helps on the algorithmic side, by convexifying the objective and allowing for faster solvers, but also add some stability with respect to the input measures, improving numerical results.

Regularizing OT

Although entropy-regularized OT appears as the most studied regularization of OT, due to its algorithmic advantages Cuturi (2013), several other convex regularizations of the transport plan have been proposed in the community: quadratically-regularized OT Essid and Solomon (2017), OT with capacity constraints Korman and McCann (2015), Group-Lasso regularized OT Courty et al. (2016), OT with Laplacian regularization Flamary et al. (2014), among others. On the other hand, regularizing the dual Kantorovich problem was shown in Liero et al. (2018) to be equivalent to unbalanced OT, that is optimal transport with relaxed marginal constraints.

Understanding why regularization helps

The question of understanding why regularizing OT proves critical has triggered several approaches. One particularly active is the statistical study of entropic regularization: although classical OT suffers from the curse of dimensionality, as its empirical version converges at a rate of order  Dudley (1969); Fournier and Guillin (2015); Weed and Bach (2019), Sinkhorn divergences have a sample complexity of  Genevay et al. (2018); Mena and Niles-Weed (2019). Entropic OT was also shown to perform maximum likelihood estimation in the Gaussian deconvolution model Rigollet and Weed (2018). Taking another approach, Dessein et al. (2018); Blondel et al. (2018) have considered general classes of convex regularizations and characterized them from a more geometrical perspective. Recently, several papers Flamary et al. (2018); Deshpande et al. (2019); Kolouri et al. (2019); Niles-Weed and Rigollet (2019); Paty and Cuturi (2019) proposed to maximize OT with respect to the ground cost, which can in turn be interpreted in light of ground metric learning Cuturi and Avis (2014). Continuing along these lines, we make a connection between regularizing and maximizing OT.

Contributions

Our main goal is to provide a novel interpretation of regularized optimal transport in terms of ground cost robustness: regularizing OT amounts to maximizing unregularized OT with respect to the ground cost. Our contributions are:

1. We show that any convex regularization of the transport plan corresponds to ground-cost robustness (section 3);

2. We reinterpret classical regularizations of OT in the ground-cost adversarial setting (section 3.3);

3. We prove, under some technical assumption, a duality theorem for regularized OT, which we use to show that under the same assumption, there exists an optimal adversarial ground-cost that is separable (section 4);

4. We propose to extend the notion of ground-cost robustness to more than two measures, and focus on the case where the measures are time-varying (section 5);

5. We give some algorithms to solve the above-mentioned problems (section 6) and illustrate them on data (section 7).

2 Background on Optimal Transport and Notations

Let be a compact Hausdorff space, and define the set of Borel probability measures over . We write for the set of continuous functions from to , endowed with the supremum norm. For , we write for the function .

For , we write . All vectors will be denoted with bold symbols. For a Boolean assertion , we write for its indicator function if is true and otherwise.

Kantorovich Formulation of OT

For , we write for the set of couplings

 Π(μ,ν)={π∈\PXX s.t.∀A,B⊂\X Borel,π(A×\X)=μ(A),π(\X×B)=ν(B)}.

For a real-valued continuous function , the optimal transport cost between and is defined as

 \OTcostc(μ,ν):=infπ∈Π(μ,ν)∫\XXc(x,y)dπ(x,y). (1)

Since is continuous and is compact, the infimum in (1) is attained, see Theorem 1.4 in Santambrogio (2015). Problem (1) admits the following dual formulation, see Proposition 1.11 and Theorem 1.39 in Santambrogio (2015):

 \OTcostc(μ,ν)=maxϕ,ψ∈\CXϕ⊕ψ≤c∫ϕdμ+∫ψdν. (2)

Space of Measures

Since is compact, the dual space of is the set of Borel finite signed measures over . For , we recall that is Fréchet-differentiable at if there exists such that for any , as

 F(π+th)=F(π)+t∫∇F(π)dh+o(t).

Similarly, is Fréchet-differentiable at if there exists such that for any , as

 G(c+th)=G(c)+t∫hd∇G(c)+o(t).

Legendre–Fenchel Transformation

For any functional , we can define its convex conjugate and biconjugate as

 F∗(c) :=supπ∈\MXX∫cdπ−F(π), F∗∗(π) :=supc∈\CXX∫cdπ−F∗(c).

is always lower semi-continuous (lsc) and convex as the supremum of continuous linear functions.

Specific notations

For , we write for its domain and will say is proper if .

We will denote by the set of proper lsc convex functions , and for , we define the set of lsc convex functions that are proper on :

 \FM(μ,ν)={F∈\FM|∃π∈Π(μ,ν),F(π)<+∞}.

3 Ground Cost Adversarial Optimal Transport

3.1 Definition

Instead of considering the classical linear formulation of optimal transport (1), we will consider the following more general nonlinear formulation:

{definition}

Let . For , we define:

 \WfF(μ,ν)=infπ∈Π(μ,ν)F(π). (3)
{lemma}

The infimum in (3) is attained. Moreover, if , .

Proof.

We can apply Weierstrass’s theorem since is compact and is lsc by definition.
For , there exists such that , so . ∎

The main result of this paper is the following interpretation of problem (3) as a ground-cost adversarial OT problem: {theorem} For and , minimizing over is equivalent to the following concave problem:

 \WfF(μ,ν)=supc∈\CXX\OTcostc(μ,ν)−F∗(c). (4)
Proof.

Since is proper, lsc and convex, Fenchel-Moreau theorem ensures that it is equal to its convex biconjugate , so:

 minπ∈Π(μ,ν)F(π) =minπ∈Π(μ,ν)F∗∗(π) =\adjustlimitsminπ∈Π(μ,ν)supc∈\CXX∫cdπ−F∗(c).

Define the objective . Since is lsc as the convex conjugate of , for any , is usc. It is also concave as the sum of concave functions. Likewise, for any , is continuous and convex (in fact linear). Since and are convex, and is compact, we can use Sion’s minimax theorem to swap the min and the sup:

 minπ∈Π(μ,ν)F(π)=\adjustlimitssupc∈\CXXminπ∈Π(μ,ν)∫cdπ−F∗(c).

{remark}

Note that the inequality

 \WfF(μ,ν)≥supc∈\CXX\OTcostc(μ,ν)−F∗(c)

is in fact verified for any since is always verified.

The supremum in equation (4) is not necessarily attained. Under some regularity assumption on , we show that the supremum is attained and relate the optimal couplings and the optimal ground costs: {proposition} Let and . Suppose that is Fréchet-differentiable on . Then the supremum in (4) is attained at where is any minimizer of (3). Conversely, suppose is Fréchet-differentiable everywhere. If is the unique maximizer in (4), then is a minimizer of (3). In section 4, we will further characterize for a class of functions . See a proof in appendix.

One interesting particular case of Theorem 3.1 is when the convex cost is a convex regularization of the classical linear optimal transport: {corollary} Let , . Let and . Then:

 minπ∈Π(μ,ν)∫c0dπ+ϵR(π) =supc∈\CXX\OTcostc(μ,ν)−ϵR∗(c−c0ϵ). (5)
Proof.

We apply theorem 3.1 with , for which we only need to compute the convex conjugate:

 F∗(c) =supπ∈\MXX∫c−c0dπ−ϵR(π) =ϵsupπ∈\MXX∫c−c0ϵdπ−R(π) =ϵR∗(c−c0ϵ).

3.2 Discrete Separable Case

In this subsection, we will focus on the discrete case where the space for some . A probability measure is then a histogram of size that we will represent by a vector such that . Cost functions and transport plans are now matrices .

We focus on regularization functions that are separable, i.e. of the form

 R(\bpi)=n∑i=1n∑j=1Rij(\bpiij)

for some differentiable convex proper lsc .

In applications, it may be natural to ask that the ground cost has nonnegative entries. Adding this constraint on the adversarial cost corresponds to linearizing “at short range” the regularization for “small transport values”:

{proposition}

Let . For , it holds:

 sup\bc∈\Rnn+\OTcost\bc(\bmu,\bnu)−ϵ∑ijR∗ij(\bcij−\bc0ijϵ) =min\bpi∈Π(\bmu,\bnu)⟨\bc0,\bpi⟩+ϵ∑ijˆRij(\bpiij) (6)

where is the continuous convex function defined as

 ˆRij(x):=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩Rij(x)if x≥R∗ij′(−\bc0ijϵ)−\bc0ijϵx−R∗ij(−\bc0ijϵ)otherwise.

Moreover, if is of class , then is also . We give a proof in the appendix.

3.3 Examples

As presented in the introduction, several convex regularizations have been proposed. We give the ground cost adversarial counterpart for some of them: two examples in the continuous setting, and three -norm based regularizations in the discrete case.

{example}

[Entropic Regularization] Let . For , we define its relative entropy as . Then for and , it holds:

 minπ∈Π(μ,ν)∫c0dπ+ϵ\KL(π∥μ⊗ν) =supc∈\CXX\OTcostc(μ,ν)−ϵ∫exp(c−c0ϵ)dμ⊗ν+ϵ.
Proof.

For , let

 R(π)={∫logdπdμ⊗νdπ−∫dπ+1% if π≪μ⊗ν+∞otherwise.

is convex, and using proposition 7 in Feydy et al. (2019),

 R∗(c)=∫ec−1dμ⊗ν.

Applying corollary 3.1 concludes the proof. ∎

Another case of interest is the so-called Subspace Robust Wasserstein distance recently proposed by Paty and Cuturi (2019). Here, the set of adversarial metrics is parameterized by a finite-dimensional parameter , which allows to recover an adversarial metric defined on the whole space even when the measures are finitely supported. {example}[Subspace Robust Wasserstein] Let , and with a finite second-order moment. For , define and its ordered eigenvalues.

Then is convex, and

 \SRWk(μ,ν):=minπ∈Π(μ,ν)k∑l=1λl(Vπ)=max0⪯Ω⪯I\trace(Ω)=k\OTcostd2Ω(μ,ν)

where is the squared Mahalanobis distance.

Proof.

See Theorem 1 in Paty and Cuturi (2019). Note that in this case, is not compact. This actually poses no problem since outside a compact set, i.e. the set on metrics on which the maximization takes place is compact. Indeed, one can show that:

 F∗(c)=ι(∃0⪯Ω⪯I with \trace(Ω)=k s.t. c=d2Ω).

Let us now consider -norm based examples, which will subsume quadratically-regularized () OT studied in Essid and Solomon (2017); Lorenz et al. (2019) and capacity-constrained () OT proposed by Korman and McCann (2015).

For a matrix with and , we will denote by the -weighted (powered) -norm of . We also write for the matrix defined by . In the following, we take such that , , . {example}[ Regularization]

 min\bpi∈Π(\bmu,\bnu) ⟨\bc0,\bpi⟩+ϵ1p∥\bpi∥p\bw,p =sup\bc∈\Rnn\OTcost\bc(\bmu,\bnu)−ϵ1q∥∥∥\bc−\bc0ϵ∥∥∥q1/\bwq−1,q.

In particular when and , this corresponds to quadratically-regularized OT studied in Essid and Solomon (2017); Lorenz et al. (2019). We give the details of the (straightforward) computations in the appendix.

{example}

[ Penalization]

 min\bpi∈Π(\bmu,\bnu)⟨\bc0,\bpi⟩+ϵ∥\bpi∥\bw,p=sup\bc∈\Rnn∥\bc−\bc0∥1/\bw,q≤ϵ\OTcost\bc(\bmu,\bnu).
Proof.

We apply Corollary 3.1 with defined as , for which we need to compute its convex conjugate. We know that the dual of is , and using classical results about convex conjugates, . ∎

{example}

[ Regularization]

 min\bpi∈Π(\bmu,\bnu)∥\bpi∥\bw,p≤ϵ⟨\bc0,\bpi⟩=sup\bc∈\Rnn\OTcost\bc(\bmu,\bnu)−ϵ∥\bc−\bc0∥1/\bw,q.

In particular when and , this coincides with capacity-constrained OT proposed by Korman and McCann (2015).

Proof.

We apply Corollary 3.1 with defined as , for which we need to compute its convex conjugate. We know that the dual of is , and using classical results about convex conjugates, . ∎

4 Properties of the Adversarial Cost

Theorem 3.1 shows that regularizing OT is equivalent to maximizing unregularized OT with respect to the ground cost. This gives access to a robustly computed cost on the ground space, which we characterize in this section. We have already seen in proposition 3.1 that we can get if we have solved the primal problem . Under some technical assumption of , we can show that there exists an optimal adversarial cost which is separable, that is of the form for some functions .

{definition}

Let . We will say that is separably -increasing if for any and any :

 ϕ⊕ψ≤c⇒F∗(ϕ⊕ψ)≤F∗(c). (7)

This definition, albeit not always verified e.g. in the classical linear case , is verified in various cases of interest, e.g. for the entropic or regularizations: {example} For , and , the entropy-regularized OT function

 F:π↦∫c0dπ+ϵ\KL(π∥μ⊗ν)

is separably -increasing.

Proof.

As in the proof of example 3.3,

 F∗(c)=ϵ∫exp(c−c0ϵ)−1dμ⊗ν

which clearly verifies condition (7). ∎

{example}

In the discrete setting , let , , summing to . Take and . With if and if , the -regularized OT function

 F:\bpi↦⟨\bc0,\bpi⟩+ϵ∑ij\bwijφp(\bpiij)

is separably -increasing.

Proof.

Note that minimizing over is equivalent to minimizing . One can show that, with such that and :

which clearly verifies condition (7). ∎

When is separably -increasing, we can easily prove a duality theorem for problem (3): {theorem}[ duality] Let and a separably -increasing function. Then:

 \WfF(μ,ν)=maxϕ,ψ∈\CX∫ϕdμ+∫ψdν−F∗(ϕ⊕ψ). (8)
Proof.

The main idea is to use Kantorovich duality (2) in the cost-adversarial formulation of . Then the -increasing property appears naturally as a condition for duality to hold. See the details in the appendix. ∎

{corollary}

If are optimal solutions in (8), the cost is an optimal adversarial cost in (4).

Proof.

For , note that

 \OTcostϕ⊕ψ(μ,ν)=∫ϕdμ+∫ψdν.

Then using duality:

 \WfF(μ,ν) =maxϕ,ψ∈\CX∫ϕdμ+∫ψdν−F∗(ϕ⊕ψ) =maxϕ,ψ∈\CX\OTcostϕ⊕ψ(μ,ν)−F∗(ϕ⊕ψ) ≤supc∈\CXX\OTcostc(μ,ν)−F∗(c) =\WfF(μ,ν)

where we have used Theorem 3.1 in the last line. This shows that the inequality is in fact an equality, so if are optimal dual potentials in (8), is an optimal adversarial cost in (4). ∎

5 Adversarial Ground-Cost Sequence for Time-varying Measures

For two measures and a separably -increasing function , corollary 4 shows that there exists an optimal adversarial ground cost that is separable. This separability, which is verified e.g. in the entropic or quadratic case, means that the OT problem for is degenerate in the sense that any transport plan is optimal for the cost . From a metric learning point of view, is not a suitable dissimilarity measure on . But why limit ourselves to two measures ? If we observe measures , we could look for a ground cost that is adversarial to (part of) all the pairs:

 \WpairsF∗(μ1,…,μN):=supc∈\CXX∑i≠j\OTcostc(μi,μj)−F∗(c) (9)

for some convex regularization . Although interesting from an application point of view, problem (9) does not correspond to any regularization of a transport plan. We thus study a slightly different problem.

5.1 Definition

For a sequence of measures , , e.g. when we observe time-evolving data, we can look for a sequence of adversarial costs which is globally adversarial: {definition} For , and for , , we define:

 \WRηD,F(μ1:T):= supc1:T−1T−1∑t=1\OTcostct(μt,μt+1) (10) −ηD(ct,ct+1)−F∗t(ct)

with the convention

As we show in the two following propositions, interpolates between two different behaviours: as , will solve independently the successive regularized OT problems, while as , enforces the uniqueness of a joint adversarial cost. Then can be reinterpreted as a regularized multimarginal OT problem.

{proposition}

With the notations of definition 5.1, for :

 \WR0D,F(μ1:T)=T−1∑t=1\WfFt(μt,μt+1).
Proof.

Since the optimization problem is separable, it holds:

 \WR0D,F(μ1:T) =supc1:T−1T−1∑t=1\OTcostct(μt,μt+1)−F∗t(ct) =T−1∑t=1supct∈\CXX\OTcostct(μt,μt+1)−F∗t(ct)

which gives the result using Theorem 3.1. ∎

{proposition}

[Multimarginal interpretation] With the notations of definition 5.1, suppose that:

1. is continuous,

2. is a divergence, i.e. and ,

3. there exists a compact set such that for all , outside of .

Then:

 limη→+∞\WRηD,F(μ1:T)=maxc∈\CXXT−1∑t=1\OTcostc(μt,μt+1)−F∗t(c) =minπ∈Π(μ1:T)(F1\infconv…\infconvFT−1)(T−1∑t=1(\projt,t+1)♯π)

where is the set of probability measures in with marginals , where for

 \projt,t+1:\XT∋(x1,…,xT)↦(xt,xt+1)∈\XX

and is the infimal convolution:

 F1\infconv…\infconvFT−1:\PXX→\Rinf π↦inf{T−1∑t=1Ft(γt)∣∣∣γ1:T−1∈\MXX,T−1∑t=1γt=π}.

We give a proof in appendix.

5.2 Time-varying Subspace Robust Wasserstein

Taking inspiration from the Subspace Robust Wasserstein (SRW) distance, we propose as a particular case of definition 5.1 a generalization of SRW to the case of a sequence of measures , : {definition} Let and . Define . We define the time-varying SRW between as:

 \tSRWk,η(μ1:T):= supΩ1,…,ΩT−1∈RkT−1∑t=1\OTcostd2Ωt(μt,μt+1) (11) −η\Bures(Ωt,Ωt+1)

where is the squared Bures metric on the SDP cone.

Note that problem (11) is convex and verifies the hypothesis of proposition 5.1. If , the time-varying SRW is equal to the classical SRW distance: .

6 Algorithms

From now on, we only consider the discrete case .

6.1 Projected (Sub)gradient Ascent Solves Nonnegative Adversarial Cost OT

In the setting of subsection 3.2, we propose to run a projected subgradient ascent on the ground cost to solve problem (3.2). Note that in this case, is not separably -increasing, so we can hope that the optimal adversarial ground cost will not be separable.

At each iteration of the ascent, we need to compute a subgradient of given by Danskin’s theorem:

 ∂g(\bc)={\opt\bpi−∇R∗(\bc−\bc0ϵ)∣∣∣\opt\bpi∈\argmin\bpi∈Π(\bmu,\bnu)⟨\bc,\bpi⟩}.

Although projected subgradient ascent does converge, having access to gradients instead of subgradients, hence regularity, helps the convergence. We therefore propose to replace by its entropy-regularized version

 \Sinkhornη\bc(μ,ν)=min\bpi∈Π(\bmu,\bnu)⟨\bc,\bpi⟩+η∑ij\bpiij(log\bpiij−1)

in the definition of the obective . Then is differentiable, because there exists a unique solution in the entropic case. This will also speed up the computations of the gradient at each iteration using Sinkhorn algorithm. We can interpret this addition of a small entropy term in the adversarial cost formulation as a further regularization of the primal: {corollary} Using the same notations as in Theorem 3.1, for :

 sup\bc∈\Rnn \Sinkhornη\bc(\bmu,\bnu)−F∗(\bc) =min\bpi∈Π(\bmu,\bnu)F(\bpi)+η∑ij\bpiij(log\bpiij−1).

6.2 Sinkhorn-like Algorithm for ∗-increasing F∈\FM

If the function is separably -increasing, we can directly write the optimality conditions for the concave dual problem (8):

 \bmu =∇F∗(\opt\bphi⊕\opt\bpsi)\ones (12) \bnu =∇F∗(\opt\bphi⊕\opt\bpsi)⊤\ones (13)

where is the vector of all ones. We can then alternate between fixing and solving for in (12) and fixing and solving for in (13). In the case of entropy-regularized OT, this is equivalent to Sinkhorn algorithm. In quadratically-regularized OT, this is equivalent to the alternate minimization proposed by Blondel et al. (2018). We give the detailed derivation of these facts in the appendix.

6.3 Coordinate Ascent for Time-varying SRW

Problem (11) is a globally convex problem of . We propose to run a randomized coordinate ascent on the concave objective, i.e. to select randomly at each iteration and doing a gradient step for . We need to compute a subgradient of the objective , given by:

 ∇h(Ωτ)=V(\opt\bpiτ) −η∂1\Bures(Ωτ,Ωτ+1) (14) −η∂2\Bures(Ωτ−1,Ωτ)

where is defined in example 3.3, is any optimal transport plan between for cost , and are the gradients of the squared Bures metric with respect to the first and second arguments, computed e.g. in Muzellec and Cuturi (2018).

7 Experiments

7.1 Linearized Entropy-Regularized OT

We consider the entropy-regularized OT problem in the discrete setting:

 \Sinkhornϵ(\bmu,\bnu)=min\bpi∈Π(\bmu,\bnu)⟨\bc0,\bpi⟩+ϵR(\bpi)

where and . Since is separable, we can constrain the associated adversarial cost to be nonnegative by linearizing the entropic regularization. By proposition 3.2, this amounts to solve

 sup\bc∈\Rnn+\OTcost\bc(\bmu,\bnu)−ϵ∑ijexp(\bcij−\bc0ijϵ) (15) =min\bpi∈Π(\bmu,\bnu)⟨\bc0,\bpi⟩+ϵ∑ijˆRij(\bpiij)

where is defined as

 ˆRij(x):=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩x(logx−1)if x≥exp(−\bc0ijϵ)−\bc0ijϵx−exp(−\bc0ijϵ)otherwise.

We first consider couples of measures in dimension , each measure being a uniform measure on samples from a Gaussian distribution with covariance matrix drawn from a Wishart distribution with degrees of freedom. For each couple, we run Algorithm 1 to solve problem (15). This gives an adversarial cost . We plot in Figure 2 the mean value of depending on , for equal to , and the value of (15). For small values of , all three values converge to the real Wasserstein distance. For large , Sinkhorn stabilizes to the MMD Genevay et al. (2016) while the robust cost goes to (for the adversarial cost goes to ).

In Figure 3, we visualize the effect of the regularization on the ground cost itself, for measures plotted in Figure 2(a). We use multidimensional scaling on the adversarial cost matrix (with distances between points from the same measures unchanged) to recover points in . For large values of , the adversarial cost goes to , which corresponds in the primal to a fully diffusive transport plan .

7.2 Learning a Metric on the Color Space

We consider 20 measures on the red-green-blue color space identified with . Each measure is a point cloud corresponding to the colors used in a painting, divided into two types: ten portraits by Modigliani () and ten by Schiele (), see the appendix for the 20 pictures. As in SRW and time-varying SRW formulations, we learn a metric parameterized by a matrix such that that best separates the Modiglianis and the Schieles:

 \optΩ∈\argmaxΩ∈R1∑i∈