Regularized Optimal Transport is Ground Cost Adversarial
Abstract
Regularizing Wasserstein distances has proved to be the key in the recent advances of optimal transport (OT) in machine learning. Most prominent is the entropic regularization of OT, which not only allows for fast computations and differentiation using Sinkhorn algorithm, but also improves stability with respect to data and accuracy in many numerical experiments. Theoretical understanding of these benefits remains unclear, although recent statistical works have shown that entropyregularized OT mitigates classical OT’s curse of dimensionality. In this paper, we adopt a more geometrical point of view, and show using Fenchel duality that any convex regularization of OT can be interpreted as ground cost adversarial. This incidentally gives access to a robust dissimilarity measure on the ground space, which can in turn be used in other applications. We propose algorithms to compute this robust cost, and illustrate the interest of this approach empirically.
1 Introduction
Optimal transport (OT) has become a generic tool in machine learning, with applications in various domains such as supervised machine learning Frogner et al. (2015); Abadeh et al. (2015); Courty et al. (2016), graphics Solomon et al. (2015); Bonneel et al. (2016), imaging Rabin and Papadakis (2015); Cuturi and Peyré (2016), generative models Arjovsky et al. (2017); Salimans et al. (2018), biology Hashimoto et al. (2016); Schiebinger et al. (2019) or NLP Grave et al. (2019); Alaux et al. (2019). The key to using OT in these applications lies in the different forms of regularization of the original OT problem studied in the renowned books of Villani (2009); Santambrogio (2015). Adding a small convex regularization to the classical linear cost not only helps on the algorithmic side, by convexifying the objective and allowing for faster solvers, but also add some stability with respect to the input measures, improving numerical results.
Regularizing OT
Although entropyregularized OT appears as the most studied regularization of OT, due to its algorithmic advantages Cuturi (2013), several other convex regularizations of the transport plan have been proposed in the community: quadraticallyregularized OT Essid and Solomon (2017), OT with capacity constraints Korman and McCann (2015), GroupLasso regularized OT Courty et al. (2016), OT with Laplacian regularization Flamary et al. (2014), among others. On the other hand, regularizing the dual Kantorovich problem was shown in Liero et al. (2018) to be equivalent to unbalanced OT, that is optimal transport with relaxed marginal constraints.
Understanding why regularization helps
The question of understanding why regularizing OT proves critical has triggered several approaches. One particularly active is the statistical study of entropic regularization: although classical OT suffers from the curse of dimensionality, as its empirical version converges at a rate of order Dudley (1969); Fournier and Guillin (2015); Weed and Bach (2019), Sinkhorn divergences have a sample complexity of Genevay et al. (2018); Mena and NilesWeed (2019). Entropic OT was also shown to perform maximum likelihood estimation in the Gaussian deconvolution model Rigollet and Weed (2018). Taking another approach, Dessein et al. (2018); Blondel et al. (2018) have considered general classes of convex regularizations and characterized them from a more geometrical perspective. Recently, several papers Flamary et al. (2018); Deshpande et al. (2019); Kolouri et al. (2019); NilesWeed and Rigollet (2019); Paty and Cuturi (2019) proposed to maximize OT with respect to the ground cost, which can in turn be interpreted in light of ground metric learning Cuturi and Avis (2014). Continuing along these lines, we make a connection between regularizing and maximizing OT.
Contributions
Our main goal is to provide a novel interpretation of regularized optimal transport in terms of ground cost robustness: regularizing OT amounts to maximizing unregularized OT with respect to the ground cost. Our contributions are:

We show that any convex regularization of the transport plan corresponds to groundcost robustness (section 3);

We reinterpret classical regularizations of OT in the groundcost adversarial setting (section 3.3);

We prove, under some technical assumption, a duality theorem for regularized OT, which we use to show that under the same assumption, there exists an optimal adversarial groundcost that is separable (section 4);

We propose to extend the notion of groundcost robustness to more than two measures, and focus on the case where the measures are timevarying (section 5);
2 Background on Optimal Transport and Notations
Let be a compact Hausdorff space, and define the set of Borel probability measures over . We write for the set of continuous functions from to , endowed with the supremum norm. For , we write for the function .
For , we write . All vectors will be denoted with bold symbols. For a Boolean assertion , we write for its indicator function if is true and otherwise.
Kantorovich Formulation of OT
For , we write for the set of couplings
For a realvalued continuous function , the optimal transport cost between and is defined as
(1) 
Since is continuous and is compact, the infimum in (1) is attained, see Theorem 1.4 in Santambrogio (2015). Problem (1) admits the following dual formulation, see Proposition 1.11 and Theorem 1.39 in Santambrogio (2015):
(2) 
Space of Measures
Since is compact, the dual space of is the set of Borel finite signed measures over . For , we recall that is Fréchetdifferentiable at if there exists such that for any , as
Similarly, is Fréchetdifferentiable at if there exists such that for any , as
Legendre–Fenchel Transformation
For any functional , we can define its convex conjugate and biconjugate as
is always lower semicontinuous (lsc) and convex as the supremum of continuous linear functions.
Specific notations
For , we write for its domain and will say is proper if .
We will denote by the set of proper lsc convex functions , and for , we define the set of lsc convex functions that are proper on :
3 Ground Cost Adversarial Optimal Transport
3.1 Definition
Instead of considering the classical linear formulation of optimal transport (1), we will consider the following more general nonlinear formulation:
Let . For , we define:
(3) 
The infimum in (3) is attained. Moreover, if , .
Proof.
We can apply Weierstrass’s theorem since is compact and is lsc by definition.
For , there exists such that , so .
∎
The main result of this paper is the following interpretation of problem (3) as a groundcost adversarial OT problem: {theorem} For and , minimizing over is equivalent to the following concave problem:
(4) 
Proof.
Since is proper, lsc and convex, FenchelMoreau theorem ensures that it is equal to its convex biconjugate , so:
Define the objective . Since is lsc as the convex conjugate of , for any , is usc. It is also concave as the sum of concave functions. Likewise, for any , is continuous and convex (in fact linear). Since and are convex, and is compact, we can use Sion’s minimax theorem to swap the min and the sup:
∎
Note that the inequality
is in fact verified for any since is always verified.
The supremum in equation (4) is not necessarily attained. Under some regularity assumption on , we show that the supremum is attained and relate the optimal couplings and the optimal ground costs: {proposition} Let and . Suppose that is Fréchetdifferentiable on . Then the supremum in (4) is attained at where is any minimizer of (3). Conversely, suppose is Fréchetdifferentiable everywhere. If is the unique maximizer in (4), then is a minimizer of (3). In section 4, we will further characterize for a class of functions . See a proof in appendix.
One interesting particular case of Theorem 3.1 is when the convex cost is a convex regularization of the classical linear optimal transport: {corollary} Let , . Let and . Then:
(5) 
Proof.
3.2 Discrete Separable Case
In this subsection, we will focus on the discrete case where the space for some . A probability measure is then a histogram of size that we will represent by a vector such that . Cost functions and transport plans are now matrices .
We focus on regularization functions that are separable, i.e. of the form
for some differentiable convex proper lsc .
In applications, it may be natural to ask that the ground cost has nonnegative entries. Adding this constraint on the adversarial cost corresponds to linearizing “at short range” the regularization for “small transport values”:
Let . For , it holds:
(6) 
where is the continuous convex function defined as
Moreover, if is of class , then is also . We give a proof in the appendix.
3.3 Examples
As presented in the introduction, several convex regularizations have been proposed. We give the ground cost adversarial counterpart for some of them: two examples in the continuous setting, and three norm based regularizations in the discrete case.
[Entropic Regularization] Let . For , we define its relative entropy as . Then for and , it holds:
Proof.
Another case of interest is the socalled Subspace Robust Wasserstein distance recently proposed by Paty and Cuturi (2019). Here, the set of adversarial metrics is parameterized by a finitedimensional parameter , which allows to recover an adversarial metric defined on the whole space even when the measures are finitely supported. {example}[Subspace Robust Wasserstein] Let , and with a finite secondorder moment. For , define and its ordered eigenvalues.
Then is convex, and
where is the squared Mahalanobis distance.
Proof.
See Theorem 1 in Paty and Cuturi (2019). Note that in this case, is not compact. This actually poses no problem since outside a compact set, i.e. the set on metrics on which the maximization takes place is compact. Indeed, one can show that:
∎
Let us now consider norm based examples, which will subsume quadraticallyregularized () OT studied in Essid and Solomon (2017); Lorenz et al. (2019) and capacityconstrained () OT proposed by Korman and McCann (2015).
For a matrix with and , we will denote by the weighted (powered) norm of . We also write for the matrix defined by . In the following, we take such that , , . {example}[ Regularization]
In particular when and , this corresponds to quadraticallyregularized OT studied in Essid and Solomon (2017); Lorenz et al. (2019). We give the details of the (straightforward) computations in the appendix.
[ Penalization]
Proof.
We apply Corollary 3.1 with defined as , for which we need to compute its convex conjugate. We know that the dual of is , and using classical results about convex conjugates, . ∎
[ Regularization]
In particular when and , this coincides with capacityconstrained OT proposed by Korman and McCann (2015).
Proof.
We apply Corollary 3.1 with defined as , for which we need to compute its convex conjugate. We know that the dual of is , and using classical results about convex conjugates, . ∎
4 Properties of the Adversarial Cost
Theorem 3.1 shows that regularizing OT is equivalent to maximizing unregularized OT with respect to the ground cost. This gives access to a robustly computed cost on the ground space, which we characterize in this section. We have already seen in proposition 3.1 that we can get if we have solved the primal problem . Under some technical assumption of , we can show that there exists an optimal adversarial cost which is separable, that is of the form for some functions .
Let . We will say that is separably increasing if for any and any :
(7) 
This definition, albeit not always verified e.g. in the classical linear case , is verified in various cases of interest, e.g. for the entropic or regularizations: {example} For , and , the entropyregularized OT function
is separably increasing.
In the discrete setting , let , , summing to . Take and . With if and if , the regularized OT function
is separably increasing.
Proof.
Note that minimizing over is equivalent to minimizing . One can show that, with such that and :
which clearly verifies condition (7). ∎
When is separably increasing, we can easily prove a duality theorem for problem (3): {theorem}[ duality] Let and a separably increasing function. Then:
(8) 
Proof.
The main idea is to use Kantorovich duality (2) in the costadversarial formulation of . Then the increasing property appears naturally as a condition for duality to hold. See the details in the appendix. ∎
5 Adversarial GroundCost Sequence for Timevarying Measures
For two measures and a separably increasing function , corollary 4 shows that there exists an optimal adversarial ground cost that is separable. This separability, which is verified e.g. in the entropic or quadratic case, means that the OT problem for is degenerate in the sense that any transport plan is optimal for the cost . From a metric learning point of view, is not a suitable dissimilarity measure on . But why limit ourselves to two measures ? If we observe measures , we could look for a ground cost that is adversarial to (part of) all the pairs:
(9) 
for some convex regularization . Although interesting from an application point of view, problem (9) does not correspond to any regularization of a transport plan. We thus study a slightly different problem.
5.1 Definition
For a sequence of measures , , e.g. when we observe timeevolving data, we can look for a sequence of adversarial costs which is globally adversarial: {definition} For , and for , , we define:
(10)  
with the convention
As we show in the two following propositions, interpolates between two different behaviours: as , will solve independently the successive regularized OT problems, while as , enforces the uniqueness of a joint adversarial cost. Then can be reinterpreted as a regularized multimarginal OT problem.
With the notations of definition 5.1, for :
Proof.
[Multimarginal interpretation] With the notations of definition 5.1, suppose that:

is continuous,

is a divergence, i.e. and ,

there exists a compact set such that for all , outside of .
Then:
where is the set of probability measures in with marginals , where for
and is the infimal convolution:
We give a proof in appendix.
5.2 Timevarying Subspace Robust Wasserstein
Taking inspiration from the Subspace Robust Wasserstein (SRW) distance, we propose as a particular case of definition 5.1 a generalization of SRW to the case of a sequence of measures , : {definition} Let and . Define . We define the timevarying SRW between as:
(11)  
where is the squared Bures metric on the SDP cone.
6 Algorithms
From now on, we only consider the discrete case .
6.1 Projected (Sub)gradient Ascent Solves Nonnegative Adversarial Cost OT
In the setting of subsection 3.2, we propose to run a projected subgradient ascent on the ground cost to solve problem (3.2). Note that in this case, is not separably increasing, so we can hope that the optimal adversarial ground cost will not be separable.
At each iteration of the ascent, we need to compute a subgradient of given by Danskin’s theorem:
Although projected subgradient ascent does converge, having access to gradients instead of subgradients, hence regularity, helps the convergence. We therefore propose to replace by its entropyregularized version
in the definition of the obective . Then is differentiable, because there exists a unique solution in the entropic case. This will also speed up the computations of the gradient at each iteration using Sinkhorn algorithm. We can interpret this addition of a small entropy term in the adversarial cost formulation as a further regularization of the primal: {corollary} Using the same notations as in Theorem 3.1, for :
6.2 Sinkhornlike Algorithm for increasing
If the function is separably increasing, we can directly write the optimality conditions for the concave dual problem (8):
(12)  
(13) 
where is the vector of all ones. We can then alternate between fixing and solving for in (12) and fixing and solving for in (13). In the case of entropyregularized OT, this is equivalent to Sinkhorn algorithm. In quadraticallyregularized OT, this is equivalent to the alternate minimization proposed by Blondel et al. (2018). We give the detailed derivation of these facts in the appendix.
6.3 Coordinate Ascent for Timevarying SRW
Problem (11) is a globally convex problem of . We propose to run a randomized coordinate ascent on the concave objective, i.e. to select randomly at each iteration and doing a gradient step for . We need to compute a subgradient of the objective , given by:
(14)  
where is defined in example 3.3, is any optimal transport plan between for cost , and are the gradients of the squared Bures metric with respect to the first and second arguments, computed e.g. in Muzellec and Cuturi (2018).
7 Experiments
7.1 Linearized EntropyRegularized OT
We consider the entropyregularized OT problem in the discrete setting:
where and . Since is separable, we can constrain the associated adversarial cost to be nonnegative by linearizing the entropic regularization. By proposition 3.2, this amounts to solve
(15)  
where is defined as
We first consider couples of measures in dimension , each measure being a uniform measure on samples from a Gaussian distribution with covariance matrix drawn from a Wishart distribution with degrees of freedom. For each couple, we run Algorithm 1 to solve problem (15). This gives an adversarial cost . We plot in Figure 2 the mean value of depending on , for equal to , and the value of (15). For small values of , all three values converge to the real Wasserstein distance. For large , Sinkhorn stabilizes to the MMD Genevay et al. (2016) while the robust cost goes to (for the adversarial cost goes to ).
In Figure 3, we visualize the effect of the regularization on the ground cost itself, for measures plotted in Figure 2(a). We use multidimensional scaling on the adversarial cost matrix (with distances between points from the same measures unchanged) to recover points in . For large values of , the adversarial cost goes to , which corresponds in the primal to a fully diffusive transport plan .
7.2 Learning a Metric on the Color Space
We consider 20 measures on the redgreenblue color space identified with . Each measure is a point cloud corresponding to the colors used in a painting, divided into two types: ten portraits by Modigliani () and ten by Schiele (), see the appendix for the 20 pictures. As in SRW and timevarying SRW formulations, we learn a metric parameterized by a matrix such that that best separates the Modiglianis and the Schieles: