Subspace Robust Wasserstein Distances

# Subspace Robust Wasserstein Distances

François-Pierre Paty    Marco Cuturi
###### Abstract

Making sense of Wasserstein distances between discrete measures in high-dimensional settings remains a challenge. Recent work has advocated a two-step approach to improve robustness and facilitate the computation of optimal transport, using for instance projections on random real lines, or a preliminary quantization of the measures to reduce the size of their support. We propose in this work a “max-min” robust variant of the Wasserstein distance by considering the maximal possible distance that can be realized between two measures, assuming they can be projected orthogonally on a lower -dimensional subspace. Alternatively, we show that the corresponding “min-max” OT problem has a tight convex relaxation which can be cast as that of finding an optimal transport plan with a low transportation cost, where the cost is alternatively defined as the sum of the largest eigenvalues of the second order moment matrix of the displacements (or matchings) corresponding to that plan (the usual OT definition only considers the trace of that matrix). We show that both quantities inherit several favorable properties from the OT geometry. We propose two algorithms to compute the latter formulation using entropic regularization, and illustrate the interest of this approach empirically.

Machine Learning, Optimal Transport, Wasserstein, ICML

## 1 Introduction

The optimal transport (OT) toolbox (Villani, 2009) is gaining popularity in machine learning, with several applications to data science outlined in the recent review paper (Peyré & Cuturi, 2019). When using OT on high-dimensional data, practitioners are often confronted to the intrinsic instability of OT with respect to input measures. A well known result states for instance that the sample complexity of Wasserstein distances can grow exponentially in dimension (Dudley, 1969; Fournier & Guillin, 2015), which means that an irrealistic amount of samples from two continuous measures is needed to approximate faithfully the true distance between them. This result can be mitigated when data lives on lower dimensional manifolds as shown in (Weed & Bach, 2017), but sample complexity bounds remain pessimistic even in that case. From a computational point of view, that problem can be interpreted as that of a lack of robustness and instability of OT metrics with respect to their inputs. This fact was already a common concern of the community when these tools were first adopted, as can be seen in the use of costs (Ling & Okada, 2007) or in the common practice of thresholding cost matrices (Pele & Werman, 2009).

#### Regularization

The idea to trade off a little optimality in exchange for more regularity is by now considered a crucial ingredient to make OT work in data sciences. A line of work initiated in (Cuturi, 2013) advocates adding an entropic penalty to the original OT problem, which results in faster and differentiable quantities, as well as improved sample complexity bounds (Genevay et al., 2019). Following this, other regularizations (Dessein et al., 2018), notably quadratic (Blondel et al., 2018), have also been investigated. Sticking to an entropic regularization, one can also interpret the recent proposal by Altschuler et al. (2018b) to approximate Gaussian kernel matrices appearing in the regularized OT problem with Nyström-type factorizations (or exact features using a Taylor expansion (Cotter et al., 2011) as in (Altschuler et al., 2018a)), as robust approaches that are willing to tradeoff yet a little more cost optimality in exchange for faster Sinkhorn iterations. In a different line of work, quantizing first the measures to be compared before carrying out OT on the resulting distriutions of centroids is a fruitful alternative (Canas & Rosasco, 2012) which has been recently revisited in (Forrow et al., 2019). Another approach exploits the fact that the OT problem between two distributions on the real line boils down to the direct comparison of their generalized quantile functions (Santambrogio, 2015, §2). Computing quantile functions only requires sorting values, with a mere log-linear complexity. The sliced approximation of OT (Rabin et al., 2011) consists in projecting two probability distributions on a given line, compute the optimal transport cost between these projected values, and then repeat this procedure to average these distances over several random lines. This approach can be used to define kernels (Kolouri et al., 2016), compute barycenters (Bonneel et al., 2015) but also to train generative models (Kolouri et al., 2018; Deshpande et al., 2018). Beyond its practical applicability, this approach is based on a perhaps surprising point-of-view: OT on the real line may be sufficient to extract geometric information from two high-dimensional distributions. Our work builds upon this idea, and more candidly asks what can be extracted from a little more than a real line, namely a subspace of dimension . Rather than project two measures on several lines, we consider in this paper projecting them on a -dimensional subspace that maximizes their transport cost. This results in optimizing the Wasserstein distance over the ground metric, which was already considered for supervised learning (Cuturi & Avis, 2014; Flamary et al., 2018).

#### Contributions

This optimal projection translates into a “max-min” robust OT problem with desirable features. Although that formulation cannot be solved with convex solvers, we show that the corresponding “min-max” problem admits on the contrary a tight convex relaxation and also has an intuitive interpretation. To see that, one can first notice that the usual 2-Wasserstein distance can be described as the minimization of the trace of the second order moment matrix of the displacements associated with a transport plan. We show that computing a maximally discriminating optimal dimensional subspace in this “min-max” formulation can be carried out by minimizing the sum of the largest eigenvalues (instead of the entire trace) of that second order moment matrix. A simple example summarizing the link between these two “min-max” and “max-min” quantities is given in Figure 1. That figure considers a toy example where points in dimension are projected on lines , our idea is designed to work for larger and , as shown in §6.

#### Paper structure

We start this paper with background material on Wasserstein distances in §2 and present an alternative formulation for the 2-Wasserstein distance using the second order moment matrix of displacements described in a transport plan. We define in §3 our “max-min” and “min-max” formulations for, respectively, projection (PRW) and subspace (SRW) robust Wasserstein distances. We study the geodesic structure induced by the SRW distance on the space of probability measures in §4, as well as its dependence on the dimension parameter . We provide computational tools to evaluate SRW using entropic regularization in §5. We conclude the paper with experiments in §6 to validate and illustrate our claims, on both simulated and real datasets.

## 2 Background on Optimal Transport

For , we write . Let be the set of Borel probability measures in , and let

 P2(Rd)={μ∈P(Rd)∣∣∣∫∥x∥2dμ(x)<∞}.

#### Monge and Kantorovich Formulations of OT

For , we write for the set of couplings

 Π(μ,ν)={π∈P(Rd×Rd) s.t.∀A,B⊂Rd Borel,π(A×Rd)=μ(A),π(Rd×B)=ν(B)},

and their -Wasserstein distance is defined as

 W2(μ,ν):=(infπ∈Π(μ,ν)∫∥x−y∥2dπ(x,y))1/2.

Because we only consider quadratic costs in the remainder of this paper, we drop the subscript in our notation and will only use to denote the -Wasserstein distance. For Borel , Borel and , we denote by the push-forward of by , i.e. the measure such that for any Borel set ,

 T#μ(A)=μ(T−1(A)).

The Monge (1781) formulation of optimal transport is, when this minimization is feasible, equivalent to that of Kantorovich, namely

 W(μ,ν)=(infT:T#μ=ν∫∥x−T(x)∥2dμ(x))1/2.

#### W as Trace-minimization

For any coupling , we define the second order displacement matrix

 Vπ:=∫(x−y)(x−y)Tdπ(x,y). (1)

Notice that when a coupling corresponds to a Monge map, namely , then one can interpret even more naturally as the second order moment of all displacement weighted by . With this convention, we remark that the total cost of a coupling is equal to the trace of , using the simple identity and the linearity of the integral sum. Computing the distance can therefore be interpreted as minimizing the trace of . This simple observation will play an important role in the next section, and more specifically the study of , the -th largest eigenvalue of .

## 3 Subspace Robust Wasserstein Distances

With the conventions and notations provided in §2, we consider here different robust formulations of the Wasserstein distance. Consider first for , the Grassmannian of -dimensional subspaces of :

 Gk={E⊂Rd|dim(E)=k}.

For , we note the orthogonal projector onto . Given two measures , a first attempt at computing a robust version of is to consider the worst possible OT cost over all possible low dimensional projections:

###### Definition 1.

For , the -dimensional projection robust -Wasserstein (PRW) distance between and is

 Pk(μ,ν)=supE∈GkW(PE#μ,PE#ν).

As we show in the supplementary material, this quantity is well posed and itself worthy of interest, yet difficult to compute. In this paper, we focus our attention on the corresponding “min-max” problem, to define the -dimensional subspace robust -Wasserstein (SRW) distance:

###### Definition 2.

For , the -dimensional subspace robust -Wasserstein distance between and is

###### Remark 1.

Both quantities and can be interpreted as robust variants of the distance. By a simple application of weak duality we have that .

###### Lemma 1.

Optimal solutions for exist, i.e.

We show next that the SRW variant can be elegantly reformulated as a function of the eigendecomposition of the displacement second-order moment matrix  (1):

###### Lemma 2.

For and , one has

 S2k(μ,ν)=minπ∈Π(μ,ν)maxU∈Rk×dUUT=Ik∫∥Ux−Uy∥2dπ(x,y)=minπ∈Π(μ,ν)k∑l=1λl(Vπ).

This characterization as a sum of eigenvalues will be crucial to study theoretical properties of . Subspace robust Wasserstein distances can in fact be interpreted as a convex relaxation of projection robust Wasserstein distances: they can be computed as the maximum of a concave function over a convex set, which will make computations tractable.

###### Theorem 1.

For and ,

 S2k(μ,ν) =minπ∈Π(μ,ν)max0⪯Ω⪯Itrace(Ω)=k∫d2Ωdπ (2) =max0⪯Ω⪯Itrace(Ω)=kminπ∈Π(μ,ν)∫d2Ωdπ (3) =max0⪯Ω⪯Itrace(Ω)=kW2(Ω1/2#μ,Ω1/2#ν) (4)

where stands for the Mahalanobis distance

 d2Ω(x,y)=(x−y)TΩ(x−y).

We can now prove that both PRW and SRW variants are, indeed, distances over .

###### Proposition 1.

For , both and are distances over .

Proof. Symmetry is clear for both objects, and for , . Let such that . Then and for any , , i.e. . Lemma 7 (in the supplementary material) then shows that . For the triangle inequalities, let . Let be optimal between and . Using the triangle inequalities for the Wasserstein distance,

 Sk(μ0,μ2)=W[Ω1/2⋆#μ0,Ω1/2⋆#μ2]≤W[Ω1/2⋆#μ0,Ω1/2⋆#μ1]+W[Ω1/2⋆#μ1,Ω1/2⋆#μ2]≤sup0⪯Ω⪯Itrace(Ω)=kW[Ω1/2#μ0,Ω1/2#μ1]+sup0⪯Ω⪯Itrace(Ω)=kW[Ω1/2#μ1,Ω1/2#μ2]=Sk(μ0,μ1)+Sk(μ1,μ2).

The same argument, used this time with projections, yields the triangle inequalities for .

## 4 Geometry of Subspace Robust Distances

We prove in this section that SRW distances share several fundamental geometric properties with the Wasserstein distance. The first one states that distances between Diracs match the ground metric:

###### Lemma 3.

For and ,

 Sk(δx,δy)=∥x−y∥.

#### Metric Equivalence.

Subspace robust Wasserstein distances are equivalent to the Wasserstein distance :

###### Proposition 2.

For , is equivalent to . More precisely, for ,

Moreover, the constants are tight since

 Sk(δx,δy)=W(δx,δy)Sk(δ0,σ)=√kdW(δ0,σ)

where are Dirac masses at points and is the uniform probability distribution over the centered unit sphere in .

#### Dependence on the dimension.

We fix and we ask the following question : how does depend on the dimension ? The following lemma gives a result in terms of eigenvalues of , where is optimal for some dimension , then we translate in Proposition 3 this result in terms of .

###### Lemma 4.

Let . For any ,

 λk+1(Vπk+1)≤S2k+1(μ,ν)−S2k(μ,ν)≤λk+1(Vπk)

where for , .

###### Proposition 3.

Let . The sequence is increasing and concave. In particular, for ,

 S2k+1(μ,ν)−S2k(μ,ν)≥W2(μ,ν)−S2k(μ,ν)d−k.

Moreover, for any ,

 Sk(μ,ν)≤Sk+1(μ,ν)≤√k+1kSk(μ,ν).

#### Geodesics

We have shown in Proposition 2 that for any , is a metric space with the same topology as that of the Wasserstein space . We conclude this section by showing that is in fact a geodesic length space, and exhibits explicit constant speed geodesics. This can be used to interpolate between measures in sense.

###### Proposition 4.

Let and . Take

 π∗∈argminπ∈Π(μ,ν)k∑l=1λl(Vπ)

and let . Then the curve

 t↦μt:=ft#π∗

is a constant speed geodesic in connecting and . Consequently, is a geodesic space.

Proof. We first show that for any ,

 Sk(μs,μt)=|t−s|Sk(μ,ν)

by computing the cost of the transport plan and using the triangular inequality. Then the curve has constant speed

 |μ′t|=limϵ→0Sk(μt+ϵ,μt)|ϵ|=Sk(μ,ν),

and the length of the curve is

 sup{n−1∑i=0Sk(μti,μti+1)∣∣∣n≥10=t0<...

i.e. is a geodesic connecting and .

## 5 Computation

We provide in this section algorithms to compute the saddle point solution of . are now discrete with respectively and points and weights and : and . For , three different objects are of interest: (i) the value of , (ii) an optimal subspace obtained through the relaxation for SRW, (iii) an optimal transport plan solving SRW. A subspace can be used for dimensionality reduction, whereas an optimal transport plan can be used to compute a geodesic, i.e. to interpolate between and .

### 5.1 Computational challenges to approximate Sk

We observe that solving is challenging. Considering a direct projection onto the transportation polytope

 Π(μ,ν)={π∈Rn×m|π1m=a,πT1n=b}

would result in a costly quadratic network flow problem. The Frank-Wolfe algorithm, which does not require such projections, cannot be used directly because the application is not smooth.

On the other hand, thanks to Theorem 1, solving the maximization problem is easier. Indeed, we can project onto the set of constraints using Dykstra’s projection algorithm (Boyle & Dykstra, 1986). In this case, we will only get the value of and an optimal subspace, but not necessarily the actual optimal transport plan due to the lack of uniqueness for OT plans in general.

#### Smoothing

It is well known that saddle points are hard to compute for a bilinear objective (Hammond, 1984). Computations are greatly facilitated by adding smoothness, which allows the use of saddle point Frank-Wolfe algorithms (Gidel et al., 2017). Out of the two problems, the maximization problem is seemingly easier. Indeed, we can leverage the framework of regularized OT (Cuturi, 2013) to output, using Sinkhorn’s algorithm, a unique optimal transport plan at each inner loop of the maximization. To save time, we remark that initial iterations can be solved with a low accuracy by limiting the number of iterations, and benefit from warm starts, using the scalings computed at the previous iteration, see (Peyré & Cuturi, 2019, §4).

### 5.2 Projected Supergradient Method for SRW

In order to compute SRW and an optimal subspace, we can solve equation (3) by maximizing the concave function

 f:Ω↦minπ∈Π(μ,ν)∑i,jd2Ω(xi,yj)πi,j=minπ∈Π(μ,ν)⟨Ω|Vπ⟩

over the convex set . Since is not differentiable, but only superdifferentiable, we can only use a projected supergradient method. This algorithm is outlined in Algorithm 1. Note that by Danskin’s theorem, for any ,

 ∂f(Ω)=Conv{Vπ∗∣∣∣π∗∈argminπ∈Π(μ,ν)⟨Ω|Vπ⟩}.

### 5.3 Frank-Wolfe using Entropy Regularization

Entropy-regularized optimal transport can be used to compute a unique optimal plan given a subspace. Let be the regularization strength. In this case, we want to maximize the concave function

 fγ:Ω↦minπ∈Π(μ,ν)⟨Ω|Vπ⟩+γ∑i,jπi,j[log(πi,j)−1]

over the convex set . Since for all , there is a unique minimizing , is differentiable. Instead of running a projected gradient ascent on , we propose to use the Frank-Wolfe algorithm when the regularization strength is positive. Indeed, there is no need to tune a learning rate in Frank-Wolfe, making it easier to use. We only need to compute, for fixed , the maximum over of :

###### Lemma 5.

For , compute the eigendecomposition of with . Then for , solves

 max0⪯Ω⪯Itrace(Ω)=k∫d2Ωdπ.

This algorithm is outlined in algorithm 2.

### 5.4 Initialization and Stopping Criterion

We propose to initialize Algorithms 1 and 2 with where is the matrix of the top eigenvectors (i.e. the eigenvectors associated with the top eigenvalues) of and is an optimal transport plan between and . In other words, is the projection matrix onto the first principal components of the transport-weighted displacement vectors. Note that would be optimal is were optimal for the min-max problem, and that this initialization only costs the equivalent of one iteration.

When entropic regularization is used, Sinkhorn algorithm is run at each iteration of Algorithms 1 and 2. We propose to initialize the potentials in Sinkhorn algorithm with the latest computed potentials, so that the number of iterations in Sinkhorn algorithm should be small after a few iterations of Algorithms 1 or 2.

We sometimes need to compute for all , for example to choose the optimal with an “elbow” rule. To speed the computations up, we propose to compute this sequence iteratively from to . At each iteration, i.e. for each dimension , we initialize the algorithm with , where is the matrix of the top eigenvectors of and is the optimal transport plan for dimension . We also initialize the Sinkhorn algorithm with the latest computed potentials.

Instead of running a fixed number of iterations in Algorithm 2, we propose to stop the algorithm when the computation error is smaller than a fixed threshold . The computation error at iteration is:

 |Sk(μ,ν)−ˆSk(t)|Sk(μ,ν)≤Δ(t)ˆSk(t)

where is the computed “max-min” value and is the duality gap at iteration . We stop as soon as .

## 6 Experiments

We first compare SRW with the experimental setup used to evaluate FactoredOT (Forrow et al., 2019). We then study the ability of SRW distances to capture the dimension of sampled measures by looking at their value for increasing dimensions , as well as their robustness to noise.

### 6.1 Fragmented Hypercube

We first consider to be uniform over an hypercube, and the pushforward of under the map , where sign is taken elementwise, and is the canonical basis of . The map splits the hypercube into four different hyperrectangles. is a subgradient of a convex function, so by Brenier’s theorem (1991) it is an optimal transport map between and and

 W2(μ,ν)=∫∥x−T(x)∥2dμ(x)=4k∗.

Note that for any , the displacement vector lies in the -dimensional subspace , which is optimal. This means that for , is constant equal to . We show the interest of plotting, based on two empirical distributions from and from , the sequence , for different values of . That sequence is increasing concave by proposition 3, and increases more slowly after , as can be seen on Figure 2. This is the case because the last dimensions only represent noise, but is recovered in our plot.

We consider next , and choose from the result of Figure 2, . We look at the estimation error where are empirical measures from and respectively with points each. In Figure 3, we plot this estimation error depending on the number of points . In Figure 4, we plot the subspace estimation error depending on , where is the optimal projection matrix onto .

We also plot the optimal transport plan (in the sense of , Figure 5 left) and the optimal transport plan (in the sense of ) between and (with points each, Figure 5 right).

### 6.2 Robustness, with 20-D Gaussians

We consider and , with semidefinite positive of rank . It means that the supports of and are -dimensional subspaces of . Although those two subspaces are -dimensional, they may be different. Since the union of two -dimensional subspaces is included in a -dimensional subspace, for any , .

For our experiment, we simulated independent couples of covariance matrices in dimension , each having independently a Wishart distribution with degrees of freedom. For each couple of matrices, we draw points from and and considered and the empirical measures on those points. In Figure 6, we plot the mean (over the samples) of . We plot the same curve for noisy data, where each point was added a random vector. With moderate noise, the data is only approximately on the two -dimensional subspaces, but the SRW does not vary too much.

### 6.3 Sk is Robust to Noise

As in experiment 6.2, we consider independent samples of couples , each following independently a Wishart distribution with degrees of freedom. For each couple, we draw points from and and consider the empirical measures and on those points. We then gradually add Gaussian noise to the points, giving measures , . In Figure 7, we plot the mean (over the samples) of the relative errors

 σ↦|S25(^μσ,^νσ)−S25(^μ0,^ν0)|S25(^μ0,^ν0)

and

 σ↦|W2(^μσ,^νσ)−W2(^μ0,^ν0)|W2(^μ0,^ν0).

Note that for small noise level, the imprecision in the computation of the SRW distance adds up to the error caused by the added noise. SRW distances seem more robust to noise than the Wasserstein distance when the noise has moderate to high variance.

### 6.4 Computation time

We consider the Fragmented Hypercube experiment, with increasing dimension and fixed . Using and Algorithm 2 with and stopping threshold , we plot in Figure 8 the mean computation time of both SRW and Wasserstein distances on GPU, over random samplings with . It shows that SRW computation is quadratic in dimension , because of the eigendecomposition of matrix in Algorithm 2.

### 6.5 Real Data Experiment

We consider the scripts of seven movies. Each script is transformed into a list of words, and using word2vec (Mikolov et al., 2018), into a measure over where the weights correspond to the frequency of the words. We then compute the SRW distance between all pairs of films: see Figure 9 for the SRW values. Movies with a same genre or thematic tend to be closer to each other: this can be visualized by running a two-dimensional metric multidimensional scaling (mMDS) on the SRW distances, as shown in Figure 10 (left).

In Figure 10 (right), we display the projection of the two measures associated with films Kill Bill Vol.1 and Interstellar onto their optimal subspace. We compute the first (weighted) principal component of each projected measure, and find among the whole dictionary their 5 nearest neighbors in terms of cosine similarity. For Kill Bill Vol.1 , these are: ’swords’, ’hull’, ’sword’, ’ice’, ’blade’. For Interstellar, they are: ’spacecraft’, ’planets’, ’satellites’, ’asteroids’, ’planet’. The optimal subspace recovers the semantic dissimilarities between the two films.

## 7 Conclusion

We have proposed in this paper a new family of OT distances with robust properties. These distances take a particular interest when used with a squared-Euclidean cost, in which case they have several properties, both theoretical and computational. These distances share important properties with the -Wasserstein distance, yet seem far more robust to random perturbation of the data and able to capture better signal. We have provided algorithmic tools to compute these SRW distance. They come at a relatively modest overhead, given that they require using regularized OT as the inner loop of a FW type algorithm. Future work includes the investigation of even faster techniques to carry out these computations, eventually automatic differentiation schemes as those currently benefitting the simple use of Sinkhorn divergences.

#### Acknowledegments.

Both authors acknowledge the support of a ”Chaire d’excellence de l’IDEX Paris Saclay”. We thank P. Rigollet and J. Weed for their remarks and their valuable input.

## References

• Altschuler et al. (2018a) Altschuler, J., Bach, F., Rudi, A., and Weed, J. Approximating the quadratic transportation metric in near-linear time. arXiv preprint arXiv:1810.10046, 2018a.
• Altschuler et al. (2018b) Altschuler, J., Bach, F., Rudi, A., and Weed, J. Massively scalable sinkhorn distances via the nyström method. arXiv preprint arXiv:1812.05189, 2018b.
• Blondel et al. (2018) Blondel, M., Seguy, V., and Rolet, A. Smooth and sparse optimal transport. In International Conference on Artificial Intelligence and Statistics, pp. 880–889, 2018.
• Bonneel et al. (2015) Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
• Boyle & Dykstra (1986) Boyle, J. P. and Dykstra, R. L. A method for finding projections onto the intersection of convex sets in hilbert spaces. In Advances in order restricted statistical inference, pp. 28–47. Springer, 1986.
• Brenier (1991) Brenier, Y. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375–417, 1991.
• Canas & Rosasco (2012) Canas, G. and Rosasco, L. Learning probability measures with respect to optimal transport metrics. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 2492–2500. 2012.
• Cotter et al. (2011) Cotter, A., Keshet, J., and Srebro, N. Explicit approximations of the gaussian kernel. arXiv preprint arXiv:1109.4603, 2011.
• Cuturi (2013) Cuturi, M. Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pp. 2292–2300, 2013.
• Cuturi & Avis (2014) Cuturi, M. and Avis, D. Ground metric learning. The Journal of Machine Learning Research, 15(1):533–564, 2014.
• Deshpande et al. (2018) Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced wasserstein distance. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
• Dessein et al. (2018) Dessein, A., Papadakis, N., and Rouas, J.-L. Regularized optimal transport and the rot mover’s distance. Journal of Machine Learning Research, 19(15):1–53, 2018.
• Dudley (1969) Dudley, R. M. The speed of mean Glivenko-Cantelli convergence. Annals of Mathematical Statistics, 40(1):40–50, 1969.
• Fan (1949) Fan, K. On a theorem of weyl concerning eigenvalues of linear transformations i. Proceedings of the National Academy of Sciences, 35(11):652–655, 1949.
• Flamary et al. (2018) Flamary, R., Cuturi, M., Courty, N., and Rakotomamonjy, A. Wasserstein discriminant analysis. Machine Learning, 107(12):1923–1945, 2018.
• Forrow et al. (2019) Forrow, A., Hütter, J.-C., Nitzan, M., Rigollet, P., Schiebinger, G., and Weed, J. Statistical optimal transport via factored couplings. 2019.
• Fournier & Guillin (2015) Fournier, N. and Guillin, A. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.
• Genevay et al. (2019) Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyré, G. Sample complexity of sinkhorn divergences. 2019.
• Gidel et al. (2017) Gidel, G., Jebara, T., and Lacoste-Julien, S. Frank-Wolfe algorithms for saddle point problems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
• Hammond (1984) Hammond, J. H. Solving asymmetric variational inequality problems and systems of equations with generalized nonlinear programming algorithms. PhD thesis, Massachusetts Institute of Technology, 1984.
• Kolouri et al. (2016) Kolouri, S., Zou, Y., and Rohde, G. K. Sliced Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267, 2016.
• Kolouri et al. (2018) Kolouri, S., Martin, C. E., and Rohde, G. K. Sliced-wasserstein autoencoder: An embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.
• Ling & Okada (2007) Ling, H. and Okada, K. An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):840–853, 2007.
• Mikolov et al. (2018) Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
• Monge (1781) Monge, G. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pp. 666–704, 1781.
• Overton & Womersley (1993) Overton, M. L. and Womersley, R. S. Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Mathematical Programming, 62(1-3):321–357, 1993.
• Pele & Werman (2009) Pele, O. and Werman, M. Fast and robust earth mover’s distances. In IEEE 12th International Conference on Computer Vision, pp. 460–467, 2009.
• Peyré & Cuturi (2019) Peyré, G. and Cuturi, M. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6), 2019.
• Rabin et al. (2011) Rabin, J., Peyré, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435–446. Springer, 2011.
• Santambrogio (2015) Santambrogio, F. Optimal transport for applied mathematicians. Birkhauser, 2015.
• Villani (2009) Villani, C. Optimal Transport: Old and New, volume 338. Springer Verlag, 2009.
• Weed & Bach (2017) Weed, J. and Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.

## Appendix A Projection Robust Wasserstein Distances

In this section, we prove some basic properties of projection robust Wasserstein distances . First note that the definition of makes sense, since for any , and , and have a second moment (for orthogonal projections are -Lipschitz).

is also well posed, since one can prove the existence of a maximizing subspace. To prove this, we will need the following lemma stating that the admissible set of couplings between the projected measures are exactly the projections of the admissible couplings between the original measures:

###### Lemma 6.

Let Borel and . Then .

This can be used to get the following result:

###### Proposition 5.

For and , there exists a subspace such that

 Pk(μ,ν)=W(PE∗#μ,PE∗#ν).
###### Proof.

The Grassmannian is compact, and we show that the application is upper semicontinuous, which gives existence. ∎

Note that we could define projection robust Wasserstein distances for any by:

 supE∈GkWp(PE#μ,PE#ν).

Then there is still existence of optimal subspaces, and it defines a distance over

 Pp(Rd)={μ∈P(Rd)∣∣∣∫∥x∥pdμ(x)<∞}.

To prove the identity of indiscernibles, we use the following Lemma due to Rényi, generalizing Cramér-Wold theorem:

###### Lemma 7.

Let be a family of subspaces of such that . Let such that for all , . Then .

## Appendix B Proofs

#### Proof of Lemma 1.

For , the application is continuous and is compact, so the supremum is a maximum. Moreover, the application is lower semicontinuous as the maximum of lower semicontinuous functions. Since is compact (for any sequence in is tight), the infimum is a minimum.

#### Proof of Lemma 2.

A classical variational result by (Fan, 1949) states that

 k∑l=1λl(Vπ)=maxU∈Rk×dUUT=Iktrace(UVπUT).

Then using the linearity of the trace:

 k∑l=1λl(Vπ)=maxU∈Rk×dUUT=Ik∫trace[U(x−y)(x−y)TUT]dπ(x,y)=maxU∈Rk×dUUT=Ik∫∥U(x−y)∥2dπ(x,y)=maxE∈Gk∫∥PE(x)−PE(y)∥2dπ(x,y).

Taking the minimum over yields the result.

#### Proof of Theorem 1.

= (2) : We fix and focus on the inner maximization in (2) :

 max0⪯Ω⪯Itrace(Ω)=k∫d2Ωdπ=max0⪯Ω⪯Itrace(Ω)=k⟨Ω|Vπ⟩.

A result by (Overton & Womersley, 1993) shows that this is equal to

 maxU∈Rk×dUUT=Iktrace(UVπUT)

which is nothing but the sum of the largest eigenvalues of by Fan’s result. By lemma 2, taking the minimum over yields the result.

(2) = (3) : We will use Sion’s minimax theorem to interchange the minimum and the maximum. Put and . Note that is convex and compact, and is convex (and actually compact, but we won’t need it here). Moreover, is bilinear and for any , is continuous. Let . Let us show that is lower semicontinuous for the weak convergence. Let be an increasing sequence of bounded continuous functions, converging pointwise to . Then . For , is continuous and bounded, so is continuous for the weak convergence. Then is lower semicontinuous as the supremum of continuous functions. Then by Sion’s minimax theorem,

 minπ∈Π(μ,ν)maxΩ∈Rf(Ω,π)=maxΩ∈Rminπ∈Π(μ,ν)f(Ω,π)

which is exactly (2) = (3).

(3) = (4) : Fix . One has:

 minπ∈Π(μ,ν)∫d2Ω