1 Introduction
###### Abstract

Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches i.e. they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, GANs or color transfer that highlight the practical interest of this strategy.

\aistatstitle

Learning with minibatch Wasserstein : asymptotic and gradient properties

\aistatsauthor

Kilian Fatras11footnotemark: 1 &Younes Zine22footnotemark: 2 &Rémi Flamary33footnotemark: 3 &Rémi Gribonval22footnotemark: 244footnotemark: 4 &Nicolas Courty11footnotemark: 1

11footnotemark: 1

Univ Bretagne Sud, IRISA, CNRS, Inria, France
22footnotemark: 2Univ Rennes, Inria, CNRS, IRISA
33footnotemark: 3Univ Côte d’Azur, OCA, UMR 7293, CNRS, Laboratoire Lagrange, France
44footnotemark: 4Univ Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, F-69342, Lyon, France

## 1 Introduction

Measuring distances between probability distributions is a key problem in machine learning. Considering the space of probability distributions over a space , and given an empirical probability distribution , we want to find a parametrized distribution which approximates the distribution . Measuring the distance between the distributions requires a function . is parametrized by a vector and the goal is to find the best which minimizes the distance between and , i.e . As the distributions are empirical, we need a distance with good statistical performances and which have optimization guarantees with modern optimization techniques. Optimal transport (OT) losses as distances have emerged recently as a competitive tool on this problem [Genevay et al., 2018, Arjovsky et al., 2017]. The corresponding estimator is usually found in the literature under the name of Minimum Kantorovich Estimator [Bassetti et al., 2006, Peyré and Cuturi, 2019]. Furthermore, OT losses have been widely used to transport samples from a source domain to a target domain by using barycentric mapping [Ferradans et al., 2013, Courty et al., 2017, Seguy et al., 2018].

Several previous works challenged the heavy optimal transport computational cost, as the Wasserstein distance comes with a complexity of , where is the size of the probability distribution supports. Variants of optimal transport has been proposed to reduce its complexity. [Cuturi, 2013] used an entropic regularization term to get a strongly convex problem which is solvable using the Sinkhorn algorithm with a computational cost of , both in time and space. However, despite some scalable solvers based on stochastic optimization [Genevay et al., 2016, Seguy et al., 2018], in the big data setting is very large and still leads to bottleneck computation problems especially when trying to minimize the OT loss. That is why [Genevay et al., 2018, Damodaran et al., 2018] use a minibatch strategy in their implementations to reduce the cost per iteration. They propose to compute the averaged of several optimal transport terms between minibatches from the source and the target distributions. However, using this strategy leads to a different optimization problem that results in a ”non optimal” transportation plan between the full original distributions. Recently, [Bernton et al., 2017] worked on minimizers and [Sommerfeld et al., 2019] on a bound between the true optimal transport and the minibatch optimal transport. However they did not study the asymptotic convergence, the loss properties and behavior of the minibatch loss.

In this paper we propose to study minibatch optimal transport loss by reviewing its relevance as a loss function. After defining the minibatch formalism, we will show which properties are inherited and which ones are lost. We describe the asymptotic behavior of our estimator and show that we can derive a concentration bound without dependence on the data space dimension. Then, we prove that the gradients of the minibatch OT losses are unbiased, which justifies its use with SGD in [Genevay et al., 2018]. Finally, we demonstrate the effectiveness of minibatches in large scale setting and how we can alleviate the memory issues for barycentric mapping. The paper is structured as follow: in Section 2, we propose a brief review of the different optimal transport losses. In Section 3, we give formal definitions of the minibatch strategy and illustrate their impacts on OT plans. Basic properties, asymptotic behaviors of the estimator and differentiability are then described. Finally in Section 4, we highlight the behavior of the minibatch OT losses on a number of experiments: gradient flows, generative networks and color transfer.

## 2 Wasserstein distance and regularization

##### Wasserstein distance

The Optimal Transport metric measures a distance between two probability distributions by considering a ground metric on the space [Peyré and Cuturi, 2019]. Formally, the Wasserstein distance between two distributions can be expressed as

 Wc(α,β)=minπ∈U(α,β)∫X×Yc(x,y)dπ(x,y), (1)

where is the set of joint probability distribution with marginals and such that . (resp. ) is the marginalization of over (resp. ). The ground cost is usually chosen as as the Euclidean or squared Euclidean distance on , in this case is a metric as well. Note that the optimization problem above is called the Kantorovitch formulation of OT and the optimal is called an optimal transport plan. When the distributions are discrete, the problem becomes a discrete linear program that can be solved with a cubical complexity in the size of the distributions support. Also the convergence in population of the Wasserstein distance is known to be slow with a rate depending on the dimensionality of the space and the size of the population [Weed and Bach, 2019].

##### Entropic regularization

Regularized entropic OT was proposed in [Cuturi, 2013] and leads to a more efficient solver. We define the entropic loss as:
, with where and is the regularization coefficient. We call this function, the entropic OT loss. As we will see later, this entropic regularization also makes the problem strongly convex and differentiable with respect to the cost or the input distributions.

It is well known that adding an entropic regularization leads to sub-optimal solutions on the original problem, and it is not a metric since . This motivated [Genevay et al., 2018] to introduce an unbiased loss which uses the entropic regularization and called it the Sinkhorn divergence. It is defined as:

It can still be computed with the same order of complexity than the entropic loss and has been proven to interpolate between OT and maximum mean discrepancy (MMD) [Feydy et al., 2019] with respect to the regularization coefficient. MMD are integral probability metrics over a reproducing kernel Hilbert space [Gretton et al., ]. When tends to 0, we get the OT solution back and when tends to , we get a solution closer to the MMD solution. Second, as proved by [Feydy et al., 2019], if the cost is Lipschitz, then is a convex, symmetric, positive definite loss function. Hence the use of the Sinkhorn divergence instead of the regularized OT. The sample complexity of the Sinkhorn divergence, that is the convergence rate of a metric between a probability distribution and its empirical counterpart as a function of the number of samples, was proven in [Genevay et al., 2019] to be: where is the dimension of . We see an interpolation depending on the value of . When it tends to 0, we get the OT’s sample complexity and when it tends to , the MMD loss’s sample complexity.

##### Minibatch Wasserstein

While the entropic loss has better computational complexity than the original Wasserstein distance, it is still challenging to compute it for a large dataset. To overcome this issue, several papers rely on a minibatch computation [Genevay et al., 2018, Damodaran et al., 2018]. Instead of computing the OT problem between the full distributions, they compute an averaged of OT problems between batches of the source and the target domains. Several work came out to justify the use of the minibatch paradigm. [Bernton et al., 2017] showed that for generative models, the minimizers of the minibatch loss converges to the true minimizer when the minibatch size increases. [Sommerfeld et al., 2019] considered another approach, where they approximate OT with the minibatch strategy and exhibit a deviation bound between the two quantities. We follow a different approach from the two previous work. We are interested in the behavior of using the minibatch strategy as a loss function and its resulting transportation plan. We want to study the asymptotic behavior of using minibatch, the optimization procedure, the transportation plan and the behavior of such a loss for data fitting problems.

## 3 Minibatch Wasserstein

In this section we first define the Minibatch Wasserstein and illustrate it on simple examples. Next we study its asymptotic properties and optimization behavior.

### 3.1 Notations and Definitions

##### Notations

Let (resp. ) be samples of iid random variables drawn from a distribution (resp. ) on the source (resp. target) domain. We denote by and the empirical distributions of support and respectively. The weights of (resp. ) are uniform, i.e equal to . We further suppose that and have compact support, the ground cost is then bounded by a constant M. denotes a sample of random variables following . In the rest of the paper, we will not make a difference between a batch of cardinality and its associated (uniform probability) distribution . The number of possible mini-batches of size on distinct samples is the binomial coefficient . For , we write (resp. ) the collection of subsets of cardinality of (resp. of ).

##### Definitions

We will first give formal definitions of the different quantities that we will use in this paper. We start with minibatch Wasserstein losses for continuous, semi-discrete and discrete distributions.

###### Definition 1 (Minibatch Wasserstein definitions).

Given a kernel function , we define the following quantities:

The continuous loss:

 Uh(α,β):=E(X,Y)∼α⊗m⊗β⊗m[h(X,Y)] (2)

The semi-discrete loss:

 Uh(αn,β):=(nm)−1∑A∈Pm(αn)EY∼β⊗m[h(A,Y)] (3)

The discrete-discrete loss:

 Uh(αn,βn):=(nm)−2∑A∈Pm(αn)∑B∈Pm(βn)h(A,B) (4)

Where can be the Wasserstein distance , the entropic loss or the sinkhorn divergence for some ground cost .

These quantities represent an average of Wasserstein distance over batches of size . Note that samples in have uniform weights and that the ground cost can be computed between all pair of batches and . It is easy to see that Eq.(4) is an empirical estimator of Eq.(1). In real world applications, computing the average over all mini batches is too costly as we have a combinatorial number of batches, that is why we will rely on a subsample of this quantity.

###### Definition 2 (Minibatch subsampling).

Pick an integer . We define:

 ˜Ukh(αn,βn):=k−1∑(A,B)∈Dkh(A,B) (5)

where is a set of cardinality whose elements are drawn at random from the uniform distribution on .

Let us now review, the minibatch definition for the OT plan.

###### Definition 3 (Mini-batch transport plan).

Consider and two discrete probability distributions. For each and we denote by the optimal plan between the random variables, considered as a matrix where all entries are zero except those indexed in . We define the averaged mini-batch transport matrix:

 Πm(αn,βn):=(nm)−2∑A∈Pm(αn)∑B∈Pm(βn)ΠA,B. (6)

Following the subsampling idea, we define the subsampled minibatch transportation matrix for and :

 Πk(αn,βn):=k−1∑(A,B)∈DkΠA,B (7)

where is drawn as in Definition 2.

Formal definitions of will be provided in appendix.

In this paper, we will study two different biases. The first bias we will encounter is the empirical estimator’s bias, and then, we will study the bias in the gradients for first order optimization methods.

### 3.2 Illustration on simple examples

To illustrate the effect of the minibatch we compute the exact minibatch transportation matrix (6) on two simple examples.

##### Distributions in 1D

The 1D case is an interesting problem because we have access to a closed-form of the optimal transport solution which allows us to calculate the closed-form of a minibatch paradigm. It is the foundation of the sliced Wasserstein distance [Bonnotte, 2013] which is widely used as an alternative to the Wasserstein distance [Liutkus et al., 2019, Kolouri et al., 2016].

We suppose that we have uniform empirical distributions and . We assume (without loss of generality) that the points are ordered in their own distribution. In such a case, we can compute the 1D Wasserstein 1 distance with cost as: and the OT matrix is simply a identity matrix scaled by (see [Peyré and Cuturi, 2019] for more details). After a short combinatorial calculus (given in appendix A.5), the 1D minibatch transportation matrix coefficient can be computed as :

 (Πm)j,k=1m(nm)−2imax∑i=imin(j−1i−1)(k−1i−1)(n−1m−i)2 (8)

where and . and represent the sorting constraints.

We show on the first row Figure 1 the minibatch OT matrices with samples for different value of the minibtach size . We also provide on the second row of the figure a plot of the distributions in several rows of . We give the matrices for entropic and quadratic regularized OT for comparison purpose. It is clear from the figure that the OT matrix densifies when decreases, which has a similar effect as entropic regularization. Note the more localized spread of mass of quadratic regularization that conserve sparsity as discussed in [Blondel et al., 2018]. While the entropic regularization spreads the mass in a similar manner for all samples, minibatch OT spreads less the mass on samples at the extremities. Note that the minibatch OT matrices solution is for ordered samples and do not depend on the position of the samples once ordered, as opposed to the regularized OT methods. This will be better illustrated in the next example.

##### Minibatch Wasserstein in 2D

We illustrate the OT matrix between two empirical distributions of 10 samples each in 2D in Figure 2. We use two 2D empirical distributions (point cloud) where the samples have a cluster structure and the samples are sorted w.r.t. their cluster. We can see from the OT matrices in the first row of the figure that the cluster structure is more or less recovered with the regularization effect of the minibatches (and also regularized OT). On the second row one can see the effect of the geometry of the samples on the spread of mass. Similarly to 1D, for Minibatch OT, samples on the border of the simplex cannot spread as much mass as those in the center and have darker rows. This effect is less visible on regularized OT.

### 3.3 Basic properties

We now propose some basic properties for minibatch Wasserstein losses. All properties are proved in the appendix. The first property concerns the transportation plan between the two initial distributions, defined in (6).

###### Proposition 1.

The transportation plan is an admissible transportation plan between the full input distributions .

The fact that is a transportation plan means that even though it is not optimal, we still do transportation similarly to regularized OT. Note that is not a transportation plan, in general, for a finite but we study its asymptotic convergence to marginals in the next section. Regarding our empirical estimator, when we have iid data, it enjoys the following property:

###### Proposition 2 (Unbiased estimator).

is an unbiased estimator of for the continuous setting and for the semi-discrete setting.

As we use minibatch OT for loss function, it is of interest to see if it is still a distance on the distribution space such as the Wasserstein distance or the Sinkhorn divergence.

###### Proposition 3 (Positivity and symmetry).

The minibatch Wasserstein losses are positive and symmetric losses. Furthermore they are convex with respect to their input. However, they are not metrics since .

The minibatch Wasserstein losses inherits some properties from the Wasserstein distance but the minibatch procedure leads to a striclty positive losses even for unbiased losses such as Sinkhorn divergence or Wasserstein distance. This breaks the fundamental separation axiom. Remarkably, the Sinkhorn divergence was introduced in the literature to correct the bias from the entropic regularization, and interestingly it was performed in practice on GANs experiments with a minibatch strategy which reintroduced a bias.

Whether removing the bias by following the same idea than the Sinkhorn divergence leads to a positive loss is an open question left to future work. An important parameter is the value of the minibatch size . We remark that the minibatch procedure allows us to interpolate between OT, when and averaged pairwise distance, when . The value of will also be important for the convergence of our estimator. Let us review now the asymptotic properties of .

### 3.4 Asymptotic convergence

We are now interested in the asymptotic behavior of our estimator and its deviation to . We will give a deviation bound between our subsampled estimator and the expetation of our estimator. This result is given in the continuous setting but a similar result holds for the semi-discrete setting and it follows the same proof. We will give a bound with respect to both and .

###### Theorem 1 (Maximal deviation bound).

Let , and be fixed, and consider two distributions with bounded support and a kernel }. We have a deviation bound between and depending on the number of empirical data and the number of batches , with probability at least on the draw of and we have:

 |˜Ukh(αn,βn)−Uh(α,β)| ≤Mh√log(1/δ)m2n+2Mh√2log(1/δ)k (9)

where depends on the kernel and scales at most as .

The proof is based on two quantities obtained from the triangle inequality. The first quantity is the difference between and its expectation . is a two-sample U-statistic and we can prove a bound between itself and its expectation in probability [Hoeffding, 1963]. It remains the second quantities, which is the difference between ’s expectation and . We use the difference between the two quantities to obtain a new random variable quantity. From this new random variable, we use the Hoeffing inequality to obtain a dependence with respect to .

This deviation bound shows that if we increase the number of data and batches while keeping the minibatch size fixed, we get closer to the expectation. We will investigate the dependence on and in different scenarios in the numerical experiments. Remarkably, the bound does not depend on the dimension of , which is an appealing property when optimizing in high dimension.

As discussed before an interesting output of Minibatch Wasserstein is the minibatch OT matrix , but since it is hard to compute in practice, we investigate the error on the marginal constraint of . In what follows, we denote by the -th row of matrix and by the vector whose entries are all equal to .

###### Theorem 2 (Distance to marginals).

Let , and consider two distributions . For all , all , with probability at least on the draw of and we have:

 |Πk(αn,βn)(i)1−1n|⩽√2log(2/δ)k (10)

with probability at least .

The proof uses the convergence of to and the fact that is a transportation plan and respects the marginals.

In this section we will review the optimization properties of the minibatch OT losses to ensure the convergence of our loss functions with modern optimization frameworks. We study a standard parametric data fitting problem. Given some discrete samples from some unknown distribution, we want to fit a parametric model to using the mini-batch Wasserstein distance for a set in an Euclidian space.

 minλ∈ΛUh(αn,βλ) (11)

Such problems are written as semi discrete OT problems because one of the distribution is continuous while the other is discrete. For instance, generative models fall under the scope of such problem [Genevay et al., 2018] also known as minimal Wasserstein estimation. As we have an expectation over one of the distribution, we would like to use a stochastic gradient descent strategy to minimize the problem. By using SGD for their method, [Genevay et al., 2018] observed that it worked well in practice and they got meaningful results with minibaches. However it is well known that the empiricial Wasserstein distance is a biased estimator of the Wasserstein distance over the true distributions and leads to biased gradients as discussed in [Bellemare et al., 2017], hence SGD might fail. The goal of this section is to prove that unlike the full Wasserstein distance, the minibatch strategy does not suffer from biased gradients

As stated in 2, we enjoy an unbiased estimator. However, the original Wasserstein distance is not differentiable, hence we will, further on, only consider the entropic loss and the Sinkhorn divergence which are differentiable.

###### Theorem 3 (Exchange of Gradient and expectation ).

Let us suppose that we have two distributions and on two bounded subsets and , a cost, and that the parametrized data is differentiable wrt . Then we are allowed to exchange gradients and expectation when is the entropic loss or the Sinkhorn divergence:

 ∇λEYλ∼β⊗mλh(A,Yλ)=EYλ∼β⊗mλ∇λh(A,Yλ) (12)

The proof relies on the differentiation lemma. Contrary to the full Wasserstein distance, we proved that the minibatch OT losses do not suffer from biased gradients and this justifies the use of SGD to optimize the problem.

## 4 Experiments

In this section, we illustrate the behavior of minibatch Wasserstein. We will use it as a loss function for generative models, use it for gradient flow and color transfer experiments. For our experiments, we relied on the POT package [Flamary and Courty, 2017] to compute the exact OT solver or then entropic OT loss and the Geomloss package [Feydy et al., 2019] for the Sinkhorn divergence. The generative model and gradient flow experiments were designed in PyTorch [Paszke et al., 2017] and all the code will be released upon publication.

### 4.1 Minibatch Wasserstein generative networks

We illustrate the use of minibatch Wasserstein loss for generative modeling [Goodfellow et al., 2014]. The goal is to learn a generative model to generate data close to the target data. We draw 8000 points which follow 8 different gaussian modes (1000 points per mode) in 2D where the modes form a circle. After generating the data, we use a minibatch Wasserstein distance and minibatch Sinkhorn divergence as loss functions with a squared euclidian cost and compared them to WGAN [Arjovsky et al., 2017] and its variant with gradient penalty WGAN-GP [Gulrajani et al., 2017].

We use a normal Gaussian noise in a latent space of dimension 10 and the generator is designed as a simple multilayer perceptron with 2 hidden layers of respectively 128 and 32 units with ReLu activation functions, and one final layer with 2 output neurons. For the different OT losses, the generator is trained with the same learning rate equal to 0.05. The optimizer is the Adam optimizer with and . For the Sinkhorn divergence we set to 0.01. For WGAN and WGAN-GP we train a discriminator with the same hidden layers than the generator. We update the discriminator 5 times before one update of the generator. WGAN is trained with RMSprop optimizer and WGAN-GP with Adam optimizer (, ) as done in their original papers. The learning rate is set to for both. WGAN-GP has a gradient penalty parameter set to 10. All models are trained for 30000 iterations with a batch size of 100. Our minibatch OT losses use , which means that we compute the stochastic gradient on only one minibatch, and larger was not needed to get meaningful results.

We show the estimated 2D distributions in Figure 4. For the same architecture it seems that MB Wasserstein trains better generators than WGAN and WGAN-GP. This could come from the fact that MB Wasserstein minimize a complex but well posed objective function (with the squared euclidian cost) while WGAN still need to solve the minmax problem making convergence more difficult especially on this 2D problem.

### 4.2 Minibatch Wasserstein gradient flow

For a given target distribution , the purpose of gradient flows is to model a distribution which at each iteration follows the gradient direction to minimize the loss [Peyré, 2015, Liutkus et al., 2019]. The gradient flow simulate the non parametric setting of data fitting problem. In this setting, the modeled distribution is parametrized by a vector which is the vector position that encodes its support.

We follow the same procedure as in [Feydy et al., 2019]. The original gradient flow algorithm uses an Euler scheme. Formally, starting from an initial distribution at time , it means that at each iteration we will integrate the ODE

 ˙x(t)=−∇xF(x(t)).

In our case, we cannot compute the gradient directly from our minibatch OT losses. As the OT loss inputs are distributions, we have an inherent bias when we calculate the gradient from the weights of samples. To correct this bias, we multiply the gradient by the inverse weight . Finally, for each data we integrate:

 ˙x(t)=−m∇x[˜Ukh(αn,βn)](x(t)) (13)

We recall that the inherent bias from minibatch makes that the final solution can not be the target distribution.

The considered data are from the CelebA dataset [Liu et al., 2015]. We use 5000 male images as source data and 5000 female images as target data. We show the evolution of 3 samples in the source data in Figure 3. We use a batch size of 500, a learning rate of 0.05 and make 750 iterations. did not need to be large and was set to 10 in order to stabilize the gradient flow. We see a natural evolution in the images along the gradient flow similar to results obtained in [Liutkus et al., 2019]. Interestingly the gradient flow with MB Wasserstein in Figure 3 leads to possibly more detailed backgrounds than with MB Sinkhorn (provided in supplementary) probably due to the two layers of regularization in the latter.

### 4.3 Large scale barycentric mapping for color transfer

The purpose of color transfer is to transform the color of a source image so that it follows the color of a target image. Optimal Transport is a well known method to solve this problem and has been studied before in [Ferradans et al., 2013, Blondel et al., 2018]. Images are represented by point clouds in the RGB color space identified with [0, 1]. Then by calculating the transportation plan between the two point clouds, we get a transfer color mapping by using a barycentric projection. As the number of pixels might be huge, previous work selected a subset of pixels using k-means clusters for each point cloud. This strategy allows to make the problem memory tractable but looses some information. With MB optimal transport, we can compute a barycentric mapping for all pixels in the image by incrementally updating the mapping at each minibtach. When one selects a source batch A and a target batch B, she just needs to update the transformed vector between the considered batches as . Indeed, to perform the color transfer when we have the full matrix, we compute the matrix product:

 Ys=nsΠk(αn,βn)Xt (14)

that can be computed incrementally by considering restriction to batches (the full algorithm is given in appendix). To the best of our knowledge, it is the first time that a barycentric mapping algorithm has been scaled up to 1M pixel images.

The source image has (943000, 3) RGB dimension and the target image has RGB dimension (933314, 3). For this experiments, we compare the results between the minibatch framework with the Wasserstein distance for several m and k. We used batch of size 10, 100 and 1000. We selected so as to obtain a good visual quality and observed that a smaller was needed when using large minibatches. Further experiments which show the dependence on can be found in appendix. Also note that performing MB optimal transport can be done in parallel and can be greatly speed-up on multi-CPU architectures. One can see in 5 the color transfer (in both directions) provided with our method. We can see that the diversity of colors falls when the batch size is too small as the entropic solver would do for a large regularization parameter. However, even for 1M pixels, a batch size of 1000 is enough to keep a good diversity.

We also studied empirically the results of theorem 2, as shown in Figure 6 we recover the convergence rate on the marginal with a constant depending on the batch size . Furthermore, we also empirically studied the computational time and showed that our method is not affected by the number of points with a fixed complexity when an algorithm like Sinkhorn still has a complexity. These experiments show that the minibatch Wasserstein losses are well suited for large scale problems where both memory and computational time are issues.

## 5 Conclusion

In this paper, we studied the impact of using a minibatch strategy in order to reduce the Wasserstein complexity. We review the basic properties, and studied the asymptotic behavior of our estimator. We showed a deviation bound between our subsampled estimator and the expectation of our estimator. Furthermore, we studied the optimization procedure of our estimator and proved that it enjoys unbiased gradients. Finally, we demonstrated the effect of minibatch strategy with gradient flow experiments, color transfer and GAN experiments. Future works will focus on the geometry of minibatch Wasserstein (for instance on barycenters) and on investigating a debiasing approach similar to the one used for Sinkhorn Divergence.

#### Acknowledgements

Authors would like to thank Thibault Séjourné and Jean Feydy for fruitful discussions. This work is partially funded through the project OATMIL ANR-17-CE23-0012 of the French National Research Agency (ANR).

## References

• [Arjovsky et al., 2017] Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning.
• [Bassetti et al., 2006] Bassetti, F., Bodini, A., and Regazzini, E. (2006). On minimum kantorovich distance estimators. Statistics & Probability Letters, 76.
• [Bellemare et al., 2017] Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The cramer distance as a solution to biased wasserstein gradients. CoRR, abs/1705.10743.
• [Bernton et al., 2017] Bernton, E., Jacob, P., Gerber, M., and Robert, C. (2017). Inference in generative models using the Wasserstein distance. working paper or preprint.
• [Blondel et al., 2018] Blondel, M., Seguy, V., and Rolet, A. (2018). Smooth and sparse optimal transport. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics.
• [Bonneel et al., 2011] Bonneel, N., van de Panne, M., Paris, S., and Heidrich, W. (2011). Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia Conference, New York, NY, USA.
• [Bonnotte, 2013] Bonnotte, N. (2013). Unidimensional and Evolution Methods for Optimal Transportation. PhD thesis, Université de Paris-Sud.
• [Bunne et al., 2019] Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S. (2019). Learning generative models across incomparable spaces. In Proceedings of the 36th International Conference on Machine Learning.
• [Clémençcon, 2011] Clémençcon, S. J. (2011). On u-processes and clustering performance. In Advances in Neural Information Processing Systems.
• [Clémençon et al., 2016] Clémençon, S., Colin, I., and Bellet, A. (2016). Scaling-up empirical risk minimization: Optimization of incomplete -statistics. Journal of Machine Learning Research.
• [Clémençon et al., 2008] Clémençon, S., Lugosi, G., Vayatis, N., et al. (2008). Ranking and empirical minimization of u-statistics. The Annals of Statistics.
• [Clémençon et al., 2013] Clémençon, S., Robbiano, S., and Tressou, J. (2013). Maximal deviations of incomplete u-statistics with applications to empirical risk sampling. In Proceedings of the 2013 SIAM International Conference on Data Mining.
• [Courty et al., 2017] Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2017). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
• [Cuturi, 2013] Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26.
• [Damodaran et al., 2018] Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., and Courty, N. (2018). DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation. In ECCV 2018 - 15th European Conference on Computer Vision. Springer.
• [Ferradans et al., 2013] Ferradans, S., Papadakis, N., Rabin, J., Peyré, G., and Aujol, J.-F. (2013). Regularized discrete optimal transport. In Scale Space and Variational Methods in Computer Vision. Springer Berlin Heidelberg.
• [Feydy et al., 2019] Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-i., Trouve, A., and Peyré, G. (2019). Interpolating between optimal transport and mmd using sinkhorn divergences. In Proceedings of Machine Learning Research.
• [Flamary and Courty, 2017] Flamary, R. and Courty, N. (2017). Pot python optimal transport library.
• [Frogner et al., 2015] Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T. A. (2015). Learning with a wasserstein loss. In Advances in Neural Information Processing Systems 28.
• [Genevay et al., 2019] Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyré, G. (2019). Sample complexity of sinkhorn divergences. In Proceedings of Machine Learning Research.
• [Genevay et al., 2016] Genevay, A., Cuturi, M., Peyré, G., and Bach, F. (2016). Stochastic optimization for large-scale optimal transport. In Advances in neural information processing systems.
• [Genevay et al., 2018] Genevay, A., Peyre, G., and Cuturi, M. (2018). Learning generative models with sinkhorn divergences. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics.
• [Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 27.
• [Gretton et al., ] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schlkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13.
• [Gulrajani et al., 2017] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30.
• [Hoeffding, 1963] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association.
• [Hull, 1994] Hull, J. (1994). Database for handwritten text recognition research. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16.
• [J Lee, 2019] J Lee, A. (2019). U-statistics : theory and practice / a. j. lee. SERBIULA (sistema Librum 2.0).
• [Kolouri et al., 2016] Kolouri, S., Zou, Y., and Rohde, G. K. (2016). Sliced wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
• [LeCun and Cortes, 2010] LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.
• [Liu et al., 2015] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
• [Liutkus et al., 2019] Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., and Stöter, F.-R. (2019). Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In Proceedings of the 36th International Conference on Machine Learning.
• [Mikołaj Bińkowski, 2018] Mikołaj Bińkowski, Dougal J. Sutherland, M. A. A. G. (2018). Demystifying MMD GANs. International Conference on Learning Representations.
• [Papa et al., 2015] Papa, G., Clémençon, S., and Bellet, A. (2015). Sgd algorithms based on incomplete u-statistics: Large-scale minimization of empirical risk. In Advances in Neural Information Processing Systems 28.
• [Paszke et al., 2017] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch.
• [Patrini et al., 2019] Patrini, G., van den Berg, R., Forré, P., Carioni, M., Bhargav, S., Welling, M., Genewein, T., and Nielsen, F. (2019). Sinkhorn autoencoders. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence.
• [Peyré, 2015] Peyré, G. (2015). Entropic approximation of wasserstein gradient flows. SIAM Journal on Imaging Sciences.
• [Peyré and Cuturi, 2019] Peyré, G. and Cuturi, M. (2019). Computational optimal transport. Foundations and Trends® in Machine Learning.
• [Seguy et al., 2018] Seguy, V., Damodaran, B. B., Flamary, R., Courty, N., Rolet, A., and Blondel, M. (2018). Large-scale optimal transport and mapping estimation. In International Conference on Learning Representations (ICLR).
• [Sommerfeld et al., 2019] Sommerfeld, M., Schrieber, J., Zemel, Y., and Munk, A. (2019). Optimal transport: Fast probabilistic approximation with exact solvers. Journal of Machine Learning Research.
• [Weed and Bach, 2019] Weed, J. and Bach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli.
• [Wu et al., 2019] Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel, D. P., and Gool, L. V. (2019). Sliced wasserstein generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Learning with minibatch Wasserstein : asymptotic and gradient properties

Supplementary material

##### Outline.

The supplementary material of this paper is organized as follows:

• In section A, we first review the formalism with definitions, basic property proofs, statistical proofs and optimization proofs. Then we give details about the 1D case.

• In section B, we give extra experiments for domain adaptation, minibatch Wasserstein gradient flow in 2D and on the celebA dataset and finally, color transfer.

## Appendix A Formalism

In what follows, without any loss of generality and in order to simplify the notations we will work with the cost matrix .

### a.1 Definitions

We start giving the formal definitions for the transportation plan .

###### Definition 4 (Mini-batch Transport).

Let and be two sets. We denote by an optimizer of the optimal transport. Formally,

 Π0A,B=argminΠ∈U(A,B)⟨Π,C|A,B⟩ (15)

where is the matrix extracted from by considering elements of the lines (resp. columns) of which belong to (resp. ).

For two sets and we denote by the matrix

 ΠA,B=(Π0A,B(i,j)1A(i)1B(j))(i,j)∈αn×βn (16)
###### Definition 5 (Averaged mini-batch transport).

We define the empirical averaged mini-batch transport matrix by the formula

 Πm:=1(nm)2∑A∈Pm(αn)∑B∈Pm(βn)ΠA,B (17)

Moreover, we can define the averaged Wasserstein distance over all mini batches as :

 UW(αn,βn)=⟨Πm,C⟩ (18)
###### Remark 1.

Note that this construction is consistent of .

### a.2 Basic properties

###### Proposition 4.

is a transportation plan between the empirical distributions .

###### Proof.

We need to verify that the marginals sum to one -e.g. that the sum over any row (resp. column) is equal to . Without loss of generality, we will fix a source sample (or row): . A simple combinatorial argument gives that . Now we are ready to sum over the row .

 n∑j=1Πm(i0,j) =1(nm)(nm)n∑j=1∑A∈Pm(αn)∑B∈Pm(βn)ΠA,B(i0,j) (19) =1(nm)(nm)∑B∈Pm(βn)n∑j=1Π0A,B(i0,j)1B(j)∑A∈Pm(αn)1A(i0) (20) =1(nm)(nm)∑B∈Pm(βn)n∑j=1Π0A,B(i0,j)1B(j)=1/m(n−1m−1) (21) =1(nm)(nm)(nm)1m(n−1m−1) (22) =1n (23)

The argument is similar for the summation over any column. ∎

###### Remark 2 (Positivity, symmetry and bias).

The quantity is positive and symmetric but also stricly positive, i.e . Indeed,

 Uh(αn,αn) :=1(nm)2∑A∈Pm(αn)∑A′∈Pm(αn)h(A,A′) (24) =1(nm)2∑(A,A′)∈Pm(αn)×Pm(βn),A≠A′h(A,A′)>0 (25)
###### Proposition 5 (Convexity).

The minibatch OT losses are convex with respect to their inputs.

###### Proof.

Here, for and we denote, for a vector , by the vector such that if and otherwise. Viewing the probability distributions as vectors it is enough to see, since convexity is preserved when summing convex functions and pre-composing them with linear functions - that for each and the restriction map is linear and that the map is convex. The latter is well-known and can be derived immediately thanks to the convexity of the set of transport plans. ∎

###### Remark 3 (Jointly convexity).

The minibatch Wasserstein loss and the minibatch entropic loss are jointly convex with respect to their inputs because is also jointly convex for every . However the minibatch Sinkhorn divergence is convex with respect to and with respect to but not jointly when .

### a.3 Statistical proofs

Note that because the distributions and are compactly supported, there exists a constant such that for any , with . We define the following quantity depending on the kernel :

 (26)
###### Lemma 1 (Upper bounds).

Let . We have the following bound for any kernels defined in the above:

 |h(A,B)|⩽2Mh (27)
###### Proof.

We start with the case . Note that with our choice of cost matrix one has . We have for a transport plan between and (with respect to the cost matrix )

 |⟨Π,C|A,B⟩|⩽∑1⩽i,j⩽m(C|A,B)ijΠi,j⩽MW∑1⩽i,j⩽mΠi,j=MW

Hence, .
If for an . Let us denote by the Shannon entropy of the discrete probability distribution . Using the classical fact : one estimates for a transport plan :

 |⟨Π,C|A,B⟩+εH(Π)|⩽MW+ε(E(Π)+1)⩽MW+ε(log(m2)+1)⩽2Mh

which gives the intended bound by definition of . Lastly, for , since it is basically the sum of three terms of the form one can conclude. ∎

Proof of Theorem 1 We now give the details of the proof of theorem 1.

###### Lemma 2 (U-statistics concentration bound).

Let and be fixed, we have a concentration bound between and the expectation over minibatches depending on the number of empirical data which follow and .

 |Uh(αn,βn)−Uh(α,β)|≤Mh√log(2/δ)m2n (28)

with probability at least

###### Proof.

is a two-sample U-statistic and is its expectation as and have iid random variables. is a sum of dependant variables and Hoeffding found a way to rewrite as a sum of independent random variables. He then applied his inequality to this transformation and deducted the bound. The proof can be found in [Hoeffding, 1963, Section 5] (the two sample U-statistic case is discussed in 5.b) . ∎

###### Lemma 3 (Deviation bound).

Let and . We have a deviation bound between and depending on the number of batches .

 |˜Ukh(αn,βn)−Uh(αn,βn)|⩽Mh√2log(2/δ)k (29)

with probability at least .

###### Proof.

First note that is an incomplete U-statistic of . Let us consider the sequence of random variables such that is equal to if has been selected at the th draw and otherwise. By construction of , the aforementioned sequence is an i.i.d sequence of random vectors and the are bernoulli random variables of parameter . We then have

 ˜Ukh(αn,βn)−Uh(αn,βn)=1kk∑l=1ωl (30)

where . Conditioned upon and , the variables are independent, centered and bounded by thanks to lemma 1. Using Hoeffding’s inequality yields

 P(|˜Ukh(αn,βn)−Uh(αn,βn)|>ε) =E[P(|˜Ukh(αn,βn)−Uh(αn,βn)|>ε|X,Y)] (31) =E[P(|1kk∑l=1ωl)|>ε|X,Y)] (32) ⩽E[2e−kε22M2]=2e−kε22M2 (33)

which concludes the proof. ∎

###### Theorem 4 (Maximal deviation bound).

Let , and be fixed, we have a maximal deviation bound between and the expectation over minibatches depending on the number of empirical data which follow and and the number of batches .

 |˜Ukh(αn,βn)−Uh(α,β)|≤Mh√log(1/δ)m2n+Mh√2log(1/δ)k (34)

with probability at least 1-

###### Proof.

Thanks to theorem 3 and 2 we get

 |˜Ukh(αn,βn)−Uh(α,β)| ≤|˜Ukh(αn,βn)−Uh(αn,βn)|+|UW(αn,βn)−Uh(α,β)| (35) ≤Mh√log(1/δ)m2n+Mh√2log(1/δ)k (36)

with probability at least . ∎

Proof of Theorem 2 We now give the details of the proof of theorem 2. In what follows, we denote by the -th row of matrix . Let us denote by the vector whose entries are all equal to .

###### Theorem 5 (Distance to marginals).

Let , we have for all and all :

 |Πk(αn,βn)(i)1−1n|⩽√2log(2/δ)k (37)

with probability at least .

###### Proof.

We would like to remind that is a transportation plan between the full input distributions and and hence, it verifies the marginals, i.e . Let us consider the sequence of random variables such that is equal to if has been selected at the th draw and otherwise. By construction of , the aforementioned sequence is an i.i.d sequence of random vectors and the are bernoulli random variables of parameter . We then have

 Πk(αn,βn)(i)1=1kk∑p=1ωp (38)

where . Conditioned upon and , the random vectors are independent, and bounded by . Moreover, one can observe that . Using Hoeffding’s inequality yields

 P(|Πk(αn,βn)i1−Πm(αn,βn)i1)|>ε) =E[P(|1kk∑p=1ωp−E[1kk∑p=1ωp])|>ε|X,Y)] (39) ⩽2e−2kε2 (40)

which concludes the proof. ∎

### a.4 Optimization

The main goal of this section is to give a justification of optimization for our minibatch OT losses by giving the proof of theorem 3. More precisely, we show that for the losses and , one can exchange the gradient symbol and the expectation . It shows for example that a stochastic gradient descent procedure is unbiased and as such legitimate.

Main hypothesis. We assume that the map is differentiable. For instance for GANs, it is verified when the neural network in the generator is differentiable -which is the case if the nonlinear activation functions are all differentiable- and when the cost chosen in the Wasserstein distance is also differentiable.
We introduce the map

 g:(Π,C)↦⟨Π,C⟩−εH(Π)

To prove this theorem, we first define a map we will use the ”Differentiation Lemma”.

###### Lemma 4 (Differentiation lemma).

Let V be a nontrivial open set in and let be a probability distribution on . Define a map with the following properties:

• For any

• For -almost all , the map , is differentiable.

• There exists a -integrable function such that for all .

Then, for any , and the function is differentiable with differential:

 EP∂λ[C(X,Y,λ)]=∂λEP[C(X,Y,λ)] (41)

The following result will also be useful.

###### Lemma 5 (Danskin, Rockafellar).

Let be a function. We define where is compact. We assume that for each , the function is differentiable and that depends continuously on . If in addition, is convex in , and if is a point such that , then is differentiable at and verifies

 ∇φ(¯¯¯z)=∇zg(¯¯¯z,¯¯¯¯w) (42)

The last theorem shows that the entropic loss is differentiable with respect to the cost matrix. Indeed, the theorem directly applies since the problem is strongly convex. This remark enables us to obtain the following result.

###### Theorem 6 (Exchange gradient and expectation).

Let us suppose that we have two distributions and on two bounded subsets and , a cost, and that the parametrized data is differentiable wrt . Then, for a set of cardinality we are allowed to exchange gradients and expectation for the entropic loss and the Sinkhorn divergence:

 ∇λEYλ∼β⊗mλh(A,Yλ)=EYλ∼β⊗m