Minimax Distribution Estimationin Wasserstein Distance

# Minimax Distribution Estimation in Wasserstein Distance

Shashank Singh
sss1@cs.cmu.edu
Machine Learning Department
Department of Statistics & Data Science
Carnegie Mellon University
&Barnabás Póczos
bapoczos@cs.cmu.edu
Machine Learning Department
Carnegie Mellon University
###### Abstract

The Wasserstein metric is an important measure of distance between probability distributions, with applications in machine learning, statistics, probability theory, and data analysis. This paper provides upper and lower bounds on statistical minimax rates for the problem of estimating a probability distribution under Wasserstein loss, using only metric properties, such as covering and packing numbers, of the sample space, and weak moment assumptions on the probability distributions.

Minimax Distribution Estimation
in Wasserstein Distance

Shashank Singh sss1@cs.cmu.edu Machine Learning Department Department of Statistics & Data Science Carnegie Mellon University Barnabás Póczos bapoczos@cs.cmu.edu Machine Learning Department Carnegie Mellon University

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

The Wasserstein metric is an important measure of distance between probability distributions, based on the cost of transforming either distribution into the other through mass transport, under a base metric on the sample space. Originating in the optimal transport literature,111The Wasserstein metric has been variously attributed to Monge, Kantorovich, Rubinstein, Gini, Mallows, and others; see Chapter 3 of [villani2008optimalTransport] for detailed history. the Wasserstein metric has, owing to its intuitive and general nature, been utilized in such diverse areas as probability theory and statistics, economics, image processing, text mining, robust optimization, and physics [villani2008optimalTransport, fournier2015rate, esfahani2015robustOptimization, gao2016distributionallyRobust].

In the analysis of image data, the Wasserstein metric has been used for various tasks such as texture classification and face recognition [sandler2011NMFImageAnalysis], reflectance interpolation, color transfer, and geometry processing [solomon2015imageOptimalTrans], image retrieval [rubner2000imageRetrieval], and image segmentation [ni2009imageSegmentation], and, in the analysis of text data, for tasks such as document classification [kusner2015documentDistances] and machine translation [zhang2016machineTranslation].

In contrast to a number of other popular notions of dissimilarity between probability distributions, such as distances or Kullback-Leibler and other -divergences [morimoto1963divergences, csiszar1964divergences, ali1966divergences], which require distributions to be absolutely continuous with respect to each other or to a base measure, Wasserstein distance is well-defined between any pair of probability distributions over a sample space equipped with a metric.222Hence, we use “distribution estimation” in this paper, rather than the more common “density estimation”. As a particularly important consequence, Wasserstein distances between discrete (e.g., empirical) distributions and continuous distributions are well-defined, finite, and informative (e.g., can decay to as the distributions become more similar).

Partly for this reason, many central limit theorems and related approximation results [ruschendorf1985wasserstein, johnson2005central, chatterjee2008normalApproximation, rio2009upper, rio2011asymptotic, chen10SteinsMethod, reitzner2013central] are expressed using Wasserstein distances. Within machine learning and statistics, this same property motivates a class of so-called minimum Wasserstein distance estimates [del1999CLT, del2003correction, bassetti2006minimum, bernton2017inferenceUsingWasserstein] of distributions, ranging from exponential distributions [baillo2016exponentialWasserstein] to more exotic models such as restricted Boltzmann machines (RBMs) [montavon2016wassersteinRBMs] and generative adversarial networks (GANs) [arjovsky2017wassersteinGAN]. This class of estimators also includes -means and -medians, where the hypothesis class is taken to be discrete distributions supported on at most points [pollard1982quantization]; more flexible algorithms such as hierarchical -means [ho2017multilevel] and -flats [tseng2000kFlats] can also be expressed in this way, using a more elaborate hypothesis classes. PCA can also be expressed and generalized to manifolds using Wasserstein distance minimization [boissard2015template]. These estimators are conceptually equivalent to empirical risk minimization, leveraging the fact that Wasserstein distances between the empirical distribution and distributions in the relevant hypothesis class are well-behaved. Moreover, these estimates often perform well in practice because they are free of both tuning parameters and strong distributional assumptions.

For many of the above applications, it is important to understand how quickly the empirical distribution converges to the true distribution in Wasserstein distance, and whether there exist distribution estimators that converge more quickly. For example, canas2012learning use bounds on Wasserstein convergence to prove learning bounds for -means, while arora2017generalization used the slow rate of convergence in Wasserstein distance in certain cases to argue that GANs based on Wasserstein distances fail to generalize with fewer than exponentially many samples in the dimension.

To this end, the main contribution of this paper is to identify, in a wide variety of settings, the minimax convergence rate for the problem of estimating a distribution using Wasserstein distance as a loss function. Our setting is very general, relying only on metric properties of the support of the distribution and the number of finite moments the distribution has; some diverse examples to which our results apply are given in Section 6. Specifically, we assume only that the distribution is has some number of finite moments in a given metric. We then prove bounds on the minimax convergence rates of distribution estimation, utilizing covering numbers of the sample space for upper bounds and packing numbers for lower bounds. It may at first be surprising that positive results can be obtained under such mild assumptions; this highlights that the Wasserstein metric is quite a weak metric (see our Lemma 11 and the subsequent remark for discussion of this). Moreover, our results imply that, without further assumptions on the population distribution, the empirical distribution is typically minimax rate-optimal. Note that, while there has been previous work on upper bounds (discussed in Section 3), this paper is the first to study minimax lower bounds for this problem.

Organization: The remainder of this paper is organized as follows. Section 2 provides notation required to formally state both the problem of interest and our results, while Section 3 reviews previous work studying convergence of distributions in Wasserstein distance. Sections 4 and 5 respectively contain our main upper and lower bound results. Since the proofs of the upper bounds, are fairly long, Appendices A and B provide high-level sketches of the proofs, followed by detailed proofs in Appendix C. The lower bound is proven in Appendix D Finally, in Section 6, we apply our upper and lower bounds to identify minimax convergence rates in a number of concrete examples. Section 7 concludes with a summary of our contributions and suggested avenues for future work.

## 2 Notation and Problem Setting

For any positive integer , denotes the set of the first positive integers. For sequences and of non-negative reals, and, equivalently , indicate the existence of a constant such that . indicates .

### 2.1 Problem Setting

For the remainder of this paper, fix a metric space , over which denotes the Borel -algebra, and let denote the family of all Borel probability distributions on . The main object of study in this paper is the Wasserstein distance on , defined as follows:

###### Definition 1 (r-Wasserstein Distance).

Given two Borel probability distributions and over and , the -Wasserstein distance between and is defined by

where denotes all couplings between and ; that is,

 Π(P,Q):={μ:Σ2→[0,1]∣∣ for all A∈Σ,μ(A×Ω)=P(A) and μ(Ω×A)=Q(A)},

is the set of joint probability measures over with marginals and .

Intuitively, quantifies the -weighted total cost of transforming mass distributed according to to be distributed according to , where the cost of moving a unit mass from to is . is sometimes defined in terms of equivalent (e.g., dual) formulations; these formulations will not be needed in this paper. it is symmetric in its arguments and satisfies the triangle inequality, and, for all , . Thus, is always a pseudometric. Moreover, it is a proper metric (i.e., ) if and only if is as well.

This paper studies the following problem:

Formal Problem Statement: Suppose is a known metric space. Suppose is an unknown Borel probability distribution on , from which we observe IID samples . We are interested in studying the minimax rates at which can be estimated from , in terms of the ( power of the) -Wasserstein loss. Specifically, we are interested in deriving finite-sample upper and lower bounds, in terms of only properties of the space , on the quantity

 infˆPsupP∈PEX1,...,Xn\lx@stackrelIID∼P[Wrr(P,ˆP(X1,...,Xn))], (1)

where the infimum is taken over all estimators (i.e., (potentially randomized) functions of the data). In the sequel, we suppress the dependence of in the notation.

### 2.2 Definitions for Stating our Results

Here, we give notation and definitions needed to state our main results in Sections 4 and 5.

Let denote the power set of . Let denote the family of all Borel partitions of :

 S:={S⊆Σ:Ω⊆⋃S∈SS and ∀S,T∈S,S∩T=∅}.

We now define some metric notions that will later be useful for bounding Wasserstein distances:

###### Definition 2 (Diameter and Separation of a Set, Resolution of a Partition).

For any set , the diameter of is defined by , and the separation of is defined by . If is a partition of , then the resolution of defined by is the largest diameter of any set in .

We now define the covering and packing number of a metric space, which are classic and widely used measures of the size or complexity of a metric space [dudley1967coveringNumbers, haussler1995sphere, zhou2002covering, zhang2002covering]. Our main convergence results will be stated in terms of these quantities, as well as the packing radius, which acts, approximately, as the inverse of the packing number.

###### Definition 3 (Covering Number, Packing Number, and Packing Radius of a Metric Space).

The covering number of is defined for all by

 N(ε):=min{|S|:S∈S % and Res(S)≤ε}.

The packing number of is defined for all by

 M(ε):=max{|S|:S⊆Ω and Sep(S)≥ε}.

Finally, the packing radius is defined for all by

 R(n):=sup{Sep(S):S⊆Ω and |S|≥n}.

Sometimes, we use the covering or packing number of a metric space, say , other than ; in such cases, we write or rather than or , respectively. For specific , we will also refer to as the -covering number of .

###### Remark 4.

The covering and packing numbers of a metric space are closely related. In particular, for any , we always have

 M(ε)≤N(ε)≤M(ε/2). (2)

The packing number and packing radius also have a close approximate inverse relationship. In particular, for any and , we always have

 R(M(ε))≥ε and M(R(n))≥n. (3)

However, it is possible that or .

Finally, when we consider unbounded metric spaces, we will require some sort of concentration conditions on the probability distributions of interest, to obtain useful results. Specifically, we an appropriately generalized version of the moment of the distribution:

###### Remark 5.

We defined the covering number slightly differently from usual (using partitions rather than covers). However, the given definition is equivalent to the usual definition, since (a) any partition is itself a cover (i.e., a set such that ), and (b), for any countable cover , there exists a partition with and each , defined recursively by . is often called the disjointification of .

###### Definition 6 (Metric Moments of a Probability Distribution).

For any , probability measure , and , the metric moment of around is defined by

 mℓ,x(P):=(EY∼P[(ρ(x,Y))ℓ])1/ℓ∈[0,∞],

using the appropriate limit if . The chosen reference point only affects constant factors since,

 for all x,x′∈Ω,∣∣mℓℓ,x(P)−mℓℓ,x′(P)∣∣≤(ρ(x,x′))ℓ.

Note that, if has linear structure with respect to which is translation-invariant (e.g., if is a Fréchet space), we can state our results more simply in terms of . As an example, if and , then is precisely the standard deviation of .

## 3 Related Work

A long line of work [dudley1969speed, ajtai1984optimalMatchings, canas2012learning, dereich2013constructive, boissard2014mean, fournier2015rate, weed2017sharp, lei2018convergence] has studied the rate of convergence of the empirical distribution to the population distribution in Wasserstein distance. In terms of upper bounds, the most general and tight upper bounds are the recent works of [weed2017sharp] and [lei2018convergence]. As we describe below, while these two papers overlap significantly, neither supersedes the other, and our upper bound combines the key strengths of those in [weed2017sharp] and [lei2018convergence].

The results of [weed2017sharp] are expressed in terms of a particular notion of dimension, which they call the Wasserstein dimension , since they derive convergence rates of order (matching the rate achieved on the unit cube ). The definition of is complex (e.g., it depends on the sample size ), but [weed2017sharp] show that, in many cases, converges to certain common definitions of the intrinsic dimension of the support of the distribution. This paper overcomes three main limitations of [weed2017sharp]:

1. The upper bounds of [weed2017sharp] apply only to totally bounded metric spaces. In contrast, our upper bounds permit unbounded metric spaces under the assumption that the distribution has some finite moment . The results of [weed2017sharp] correspond to the special case .

2. Their main upper bound (their Proposition 10) only holds when , with constant factors diverging to infinity as . Hence, their rates are loose when is large or when the data have low intrinsic dimension. In contrast, our upper bound is tight even when .

3. As we discuss in our Example 4, the upper bound of [weed2017sharp] becomes loose as the Wasserstein dimension approaches , limiting its utility in infinite-dimensional function spaces. In contrast, we show that our upper and lower bounds match for several standard function spaces.

Intuitively, we find that the finite-sample bounds of [weed2017sharp] are tight when the intrinsic dimension of the data lies in an interval with , but they can be loose outside this range. In contrast, we find our results give tight rates for a larger class of problems.

On the other hand, [lei2018convergence] focuses on the case where is a (potentially unbounded and infinite-dimensional) Banach space, under moment assumptions on the distributions. Thus, while the results of [lei2018convergence] cover interesting cases such as infinite-dimensional Gaussian processes, they do not demonstrate that convergence rates improve when the intrinsic dimension of the support of is smaller than that of (unless this support lies within a linear subspace of ). As a simple example, if the distribution is in fact supported on a finite set of linearly independent points, the bound of [lei2018convergence] implies only a convergence rate, whereas we give a bound of order . Although we do not delve into this here, our results (unlike those of [lei2018convergence]) should also benefit from the multi-scale behavior discussed in Section 5 of [weed2017sharp]; namely, much faster convergence rates are often observed for small than for large . This may help explain why algorithms such as functional -means [garcia2015functionalKMeans] work in practice, even though the results of [lei2018convergence] imply only a slow convergence rate of , for some constant , in this case.

Under similarly general conditions, [sriperumbudur2010integralProbabilityMetrics, sriperumbudur2012empirical] have studied the related problem of estimating the Wasserstein distance between two unknown distributions given samples from those two distributions. Since one can estimate Wasserstein distances by plugging in empirical distributions, our upper bounds imply upper bounds for Wasserstein distance estimation. These bounds are tighter, in several cases, than those of [sriperumbudur2010integralProbabilityMetrics, sriperumbudur2012empirical]; for example, when is the Euclidean unit cube, we give a rate of , whereas they give a rate of . Minimax rates for this problem are currently unknown, and it is presently unclear to us under what conditions recent results on estimation of distances between discrete distributions [jiao2017minimaxL1] might imply an improved rate as fast as for estimation of Wasserstein distance.

To the best of our knowledge, minimax lower bounds for distribution estimation under Wasserstein loss remain unstudied, except in the very specific case when is the Euclidean unit cube and  [liang2017well]. As noted above, most previous works have focused on studying convergence rate of the empirical distribution to the true distribution in Wasserstein distance. For this rate, several lower bounds have been established, matching known upper bounds in many cases. However, many distribution estimators besides the empirical distribution can be considered. For example, it is tempting (especially given the infinite dimensionality of the distribution to be estimated) to try to reduce variance by techniques such as smoothing or importance sampling [bucklew2013introduction]. Our lower bound results, given in Section 5, imply that the empirical distribution is already minimax optimal, up to constant factors, in many cases.

## 4 Upper Bounds

In this section, we present our main upper bounds on the convergence rate of the empirical distribution to the true distribution in Wasserstein distance. We begin by presenting a simpler result for the case of totally bounded metric spaces, followed by a more complex but general result for arbitrary metric spaces under finite-moment assumptions on the distribution.

###### Theorem 7.

Let be a metric space on which is a Borel probability measure. Let denote the empirical distribution of IID samples , give by

 ˆP(S):=1nn∑i=11{Xi∈S},∀S∈Σ. (4)

Then, for any non-increasing sequence with ,

 E[Wrr(P,ˆP)]≤εrK+1√nK∑k=1⎛⎝K∑j=k2K−jεj⎞⎠r√N(εk)−1.

In the proof of the above theorem, the sequence gives the resolutions of a sequence of increasingly fine partitions of . The basic idea of the proof is to recursively bound the error over each partition at resolution in terms of and the error over the partition of resolution . The parameter restricts us to a particular finite resolution, with optimal value typically increasing with . Note that this “multi-resolution” proof approach has been utilized in several special cases, apparently originating in the analysis of Our Theorem 7 is most comparable to the upper bound (Proposition 10) of weed2017sharp.

Theorem 7 requires to be totally bounded in order for to be finite. Next, we present a more complex bound, which, under the additional assumption that has some number of finite moments, is often finite even when is not totally bounded. The key idea of the proof is to partition into bounded subsets, over each of which we can apply a bound similar to Theorem 7. Thus, instead of the covering number of , this result uses covering numbers of a partition into totally bounded subsets.

###### Theorem 8 (General Upper Bound for Unbounded Metric Spaces).

Let and suppose . Let . Fix two non-decreasing real-valued sequences and , of which is non-decreasing with and and is non-increasing. For each , define . Then,

 E[Wrr(P,ˆP)] ≤mℓℓ,x0(P)∑k∈Nw−ℓk(εJ)r+2rwr−ℓ/2kmin{2w−ℓ/2k,√1n} +J∑j=1(J∑t=j2J−tεt)rmin⎧⎪⎨⎪⎩2w−ℓk,√w−ℓknN(Bk,ρ,εj)⎫⎪⎬⎪⎭.

In the above, corresponds to radii of the partition of into a sequence of “spherical shells”, whereas , as in the previous result, corresponds to resolutions of partitions of the ’s. As with in the previous result, is used to ensure that we restrict ourselves to a particular finite resolution. The terms appear because, for large , the error is controlled by the fact that is small (due to the moment assumption), rather than using a covering of .

## 5 Lower Bounds

In this section, we provide a minimax lower bound (over the family of all Borel distributions on ) for density estimation in Wasserstein distance (that is, the quantity

 infˆP:Xn→PsupP∈PEX1,...,Xn\lx@stackrelIID∼P[Wrr(P,ˆP)], (5)

where the infimum is over all estimators of (i.e., all (potentially randomized) functions )). Our bound depends primarily on the packing radius of , and, presently, we handle only the case without finite-moment assumptions on . However, we show in the next section that this often implies tight lower bounds when enough (roughly, ) moments exist.

###### Theorem 9.

Let be a metric space, on which is the set of Borel probability measures. Then,

 infˆP:Xn→PsupP∈PEX1,...,Xn\lx@stackrelIID∼P[Wrr(P,ˆP(X1,...,Xn))]≥crsupk∈[32n]Rr(k)√k−1n,

where depends only on .

## 6 Example Applications

Our theorems in the previous sections are quite abstract and have many tuning parameters. Thus, we conclude by exploring applications of our results to cases of interest. In each of the following examples, is an unknown Borel probability measure over the specified , from which we observe IID samples. For upper bounds, denotes the empirical distribution (4) of these samples.

###### Example 1 (Finite Space).

Consider the case where is a finite set, over which is the discrete metric given, for some , by , for all . Then, for any , the covering number is . Thus, setting and sending in Theorem 7 gives

 E[Wrr(P,ˆP)]≤δr√|Ω|−1n.

On the other hand, , and so, setting in Theorem 9 yields

###### Example 2 (Unit Cube, Euclidean Metric).

Consider the case where is the unit cube and is the Euclidean metric. Assuming , using the fact that  [pollard1990empirical] and plugging and into Theorem 8 gives (after a straightforward but very tedious calculation) a constant depending only on , , and such that

 E[Wrr(P,ˆP)]≤CD,ℓ,rmℓℓ(P)(nℓ−rℓ+2−2Jr+J∑j=12(D−2r)j). (6)

Of these three terms, the first depends only on the number of finite moments is assumed to have and the order of the Wasserstein distance, whereas the second and third terms depend on choosing the parameter . The optimal choice of scales with the sample size at a rate depending on the quantity . Specifically, if , then setting gives a rate of . If , then (6) reduces to

 E[Wrr(P,ˆP)]≤CD,ℓ,rmℓℓ(P)(nℓ−rℓ+2−2Jr+2(D−2r)J−12D−2r−1).

Then, if , sending gives . Finally, if , then setting gives . To summarize

 E[Wrr(P,ˆP)]≲nℓ−rℓ+⎧⎪ ⎪⎨⎪ ⎪⎩n−1/2 if 2r>Dn−1/2logn if 2r=Dn−r/D if 2r

(reproducing Theorem 1 of [fournier2015rate]). On the other hand, it is easy to check that the packing radius satisfies and . Thus, Theorem 9 with and yields

 infˆPsupP∈PE[Wrr(ˆP,P)]≳max{(n+1)−r/D,Dr/2n−1/2}.

Together, these bounds give the following minimax rates for density estimation in Wasserstein loss:

 infˆPsupP∈PE[Wrr(ˆP,P)]≍{n−1/2 if ℓ>2r>Dn−r/D if 2rDrD−r

When and , our upper and lower bounds are separated by a factor of . The main result of [ajtai1984optimalMatchings] implies that, for the case and , the empirical distribution converges as , suggesting that the factor in our upper bound may be tight. Further generalization of Theorem 9 is needed to give lower bounds when both or when and .

The next example demonstrates how the rate of convergence in Wasserstein metric depends on properties of the metric space at both large and small scales. Specifically, if we discretize , then the phase transition at disappears.

###### Example 3.

Suppose is a -dimensional grid of integers and is -metric (given by ). Since and the and Euclidean metrics are topologically equivalent, the upper bounds from Example 2 clearly apply, up to a factor of . However, we also have the fact that, whenever , . Therefore, setting , , and in Theorem 8 gives, for a constant depending only on , , and ,

 E[Wrr(P,ˆP)] ≤CD,ℓ,rmℓℓ(P)⎛⎝nℓ−rℓ+∑k∈N√2(D−ℓ)kn⎞⎠.

When , this reduces to , giving a tighter rate than in Example 2 when . To the best of our knowledge, no prior results in the literature imply this fact.

Finally, we consider distributions over an infinite dimensional space of smooth functions.

###### Example 4 (Hölder Ball, L∞ Metric).

Suppose that, for some ,

 Ω:={f[0,1]D→[−1,1]∣∣∀x,y∈[0,1]D,|f(x)−f(y)|≤∥x−y∥α2}

is the class of unit -Hölder functions on the unit cube and is the -metric given by

 ρ(f,g)=supx∈[0,1]D|f(x)−g(x)|, for all f,g∈Ω.

The covering and packing numbers of are well-known to be of order [devore1993approximation]; specifically, there exist positive constants such that, for all ,

 c1exp(ε−D/α)≤N(ε)≤M(ε)≤c2exp(ε−D/α).

Since , applying Theorem 7 with and

 ε1=(2logn−(αr/D)loglogn)−αrD gives E[Wrr(P,ˆP)]≲(logn)−αrD.

Conversely, Inequality (3) implies , and so setting in Theorem 9 gives

showing that distribution estimation over has the extremely slow minimax rate . Although we considered only (due to the notational complexity of defining higher-order Hölder spaces), analogous rates hold for all . Also, since our rates depend only on covering and packing numbers of , identical rates can be derived for related Sobolev and Besov classes. Note that the Wasserstein dimension used in the prior work [weed2017sharp] is of order , and so their upper bound (their Proposition 10) gives a rate of , which fails to converge as .

One might wonder why we are interested in studying Wasserstein convergence of distributions over spaces of smooth functions, as in Example 4. Our motivation comes from the historical use of smooth function spaces have been widely used for modeling images and other complex naturalistic signals [mallat1999wavelet, peyre2011numerical, sadhanala2016totalVariation]. Empirical breakthroughs have recently been made in generative modeling, particularly of images, based on the principle of minimizing Wasserstein distance between the empirical distribution and a large class of models encoded by a deep neural network [montavon2016wassersteinRBMs, arjovsky2017wassersteinGAN, gulrajani2017improved].

However, little is known about theoretical properties of these methods; while there has been some work studying the optimization landscape of such models [nagarajan2017gradient, liang2018interaction], we know of far less work exploring their statistical properties. Given the extremely slow minimax convergence rate we derived above, it must be the case that the class of distributions encoded by such models is far smaller or sparser than . An important avenue for further work is thus to explicitly identify stronger assumptions that can be made on distributions over interesting classes of signals, such as images, to bridge the gap between empirical performance and our theoretical understanding.

## 7 Conclusion

In this paper, we derived upper and lower bounds for distribution estimation under Wasserstein loss. Our upper bounds generalize prior results and are tighter in certain cases, while our lower bounds are, to the best of our knowledge, the first minimax lower bounds for this problem. We also provided several concrete examples in which our bounds imply novel convergence rates.

### 7.1 Future Work

We studied minimax rates over the very large entire class of all distributions with some number of finite moments. It would be useful to understand how minimax rates improve when additional assumptions, such as smoothness, are made (see, e.g., [liang2017well] for somewhat improved upper bounds under smoothness assumptions when is the Euclidean unit cube). Given the slow convergence rates we found over in many cases, studying minimax rates under stronger assumptions may help to explain the relatively favorable empirical performance of popular distribution estimators based on empirical risk minimization in Wasserstein loss. Moreover, while rates over all of are of interest only for very weak metrics such as the Wasserstein distance (as stronger metrics may be infinite or undefined), studying minimax rates under additional assumptions will allow for a better understanding of the Wasserstein metric in relation to other commonly used metrics.

#### Acknowledgments

This work was partly supported by a NSF Graduate Research Fellowship DGE-1252522 to S.S.

## Appendix A Preliminary Lemmas and Proof Sketch of Theorem 7

In this section, we outline the proof of Theorem 7, our upper bound for the case of totally bounded metric spaces. The proof of the more general Theorem 8 for unbounded metric spaces, which is given in the next section, builds on this.

We begin by providing a few basic lemmas; these lemmas are not fundamentally novel, but they will be used in the subsequent proofs of our main upper and lower bounds, and also help provide intuition for the behavior of the Wasserstein metric and its connections to other metrics between probability distributions. The proofs of these lemmas are given later, in Appendix C. Our first lemma relates Wasserstein distance to the notion of resolution of a partition.

###### Lemma 10.

Suppose is a countable Borel partition of . Let and be Borel probability measures such that, for every , . Then, for any , .

Our next lemma gives simple lower and upper bounds on the Wasserstein distance between distributions supported on a countable subset , in terms of and . Since our main results will utilize coverings and packings to approximate by finite sets, this lemma will provide a first step towards approximating (in Wasserstein distance) distributions on by distributions on these finite sets. Indeed, the lower bound in Inequality (7) will suffice to prove our lower bounds, although a tighter upper bound, based on the upper bound in (7), will be necessary to obtain tight upper bounds.

###### Lemma 11.

Suppose is a metric space, and suppose and are Borel probability distributions on with countable support; i.e., there exists a countable set with . Then, for any ,

 (Sep(X))r∑x∈X|P({x})−Q({x})|≤Wrr(P,Q)≤(Diam(X))r∑x∈X|P({x})−Q({x})|. (7)
###### Remark 12.

Recall that the term in Inequality (7) is the distance

 ∥p−q∥1:=∑x∈X|p(x)−q(x)|

between the densities and of and with respect to the counting measure on , and that this same quantity is twice the total variation distance

 TV(P,Q):=supA⊆Ω|P(A)−Q(A)|.

Hence, Lemma 11 can be equivalently written as

 Sep(Ω)(∥p−q∥1)1/r≤Wr(P,Q)≤Diam(Ω)(∥p−q∥1)1/r

and as

 Sep(Ω)(2TV(P,Q))1/r≤Wr(P,Q)≤Diam(Ω)(2TV(P,Q))1/r,

bounding the -Wasserstein distance in terms of the and total variation distance. As noted in Example 1, equality holds in (7) precisely when is the unit discrete metric given by for all .

On metric spaces that are discrete (i.e., when ), the Wasserstein metric is (topologically) at least as strong as the total variation metric (and the metric, when it is well-defined), in that convergence in Wasserstein metric implies convergence in total variation (and , respectively). On the other hand, on bounded metric spaces, the converse is true. In either of these cases, rates of convergence may differ between metrics, although, in metric spaces that are both discrete and bounded (e.g., any finite space), we have .

To obtain tight bounds as discussed below, we will require not only a partition of the sample space , but a nested sequence of partitions, defined as follows.

###### Definition 13 (Refinement of a Partition, Nested Partitions).

Suppose are partitions of . is said to be a refinement of if, for every , there exists with . A sequence of partitions is called nested if, for each , is a refinement of ,

While Lemma 11 gave a simple upper bound on the Wasserstein distance, the factor of turns out to be too large to obtain tight rates for a number of cases of interest (such as the -dimensional unit cube , discussed in Example 2). The following lemma gives a tighter upper bound, based on a hierarchy of nested partitions of ; this allows us to obtain tighter bounds (than ) on the distance that mass must be transported between and . Note that, when , Lemma 14 reduces to a trivial combination of Lemmas 10 and 11; indeed, these lemmas are the starting point for proving Lemma 14 by induction on .

Note that the idea of such a “multi-resolution” upper bound has been utilized extensively before, and numerous versions have been proven before (see, e.g., Fact 6 of do2011sublinearTimeEMD, Lemma 6 of fournier2015rate, or Proposition 1 of weed2017sharp). Most of these versions have been specific to Euclidean space; to the best of our knowledge, only Proposition 1 of weed2017sharp applies to general metric spaces. However, that result also requires that is totally bounded (more precisely, that , for some ).

###### Lemma 14.

Let be a positive integer. Suppose is a nested sequence of countable Borel -partitions of . Then, for any and Borel probability measures and on ,

 Wrr(P,Q)≤(Res(S0))r+∞∑k=1(Res(Sk))r⎛⎝∑S∈Sk+1|P(S)−Q(S)|⎞⎠. (8)

Lemma 14 requires a sequence of partitions of that is not only multi-resolution but also nested. While the -covering number implies the existence of small partitions with small resolution, these partitions need not be nested as becomes small. For this reason, we now give a technical lemma that, given any sequence of partitions, constructs a nested sequence of partitions of the same cardinality, with only a small increase in resolution.

###### Lemma 15.

Suppose and are partitions of , and suppose is countable. Then, there exists a partition of such that:

1. .

2. .

3. is a refinement of .

Lemmas 14 and 15 are the main tools needed to bound the expected Wasserstein distance of the empirical distribution from the true distribution into a sum of its expected errors on each element of a nested partition of . Then, we will need to control the total expected error across these partition elements, which we will show behaves similarly to the error of the standard maximum likelihood (mean) estimator a multinomial distribution from its true mean. Thus, the following result of han2015minimax will be useful.

###### Lemma 16 (Theorem 1 of [han2015minimax]).

Suppose . Let

 Z:=∥X−np∥1=K∑k=1|Xk−npk|.

Then, .

Finally, we are ready to prove Theorem 7.

###### Theorem 7.

Let be a metric space on which is a Borel probability measure. Let denote the empirical distribution of IID samples , give by

 ˆP(S):=1nn∑i=11{Xi∈S},∀S∈Σ.

Then, for any sequence with ,

 E[Wrr(P,ˆP)]≤εrK+1√nK∑k=1⎛⎝K∑j=k−12j−kεj⎞⎠r√N(εk)−1.
###### Proof.

By recursively applying Lemma 15, there exists a sequence of partitions of satisfying the following conditions:

1. for each , .

2. for each , .

3. is nested.

Note that, for any , the vector (indexed by ) follows an -multinomial distribution over categories, with means given by ; i.e.,

 (nˆP(S1),...,nˆP(Sk))∼Multinomial(n,P(S1),...,P(Sk)).

Thus, by Lemma 16, for each ,

 E⎡⎣∑S∈Sk∣∣P(S)−ˆP(S)∣∣⎤⎦≤√|Sk|−1n=√N(εk)−1n.

Thus, by Lemma 14,

 E[Wrr(P,Q)] ≤E⎡⎣εrK+K∑k=1⎛⎝K∑j=k2j−kεj⎞⎠r⎛⎝∑S∈Sk|P(S)−Q(S)|⎞⎠⎤⎦ ≤εrK+K∑k=1⎛⎝K∑j=k2j−kεj⎞⎠rE⎡⎣∑S∈Sk|P(S)−Q(S)|⎤⎦ ≤εrK+1√nK∑k=1⎛⎝K∑j=k2j−kεj⎞⎠r√N(εk)−1

## Appendix B Proof Sketch of Theorem 8

In this section, we prove our more general upper bound, Theorem 8, which applies to potentially unbounded metric spaces , assuming that is sufficiently concentrated (i.e., has at least finite moments).

The basic idea is to partition the potentially unbounded metric space into countably many totally bounded subsets , and to decompose the Wasserstein error into its error on each , weighted by the probability . Specifically, fixing an arbitrary base point , will be spherical shells, such that , and both the distance between and , as well as the size (covering number) of , increase with . For large , the assumption that has bounded moments implies (by Markov’s inequality) that is small, whereas, for small , we adapt our previous result Theorem 7 in terms of the covering number.

To carry out this approach, we will need two new lemmas. The first decomposes Wasserstein distance into the sum of its distances on each , and can be considered an adaptation of Lemma 2.2 of lei2018convergence (for Banach spaces) to general metric spaces.

###### Lemma 17.

Fix a reference point and a non-decreasing real-valued sequence with and . For each , define

 Bk:={x∈Ω:wk≤ρ(x0,x)

Then, there exists a constant depending only on such that, for any Borel probability measures and on ,

 Wrr(P,Q)≤Cr∞∑k=0wrkmin{P(Bk),Q(Bk)}Wrr(PBk,QBk)+|P(Bk)−Q(Bk)|.

where, for any sets ,

 PA(B)=P(A∩B)P(B)

(under the convention that ) denotes the conditional probability of given , under .

The second lemma is more nuanced variant of Lemma 16 (albeit, leading to slightly looser constants). When is large the covering number of can become quite large, but the total probability is quite small. Whereas Lemma 16 depends only on the size of the partition, the following result will allow us to control the total error using both of these factors.

###### Lemma 18 (Theorem 1 of berend2013binomialMAD).

Suppose . Then, we have the bound

 E[|X−np|]≤nmin{2P(A),√P(A)/n}. (9)

on the mean absolute deviation of .

Finally, we are ready to prove our main upper bound result for unbounded metric spaces.

###### Theorem 8 (General Upper Bound for Unbounded Metric Spaces).

Let and suppose . Let be a positive integer. Fix two non-decreasing real-valued sequences and , of which is non-decreasing with and and is non-increasing. For each , define

 Bk(x0):={y∈Ω:wk≤ρ(x0,x)

Then,

 E[Wrr(P,ˆP)] +J∑j=1(J∑t=j2J−tεt)rmin⎧⎪⎨⎪⎩2w−ℓk,√w−ℓknN(Bk,ρ,εj)⎫⎪⎬⎪⎭.
###### Proof.

As in the proof of Theorem 7, by recursively applying Lemma 15, for each , we can construct a nested sequence of partitions of such that, for each ,

 |Sk,j|=N(Bk,ρ,