Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence

# Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence

## Abstract

Many works in statistics aim at designing a universal estimation procedure. This question is of major interest, in particular because it leads to robust estimators, a very hot topic in statistics and machine learning. In this paper, we tackle the problem of universal estimation using a minimum distance estimator presented in [Briol et al., 2019] based on the Maximum Mean Discrepancy. We show that the estimator is robust to both dependence and to the presence of outliers in the dataset. We also highlight the connections that may exist with minimum distance estimators using -distance. Finally, we provide a theoretical study of the stochastic gradient descent algorithm used to compute the estimator, and we support our findings with numerical simulations.

## 1 Introduction

One of the main challenges in statistics is the design of a universal estimation procedure. Given data, a universal procedure is an algorithm that provides an estimator of the generating distribution which is simultaneously statistically optimal when the true distribution belongs to the model, and robust otherwise. Typically, a universal estimator is consistent for any model, with minimax-optimal or fast rates of convergence and is robust to small departures from the model assumptions [Bickel, 1976] such as sparse instead of dense effects or non-Gaussian errors in high dimensional linear regression. Unfortunately, most statistical procedures are based upon strong assumptions on the model or on the corresponding parameter set, and very famous estimation methods such as maximum likelihood estimation (MLE), method of moments or Bayesian posterior inference may fail even on simple problems when such assumptions do not hold. For instance, even though MLE is consistent and asymptotically normal with optimal rates of convergence in parametric estimation under suitable regularity assumptions [Le Cam, 1970, Van der Vaart, 1990] and in nonparametric estimation under entropy conditions, this method behaves poorly in case of misspecification when the true generating distribution of the data does not belong to the chosen model.

Let us investigate a simple example presented in [Birgé, 2006] that illustrates the non-universal characteristic of MLE. We observe a collection of independent and identically distributed (i.i.d) random variables that are distributed according to some mixture distribution where is the uniform distribution between and . We consider the parametric model of independent uniform distributions , , and we choose the squared Hellinger distance as the risk measure. Here the maximum likelihood is the maximum of the observations , and is a good approximation of the generating distribution as for . Hence, one would expect that goes to as , which is actually not the case. We do not even have consistency: . Hence, the MLE is not robust to this small deviation from the parametric assumption. Other problems can arise for the MLE: for instance, the quadratic risk can be much bigger than the minimax risk, and the performance of the MLE may be too sensitive to the choice of the family of densities used in the model, see respectively [Birgé, 2006] and [Baraud and Birgé, 2016]. The same happens in Bayesian statistics: the regular posterior distribution is not always robust to model misspecification. Indeed, authors of [Barron et al., 1999, Grünwald and Van Ommen, 2017] show pathologic cases where the posterior does not concentrate to the true distribution.

Universal estimation is all the more important since it provides a generic approach to tackle the more and more popular problem of robustness to outliers under the i.i.d assumption, although definitions and goals involved in robust statistics are quite different from the universal estimation perspective. Huber introduced a framework that models situations where a small fraction of data is contaminated, and he assumes that the true generated distribution can be written where is the contaminating distribution and is the proportion of corrupted observations [Hüber, 1964]. The goal when using this approach is to estimate the true parameter given a misspecified model with . A procedure is then said to be robust in this case if it leads to a good estimation of the true parameter . More generally, when a procedure is able to provide a good estimate of the generating distribution of i.i.d data when a small proportion of them is corrupted, whatever the values of these outliers, then such an estimator is considered as robust.

### 1.1 Related work

Several authors attempted to design a general universal estimation method. Sture Holm [Bickel, 1976] suggested that Minimum Distance Estimators (MDE) were the most natural procedures being robust to misspecification. Motivated by [Wolfowitz, 1957, Parr and Schucany, 1980], MDE consists in minimizing some probability distance between the empirical distribution and a distribution in the model. The MDE is defined by:

 d(^Pn,P^θn)=infθ∈Θd(^Pn,Pθ)

where is the empirical measure and the parameter set associated to the model. If the minimum does not exist, then one can consider a -approximate solution. In fact, this minimum distance estimator is used in many usual procedures. Indeed, the generalized method of moments [Hansen, 1982] is actually defined as minimizing the weighted Euclidean distance between moments of and while the MLE minimizes the KL divergence. When the distance is wisely chosen, among others, it must be bounded, then MDE can be robust and consistent. A typical choice of the metric is the Total Variation (TV) distance [Yatracos, 1985, Devroye and Lugosi, 2001]. [Yatracos, 1985] showed that under the i.i.d assumption, the minimum distance estimator based on the TV metric is uniformly consistent in TV distance and is robust to misspecification without any assumption on the parameter set, with a rate of convergence depending on the Kolmogorov entropy of the space of measures. A few decades later, Devroye and Lugosi studied in details the skeleton estimate, a variant of the estimator of [Yatracos, 1985] that is based on the TV-distance restricted to the so-called Yatracos sets, see [Devroye and Lugosi, 2001]. Unfortunately, the skeleton estimate and the original Yatracos estimate are not computationally tractable.

In [Baraud and Birgé, 2016] and [Baraud et al., 2017], Baraud, Birgé and Sart introduced in the independent framework the so-called -estimators, a universal method that retains some appealing properties of the MLE such as efficiency under some regularity assumptions, while being robust to Hellinger deviations. -estimation is inspired from T-estimation [Birgé, 2006], itself inspired from earlier works of Le Cam [Le Cam, 1973, Le Cam, 1975] and Birgé [Birgé, 1983], and goes beyond the classical compactness assumption used in T-estimation. In compact models, -estimators can be seen as variants of T-estimators also based on robust tests, but they can be extended to noncompact models such as linear regression with fixed or random design with various error distributions. As T-estimators, they enjoy robustness properties, but involve other metric dimensions which lead to optimal rates of convergence with respect to the Hellinger distance even in cases where T-estimators can not be defined. Moreover, note that when the sample size is large enough, -estimation recovers the usual MLE in density estimation when the model is parametric, well-specified and regular enough. Hence, -estimation can be seen as a robust version of the MLE, but once again, such a strategy is intractable.

More recently, [Briol et al., 2019] showed that using the Maximum Mean Discrepancy (MMD) [Gretton et al., 2012] as a minimum distance estimator leads to both robust and tractable estimation in the i.i.d case. MMD, a metric based on embeddings of probability measures into a reproducing kernel Hilbert space, has been applied successfully in a wide range of problems such as kernel Bayesian inference [Song et al., 2011], approximate Bayesian computation [Park et al., 2016], two-sample [Gretton et al., 2012] and goodness-of-fit testing [Jitkrittum et al, 2017], and MMD GANs [Dziugaite et al, 2015, Li and Zemel, 2015] and autoencoders [Zhao et al ,2017], to name a few prominent examples. Such minimum MMD-based estimators are proved to be consistent, asymptotically normal and robust to model misspecification. The trade-off between the statistical efficiency and the robustness is made through the choice of the kernel. The authors investigated the geometry induced by the MMD on the finite-dimensional parameter space and introduced a (natural) gradient descent algorithm for efficient computation of the estimator. This algorithm is inspired from the stochastic gradient descent (SGD) used in the context of MMD GANs where the usual discriminator is replaced with a two-sample test based on MMD [Dziugaite et al, 2015]. These results were extended in the Bayesian framework by [Chérief-Abdellatif and Alquier, 2019].

### 1.2 Contributions

In this paper, we further investigate universality properties of minimum distance estimation based on MMD distance [Briol et al., 2019]. Inspired by the related literature, our contributions in this paper are the following:

• We go beyond the classical i.i.d framework. Indeed, we prove that the the estimator is robust to dependence between observations. To do so, we introduce a new dependence coefficient expressed as a covariance in some reproducing kernel Hilbert space, and which is very simple to use in practice.

• We show that our oracle inequalities imply robust estimation under the i.i.d assumption in the Huber contamination model and in the case of adversarial contamination.

• We also highlight the connection between our MMD estimator and minimum distance estimation using -distance for radial kernels.

• We propose a theoretical analysis of the SGD algorithm used to compute this estimator in [Briol et al., 2019] and [Dziugaite et al, 2015] for some finite dimensional models. We provide numerical simulation to illustrate our theoretical results.

The first result of this paper is a generalization bound in the non-i.i.d setting. It states that under a very general dependent assumption, the generalization error with respect to the MMD distance decreases in as . This result extends the inequalities in [Briol et al., 2019] that are only available in the i.i.d framework, and is obtained using dependence concepts for stochastic processes. Since the seminal work of [Rosenblatt, 1956], many mixing conditions, that is, restrictions on the dependence bewteen observations, were defined. These conditions lead to limit theorems (LLN, CLT) useful to analyze the asymptotic behavior of time series [Doukhan, 1994]. Nevertheless, checking mixing assumptions is difficult in practice and many classes of processes that are of interest in statistics such as elementary Markov chains are sometimes not mixing. More recently, [Doukhan and Louhichi, 1999] proposed a new weak dependence condition for time series that is built on covariance-based coefficients which are much easier to compute than mixing ones, and that is more general than mixing as it stands for most relevant classes of processes. We introduce in this paper a new dependence coefficient in the wake of [Doukhan and Louhichi, 1999] which can be expressed as a covariance in some reproducing kernel Hilbert space associated with MMD, which can be easily computed in many situations and which may be related to usual mixing coefficients such as the popular -mixing one. We show that a weak assumption on this new dependence coefficient can relax the i.i.d assumption of [Briol et al., 2019] and can lead to valid generalization bounds even in the dependent setting.

Also, we provide inequalities in -distance. Previous attempts of designing a universal estimator lead to bounds in TV or Hellinger distances [Baraud and Birgé, 2016, Devroye and Lugosi, 2001], but state that the quadratic loss is to be avoided as a minimum distance estimator, in particular because this metric exclude distributions for which no density is available. We show here that for radial kernels, the MMD distance is a good approximation of the -metric when densities exist, and thus can be seen as a “universalized” and robustified version of the -distance-based minimum distance estimator. We introduce conditions on the kernel leading to valid generalization bounds in quadratic loss. Moreover, we show how our results can be used in the context of robust estimation with contamination, and how they can provide statistically optimal robust Gaussian mean estimation with respect to the Euclidean distance.

Regarding computational issues, we provide a Stochastic Gradient Descent algorithm as in [Briol et al., 2019, Dziugaite et al, 2015] involving a U-statistic approximation of the expectation in the formula of the MMD distance. We theoretically analyze this algorithm in parametric estimation using a convex parameter set. We also perform numerical simulations that illustrate the efficiency of our method, especially by testing the behavior of the algorithm in the presence of outliers.

The rest of this paper is organized as follows. Section 2 defines the MMD-based minimum distance estimator and our new dependence coefficient based on the kernel mean embedding. Section 3 provides nonasymptotic bounds in the dependent and misspecified framework, with results in robust parametric estimation and the connection with density estimation using quadratic loss. Section 4 illustrates the efficiency of our method in several different frameworks. We finally present an SGD algorithm with theoretical convergence guarantees in Section 5 and we perform numerical simulations in Section 6. Section 7 is dedicated to the proofs.

## 2 Background and definitions

In this section, we introduce first some notations and present the statistical setting of the paper in Section 2.1. Then we remind in Section 2.2 some theory on reproducing kernel Hilbert spaces (RKHS) and we define both the maximum mean discrepancy (MMD) and our minimum distance estimator based on the MMD. Finally, we introduce in Section 2.3 a new dependence coefficient expressed as a covariance in a RKHS.

### 2.1 Statistical setting

We shall consider a dependent setting throughout the paper. We observe in a measurable space a collection of random variables ,…, generated from a stationary process. This implies that the ’s are identically distributed, we will let denote their marginal distribution. Note that this include as an example the case where the ’s are i.i.d with generating distribution . We introduce a statistical model indexed by a parameter space .

### 2.2 Maximum Mean Discrepancy

We consider a positive definite kernel function , i.e a symmetric function such that for any integer , for any and for any :

 n∑i=1n∑j=1cicjk(xi,xj)≥0.

We then consider the reproducing kernel Hilbert space (RKHS) associated with the kernel which satisfies the reproducing property for any function and any . From now on, we assume that the kernel is bounded by some positive constant, that will be assumed to be without loss of generality.

Now we introduce the notion of kernel mean embedding, a Hilbert space embedding of a probability measure that can be viewed as a generalization of the original feature map used in support vector machines and other kernel methods. The basic idea is to map measures into the RKHS , enabling to apply all various kernel methods to the underlying measures. Given a probability measure , we define the mean embedding as:

 μP(⋅):=EX∼P[k(X,⋅)]∈Hk.

All the applications and the theoretical properties of those embeddings have been well studied [Muandet et al., 2017]. In particular, the mean embedding satisfies the relationship for any function , and induces a semi-metric 1 on measures called maximum mean discrepancy and defined for two measures and as follows:

 Dk(P,Q)=∥μP−μQ∥Hk

or alternatively

 D2k(P,Q)=EX,X′∼P[k(X,X′)]−2EX∼P,Y∼Q[k(X,Y)]+EY,Y′∼Q[k(Y,Y′)].

A kernel is said to be characteristic if is injective. This ensures that is a metric, and not only a semi-metric. Subsection 3.3.1 of the thorough survey [Muandet et al., 2017] provides a wide range of conditions ensuring that is characteristic. They also provide many examples of characteristic kernels, see their Table 3.1. Among others, the Gaussian kernel and the Laplace kernel , that we will use in all of our applications, are known to be characteristic. From now, we will assume that is characteristic.

Note that there are many applications of the kernel mean embedding and MMD in statistics such as two-sample testing [Gretton et al., 2012], change-point detection [Arlot et al., 2012], detection [Lerasle et al., 2019], we refer the reader to [Liu et al., 2019] for a thorough introduction to the applications of kernels and MMD to computational biology .

Here, we will focus on estimation of parameters based on MMD. This principle was used to train generative networks [Dziugaite et al, 2015, Li and Zemel, 2015], it’s only recently that it was studied as a general principle for estimation [Briol et al., 2019]. Following these papers we define the MMD estimator such that:

where is the empirical measure, i.e.:

 ^θn=argminθ∈Θ{EX,X′∼Pθ[k(X,X′)]−2nn∑i=1EX∼Pθ[k(X,Xi)]}.

It could be that there is no minimizer, see the discussion in Theorem 1 page 9 in [Briol et al., 2019]. In this case, we can use an approximate minimizer. More precisely, for any we can always find a such that:

 Dk(P^θn,ε,^Pn)≤infθ∈ΘDk(Pθ,^Pn)+ε.

In what follows, we will consider the case where the minimizer exists (that is, ) but when this is not the case, everything can be easily extended by considering .

### 2.3 Covariances in RKHS

In this subsection, we introduce and discuss a new dependence coefficient based on the kernel mean embedding. This coefficient allows to go beyond the i.i.d case and to show that the MMD estimator of [Briol et al., 2019] is actually robust to dependence. Please refer to [Briol et al., 2019] for results in the i.i.d case.

###### Definition 2.1.

We define, for any ,

 ϱt=∣∣E⟨k(Xt,⋅)−μP0,k(X0,⋅)−μP0⟩Hk∣∣.

In the i.i.d case, note that for any . In general, the following assumption will ensure the consistency of our estimator:

###### Assumption 2.1.

There is a such that, for any , .

Our mean embedding dependence coefficient may be seen as a covariance expressed in the RKHS . We shall see throughout the paper that the kernel mean embedding coefficient can be easily computed in many situations, and that it is closely related to the widely used mixing coefficients. In particular, we will show in Section 4.2 that our coefficient is upper-bounded by the celebrated -mixing coefficient for radial kernels in the case of a strictly stationary time series. More importantly, we exhibit in Section 4.3 an example of a non-mixing process such that but for which is exponentially decaying and hence Assumption 2.1 still holds, which means that is a more general and weaker dependence coefficient than usual mixing coefficients, due to its covariance structure. Hence, Assumption 2.1 may be referred to as a weak dependence condition in the wake of the concept of weak dependence introduced in [Doukhan and Louhichi, 1999]. Using a Hoeffding-like inequality due to [Rio, 2000], we will show in the next section that under Assumption 2.1, we can obtain a nonasymptotic generalization bound of the same order than in the i.i.d case.

## 3 Nonasymptotic bounds in the dependent, misspecified case

In this section, we provide nonasymptotic generalization bounds in MMD distance for the minimum MMD estimator. In particular, we show in Subsection 3.1 that under a weak dependence assumption, it is robust to both dependence and misspecification, and is consistent at the same rate than in the i.i.d case. In particular, we give explicit bounds in the Huber contamination model and in a more general adversarial setting in Subsection 3.2. Finally, we connect in Subsection 3.3 our MMD-based estimator to an -based one when densities exist. We discuss conditions on the kernel and provide oracle inequalities in -distance.

### 3.1 Estimation with respect to the MMD distance

First, we begin with a theorem that gives an upper bound on the generalization error, i.e the expectation of . The rate of convergence of this error is of order independently of the dimensions and the property of the kernel.

###### Theorem 3.1.

We have:

 E[Dk(P^θn,P0)]≤infθ∈ΘDk(Pθ,P0)+2√1+2∑nt=1ϱtn.

As a consequence, under Assumption 2.1:

 E[Dk(P^θn,P0)]≤infθ∈ΘDk(Pθ,P0)+2√1+2Σn.

We remind that all the proofs are deferred to Section 7. It is also possible to provide a result that holds with large probability as in [Briol et al., 2019] and in [Dziugaite et al, 2015]. Naturally, it requires stronger assumptions, and the conditions on the dependence become more intricate in this case. Here, we use a condition introduced in [Louhichi, 1998] for generic metric spaces that we adapt to the kernel embedding and to stationarity:

###### Assumption 3.1.

Assume that there is a family of nonnegative numbers such that, for any integer , for any and any function such that

 |g(a1,…,aℓ)−g(b1,…,bℓ)|≤ℓ∑i=1∥ai−bi∥Hk,

we have, , almost surely. Assume that there is a .

Again, note that in the case of independence, we can take all the and hence in addition to . We can now state our result in probability:

###### Theorem 3.2.

Assume that Assumptions 2.1 and 3.1 are satisfied. Then, for any ,

 P⎡⎢ ⎢ ⎢ ⎢⎣Dk(P^θn,P0)≤infθ∈ΘDk(Pθ,P0)+2√1+2Σ+(1+Γ)√2log(1δ)√n⎤⎥ ⎥ ⎥ ⎥⎦≥1−δ.

Assumption 3.1 is fundamental to obtain a result in probability. Indeed, the rate of convergence in Theorem 3.2 is characterized by some concentration inequality upper bounding the MMD distance between the empirical and the true distribution as done in [Briol et al., 2019]. Nevertheless, the proof of this inequality in [Briol et al., 2019] is based on a Hoeffding-type inequality known as McDiarmid’s inequality [McDiarmid, 1989] that is only valid for independent variables, which makes this inequality not applicable in our dependent setting. Hence we use a version of McDiarmid’s inequality for time series obtained by Rio [Rio, 2013] which is available under a polynomial decay assumption on some mixing dependence coefficients . This decay assumption is expressed here in the RKHS of Kernel as Assumption 3.1.

###### Remark 3.1 (The i.i.d case).

Note that when the ’s are i.i.d, Assumptions 2.1 and 3.1 are always satisfied with and thus Theorem 3.1 gives simply

 E[Dk(P^θn,P0)]≤infθ∈ΘDk(Pθ,P0)+2√n

while Theorem 3.2 gives

 P⎡⎢ ⎢ ⎢ ⎢⎣Dk(P^θn,P0)≤infθ∈ΘDk(Pθ,P0)+21+√2log(1δ)√n⎤⎥ ⎥ ⎥ ⎥⎦≥1−δ.

### 3.2 Robust parametric estimation

#### Contamination models

As explained in the introduction, when all observations but a small proportion of them are sampled independently from a generating distribution (), robust parametric estimation consists in finding estimators being both rate optimal and resistant to outliers. Two among the most popular frameworks for studying robust estimation are the so-called Huber’s contamination model and the adversarial contamination model.

Huber’s contamination model is as follows. We observe a collection of random variables . We consider a contamination rate , latent i.i.d random variables and some noise distribution , such that the distribution of given is , and that the distribution of given is . Hence, the observations ’s are independent and sampled from the mixture .

The adversarial model is more general. Contrary to Huber’s contamination where outliers were all sampled from the contaminating distribution, we do not make any particular assumption on the outliers here. Hence, we shall adopt slightly different notations. We assume that are identically distributed from for some . However, the statistician only observes where can be any arbitrary value for , where is an arbitrary set subject to the constraint , and for . The estimators are built based on these observations .

#### Literature

One hot research trend in robust statistics is focused on the search of both statistically optimal and computationally tractable procedures for the Gaussian mean estimation problem in the presence of outliers under the i.i.d assumption, which remains a major challenge. Usual robust estimators such as the coordinatewise median and the geometric median are known to be suboptimal in this case, and there is a need to look at more complex estimators such as Tukey’s median that achieves the minimax optimal rate of convergence with respect to the squared Euclidean distance, where is the dimension, is the sample size and is the proportion of corrupted data. Unfortunately, computation of Tukey’s median is not tractable and even approximate algorithms lead to an complexity [Chan, 2004, Amenta et al., 2000]. This has led to the rise of the recent studies in robust statistics which adress how to build robust and optimal statistical procedures, in the wake of the works of [Tukey, 1975] and [Hüber, 1964], but that are also computationally efficient.

This research area started with two seminal works presenting two procedures for the normal mean estimation problem: the iterative filtering [Diakonikolas et al, 2016] and the dimension halving [Lai et al, 2016]. These algorithms are based upon the idea of using higher moments in order to obtain a good robust moment estimation, and are minimax optimal up to a poly-logarithmic factor in polynomial time. This idea was then used in several other problems in robust statistics, for instance in sparse functionals estimation [Du et al, 2017], clustering [Kothari et al, 2018], mixtures of spherical Gaussians learning [Diakonikolas et al, 2018a], and robust linear regression [Diakonikolas et al, 2018b]. [Gao et al., 2019] offers a different perspective on robust estimation and connects the robust normal mean estimation problem with Generative Adversarial Networks (GANs) [Goodfellow et al., 2014, Biau et al., 2018], what enables computing robust estimators using efficient tools developed for training GANs. Hence, the authors compute depth-like estimators that retain the same appealing robustness properties than Tukey’s median and that can be trained using stochastic gradient descent (SGD) algorithms that were originally designed for GANs.

Another popular approach for the more general problem of mean estimation under the i.i.d assumption in the presence of outliers is the study of finite-sample sub-Gaussian deviation bounds. Indeed, designing estimators achieving sub-Gaussian performance under minimal assumptions ensures robustness to outliers that are inevitably present when the generating distribution is heavy-tailed. In the univariate case, some estimators present a sub-Gaussian behavior for all distributions under first and second order moments. A simple but powerful strategy, the Median-of-Means (MOM), dates back to [Nemirovski and Yudin, 1983, Jerrum et al, 1986, Alon et al., 1999]. This method consists in randomly splitting the data into several equal-size blocks, then computing the empirical mean within each block, and finally taking the median of them. Most MOM-based procedures lead to estimators that are simultaneously statistically optimal [Lugosi and Mendelson, 2016, Devroye et al., 2016, Lecué et al., 2018, Lerasle et al., 2019, Chinot et al, 2019] and computationally efficient [Hopkins, 2019, Cherapanamjeri et al, 2019, Depersin and Lecué, 2019]. Moreover, this approach be easily extended to the multivariate case [Minsker, 2015, Hsu and Sabato, 2016]. An important advantage is that the MOM estimator has good performance even for distributions with infinite variance. An elegant alternative to the MOM strategy is due to Catoni, whose estimator is based on PAC-Bayesian truncation in order to mitigate heavy tails [Catoni, 2012]. It has the same performance guarantees than the MOM method but with sharper and near-optimal constants. In [Catoni and Giulini, 2017], Catoni and Giulini proposed a very simple and trivial-to-compute multidimensional extension of Catoni’s M-estimator defined as an empirical average of the data, with the observations with large norm shrunk towards zero, and that still satisfies a sub-Gaussian concentration using PAC-Bayes inequalities. The influence function of Catoni and Giulini has been widely used since then, see [Giulini, 2017, Giulini, 2018, Holland, 2019a, Holland, 2019b]. We refer the reader to the excellent review of [Lugosi and Mendelson, 2019] for more details on those mean estimation procedures.

#### Robust MMD estimation

In this section, we show the properties of our MMD-based estimator in robust parametric estimation with outliers, both in Huber’s and in the adversarial contamination model. Our bounds are obtained by working directly in the RKHS rather than in the parameter space, and going back and forth between the two spaces.

First we consider Huber’s contamination model [Hüber, 1964]. The objective is to estimate by observing contaminated random variables , …, with actual distribution is for some , and some . We state the key following lemma:

###### Lemma 3.3.

We have: .

As a consequence of Lemma 3.3 and Theorem 3.1, we have the following result.

###### Corollary 3.4.

Assume that are identically distributed from for some , some , with . Then:

 E[Dk(P^θn,Pθ0)]≤2ϵ+2√1+2∑nt=1ϱtn.

If moreover we assume that Assumptions 2.1 and 3.1 are satisfied, then for any ,

 P⎡⎢ ⎢ ⎢ ⎢⎣Dk(P^θn,Pθ0)≤2⎛⎜ ⎜ ⎜ ⎜⎝ϵ+√1+2Σ+(1+Γ)√2log(1δ)√n⎞⎟ ⎟ ⎟ ⎟⎠⎤⎥ ⎥ ⎥ ⎥⎦≥1−δ.

We obtain a rate (in MMD distance) that can be expressed in the same way than the minimax rate (in the Euclidean distance) when estimating a Gaussian mean. When , then we recover the minimax rate of convergence without contamination, and when , then the rate is dominated by the contamination ratio . Hence, the maximum number of outliers which can be tolerated without breaking down the minimax rate is .

This result can also be extended to the adversarial contamination setting, where no assumption is made on the outliers.

###### Proposition 3.5.

Assume that are identically distributed from from for some . However, the statistician only observes where can be any arbitrary value for , is any arbitrary set subject to the constraint , and for and builds the estimator based on these observations:

 Dk(P~θn,1nn∑i=1δ~Xi)=infθ∈ΘDk(Pθ,1nn∑i=1δ~Xi).

Then:

 Dk(P~θn,Pθ0)≤4ϵ+2Dk(P^θn,Pθ0).

Thus

 E[Dk(P~θn,Pθ0)]≤4ϵ+4√1+2∑nt=1ϱtn

and, under Assumptions 2.1 and 3.1, for any ,

 P⎡⎢ ⎢ ⎢ ⎢⎣Dk(P^θn,Pθ0)≤4⎛⎜ ⎜ ⎜ ⎜⎝ϵ+√1+2Σ+(1+Γ)√2log(1δ)√n⎞⎟ ⎟ ⎟ ⎟⎠⎤⎥ ⎥ ⎥ ⎥⎦≥1−δ.

One can see that the rate of convergence we obtain without making any assumption on the outliers is exactly the same than in Huber’s contamination model. The only different thing is that the constant in the right hand side of the inequality is tighter in Huber’s contamination model.

### 3.3 Density estimation with quadratic loss

To obtain universal oracle inequalities in statistics is a goal that has been studied by many authors, and that leads to the question: what distance should be used between probability measures? Results with the total variation (TV) norm were obtained by [Yatracos, 1985], and more recently, the monograph [Devroye and Lugosi, 2001] gives a complete overview of estimation with TV. While the Kullback-Leibler distance seems natural as it leads to maximum likelihood estimation, it leads to many problems, including a very strong sensitivity to misspecification, see for example the discussion in [Baraud and Birgé, 2016]. There, the authors derive universal inequalities for the Hellinger distance. The Wasserstein distance became recenly extremely popular, partly due to its ability to take into account the geometry of the space , see [Peyré and Cuturi, 2019]. Some attempts to obtain universal estimation with the Wasserstein distance can be found in [Bernton et al., 2017, Lee and Raginsky, 2018]. Note that in [Devroye and Lugosi, 2001] and in [Baraud and Birgé, 2016], it is argued that the -distance between densities is not a good distance. Among others:

• it is not universal, as some probability measures don’t have densities, while some others have densities that are not in ,

• it depends on the choice of the reference measure (usually taken as the Lebesgue measure).

Regarding the first objection, we argue here that for reasonable kernels, the MMD distance is an approximation of the -distance that is defined for any probability distribution. Thus, from the oracle inequalities on the MMD distance it is possible to derive oracle inequalities on the -distance. On the other hand, the MMD distance remains well-defined for any probability measure, even those without density. To this regard, estimation with respect to the MMD distance can be seen as a universal approximation of density estimation in .

Regarding the second objection, our estimator does not depend on any reference measure, but it depends on the choice of the kernel. However, we believe that this can actually be an attractive property – indeed, the popularity of the Wasserstein distance is due to the fact that it takes into account the geometry of through the choice of a distance on . But the same argument hold for MMD-based estimation, the geometry of being taken into account through the choice of the kernel. For example, when , a property shared with the Wasserstein distance, and which does not hold for the Hellinger nor the TV distance.

Let us come back to the link with the distance. Consider for example the Gaussian kernel . On the one hand, when , for any , this leads to for any and . This case is not very useful as it does not “see” the difference between any probability distributions. On the other hand, when , assume that and have densities and with respect to the Lebesgue measure respectively. Under suitable regularity and integrability assumptions on and , we have

 EX∼P,Y∼Q[kγ(X,Y)]∼πd2γd∫p(x)q(x)dx

when , and thus

 Dkγ(P,Q)∼πd4γd2∥p−q∥L2.

Considering small enough, but not zero, will then allow to give a sense to estimation even for densities that are not in . Of course, when and are not regular, these approximations can become wrong. Thus, the regularity of the model is something important in the link between MMD and distance. In order to make this discussion more formal, let us first introduce a measure of the distortion between the MMD and the distance.

###### Assumption 3.2.

Assume that and that where and .

When Assumption 3.2 is satisfied, we will use the notation .

###### Assumption 3.3.

For any , has density w.r.t the Lebesgue measure.

###### Definition 3.1.

When Assumptions 3.2 and 3.3 are satisfied, we define:

 L(γ) =inf(θ,θ′)∈ΘDkγ(Pθ,Pθ′)∥pθ−pθ′∥L2, and U(γ) =sup(θ,θ′)∈ΘDkγ(Pθ,Pθ′)∥pθ−pθ′∥L2.
###### Assumption 3.4.

The true distribution has density w.r.t the Lebesgue measure.

When all these assumptions are satisfied, the model and the true distribution are well-behaved and it possible to derive from MMD estimation a bound for estimation. The tightness of the bound will depend on the ratio .

###### Theorem 3.6.

Under Assumptions 3.2, 3.3 and 3.4,

 E[∥p^θn−p0∥L2]≤(1+2U(γ)L(γ))infθ∈Θ∥pθ−p0∥L2+2L(γ) ⎷1+2∑nt=1(1−tn)ϱtn.

If moreover we assume that Assumptions 2.1 and 3.1 are satisfied,

 P[∥p^θn−p0∥L2≤(1+2U(γ)L(γ))infθ∈Θ∥pθ−p0∥L2+2√1+2Σ+(1+Γ)√2log(1/δ)L(γ)√n]≥1−δ.

We end this subsection by a proposition that allows to upper bound and . We remind the definition of the Fourier transform of a function :

 F[f](t)=∫f(x)exp(−2iπ⟨t,x⟩)dx.
###### Assumption 3.5.

The kernel is such that for some function with

1. for any ,

2. for some ,

3. there is an such that .

###### Example 3.1.

All these conditions are satisfied by the Gaussian kernel: and . Then as required. Moreover: and so we have 1) and 2) with and 3) with and .

###### Proposition 3.7.

Under 1. and 2. in Assumption 3.5,

 U(γ)≤D1/2γd/2.

Under 3. in Assumption 3.5,

 L(γ)≥bγd/2A(aγ)

where

 A(ξ):=inf(θ,θ′)∈Θ2  ⎷∫∥t∥≤ξ|F[pθ−pθ′](t)|2dt∫Rd|F[pθ−pθ′](t)|2dt.

Let us now discuss the consequences of Theorem 3.6 and Proposition 3.7 for density estimation, as well as the role of and . For the sake of simplicity we stick to the bound in expectation in the discussion, but the the same comments apply to the bound in probability. First, pluging Proposition 3.7 into Theorem 3.6 gives:

 E(∥p^θn−p0∥L2)≤(1+2√DbA(aγ))infθ∈Θ∥pθ−p0∥L2+2bγd2A(aγ) ⎷1+2∑nt=1(1−tn)ϱtn (1)

and, in the well-specified case,

 E(∥p^θn−p0∥L2)≤2bγd2A(aγ) ⎷1+2∑nt=1(1−tn)ϱtn. (2)

It is clear that the optimization with respect to might change the way the bound in (1) depends on , but will not affect the way it depends on . The first examples in Section 4 will clearly illustrate this fact. So, in small dimensions, one can always take to obtain the bound:

 E(∥p^θn−p0∥L2)≤(1+2√DbA(a))infθ∈Θ∥pθ−p0∥L2+2bA(a) ⎷1+2∑nt=1(1−tn)ϱtn.

However, in large dimension, the dependence on matters. Note that the function satisfies:

• ,

• is nondecreasing,

• .

For a fixed , is the ratio between the energy in low frequencies and the whole energy of which might exhibit different behaviors depending on the smoothness of .

When has enough energy in its low frequencies for any and , then one can expect for . In this case, the function would satisfy and , which means that even though the function might have a global minimum somewhere in between and , taking as large as possible cannot really hurt in (2) – even though it will make the first term in the right-hand side of (1) explode, which means that taking too large is unsafe in case of misspecification.

For nonsmooth models, one can however have for . In this case, both and will make the r.h.s explode both in (2) and (1). In this case, a careful optimization w.r.t is in order.

## 4 Examples

### 4.1 Independent observations

In this subsection, we focus on i.i.d observations. That is, for any . Moreover, we will only use the Gaussian kernel . Note that for this kernel, Proposition 3.7 gives . We will use this bound in some situations, however, in the Gaussian model, we will see that it is possible to derive the explicit dependence between the norm and the MMD norm, thus avoiding Proposition 3.7. We assume that Assumption 3.4 is satisfied in this whole section.

#### Estimation of the mean in a Gaussian model

Here, and we are interested in the estimation of the mean in a Gaussian model. For the sake of simplicity, we assume that the variance is known. In this case, the proof of Proposition 4.1 below will show that we have explicit formulas, for any , for and for , both as functions of . In particular, this leads to exact formulas

The complete proof is postponed to Section 7.

###### Proposition 4.1.

Assume that for . Then, for any ,

 P[∥p^θn−p0∥L2≤(1+4σ2+γ22σ2)infθ∈Θ∥pθ−p0∥L2+4σ2+γ22σ2(4σ2+γ24πσ2γ2)d/22+2√2log(1/δ)√n]≥1−δ. (3)

Moreover, the second term in the upper bound is minimized for which leads to

 ∥p^θn−p0∥L2≤(3+d)infθ∈Θ∥pθ−p0∥L2+e(d+2)(4πσ2)d22+2√2log(1/δ)√n, (4)

still with probability . Finally, assume that we are in an adversarial contamination model where a proportion at most of the observations is contaminated, then, with probability ,

 ∥~θn−θ0∥2≤−2σ2(d+2)log⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−8e⎛⎜ ⎜ ⎜ ⎜⎝ϵ+1+√2log(1δ)√n⎞⎟ ⎟ ⎟ ⎟⎠2⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (5)

Note that when is small and is large,

 ∥~θn−θ0∥2≤−2σ2(d+2)log⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1−16e⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ϵ2+(1+√2log(1δ))2n⎞⎟ ⎟ ⎟ ⎟ ⎟⎠⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∼32eσ2(d+2)⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ϵ2+(1+√2log(1δ))2n⎞⎟ ⎟ ⎟ ⎟ ⎟⎠.

According to Theorems 2.1 and 2.2 in [Chen et al., 2018], the minimax rate with respect , and is . Hence, we obtain a convergence rate that achieves a quadratic dependence in , contrary to most popular robust estimators such as Median-of-Means which dependence in is linear. Note that the rate of convergence we obtain is the one achieved by the geometric median.

#### Cauchy model

Here, and where has density . This time, we use the generic upper bound and prove a lower bound for thanks to Proposition 3.7. We obtain the following result.

###### Proposition 4.2.

Assume that for . Then, taking leads to, for any ,

 P[∥p^θn−p0∥L2≤14infθ∈Θ∥pθ−p0∥L2+6+6√2log(1/δ)√n]≥1−δ.

Moreover, assume that we are in an adversarial contamination model where a proportion at most of the observations is contaminated, then, with probability ,

 (~θn−θ0)2≤4⎛⎜ ⎜⎝1−11−96π(ϵ2+2+4log(1/δ)n)⎞⎟ ⎟⎠.

Note that

 (~θn−θ0)2≤4⎛⎜ ⎜⎝1−11−96π(ϵ2+2+4log(1/δ)n)⎞⎟ ⎟⎠∼384π(ϵ2+2+4log(1/δ)n).

Again, we achieve the optimal quadratic dependence in .

#### Uniform model

Here, and .

###### Proposition 4.3.

Assume that , Then, taking leads to, for any ,

 P[∥p^θn−p0∥L2≤23.4infθ∈Θ∥pθ−p0∥L2+10+10√2log(1/δ)√n]≥1−δ.

Note that the proof shows that in the well specified case , . Thus, for large enough to ensure that the bound is smaller than , Proposition 4.3 states that, with probability at least ,

 |^θn−θ0|≤10+10√2log(1/δ)√n.

Note that in this model, the moment estimator reaches the rate but the MLE reaches the rate . In practice, we indeed observe in the simulations that for , the MMD estimator is “as bad” as the moment estimator. However, on the contrary to MLE and moment estimators, it is highly robust to the presence of outliers.

Moreover, for , we observe that the MMD estimator becomes as good as the MLE in the nice situation (correct specification, no outliers). We were not able to explain this with our theoretical analysis and leave it to future works.

#### Estimation with a dictionary

We consider here estimation of the density as a linear combination of given functions in a dictionary. This framework actually appears in various models:

• first, when the dictionary contains densities, this is simply a mixture of known components. In this case, the linear combination is actually a convex combination. This context is for example studied in [Dalalyan and Sebbar, 2017].

• in nonparametric density estimation, we can use this setting, the dictionary being a basis of . This is for example the point of view in [Alquier, 2008, Bunea et al., 2007, Bunea et al., 2010].

We will here focus on the first setting, but an extension to the second one is quite straightforward. Let be a family of probability measures over . For we remind that

 μΦi(⋅)=∫k(x,⋅)Φi(dx).

Define the measure with respect to the Lebesgue measure, and we define the model . We could consider in a general framework, but as we only study the mixture case, we assume that . Note that in the fisrt case, most ’s are not probability measures, but this is in accordance with our definition of a statistical model. The estimator is then

 ^θn=argminθ∈Θ∥∥ ∥∥d∑ℓ=1θℓμΦℓ(