Learning from MOM’s principles : Le Cam’s approach

# Learning from MOM’s principles : Le Cam’s approach

Guillaume Lecué CREST, CNRS, Université Paris Saclay Matthieu Lerasle Laboratoire de Mathématiques d’Orsay, Univ. Paris-Sud, CNRS, Université Paris Saclay
###### Abstract

We obtain estimation error rates for estimators obtained by aggregation of regularized median-of-means tests, following a construction of Le Cam. The results hold with exponentially large probability, under only weak moments assumptions on data. Any norm may be used for regularization. When it has some sparsity inducing power we recover sparse rates of convergence. The procedure is robust since a large part of data may be corrupted, these outliers have nothing to do with the oracle we want to reconstruct. Our general risk bound is of order

 max(minimax rate in the i.i.d. setup,number of % outliersnumber of observations).

In particular, the number of outliers may be as large as (number of data) (minimax rate) without affecting this rate. The other data do not have to be identically distributed but should only have equivalent and moments. For example, the minimax rate of recovery of a -sparse vector in is achieved with exponentially large probability by a median-of-means version of the LASSO when the noise has moments for some , the entries of the design matrix should have moments and the dataset can be corrupted up to outliers.

###### keywords:
robust statistics, statistical learning, high dimensional statistics.
###### Msc:
[2010] 62G35, 62G08.

label1]guillaume.lecue@ensae.fr

label2]matthieu.lerasle@math.u-psud.fr

## 1 Introduction

Consider the problem of estimating minimizers of the integrated square-loss over a convex class of functions : based on a data set . The labels and ’s are real-valued while the inputs and ’s take values in an abstract measurable space .

Empirical Risk Minimizers (ERM) of MR1641250; MR0474638 and later on, their regularized versions replace the unknown distribution in the definition of by the empirical distribution based on the sample . Given a function , this produces regularized ERM defined by

 ^fRERMN∈argminf∈F{PN(Y−f(X))2+reg(f)}.

These estimators are optimal in i.i.d. subgaussian setups but suffer several drawbacks when data are heavy-tailed or corrupted by “outliers”, see MR3052407; HubRonch2009. These issues are critical in many modern applications such as high-frequency trading, where heavy-tailed data are quite common or in various areas of biology such as micro-array analysis or neuroscience where data are sometimes still nasty after being preprocessed. To overcome the problem, various methods have been proposed. The most common strategy is to replace the square-loss function to make it less sensitive to outliers. For example, MR0161415 proposed a loss that interpolates between square and absolute loss to produce an estimator between the unbiased (but non robust) empirical mean and the (more robust but biased) empirical median. Huber’s estimators have been intensively studied asymptotically by MR0161415; HubRonch2009, non-asymptotic results have also been obtained more recently by MR3217454; shahar_general_loss; FanLiWang2016 for example. An alternative approach has been proposed by MR3052407 and used in learning frameworks such as least-squares regression by MR2906886 and for more general loss functions by MR3405602.

Another line of research to build robust estimators and robust selection procedures was initiated by MR0334381; MR856411 and further developed by MR2219712, MR2834722 and BaraudBirgeSart. It is based on comparisons or tests between elements of . More precisely, the approach builds on tests statistics comparing and . These tests define the sets of all ’s that have been preferred to and the final estimator is a minimizer of the diameter of . The measure of diameter is directly related to statistical performances one seeks for the estimator. These methods mostly focus on Hellinger loss and are generally considered difficult to compute, see however MR3224300; MR3224298.

In a related but different approach, LugosiMendelson2016 have recently introduced “median-of-means tournaments”. Median-of-means estimators of MR1688610; MR855970; MR702836 compare elements of . A “champion” is an element such that is smaller than a computable upper bound on the radius of . They prove that the risk of any champion is controlled by this upper bound. An important message of this paper is that Le Cam’s estimators are quite common in statistics, in particular in robust statistics. For example, Section 3 shows that any penalized empirical loss function can be obtained by Le Cam’s approach and that Le Cam’s estimators based on median-of-means tests are champions of median-of-means tournaments.

This paper studies estimators derived from Le Cam’s procedure based on regularized median-of-means (MOM) tests (see Section 4.1). Our estimators are therefore particular instances of champions of MOM’s tournaments and another motivation is to push further the analysis of this particular champion. The main advantage of MOM’s tests over Le Cam’s original ones is that they allow for more classical loss functions than Hellinger loss. This idea is illustrated on the square-loss. Compared to Huber or Catoni’s losses, this approach allows to control easily the risk of our estimators by using classical tools from empirical process theory, it also allows to tackle the problem of “aggressive” outliers.

The closest work is certainly that of LugosiMendelson2016, but we believe that our paper contains substantial improvements. We stress the intimate relationship between their estimator and Le Cam general construction and use this parallel to propose a much simpler estimator. Our risk bounds are always better and we extend their results to possibly corrupted data-sets.

To investigate robustness properties of median-of-means estimators, we partition the dataset into two parts. One is made of outliers data. They are indexed by of cardinality . On those data, absolutely nothing is assumed : they may not be independent, have distributions totally different from , with no moment at all, etc.. These are typically data polluting datasets like in the case of declarative data on internet or when something went wrong during the storage, compression or transfer which resulted in complete non sense data. They may also be observations met in biology as in the classical eQTL (Expression Quantitative Trait Loci and The Phenogen Database) from eQTL. Many other examples of datasets containing outliers could be provided, this includes frauds detection and terrorist activity as examples. Of course, outliers are not flagged in advance and the statistician is given no a priori information on which data is an outlier or not. The other part of the dataset is made of data on which the MOM estimator rely on to estimate the oracle . There should be enough information in those data so that the estimation of is possible, even in the presence of outliers provided they remain in a “decent proportion”. We therefore call the non-outliers, the informative data, those that bring information on . We denote by the set indexing these data. We therefore end up with a partition of as which, again, is not known from the statistician.

The radii of the sets are computed for regularization and norms. The regularization norm is chosen in advance by the statistician to promote sparsity or smoothness. It can be used freely in our procedure, but it doesn’t ensure a small risk for the estimator. The -norm is unknown in general since it depends on the distribution of . Furthermore, the classical -empirical metric fails to estimate the metric without subgaussian properties of the design vector . Fortunately, it can be replaced by a median-of-means metric. To handle simultaneously both regularization and norms, we will also slightly extend Le Cam’s principle. Our first important result shows that the resulting estimator is well localized w.r.t. both regularization and norms.

Median-of-means estimators rely on a data splitting into blocks and this parameter drives the resulting statistical performances (cf. MR3576558). To achieve optimal rates, should be ultimately chosen using parameters that depend on the oracle like its sparsity which is not in general available to the statistician. To bypass this problem, the strategy of MR1147167 is used as in MR3576558 to select adaptively and get a fully data-driven procedure.

There are four important features in our approach. First, all results are proved under weak moment assumptions on the noise. This is an almost minimal condition for the problem to make sense. The class is only assumed to satisfy a weak “” comparison. Second, performances of the estimators are not affected by the presence of complete outliers, as long as their number remains comparable to (number of observations)(rates of convergence). Third, all results are non-asymptotic and the regression function is never assumed to belong to the class . In particular, the noise can be correlated with . Finally, even “informative data”, those that are not “outliers”, are not requested to be i.i.d. , but only to have close first and second moments for all . Nevertheless, the estimators are shown to behave as well as the ERM when the data are i.i.d. , , the noise and the class are Gaussian and the noise is independent from the design.

Example: sparse-recovery via MOM LASSO. As a proof of concept, theoretical properties are illustrated in the classical example of sparse-recovery in high-dimensional spaces using the -regularization. This example illustrates typical results that follow from our analysis in one of the most classical problem of high dimensional statistics (cf. MR2807761; MR3307991). The interested reader can check that it also applies to other procedures like Slope (cf. slope1; slope2) and trace-norm regularization as well as kernel methods, for instance, by using the results in LM_reg_comp; LM_reg_comp_2.

Recall this classical setup. Let denote a random vector in such that for all ( is isotropic) and let be a real-valued random vector. Let . Let denote independent data corrupted by outliers : no assumption is made on a subset of the dataset. Let denote the indices of informative data : for all , are independent with the same distribution . For the sake of simplicity, we only consider the case of i.i.d. informative data in this example. In high-dimensional statistics, but has only () non-zero coordinates. To estimate , the -norm is used for penalization to promote zero coordinates. The following result holds.

###### Theorem 1.

[Theorem 1.4 in LM_reg_comp] Assume is -sparse, , is isotropic and

1. and (no outliers in the dataset),

2. for some

3. there exists such that for all and all ,

4. there exist and such that for all ,

 P[|⟨X,t⟩|≥u0∥∥⟨X,t⟩∥∥L2]≥β0.

The LASSO estimator, defined by

satisfies for every ,

 ∥∥~t−t∗∥∥p≤c4(L,u0,κ0)∥ζ∥Lq0s1/p√log(ed)N,

with probability at least

 1−c2logq0NNq0/2−1−2exp(−c3slog(ed/s)). (1)

This paper shows that Theorem 1 holds for a MOM version of the LASSO estimator under much weaker assumptions, with a better probability estimate than (1). More precisely, the following theorem is proved.

###### Theorem 2.

Assume that is -sparse, , is isotropic and

1. and (the number of outliers may be proportional to the sparsity times ),

2. for some

3. for every , where is the canonical basis of and is some absolute constant,

4. there exists such that , for all ,

5. there exists such that , for all .

There exists an estimator , called MOM-LASSO, satisfying for every ,

 ∥∥^t−t∗∥∥p≤c4(L,θm)∥ζ∥Lq0s1/p√1Nlog(eds),

with probability at least

 1−c2exp(−c3slog(ed/s)). (2)

Theoretical properties of MOM LASSO outperform those of LASSO in several ways.

• Estimation rates achieved by MOM-LASSO are the actual minimax rates , see BLT16, while classical LASSO estimators achieve the rate . This improvement is possible thanks to the adaptation step in MOM-LASSO.

• the probability deviation in (1) is polynomial – in (1) – it is exponentially small for MOM LASSO. Exponential rates for LASSO hold only if is subgaussian ( for all ).

• MOM LASSO is insensitive to data corruption by up to times outliers while only one outlier can be responsible of a dramatic breakdown of the performances of LASSO.

• All assumptions on are weaker for MOM LASSO than for LASSO. In particular, condition v) holds with if for all , – which is a much weaker requirement than condition iii) for LASSO.

From a mathematical point of view, our results are based on a slight extension of the Small Ball Method (SBM) of MR3431642; Shahar-COLT to handle non-i.d. data. SBM is also extended to bound both quadratic and multiplier parts of the quadratic loss. Otherwise, all arguments are standard, which makes the approach very attractive and easily reproducible in other frameworks of statistical learning.

The paper is organized as follows. Section 2 briefly presents the general setting and our main illustrative example. Section 3 presents Le Cam’s construction of estimators based on tests. We also show why many learning procedures may be obtained by this approach. The construction of estimators and the main assumptions are gathered in Section 4. Our main theorems are stated in Section 5 and proved in Section 6.

#### Notation

For any real number , let denote the largest integer smaller than and let if . For any finite set , let denote its cardinality. All along the paper, denote absolute constants which may vary from line to line and , with various subscripts, denote real valued parameters introduced in the assumptions. Finally, for any set for which it makes sense, for any , and ,

 g+cC=cC+g={h:∃g′∈C such % that h=g+cg′}.

Let also . We also denote by the indicator function of the set which equals to when and otherwise.

## 2 Setting

Let denote a measurable space and let denote random variables taking values in , with respective distributions . Given a probability distribution , let denote the space of all functions from to such that where . Let denote a convex class of functions . Assume that and let, for all ,

 R(f)=P[(Y−f(X))2],f∗∈argminf∈FR(f) and ζ=Y−f∗(X).

Let denote a norm defined onto a linear subspace of containing .

#### Example : ℓ1-regularization of linear functionals

For every and , let

Let , where

 t∗∈argmint∈Rd{P(Y−⟨X,t⟩)2}.

Whenever it’s necessary, will denote the canonical basis of and (resp. ) will denote the unit ball (resp. sphere) associated to . To ease readability in this example, we focus on rates of convergence, we do not consider the “full” non-i.i.d. setup and assume that for all . We write for to shorten notations.

## 3 Learning from tests

### 3.1 General Principle

This section details the ideas underlying the construction of a MOM estimator using an extension of Le Cam’s approach.

#### Basic idea

By definition of the oracle , one has

 f∗=argminf∈FR(f)=argminf∈Fsupg∈F{R(f)−R(g)},whereR(f)=P[(Y−f(X)2].

As depends on , we estimate it by test statistics that is, real random variables such that

 TN(f,g)+TN(g,f)=0. (3)

These statistics are used to compare to , simply by saying that -beats iff . In this paper, the statistics are median-of-means estimators of (cf. (12) in Section 4.1).

#### Le Cam’s construction

Let denote a collection of test statistics and let denote a pseudo-distance on measuring (or related to) the risk we want to control. Let for all ,

 BTN(f)={g∈F:TN(g,f)≥0}

be the set of all functions that beat . If is far from , then is expected to have a large radius w.r.t. . We therefore introduce this radius as a criteria to minimize : for all , let .

By (3), or (both happen if ), hence . In particular, for all ,

 d(f,f∗)≤CTN(f)∨CTN(f∗). (4)

Eq (4) suggests to define the estimator

 ^fTN∈argminf∈FCTN(f)=argminf∈Fsupg∈BTN(f)d(f,g). (5)

This estimator satisfies, from Eq (4),

 d(^fTN,f∗)≤CTN(f∗). (6)

Risk bounds for follow from (6) and upper bounds on the radii of .

###### Remark 1.

More generally, one can compare only the elements of a subset , typically a maximal -net by introducing for all , the set

 BTN(f,F)={g∈F:TN(g,f)≥0} (7)

and then by minimizing the diameter of over . This usually improves the rates of convergence for constant deviation results when there is a gap in Sudakov’s inequality of the localized sets of (cf. Section 5 in LM13 for more details). These results are not presented because we are interested in exponentially large deviation results for which our results are optimal.

#### Dealing with regularization : the link function

Statistical performances of estimators and the radius of can be measured by two norms: the regularization norm and . As (5) allows only for one distance , we propose the following extension of Le Cam approach to handle two metrics.

To introduce this extension, assume first that can be computed for all (this is the case if the distribution of the design is known). The next paragraph explains how to deal with the more common framework where this distance is unknown. Remark that

 CTN(f)=supg∈BTN(f)∥f−g∥=min⎧⎨⎩ρ≥0:supg∈BTN(f)∥g−f∥≤ρ⎫⎬⎭.

The main point to extend Le Cam’s approach to simultaneously control two norms is to design a link function . In a nutshell, the values is the -minimax rate of convergence in a ball of radius for the regularization norm (cf. (13) in Section 4.3 for a formal definition). Then one can define

 C(2)TN(f)=min⎧⎨⎩ρ≥0:supg∈BTN(f)∥g−f∥≤ρ and supg∈BTN(f)d(f,g)≤r(ρ)⎫⎬⎭.

Theorem 3 shows that while a minimizer of has only a nice risk for , a minimizer of has both and properly controlled.

#### Dealing with unknown norms : the isometry property

In general, -distances cannot be directly computed and have to be estimated. To deal with this issue, one considers usually the empirical distance and prove that empirical and actual distances are equivalent outside a -ball centered in (cf. for instance, remark after Lemma 2.6 in LM13). Unfortunately this approach only works under strong concentration property that we want to relax in this paper.

The unknown -metric is instead estimated by a median-of-means approach, that is, we use MOM estimators of all (cf. Section 4.4). The final estimator is therefore defined as a minimizer of

 C′′TN(f)=min⎧⎨⎩ρ≥0:supg∈BTN(f)∥g−f∥≤ρ and supg∈BTN(f)dN(f,g)≤r(ρ)⎫⎬⎭.

### 3.2 Examples

Le Cam’s approach has been used by Birgé to define -estimators (cf. MR2449129; MR2219712; MR3186748) and by Baraud, Birgé and Sart to define -estimators (cf. MR3565484; BaraudBirgeSart). MR2834722; MR3224300 also built efficient estimator selection procedures with this approach. It also extends many common procedures in statistical learning theory, as shown by the following examples.

#### Example 1 : Empirical minimizers

Assume for some random function and denote by a minimizer of the corresponding criterion (provided that it exists and is unique). Then it is easy to check that , so its radius is null, while the radius of any other point is larger than (whatever the non-degenerate notion of pseudo-distance used for ). It follows that is the estimator (5). In particular, any possibly penalized empirical risk minimizer

 ^f=argminf∈F{PNℓf+reg(f)}

is obtained by Le Cam’s construction with the tests

 TN(g,f)=PN(ℓf−ℓg)+reg(f)−reg(g).

These examples encompass classical empirical risk minimizers of MR1641250 but also their robust versions from MR0161415; MR2906886.

#### Example 2 : median-of-means estimators

Another, perhaps less obvious example is the median-of-means estimator MR1688610; MR855970; MR702836 of the expectation of a real valued random variable . Let denote a sample and let denote a partition of into bins of equal size . The estimator is the (empirical) median of the vector of empirical means . Recall that

 PZ=argminm∈RP(Z−m)2=argminm∈Rmaxm′∈RP[(Z−m)2−(Z−m′)2].

Define the MOM test statistic to compare any by

 TN(m,m′) =MOMK[(Z−m′)2−(Z−m)2].

Basic properties of the median (recalled in Eq (8) and (9) of Section 4.1) yield

 TN(m,m′) =(m′)2−m2+MOMK[−2Z(m′−m)] =(m′)2−2m′MOMK(Z)−[m2−2mMOM% K(Z)] =(m′−MOMK(Z))2−(m−MOMK(Z))2.

Defining , one has

 TN(m,m′)=ℓN(m′)−ℓN(m).

As in the previous example, Le Cam’s estimator based on is therefore the unique minimizer of , that is .

#### Example 3 : “Champions” of a Tournament

LugosiMendelson2016 introduced median-of-means tournaments. More precisely, they used median-of-means tests to compare elements in . These tests cannot be separated in general. LugosiMendelson2016 assume that an upper bound on the radius of (that holds with exponentially large probability) is known from the statistician and call “champion” any element of such that . It is clear that, by definition the radius of is smaller than and therefore smaller than . This means that is a “champion” for this terminology. The main advantage of Le Cam’s approach is that (which usually depends on some attribute of the oracle like the sparsity) is not required to build the estimator .

## 4 Construction of the regularized MOM estimators

### 4.1 Quantile of means processes and median-of-means tests

This section presents median-of-means (MOM) tests used in this work. Designing a family of tests is one of the most important building blocks in Le Cam’s approach together with the right choice of the metric measuring the diameters for .

Start with a few notations. For all , and , the set of -quantiles of is denoted by

 Qα(z)={x∈R:1ℓℓ∑k=1I(zi≤x)≥αand1ℓℓ∑k=1I(zi≥x)≥1−α}.

For a non-empty subset and a function , let

 PBf=1|B|∑i∈Bf(Xi,Yi) and ¯¯¯¯PBf=1|B|∑i∈BPif.

Let and let denote an equipartition of into bins of size . When does not divide , at most data can be removed from the dataset. For any real number and any function , the set of -quantiles of empirical means is denoted by

 Qα,K(f)=Qα((PBkf)k∈[K]).

With a slight abuse of notations, we shall repeatedly denote by any element in and write if , if , if , and any element in the Minkowski sum . Let also denote an empirical median of the empirical means on the blocks . Empirical quantiles satisfy for any , and ,

 Qα,K(cf)=cQα,K(f), (8) Qα,K(−f)=−Q1−α,K(f), (9) sup{Q1/4,K(f)+Q1/4,K(f′)}≤infQ1/2,K(f+f′), (10) supQ1/2,K(f+f′)≤inf{Q3/4,K(f)+Q3/4,K(f′)}. (11)

With some abuse of notations, we shall write these properties respectively

 Qα,K(cf)=cQα,K(f),Qα,K(−f)=−Q1−α,K(f), Q1/4,K(f) +Q1/4,K(f′)≤MOMK[f+f′]≤Q3/4,K(f)+Q3/4,K(f′).

A regularization parameter is introduced to balance between data adequacy and regularization. The (quadratic) loss and regularized (quadratic) loss are respectively defined on as the real valued functions such that

 ℓf(x,y)=(y−f(x))2,ℓλf=ℓf+λ∥f∥,∀(f,x,y)∈F×X×R.

To compare/test functions and in , median-of-means tests between and are now defined by

 TK,λ(g,f)=MOMK[ℓλf−ℓλg]=MOMK[ℓf−ℓg]+λ(∥f∥−∥g∥). (12)

From (9), satisfies (3) and is a tests statistic in the sense of Section 3.

### 4.2 Main assumptions

Recall that and that is a set of outliers on which we make no assumption so these may be aggressive in any sense one can imagine. The remaining informative data need to bring enough information onto . We therefore need some assumption on the sub-dataset and, in particular, some connexion between the distributions for and . These assumptions are pretty weaksince we only assume essentially that the and geometries are comparable in the following sense.

###### Assumption 1.

There exists such that, for all and ,

 ∥f−f∗∥L2Pi≤θr∥f−f∗∥L2P.

Of course, Assumption 1 holds in the i.i.d. framework, with and . The second assumption bounds the correlation between the noise function and the design on the shifted class in for all .

###### Assumption 2.

There exists such that, for all and ,

 varQ(ζ(f−f∗))=Q[ζ2(f−f∗)2−[Q(ζ(f−f∗))]2]≤θ2m∥f−f∗∥2L2P.

Let us give some examples where Assumption 2 holds. If the noise random variable (resp. for ) has a variance conditionally to (resp. for ) that is uniformly bounded then Assumption 2 holds. This is, for example, the case, when (resp. for ) is independent of (resp. for ) and has finite -moment with . It also holds without independence under higher moment conditions. For example, assume and, for every , then by Cauchy-Schwarz inequality, and so Assumption 2 holds for .

###### Assumption 3.

There exists such that for all and all

 ∥f−f∗∥L2P≤θ0∥f−f∗∥L1Pi.

By Cauchy-Schwarz inequality, for all and . Therefore, Assumptions 1 and 3 together imply that all norms are equivalent over . Note also that Assumption 3 is related to the small ball property (cf. MR3431642; Shahar-COLT) as shown by Proposition 1 bellow. The small ball property has been recently used in Learning theory and signal processing. We refer to MR3431642; LM_compressed; shahar_general_loss; MR3364699; Shahar-ACM; RV_small_ball for examples of distributions satisfying this assumption.

###### Proposition 1.

Let be a real-valued random variable.

1. If there exist and such that then .

2. If there exists such that , then for any , where .

###### Proof.

If then

 ∥Z∥1≥∫|z|≥κ0∥Z∥2|z|PZ(dz)≥u0κ0∥Z∥2,

where denotes the distribution of . Conversely, if , the Paley-Zigmund’s argument (MR1666908, Proposition 3.3.1) shows that, if ,

 ∥Z∥2 ≤θ0∥Z∥1=θ0(E[|Z|I(|Z|≤κ0∥Z∥2)]+E[|Z|I(|Z|≥κ0∥Z∥2)])

As one can assume that , . ∎

### 4.3 Complexity parameters and the link function

This section defines the link function making the connections between norms that will be required in the extension of Le Cam’s approach to a simultaneous control of two norms (one of the two being unknown). For any and any , let

 B(f,ρ)={g∈E:∥f−g∥≤ρ},S(f,ρ)={g∈E:∥g−f∥=ρ}.
###### Definition 1.

Let be independent Rademacher random variables, independent from and let . For any and let ,

 QγQf⋆,ρ ={r>0:∀J∈J,Esupf∈Ff⋆,ρ,r∣∣ ∣∣∑i∈Jϵi(f−f⋆)(Xi)∣∣ ∣∣≤γQ|J|r}, MγMf⋆,ρ ={r>0:∀J∈J,Esupf∈Ff⋆,ρ,r∣∣ ∣∣∑i∈Jϵi(Yi−f⋆(Xi))(f−f⋆)(Xi)∣∣ ∣∣≤γM|J|r2}

and the two fixed point functions

 rQ(ρ,γQ)=supf⋆∈F{infQγQf⋆,ρ},rM(ρ,γM)=supf⋆∈F{infMγMf⋆,ρ}.

The link function is any continuous and non-decreasing function such that for all

 r(ρ)