Learning from MOM’s principles : Le Cam’s approach
Abstract
We obtain estimation error rates for estimators obtained by aggregation of regularized medianofmeans tests, following a construction of Le Cam. The results hold with exponentially large probability, under only weak moments assumptions on data. Any norm may be used for regularization. When it has some sparsity inducing power we recover sparse rates of convergence. The procedure is robust since a large part of data may be corrupted, these outliers have nothing to do with the oracle we want to reconstruct. Our general risk bound is of order
In particular, the number of outliers may be as large as (number of data) (minimax rate) without affecting this rate. The other data do not have to be identically distributed but should only have equivalent and moments. For example, the minimax rate of recovery of a sparse vector in is achieved with exponentially large probability by a medianofmeans version of the LASSO when the noise has moments for some , the entries of the design matrix should have moments and the dataset can be corrupted up to outliers.
keywords:
robust statistics, statistical learning, high dimensional statistics.Msc:
[2010] 62G35, 62G08.label1]guillaume.lecue@ensae.fr
label2]matthieu.lerasle@math.upsud.fr
1 Introduction
Consider the problem of estimating minimizers of the integrated squareloss over a convex class of functions : based on a data set . The labels and ’s are realvalued while the inputs and ’s take values in an abstract measurable space .
Empirical Risk Minimizers (ERM) of MR1641250; MR0474638 and later on, their regularized versions replace the unknown distribution in the definition of by the empirical distribution based on the sample . Given a function , this produces regularized ERM defined by
These estimators are optimal in i.i.d. subgaussian setups but suffer several drawbacks when data are heavytailed or corrupted by “outliers”, see MR3052407; HubRonch2009. These issues are critical in many modern applications such as highfrequency trading, where heavytailed data are quite common or in various areas of biology such as microarray analysis or neuroscience where data are sometimes still nasty after being preprocessed. To overcome the problem, various methods have been proposed. The most common strategy is to replace the squareloss function to make it less sensitive to outliers. For example, MR0161415 proposed a loss that interpolates between square and absolute loss to produce an estimator between the unbiased (but non robust) empirical mean and the (more robust but biased) empirical median. Huber’s estimators have been intensively studied asymptotically by MR0161415; HubRonch2009, nonasymptotic results have also been obtained more recently by MR3217454; shahar_general_loss; FanLiWang2016 for example. An alternative approach has been proposed by MR3052407 and used in learning frameworks such as leastsquares regression by MR2906886 and for more general loss functions by MR3405602.
Another line of research to build robust estimators and robust selection procedures was initiated by MR0334381; MR856411 and further developed by MR2219712, MR2834722 and BaraudBirgeSart. It is based on comparisons or tests between elements of . More precisely, the approach builds on tests statistics comparing and . These tests define the sets of all ’s that have been preferred to and the final estimator is a minimizer of the diameter of . The measure of diameter is directly related to statistical performances one seeks for the estimator. These methods mostly focus on Hellinger loss and are generally considered difficult to compute, see however MR3224300; MR3224298.
In a related but different approach, LugosiMendelson2016 have recently introduced “medianofmeans tournaments”. Medianofmeans estimators of MR1688610; MR855970; MR702836 compare elements of . A “champion” is an element such that is smaller than a computable upper bound on the radius of . They prove that the risk of any champion is controlled by this upper bound. An important message of this paper is that Le Cam’s estimators are quite common in statistics, in particular in robust statistics. For example, Section 3 shows that any penalized empirical loss function can be obtained by Le Cam’s approach and that Le Cam’s estimators based on medianofmeans tests are champions of medianofmeans tournaments.
This paper studies estimators derived from Le Cam’s procedure based on regularized medianofmeans (MOM) tests (see Section 4.1). Our estimators are therefore particular instances of champions of MOM’s tournaments and another motivation is to push further the analysis of this particular champion. The main advantage of MOM’s tests over Le Cam’s original ones is that they allow for more classical loss functions than Hellinger loss. This idea is illustrated on the squareloss. Compared to Huber or Catoni’s losses, this approach allows to control easily the risk of our estimators by using classical tools from empirical process theory, it also allows to tackle the problem of “aggressive” outliers.
The closest work is certainly that of LugosiMendelson2016, but we believe that our paper contains substantial improvements. We stress the intimate relationship between their estimator and Le Cam general construction and use this parallel to propose a much simpler estimator. Our risk bounds are always better and we extend their results to possibly corrupted datasets.
To investigate robustness properties of medianofmeans estimators, we partition the dataset into two parts. One is made of outliers data. They are indexed by of cardinality . On those data, absolutely nothing is assumed : they may not be independent, have distributions totally different from , with no moment at all, etc.. These are typically data polluting datasets like in the case of declarative data on internet or when something went wrong during the storage, compression or transfer which resulted in complete non sense data. They may also be observations met in biology as in the classical eQTL (Expression Quantitative Trait Loci and The Phenogen Database) from eQTL. Many other examples of datasets containing outliers could be provided, this includes frauds detection and terrorist activity as examples. Of course, outliers are not flagged in advance and the statistician is given no a priori information on which data is an outlier or not. The other part of the dataset is made of data on which the MOM estimator rely on to estimate the oracle . There should be enough information in those data so that the estimation of is possible, even in the presence of outliers provided they remain in a “decent proportion”. We therefore call the nonoutliers, the informative data, those that bring information on . We denote by the set indexing these data. We therefore end up with a partition of as which, again, is not known from the statistician.
The radii of the sets are computed for regularization and norms. The regularization norm is chosen in advance by the statistician to promote sparsity or smoothness. It can be used freely in our procedure, but it doesn’t ensure a small risk for the estimator. The norm is unknown in general since it depends on the distribution of . Furthermore, the classical empirical metric fails to estimate the metric without subgaussian properties of the design vector . Fortunately, it can be replaced by a medianofmeans metric. To handle simultaneously both regularization and norms, we will also slightly extend Le Cam’s principle. Our first important result shows that the resulting estimator is well localized w.r.t. both regularization and norms.
Medianofmeans estimators rely on a data splitting into blocks and this parameter drives the resulting statistical performances (cf. MR3576558). To achieve optimal rates, should be ultimately chosen using parameters that depend on the oracle like its sparsity which is not in general available to the statistician. To bypass this problem, the strategy of MR1147167 is used as in MR3576558 to select adaptively and get a fully datadriven procedure.
There are four important features in our approach. First, all results are proved under weak moment assumptions on the noise. This is an almost minimal condition for the problem to make sense. The class is only assumed to satisfy a weak “” comparison. Second, performances of the estimators are not affected by the presence of complete outliers, as long as their number remains comparable to (number of observations)(rates of convergence). Third, all results are nonasymptotic and the regression function is never assumed to belong to the class . In particular, the noise can be correlated with . Finally, even “informative data”, those that are not “outliers”, are not requested to be i.i.d. , but only to have close first and second moments for all . Nevertheless, the estimators are shown to behave as well as the ERM when the data are i.i.d. , , the noise and the class are Gaussian and the noise is independent from the design.
Example: sparserecovery via MOM LASSO. As a proof of concept, theoretical properties are illustrated in the classical example of sparserecovery in highdimensional spaces using the regularization. This example illustrates typical results that follow from our analysis in one of the most classical problem of high dimensional statistics (cf. MR2807761; MR3307991). The interested reader can check that it also applies to other procedures like Slope (cf. slope1; slope2) and tracenorm regularization as well as kernel methods, for instance, by using the results in LM_reg_comp; LM_reg_comp_2.
Recall this classical setup. Let denote a random vector in such that for all ( is isotropic) and let be a realvalued random vector. Let . Let denote independent data corrupted by outliers : no assumption is made on a subset of the dataset. Let denote the indices of informative data : for all , are independent with the same distribution . For the sake of simplicity, we only consider the case of i.i.d. informative data in this example. In highdimensional statistics, but has only () nonzero coordinates. To estimate , the norm is used for penalization to promote zero coordinates. The following result holds.
Theorem 1.
[Theorem 1.4 in LM_reg_comp] Assume is sparse, , is isotropic and

and (no outliers in the dataset),

for some

there exists such that for all and all ,

there exist and such that for all ,
The LASSO estimator, defined by
satisfies for every ,
with probability at least
(1) 
This paper shows that Theorem 1 holds for a MOM version of the LASSO estimator under much weaker assumptions, with a better probability estimate than (1). More precisely, the following theorem is proved.
Theorem 2.
Assume that is sparse, , is isotropic and

and (the number of outliers may be proportional to the sparsity times ),

for some

for every , where is the canonical basis of and is some absolute constant,

there exists such that , for all ,

there exists such that , for all .
There exists an estimator , called MOMLASSO, satisfying for every ,
with probability at least
(2) 
Theoretical properties of MOM LASSO outperform those of LASSO in several ways.

Estimation rates achieved by MOMLASSO are the actual minimax rates , see BLT16, while classical LASSO estimators achieve the rate . This improvement is possible thanks to the adaptation step in MOMLASSO.

MOM LASSO is insensitive to data corruption by up to times outliers while only one outlier can be responsible of a dramatic breakdown of the performances of LASSO.

All assumptions on are weaker for MOM LASSO than for LASSO. In particular, condition v) holds with if for all , – which is a much weaker requirement than condition iii) for LASSO.
From a mathematical point of view, our results are based on a slight extension of the Small Ball Method (SBM) of MR3431642; ShaharCOLT to handle noni.d. data. SBM is also extended to bound both quadratic and multiplier parts of the quadratic loss. Otherwise, all arguments are standard, which makes the approach very attractive and easily reproducible in other frameworks of statistical learning.
The paper is organized as follows. Section 2 briefly presents the general setting and our main illustrative example. Section 3 presents Le Cam’s construction of estimators based on tests. We also show why many learning procedures may be obtained by this approach. The construction of estimators and the main assumptions are gathered in Section 4. Our main theorems are stated in Section 5 and proved in Section 6.
Notation
For any real number , let denote the largest integer smaller than and let if . For any finite set , let denote its cardinality. All along the paper, denote absolute constants which may vary from line to line and , with various subscripts, denote real valued parameters introduced in the assumptions. Finally, for any set for which it makes sense, for any , and ,
Let also . We also denote by the indicator function of the set which equals to when and otherwise.
2 Setting
Let denote a measurable space and let denote random variables taking values in , with respective distributions . Given a probability distribution , let denote the space of all functions from to such that where . Let denote a convex class of functions . Assume that and let, for all ,
Let denote a norm defined onto a linear subspace of containing .
Example : regularization of linear functionals
For every and , let
Let , where
Whenever it’s necessary, will denote the canonical basis of and (resp. ) will denote the unit ball (resp. sphere) associated to . To ease readability in this example, we focus on rates of convergence, we do not consider the “full” noni.i.d. setup and assume that for all . We write for to shorten notations.
3 Learning from tests
3.1 General Principle
This section details the ideas underlying the construction of a MOM estimator using an extension of Le Cam’s approach.
Basic idea
By definition of the oracle , one has
As depends on , we estimate it by test statistics that is, real random variables such that
(3) 
These statistics are used to compare to , simply by saying that beats iff . In this paper, the statistics are medianofmeans estimators of (cf. (12) in Section 4.1).
Le Cam’s construction
Let denote a collection of test statistics and let denote a pseudodistance on measuring (or related to) the risk we want to control. Let for all ,
be the set of all functions that beat . If is far from , then is expected to have a large radius w.r.t. . We therefore introduce this radius as a criteria to minimize : for all , let .
By (3), or (both happen if ), hence . In particular, for all ,
(4) 
Eq (4) suggests to define the estimator
(5) 
This estimator satisfies, from Eq (4),
(6) 
Risk bounds for follow from (6) and upper bounds on the radii of .
Remark 1.
More generally, one can compare only the elements of a subset , typically a maximal net by introducing for all , the set
(7) 
and then by minimizing the diameter of over . This usually improves the rates of convergence for constant deviation results when there is a gap in Sudakov’s inequality of the localized sets of (cf. Section 5 in LM13 for more details). These results are not presented because we are interested in exponentially large deviation results for which our results are optimal.
Dealing with regularization : the link function
Statistical performances of estimators and the radius of can be measured by two norms: the regularization norm and . As (5) allows only for one distance , we propose the following extension of Le Cam approach to handle two metrics.
To introduce this extension, assume first that can be computed for all (this is the case if the distribution of the design is known). The next paragraph explains how to deal with the more common framework where this distance is unknown. Remark that
The main point to extend Le Cam’s approach to simultaneously control two norms is to design a link function . In a nutshell, the values is the minimax rate of convergence in a ball of radius for the regularization norm (cf. (13) in Section 4.3 for a formal definition). Then one can define
Theorem 3 shows that while a minimizer of has only a nice risk for , a minimizer of has both and properly controlled.
Dealing with unknown norms : the isometry property
In general, distances cannot be directly computed and have to be estimated. To deal with this issue, one considers usually the empirical distance and prove that empirical and actual distances are equivalent outside a ball centered in (cf. for instance, remark after Lemma 2.6 in LM13). Unfortunately this approach only works under strong concentration property that we want to relax in this paper.
The unknown metric is instead estimated by a medianofmeans approach, that is, we use MOM estimators of all (cf. Section 4.4). The final estimator is therefore defined as a minimizer of
3.2 Examples
Le Cam’s approach has been used by Birgé to define estimators (cf. MR2449129; MR2219712; MR3186748) and by Baraud, Birgé and Sart to define estimators (cf. MR3565484; BaraudBirgeSart). MR2834722; MR3224300 also built efficient estimator selection procedures with this approach. It also extends many common procedures in statistical learning theory, as shown by the following examples.
Example 1 : Empirical minimizers
Assume for some random function and denote by a minimizer of the corresponding criterion (provided that it exists and is unique). Then it is easy to check that , so its radius is null, while the radius of any other point is larger than (whatever the nondegenerate notion of pseudodistance used for ). It follows that is the estimator (5). In particular, any possibly penalized empirical risk minimizer
is obtained by Le Cam’s construction with the tests
These examples encompass classical empirical risk minimizers of MR1641250 but also their robust versions from MR0161415; MR2906886.
Example 2 : medianofmeans estimators
Another, perhaps less obvious example is the medianofmeans estimator MR1688610; MR855970; MR702836 of the expectation of a real valued random variable . Let denote a sample and let denote a partition of into bins of equal size . The estimator is the (empirical) median of the vector of empirical means . Recall that
Define the MOM test statistic to compare any by
Basic properties of the median (recalled in Eq (8) and (9) of Section 4.1) yield
Defining , one has
As in the previous example, Le Cam’s estimator based on is therefore the unique minimizer of , that is .
Example 3 : “Champions” of a Tournament
LugosiMendelson2016 introduced medianofmeans tournaments. More precisely, they used medianofmeans tests to compare elements in . These tests cannot be separated in general. LugosiMendelson2016 assume that an upper bound on the radius of (that holds with exponentially large probability) is known from the statistician and call “champion” any element of such that . It is clear that, by definition the radius of is smaller than and therefore smaller than . This means that is a “champion” for this terminology. The main advantage of Le Cam’s approach is that (which usually depends on some attribute of the oracle like the sparsity) is not required to build the estimator .
4 Construction of the regularized MOM estimators
4.1 Quantile of means processes and medianofmeans tests
This section presents medianofmeans (MOM) tests used in this work. Designing a family of tests is one of the most important building blocks in Le Cam’s approach together with the right choice of the metric measuring the diameters for .
Start with a few notations. For all , and , the set of quantiles of is denoted by
For a nonempty subset and a function , let
Let and let denote an equipartition of into bins of size . When does not divide , at most data can be removed from the dataset. For any real number and any function , the set of quantiles of empirical means is denoted by
With a slight abuse of notations, we shall repeatedly denote by any element in and write if , if , if , and any element in the Minkowski sum . Let also denote an empirical median of the empirical means on the blocks . Empirical quantiles satisfy for any , and ,
(8)  
(9)  
(10)  
(11) 
With some abuse of notations, we shall write these properties respectively
A regularization parameter is introduced to balance between data adequacy and regularization. The (quadratic) loss and regularized (quadratic) loss are respectively defined on as the real valued functions such that
To compare/test functions and in , medianofmeans tests between and are now defined by
(12) 
From (9), satisfies (3) and is a tests statistic in the sense of Section 3.
4.2 Main assumptions
Recall that and that is a set of outliers on which we make no assumption so these may be aggressive in any sense one can imagine. The remaining informative data need to bring enough information onto . We therefore need some assumption on the subdataset and, in particular, some connexion between the distributions for and . These assumptions are pretty weaksince we only assume essentially that the and geometries are comparable in the following sense.
Assumption 1.
There exists such that, for all and ,
Of course, Assumption 1 holds in the i.i.d. framework, with and . The second assumption bounds the correlation between the noise function and the design on the shifted class in for all .
Assumption 2.
There exists such that, for all and ,
Let us give some examples where Assumption 2 holds. If the noise random variable (resp. for ) has a variance conditionally to (resp. for ) that is uniformly bounded then Assumption 2 holds. This is, for example, the case, when (resp. for ) is independent of (resp. for ) and has finite moment with . It also holds without independence under higher moment conditions. For example, assume and, for every , then by CauchySchwarz inequality, and so Assumption 2 holds for .
Assumption 3.
There exists such that for all and all
By CauchySchwarz inequality, for all and . Therefore, Assumptions 1 and 3 together imply that all norms are equivalent over . Note also that Assumption 3 is related to the small ball property (cf. MR3431642; ShaharCOLT) as shown by Proposition 1 bellow. The small ball property has been recently used in Learning theory and signal processing. We refer to MR3431642; LM_compressed; shahar_general_loss; MR3364699; ShaharACM; RV_small_ball for examples of distributions satisfying this assumption.
Proposition 1.
Let be a realvalued random variable.

If there exist and such that then .

If there exists such that , then for any , where .
Proof.
If then
where denotes the distribution of . Conversely, if , the PaleyZigmund’s argument (MR1666908, Proposition 3.3.1) shows that, if ,
As one can assume that , . ∎
4.3 Complexity parameters and the link function
This section defines the link function making the connections between norms that will be required in the extension of Le Cam’s approach to a simultaneous control of two norms (one of the two being unknown). For any and any , let
Definition 1.
Let be independent Rademacher random variables, independent from and let . For any and let ,
and the two fixed point functions
The link function is any continuous and nondecreasing function such that for all