Rhoestimators revisited: general theory and applications
Abstract
Following Baraud et al. (2017), we pursue our attempt to design a robust universal estimator of the joint ditribution of independent (but not necessarily i.i.d.) observations for an Hellingertype loss. Given such observations with an unknown joint distribution and a dominated model for , we build an estimator based on (a estimator) and measure its risk by an Hellingertype distance. When does belong to the model, this risk is bounded by some quantity which relies on the local complexity of the model in a vicinity of . In most situations this bound corresponds to the minimax risk over the model (up to a possible logarithmic factor). When does not belong to the model, its risk involves an additional bias term proportional to the distance between and , whatever the true distribution . From this point of view, this new version of estimators improves upon the previous one described in Baraud et al. (2017) which required that be absolutely continuous with respect to some known reference measure. Further additional improvements have been brought as compared to the former construction. In particular, it provides a very general treatment of the regression framework with random design as well as a computationally tractable procedure for aggregating estimators. We also give some conditions for the Maximum Likelihood Estimator to be a estimator. Finally, we consider the situation where the Statistician has at his or her disposal many different models and we build a penalized version of the estimator for model selection and adaptation purposes. In the regression setting, this penalized estimator not only allows one to estimate the regression function but also the distribution of the errors.
arXiv:1605.05051v5 \startlocaldefs \endlocaldefs
Rhoestimators revisited and
class=MSC] \kwd[Primary ]62G35 \kwd62G05 \kwd62G07 \kwd62G08 \kwd62C20 \kwd62F99
estimation \kwdRobust estimation \kwdDensity estimation \kwdRegression with random design \kwdStatistical models \kwdMaximum likelihood estimators \kwdMetric dimension \kwdVCclasses
1 Introduction
In a previous paper, namely Baraud et al. (2017), we introduced a new class of estimators that we called estimators for estimating the distribution of a random variable with values in some measurable space under the assumption that the are independent but not necessarily i.i.d. These estimators are based on density models, a density model being a family of densities with respect to some reference measure on . We also assumed that was absolutely continuous with respect to with density and, following Le Cam (1973), we measured the performance of an estimator of in terms of , where is a Hellingertype distance to be defined later. Originally, the motivations for this construction were to design an estimator of with the following properties.
— Given a density model , the estimator should be nearly optimal over from the minimax point of view, which means that it is possible to bound the risk of the estimator over from above by some quantity which is approximately of the order of the minimax risk over .
— Since in Statistics we typically have uncomplete information about the true distribution of the observations, when we assume that belongs to nothing ever warrants that this is true. We may more reasonably expect that is close to which means that the model is not exact but only approximate and that the quantity might therefore be positive. In this case we would like the risk of to be bounded by for some universal constant . In the case of estimators, the previous bound can actually be slightly refined and expressed in the following way. It is possible to define on a positive function such that the risk of the estimator is not larger than , with if belongs to the model and not larger than when does not belong to .
The weak sensibility of this risk bound to small deviations with respect to the Hellingertype distance between and an element of covers some classical notions of robustness among which robustness to a possible contamination of the data and robustness to outliers, as we shall see in Section 5.
There are nevertheless some limitations to the properties of estimators as defined in Baraud et al. (2017).

The study of random design regression required that either the distribution of the design be known or that the errors have a symmetric distribution. We want to relax these assumptions and consider the random design regression framework with greater generality.

We always worked with some reference measure and assumed that all the probabilities we considered, including the true distribution of , were absolutely continuous with respect to . This is quite natural for the probabilities that belong to our models since the models are, by assumption, dominated and, typically, defined via a reference measure and a family of densities with respect to . Nevertheless, the assumption that the true distribution of the observations be also dominated by is questionable. We therefore would like to get rid of it and let the true distribution be completely arbitrary, relaxing thus the assumption that the density exists. Unexpectedly, such an extension leads to subtle complications as we shall see below and this generalization is actually far from being straightforward.

Our construction was necessarily restricted to countable models rather than the uncoutable ones currently used in Statistics.
We want here to design a method based on “probability models” rather than “density models”, which means working with dominated models consisting of probabilities rather than of densities as for . Of course, the choice of a dominating measure and a specific set of densities leads to a probability model . This is by the way what is actually done in Statistics, but the converse is definitely not true and there exist many ways of representing a dominated probability model by a reference measure and a set of densities. It turns out — see Section 2.3 — that the performance of a very familiar estimator, namely the MLE (Maximum Likelihood Estimator), can be strongly affected by the choice of a specific version of the densities. Our purpose here is to design an estimator the performance of which only depends on the probability model and not on the choice of the reference measure and the densities that are used to represent it.
In order to get rid of the abovementioned restrictions, we have to modify our original construction which leads to the new version that we present here. This new version retains all the nice properties that we proved in Baraud et al. (2017) and the numerous illustrations we considered there remain valid for the new version. It additionally provides a general treatment of conditional density estimation and regression, allowing the Statistician to estimate both the regression function and the error distribution even when the distribution of the design is totally unknown and the errors admit no finite moments. From this point of view, our approach contrasts very much with that based on the classical least squares. An alternative point of view on the particular problem of estimating a conditional density can be found in Sart (2015).
A thorough study of the performance of the least squares estimator (or truncated versions of it) can be found in Györfi et al. (2002) and we refer the reader to the references therein. A nice feature of these results lies in the fact that they hold without any assumption on the distribution of the design. While few moment conditions on the errors are necessary to bound the integrated risk of their estimator, much stronger ones, typically boundedness of the errors, are necessary to obtain exponential deviation bounds. In contrast, in linear regression, Audibert and Catoni (2011) established exponential deviation bounds for the risk of some robust versions of the ordinary least squares estimator. Their idea is to replace the sum of squares by the sum of their truncated version in view of designing a new criterion which is less sensitive to possible outliers than the original least squares. Their way of modifying the least squares criterion shares some similarity with our way of modifying the loglikelihood criterion, as we shall see below. However their results require some conditions on the distribution of the design as well as some (weak) moment condition on the errors while ours do not.
It is known, and we shall give an additional example below, that the MLE, which is often considered as a “universal” estimator, does not possess, in general, the properties that we require and more specifically robustness. An illustration of the lack of robustness of the MLE with respect to Hellinger deviations is provided in Baraud and Birgé (2016a). Some other weaknesses of the MLE have been described in Le Cam (1990) and Birgé (2006), among other authors, and various alternatives aimed at designing some sorts of “universal” estimators (for the problem we consider here) which would not suffer from the same weaknesses have been proposed in the past by Le Cam (1973) and (1975) followed by Birgé (1983) and (2006). The construction of estimators, as described in Baraud et al. (2017) was in this line. In that paper, we actually introduced estimators via a testing argument as was the case for Le Cam and Birgé for their methods. This argument remains valid for the generalized version we consider here — see Lemma 4 in Appendix D.10 — but estimators can also be viewed as a generalization, and in fact a robustified version, of the MLE. We shall even show, in Section 6, that in favorable situations (i.i.d. observations and a convex separable set of densities as a model for the true density) the MLE is actually a estimator and therefore shares their properties.
To explain the idea underlying the construction of estimators, let us assume that we observe an sample with an unknown density belonging to a set of densities with respect to some reference measure . We may write the loglikelihood of as and the loglikelihood ratios as
so that maximizing the likelihood is equivalent to minimizing with respect to
This happens simply because of the magic property of the logarithm which says that . However, the use of the unbounded log function in the definition of leads to various problems that are responsible for some weaknesses of the MLE. Replacing the log function by another function amounts to replace by
(1) 
which is different from since is not the log function. We may nevertheless define the analogue of , namely
(2) 
and define our estimator as a minimizer with respect to of the quantity . The resulting estimator is an alternative to the maximum likelihood estimator and we shall show that, for a suitable choice of a bounded function , it enjoys various properties, among which robustness, that are often not shared by the MLE.
To analyze the performance of this new estimator, we have to study the behaviour of the process when is fixed, is close to the true distribution of the and varies in . Since the function is bounded, the process is similar to those considered in learning theory for the purpose of studying empirical risk minimization. As a consequence, the tools we use are also similar to those described in great detail in Koltchinskii (2006).
It is wellknown that working with a single model for estimating an unknown distribution is not very efficient unless one has very precise pieces of information about the true distribution, which is rarely the case. Working with many models simultaneously and performing model selection improves the situation drastically. Refining the previous construction of estimators by adding suitable penalty terms to the statistic allows one to work with a finite or countable family of probability models instead of a single one, each model leading to a risk bound of the form , and to choose from the observations a model with approximately the best possible bound which results in a final estimator and a bound for of the form
where the additional term is connected to the complexity of the family of models we use.
The paper is organised as follows. We shall first make our framework, which is based on dominated families of probabilities rather than families of densities with respect to a given dominating measure, precise in Section 2. This section is devoted to the definition of models and of our new version of estimators, then to the assumptions that the function we use to define the statistic in (1) should satisfy. In Section 3, we define the dimension function of a model, a quantity which measures the difficulty of estimation within the model using a estimator, and present the main results, namely the performance of these new estimators. Section 4 is devoted to the extension of the construction from countable to uncountable statistical models (which are the ones currently used in Statistics) under suitable assumptions. We describe the robustness properties of estimators in Section 5. In Section 6 we investigate the relationship between estimators and the MLE when the model is a convex set of densities. Section 7 provides various methods that allow one to bound the dimension functions of different types of models and indicates how these bounds are to be used to bound the risk of estimators in typical situations with applications to the minimax risk over classical statistical models. We also provide a few examples of computations of bounds for the dimension function. Many applications of our results about estimators have already been given in Baraud et al. (2017) and we deal here with a new one: estimation of conditional distributions in Section 8. In Section 9 we apply this additional result to the special case of random design regression when the distribution of the design is completely unknown, a situation for which not many results are known. We provide here a complete treatment of this regression framework with simultaneous estimation of both the regression function and the density of the errors. Section 10 is devoted to estimator selection and aggregation: we show there how our procedure can be used either to select an element from a family of preliminary estimators or to aggregate them in a convex way. The Appendices (Supplementary material) contain the proofs as well as some additional facts.
2 Our new framework and estimation strategy
As already mentioned, our method is based on statistical models which are sets of probability distributions, in opposition with more classical models which are sets of densities with respect to a given dominating measure.
2.1 A probabilistic framework
We observe a random variable defined on some probability space with independent components and values in the measurable product space . We denote by the set of all product probabilities on and by the true distribution of . We identify an element of with the tuple and extend this identification to the elements of the set of all finite product measures on .
When is absolutely continuous with respect to () or, equivalently, dominates , each , for , is absolutely continuous with respect to with density so that . We denote by the set of all densities with respect to , i.e. the set of measurable functions from to such that . We then write where is the tuple and we say that is a density for with respect to . We denote by the set of such densities and by the set of all those which are absolutely continuous with respect to .
Our aim is to estimate the unknown distribution from the observation of . In order to evaluate the performance of an estimator of , we shall introduce, following Le Cam (1975), an Hellingertype distance on . We recall that, given two probabilities and on a measurable space , the Hellinger distance and the Hellinger affinity between and are respectively given by
(3) 
where denotes any measure that dominates both and , the result being independent of the choice of . The Hellingertype distance and affinity between two elements and of are then given by the formulas
We shall denote by the topology of the metric space .
2.2 Models and their representations
Let us start with this definition:
Definition 1.
We call model any dominated subset of and we call representation of (the model) a pair where is a finite measure which dominates and is a subset of such that for any in there exists a unique density with .
This means that, given a representation of the model , we can associate to each probability a density and viceversa. Clearly a dominated subset has different representations depending on the choice of the dominating measure and the versions of the densities .
Our estimation strategy is based on specific dominated subsets of that we call models.
Definition 2.
A model is a countable (which in this paper always means either finite or infinite and countable) subset of .
A model being countable, it is necessarily dominated. One should think of it as a probability set to which the true distribution is believed to be close (with respect to the Hellingertype distance ).
2.3 Construction of a estimator on a model
Given the model , our estimator is defined as a random element of , where denotes the closure of the subset of in the metric space , and its construction relies on a particular representation of the model . It actually depends on three elements with specific properties to be made precise below:

A function (which will serve as a substitute for the logarithm to derive an alternative to the MLE) with the following properties:
Assumption 1.
The function is nondecreasing from to , Lipschitz and satisfies
(4) Throughout this paper we shall only consider, without further notice, functions satisfying Assumption 1.

A model (in most cases a model) with a representation .

A penalty function “” mapping to , the role of which will be explained later in Section 3. We may, at first reading, assume that this penalty function is identically 0.
It is essential to note that the dominating measure is chosen by the statistician and that there is no reason that the true distribution of be absolutely continuous with respect to . On the contrary, all probabilities on belonging to are absolutely continuous with respect to .
Given the function and the representation , we define the realvalued function on by
(5) 
with the conventions and for all . We then set (with and )
(6) 
Definition 3 (estimators).
Let be the (nonvoid) set
(7) 
where the positive constant is given by (19) below. A estimator relative to is any (measurable) element of .
Since belongs to , the elements of which are dominated by , there exists a random density with for such that . Note that might not belong to .
As an immediate consequence of Assumption 1 and the convention , and
(8) 
Moreover,
for all , which implies that any element in such that is a estimator. In particular, when for all (which we shall write in the sequel ) and , it follows from (6) that
for all . This means that, in this case, is a saddle point of the map .
A estimator depends on the chosen representation of and there are different versions of the estimators associated to , even though, most of the time, will directly be given by a specific representation, that is a family of densities with respect to some reference measure . Here is the important point, to be proven in Section 3: when is a model the risk bounds we shall derive only depend on and the penalty function but not on the chosen representation of , which allows us to choose the more convenient one for the construction. In contrast, the performances of many classical estimators are sensitive to the representation of the model and this is in particular the case of the MLE as shown by the following example.
Proposition 1.
Let us consider a sequence of i.i.d. random variables defined on a measurable space with normal distribution for some unknown . We choose for reference measure and for the version of , , the function
(9) 
Whatever the value of the true parameter , on a set of probability tending to 1 when goes to infinity, the MLE is given by and is therefore inconsistent.
The proof of Proposition 1 is given in Section D.1 of the Appendix. Note that the usual choice for : for is purely conventional. Mathematically speaking our choice (9) is perfectly correct but leads to an inconsistent MLE. Also note that the usual tools that are used to prove consistency of the MLE, like bracketing entropy (see for instance Theorem 7.4 of van de Geer (2000)) are not stable with respect to changes of versions of the densities in the family. The same is true for arguments based on VCclasses that we used in Baraud et al. (2017). Choosing a convenient set of densities to work with is wellgrounded as long as the reference measure not only dominates the model but also the true distribution . If not, sets of null measure with respect to might have a positive probability under and it becomes unclear how the choice of this set of densities influences the performance of the estimator.
2.4 Notations and conventions
Throughout this paper, given a representation of a model , we shall use lower case letters and for denoting the chosen densities of and with respect to the reference measures and respectively for all . We set for all ; denotes the cardinality of the set ; is the closed Hellingertype ball in with center and radius . Given a set , a nonnegative function on , and , we set . In particular, for , . We set and for and respectively. By convention , the ratio equals for , for and 1 for .
2.5 Our assumptions
Given the model , let us now indicate what properties the function (satisfying Assumption 1) are required in view of controlling the risk of the resulting estimators.
Assumption 2.
Let be the model to be used for the construction of estimators. There exist three positive constants , , with and such that, whatever the representation of , the densities , the probability and ,
(10)  
(11) 
Note that the lefthand sides of (10) and (11) depend on the choices of the reference measures and versions of the densities and while the corresponding righthand sides do not.
Given that satisfies Assumption 2, the values of , and are clearly not uniquely defined but, in the sequel, when we shall say that Assumption 2 holds, this will mean that the function satisfies (10) and (11) with given values of these constants which will therefore be considered as fixed once has been chosen. When we shall say that some quantity depends on it will implicitely mean that it depends on these chosen values of , and .
An important consequence of (8), (10) and (11) is the fact that, for all , in and ,
(12)  
These inequalities follow by summing the inequalities (10) with respect to with , then exchanging the roles of and and applying (8). They imply that the sign of tells us which of the two distributions and is closer to the true one when the ratio between the distances and is far enough from one.
In view of checking that a given function satisfies Assumption 2, the next result to be proved in Section D.3 of the Appendix is useful.
Proposition 2.
This proposition means that, up to a possible adjustment of the constants and , it is actually enough to check that (10) and (11) hold true for a given representation of and all probabilities .
Let us now introduce two functions which do satisfy Assumption 2.
Proposition 3.
Let and be the functions taking the value 1 at and defined for by
These two functions are continuously increasing from to , Lipschitz (with respective Lipschitz constants 1.143 and 2) and satisfy Assumption 2 for all models with , , for and , , for .
Both functions can therefore be used everywhere in the applications of the present paper. Nevertheless, we prefer because it leads to better constants in the risk bounds of the estimator. Proposition 3 is proved in Appendix D.4. Some comments on Assumption 2 can be found in Appendix D.2. When the model reduces to two elements, our selection procedure can be interpreted as a robust test between two simple hypotheses. Upper bounds on the errors of the first and second kinds are established in Appendix D.10.
3 The performance of estimators on models
3.1 The dimension function
The deviation between the true distribution and a estimator built on the model is controlled by two terms which are the analogue of the classical bias and variance terms and we shall first introduce a function that replaces here the variance.
Let , and be an arbitrary subset of , we define
and for measurable nonnegative functions on , we set
(13) 
Given a representation of , we define
(14) 
where, for , denotes the (unique) element of such as and denotes the element of such that . We recall that we use the convention . Since is countable, so is . Therefore the supremum of over is measurable and the righthand side of (14) is welldefined. Also note that, since ,
Hence .
Definition 4 (dimension function).
Let be a model and some function satisfying Assumption 2 with constants and . The dimension function of is the mapping from to given by
(15) 
with and
where the infimum runs over all the representations of .
Note that the dimension function of depends on the choice of the function and not on the choice of the representations of . Since it measures the local fluctuations of the centred empirical process indexed by , it is quite similar to the local Rademacher complexity introduced in Koltchinskii (2006) for the purpose of studying empirical risk minimization. Its importance comes from the following property.
Proposition 4.
Let be a model and an arbitrary representation of . Whatever ,
(16) 
hence, for all
(17) 
The proof is provided in Section D.5 of the Appendix.
3.2 Exponential deviation bounds
Our first theorem, to be proven in Section A.2, deals with the situation of a null penalty function .
Theorem 1.
Let be an arbitrary distribution in , a model and a function satisfying Assumption 2. Whatever the representation of , a estimator relative to as defined in Section 2.3 satisfies, for all and ,
(18) 
with
(19) 
In particular, if the dimension function is bounded on by , then
(20) 
and for some constant which only depends on the choice of .
None of the quantities involved in (18) depends on the chosen representation of , which means that the performance of does not depend on although its construction depends on it. We shall therefore (abusively) refer to as a estimator on omitting to mention what representation is used for its construction.
Introducing a nontrivial penalty function allows one to favour some probabilities as compared to others in and gives thus a Bayesian flavour to our estimation procedure. We shall mainly use it when we have at our disposal not only one single model for but rather a countable collection of candidate ones, in which case is still a model that we call the reference model. The penalty function may not only be used for estimating but also for performing model selection among the family by deciding that the procedure selects the model if the resulting estimator belongs to . Since may belong to several models, this selection procedure may result in a (random) set of possible models for and a common way of selecting one is to choose that with the smallest complexity in a suitable sense. In the present paper, the complexity of a model will be measured by means of a nonnegative weight function mapping into and which satisfies
(21) 
where the number “1” is chosen for convenience. When equality holds in (21), can be viewed as a prior distribution on the family of models .
In such a context, we shall describe how our penalty term should depend on this weight function in view of selecting a suitable model for . The next theorem is proved in Section A.3.
Theorem 2.
Let be an arbitrary distribution in , be a countable collection of models, a weight function satisfying (21), a representation of , a function satisfying Assumption 2 and be given by (19). Assume that there exists a mapping and a number such that, whatever ,
(22) 
Let the penalty function satisfies, for some constant ,
(23) 
Then any estimator relative to satisfies, for all with probability at least and with given by (19),
(24)  
3.3 The case of density estimation
Of special interest is the situation where the are assumed to be i.i.d. with values in a measurable set in which case