Estimating the conditional density by histogram type estimators and model selection

Estimating the conditional density by histogram type estimators and model selection


We propose a new estimation procedure of the conditional density for independent and identically distributed data. Our procedure aims at using the data to select a function among arbitrary (at most countable) collections of candidates. By using a deterministic Hellinger distance as loss, we prove that the selected function satisfies a non-asymptotic oracle type inequality under minimal assumptions on the statistical setting. We derive an adaptive piecewise constant estimator on a random partition that achieves the expected rate of convergence over (possibly inhomogeneous and anisotropic) Besov spaces of small regularity. Moreover, we show that this oracle inequality may lead to a general model selection theorem under very mild assumptions on the statistical setting. This theorem guarantees the existence of estimators possessing nice statistical properties under various assumptions on the conditional density (such as smoothness or structural ones).


Let be independent and identically distributed random variables defined on an abstract probability space with values in . We suppose that the conditional law admits a density with respect to a known -finite measure . In this paper, we address the problem of estimating the conditional density on a given subset .

When admits a joint density with respect to a product measure , one can rewrite as

where stands for the density of with respect to . A first approach to estimate was introduced in the late 60’s by [28]. The idea was to replace the numerator and the denominator of this ratio by Kernel estimators. We refer to [25] for a study of its asymptotic properties. An alternative point of view is to consider the conditional estimation density problem as a non-parametric regression problem. This has motivated the definition of local parametric estimators which have been asymptotically studied by [20]. Another approach was proposed by [22]. He showed asymptotic results for his copula based estimator under smoothness assumptions on the marginal density of and the copula function.

The aforementioned procedures depend on some parameters that should be tuned according to the (usually unknown) regularity of . The practical choice of these parameters is for instance discussed in [21] (see also the references therein). Nonetheless, adaptive estimation procedures are rather scarce in the literature. We can cite the procedure of [18] which yields an oracle inequality for an integrated loss. His estimator is sharp minimax under Sobolev type constraints. [12] adapted the combinatorial method of [17] to the problem of bandwidth selection in kernel conditional density estimation. They showed that this method allows to select the bandwidth according to the regularity of by proving an oracle inequality for an integrated loss. The papers of [13] are based on the minimisation of a penalized contrast inspired from the least squares. They established a model selection result for an empirical loss and then for an integrated loss. These procedures build adaptive estimators that may achieve the minimax rates over Besov classes. The paper of [14] is based on projection estimators, Goldenshluger and Lepski methodology and a transformation of the data. She showed an oracle inequality for an integrated loss from which she deduced that her estimator is adaptive and reaches the expected rate of convergence under Sobolev constraints on an auxiliary function. [15] gave model selection results for the penalized maximum likelihood estimator for a loss based on a Jensen-Kullback-Leibler divergence and under bracketing entropy type assumptions on the models.

Another estimation procedure that can be found in the literature is the one of -estimation ( for test) developed by [9]. It leads to much more general model selection theorems, which allows the statistician to model finely the knowledge he has on the target function to obtain accurate estimates. It is shown in [11] that one can build a -estimator of the conditional density. We now define the loss used in that paper to compare it with ours. We suppose henceforth that the distribution of is absolutely continuous with respect to a known -finite measure . Let be its Radon-Nikodym derivative. We denote by the cone of non-negative integrable functions on with respect to the product measure vanishing outside . [11] measured the quality of his estimator by means of the Hellinger deterministic distance defined by

It is assumed in that paper that the marginal density of is bounded from below by a positive constant. This classical assumption seems natural in the sense that the estimation of is better in regions of high value of than regions of low value as stressed for instance in [8]. In the present paper, we bypass this assumption by measuring the quality of our estimators through the Hellinger distance defined by

The marginal density can even vanish, in contrast to most of the papers cited above. We propose a new and data-driven (penalized) criterion adapted to this unknown loss. Its definition is in the line of the ideas developed in [3].

The main result is an oracle type inequality for (at most) countable families of functions of . This inequality holds true without additional assumptions on the statistical setting. We use it a first time as an alternative to resampling methods to select among families of piecewise constant estimators. We deduce an adaptive estimator that achieves the expected rates of convergence over a range of (possibly inhomogeneous and anisotropic) Besov classes, including the ones of small regularities. A second application of this inequality leads to a new general model selection theorem under very mild assumptions on the statistical setting. We propose illustrations of this result. The first shows the existence of an adaptive estimator that attains the expected rate of convergence (up to a logarithmic term) over a very wide range of (possibly inhomogeneous and anisotropic) Besov spaces. This estimator is therefore able to cope in a satisfactory way with very smooth conditional densities as well as with very irregular ones. The second illustration deals with the celebrated regression model. It shows that the rates of convergence can be faster than the ones we would obtain under pure smoothness assumptions on when the data actually obey to a regression model (not necessarily Gaussian). The last illustration concerns the case where the random variables lie in a high dimensional linear space, say with large. In this case, we explain how our procedure can circumvent the curse of dimensionality.

The paper is organized as follows. In Section 2, we carry out the estimation procedure and the oracle inequality. We use it to select among a family of piecewise constant estimators and study the quality of the selected estimator. Section 3 is dedicated to the general model selection theorem and its applications. The proofs are postponed to Section 4.

We now introduce the notations that will be used all along the paper. We set , , . For , (respectively ) stands for (respectively ). The positive part of a real number is denoted by . The distance between a point and a set in a metric space is denoted by . The cardinality of a finite set is denoted by . The restriction of a function to a set is denoted by f_Af_Af_Af_A. The indicator function of a set is denoted by . The notations are for the constants. These constants may change from line to line.

2Selection among points and hold-out

Throughout the paper, and is of the form with , .

2.1Selection rule and main theorem.

Let be the subset of defined by

and let be an at most countable subset of . The aim of this section is to use the data in order to select a function close to the unknown conditional density . We begin by presenting the procedure. The underlying motivations will be further discussed below.

Let be a map on satisfying

We define the function on by

where the convention is used. We set for ,

We finally define our estimator as any element of such that

Remarks. The definition of comes from a decomposition of the Hellinger distance initiated by [3] and taken back in [?]. We shall show in the proof of Theorem ? that for all , , the two following assertions hold true with probability larger than :

  • If , then

  • If , then .

In the above inequalities, and are positive universal constants. The sign of allows thus to know which function among and is the closest to (up to the multiplicative constant and the remainder term ). Note that comparing directly to is not straightforward in practice since and are both unknown to the statistician.

The definition of the criterion looks like the one proposed in Section 4.1 of [29] for estimating the transition density of a Markov chain as well as the one proposed in [6] for estimating one or several densities. The underlying idea is that is roughly between and . It is thus natural to minimize to define an estimator of . To be more precise, when is large enough, the proof of Theorem ? shows that for all , the following chained inequalities hold true with probability larger than uniformly for ,


for universal constants , , . We recall that is the square of the Hellinger distance between the conditional density and the set , . Therefore, as satisfies (Equation 1),

Rewriting this last inequality and using that yields:

Note that the marginal density influences the performance of the estimator through the Hellinger loss only. Moreover, no information on is needed to build the estimator.

We can interpret the condition as a (sub)-probability on . The more complex , the larger the weights . When is finite, one can choose , and the above inequality becomes

The Hellinger quadratic risk of the estimator can therefore be bounded from above by a sum of two terms (up to a multiplicative constant): the first one stands for the bias term while the second one stands for the estimation term.

Let us mention that assuming that is a subset of is not restrictive. Indeed, if belongs to , we can set

The function belongs to and does always better than :

Thereby, if is only assumed to be a subset of , the procedure applies with in place of (and with ). The resulting estimator then satisfies ( ?).

Remark: the procedure does not depend on the dominating measure . However, the set , which must be chosen by the statistician, must satisfy the above assumption , which usually requires the knowledge of . Actually, this assumption can be slightly strengthened to deal with an unknown, but finite measure . This may be of interest when is the (unknown) marginal distribution of (in which case ). More precisely, let be the set of non-negative measurable functions vanishing outside such that

The assumption can be satisfied without knowing and implies .


As a first application of our oracle inequality, we consider the situation in which the set is a family of estimators built on a preliminary sample. We suppose therefore that we have at hand two independent samples of : and . This is equivalent to splitting an initial sample of size into two equal parts: and .

Let be an at most countable collection of estimators based only on the first sample . In view of Proposition ?, we may assume, without loss of generality, that for all ,

Let be a map defined on such that .

Conditionally to , is a deterministic set. We can therefore apply our selection rule to , and to the sample to derive an estimator such that:

By taking the expectation with respect to , we then deduce:

Note that there is almost no assumption on the preliminary estimators. It is only assumed that . Besides, the non-negativity of can always be fixed by taking its positive part if needed. We may therefore select among Kernel estimators (to choose the bandwith for instance), local polynomial estimators, projection estimators…It is also possible to mix in the collection several type of estimators. From a numerical point of view, the procedure can be implemented in practice provided that is finite and not too large.

We shall illustrate this result by applying it to some families of piecewise constant estimators. As we shall see, the resulting estimator will be optimal and adaptive over some range of possibly anisotropic Hölder and possibly inhomogeneous Besov classes.

2.3Histogram type estimators.

We now define the piecewise constant estimators. Let be a (finite) partition of , and

where the conventions , are used. [23] established an integrated risk bound for under Lipschitz conditions on . We are nevertheless unable to find in the literature a non-asymptotic risk bound for the Hellinger deterministic loss . We propose the following result (which is assumption free on ):

This result shows that the Hellinger quadratic risk of the estimator can be bounded by a sum of two terms. The first one corresponds to a bias term whereas the second one corresponds to a variance or estimation term. A deviation bound can also be established for some partitions:

2.4Selecting among piecewise constant estimators by Hold-out.

The risk of a histogram type estimator depends on the choice of the partition : the thinner , the smaller the bias term but the larger the variance term . Choosing a good partition , that is a partition that realizes a good trade-off between the bias and variance terms is difficult in practice since is unknown (as it involves the unknown conditional density and the unknown distance ). Nevertheless, combining (Equation 2) and Proposition ? immediately entails the following corollary.

The novelty of this oracle inequality lies in the fact that it holds for an (unknown) deterministic Hellinger loss under very mild assumptions both on the partitions and the statistical setting. We avoid some classical assumptions that are required in the literature to prove similar inequalities (see, for instance, Theorem 3.1 of [2] for a result with respect to a loss).

2.5Minimax rates over Hölder and Besov spaces.

We can now deduce from ( ?) estimators with nice statistical properties under smoothness assumptions on the conditional density. Throughout this section, , and is the Lebesgue measure.

Hölder spaces.

Given , we recall that the Hölder space is the set of functions on for which there exists such that

Given , the Hölder space is the set of functions on such that for all , ,

satisfies (Equation 3) with some constant independent of . We then set . When all the are equals, the Hölder space is said to be isotropic and anisotropic otherwise.

Choosing suitably the collection of partitions allows to bound from above the right-hand side of ( ?) when _^d_^d_^d_^d is Hölderian. More precisely, for each integer , let be the regular partition of with pieces

We may define for each multi-integer ,

We now choose , to deduce (see, for instance, Lemma 4 and Corollary 2 of [10] among numerous other references):

The estimator achieves therefore the optimal rate of convergence over the anisotropic Hölder classes , . It is moreover adaptive since its construction does not involve the smoothness parameter .

Besov spaces.

The preceding result may be generalized to the Besov classes under a mild assumption on the design density.

We refer to Section 2.3 of [1] for a precise definition of the Besov spaces. According to the notations developed in this paper, stands for the Besov space with parameters , , and smoothness index . We denote its semi norm by . This space is said to be homogeneous when and inhomogeneous otherwise. It is said to be isotropic when all the are equals and anisotropic otherwise. We now set for ,

and denote by the semi norm associated to the space .

The algorithm of [1] provides a collection of partitions that allows to bound the right-hand side of ( ?) from above when _^d_^d_^d_^d belongs to a Besov space. More precisely:

Remark: the control of the bias term in ( ?) naturally involves a smoothness assumption on the square root of instead of . However, the regularity of the square root of may be deduced from that of . Indeed, we can prove that if with then and . If, additionally, is positive on , then also belongs to and

Under the assumption of Corollary ?, we deduce that if for some , , ,

where depends only on ,,,.

3Model selection

The construction of adaptive and optimal estimators over Hölder and Besov classes follows from the oracle inequality ( ?). This inequality is itself deduced from Theorem ?. Actually, this latter theorem can be applied in a different way to deduce a more general oracle inequality. We can then derive adaptive and (nearly) optimal estimators over more general classes of functions.

3.1A general model selection theorem.

From now on, the following assumption holds.

Let be the space of square integrable functions on with respect to the product measure endowed with the distance

We say that a subset of is a model if it is a finite dimensional linear space.

The discretization trick described in Section 4.2 of [29] can be adapted to our statistical setting. It leads to the theorem below.

As in Theorem ?, the condition has a Bayesian flavour since it can be interpreted as a (sub)-probability on . When does not contain too many models per dimension, we can set , in which case ( ?) becomes

where is universal.

This theorem is more general than Corollary ? since it enables us to deal with more general models . Moreover, it provides a deviation bound for , which is not the case of Corollary ?. As a counterpart, it requires an assumption on the marginal density and the bound involves a logarithmic term and .

Another difference between this theorem and Corollary ? lies in the computation time of the estimators. The estimator of Corollary ? may be built in practice in a reasonable amount of time if is not too large. On the opposite, the procedure leading to the above estimator (which is described in the proof of the theorem) is numerically very expensive, and it is unlikely that it could be implemented in a reasonable amount of time. This estimator should therefore be only considered for theoretical purposes.

3.2From model selection to estimation.

It is recognized that a model selection theorem such as Theorem ? is a bridge between statistics and approximation theory. Indeed, it remains to choose models with good approximation properties with respect to the assumptions we wish to consider on to automatically derive a good estimator .

A convenient way to model these assumptions is to consider a class of functions of and to suppose that _A_A_A_A belongs to . The aim is then to choose and to bound

from above since

where depend only on and where is universal. This work has already been carried out in the literature for different classes of interest. The flexibility of our approach enables the study of various assumptions as illustrated by the three examples below. We refer to [29] for additional examples. In the remainder of this section, and stand for the Lebesgue measure.

Besov classes.

We suppose that , and that is the class of smooth functions defined by

It is then shown in [29] that one can choose a collection provided by Theorem 1 of [1] to get:

where , , are such that and where depends only on , , .

With this choice of models, the estimator of Theorem ? converges at the expected rate (up to a logarithmic term) for the Hellinger deterministic loss over a very wide range of possibly inhomogeneous and anisotropic Besov spaces. It is moreover adaptive with respect to the (possibly unknown) regularity index of _^d_^d_^d_^d.

Regression model.

We can also tackle the celebrated regression model where is an unknown function and where is an unobserved random variable. For the sake of simplicity, , . The conditional density is of the form where is the density of with respect to the Lebesgue measure.

Since and are unknown, we can, for instance, suppose that these functions are smooth, which amounts to saying that _^2_^2_^2_^2 belongs to

Here, stands for the space of Hölderian functions on with regularity index and semi norm . The notation stands for the supremum norm: . An upper bound for may be found in Section 4.4 of [29]. Actually, we show in Section 4.6 that this bound can be slightly improved. To be more precise, the result is the following: for all , , , , , such that , and all function of the form ,

where depends only on , , , , , and where depends only on , , .

In particular, if is more regular than in the sense that , then the rate for estimating the conditional density is the same as the one for estimating the regression function (up to a logarithmic term). As shown in [29], this rate is always faster than the rate we would obtain under smoothness assumptions only that would ignore the specific form of .

Remark. The reader could find in [29] a bound for when corresponds to the heteroscedastic regression model , where are smooth unknown functions.

A single index type model.

In this last example, we investigate the situation in which the explanatory random variables lie in a high dimensional linear space, say with large. On the contrary, the random variables lie in a small dimensional linear space, say with small. Our aim is then to estimate on .

It is well known (and this appears in (Equation 4)) that the curse of dimensionality prevents us to get fast rate of convergence under pure smoothness assumptions on . A solution to overcome this difficulty is to use a single index approach as proposed by [24], that is to suppose that the conditional distribution depends on through an unknown parameter . More precisely, we suppose in this section that is of the form where denotes the usual scalar product on and where is a smooth unknown function. Without loss of generality, we can suppose that belongs to the unit ball of denoted by . We can reformulate these different assumptions by saying that _^d_1+d_2_^d_1+d_2_^d_1+d_2_^d_1+d_2 belongs to the set

A collection of models possessing nice approximation properties with respect to the elements of can be built by using the results of [7]. We prove in Section 4.6 that we can bound as follows: for all , , , and all function of the form ,

where depends only on , , and where depends only on , . Although is a function of variables, the rate of convergence of corresponds to the estimation rate of a smooth function of variables only (up to a logarithmic term).


4.1Proof of Theorem .

Let and be the functions defined on by

where the convention is used. Let

We decompose as

and define