Approximation and Estimation of -Concave Densities via Rényi Divergences
In this paper, we study the approximation and estimation of -concave densities via Rényi divergence. We first show that the approximation of a probability measure by an -concave density exists and is unique via the procedure of minimizing a divergence functional proposed by Koenker and Mizera (2010) if and only if admits full-dimensional support and a first moment. We also show continuity of the divergence functional in : if in the Wasserstein metric, then the projected densities converge in weighted metrics and uniformly on closed subsets of the continuity set of the limit. Moreover, directional derivatives of the projected densities also enjoy local uniform convergence. This contains both on-the-model and off-the-model situations, and entails strong consistency of the divergence estimator of an -concave density under mild conditions. One interesting and important feature for the Rényi divergence estimator of an -concave density is that the estimator is intrinsically related with the estimation of log-concave densities via maximum likelihood methods. In fact, we show that for at least, the Rényi divergence estimators for -concave densities converge to the maximum likelihood estimator of a log-concave density as . The Rényi divergence estimator shares similar characterizations as the MLE for log-concave distributions, which allows us to develop pointwise asymptotic distribution theory assuming that the underlying density is -concave.
t2Supported in part by NSF Grant DMS-1104832 and NI-AID grant R01 AI029168
class=MSC] \kwd[Primary ]62G07 \kwd62H12 \kwd[; secondary ]62G05 \kwd62G20
-concavity \kwdconsistency \kwdprojection \kwdasymptotic distribution \kwdmode estimation \kwdnonparametric estimation \kwdshape constraints
- 1 Introduction
- 2 Theoretical properties of the divergence estimator
- 3 Limit behavior of -concave densities
- 4 Limiting distribution theory of the divergence estimator
- 5 Discussion
- 6 Proofs
- 7 Appendix
The class of -concave densities on is defined by the generalized means of order as follows. Let
Then a density on is called -concave, i.e. if and only if for all and , . This definition apparently goes back to Avriel (1972) with further studies by Borell (1974, 1975), Das Gupta (1976), Rinott (1976), and Uhrin (1984); see also Dharmadhikari and Joag-Dev (1988) for a nice summary. It is easy to see that the densities have the form for some concave function if , for some concave if , and for some convex if . The function classes are nested in in that for every , we have
Nonparametric estimation of -concave densities has been under intense research efforts in recent years. In particular, much attention has been paid to estimation in the special case which corresponds to all log-concave densities on . The nonparametric maximum likelihood estimator (MLE) of a log-concave density was studied in the univariate setting by Walther (2002), Dümbgen and Rufibach (2009), Pal, Woodroofe and Meyer (2007); and in the multivariate setting by Cule, Samworth and Stewart (2010); Cule and Samworth (2010). The limiting distribution theory at fixed points when was studied in Balabdaoui, Rufibach and Wellner (2009), and rate results in Doss and Wellner (2016); Kim and Samworth (2015). Dümbgen, Samworth and Schuhmacher (2011) also studied stability properties of the MLE projection of any probability measure onto the class of log-concave densities.
Compared with the well-studied log-concave densities (i.e. ), much remains unknown concerning estimation and inference procedures for the larger classes . One important feature for this larger class is that the densities in are allowed to have heavier and heavier tails as . In fact, distributions with degrees of freedom belong to (and hence also to for any ). The study of maximum likelihood estimators (MLE’s in the following) for general -concave densities in Seregin and Wellner (2010) shows that the MLE exists and is consistent for . However there is no known result about uniqueness of the MLE of -concave densities except for . The difficulties in the theory of estimation via MLE lie in the fact we have still very little knowledge of ‘good’ characterizations of the MLE in the -concave setting. This has hindered further development of both theoretical and statistical properties of the estimation procedure.
Some alternative approaches to estimation of -concave densities have been proposed in the literature by using divergences other than the log-likelihood functional (Kullback-Leibler divergence in some sense). Koenker and Mizera (2010) proposed an alternative to maximum likelihood based on generalized Rényi entropies. Similar procedures were also proposed in parametric settings by Basu et al. (1998) using a family of discrepancy measures. In our setting of -concave densities with , the methods of Koenker and Mizera (2010) can be formulated as follows.
Given i.i.d. observations , consider the primal optimization problem :
where denotes all non-negative closed convex functions supported on the convex set , the empirical measure and . As is shown by Koenker and Mizera (2010), the associated dual problem is
where is the polar cone of , and is the conjugate index of , i.e. . Here , the space of signed Radon measures on , is the topological dual of , the space of continuous functions on . We also note that the constraint in the dual form (1.2) comes from the ‘dual’ of the primal constraint , and the constraint can be derived from the dual computation of :
Here we used the notation , and is the functional defined by for clarity. Now the dual form (1.2) follows by the well known fact (e.g. Rockafellar (1971) Corollary 4A) that the form of the above dual functional is given by
For the primal problem and the dual problem , Koenker and Mizera (2010) proved the following results:
Theorem 1.1 (Theorem 4.1, Koenker and Mizera (2010)).
admits a unique solution if , where is a polyhedral convex function supported on .
Theorem 1.2 (Theorem 3.1, Koenker and Mizera (2010)).
Strong duality between and holds. Any dual feasible solution is actually a density on with respect to the canonical Lebesgue measure. The dual optimal solution exists, and satisfies
We note that the above results are all obtained in the empirical setting. At the population level, given a probability measure with suitable regularity conditions, consider
and denotes the class of all (non-negative) closed convex functions with non-empty interior, which are coercive in the sense that . Koenker and Mizera (2010) show that Fisher consistency holds at the population level: Suppose is defined for some where ; then is an optimal solution for .
Koenker and Mizera (2010) also proposed a general discretization scheme corresponding to the primal form (1.1) and the dual form (1.2) for fast computation, by which the one dimensional problem can be solved via linear programming and the two dimensional problem via semi-definite programming. These have been implemented in the R package REBayes by Koenker and Mizera (2014). Koenker’s package depends in turn on the MOSEK implementation of MOSEK ApS (2011); see Appendix B of Koenker and Mizera (2010) for further details. On the other hand, in the special case , computation of the MLE’s of log-concave densities has been implemented in the R package LogConcDEAD developed in Cule, Samworth and Stewart (2010) in arbitrary dimensions. However, expensive search for the proper triangulation of the support renders computation difficult in high dimensions.
In this paper, we show that the estimation procedure proposed by Koenker and Mizera (2010) is the ‘natural’ way to estimate -concave densities. As a starting point, since the classes are nested in , it is natural to consider estimation of the extreme case (the class of log-concave densities) as some kind of ‘limit’ of estimation of the larger class . As we will see, estimation of -concave distributions via Rényi divergences is intrinsically related with the estimation of log-concave distributions via maximum likelihood methods. In fact we show that in the empirical setting in dimension 1, the Rényi divergence estimators converge to the maximum likelihood estimator for log-concave densities as .
We will show that the Rényi divergence estimators share characterization and stability properties similar to the analogous properties established in the log-concave setting by Dümbgen and Rufibach (2009); Cule and Samworth (2010) and Dümbgen, Samworth and Schuhmacher (2011). Once these properties are available, further theoretical and statistical considerations in estimation of -concave densities become possible. In particular, the characterizations developed here enable us to overcome some of the difficulties of maximum likelihood estimators as proposed by Seregin and Wellner (2010), and to develop limit distribution theory at fixed points assuming that the underlying model is -concave. The pointwise rate and limit distribution results follow a pattern similar to the corresponding results for the MLE’s in the log-concave setting obtained by Balabdaoui, Rufibach and Wellner (2009). This local point of view also underlines the results on global rates of convergence considered in Doss and Wellner (2016), showing that the difficulty of estimation for such densities with tails light or heavy, comes almost solely from the shape constraints, namely, the convexity-based constraints.
The rest of the paper is organized as follows. In Section 2, we study the basic theoretical properties of the approximation/projection scheme defined by the procedure (1.3). In Section 3, we study the limit behavior of -concave probability measures in the setting of weak convergence under dimensionality conditions on the supports of the limiting sequence. In Section 4, we develop limiting distribution theory of the divergence estimator in dimension 1 under curvature conditions with tools developed in Sections 2 and 3. Related issues and further problems are discussed in Section 5. Proofs are given in Sections 6 and 7.
In this paper, we denote the canonical Lebesgue measure on by or and write for the canonical Euclidean -norm in , and unless otherwise specified. stands for the open ball of radius centered at in , and for the indicator function of . We use to denote the norm of a measurable function on if no confusion arises.
We write for the convex support of a measure defined on , i.e.
We let denote all probability measures on whose convex support has non-void interior, while denotes the set of all probability measures with finite first moment: .
We write if converges weakly to for the corresponding probability measures and .
We write unless otherwise specified.
2 Theoretical properties of the divergence estimator
In this section, we study the basic theoretical properties of the proposed projection scheme via Rényi divergence (1.3). Starting from a given probability measure , we first show the existence and uniqueness of such projections via Rényi divergence under assumptions on the index and . We will call such a projection the Rényi divergence estimator for the given probability measure in the following discussions. We next show that the projection scheme is continuous in in the following sense: if a sequence of probability measures , for which the projections onto the class of -concave densities exist, converge to a limiting probability measure in Wasserstein distance, then the corresponding projected densities converge in weighted metrics and uniformly on closed subsets of the continuity set of the limit. The directional derivatives of such projected densities also converge uniformly in all directions in a local sense. We then turn our attention the explicit characterizations of the Rényi divergence estimators, especially in dimension 1. This helps in two ways. First, it helps to understand the continuity of the projection scheme in the index , i.e. answers affirmatively the question: For a given probability measure , does the Rényi divergence estimator converge to the log-concave projection as studied in Dümbgen, Samworth and Schuhmacher (2011) as ? Second, the explicit characterizations are exploited in the development of asymptotic distribution theory presented in Section 4.
2.1 Existence and uniqueness
For a given probability measure , let .
Assume and . Then if and only if .
Now we state our main theorem for the existence of Rényi divergence projection corresponding to a general measure on .
Assume and . Then (1.3) achieves its nontrivial minimum for some . Moreover, is bounded away from zero, and is a bounded density with respect to .
The uniqueness of the solution follows immediately from the strict convexity of the functional .
is the unique solution for if .
By the above discussion, we conclude that the map is well-defined for probability measures with suitable regularity conditions: in particular, if and , it is well-defined if and only if . From now on we denote the optimal solution as or simply if no confusion arises, and write for the corresponding -concave distribution, and say that is the Rényi projection of to .
2.2 Weighted global convergence in and
Assume . Let be a sequence of probability measures converging weakly to . Then
If we further assume that
For any closed set contained in the continuity points of and ,
Furthermore, let , and be any compact set. Then
where denotes the (one-sided) directional derivative along .
The one-sided directional derivative for a convex function is well-defined and , hence well-defined for . See Section 23 in Rockafellar (1997) for more details.
As a direct consequence, we have the following result covering both on and off-the-model cases.
Assume . Let be a probability measure such that , with the density function corresponding to the Rényi projection (as in Remark 2.4). Let be the empirical measure when are i.i.d. with distribution on . Let and be the Rényi divergence estimator of . Then, almost surely we have
For any closed set contained in the continuity points of and ,
Furthermore, for any compact set ,
Now we return to the correctly specified case and relax the previous assumption that for the case of the empirical measure and some measure with finite mean and bounded density with .
2.3 Characterization of the Rényi divergence projection and estimator
We now develop characterizations for the Rényi divergence projection, especially in dimension . All proofs for this subsection can be found in Appendix 6.1.
We note that the assumption is imposed only for the existence and uniquess of the Rényi divergence projection. For the specific case of empirical measure , this condition can be relaxed to .
Now we give a variational characterization in the spirit of Theorem 2.2 in Dümbgen and Rufibach (2009). This result holds for all dimensions .
Assume and . Then if and only if
holds for all such that there exists with holds for all .
Assume and and let be any closed convex function. Then
where is the Rényi projection of to (as in Remark 2.4).
As a direct consequence, we have
Corollary 2.11 (Moment Inequalities).
Assume and . Let . Then . Furthermore if , we have and where is the covariance matrix defined by . Generally if for some , then holds for all .
Now we restrict our attention to , and in the following we will give a full characterization of the Rényi divergence estimator. Suppose we observe i.i.d. on , and let be the order statistics of . Let be the empirical distribution function corresponding to the empirical probability measure . Let and . From Theorem 4.1 in Koenker and Mizera (2010) it follows that is a convex function supported on , and linear on for all . For a continuous piecewise linear function , define the set of knots to be
Let be a convex function taking the value on and linear on for all . Let
Assume . Then if and only if
For , we have
Finally we give a characterization of the Rényi divergence estimator in terms of distribution function as Theorem 2.7 in Dümbgen, Samworth and Schuhmacher (2011).
Assume and is a probability measure on with distribution function . Let be such that is a density on , with distribution function . Then if and only if
for all with equality when .
The above theorem is useful for understanding the projected -concave density given an arbitrary probability measure . The following example illustrates these projections and also gives some insight concerning the boundary properties of the class of -concave densities.
Consider the class of densities defined by
Note that is -concave and not -concave for any . We start from arbitrary with , and we will show in the following that the projection of onto the class of -concave () distribution through will be given by . Let be the distribution function of , then we can calculate
It is easy to check by direct calculation that with equality attained if and only if . It is clear that and hence the conditions in Theorem 2.14 are verified. Note that, in Example 2.9 of Dümbgen, Samworth and Schuhmacher (2011), the log-concave approximation of the rescaled density is the Laplace distribution. It is easy to see from the above calculation that the log-concave projection of the whole class will be the Laplace distribution . Therefore the log-concave approximation fails to distinguish densities at least amongst the class .
2.4 Continuity of the Rényi divergence estimator in
Recall that , and then is a conjugate pair with where . For , let
where is the polar cone of and is the empirical measure. The maximum likelihood estimation of a log-concave density has dual form
Let and be the solutions of and . For simplicity we drop the explicit notational dependence of on . Since as for smooth enough, it is natural to expect some convergence property of to . The main result is summarized as follows.
Suppose . For all , we have the following weighted convergence
Moreover, for any closed set contained in the continuity points of ,
for all .
3 Limit behavior of -concave densities
Let be a sequence of -concave densities with corresponding measures . Suppose . From Borell (1974, 1975) and Brascamp and Lieb (1976), we know that each is a concave measure with if , if , and if . This result is proved via different methods by Rinott (1976). Furthermore, if the dimension of the support of is , then it follows from Borell (1974), Theorem 2.2 that the limit measure is concave and hence has a Lebesgue density with . Here we pursue this type of result in somewhat more detail. Our key dimensionality condition will be formulated in terms of the set . We will show that if
holds, then the limiting probability measure admits an upper semi-continuous -concave density on . Furthermore, if a sequence of -concave densities converges weakly to some density (in the sense that the corresponding probability measures converge weakly), then is -concave, and converges to in weighted metrics and uniformly on any closed set of continuity points of . The directional derivatives of also converge uniformly in all directions in a local sense.
In the following sections, we will not fully exploit the strength of the results we have obtained. The results obtained will be interesting in their own right, and careful readers will find them useful as technical support for Sections 2 and 4.
3.1 Limit characterization via dimensionality condition
Note that is a convex set. For a general convex set , we follow the convention (see Rockafellar (1997)) that , where is the affine hull of . It is well known that the dimension of a convex set is the maximum of the dimensions of the various simplices included in (cf. Theorem 2.4, Rockafellar (1997)).
Assume (D1). Then .
Let be probability measures with upper semi-continuous -concave densities such that weakly as . Here is a probability measure with density . Then , and can be taken as and hence upper semi-continuous -concave.
In many situations, uniform boundedness of a sequence of -concave densities give rise to good stability and convergence property.
Assume . Let be a sequence of -concave densities on . If where as above, then .
Now we state one limit characterization theorem.
Assume . Under either condition of (D1), is absolutely continuous with respect to , with a version of the Radon-Nikodym derivative , which is an upper semi-continuous and an -concave density on .
3.2 Modes of convergence
It is shown above that the weak convergence of -concave probability measures implies almost everywhere pointwise convergence at the density level. In many applications, we wish different/stronger types of convergence. This subsection is devoted to the study of the following two types of convergence:
Convergence in metric;
Convergence in metric.
We start by investigating convergence property in metric.
Assume . Let be probability measures with upper semi-continuous -concave densities such that weakly as . Then there exists such that
Once the existence of a suitable integrable envelope function is established, we conclude naturally by dominated convergence theorem that
Assume . Let be probability measures with upper semi-continuous -concave densities such that weakly as . Then for ,
Next we examine convergence of -concave densities in norm. We denote unless otherwise specified. Since we have established pointwise convergence in Lemma 3.2, classical convex analysis guarantees that the convergence is uniform over compact sets in . To establish global uniform convergence result, we only need to control the tail behavior of the class of -concave functions and the region near the boundary of . This is accomplished via Lemmas 6.2 and 6.3.
Let be probability measures with upper semi-continuous -concave densities such that weakly as . Then for any closed set contained in the continuity points of and ,
We note that no assumption on the index is required here.
3.3 Local convergence of directional derivatives
It is known in convex analysis that if a sequence of convex functions converges pointwise to on an open convex set, then the subdifferential of also ‘converges’ to the subdifferential of . If we further assume smoothness of , then local uniform convergence of the derivatives automatically follows. See Theorems 24.5 and 25.7 in Rockafellar (1997) for precise statements. Here we pursue this issue at the level of transformed densities.
Let be probability measures with upper semi-continuous -concave densities such that weakly as . Let , and be any compact set. Then
4 Limiting distribution theory of the divergence estimator
In this section we establish local asymptotic distribution theory of the divergence estimator at a fixed point . Limit distribution theory in shape-constrained estimation was pioneered for monotone density and regression estimators by Prakasa Rao (1969), Brunk (1970), Wright (1981) and Groeneboom (1985). Groeneboom, Jongbloed and Wellner (2001) established pointwise limit theory for the MLE’s and LSE’s of a convex decreasing density, and also treated pointwise limit theory estimation of a convex regression function. Balabdaoui, Rufibach and Wellner (2009) established pointwise limit theorems for the MLEs of log-concave densities on . On the other hand, for nonparametric estimation of -concave densities, asymptotic theory beyond the Hellinger consistency results for the MLE’s established by Seregin and Wellner (2010) has been non-existent. Doss and Wellner (2016) have shown in the case of that the MLE’s have Hellinger convergence rates of order for each (which includes the log-concave case ). However, due at least in part to the lack of explicit characterizations of the MLE for -concave classes, no results concerning limiting distributions of the MLE at fixed points are currently available. In the remainder of this section we formulate results of this type for the Rényi divergence estimators. These results are comparable to the pointwise limit distribution results for the MLE’s of log-concave densities obtained by Balabdaoui, Rufibach and Wellner (2009).
In the following, we will see how natural and strong characterizations developed in Section 2 help us to understand the limit behavior of the Rényi divergence estimator at a fixed point. For this purpose, we assume the true density satisfies the following:
and is an -concave density on , where ;
is locally around for some .
Let , and if the above set is empty. Assume is continuous around .
4.1 Limit distribution theory
Before we state the main results concerning the limit distribution theory for the Rényi divergence estimator, let us sketch the route by which the theory is developed. We first denote , and . We also denote and . Due to the form of the characterizations obtained in Theorem 2.12, we define local processes at the level of integrated distribution functions as follows:
where and are defined so that by virtue of Theorem 2.12. Since we wish to derive asymptotic theory at the level of the underlying convex function, we modify the processes by