Log-concave density estimation with symmetry or modal constraints

Log-concave density estimation with symmetry or modal constraints

\fnmsCharles R. \snmDoss\thanksreft1label=e1]cdoss@stat.umn.edulabel=u2 [[       \fnmsJon A. \snmWellner\thanksreft2label=e2]jaw@stat.washington.edu [[ School of Statistics
University of Minnesota
Minneapolis, MN 55455
Department of Statistics, Box 354322
University of Washington
Seattle, WA 98195-4322
Abstract

We study nonparametric maximum likelihood estimation of a log-concave density function which is known to satisfy further constraints, where either (a) the mode of is known, or (b) is known to be symmetric about a fixed point . We develop asymptotic theory for both constrained log-concave maximum likelihood estimators (MLE’s), including consistency, global rates of convergence, and local limit distribution theory. In both cases, we find the MLE’s pointwise limit distribution at (either the known mode or the known center of symmetry) and at a point . Software to compute the constrained estimators is available in the R package logcondens.mode.

The symmetry-constrained MLE is particularly useful in contexts of location estimation. The mode-constrained MLE is useful for mode-regression. The mode-constrained MLE can also be used to form a likelihood ratio test for the location of the mode of . These problems are studied in separate papers. In particular, in a separate paper we show that, under a curvature assumption, the likelihood ratio statistic for the location of the mode can be used for hypothesis tests or confidence intervals that do not depend on either tuning parameters or nuisance parameters.

[
\kwd
\excludeversion

mylongform \excludeversionmynotes \NewDocumentCommand\tnbO t_#1

\runtitle

Mode-constrained MLE

{aug}

and label=u1,url]http://www.stat.washington.edu/jaw/ \thankstextt1Supported in part by NSF Grants DMS-1104832 and a University of Minnesota Grant-In-Aid grant. \thankstextt2Supported in part by NSF Grants DMS-1104832 and DMS-1566514, and NI-AID grant 2R01 AI291968-04

class=AMS] \kwd[Primary ]62G07 \kwd[; secondary ]62G05 \kwd62G20

mode \kwdconsistency \kwdconvergence rate \kwdempirical processes \kwdconvex optimization \kwdlog-concave \kwdshape constraints \kwdsymmetric

1 Introduction and overview

The classes of log-concave densities on (and on ) have great importance in statistics for a variety of reasons including their many natural closure properties, including closure under convolution, affine transformations, convergence in distribution, and marginalization. These classes are also unimodal and serve as important nonparametric generalizations of the class of Gaussian distributions.

Nonparametric estimation in the unconstrained classes of log-concave densities has developed rapidly in the past 10–15 years. Existence of maximum likelihood estimators for log-concave densities on was provided by Walther [2002], while Pal, Woodroofe and Meyer [2007] established consistency. Dümbgen and Rufibach [2009] gave rates of convergence in certain uniform metrics, and provided efficient algorithms based on “active set” methods (see also Dümbgen and Rufibach [2011]). Balabdaoui, Rufibach and Wellner [2009] established pointwise limit distribution theory for the MLE’s, while Doss and Wellner [2016a] established rates of convergence of the MLE in the Hellinger metric. There has also been rapid progress in estimation of log-concave densities on ; see e.g. Cule, Samworth and Stewart [2010], Cule and Samworth [2010a], Dümbgen, Samworth and Schuhmacher [2011], Seregin and Wellner [2010], and Han and Wellner [2016].

Interesting uses of the unconstrained log-concave MLE’s in more complicated models, mostly in mixture modeling and clustering, have been considered by Chang and Walther [2007], Eilers and Borgdorff [2007], Walther [2009], and Cule and Samworth [2010b].

On the other hand, for a number of important statistical problems it is of great interest to understand estimation in several important sub-classes of the class of all log-concave densities on .

• For testing that a log-concave density on is symmetric about a known point, for example , we need know how to estimate the log-concave density both with and without the constraint of symmetry.

• For the basic problem of estimation of location with a symmetric error density, it is important to know how to estimate a symmetric log-concave density with mode (and median and mean) equal to .

• For inference about the mode of a log-concave density it is necessary to understand how to estimate a log-concave density with a known mode (but without the constraint of symmetry).

Once the properties of nonparametric estimators within these sub-classes is understood, then the estimators can be used to develop statistical methods with known properties for other more complex statistical problems. For example: the basic procedures we study here can be viewed as building blocks to be used for, among others:

(a)

Testing the hypothesis of symmetry of a log-concave density.

(b)

Estimation of the location of a symmetric log-concave density.

(c)

Inference about the mode of a log-concave density.

(d)

Nonparametric modal regression (as in Chen et al. [2016], but using log-concavity).

(e)

Semiparametric estimation in mixture models based on symmetric log-concave distributions; see e.g. Balabdaoui and Doss [2018], Pu and Arias-Castro [2017], and Eilers and Borgdorff [2007].

(f)

Modal clustering (as in Chacón [2016], but using log-concavity).

(g)

Estimation of a spherically symmetric multivariate log-concave density, which is pursued in Xu and Samworth [2017].

(h)

Inference about the center of an elliptical multivariate distribution based on the assumption of a log-concave underlying shape.

Thus our focus here is on estimation of a log-concave density in two important sub-classes: Let denote the class of all log-concave densities on the real line . The two subclasses we study here are:
(1) The class of all log-concave densities with mode a
fixed number .
(2) The class of all log-concave densities symmetric at .

We rely on the methods and properties developed here for the subclass to derive new inference procedures for the mode in Doss and Wellner [2016b]. The sub-class has already been used in Balabdaoui and Doss [2018] to study semiparametric mixture models. The methods developed here for are also being used in an on-going study by Laha [2018] of efficient estimation of a location parameter in the classical semiparametric symmetric location model with the (very natural) assumption of a symmetric log-concave error distribution. In on-going work, Doss (2017) studies new methods for modal regression based on estimation in the class .

Thus our main goals here are the following:

1. To show that the mode-constrained MLE’s and exist and to provide useful characterizations thereof.

2. Establish useful finite-sample properties of and .

3. Establish consistency of the mode-constrained and symmetric mode-constrained MLE’s with respect to the Hellinger metric.

4. Establish local rates of convergence of the constrained estimators and and establish the (pointwise) asymptotic distributions of the constrained estimators.

5. Establish global rates of convergence of the constrained estimators.

Here is a brief summary of the paper: In Section 2 we show that the constrained estimators exist and satisfy useful characterizations. Section 3 provides plots of the constrained estimators and provides comparisons to each other and to the unconstrained maximum likelihood estimators . In Section 4 we summarize results concerning consistency and global rates of convergence, while Section 5 addresses local rates of convergence and limiting distributions at fixed points. Section 6 summarizes some problems and difficulties concerning extensions to higher dimensions. All the proofs are given in the Appendix.

2 Maximum likelihood estimator finite sample properties: unconstrained and mode-constrained

2.1 Notation and terminology

Several classes of concave functions will play a central role in this paper. In particular, we let

 C:={φ: R→[−∞,∞) | φ  is concave, closed, and proper} (2.1)

and, for any fixed ,

 Cm:={φ∈C | φ(m)≥φ(x)  for all  x∈R} (2.2)

is the class of concave functions on with mode at . We also let

 SC0:={φ∈C0 | φ(−x)=φ(x)  for all  x∈R}. (2.3)

Here proper and closed concave functions are as defined in Rockafellar [1970], pages 24 and 50. We will follow the convention that all concave functions are defined on all of and take the value off of their effective domains where (Rockafellar [1970], page 40). The classes of unconstrained and constrained log-concave densities are then

 LC:={eφ : ∫eφdλ=1,  φ∈C}, LCm:={eφ : ∫eφdλ=1,  φ∈Cm},   and SLC0:={eφ :∫eφdλ=1,  φ∈SC0},

where is Lebesgue measure on . We let be the observations, independent and identically distributed with density with respect to Lebesgue measure. Here we assume throughout that and frequently that for some or . We let denote the order statistics of the ’s, and write for the order statistics of . We let denote the empirical measure, let denote the empirical distribution function, and let denote the empirical distribution function of .

We define the log-likelihood criterion function by

 Ψn(φ)=1nn∑i=1φ(Xi)−∫Reφ(x)dx=Pnφ−∫Reφdλ (2.4)

where we have used the standard device of including the Lagrange term in to avoid the normalization constraints involved in the classes and . This is as in Silverman [1982], Dümbgen and Rufibach [2009], and other current literature.

We will denote the unconstrained MLE’s of , , and by , , and respectively. The corresponding constrained estimators with mode and symmetric estimators with mode at will be denoted by , , , and , , respectively. Thus

 ˆφn≡argmaxφ∈CΨn(φ),  ˆφ0n≡argmaxφ∈CmΨn(φ),  and  ˆψ0n≡argmaxψ∈SC0Ψn(ψ).

Before proceeding to results concerning existence and uniqueness of the constrained estimators and , we first explain some undesirable properties of “naive” constrained estimators based on the unconstrained MLE’s and .

2.2 Naive Estimators

We can easily construct “naive” estimators under our two classes of constraints. For instance, a naive mode-constrained estimator based on the unconstrained log-concave MLE is , where is the mode of . Then indeed has mode . Let . Unfortunately, these estimators have quite undesirable properties. For example, when , we can see that

 n2/5(~φ0n(x0)−φ0(x0)) =n2/5(ˆφn(x0+(^mn−m))−φ0(x0)−φ′0(x0)(^m−m)) + n2/5φ′0(x0)(^m−m) =Op(1)+n1/5Op(1)

since by Theorem 3.6 of Balabdaoui, Rufibach and Wellner [2009], so the first summand is by Corollary 2.2 of Balabdaoui, Rufibach and Wellner [2009] (and its proof, see their (4.34)). Thus, away from the mode, this naive estimator in fact converges at a slower rate than .

Similarly, a naive -symmetric estimator can be constructed. Let . Then is symmetric about its mode . (It is not necessarily a bona fide density that integrates to , but its integral converges to .) Again, unfortunately, a similar analysis as above shows that if , then

 n2/5(log~gsn(x0)−φ0(x0))=Op(1)+n2/5^mnφ′0(x0)

since . Since and , we again see that the naive estimator converges at a slower rate than .

In summary, naive plug-in estimation for the mode and symmetry constraints does not work. The poor performance of these and other “naive” or “plug-in” estimators motivates study of the constrained MLE’s, which we now pursue.

2.3 The unconstrained and the constrained MLE’s

To develop theory for the mode-constrained estimators , , and it will be helpful to consider mode-augmented data with or as follows:

(1)

If for some then for and .

(2)

If for some (where and ), then we define for , , and for . In this case
and .

Theorem 2.1.

The following statements hold almost surely when are i.i.d. from a density on .

1. (Pal, Woodroofe and Meyer [2007], Rufibach [2006]) For the (unconstrained) nonparametric MLE exists and is unique. It is linear on all intervals , . Moreover, and on .

2. (Doss [2013b]) For the mode-constrained MLE exists and is unique. It is piecewise linear with knots at the ’s and domain . If is not a data point, then at least one of or is .

3. The constrained MLE exists for and is unique. It is piecewise linear with knots contained in the set of points
, and is for . If , then .

The previous result shows that the MLE’s exist. Unfortunately, there is no closed form expression for the MLE’s. However, since they are solutions to optimization problems, they satisfy certain optimality conditions. Thus, the next two theorems we present provide systems of inequalities and equalities that characterize the MLE’s.

Theorem 2.2.

1. (Rufibach [2006], Dümbgen and Rufibach [2009] ) Let be a concave function such that . Then is the unconstrained MLE if and only if

 ∫Δ(x)dFn(x)≤∫Δ(x)exp(ˆφn(x))dx=∫Δ(x)dˆFn(x)

for any function such that is concave for some .

2. (Doss [2013b]) Suppose that . Then is the MLE over if and only if

 ∫ΔdFn≤∫ΔdˆF0n (2.5)

for all such that for some .

3. Suppose that and . Then is the MLE over if and only if

 ∫ΔdFn≤∫ΔdˆG0n (2.6)

for all such that for some .

To state the second characterization theorem for the MLE’s, we first introduce some further notation and definitions. For a continuous and piecewise linear function we define its knots to be

 Sn(h):={t∈(A,B): h′(t−)≠h′(t+)}∪{A,B}.

Note that , , and are all continuous and piecewise linear functions (with , in the case of and , and with , in the case of ), and we have

 Sn(ˆφn)⊂{X(1),X(2),…,X(n)}, Sn(ˆφ0n)⊂{X(1),X(2),…,X(n)}, Sn(ˆψ0n)⊂{−|X|(n),…,|X|(n)}.

Now suppose that is piecewise linear with knots at the (mode-augmented) data, let , and assume that . For define

 (2.7)
Definition 2.3.

With considered as a possible knot of we say that is a left knot (or LK) if and that is a right knot (or RK) if . We say that is not a knot (or NK) if . All other knots are considered to be left knots (LKs) or right knots (RKs) depending on whether they are strictly smaller or strictly larger than .

Theorem 2.4.

1. (Rufibach [2006], Dümbgen and Rufibach [2009]) Let , and assume further that . Then is the MLE in if and only if

 ˆHn(t)≡∫tX(1)ˆFn(y)dy≤∫tX(1)Fn(y)dy≡Yn(t)   for all  t∈R

with equality if .

2. [Doss, 2013b] With the notation in (2.7), is the MLE of if and only if

 ˆH0n,L(t)≤Yn,L(t)   for  X(1)≤t≤m (2.8)
 ˆH0n,R(t)≤Yn,R(t)   for  m≤t≤X(n) (2.9)

with equality in (2.8) if is a left knot of and equality in (2.9) if is a right knot of .

3. is the MLE if and only if satisfies, with and ,

Remark 2.5.

The conditions (2.8) and (2.9) only involve data from the left and right sides of , and hence are separate characterizations in a sense. But they are coupled by way of the (global) constraint (or, equivalently, ) which involves the data on both sides of .

Remark 2.6.

The “C parts” of Theorems 2.1, 2.2, and 2.4 are related to corresponding (more complex) results of Balabdaoui and Doss [2018] who treated symmetric log-concave densities in the setting of a two-component mixture model.

These characterization theorems have two important corollaries. (Recall that denotes the empirical distribution function of the ’s.)

Corollary 2.7.

(MLE’s related to at knot points) Each of the following holds almost surely.

1. on .

2. on .

3. on .

Now for any distribution function on let and .

Corollary 2.8.

(Mean and variance inequalities)
A. and .
B. and .

Because does not have mode , and because only has mode if , we cannot make comparisons between the mean and variances of and .

3 Simulation results

Software to compute the mode-constrained estimator, and also to implement the likelihood ratio test and corresponding confidence intervals studied in Doss and Wellner [2016b], is available in the package logcondens.mode [Doss, 2013a] in R [R Core Team, 2016]. Here we illustrate the existence and characterization results on simulated data in Figure 1. There are two columns of four plots. The left column includes the mode-constrained log-concave MLE. The right column includes the -symmetric log-concave MLE. The data points are represented by vertical hash lines along the bottom of each plot. The density, log density, and distribution function are plotted in the top three rows, with the unconstrained log-concave MLE in red and the true (unknown) function in black. On the left the mode-constrained MLE is in blue, and on the right the -symmetric MLE is in blue. The empirical df is plotted in green in the third row. In the last row, we plot and in blue to illustrate Theorem 2.4 2 (left plot) and Theorem 2.4  3 (right plot), and in red to illustrate Theorem 2.4 1. In all the plots, dashed vertical red lines give and dashed vertical blue lines give knots of the constrained estimator (which frequently overlap). The solid blue line is the specified mode value for the mode-constrained MLE. Plots from further simulations can be found on pages 10–13 of Doss and Wellner [2016c].

Figure 2 gives plots of (“SC”), (“MC”), (“UC”), and (“E”). The left and right plot are each one simulation with sample size and , respectively, from a distribution. The plots show improvements by and over . The MC and UC lines are indistinguishable when since one needs to plot locally to the mode to see differences between and when is large.

4 Hellinger consistency and rates

Pal, Woodroofe and Meyer [2007] showed that the unconstrained MLE’s are a.s. consistent in the Hellinger metric where , and their methods also yield consistency for the MLE’s over any sub-class for which the MLE’s exist and satisfy

 supnsupxlogˆgn(x)<∞   a.s..

This nicely includes the subclass when ; i.e. the mode has been correctly specified. Further consistency results are due to Cule and Samworth [2010a], Rufibach [2006], and Dümbgen and Rufibach [2009]. To the best of our knowledge, this is the first treatment of the consistency and global rate properties of the constrained estimators.

Theorem 4.1.

(Hellinger consistency and rates of convergence)
A.  (Doss and Wellner [2016a]) If , then .
B.  (Doss [2013b]) If , then .
C.  (Balabdaoui and Doss [2018]) If , then .

Remark 4.2.

Kim and Samworth [2016] extend Part A of Theorem 4.1 by upper bounding the maximal risk of : their Theorem 5 implies that is (considering squared Hellinger rather than Hellinger distance). They also provide a matching lower bound: their Theorem 1 implies

 inf~fnsupf∈LCEfH2(~fn,f)≥cn−4/5,

for some , where the infimum is over all (measurable) estimators of . Neither upper nor lower bounds for the (Hellinger) minimax risk are known for either of the constrained density classes we consider in the present paper, although we conjecture that is the minimax rate of convergence in both cases.

Remark 4.3.

When , then we can show that where satisfies

 K(f0,f∗0)=infg∈LCmK(f0,g);

see Figure 4 of Doss and Wellner [2016c]. Similarly, when , then we can show that where satisfies

 K(f0,g∗0)=infg∈SLC0K(f0,g);

but we will not pursue this here since our goal in this paper is to understand the null hypothesis (or correctly specified) behavior of the constrained estimators , , and . See Doss and Wellner [2016b] for some initial steps concerning the power of the likelihood ratio test based on when .

In addition to considering Hellinger distance, one can consider the sup norm (on compact sets) as a metric for global convergence. It turns out that the proofs in Doss and Wellner [2016b] rely crucially on knowing the rate of sup-norm convergence for (as well as for ). Thus we study the sup-norm rate of convergence for in that paper. In Theorem 4.1 of that paper we find, when the true log-density satisfies a Hölder condition of order and , that the rate of convergence is on compact sets interior to the support of .

5 Local limit processes and limiting distributions at fixed points

Our goal in this section is to describe the limiting distributions of our estimators, both unconstrained and constrained, at fixed points (and and ) at which the true density satisfies a curvature condition. We also want to compare and contrast the behavior of the three different estimators.

5.1 The limit processes, unconstrained and constrained

We first need to introduce the local limit proceses which we need to treat the local (at a single point or in a neighorhood of a point) limiting distributions of the estimators, unconstrained and constrained. For all of our estimators (including the unconstrained and the two different mode-constrained estimators), the limit distributions are not Gaussian. But they are defined in terms of so-called invelope processes of integrated Brownian motion. We first recall the invelope process related to the limit distribution for the unconstrained estimators; this process was first presented and studied in Groeneboom, Jongbloed and Wellner [2001a] (and shown to yield the limit distribution in several convex function estimation problems in Groeneboom, Jongbloed and Wellner [2001b]). Let be a two-sided standard Brownian motion starting at and for any let

 X(t)=W(t)−4t3,    Y(t)=∫t0X(s)ds=∫t0W(s)ds−t4. (5.1)
Theorem 5.1 (Groeneboom, Jongbloed and Wellner [2001a]).

Let , , and be as in (5.1). Then there exists an almost surely uniquely defined random continuous function satisfying the following conditions:
(i)   The function is everywhere below :

 H(t)≤Y(t) for all t∈R.

(ii)   has a concave second derivative.
(iii) satisfies

 ∫∞−∞(H(t)−Y(t))d(H(3))(t)=0.

The random variables and give the universal component of the limit distribution of and ; see Theorem 5.5, below.

Theorem 5.1 concerns a process , related to the unconstrained concave estimation problem. In the mode constrained estimation problem, , instead of having one process we have two, one for the left-hand side of (negative axis) and one for the right-hand side of (positive axis). (Here, corresponds to the mode , by a translation.) The definitions of the left- and right-hand processes depend on a random starting point for the corresponding integrals involved, which we will eventually denote and (this is made clear in (8.29), (8.30), and (8.31), below). To define and , we must define rigorously the possible ‘bend points’ of . To describe the situation exactly, we also will define ‘bend points’ and , satisfying and , where the inequality may or may not be strict; these bend points arise in (5.10) below.

Theorem 5.2.

Assume , are random processes where . Define the ‘bend points’ by

 (5.2)

Next, define

 τ0−(ˆφ0)≡τ0−=sup{t∈ˆS0:(ˆφ0)′(t−ε−)>0 % for all ε>0}, (5.3) τ0+(ˆφ0)≡τ0+=inf{t∈ˆS0:(ˆφ0)′(t+ε+)<0 % for all ε>0}, (5.4) τL=sup(ˆS0∩(−∞,0)) % and τR=inf(ˆS0∩(0,∞)). (5.5)

Let be a standard two-sided Brownian motion with , and for let

 X(t)=W(t)−4t3,
 YL(t)=∫τLt∫τLudX(v)du and YR(t)=∫tτR∫uτRdX(v)du. (5.6)

With these definitions, we assume that:
(i) and and

 ∫τRτL(ˆφ0(v)dv−dX(v))=0. (5.7)

(ii)

 HL(t)−YL(t)≤0 for t≤0, (5.8) HR(t)−YR(t)≤0 for t≥0, (5.9)

(iii)

 ∫(−∞,τ0−](HL(u)−YL(u))d(ˆφ0)′(u)=0=∫[τ0+,∞)(HR(u)−YR(u))d(ˆφ0)′(u). (5.10)

Then, and are unique, as are and .

Theorem 5.2 shows that processes with the given properties are unique; that they exist will follow from Section 5.3.

If is, in fact, piecewise linear, then is just the last knot point of with . By Theorem 23.1 of Rockafellar [1970], a finite, concave function on such as has well-defined right- and left-derivatives at all of ; the specification of left- and right- derivatives in the definitions of and are for concreteness but not necessary since we consider all .

The distinction between and depends only on the behavior of at , and can be understood by considering the case where the infima in (5.4) and in (5.5) are actually minima (the infima are attained). In that case, we see that can be thought of as the smallest “right-knot” in the sense that has a strictly negative slope to the right of . And can be thought of as the smallest positive knot. Note that (by concavity) all positive knots are right-knots, so that . Note that the infimum defining in (5.5) is taken over knots that are strictly larger than , so that (when the infima are attained) we have . On the other hand, if is a right-knot then , so that then and are distinct. If is not a right-knot, then we will have . These statements are slightly complicated by the fact that and are defined as infima rather than minima, but the intuitive differences are captured by the previous description. Corresponding statements hold for and .

The distinction between the two sets of knots pairs is important because many of our arguments depend on constructing “perturbations” of , and we can use different types of perturbations at each pair. This means that the different knot pairs have different properties: if we replace by in (5.7), then that display may not hold, and similarly, if we replace by in (5.10), then that display may not hold. The following lemma holds for but not necessarily for .

Lemma 5.3.

With the definitions and assumptions as in Theorem 5.2,

 (ˆφ0)′(t)=0 for t∈(τ0−,τ0+). (5.11)

Now we introduce the appropriate limit processes for the symmetric about mode-constrained estimators. The characterization is similar to that for the mode-constrained (but not symmetric) processes, but since it is defined only on the processes are not the same.

{mynotes}

Note: We decide to always let . Explicitly excluding in the definition of (including in the set of knots) is somewhat redundant since by definition is not defined. But we make it explicit.

This definition seems like the appropriate analog for the two-sided case (in the two-sided mode-constrained case we may have as a “knot” even though it is not a right-knot. This entails only that , which is effectively tautologically true in this one-sided setting). Thus, just as in the two-sided case, we do not generally have that .

Theorem 5.4.

Assume is a random process on , and assume that . Define

 (ˆS+)c≡S(ˆψ0)c={t≥0:(ˆψ0)(2)(t±)=0}∖{0},
 τ++=inf{t∈ˆS+:(ˆψ0)′(t+ε+)<0 for all ε>0}, and
 τ+R=infˆS+∩(0,∞).

For , let be a one-sided standard Brownian motion with , let , and let

 Y+(t)=∫tτ+R∫uτ+RdX(v)du.

Suppose that and
(i)

 ∫τ+R0(ˆψ0(u)du−dX(u))=0, (5.12)

(ii)

 H+(t)−Y