# Approximation by log-concave distributions, with applications to regression

###### Abstract

We study the approximation of arbitrary distributions on -dimensional space by distributions with log-concave density. Approximation means minimizing a Kullback–Leibler-type functional. We show that such an approximation exists if and only if has finite first moments and is not supported by some hyperplane. Furthermore we show that this approximation depends continuously on with respect to Mallows distance . This result implies consistency of the maximum likelihood estimator of a log-concave density under fairly general conditions. It also allows us to prove existence and consistency of estimators in regression models with a response , where and are independent, belongs to a certain class of regression functions while is a random error with log-concave density and mean zero.

10.1214/10-AOS853 \volume39 \issue2 2011 \firstpage702 \lastpage730 \newproclaimRemark[Theorem]Remark \newproclaimExample[Theorem]Example

Log-concave approximations

A]\fnmsLutz \snmDümbgen\corref\thanksreft1label=e1]duembgen@stat.unibe.ch,
B]\fnmsRichard \snmSamworthlabel=e2]r.samworth@statslab.cam.ac.uk

and
A]\fnmsDominic \snmSchuhmacher\thanksreft1label=e3]dominic.schuhmacher@stat.unibe.ch

t1Supported by the Swiss National Science Foundation.

class=AMS] \kwd62E17 \kwd62G05 \kwd62G07 \kwd62G08 \kwd62G35 \kwd62H12. Convex support \kwdisotonic regression \kwdlinear regression \kwdMallows distance \kwdprojection \kwdweak semicontinuity.

## 1 Introduction

Log-concave distributions, that is, distributions with a Lebesgue density the logarithm of which is concave, are an interesting nonparametric model comprising many parametric families of distributions. Bagnoli and Bergstrom (2005) give an overview of many interesting properties and applications in econometrics. Indeed, these distributions have received a lot of attention among statisticians recently as described in the review by Walther (2009). The nonparametric maximum likelihood estimator was studied in the univariate setting by Pal, Woodroofe and Meyer (2007), Rufibach (2006), Dümbgen, Hüsler and Rufibach (2007), Balabdaoui, Rufibach and Wellner (2009) and Dümbgen and Rufibach (2009). These references contain characterizations of the estimators, consistency results and explicit algorithms. Extensions of one or more of these aspects to the multivariate setting are presented by Cule, Samworth and Stewart (2010), Cule and Samworth (2010), Koenker and Mizera (2010), Seregin and Wellner (2010) and Schuhmacher and Dümbgen (2010). Both Cule and Samworth (2010) and Schuhmacher, Hüsler and Dümbgen (2009) show that multivariate log-concave distributions are a very well-behaved nonparametric class. For instance, moments of arbitrary order are continuous statistical functionals with respect to weak convergence.

The first aim of the present paper is a deeper understanding of the approximation scheme underlying the maximum likelihood estimator of a log-concave density. Let us put this into a somewhat broader context: let be the empirical distribution of independent random vectors , with distribution on a given open set . Suppose that has a density belonging to a given class of probability densities on . The maximum likelihood estimator of may be written as

(provided this exists and is unique). Even if fails to have a density within , one may view as an estimator of the approximating density

In fact, if has a density on such that the integral exists in , one may rewrite as the minimizer of the Kullback–Leibler divergence,

over all . Note the well-known fact that unless almost everywhere. Viewing a maximum likelihood estimator as an estimator of an approximation within a given model is common in statistics [see, e.g., Pfanzagl (1990), Patilea (2001), Doksum et al. (2007) and Cule and Samworth (2010)]. Pfanzagl (1990) and Patilea (2001) show that under suitable regularity conditions on and , the estimator is consistent with certain large deviation bounds or rates of convergence, even in the case of misspecified models. To the best of our knowledge, their results are not directly applicable in the setting of log-concave densities, which is treated by Cule and Samworth (2010). Our ambition is to identify the largest possible class of distributions such that is well defined and unique. Moreover, we want to show that the mapping is continuous on that class with respect to a coarse topology, ideally the topology of weak convergence.

With these goals in mind, let us tell a short success story about Grenander’s estimator [Grenander (1956)], also a key example of Patilea (2001): let be the class of all nonincreasing and left-continuous probability densities on . Then for any distribution on , the maximizer

is well defined and unique. Namely, if denotes the distribution function of , then is the left-sided derivative of the smallest concave majorant of on [see Barlow et al. (1972)]. With this characterization one can show that for any sequence of distributions on converging weakly to ,

Since the sequence of empirical measures converges weakly to almost surely, this entails strong consistency of the Grenander estimator in total variation distance.

In the remainder of the present paper we consider the class of log-concave probability densities on . We will show in Section 2 that exists and is unique in if and only if

and

Some additional properties of will be established as well. We show that the mapping is continuous with respect to Mallows distance [Mallows (1972)] , also known as a Wasserstein, Monge–Kantorovich or Earth Mover’s distance. Precisely, let satisfy the properties just mentioned, and let be a sequence of probability distributions converging to in ; in other words,

(1) |

as . Then is well defined for sufficiently large and

This entails strong consistency of the maximum likelihood estimator , because converges almost surely to with respect to Mallows distance . In addition we show that is convex and upper semicontinuous with respect to weak convergence.

In Section 3 we apply these results to the following type of regression problem: suppose that we observe independent real random variables , such that

for given fixed design points in some set , some unknown regression function and independent, identically distributed random errors with unknown log-concave density and mean zero. We will show that a maximum likelihood estimator of exists and is consistent under certain regularity conditions in the following two cases: (i) and is affine (i.e., affine linear); (ii) and is nondecreasing. These methods are illustrated with a real data set.

Many proofs and technical arguments are deferred to Section 4. A longer and more detailed version of this paper is the technical report by Dümbgen, Samworth and Schuhmacher (2010), referred to as [DSS 2010] hereafter. It contains all proofs, additional examples and plots, a detailed description of our algorithms and extensive simulation studies. There we also indicate potential applications to change-point analyses.

## 2 Log-concave approximations

For a fixed dimension , let be the family of concave functions which are upper semicontinuous and coercive in the sense that

In particular, for any there exist constants and such that , so is finite. Further let be the family of all probability distributions on . Then we define a log-likelihood-type functional

and a profile log-likelihood

If, for fixed , there exists a function such that , then it will automatically satisfy

To verify this, note that for any fixed function and arbitrary , and

if . Thus is minimal for .

### 2.1 Existence, uniqueness and basic properties

The next theorem provides a complete characterization of all distributions with real profile log-likelihood . To state the result we first define the convex support of a distribution and collect some of its properties.

###### Lemma 2.1 ((Dss 2010))

For any , the set

is itself closed and convex with . The following three properties of are equivalent:

[(a)]

has nonempty interior;

for any hyperplane ;

with denoting Lebesgue measure on ,

###### Theorem 2.2

For any , the value of is real if and only if

In that case, there exists a unique function

This function satisfies and

[[Moment (in)equalities]] Let satisfy the properties stated in Theorem 2.2. Then the log-density satisfies the following requirements: , and for any function ,

(2) |

This follows from

Let be the approximating probability measure with . It satisfies the following (in)equalities:

(3) | |||||

(4) |

To verify (3), let be a subgradient of at , that is, for all . Since for arbitrary and suitable constants and , the function satisfies the requirement that whenever . Hence the asserted inequality follows from (2). The equality for the first moments follows by setting for arbitrary .

In what follows let

Thus if and only if . Moreover, the proof of Theorem 2.2 shows that

[(Affine equivariance)] Suppose that . For arbitrary vectors and nonsingular, real matrices define to be the distribution of when has distribution . Then , too, and elementary considerations reveal that

and

[(Convexity, DSS 2010)] The profile log-likelihood is convex on . Precisely, for arbitrary and ,

The two sides are equal and real if and only if with . {Remark}[(Concave majorants, DSS 2010)] Let for a distribution . For any open set there exists a (pointwise) minimal function such that on . In particular, with equality on . One can also show that is the pointwise infimum of all affine functions such that on . If denotes the smallest closed set with , then

Furthermore, suppose that has a density on an open set such that on this set. Then

### 2.2 The one-dimensional case

For the case of one can generalize Theorem 2.4 of Dümbgen and Rufibach (2009) as follows: for a function let

The log-concave approximation of a distribution on can be characterized in terms of distribution functions only:

###### Theorem 2.3

Let be a nondegenerate distribution on with finite first moment and distribution function . Let be a distribution function with log-density . Then if and only if

and

[(DSS 2010)] One consequence of this theorem is that the c.d.f. of follows the c.d.f. of quite closely in that

Let be a rescaled version of Student’s distribution with density and distribution function

respectively. The best approximating log-concave distribution is the Laplace distribution with density and distribution function

respectively. To verify this claim, note that by symmetry it suffices to show that

Indeed the integral on the left-hand side equals

for all . Clearly this expression is zero for , and elementary considerations show that it is nonpositive for all . Numerical calculations reveal that is smaller than everywhere. {Remark} Let such that for some bounded interval . Then is linear on . This follows from Remark 2.1, applied to . Note that and , because otherwise on or on . But this would be incompatible with , because both and are positive. {Remark}[(DSS 2010)] Suppose that has a continuous but not log-concave density . Nevertheless one can say the following about the approximating log-density :

Suppose that is concave on an interval with and . Then there exists a point such that is linear on and on .

Suppose that is differentiable everywhere, convex on a bounded interval and concave on both and . Then there exist points and such that is linear on while on .

Suppose that is convex on an interval such that . Then is linear on . {Example} Let us illustrate part (ii) of Remark 2.2 with a numerical example. Figure 1 shows the bimodal

density (green/dotted line) of the Gaussian mixture together with its log-concave approximation (blue line). As predicted, there exists an interval such that on and is linear on .

### 2.3 Continuity in

For the applications to regression problems to follow we need to understand the properties of both and on . Our first hope was that both mappings would be continuous with respect to the weak topology. It turned out, however, that we need a somewhat stronger notion of convergence, namely, convergence with respect to Mallows distance which is defined as follows: for two probability distributions ,

where the infimum is taken over all pairs of random vectors and on a common probability space. It is well known that the infimum in is a minimum. The distance is also known as Wasserstein, Monge–Kantorovich or Earth Mover’s distance. An alternative representation due to Kantorovič and Rubinšteĭn (1958) is

where consists of all such that for all . Moreover, for a sequence in , it is known that (1) is equivalent to as [Mallows (1972), Bickel and Freedman (1981)]. In case of , the optimal coupling of and is given by the quantile transformation: if and denote the respective distribution functions, then

A good starting point for more detailed information on Mallows distance is Chapter 7 of Villani (2003).

Before presenting the main results of this section we mention two useful facts about the convex support of distributions.

###### Lemma 2.4

Given a distribution , a point is an interior point of if and only if

Moreover, if is a sequence in converging weakly to , then

This lemma implies that the set is an open subset of with respect to the topology of weak convergence. The supremum is a maximum over closed halfspaces and is related to Tukey’s halfspace depth [Donoho and Gasko (1992), Section 6]. For a proof of Lemma 2.4 we refer to [DSS 2010]. Now we are ready to state the main results of this section.

###### Theorem 2.5 ((Weak upper semicontinuity))

Let be a sequence of distributions in converging weakly to some . Then

Moreover, if and only if

This result already entails continuity of on with respect to Mallows distance . The next theorem extends this result to :

###### Theorem 2.6 ((Continuity with respect to Mallows distance ))

Let be a sequence of distributions in such that for some . Then

In case of , the probability densities and are well defined for sufficiently large and satisfy

[(Stronger modes of convergence)] The convergence of to in total variation distance may be strengthened considerably. It follows from recent results of Cule and Samworth (2010) or Schuhmacher, Hüsler and Dümbgen (2009) that uniformly on arbitrary closed subsets of , where is the set of discontinuity points of . The latter set is contained in the boundary of the convex set , hence a null set with respect to Lebesgue measure. Moreover, there exists a number such that

More generally,

for any sublinear function such that .

## 3 Applications to regression problems

Now we consider the regression setting described in the Introduction with observations , , where the are given fixed design points, is an unknown regression function, and the are independent random errors with mean zero and unknown distribution on such that is well defined. The regression function is assumed to belong to a given family with the property that

### 3.1 Maximum likelihood estimation

We propose to estimate by a maximizer of

over all . Note that remains unchanged if we replace with for an arbitrary . For fixed , the maximizer of over will automatically satisfy and . Thus if maximizes over , then

maximizes over all satisfying the additional constraint that defines a probability density with mean zero.

Define and . Then we may write

with the empirical distributions

for . Thus our procedure aims to find

and this representation is our key to proving the existence of . Before doing so we state a simple inequality of independent interest, which follows from Jensen’s inequality and elementary considerations:

###### Lemma 3.1 ((Dss 2010))

For any distribution ,

where is a median of while denotes its mean .

###### Theorem 3.2 ((Existence in regression))

Suppose that the set is closed and does not contain . Then there exists a maximizer of over all .

The constraint excludes situations with perfect fit. In that case, the Dirac measure would be the most plausible error distribution. {Example}[(Linear regression)] Let , and let consist of all affine functions on . Then is the column space of the design matrix

hence a linear subspace of . Consequently there exists a maximizer of over , unless . {Example}[(Isotonic regression)] Let be some interval on the real line, and let consist of all isotonic functions . Then the set is a closed convex cone in . Here the condition that is equivalent to the existence of two indices such that but .

#### Fisher consistency

The maximum likelihood estimator need not be unique in general. Nevertheless we will prove it to be consistent under certain regularity conditions. A key point here is Fisher consistency in the following sense: note that the expectation measure of the empirical distribution equals

with

But

with equality if and only if is constant on . This follows from a more general inequality which is somewhat reminiscent of Anderson’s lemma [Anderson (1955)]:

###### Theorem 3.3

Let and . Then and

Equality holds if and only if for some .

### 3.2 Consistency

In this subsection we consider a triangular scheme of independent observations , , with fixed design points and

where is an unknown regression function in and are unobserved independent random errors with mean zero and unknown distribution . Two basic assumptions are:

[(A.2)]

is a closed subset of for every ;

for some distribution .

We write for a maximizer of over all pairs such that and , where stands for the empirical distribution of the residuals , . We also need to consider its expectation measure

Furthermore we write

It is also convenient to metrize weak convergence. In Theorem 3.4 below we utilize the bounded Lipschitz distance: for probability distributions on the real line let

where is the family of all functions such that for all .

###### Theorem 3.4 ((Consistency in regression))

Let assumptions (A.1) and (A.2) be satisfied. Suppose further that: {longlist}[(A.2)]

for arbitrary fixed ,

Then, with and , the maximum likelihood estimator of exists with asymptotic probability one and satisfies

We know already that assumption (A.1) is satisfied for multiple linear regression and isotonic regression. Assumption (A.2) is a generalization of assuming a fixed error distribution for all sample sizes. The crucial point, of course, is assumption (A.3). In our two examples it is satisfied under mild conditions:

###### Theorem 3.5 ((Linear regression))

Let be the family of all affine functions on . If assumption (A.2) is satisfied, then (A.3) follows from

###### Theorem 3.6 ((Isotonic regression))

Let be the set of all nondecreasing functions on an interval . If assumption (A.2) holds true, then (A.3) follows from

### 3.3 Algorithms and numerical results

Computing the maximum likelihood estimator from Section 3.1 turns out to be a rather difficult task, because the function can have multiple local maxima. In [DSS 2010] we discuss strengths and weaknesses of three different algorithms, including an alternating and a stochastic search algorithm. The third procedure, which is highly successful in the case of linear regression, is global maximization of the profile log-likelihood , where for every , by means of differential evolution [Price, Storn and Lampinen (2005)].

Extensive simulation studies in [DSS 2010] suggest that provides rather accurate estimates even if is only moderately large. For various skewed error distributions, may be considerably better than the corresponding least squares estimator. As an example consider the simple linear regression model with observations

where are independent design points from the distribution and are independent errors from a centered gamma distribution with shape parameter and variance . Note that the distribution of does not depend on or . Monte Carlo estimation of the root mean squared error based on 1000 simulations of this model gives 0.023 for the estimator versus 0.118 for the least squares estimator of if , and 0.095 versus 0.113 for the same comparison if .

### 3.4 A data example

A familiar task in econometrics is to model expenditure () of households as a function of their income (). Not only the mean curve (Engel curve) but also quantile curves play an important role. A related application are growth charts in which, for instance, is the age of a newborn or infant and is its height or weight.

We applied our methods to a survey of households in the United Kingdom in 1973 (data courtesy of W. Härdle, HU Berlin). The two variables we considered were annual income () and annual expenditure for food (). Figure 2 shows scatter plots of the raw and log-transformed data. To enhance visibility we only show a random subsample of size . In addition, isotonic quantile curves are added for , , , , (based on all observations). These pictures show clearly that the raw data are heteroscedastic, whereas for the log-transformed data, , an additive model seems appropriate.

Interestingly, neither linear nor quadratic nor cubic regression yield convincing fits to these data. Polynomial regression of degree four or cubic splines with knot points at, say, , , , , seem to fit the data quite well. Moreover, exact Monte Carlo goodness-of-fits test, assuming the regression function to be a cubic spline and based on a Kolmogorov–Smirnov statistic applied to studentized residuals, revealed the regression errors to be definitely non-Gaussian.

Figure 3 shows the data and estimated -quantile curves for , , , , , based on our additive regression model. Note that the estimated -quantile curve is simply the estimated mean curve plus the -quantile of the estimated error distribution. On the left-hand side, we only assumed to be nondecreasing, on the right-hand side we fitted the aforementioned spline model. In both cases the fitted quantile curves are similar to the quantile curves in Figure 2 but with fewer irregularities such as big jumps which may be artifacts due to sampling error.

## 4 Proofs

For the proof of Theorem 2.2 we need an elementary bound for the Lebesgue measure of level sets of log-concave distributions:

###### Lemma 4.1 ((Dss 2010))

Let be such that . For real define the level set . Then for ,

Another key ingredient for the proofs of Theorems 2.2 and 2.6 is a lemma on pointwise limits of sequences in :

###### Lemma 4.2 ((Dss 2010))

Let and be functions in such that for all . Further suppose that the set

is nonempty. Then there exist a subsequence of and a function such that and

Proof of Theorem 2.2 Suppose first that . Since any is majorized by for suitable constants and , this entails that .

Second, suppose that but . According to Lemma 2.1, the latter fact is equivalent to for some hyperplane