A penalized exponential risk bound in parametric estimation

A penalized exponential risk bound in parametric estimation

Spokoiny, Vladimir
Weierstrass-Institute,
Mohrenstr. 39, 10117 Berlin, Germany
spokoiny@wias-berlin.de
Abstract

The paper offers a novel unified approach to studying the accuracy of parameter estimation by the quasi likelihood method. Important features of the approach are: (1) The underlying model is not assumed to be parametric. (2) No conditions on parameter identifiability are required. The parameter set can be unbounded. (3) The model assumptions are quite general and there is no specific structural assumptions like independence or weak dependence of observations. The imposed conditions on the model are very mild and can be easily checked in specific applications. (4) The established risk bounds are nonasymptotic and valid for large, moderate and small samples. (5) The main result is the concentration property of the quasi MLE giving an nonasymptotic exponential bound for the probability that the considered estimate deviates out of a small neighborhood of the “true” point.

In standard situations under mild regularity conditions, the usual consistency and rate results can be easily obtained as corollaries from the established risk bounds. The approach and the results are illustrated on the example of generalized linear and single-index models.

JEL codes: C13,C22.

Keywords: exponential risk bounds, rate function, quasi maximum likelihood

1 Introduction

One of the most popular approaches in statistics is based on the parametric assumption that the distribution of the observed data belongs to a given parametric family , where is a subset in a finite dimensional space . In this situation, the statistical inference about is reduced to recovering . The standard likelihood principle suggests to estimate by maximizing the corresponding log-likelihood function . The classical parametric statistical theory focuses mostly on asymptotic properties of the difference between and the true value as the sample size tends to infinity. There is a vast literature on this issue. We only mention the book by Ibragimov and Khas’minskij (1981), which provides a comprehensive study of asymptotic properties of maximum likelihood and Bayesian estimators. The related analysis is effectively based on the Taylor expansion of the likelihood function near the true point under the assumption that the considered estimate well concentrates in a small (root-n) neighborhood of this point. In the contrary, there is only few results which establish this desired concentration property. Ibragimov and Khas’minskij (1981), Section 1.5, presents some exponential concentration bounds in the i.i.d. parametric case. Large deviation results about minimum contrast estimators can be found in Jensen and Wood (1998) and Sieders and Dzhaparidze (1987), while subtle small sample size properties of these estimators are presented in Field (1982) and Field and Ronchetti (1990). This paper aims at studying the concentration properties of a general parametric estimate. The main result describes some concentration sets for the considered estimate and establishes an exponential bound for deviating of the estimate out of such sets.

In the modern statistical literature there is a number of papers considering maximum likelihood or more generally minimum contrast estimators in a general i.i.d. situation, when the parameter set is a subset of some functional space. We mention the papers Van de Geer (1993), Birgé and Massart (1993), Birgé and Massart (1998), Birgé (2006) and references therein. The studies mostly focused on the concentration properties of the maximum over of the log-likelihood rather on the properties of the estimator which is the point of maximum of . The established results are based on deep probabilistic facts from the empirical process theory (see e.g. Talagrand (1996), van der Vaart and Wellner (1996), Boucheron et al. (2003)). Our approach is similar in the sense that the analysis also focuses on the properties of the maximum of over . However, we do not assume any specific structure of the model. In particular, we do not assume independent observations and thus, cannot apply the methods from the empirical process theory.

The aim of this paper is to offer a general and unified approach to statistical estimation problem which delivers meaningful and informative results in a general framework under mild regularity assumptions. An important issue of the proposed approach is that it allows to go beyond the parametric case, that is, the most of results and conclusions continue to apply even if the parametric assumption is not precisely fulfilled. Then the target of estimation can be viewed as the best parametric fit. Some other important features of the proposed approach are that the established risk bounds are nonasymptotic and equally apply to large, moderate and small samples and that the results describe nonasymptotic confidence and concentration sets in terms of quasi log-likelihood rather than the accuracy of point estimation. In the most of examples, the usual consistency and rate results can be easily obtained as corollaries from the established risk bounds. The results are obtained under very mild conditions which are easy to verify in particular applications. There is no specific assumptions on the structure of the data like independence or weak dependence of observations, the parameter set can be unbounded. Another interesting feature of the proposed approach that it does not require any identifiability conditions. Informally, one can say that whatever the quasi likelihood or contrast is, the corresponding estimate belongs with a dominating probability to the corresponding concentration set. Examples show that the resulting concentration sets are of right magnitude, in typical situations this is a root-n vicinity of the true point.

Now we specify the considered set-up. Let stand for the observed data. For notational simplicity we assume that is a vector in . By we denote the measure describing the distribution of the whole sample . The parametric approach discussed below allows to reduce the whole description of the model to a few parameters which have to be estimated from the data. Let be a given parametric family of measures on . The parametric assumption means simply that for some . The parameter vector can be estimated using the maximum likelihood (MLE) approach. Let be the log-likelihood for the considered parametric model: , where is any dominating measure for the family . The MLE estimate of the parameter is given by maximizing the log-likelihood :

(1.1)

Note that the value of the estimate will not be changed if the process is multiplied by any positive constant .

The quasi maximum likelihood approach admits that the underlying distribution does not belong to the family . The estimate from (1.1) is still meaningful and it becomes the quasi MLE. Later we show that estimate the value defined by maximizing the expected value of :

which is the true value in the parametric situation and can be viewed as the parameter of the best parametric fit in the general case.

Note that the presented set-up is quite general and the most of statistical estimation procedures can be represented as quasi maximum likelihood for a properly selected parametric family. In particular, popular least squares, least absolute deviations, M-estimates can be represented as quasi MLE.

The set-up of this paper is even more general. Namely, we consider a general estimate defined by maximizing a random field . The basic example we have in mind is the scaled quasi log-likelihood for some . In some cases, especially if the parameter set is unbounded, the scaling factor can also be taken depending on , that is, . We focus on the properties of the process as a function of the parameter . Therefore, we suppress the argument there. One has to keep in mind that is random and depends on the observed data . The study focuses on the concentration properties of the estimate which is defined by maximization of the random process . Let

We also define . The aim of our study is to bound the value of the quasi maximum likelihood . The basic assumption imposed on the process is that the difference has bounded exponential moments for every . Our primary goal is to bound the supremum of such differences, or more precisely, to establish an exponential bound for the value . The standard approach of empirical process theory is to consider separately the mean and the centered stochastic deviations of the process . Here a slightly different standardization of the process is used. Assume that the exponential moment for is finite for all . This enables us to define for each the rate function which ensures the identity

This means that the process is pointwise stochastically bounded in a rather strict sense. We aim at establishing a similar bound for the maximum of . It turns out that some payment for taking the maximum is necessary. Namely, we present a penalty function which ensures that the maximum of is bounded with exponential moments. Then we show that this fundamental fact yields a number of straightforward corollaries about the quality of estimation.

The paper is organized as follows. The next section presents the main result which describes an exponential upper bound for the (quasi) maximum likelihood. Section 2.2 discusses some implications of this exponential bound for statistical inference. In particular, we present a general likelihood-based construction of confidence sets and establish an exponential bound for the coverage probability. We also show that the considered estimate well concentrates on the level set of the rate function . Under some standard conditions we show that such concentration sets become usual root-n neighborhoods of the target . Sections 3 and 4 illustrate the obtained general results for two quite popular statistical models: generalized linear and single index. These models are very well studied, the existing results claim asymptotic normality and efficiency of the maximum likelihood estimate as the sample size grows to infinity. On the contrary, our study focuses on nonasymptotic deviation bounds and concentration properties of this estimate. The main result giving an exponential bound for the maximum likelihood is based on general results for the maximum of a random field described in Section 5.

2 Exponential bound for the maximum likelihood

This section presents a general exponential bound on the (quasi) maximum likelihood value in a quite general set-up. The main result concerns the value of maximum rather than the point of maximum . Namely, we aim at establishing some exponential bounds on the supremum in of the random field

In this paper we do not specify the structure of the process . The basic assumption we impose on the considered model is that is absolutely continuous in and that and its gradient w.r.t. have bounded exponential moments.

The rate function is finite for all :

Note that this condition is automatically fulfilled if and with provided that all are absolutely continuous w.r.t. . With and , it holds . For , for by the Jensen inequality.

The main observation behind the condition is that

Our main goal is to get a similar bound for the maximum of the random field over . Below in Section 2.2 we show that such a bound implies an exponential bound for the coverage probability for a confidence set and that the estimate well concentrates on a set in the sense that the probability of the event is exponentially small in .

Unfortunately, in some situations, the exponential moment of the maximum of can be unbounded. We present a simple example of this sort.

Example 2.1.

Consider a Gaussian shift with only one observation and suppose that the true parameter is . Then the log-likelihood ratio reads as , and it holds , and .

We therefore consider the penalized expression , where the penalty function should provide some bounded exponential moments for

To bound local fluctuations of the process , we introduce an exponential moment condition on the stochastic component :

Suppose also that the random function is differentiable in and its gradient fulfills the following condition:

There exist some continuous symmetric matrix function for and constant such that for all

(2.1)

Define for every , and

Next, introduce for every the local vicinity such that for all .

Let also the function from satisfy the following regularity condition:

There exist constants and such that

Now we are prepared to state the main result which gives some sufficient condition on the penalty function ensuring the desired penalized exponential bound. It is a specification of a more general result from Theorem 5.5 in Section 5.

Here and in what follows (resp. ) denotes the volume (resp. the entropy number) of the unit ball in .

Theorem 2.1.

Suppose that the conditions is fulfilled and holds with some and a matrix function which fulfills for and . If for some with , the penalty function fulfills

(2.2)

with , then

(2.3)

where

(2.4)

2.1 Penalty via the norm

The choice of the penalty function can be made more precise if for a fixed matrix and all . This section describes how the penalty function can be defined in terms of the norm .

Theorem 2.2.

Let the conditions and be fulfilled and in addition for some matrix for all . Let and be fixed to ensure . Suppose that is a monotonously decreasing positive function on satisfying

(2.5)

Define

(2.6)

Then the assertion (2.3) holds with

Proof.

This result is a straightforward corollary of Theorem 2.1 applied with and thus, condition is fulfilled with . ∎

Here two natural ways of defining the penalty function : quadratic or logarithmic in . The functions and the corresponding -values are:

(2.7)

where are some constant and means . The corresponding penalties read as:

2.2 Some corollaries

Theorem 2.1 claims that the value is uniformly in stochastically bounded. In particular, one can plug the estimate in place of :

(2.8)

Below we present some corollaries of this result.

2.2.1 Concentration properties of the estimator

Define for every subset of the parameter set the value

(2.9)

The next result shows that the estimator deviates out of the set with an exponentially small probability of order .

Corollary 2.3.

Suppose (2.8). Then for any set

Proof.

If , then . As , it follows

as required. ∎

Two particular choices of the set can be mentioned:

For the set , Corollary 2.3 yields

For the set , define additionally as the minimal value for which

or, equivalently,

(2.10)
Corollary 2.4.

Suppose (2.8). Then for any

In typical situations the value is nearly proportional to the sample size and is nearly quadratic in so that for a fixed the set corresponds to a root- neighborhood of the point . See below in Section 2.4 for a precise formulation.

2.2.2 Confidence sets based on

Next we discuss how the exponential bound can be used for establishing some risk bounds and for constructing the confidence sets for the target based on the maximized value . The inequality (2.8) claims that is stochastically bounded with finite exponential moments. This implies boundness of the polynomial moments.

Define

(2.11)
Corollary 2.5.

Suppose (2.8) and let from (2.11) be finite. Then

Proof.

Observe that

as required. ∎

By the same reasons, one can construct confidence sets based on the (quasi) likelihood process. Define

The bound for ensures that the target belongs to this set with a high probability provided that is large enough. The next result claims that does not cover the true value with a probability which decreases exponentially in .

Corollary 2.6.

Suppose (2.8). For any

Proof.

The bound (2.8) implies for the event

as required. ∎

2.3 Identifiability condition

Until this point no any identifiability condition on the model has been used, that is, the presented results apply even for a very poor parametrization. Actually, a particular parametrization of the parameter set plays no role as long as the value of maximum is considered. If we want to derive any quantitative result on the point of maximum , then the parametrization matters and an identifiability condition is really necessary. Here we follow the usual path by applying the quadratic lower bound for the rate function in a vicinity of the point . Suppose that the rate function is two times continuously differentiable in . Obviously and simple algebra yields for the gradient :

because is the point of maximum of . The Taylor expansion of the second order in a vicinity of yields for all close to the following approximation:

with the matrix . So, one can expect that the rate function is nearly quadratic in in a neighborhood of the point .

Corollary 2.7.

Let (2.8) hold. Suppose that for some positive symmetric matrix and some , the function fulfills

(2.12)

Then for any

Proof.

It is obvious that

and the result follows from Corollary 2.4. ∎

In the next theorem we assume the lower bound (2.12) to be fulfilled on the whole parameter set . The general case can be reduced to this one by using once again the concentration property of Corollary 2.4.

Theorem 2.8.

Suppose , with for a matrix . Let also for some

(2.13)

Fix some and define by

(2.14)

Then with it holds

(2.15)

for some fixed constant . In addition, from (2.10) fulfills for all yielding for any the concentration property and confidence bound:

Proof.

We apply Theorem 2.2 with

leading for and to the formula (2.14) for . By simple algebra

cf. the bound (2.7) for with . This implies the bound (2.15) for the because and are bounded by some fixed constants.

The inequality (2.13) ensures for that , i.e. and . Finally, the concentration and coverage bounds follow from Corollaries 2.4 and 2.6. ∎

Remark 2.1.

If the quadratic lower bound (2.13) is only fulfilled for from an elliptic neighborhood of the point with a sufficiently large , then it is reasonable to redefine the penalty function using the hybrid proposal:

Then the bound (2.15) still applies with the obvious correction of the value . However, the values and from (2.10) entering in our risk bounds have to be corrected depending on the behavior of the rate function for .

2.4 Discussion

This section collects some comments about the presented exponential bound.

Bounds for polynomial loss

Our concentration result is stated in terms if the rate function . Note that the bounds (2.15) and (2.13) imply the usual result about the quadratic loss :

Note however, that the result (2.15) in terms of the rate function is more accurate because the lower bound (2.13) can be very rough. The bound (2.13) as well as the bound are only used to evaluate the constants in the exponential risk bound. Moreover, if or approaches one, the leading term in the risk bound is which does not depend on or .

Coverage probability and risk bounds

The result of Corollary 2.5 justifies the use of confidence set . However, the bound for the coverage probability given by this result is quite rough and cannot be used for practical purposes. One has to apply one or another resampling scheme to fix a proper value providing the prescribed coverage probability .

The same remark applies to the result of Corollary 2.7. All these bounds are deduced from rather rough exponential inequalities and constants shown there are not optimal. However, the concentration property enables us to apply the classical one-step improvement technique to build a new estimate which achieves the asymptotic efficiency bound.

Root-n consistency

Suppose that there exists a constant (usually this constant means the sample size) such that the functions

are continuous and bounded on every compact set by constants which only depend on this set. In addition we assume similarly to (2.12) that for some fixed symmetric positive matrix and some , it holds in the vicinity of the point :

(2.16)

Then and the elliptic set is a root-n neighborhood of the point . By Theorem 2.8 the estimate deviates from this neighborhood with probability which decreases exponentially with :

Local approximation

The standard asymptotic theory of parameter estimation heavily uses the idea of local approximation: the considered (quasi) log likelihood is approximated by the log-likelihood of another simpler model in the vicinity of the true point yielding the local asymptotic equivalence of the original and the approximating model. The local asymptotic normality (LAN) condition is the most popular example of this approach; see Ibragimov and Khas’minskij (1981), Ch. 2, for more details. A combination of this idea with the concentration property of Corollary 2.4 can be used to derive sharp asymptotic risk bounds for the estimate ; see again Ibragimov and Khas’minskij (1981), Ch. 3. Similarly one can derive non asymptotic risk in the framework of this paper. However, a precise formulation of the related results is to be given elsewhere.

Large and moderate deviation

The obtained results can be used to derive large and moderate deviations for the estimate ; cf. Jensen and Wood (1998), Sieders and Dzhaparidze (1987). Particularly, the deviation result from Corollary 2.4 can be used to study the efficiency of the estimate in the Bahadur sense; see e.g. Arcones (2006) and reference therein.

3 Estimation in a generalized linear model

In this section we illustrate the general results of Sections 2 and 2.2 on the problem of estimating the parameter vector in the so called generalized linear model. Let be an exponential family with the canonical parametrization (EFC) which means that the corresponding log-likelihood function can be written in the form

where is a given convex function; see Green and Silverman (1994). The term is unimportant and it cancels in the log-likelihood ratio.

Let be an observed sample. A generalized linear assumption means that the ’s are independent, the distribution of every belongs to and the corresponding parameter linearly depends on given feature vectors :

(3.1)

To be more specific we consider a deterministic explanatory variables . The case of a random design can be considered in the same way.

The parametric assumption (3.1) leads to the log-likelihood :

(3.2)

Asymptotic properties of the MLE are well studied. We refer to Fahrmeir and Kaufmann (1985), Lang (1996), Chen et al. (1999) and the book McCullagh and Nelder (1989) for further references. The results claim asymptotic consistency, normality and efficiency of the estimate .

Our approach is a bit different because we do not assume that the underlying model follow (3.1). The observations are independent, otherwise any particular structure is allowed. In particular, the distribution of every does not necessarily belong to . The considered problem is the problem of the best parametric approximation of the data distribution by the GLM’s of the form .

Example 3.1.

[Mean regression] The least squares estimate in the classical mean regression minimizes the sum of squared residuals:

This estimate can be viewed as the quasi MLE for to the Gaussian homogeneous errors. However, many of its properties continue to hold even if the errors are not i.i.d. Gaussian. What we only need is the existence of exponential moments of the errors.

Example 3.2.

[Poisson regression] Let be some nonnegative integers, observed at “locations” . Such data often appear in digital imaging, positron emission tomography, queueing and traffic theory, and many others. A natural way of modeling such data is to assume that every is Poissonian with the parameter which depends on the locations through the regression vector . A generalized linear model assumes that the canonical parameter of the underlying Poisson distribution of linearly depends on the vector leading to the (quasi) MLE

For our further analysis we only require that every has a bounded exponential moment, see below the condition (3.5) for a precise formulation.

In the general situation, for some which will be fixed later, define

The target maximizes :