A Proofs

AIC for the Non-concave Penalized Likelihood Method

Abstract

Non-concave penalized maximum likelihood methods, such as the Bridge, the SCAD, and the MCP, are widely used because they not only perform the parameter estimation and variable selection simultaneously but also are more efficient than the Lasso. They include a tuning parameter which controls a penalty level, and several information criteria have been developed for selecting it. While these criteria assure the model selection consistency, they have a problem in that there are no appropriate rules for choosing one from the class of information criteria satisfying such a preferred asymptotic property. In this paper, we derive an information criterion based on the original definition of the AIC by considering minimization of the prediction error rather than model selection consistency. Concretely speaking, we derive a function of the score statistic that is asymptotically equivalent to the non-concave penalized maximum likelihood estimator and then provide an estimator of the Kullback-Leibler divergence between the true distribution and the estimated distribution based on the function, whose bias converges in mean to zero. Furthermore, through simulation studies, we find that the performance of the proposed information criterion is about the same as or even better than that of the cross-validation.

KEY WORDS: information criterion; Kullback-Leibler divergence; regularization; statistical asymptotic theory; tuning parameter; variable selection.

1 Introduction

The Lasso (Tibshirani 1996) is a regularization method that imposes an penalty term on an estimating function with respect to an unknown parameter vector , where is a tuning parameter controlling a penalty level. The Lasso can simultaneously perform estimation and variable selection by exploiting the non-differentiability of the penalty term at the origin. Concretely speaking, if is the estimator based on the Lasso, several of its components will shrink to exactly 0 when is not close to 0. However, a parameter estimation based on the Lasso is not necessarily efficient, because the Lasso shrinks the estimator to the zero vector too much. To avoid such a problem, it has been proposed to use a penalty term that does not shrink the estimator with a large value. Typical examples of such regularization methods are the Bridge (Frank and Friedman 1993), the smoothly clipped absolute deviation (SCAD; Fan and Li 2001), and the minimax concave penalty (MCP; Zhang 2010). Whereas the Bridge uses an penalty term (), SCAD and MCP use penalty terms that can be approximated by an penalty term in the neighborhood of the origin, which we call an type. Although it is difficult to obtain estimates of them as their penalties are non-convex, there are several algorithms, such as the coordinate descent method and the gradient descent method that assure convergence to a local optimal solution.

On the other hand, in the above regularization methods, we have to choose a proper value for the tuning parameter , and this is an important task for appropriate model selection. One of the simplest ways of selecting is to use cross-validation (CV; Stone 1974). While the stability selection method (Meinshausen and Bühlmann 2010) based on subsampling in order to avoid problems caused by selecting a model based on only one value of would be nice, it carries with it a considerable computational cost as in CV. Recently, information criteria without such a problem have been developed (Yuan and Lin 2007; Wang et al. 2007, 2009; Zhang et al. 2010; Fan and Tang 2013). Here, by letting be the log-likelihood function and be the estimator of obtained by the above regularization methods, their information criteria take the form . Accordingly, model selection consistency is at least assured for some sequence that depends on at least the sample size . For example, the information criterion with is proposed as the BIC. This approach includes the results for the case in which the dimension of the parameter vector goes to infinity, and hence, it is considered to be significant. However, the choice of tuning parameter remains somewhat arbitrary. That is, there is a class of assuring a preferred asymptotic property such as model selection consistency, but there are no appropriate rules for choosing one from the class. For example, since the BIC described above is not derived from the Bayes factor, there is no reason to use instead of . This is a severe problem because data analysts can choose arbitrarily and do model selection as they want.

Information criteria without such an arbitrariness problem have been proposed by Efron et al. (2004) or Zou et al. (2007) for Gaussian linear regression and by Ninomiya and Kawano (2014) for generalized linear regression. Concretely speaking, on the basis of the original definition of the or AIC, they derive an unbiased estimator of the mean squared error or an asymptotically unbiased estimator of a Kullback-Leibler divergence. However, these criteria are basically only for the Lasso. In addition, the asymptotic setting used in Ninomiya and Kawano (2014) does not assure even estimation consistency.

Our goal in this paper is to derive an information criterion based on the original definition of AIC in an asymptotic setting that assures estimation consistency for regularization methods using non-concave penalties including the Bridge, SCAD, and MCP. To achieve it, the results presented in Hjort and Pollard (1993) are slightly extended to derive an asymptotic property for the estimator. Then, for the Kullback-Leibler divergence, we construct an asymptotically unbiased estimator by evaluating the asymptotic bias between the divergence and the log-likelihood into which the estimator is plugged. Moreover, we verify that this evaluation is the asymptotic bias in the strict sense; that is, the bias converges in mean to the evaluation. This sort of verification has usually been ignored in the literature (see, e.g., Konishi and Kitagawa 2008).

The rest of the paper is organized as follows. Section 2 introduces the generalized linear model and the regularization method, and it describes some of the assumptions on our asymptotic theory. In Section 3, we discuss the asymptotic property of the estimator obtained from the regularization method, and in Section 4, we use it to evaluate the asymptotic bias, which is needed to derive the AIC. In Section 5, we discuss the moment convergence of the estimator to show that the bias converges in mean to our evaluation. Section 6 presents the results of simulation studies showing the validity of the proposed information criterion for several models, and Section 7 gives concluding remarks and mentions future work. The proofs are relegated to the appendixes.

2 Setting and assumptions for asymptotics

Let us consider a natural exponential family with a natural parameter in for an -dimensional random variable , whose density is

with respect to a -finite measure. We assume that is the natural parameter space; that is, in satisfies . Accordingly, all the derivatives of and all the moments of exist in the interior of , and, in particular, and . For a function , we denote and by and , respectively. We also assume that is positive definite, and hence, is a strictly convex function with respect to .

Let be the -th set of responses and regressors ; we assume that are independent -dimensional random vectors and in are -matrices of known constants. We will consider generalized linear models with natural link functions for such data (see McCullagh and Nelder 1989); that is, we will consider a class of density functions for ; thus, the log-likelihood function of is given by

where is a -dimensional coefficient vector and is an open convex set. To develop an asymptotic theory for this model, we assume two conditions about the behavior of , as follows:

  • is a compact set with for all and .

  • There exists an invariant distribution on . In particular, converges to a positive-definite matrix .

In the above setting, we can prove the following lemma.

Lemma 1.

Let be the true value of . Then, under conditions (C1) and (C2), we obtain the following:

  • There exists a convex and differentiable function such that for each .

  • converges to .

  • .

See Ninomiya and Kawano (2014) for the proof. Note that we can explicitly write

(1)

since we assume (C2), and hence, we can prove its convexity and differentiability without using the techniques of convex analysis (Rockafellar 1970).

Let us consider a non-concave penalized maximum likelihood estimator,

(2)

where is a tuning parameter and is a penalty term with respect to , which is not necessarily convex. Letting , we assume that satisfies the following conditions; hereafter, we call it an type:

  • is not differentiable only at the origin, symmetric with respect to , and monotone non-decreasing with respect to .

  • .

Such penalty terms for the Bridge, the SCAD, and the MCP are

and

where and . The Bridge penalty is the Lasso penalty itself when , and it has the property that the derivative at the origin diverges when . For the SCAD and MCP penalties, condition (C4) on the behavior in the neighborhood of the origin is satisfied by setting , just like in the Lasso penalty. Thus, it is easy to imagine that a lot of penalties satisfy these conditions. Note that by using such penalties, several components of tend to exactly 0 because of the non-differentiability at the origin. Also note that is assumed not to depend on the subscript of the parameter for simplicity; this is not essential. While Ninomiya and Kawano (2014) put on the penalty term, we put on it in this study. From this, we can prove estimation consistency. Moreover, we can prove weak convergence of , although the asymptotic distribution is not normal in general.

3 Asymptotic behavior

3.1 Preparations

Although the objective function in (2) is no longer convex because of the non-convexity of , the consistency of can be derived by using a similar argument to the one in Knight and Fu (2000). First, the following lemma holds.

Lemma 2.

is a consistent estimator of , that is, under conditions (C1)–(C4).

This lemma is proved through uniform convergence of the random function,

(3)

The details are given in Section A.1. Hereafter, we will denote by so long as there is no confusion. In addition, we denote and by and , respectively. Moreover, the vector and the matrix will be denoted by and , respectively, and we will sometimes express, for example, as .

To develop the asymptotic property of the penalized maximum likelihood estimator in (2), which will be used to derive an information criterion, we need to make a small generalization of the result in Hjort and Pollard (1993), as follows:

Lemma 3.

Suppose that is a strictly convex random function that is approximated by . Let be a subvector of , and let and be continuous functions such that and converge to and uniformly over and in any compact set, respectively, and assume that is convex and . In addition, for

let and be the argmin of and , respectively, and assume that is unique and . Then, for any , and , there exists such that

(4)

where

(5)

Hjort and Pollard (1993) derived an inequality ; they assumed that is convex. Although is non-convex (hence is too), we will use the fact that converge to over . In fact, if is sufficiently large, the inequality satisfied by the convex function is approximately satisfied for ; that is, we have

(6)

in . Here, is a unit vector such that , and is in , since . Moreover, if is sufficiently small and is sufficiently large, since , we have

(7)

in . Hence, we can show that

(8)

in the same way as in Hjort and Pollard (1993), from which we obtain the above lemma. See Section A.2 for the details.

3.2 Limiting distribution

We use Lemma 3 to derive the asymptotic property of the penalized maximum likelihood estimator in (2). Because the asymptotic property depends on the value of , we will develop our argument by setting . Furthermore, we will use for the sake of simplicity.

Let us define a strictly convex random function,

(9)

and

(10)

where . By making a Taylor expansion around , can be expressed as

plus . Note that the term converges to from (R2), and the terms including reduce to . Accordingly, we see that is asymptotically equivalent to . Next, letting be and letting

(11)

and

(12)

we can see from (C3) and (C4) that and uniformly converge to a function,

(13)

over in a compact set, respectively, where . In addition, letting and , we see that the argmins of and are given by

Note that is not convex but satisfies that . Using Lemma 3 together with the above preliminaries, we find that, for any , and , there exists such that

(14)

where and are the functions defined in (5). The triangle inequality, the convexity of and the uniform convergence of and imply

(15)

Let be half the smallest eigenvalue of . Then, a simple calculation gives

(16)

From (15) and (16), by considering a sufficiently small and a sufficiently large , the first term on the right-hand side in (14) can be made arbitrarily small. In addition, we can generalize the result in Radchenko (2005) with respect to the model and the penalty term; thus, for any , we have

(17)

See Section A.3 for the proof of (17). From this, by considering a sufficiently large and a sufficiently large , the second and third terms on the right-hand side in (14) can be made arbitrarily small. Thus, we conclude that

Theorem 1.

Let and

(18)

Under conditions (C1)–(C4), we have

when , and we have

(19)

and

(20)

when .

We can obtain the result for the case of in almost the same way as in the case of (see Section A.4 for details). From Theorem 1, the estimator in (2) is shown to converge in distribution to some function of a Gaussian distributed random variable. When , we immediately see that it is 0 or the Gaussian distributed random variable itself, and this simple fact is useful for deriving an information criterion explicitly and reducing the computational cost of model selection. On the other hand, when , we can prove weak convergence, since the convex objective function in (18) converges uniformly from the convexity lemma in Hjort and Pollard (1993).

Corollary 1.

Let be a Gaussian distributed random variable with mean and covariance matrix and

(21)

Then, under the same conditions as in Theorem 1, we have

when , and we have

when .

In the case of , we still need to solve the minimization problem in (21) for evaluating the AIC, but this is easy because the objective function is convex with respect to , so we can use existing convex optimization techniques. It is known that the proximal gradient method (Rockafellar 1976; Beck and Teboulle 2009) is effective for solving such a minimization problem when the objective function is the sum of a differentiable function and a non-differentiable function. We will use, however, the coordinate descent method (Mazumder et al. 2011) because the objective function can be minimized explicitly for each variable. Actually, when we fix all the elements of except for the -th one, is given by

Then, for the -th step in the algorithm, we have only to update as follows:

for , and we repeat this update until converges. Note that the optimal value satisfies if and otherwise.

4 Information criterion

From the perspective of prediction, model selection using the AIC aims to minimize twice the Kullback-Leibler divergence (Kullback and Leibler 1951) between the true distribution and the estimated distribution,

where is a copy of ; in other words, has the same distribution as and is independent of . In addition, and denote a log-likelihood function based on , that is, , and the expectation with respect to only , respectively. Because the first term is a constant, i.e., it does not depend on the model selection, we only need to consider the second term, and then the AIC is defined as an asymptotically biased estimator for it (Akaike 1973). A simple estimator of the second term in our setting is , but it underestimates the second term. Consequently, we will minimize the bias correction,

(22)

in AIC-type information criteria (see Konishi and Kitagawa 2008). Because the expectation in (22), i.e., the bias term, depends on the true distribution, it cannot be explicitly given in general; thus, we will evaluate it asymptotically in the same way as was done for the AIC.

For the Lasso, Efron et al. (2004) and Zou et al. (2007) developed the -type information criterion as an unbiased estimator of the prediction squared error in a Gaussian linear regression setting, in other words, a finite correction of the AIC (Sugiura 1978) in a Gaussian linear setting with a known variance. For the Lasso estimator , it can be expressed as

where the index set is called an active set. Unfortunately, since Stein’s unbiased risk estimation theory (Stein 1981) was used for deriving this criterion, it was difficult to extend this result to other models. In that situation, Ninomiya and Kawano (2014) relied on statistical asymptotic theory and extended the result to generalized linear models based on the asymptotic distribution of the Lasso estimator. The Lasso estimator in their paper is defined by

but, as was mentioned in the previous section, estimation consistency is not assured because the order of the penalty term is . In this study, we derive an information criterion in a setting that estimation consistency holds as in Lemma 2 for not only the Lasso but also the non-concave penalized likelihood method.

The bias term in (22) can be rewritten as the expectation of

(23)

so we can derive an AIC by evaluating , where is the limit to which (23) converges in distribution. We call an asymptotic bias. Here, we will develop an argument by setting .

Using Taylor’s theorem, the first term in (23) can be expressed as

(24)

where is a vector on the segment from to . Note that