Sparse linear discriminant analysis by thresholding for high dimensional data

Sparse linear discriminant analysis by thresholding for high dimensional data

[ [    [ [    [ [    [ [ East China Normal University and University of Wisconsin Department of Statistics
University of Wisconsin
1300 University Ave.
Madison, Wisconsin 53706
USA
\printeade1
E-mail: \printead*e2
E-mail: \printead*e3
E-mail: \printead*e4
\smonth2 \syear2010\smonth9 \syear2010
\smonth2 \syear2010\smonth9 \syear2010
\smonth2 \syear2010\smonth9 \syear2010
Abstract

In many social, economical, biological and medical studies, one objective is to classify a subject into one of several classes based on a set of variables observed from the subject. Because the probability distribution of the variables is usually unknown, the rule of classification is constructed using a training sample. The well-known linear discriminant analysis (LDA) works well for the situation where the number of variables used for classification is much smaller than the training sample size. Because of the advance in technologies, modern statistical studies often face classification problems with the number of variables much larger than the sample size, and the LDA may perform poorly. We explore when and why the LDA has poor performance and propose a sparse LDA that is asymptotically optimal under some sparsity conditions on the unknown parameters. For illustration of application, we discuss an example of classifying human cancer into two classes of leukemia based on a set of 7,129 genes and a training sample of size 72. A simulation is also conducted to check the performance of the proposed method.

[
\kwd
\doi

10.1214/10-AOS870 \volume39 \issue2 2011 \firstpage1241 \lastpage1265 \newproclaimremarkRemark \newproclaimdefinitionDefinition

\runtitle

Sparse linear discriminant analysis

{aug}

A]\fnmsJun \snmShao\correflabel=e1]shao@stat.wisc.edu\thanksrefaut1, A]\fnmsYazhen \snmWanglabel=e2]yzwang@stat.wisc.edu\thanksrefaut2, A]\fnmsXinwei \snmDenglabel=e3]xdeng@stat.wisc.edu and A]\fnmsSijian \snmWanglabel=e4]wangs@stat.wisc.edu \thankstextaut1Supported in part by the NSF Grant SES-0705033. \thankstextaut2Supported in part by the NSF Grant DMS-10-05635.

class=AMS] \kwd[Primary ]62H30 \kwd[; secondary ]62F12 \kwd62G12.

Classification \kwdhigh dimensionality \kwdmisclassification rate \kwdnormality \kwdoptimal classification rule \kwdsparse estimates.

1 Introduction

The objective of a classification problem is to classify a subject to one of several classes based on a -dimensional vector of characteristics observed from the subject. In most applications, variability exists, and hence is random. If the distribution of is known, then we can construct an optimal classification rule that has the smallest possible misclassification rate. However, the distribution of is usually unknown, and a classification rule has to be constructed using a training sample. A statistical issue is how to use the training sample to construct a classification rule that has a misclassification rate close to that of the optimal rule.

In traditional applications, the dimension of is fixed while the training sample size is large. Because of the advance in technologies, nowadays a much larger amount of information can be collected, and the resulting is of a high dimension. In many recent applications, is much larger than the training sample size, which is referred to as the large--small- problem or ultra-high dimension problem when for some . An example is a study with genetic or microarray data. In our example presented in Section 5, for instance, a crucial step for a successful chemotherapy treatment is to classify human cancer into two classes of leukemia, acute myeloid leukemia and acute lymphoblastic leukemia, based on genes and a training sample of 72 patients. Other examples include data from radiology, biomedical imaging, signal processing, climate and finance. Although more information is better when the distribution of is known, a larger dimension produces more uncertainty when the distribution of is unknown and, hence, results in a greater challenge for data analysis since the training sample size cannot increase as fast as .

The well-known linear discriminant analysis (LDA) works well for fixed--large- situations and is asymptotically optimal in the sense that, when increases to infinity, its misclassification rate over that of the optimal rule converges to one. In fact, we show in this paper that the LDA is still asymptotically optimal when diverges to infinity at a rate slower than . On the other hand, Bickel and Levina (2004) showed that the LDA is asymptotically as bad as random guessing when ; some similar results are also given in this paper. The main purpose of this paper is to construct a sparse LDA and show it is asymptotically optimal under some sparsity conditions on unknown parameters and some condition on the divergence rate of (e.g., as ). Our proposed sparse LDA is based on the thresholding methodology, which was developed in wavelet shrinkage for function estimation [Donoho and Johnstone (1994), Donoho et al. (1995)] and covariance matrix estimation [Bickel and Levina (2008)]. There exist a few other sparse LDA methods, for example, Guo, Hastie and Tibshirani (2007), Clemmensen, Hastie and Ersbøll (2008) and Qiao, Zhou and Huang (2009). The key differences between the existing methods and ours are the conditions on sparsity and the construction of sparse estimators of parameters. However, no asymptotic results were established in the existing papers.

For high-dimensional in regression, there exist some variable selection methods [see a recent review by Fan and Lv (2010)]. For constructing a classification rule using variable selection, we must identify not only components of having mean effects for classification, but also components of having effects for classification through their correlations with other components [see, e.g., Kohavi and John (1997), Zhang and Wang (2010)]. This may be a very difficult task when is much larger than , such as and in the leukemia example in Section 5. Ignoring the correlation, Fan and Fan (2008) proposed the features annealed independence rule (FAIR), which first selects components of having mean effects for classification and then applies the naive Bayes rule (obtained by assuming that components of are independent) using the selected components of only. Although no sparsity condition on the covariance matrix of is required, the FAIR is not asymptotically optimal because the correlation between components of is ignored. Our approach is not a variable selection approach, that is, we do not try to identify a subset of components of with a size smaller than . We use thresholding estimators of the mean effects as well as Bickel and Levina’s (2008) thresholding estimator of the covariance matrix of , but we allow the number of nonzero estimators (for the mean differences or covariances) to be much larger than to ensure the asymptotic optimality of the resulting classification rule.

The rest of this paper is organized as follows. In Section 2, after introducing some notation and terminology, we establish a sufficient condition on the divergence of under which the LDA is still asymptotically close to the optimal rule. We also show that, when is large compared with (), the performance of the LDA is not good even if we know the covariance matrix of , which indicates the need of sparse estimators for both the mean difference and covariance matrix. Our main result is given in Section 3, along with some discussions about various sparsity conditions and divergence rates of for which the proposed sparse LDA performs well asymptotically. Extensions of the main result are discussed in Section 4. In Section 5, the proposed sparse LDA is illustrated in the example of classifying human cancer into two classes of leukemia, along with some simulation results for examining misclassification rates. All technical proofs are given in Section 6.

2 The optimal rule and linear discriminant analysis

We focus on the classification problem with two classes. The general case with three or more classes is discussed in Section 4. Let be a -dimensional normal random vector belonging to class if , , where , and is positive definite. The misclassification rate of any classification rule is the average of the probabilities of making two types of misclassification: classifying to class 1 when and classifying to class 2 when .

If , and are known, then the optimal classification rule, that is, the rule with the smallest misclassification rate, classifies to class 1 if and only if , where , , and denotes the transpose of the vector . This rule is also the Bayes rule with equal prior probabilities for two classes. Let denote the misclassification rate of the optimal rule. Using the normal distribution, we can show that

(1)

where is the standard normal distribution function. Although , if as and if . Since is the misclassification rate of random guessing, we assume the following regularity conditions: there is a constant (not depending on ) such that

all eigenvalues of (2)

and

(3)

where is the th component of . Under (2)–(3), , and hence . Also, and so that the rate of is the same as the rate of , where is the -norm of the vector .

In practice, and are typically unknown, and we have a training sample , where is the sample size for class , , , all ’s are independent and is independent of to be classified. The limiting process considered in this paper is the one with . We assume that converges to a constant strictly between 0 and 1; is a function of , but the subscript is omitted for simplicity. When , may diverge to , and the limit of may be 0, a positive constant, or .

For a classification rule constructed using the training sample, its performance can be assessed by the conditional misclassification rate defined as the average of the conditional probabilities of making two types of misclassification, where the conditional probabilities are with respect to , given the training sample . The unconditional misclassification rate is . The asymptotic performance of refers to the limiting behavior of or as . Since , by the dominated convergence theorem, if , where is a constant and denotes convergence in probability, then . Hence, in this paper we focus on the limiting behavior of the conditional misclassification rate .

We hope to find a rule such that converges in probability to the same limit as , the misclassification rate of the optimal rule. If , however, we hope not only , but also and have the same convergence rate. This leads to the following definition.

{definition}

Let be a classification rule with conditional misclassification rate , given the training sample . {longlist}[(iii)]

is asymptotically optimal if .

is asymptotically sub-optimal if .

is asymptotically worst if .

If [i.e., in (1) is bounded], then the asymptotic sub-optimality is the same as the asymptotic optimality. Part (iii) of Definition 2 comes from the fact that is the misclassification rate of random guessing.

In this paper we focus on the classification rules of the form

(4)

where , and are estimators of , and , respectively, constructed using the training sample .

The well-known linear discriminant analysis (LDA) uses the maximum likelihood estimators , and , where

The LDA is given by (4) with , , when exists, and a generalized inverse when does not exist (e.g., when ). A straightforward calculation shows that, given , the conditional misclassification rate of the LDA is

(5)

Is the LDA asymptotically optimal or sub-optimal according to Definition 2? Bickel and Levina [(2004), Theorem 1] showed that, if and , then the unconditional misclassification rate of the LDA converges to so that the LDA is asymptotically worst. A natural question is, for what kind of (which may diverge to ), is the LDA asymptotically optimal or sub-optimal. The following result provides an answer.

{thm}

Suppose that (2)–(3) hold and . {longlist}[(iii)]

The conditional misclassification rate of the LDA is equal to

If is bounded, then the LDA is asymptotically optimal and

If , then the LDA is asymptotically sub-optimal.

If and , then the LDA is asymptotically optimal.

{remark}

Since under conditions (2) and (3), when is bounded, is the same as , which is satisfied if with . When , is stronger than . Under (2)–(3), . Hence, the extreme case is is a constant times , and the condition in part (iv) becomes , which holds when with . In the traditional applications with a fixed , is bounded, as and thus Theorem 2 proves that the LDA is asymptotically optimal.

The proof of part (iv) of Theorem 2 (see Section 6) utilizes the following lemma, which is also used in the proofs of other results in this paper.

Lemma 1

Let and be two sequences of positive numbers such that and as . If , where may be 0, positive, or , then

Since the LDA uses to estimate when and is asymptotically worst as Bickel and Levina (2004) showed, one may think that the bad performance of the LDA is caused by the fact that is not a good estimator of . Our following result shows that the LDA may still be asymptotically worst even if we can estimate perfectly.

{thm}

Suppose that (2)–(3) hold, and that is known so that the LDA is given by (4) with , and . {longlist}[(iii)]

If (which is true if ), then .

If with , then a constant strictly between 0 and and .

If , then but .

Theorem 2 shows that even if is known, the LDA may be asymptotically worst and the best we can hope is that the LDA is asymptotically sub-optimal. It can also be shown that, when and are known and we apply the LDA with and , the LDA is still not asymptotically optimal when , where is any sub-vector of with dimension . This indicates that, in order to obtain an asymptotically optimal classification rule when is much larger than , we need sparsity conditions on and when both of them are unknown. For bounded (in which case the asymptotic optimality is the same as the asymptotic sub-optimality), by imposing sparsity conditions on , and , Theorem 2 of Bickel and Levina (2004) shows the existence of an asymptotically optimal classification rule. In the next section, we obtain a result by relaxing the boundedness of and by imposing sparsity conditions on and . Since the difference of the two normal distributions is in , imposing a sparsity condition on is weaker and more reasonable than imposing sparsity conditions on both and .

3 Sparse linear discriminant analysis

We focus on the situation where the limit of is positive or . The following sparsity measure on is considered in Bickel and Levina (2008):

(6)

where is the th element of , is a constant not depending on , and is defined to be 0. In the special case of , in (6) is the maximum of the numbers of nonzero elements of rows of so that a much smaller than implies many elements of are equal to 0. If is much smaller than for a constant , then is sparse in the sense that many elements of are very small. An example of much smaller than is or .

Under conditions (2) and

(7)

Bickel and Levina (2008) showed that

(8)

where , is thresholded at with a positive constant ; that is, the th element of is , is the th element of and is the indicator function of the set . We consider a slight modification, that is, only off-diagonal elements of are thresholded. The resulting estimator is still denoted by and it has property (8) under conditions (2) and (7).

We now turn to the sparsity of . On one hand, a large results in a large difference between and so that the optimal rule has a small misclassification rate. On the other hand, a larger divergence rate of results in a more difficult task of constructing a good classification rule, since has to be estimated based on the training sample of a size much smaller than . We consider the following sparsity measure on that is similar to the sparsity measure on :

(9)

where is the th component of , is a constant not depending on and . If is much smaller than for a , then is sparse. For defined in (1), under (2)–(3), . Hence, the rate of divergence of is always smaller than that of and, in particular, is bounded when is bounded for a .

We consider the sparse estimator that is thresholded at

(10)

with constants and , that is, the th component of is , where is the th component of . The following result is useful.

Lemma 2

Let be the th component of , be the th component of , be given by (10) and be a fixed constant. {longlist}[(ii)]

If (7) holds, then

(11)

and

(12)

Let the number of ’s with , the number of ’s with and the number of ’s with . If (7) holds, then

We propose a sparse linear discriminant analysis (SLDA) for high-dimension , which is given by (4) with , and . The following result establishes the asymptotic optimality of the SLDA under some conditions on the rate of divergence of , , , and .

{thm}

Let be given by (6), be given by (9), be given by (10), be as defined in Lemma 2 and . Assume that conditions (2), (3) and (7) hold and

(13)
{longlist}

[(iii)]

The conditional misclassification rate of the SLDA is equal to

If is bounded, then the SLDA is asymptotically optimal and

If , then the SLDA is asymptotically sub-optimal.

If and , then the SLDA is asymptotically optimal.

{remark}

Condition (13) may be achieved by an appropriate choice of in , given the divergence rates of , , and .

{remark}

When is bounded and (2)–(3) hold, condition (13) is the same as

(14)
{remark}

When , condition (13), which is sufficient for the asymptotic sub-optimality of the SLDA, is implied by , and . When , the condition , which is sufficient for the asymptotic optimality of the SLDA, is the same as

(15)

We now study when condition (13) holds and when with . By Remarks 3 and 4, (13) is the same as condition (14) when is bounded, and is the same as condition (15) when .

  1. If there are two constants and such that for any nonzero , then is exactly the number of nonzero ’s. Under condition (3), and have exactly the order .

    1. [(a)]

    2. If is bounded (e.g., there are only finitely many nonzero ’s), then is bounded and condition (13) is the same as condition (14). The last two convergence requirements in (14) are implied by , which is the condition for the consistency of proposed by Bickel and Levina (2008).

    3. When (), we assume that and with and . Then, condition (15) is implied by

      (16)

      If we choose , then condition (16) holds when and . To achieve (16) we need to know the divergence rate of . If for a , then , and thus condition (16) holds when and . If for a , which is referred to as an ultra-high dimension, then , and condition (16) holds if and .

  2. Since

    and

    we conclude that

    (17)

    The right-hand side of (17) can be used as a bound of the divergence rate of when , although it may not be a tight bound. For example, if and the right-hand side of (17) is used as a bound for , then the last convergence requirement in (14) or (15) is implied by the first convergence requirement in (14) or (15) when .

  3. If , then the second convergence requirement in (14) or (15) is implied by the first convergence requirement in (14) or (15) when .

  4. Consider the case where , and an ultra-high dimension, that is, for a . From the previous discussion, condition (14) holds if , and (15) holds if . Since , , which converges to 0 if . If is bounded, then is sufficient for condition (13). If , then the largest divergence rate of is and (i.e., the SLDA is asymptotically optimal) when . When , this means .

  5. If the divergence rate of is smaller than then we can afford to have a larger than divergence rate for and . For example, if for a and for a and a positive constant , then diverges to at a rate slower than . We now study when condition (14) holds. First, , which converges to 0 if . Second, , which converges to 0 if is chosen so that . Finally, if we use the right-hand side of (17) as a bound for , then , which converges to 0 if . Thus, condition (14) holds if and . For condition (15), we assume that with a ( corresponds to a bounded ). Then, a similar analysis leads to the conclusion that condition (15) holds if and .

To apply the SLDA, we need to choose two constants, in the thresholding estimator and in the thresholding estimator . We suggest a data-driven method via a cross-validation procedure. Let be the data set containing the entire training sample but with deleted, and let be the SLDA rule based on , , . The leave-one-out cross-validation estimator of the misclassification rate of the SLDA is

where is the indicator function of whether classifies incorrectly. Let denote when the sample sizes are and . Then

which is close to for large . Let be the cross-validation estimator when is used in thresholding and . Then, a data-driven method of selecting is to minimize over a suitable range of . The resulting can also be used as an estimate of the misclassification rate of the SLDA.

4 Extensions

We first consider an extension of the main result in Section 3 to nonnormal and ’s. For nonnormal , the LDA with known and , that is, the rule classifying to class 1 if and only if , is still optimal when has an elliptical distribution [see, e.g., Fang and Anderson (1990)] with density

(18)

where is either or , is a monotone function on , and is a normalizing constant. Special cases of (18) are the multivariate -distribution and the multivariate double-exponential distribution. Although this rule is not necessarily optimal when the distribution of is not of the form (18), it is still a reasonably good rule when and are known. Thus, when and are unknown, we study whether the misclassification rate of the SLDA defined in Section 3 is close to that of the LDA with known and .

From the proofs for the asymptotic properties of the SLDA in Section 3, the results depending on the normality assumption are: {longlist}[(iii)]

result (8), the consistency of ;

results (11) and (12) in Lemma 2;

the form of the optimal misclassification rate given by (1);

the result in Lemma 1.

Thus, if we relax the normality assumption, we need to address (i)–(iv). For (i), it was discussed in Section 2.3 of Bickel and Levina (2008) that result (8) still holds when the normality assumption is replaced by one of the following two conditions. The first condition is

(19)

for a constant , where is the th component of . Under condition (19), result (8) holds without any modification. The second condition is

(20)

for a constant . Under condition (20), result (8) holds with changed to . The same argument can be used to address (ii), that is, results (11) and (12) hold under condition (19) or condition (20) with replaced by . For (iii), the normality of can be relaxed to that, for any -dimensional nonrandom vector with and any real number ,

(21)

where is an unknown distribution function symmetric about 0 but it does not depend on . Distributions satisfying (21) include elliptical distributions [e.g., a distribution of the form (18)] and the multivariate scale mixture of normals [Fang and Anderson (1990)]. Under (21), when and are known, the LDA has misclassification rate with given by (1). It remains to address (iv). Note that the following result,

(22)

is the key for Lemma 1. Without assuming normality, we consider the condition

(23)

where is a constant, , is a constant and is a positive constant. For the case where is standard normal, condition (23) holds with , and . Under condition (23), we can show that the result in Lemma holds for the case of , which is needed to extend the result in Theorem 3(iv). This leads to the following extension.

{thm}

Assume condition (21) and either condition (19) or (20). When condition (19) holds, let be defined by (13). When condition (20) holds, let and be defined by (10) and (13), respectively, with replaced by . Assume that and . {longlist}[(iii)]

The conditional misclassification rate of the SLDA is

If is bounded, then

where is the misclassification rate of the LDA when and are known.

If , then .

If and , then

We next consider extending the results in Sections 2 and 3 to the classification problem with classes. Let be a -dimensional normal random vector belonging to class if , , and the training sample be , where is the sample size for class , , , and all ’s are independent. The LDA classifies to class if and only if for all ,