Optimal Feature Selection in High-Dimensional Discriminant Analysis

Optimal Feature Selection in High-Dimensional Discriminant Analysis

Mladen Kolar   and  Han Liu Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15217, USA; e-mail: mladenk@cs.cmu.edu.Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; e-mail: hanliu@princeton.edu Research supported by NSF Grant IIS–1116730.
Abstract

We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the convergence results to the discriminative rule. However, sharp theoretical analysis for the variable selection performance of these procedures have not been established, even though model interpretation is of fundamental importance in scientific data analysis. This paper bridges the gap by providing sharp sufficient conditions for consistent variable selection using the sparse discriminant analysis (Mai et al., 2012). Through careful analysis, we establish rates of convergence that are significantly faster than the best known results and admit an optimal scaling of the sample size , dimensionality , and sparsity level in the high-dimensional setting. Sufficient conditions are complemented by the necessary information theoretic limits on the variable selection problem in the context of high-dimensional discriminant analysis. Exploiting a numerical equivalence result, our method also establish the optimal results for the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an exhaustive search procedure, whose performance serves as a benchmark, and show that it is variable selection consistent under weaker conditions. Extensive simulations demonstrating the sharpness of the bounds are also provided.

Keywords: high-dimensional statistics; discriminant analysis; variable selection; optimal rates of convergence

1 Introduction

We consider the problem of binary classification with high-dimensional features. More specifically, given data points, , sampled from a joint distribution of , we want to determine the class label for a new data point .

Let and be the density functions of given (class 1) and (class 2) respectively, and the prior probabilities , . Classical multivariate analysis theory shows that the Bayes rule classifies a new data point to class if and only if

(1.1)

The Bayes rule usually serves as an oracle benchmark, since, in practical data analysis, the class conditional densities and are unknown and need to be estimated from the data.

Throughout the paper, we assume that the class conditional densities and are Gaussian. That is, we assume that

(1.2)

This assumption leads us to linear discriminant analysis (LDA) and the Bayes rule in (1.1) becomes

(1.3)

where . Theoretical properties of the plug-in rule , where are sample estimates of , have been well studied when the dimension is low (Anderson, 2003).

In high-dimensions, the standard plug-in rule works poorly and may even fail completely. For example, Bickel and Levina (2004) show that the classical low dimensional normal-based linear discriminant analysis is asymptotically equivalent to random guessing when the dimension increases at a rate comparable to the sample size . To overcome this curse of dimensionality, it is common to impose certain sparsity assumptions on the model and then estimate the high-dimensional discriminant rule using plug-in estimators. The most popular approach is to assume that both and are sparse. Under this assumption, Shao et al. (2011) propose to use a thresholding procedure to estimate and and then plug them into the Bayes rule. In a more extreme case, Tibshirani et al. (2003), Wang and Zhu (2007), Fan and Fan (2008) assume that and estimate using a shrinkage method. Another common approach is to assume that and are sparse. Under this assumption, Witten and Tibshirani (2009) propose the scout method which estimates using a shrunken estimator. Though these plug-in approaches are simple, they are not appropriate for conducting variable selection in the discriminant analysis setting. As has been elaborated in Cai et al. (2011) and Mai et al. (2012), for variable selection in high-dimensional discriminant analysis, we need to directly impose sparsity assumptions on the Bayes discriminant direction instead of separately on and . In particular, it is assumed that for . Their key observation comes from the fact that the Fisher’s discriminant rule depends on and only through the product . Furthermore, in the high-dimensional setting, it is scientifically meaningful that only a small set of variables are relevant to classification, which is equivalent to the assumption that is sparse. On a simple example of tumor classification, Mai et al. (2012) elaborate why it is scientifically more informative to directly impose sparsity assumption on instead of on (For more details, see Section 2 of their paper). In addition, Cai et al. (2011) point out that the sparsity assumption on is much weaker than imposing sparsity assumptions and separately. A number of authors have also studied classification in this setting (Wu et al., 2009, Fan et al., 2012, Witten and Tibshirani, 2011, Clemmensen et al., 2011, Cai et al., 2011, Mai et al., 2012).

In this paper, we adopt the same assumption that is sparse and focus on analyzing the SDA (Sparse Discriminant Analysis) proposed by Mai et al. (2012). This method estimates the discriminant direction (More precisely, they estimate a quantity that is proportional to .) and our focus will be on variable selection consistency, that is, whether this method can recover the set with high probability. In a recent work, Mai and Zou (2012) prove that the SDA estimator is numerically equivalent to the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). By exploiting this result, our theoretical analysis provides a unified theoretical justification for all these three methods.

1.1 Main Results

Let and . The SDA estimator is obtained by solving the following least squares optimization problem

(1.4)

where denotes the set , and the vector encodes the class labels as if and if . Here is a regularization parameter.

The SDA estimator in (1.4) uses an -norm penalty to estimate a sparse and avoid the curse of dimensionality. Mai et al. (2012) studied its variable selection property under a different encoding scheme of the response . However, as we show later, different coding schemes do not affect the results. When the regularization parameter is set to zero, the SDA estimator reduces to the classical Fisher’s discriminant rule.

The main focus of the paper is to sharply characterize the variable selection performance of the SDA estimator. From a theoretical perspective, unlike the high dimensional regression setting where sharp theoretical results exist for prediction, estimation, and variable selection consistency, most existing theories for high-dimensional discriminant analysis are either on estimation consistency or risk consistency, but not on variable selection consistency (see, for example, Fan et al., 2012, Cai et al., 2011, Shao et al., 2011). Mai et al. (2012) provide a variable selection consistency result for the SDA estimator in (1.4). However, as we will show later, their obtained scaling in terms of is not optimal. Though some theoretical analysis of the -norm penalized M-estimators exists (see Wainwright (2009a), Negahban et al. (2012)), these techniques are not applicable to analyze the estimator given in (1.4). In high-dimensional discriminant analysis the underlying statistical model is fundamentally different from that of the regression analysis. At a high level, to establish variable selection consistency of the SDA estimator, we characterize the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem in (1.4). Unlike the -norm penalized least squares regression, which directly estimates the regression coefficients, the solution to (1.4) is a quantity that is only proportional to the Bayes rule’s direction . To analyze such scaled estimators, we need to resort to different techniques and utilize sophisticated multivariate analysis results to characterize the sampling distributions of the estimated quantities. More specifically, we provide sufficient conditions under which the SDA estimator is variable selection consistent with a significantly improved scaling compared to that obtained by Mai et al. (2012). In addition, we complement these sufficient conditions with information theoretic limitations on recovery of the feature set . In particular, we provide lower bounds on the sample size and the signal level needed to recover the set of relevant variables by any procedure. We identify the family of problems for which the estimator (1.4) is variable selection optimal. To provide more insights into the problem, we analyze an exhaustive search procedure, which requires weaker conditions to consistently select relevant variables. This estimator, however, is not practical and serves only as a benchmark. The obtained variable selection consistency result also enables us to establish risk consistency for the SDA estimator. In addition, Mai and Zou (2012) show that the SDA estimator is numerically equivalent to the ROAD estimator proposed by Wu et al. (2009), Fan et al. (2012) and the sparse optimal scaling estimator proposed by Clemmensen et al. (2011). Therefore, the results provided in this paper also apply to those estimators. Some of the main results of this paper are summarized below.

Let denote the minimizer of (1.4). We show that if the sample size

(1.5)

where is a fixed constant which does not scale with and , , and denotes the minimum eigenvalue of , then the estimated vector has the same sparsity pattern as the true , thus establishing variable selection consistency (or sparsistency) for the SDA estimator. This is the first result that proves that consistent variable selection in the discriminant analysis can be done under a similar theoretical scaling as variable selection in the regression setting (in terms of and ). To prove (1.5), we impose conditions that is not too small and with , where . The latter one is the irrepresentable condition, which is commonly used in the -norm penalized least squares regression problem (Zou, 2006, Meinshausen and Bühlmann, 2006, Zhao and Yu, 2006, Wainwright, 2009a). Let be the magnitude of the smallest absolute value of the non-zero component of . Our analysis of information theoretic limitations reveals that, whenever , no procedure can reliably recover the set . In particular, under certain regimes, we establish that the SDA estimator is optimal for the purpose of variable selection. The analysis of the exhaustive search decoder reveals a similar result. However, the exhaustive search decoder does not need the irrepresentable condition to be satisfied by the covariance matrix. Thorough numerical simulations are provided to demonstrate the sharpness of our theoretical results.

In a preliminary work, Kolar and Liu (2013) present some variable selection consistency results related to the ROAD estimator under the assumption that . However, it is hard to directly compare their analysis with that of Mai et al. (2012) to understand why an improved scaling is achievable, since the ROAD estimator is the solution to a constrained optimization while the SDA estimator is the solution to an unconstrained optimization. This paper analyzes the SDA estimator and is directly comparable with the result of Mai et al. (2012). As we will discuss later, our analysis attains better scaling due to a more careful characterization of the sampling distributions of several scaled statistics. In contrast, the analysis in Mai et al. (2012) hinges on the sup-norm control of the deviation of the sample mean and covariance to their population quantities, which is not sufficient to obtain the optimal rate. Using the numerical equivalence between the SDA and the ROAD estimator, the theoretical results of this paper also apply on the ROAD estimator. In addition, we also study an exhaustive search decoder and information theoretic limits on the variable selection in high-dimensional discriminant analysis. Furthermore, we provide discussions on risk consistency and approximate sparsity, which shed light on future investigations.

The rest of this paper is organized as follows. In the rest of this section, we introduce some more notation. In 2, we study sparsistency of the SDA estimator. An information theoretic lower bound is given in 3. We characterize the behavior of the exhaustive search procedure in 4. Consequences of our results are discussed in more details in 5. Numerical simulations that illustrate our theoretical findings are given in 6. We conclude the paper with a discussion and some results on the risk consistency and approximate sparsity in 7. Technical results and proofs are deferred to the appendix.

1.2 Notation

We denote to be the set . Let be an index set, we denote to be the subvector containing the entries of the vector indexed by the set , and denotes the submatrix containing the columns of indexed by . Similarly, we denote to be the submatrix of with rows and columns indexed by . For a vector , we denote to be the support set. We also use , , to be the -norm defined as with the usual extensions for , that is, and . For a matrix , we denote the operator norm. For a symmetric positive definite matrix we denote and to be the smallest and largest eigenvalues, respectively. We also represent the quadratic form for a symmetric positive definite matrix . We denote to be the identity matrix and to be the vector with all components equal to . For two sequences and , we use to denote that for some finite positive constant . We also denote to be . If and , we denote it to be . The notation is used to denote that .

2 Sparsistency of the SDA Estimator

In this section, we provide sharp sparsistency analysis for the SDA estimator defined in (1.4). Our analysis decomposes into two parts: (i) We first analyze the population version of the SDA estimator in which we assume that , , and are known. The solution to the population problem provides us insights on the variable selection problem and allows us to write down sufficient conditions for consistent variable selection. (ii) We then extend the analysis from the population problem to the sample version of the problem in (1.4). For this, we need to replace , , and by their corresponding sample estimates , , and . The statement of the main result is provided in 2.2 with an outline of the proof in 2.3.

2.1 Population Version Analysis of the SDA Estimator

We first lay out conditions that characterize the solution to the population version of the SDA optimization problem.

Let be the matrix with rows containing data points from the first class and similarly define to be the matrix with rows containing data points from the second class. We denote and to be the centering matrices. We define the following quantities

With this notation, observe that the optimization problem in (1.4) can be rewritten as

where we have dropped terms that do not depend on . Therefore, we define the population version of the SDA optimization problem as

(2.1)

Let be the solution of (2.1). We are aiming to characterize conditions under which the solution recovers the sparsity pattern of . Recall that denotes the true support set and , under the sparsity assumption, we have

(2.2)

We define as

(2.3)

The following theorem characterizes the solution to the population version of the SDA optimization problem in (2.1).

Theorem 1.

Let be a constant and be the solution to the problem in (2.1). Under the assumptions that

(2.4)
(2.5)

we have with

(2.6)

Furthermore, we have .

Equations (2.4) and (2.5) provide sufficient conditions under which the solution to (2.1) recovers the true support. The condition in (2.4) takes the same form as the irrepresentable condition commonly used in the -penalized least squares regression problem (Zou, 2006, Meinshausen and Bühlmann, 2006, Zhao and Yu, 2006, Wainwright, 2009a). Equation (2.5) specifies that the smallest component of should not be too small compared to the regularization parameter . In particular, let for some . Then (2.5) suggests that recovers the true support of as long as . Equation (2.6) provides an explicit form for the solution , from which we see that the SDA optimization procedure estimates a scaled version of the optimal discriminant direction. Whenever , is a biased estimator. However, such estimation bias does not affect the recovery of the support set of when is small enough.

We present the proof of Theorem 1, as the analysis of the sample version of the SDA estimator will follow the same lines. We start with the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem in (2.1):

(2.7)

where is an element of the subdifferential of .

Let be defined in (2.6). We need to show that there exists a such that the vector , paired with , satisfies the KKT conditions and .

The explicit form of is obtained as the solution to an oracle optimization problem, specified in (2.8), where the solution is forced to be non-zero only on the set . Under the assumptions of Theorem 1, the solution to the oracle optimization problem satisfies . We complete the proof by showing that the vector satisfies the KKT conditions for the full optimization procedure.

We define the oracle optimization problem to be

(2.8)

The solution to the oracle optimization problem (2.8) satisfies where is given in (2.6). It is immediately clear that under the conditions of Theorem 1, .

The next lemma shows that the vector is the solution to the optimization problem in (2.1) under the assumptions of Theorem 1.

Lemma 2.

Under the conditions of Theorem 1, we have that is the solution to the problem in (2.1), where is defined in (2.6).

This completes the proof of Theorem 1.

The next theorem shows that the irrepresentable condition in (2.4) is almost necessary for sign consistency, even if the population quantities and are known.

Theorem 3.

Let be the solution to the problem in (2.1). If we have , Then, there must be

(2.9)

The proof of this theorem follows similar argument as in the regression settings in Zhao and Yu (2006), Zou (2006).

2.2 Sample Version Analysis of the SDA Estimator

In this section, we analyze the variable selection performance of the sample version of the SDA estimator defined in (1.4). In particular, we will establish sufficient conditions under which correctly recovers the support set of (i.e., we will derive conditions under which and ). The proof construction follows the same line of reasoning as the population version analysis. However, proving analogous results in the sample version of the problem is much more challenging and requires careful analysis of the sampling distribution of the scaled functionals of Gaussian random vectors.

The following theorem is the main result that characterizes the variable selection consistency of the SDA estimator.

Theorem 4.

We assume that the condition in (2.4) holds. Let the penalty parameter be with

(2.10)

where is a sufficiently large constant. Suppose that satisfies

(2.11)

for a sufficiently large constant . If

(2.12)

for some constant , then is the solution to the optimization problem in (1.4), where

(2.13)

with probability at least . Furthermore, .

Theorem 4 is a sample version of Theorem 1 given in the previous section. Compared to the population version result, in addition to the irrepresentable condition and a lower bound on , we also need the sample size to be large enough for the SDA procedure to recover the true support set with high probability.

At the first sight, the conditions of the theorem look complicated. To highlight the main result, we consider a case where and for some constants . In this case, it is sufficient that the sample size scales as and . This scaling is of the same order as for the Lasso procedure, where is needed for correct recovery of the relevant variables under the same assumptions (see Theorem 3 in Wainwright, 2009a). In 5, we provide more detailed explanation of this theorem and complement it with the necessary conditions given by the information theoretic limits.

Variable selection consistency of the SDA estimator was studied by Mai et al. (2012). Let denote the marginal covariance matrix. Under the assumption that , and are bounded, Mai et al. (2012) show that the following conditions

are sufficient for consistent support recovery of . This is suboptimal compared to our results. Inspection of the proof given in Mai et al. (2012) reveals that their result hinges on uniform control of the elementwise deviation of from and from . These uniform deviation controls are too rough to establish sharp results given in Theorem 4. In our proofs, we use more sophisticated multivariate analysis tools to control the deviation of from , that is, we focus on analyzing the quantity but instead of studying and separately.

The optimization problem in (1.4) uses a particular scheme to encode class labels in the vector , though other choices are possible as well. For example, suppose that we choose if and if , with and such that . The optimality conditions for the vector to be a solution to (1.4) with the alternative coding are

(2.14)
(2.15)

Now, choosing , we obtain that , which satisfies (2.14) and (2.15), is proportional to with . Therefore, the choice of different coding schemes of the response variable does not effect the result.

The proof of Theorem 4 is outlined in the next subsection.

2.3 Proof of Sparsistency of the SDA Estimator

The proof of Theorem 4 follows the same strategy as the proof of Theorem 1. More specifically, we only need to show that there exists a subdifferential of such that the solution to the optimization problem in (1.4) satisfies the sample version KKT condition with high probability. For this, we proceed in two steps. In the first step, we assume that the true support set is known and solve an oracle optimization problem to get which exploits the knowledge of . In the second step, we show that there exists a dual variable from the subdifferential of such that the vector , paired with , satisfies the KKT conditions for the original optimization problem given in (1.4). This proves that is a global minimizer of the problem in (1.4). Finally, we show that is a unique solution to the optimization problem in (1.4) with high probability.

Let be the support of a solution to the optimization problem in (1.4) and . Any solution to (1.4) needs to satisfy the following Karush-Kuhn-Tucker (KKT) conditions

(2.16)
(2.17)

We construct a solution to (1.4) and show that it is unique with high probability.

First, we consider the following oracle optimization problem

(2.18)

The optimization problem in (2.18) is related to the one in (1.4), however, the solution is calculated only over the subset and is replaced with . The solution can be computed in a closed form as

(2.19)

The solution is unique, since the matrix is positive definite with probability .

The following result establishes that the solution to the auxiliary oracle optimization problem (2.18) satisfies with high probability, under the conditions of Theorem 4.

Lemma 5.

Under the assumption that the conditions of Theorem 4 are satisfied, and with probability at least .

The proof Lemma 5 relies on a careful characterization of the deviation of the following quantities , , and from their expected values.

Using Lemma 5, we have that defined in (2.19) satisfies . Next, we show that is a solution to (1.4) under the conditions of Theorem 4.

Lemma 6.

Assuming that the conditions of Theorem 4 are satisfied, we have that is a solution to (1.4) with probability at least .

The proof of Theorem 4 will be complete once we show that is the unique solution. We proceed as in the proof of Lemma  in Wainwright (2009a). Let be another solution to the optimization problem in (1.4) satisfying the KKT condition

for some subgradient . Given the subgradient , any optimal solution needs to satisfy the complementary slackness condition , which holds only if for all such that . In the proof of Lemma 6, we established that for . Therefore, any solution to (1.4) has the same sparsity pattern as . Uniqueness now follows since is the unique solution of (2.18) when constrained on the support set .

3 Lower Bound

Theorem 4 provides sufficient conditions for the SDA estimator to reliably recover the true set of nonzero elements of the discriminant direction . In this section, we provide results that are of complementary nature. More specifically, we provide necessary conditions that must be satisfied for any procedure to succeed in reliable estimation of the support set . Thus, we focus on the information theoretic limits in the context of high-dimensional discriminant analysis.

We denote to be an estimator of the support set , that is, any measurable function that maps the data to a subset of . Let be the problem parameters and be the parameter space. We define the maximum risk, corresponding to the loss, as

where denotes the joint distribution of under the assumption that , and (recall that ). Let be the class of all subsets of the set of cardinality . We consider the parameter space

(3.1)

where determines the signal strength. The minimax risk is defined as

In what follows we provide a lower bound on the minimax risk. Before stating the result, we introduce the following three quantities that will be used to state Theorem 7

(3.2)
(3.3)

and

(3.4)

The first quantity measures the difficulty of distinguishing two close support sets and that differ in only one position. The second quantity measures the effect of a large number of support sets that are far from the support set . The quantity is a threshold for the signal strength. Our main result on minimax lower bound is presented in Theorem 7.

Theorem 7.

For any , there exists some constant , such that