Classification with UltrahighDimensional Features\thanksrefT1
Abstract
Although much progress has been made in classification with highdimensional features [10, 16, 6, 47], classification with ultrahighdimensional features, wherein the features much outnumber the sample size, defies most existing work. This paper introduces a novel and computationally feasible multivariate screening and classification method for ultrahighdimensional data. Leveraging interfeature correlations, the proposed method enables detection of marginally weak and sparse signals and recovery of the true informative feature set, and achieves asymptotic optimal misclassification rates. We also show that the proposed procedure provides more powerful discovery boundaries compared to those in Cai and Sun [6] and Jin [20]. The performance of the proposed procedure is evaluated using simulation studies and demonstrated via classification of patients with different posttransplantation renal functional types.
arXiv:0000.0000 \startlocaldefs \endlocaldefs
Classification with UltrahighDimensional Features \thankstextT1Footnote to the title with the “thankstext” command.
, , , , and
t1Some comment \thankstextt2Supported in part by the NSA Grant H982301510260 \thankstextt3Supported in part by the NIH Grant 1R01MH105561
class=MSC] \kwd[Primary ]60K35 \kwd60K35 \kwd[; secondary ]60K35
Classification \kwdcorrelation \kwdmultivariate screening \kwdrare and weak signal model \kwdscreening \kwdsparsity.
1 Introduction
Highthroughput data such as genomic, proteomic, microarray and neuroimaging data have now been routinely collected in many contemporary biomedical studies, wherein the number of variables far exceeds the number of observations. For example, our motivating kidney transplantation study [15] assayed a total of 12,625 genes from 62 tissue samples with 4 different posttransplantation kidney functional types. A key scientific interest lies in classifying patients with different functional types based on their molecular information, for the purpose of precision medicine.
Classification methods for highdimensional data have been widely studied in recent years [17, 10, 43, 16, 34, 44, 11, 29, 47]. However, most classification approaches [16, 44, 47] and software (penalizedLDA, lda in R [39], etc.) use penalized methods, which are not directly applicable to ultrahighdimensional cases due to computational infeasibility.
Furthermore, discriminant methods based on independence rules which ignore correlations among features [13, 14] have been widely practiced. This is partly because, as Bickel and Levina [2] pointed out, the independence rule may perform better in the highdimensional case due to illconditioned highdimensional covariance matrices and cumulative error arising from estimation. However, this conclusion only holds for marginally strong signals. Most current ultrahighdimensional screening methods [10, 6] have assumed the independence rule when detect marginally strong signals. In many real world applications, an independence rule is restrictive and frequently fails to detect marginally weak features. Xu et al. [47] showed that even though a feature is marginally nondiscriminative, it could be informative when considered jointly with other marginally differentiating features. For example, Figure 1 indicates that the best classifier involves both features and while does not have any discriminant power on its own.
Table 1, which summarizes popular modern highdimensional classification methods. It reveals that weak signal detection has been less studied in the literature, even though detecting marginally weak signals has drawn recent attention[4, 24]. In this paper, we propose an ultrahighdimensional screening and classification method, which accounts for interfeature correlations and enables detection of marginally weak or nondiscriminant features that might be missed by marginal screening approaches. We cast the proposed method in a unified algorithm for both ultrahighdimensional variable selection and classification motivated by the formulation of linear discriminant analysis (LDA). It contains a screening step that simultaneously selects informative features and estimates the parameters, as well as a classification step. We show that the screening step can recover the informative features with a probability tending to and the classification step reaches an optimal misclassification rate even for weak signals. Furthermore, it improves the discovery boundary defined in Jin [20] and Cai and Sun [6]. The procedure is computationally efficient and an R package, “CISLDA,” implementing the proposed method is also developed.
The rest of the paper is organized as follows. In Section 2, we introduce the covariance insured screening and classification method for ultrahighdimensional data. Section 3 develops theoretical properties of the proposed method. In Section 4, the performance of the proposed method is evaluated using simulation studies. We apply the proposed procedure to the renal transplantation data in Section 5 and Section 6 provides some discussions.
Methods  Handle  Consider  Able to  Identify pair 

ultrahigh  interfeature  detect  classspecific  
dimension?  correlation?  marginally  features?  
weak features?  
Guo [16]  X  X  X  
Witten and Tibshirani [44]  X  X  X  X 
Clemmensen et al. [9]  X  X  X  X 
Fan and Lv [13]  X  X  X  
Xu et al. [47]  X  
Cai and Sun [6]  X  X  X  
Cai and Liu [5]  X  X 
2 Covariance insured screening and classification
2.1 Notation
Throughout the paper, we use bold uppercase letters to denote vectors and matrices. For a matrix , denote by its transpose, by its th column vector, and by its entry on the th row and th column, . Denote by the cardinality of a set and by the complement of .
Denote by the graph induced by a symmetric matrix , where the node set corresponds to the row indices of and the edge set contains all the edges. An edge between nodes and exists if and only if . For a subset , denote by the principle submatrix of confined on and the corresponding edge set. We say that the graph is a connected component of if the following two conditions are satisfied: (i) any two nodes in are connected by edges in and (ii) for any node in , there exists another node in such that the two can not be connected by edges in .
For a symmetric matrix , denote by the trace of , denote by and the minimum and maximum eigenvalues of . Denote the operator norm and the Frobenius norm by and , respectively.
2.2 Method
We assume that there are distinct classes. Let represent class label taking values in . Denote by for such that . Let X be a dimensional vector of features, of which the distribution conditional on the corresponding class membership follows a multivariate Gaussian
(2.1) 
where is a dimensional vector, and is a positive definite matrix. Let be independent observations of . When , (2.1) can be rewritten as
(2.2) 
For the ease of presentation, we assume throughout. We note that the theoretical properties, implementation, and our R package can all be readily extended to the case.
We recast (2.2) in the framework of rare/weak feature models [20], where is only different from in fraction of features. Specifically, let be samples from a Bernoulli() distribution, and
(2.3) 
where denotes the th coordinate of , . and for all . Equation (2.3) assigns the mean difference between the two classes on features with . Set . Therefore, controls the strength of signals and controls the rareness of marginally differentiable features. Jin [20] and Cai and Sun [6] showed that in the ultrahighdimensional case with both small and , it is impossible to discriminate one class from the other. Following Jin [20], let for some and for some . Then and are controlled by and , respectively. As gets closer to the signal becomes weaker, while as gets closer to the signal becomes sparser.
Next we outline a new covarianceinsured screening (CIS) and classification method that utilizes the feature correlation structures to detect the signals that are marginally uninformative but jointly informative (MUJI). Denote by the precision matrix, where if and only if that the th and th () variables are conditionally uncorrelated given the rest of the variables. It was shown in [47] that a sufficient and necessary condition for variable to be noninformative in differentiating the two classes is
(Condition A) 
Accurately estimating the precision matrix involves inverting a large square matrix and requires computational steps. However, under the sparsity assumption of , it is possible to estimate without fully uncovering the whole matrix. Specifically, assume that in each row of , the fraction of its nonzero entries is at most for . Hence, each marginally informative signal is partially correlated with at most number of the MUJI signals. Under the sparsity assumption, is a sparse graph in the sense that the degree of each node . Particularly, denote by , then each feature in is partially correlated with at most features. Suppose the graph admits a decomposition of connected components and denote by the precision matrices in corresponding to those connected components, i.e.
where is the subgraph corresponds to the th connected component. Condition A reveals that if a marginally noninformative feature, say , is jointly informative, there must exist a feature such that and . That is, feature has to be connected to some marginally informative feature. This implies that, in order to detect the MUJI signals, one only needs to focus on the connected components that contains at least one marginally informative features in . See Figure 2 for a graphical illustration.
Under the sparsity assumption, the connected components and the graphic structure of can be inferred by thresholding the sample covariance matrix . Denote by the thresholded sample covariance matrix with a threshold , where , with being the indicator function. Denote by the graph corresponding to and suppose it can be decomposed into connected components
with , , being the node set, edge set and the submatrix in corresponding to the th connected component, . For two symmetric matrices and , we say that graph belongs to graph , denote by , if and .
Lemma 1.
Let be the connected component of a marginally informative feature in , and be the connected component of feature in , then for sufficiently large and , we have
where , are some positive constants and .
Remark. The efficiency gain on estimating stems from that: (i) instead of solving an ultrahighdimensional graphic lasso problem, it computes the connected components of the thresholded sample covariance matrix, reducing computational cost; (ii) it only needs detection of the connected components of the marginally informative ones. Lemma 1 guarantees the efficacy of the proposed method on reconstructing the connected component structure.
2.3 covarianceinsured screening and classification steps
Given the data set , denote by the number of samples in class , . Denote by the value of feature for sample and the mean of feature in class , . We further detail the proposed procedures that consist of screening and classification steps.

Screening step

Standardize columnwise.

Select the set consisting of the features satisfying
(2.4) where is a thresholding parameter that controls the strength of the marginal effects.

For a given threshold , construct a sparse estimate of the covariance matrix by setting , .

For each feature in , detect its connected component in . Denote by the corresponding index set. Suppose the first components are the smallest number of components such that . Compute the precision matrix of the first components, .

Evaluate Condition A and select informative features by
for some prespecified . In our numerical studies, we rank the magnitude of the estimated Condition A: and select up to the first features, where is the sample size. refers to the importance score of predictor .
Remark. The optimal tuning parameter values and are selected through crossvalidation. Resampling strategies such as stability selection [31] can be applied to reduce the variation from random sampling.

Postscreening classification
A new observation is to be classified with the decision rule with . Here , , , and superscript “CIS” restricts features to .
Remark. Notice that 24 in the screening step are mainly thresholding the marginal mean difference and the correlation parameters. They estimate the parameters needed in constructing the screening statistics. And 5 in the screening step does the job of variable selection based on Condition A. Even though 24 can identify marginally informative signals and their connected components, but they do not select MUJI signals by themselves without 5.
At step 4 of screening, for each feature , we look for such that is relatively large. For each of those s, we then recursively look for its connected features till all the features in the connected component of feature are found. Other connected component labeling algorithms such as those listed in [35] can be employed. For a specific and some , it is possible that is of a greater dimension than the sample size and calculating is by itself a complex highdimensional graphical model problem. In such cases, we consider an depth connected subgraph, which contains features connecting to the feature through at most edges in the graph corresponding to . Denote the node set of such a connected subgraph by . We then set and to be the covariance and precision matrices of features in only. This is similar to the sized subgraph in Zhang, Jin and Fan [48]. With a proper choice of , one can avoid inverting a highdimensional covariance matrix while involving essentially all the highly correlated features. The impact of the different sizes of m on screening and classification is explored in Section 4. Using different depth for different connected component is, to some extent, equivalent to adaptively thresholding the corresponding correlation blocks for different features in .
3 Theoretical properties
Let be the true informative set. Then , where and . And .
Let be the index set of features selected by the CIS screening step with thresholding parameters . Let , where is a constant not depending on . Note that can be used as a measure of the overall sparsity of [3, 34, 12]. Then under the following assumptions, can uncover the true informative set with probability tending to .
(A1) for some positive ;
(A2) for some ;
(A3) for some such that for in (A2);
(A4) there exist positive constants and such that ;
(A5) , and ;
(A6) , i.e. both the marginally informative signals and the MUJI signals are sparse enough.
Theorem 2.
(Sure Screening Property) Under assumptions (A1)(A5), for any ,
(3.1) 
Theorem 3.
(False Positive Control Property) Under assumptions (A1)(A6), for any , we have
(3.2) 
Remark. Theorems 2 and 3 ensure that with probability tending to , screening step of the CIS method captures all and only the informative features.
Let denote the misclassifiction rate of the oracle rule or the optimal rule, which assumes that the parameters , and are known [30]. Under model (2.1), it can be shown that , where [30] with and is the standard normal cumulative distribution function. Consider the following definitions introduced in Shao et al. [34].
Definition 4.
Let be a classification rule with the conditional misclassification rate , given the training sample .

[leftmargin=*]

(D3.1) is asymptotically optimal if ;

(D3.2) is asymptotically suboptimal if .
For any thresholding parameters and of the order specified in assumption (A5), denote by and by the estimated precision matrix from the CIS screening step (fill the entries with zeros if ). We first consider a classification rule using all the features: with and . Its misclassification rate is given by [10]:
(3.3) 
We further assume the following:
(A7) There exist positive constants and such that ;
(A8) Assume that ;
(A9) Let be the number of features with nonzero mean differences between two classes. Then ;
(A10) Let for some constant and . Then .
Theorem 5.
(Asymptotic Misclassification Rate) Under the assumptions (A7)(A10), when ignoring the effect of variable selection based on Condition A in the screening step and constructing a classification rule just using the parameter estimators and , we have:
Theorem 5 reveals that in ultrahighdimensional settings, when not combining with a variable selection method, building a decision rule simply based on all features can lead to a misclassification error no better than a mere random guess. Fan and Fan [10] showed that in the case of uncorrelated features, for any classifier , the misclassification rate is for some constant . When the signal strength is not strong enough to balance out the increasing dimensionality , as and thus . Similarly for the proposed CIS classification, when all the features are used, according to Theorem 5, the signal strength needs to be strong enough to balance out to achieve asymptotic optimal and zero misclassification rates. When , Theorem 5 reduces to Theorem 3 in [34], where sparsity is only assumed for marginally informative features, but not for MUJI features.
However, according to Theorems 2 and 3, the proposed CIS screening step retains only truly informative predictors with large probability. If we base the postscreening classification only on , it can achieve better classification performance, because then only needs to be strong enough to balance out to yield an asymptotic zero and optimal postscreening misclassification rate.
Define the CIS postscreening classifier as , where with superscript CIS restricts the features to the set obtained from the CIS screening step. Then its misclassifcation rate is
Theorem 6.
Now let be the informative feature set discovered by a decision rule . In order to characterize the phase transition in terms of optimal discovery, Cai and Sun [6] defined the discovery boundary that satisfies the following conditions:

The false positive rate is vanishingly small, i.e. ;

A nonempty discovery set is constructed with high probability, i.e. .
Assuming that the marginally informative features have a common marginal variance and marginally uninformative ones have a marginal variance , Cai and Sun [6] derived the discovery boundary that divides the  plane into discoverable and nondiscoverable areas:

For , ;

For ,

For ,
Specifically, they showed that (a) If , it is possible to find a classification rule that fulfills (C1) and (C2) simultaneously. (b) If , it is impossible to find a classification rule that fulfills (C1) and (C2) simultaneously. A similar detection boundary about reliably detecting any existing signals was defined in [20]:
The discovery boundaries coincide with on . And on . For , one can detect the existence of signals reliably but it is impossible to separate any signal from noise [6].
However, when accounting for interfeature correlations, we show in proposition 7 that the discovery boundary can be improved, and the marginally uninformative features that would have been undiscoverable in [6] and [20] can become “discoverable.” Though finding the exact optimal discovery bound for model (2.2) is beyond the scope of this paper, we do provide an upper bound of the exact optimal discovery bound and show that the upper bound is below the discovery or detection boundaries in [6] and [20] (See Figure 3).
For simplicity, from now on, we assume , for all and the features have the same marginal variance , . We also assume that , that is , which is equivalent to say that on average, the number of marginally noninformative signals correlated with some marginally informative signals is larger than the number of marginally informative signals. In most cases, the assumption of is a reasonable one to make. For example, in genomic data, MUJI markers are usually less sparse than the marginally informative ones.
Theorem 7.
(Discovery Boundary) Denote by the discovery boundary associated with the CIS decision rule. Let , , then

[leftmargin=*]

(T6.1) For , ;

(T6.2) For ,

(T6.3) For ,
Define the power of the discovery boundary of some classification procedure as the ratio of the discoverable area to the nondiscoverable one. From Theorem 7, we can observe that for any single signal, the discovery boundary from the CIS procedure is at least as powerful as the discovery boundary in [6]. Moreover, as the MUJI signals get less sparse, the CIS discovery boundary has more power. Figure 3 depicts the discovery boundaries for the optimal screening procedure in Cai and Sun [6] and the CIS procedure with different values of . Theorem 7 implies that when , the detection boundary from the CIS procedure has more power than that in [20].
4 Numerical studies
In this section, we assess the finite sample performance of the proposed CIS screening and classification method and compare it with the marginal screening method [6, 20]. First, we generate 3 classes with equal sample size of with features, of which only the first 20 variables are informative.
The mean structure of the first variables is given in Table 2. For example, variables 14 and 1114 have mean 0 in both classes 1 and 2, variables 5 and 15 have mean 0.5 in class 1 and mean 2 in class 2, and variables 610 and 1620 have mean 1.5 in class 1 and mean 1.5 in class 2. Therefore variables 5, 15, 610, 1620 all are marginally informative for differentiating classes 1 and 2. Variables 14 and 1114, though marginally noninformative to differentiate classes 1 and 2, are correlated with variables 5 and 15 respectively. Effectively, all the first 20 variables are informative for differentiating classes 1 and 2.
Variables 15 and 1115 are marginally informative for differentiating classes 2 and 3. Variables 610 and 1620 are marginally noninformative for differentiating classes 2 and 3, and are set to be independent of variables 15 and 1115 respectively. Therefore, variables 610 and 1620 are noninformative for differentiating classes 2 and 3.
Variables  Class 1  Class 2  Class 3 

,  0  0  2.5 
,  0.5  2  2.5 
,  1.5  1.5  1.5 
The remaining 9,980 variables are independently and identically distributed from for all classes and are independent of the first 20 variables, hence they are noninformative for differentiating neither classes 1 and 2 nor classes 2 and 3.
Our goal is to examine whether our proposed CISclassification method renders more power to detect the MUJI variables 14 and 1114 in classifying classes 1 and 2, and whether it performs similarly to the marginal screening methods in classifying classes 2 and 3.
Regarding the correlation structure between the MUJI variables and the marginally informative variables, we divide the first 20 variables into four correlation blocks , , and , and set variables from different correlation blocks to be independent. Variables within the same block are governed by the same correlation structure. We specifically consider two correlation structures:
Example 1 (Compound Symmetry): each block has a compound symmetry (CS) correlation structure with a correlation coefficient . The MUJI variables 14 and 1114 for differentiating classes 1 and 2 are equally correlated with the marginally informative variables 5 and 15, respectively.
Example 2 (First Order AutoCorrelation): each block has a first order autocorrelation (AR1) structure with a correlation coefficient . The farther the MUJI variables are away from the marginally informative ones, the weaker their correlations are.
We generate 500 data sets, each of which is divided into a training set with 100 samples per class and a testing set with 50 samples per class. Features are selected by applying screening procedures (steps 15) on training dataset, described in the previous section. Postscreening is performed on the testing dataset. A thresholding tuning parameter is obtained by a 5fold cross validation. We also explore the use of different correlation thresholding values of and report the corresponding screening results. In both examples, we fix the depth parameter as , which controls the maximum number of edges that connect any variable to a marginally informative variable in its connected component. In our numerical experience, we have found that the choice of is not critical.
We compare the performance of the proposed method and marginal screening method using the false positive (FP), false negative (FN), sensitivity (se), specificity (sp), and the minimum model size (MMS) that is the minimum number of features needed to include all informative features. To assess the classification performance, we report the percentage of samples that are misclassified (ER). Tables 3 and 4 summarize the results under the CS and AR1 correlation structures. The presented values are the averages of the 500 replicates, except that the median is reported for the MMS. Tables 3 and 4 reveal that in classifying between classes 1 and 2, where MUJI signals exist, the proposed method provides much smaller false negatives and much larger sensitivity values. The smaller MMS for the proposed method also indicates that it identifies MUJI signals more efficiently, compared to the marginal screening method. When the true correlation is , a smaller correlation matrix thresholding value () gives a better screening result as it includes more marginally weak features, but at the cost of increment of computational time and memory to detect their connected components. In classification between classes 2 and 3, where no MUJI signals exist, the proposed method performs similarly to the marginal screening method in both screening and classification.
We further investigate the effect of the “depth” parameter on the feature selection and classification in Example 3, where data are simulated from the CS correlation structure and other settings remain the same as Example 1. We fix at . Table 5 shows that CIS gives satisfactory screening and classification performance even when the depth is as small as .
Method  class pair  FP#  FN#  se  sp  MMS  ER  
CIS  12  1.1  0.1  0.994  0.999  20  1.4  
(3.4)  (0.5)  (0.02)  ()  (0)  (1.1)  
23  9.3  0  1  0.999  19  3.1  
(3.2)  (0)  (0)  ()  (3)  (2.9)  
MS  12  16.4  9.1  0.54  0.998  8685  3.1  
(23.9)  (3.2)  (0.2)  (0.002)  (2049)  (4.2)  
23  3.0  0  1  0.999  10  3.4  
(14.9)  (0)  (0)  (0.001)  (0)  (0.9)  
CIS  12  11.8  2.5  0.88  0.999  27  2.7  
(22.4)  (3.4)  (0.07)  (0.002)  (7)  (1.5)  
23  5.6  0  1  0.999  10  2.9  
(11.0)  (0)  (0)  ()  (10)  (2.1)  
MS  12  18.0  9.0  0.55  0.998  8634  3.0  
(29.1)  (3.1)  (0.15)  (0.003)  (2060)  (4.1)  
23  3.5  0  1  0.999  10  3.4  
(17.1)  (0)  (0)  (0.002)  (0)  (1.0)  
CIS  12  12.2  7.8  0.61  0.999  21  1.6  
(19.6)  (0.5)  (0.02)  (0.002)  (2)  (0.9)  
23  2.7  0  1  0.999  10  3.4  
(9.5)  (0)  (0)  ()  (0)  (0.9)  
MS  12  17.3  9.1  0.54  0.998  8876  3.1  
(25.8)  (3.2)  (0.2)  (0.003)  (1913)  (4.3)  
23  2.7  0  1  0.999  10  3.4  
(9.5)  (0)  (0)  ()  (0)  (0.9)  
* MS=Marginal screening ignoring the interfeature correlation. * FP=false positive, FN= false negative, se=sensitivity, sp=specificity, MMS=median of model sizes based on 500 replicated experiments, ER=misclassification error rate (percentage of samples whose class label were mistakenly predicted). Numbers in parentheses are mean standard errors for FP, FN, se, sp, ER and IQRs for MMS. 
Method  class pair  FP#  FN#  se  sp  MMS  ER  

CIS  12  0.9  0.05  0.997  0.999  20  0.2  
(3.2)  (0.3)  (0.01)  ()  (0)  (0.9)  
23  9.9  0  1  0.999  17  0.9  
(4.7)  (0)  (0)  ()  (5)  (1.5)  
MS  12  19.9  9.6  0.52  0.998  8774  2.1  
(35.4)  (3.6)  (0.2)  (0.003)  (1998)  (5.0)  
23  4.5  0  1  0.999  10  1.2  
(20.5)  (0)  (0)  0.002  (0)  (0.7)  
CIS  12  9.1  5.3  0.74  0.999  24  0.6  
(15.5)  (3.4)  (0.17)  (0.002)  (5)  (1.3)  
23  10.9  0  1  0.999  20  1.4  
(23.2)  (0)  (0)  (0.002)  (19)  (1.7)  
MS  12  11.2  11.0  0.45  0.999  9354  3.3  
(16.1)  (4.7)  (0.23)  (0.002)  (418)  (6.1)  
23  4.6  0  1  0.999  10  1.4  
(21.7)  (0)  (0)  (0.002)  (0)  (0.7)  
CIS  12  1.3  9.1  0.54  0.999  20  0.4  
(3.4)  (1.9)  (0.09)  ()  (0)  (1.2)  
23  0.2  0.008  0.999  0.999  10  0.9  
(1.3)  (0.09)  (0.009)  (0.001)  (0)  (0.6)  
MS  12  1.9  11.7  0.41  0.999  7470  5.6  
(4.3)  (5.3)  (0.3)  ()  (3679)  (7.4)  
23  0.2  0.02  0.998  0.999  10  1.0  
(1.3)  (0.4)  (0.04)  ()  (0)  (0.7) 
Method  class pair  FP#  FN#  se  sp  MMS  ER  

CIS  12  11.8  2.5  0.88  0.999  27  2.7  
(22.4)  (3.4)  (0.07)  (0.002)  (7)  (1.5)  
23  5.6  0  1  0.999  10  2.9  
(11.0)  (0)  (0)  ()  (10)  (2.1)  
MS  12  18.0  9.0  0.55  0.998  8634  3.0  
(29.1)  (3.1)  (0.15)  (0.003)  (2060)  (4.1)  
23  3.5  0  1  0.999  10  3.4  
(17.1)  (0)  (0)  (0.002)  (0)  (1.0)  
CIS  12  12.7  4.1  0.80  0.999  28  2.1  
(19.9)  (2.9)  (0.14)  (0.002)  (8)  (1.8)  
23  8.6  0  1  0.999  10  4.3  
(20.0)  (0)  (0)  (0.002)  (8)  (1.7)  
MS  12  21.6  9.3  0.54  0.998  8752  2.6  
(33.5)  (3.4)  (0.2)  (0.003)  (1749)  (4.3)  
23  1.6  0  1  0.999  10  2.8  
(6.4)  (0)  (0)  ()  (0)  (0.7)  
CIS  12  15.1  5.9  0.71  0.998  24  1.5  
(24.3)  (1.8)  (0.09)  (0.002)  (10)  (1.6)  
23  7.3  0  1  0.999  10  3.8  
(20.1)  (0)  (0)  (0.002)  (7)  (1.5)  
MS  12  21.1  9.6  0.52  0.998  8772  2.9  
(32.8)  (3.7)  (0.2)  (0.003)  (1630)  (4.6)  
23  1.4  0  1  0.999  10  2.8  
(6.7)  (0)  (0)  ()  (0)  (0.7) 
5 Real data analysis
There are 4 functional types of tissues among the 62 kidney tissue samples in the kidney transplant study [15]: 17 normal donor kidneys (C), 19 wellfunctioning kidneys more than 1year posttransplant (TX), 13 biopsyconfirmed acute rejection (AR), and 13 acute dysfunction with no rejection (NR). Each sample has gene profile assayed from kidney biopsies and peripheral blood lymphocytes on 12,625 genes. Distinguishing these 4 types of tissues is important in balancing the need for immunosuppression to prevent rejection and minimizing druginduced toxicities.
We assess the classification performance by predicting each individual’s class using the leaveoneout procedure. Specifically, for each sample, we classify it between any class pair using the rest samples belong to the class pair to select informative features and optimal tuning parameters. The thresholding parameter is chosen by 5fold cross validation. As the median absolute value of the correlation coefficient between two features for our data is , the covariance (or the correlation as we have standardized each feature) thresholding parameter is fixed at . The depth parameter is set as . Since there are 4 classes, a majority voting strategy is then employed to decide which class the sample is eventually classified to. For example, if a sample is classified to class C between the classes C and TX, to C between C and AR, to C between C and NR, to TX between TX and AR, to NR between TX and NR, and to NR between AR and NR, then the sample is eventually classified to C. The pairwise comparison strategy which reduces the multiclass problem to multiple twoclass problems [37], and the leaveoneout plus majority voting classification error is 6 out of 62 samples.
In terms of selecting informative variables, we draw 100 bootstrap samples for each pairwise twoclass comparison and use 5fold crossvalidation to select the tuning parameters. Top genes with high selection frequencies for each class pair are identified and reported in Table 6. MUJI genes are indicated in the third column of Table 6. The last column lists literature that validate the biological relevance of the selected genes.
Our method identifies some MUJI genes as well as some marginally informative genes supported by the clinical literature. For example, DDX19A has been identified as a MWJI gene which has high discrimination power between classes C and AR. [26] identified DDX19A, a member of the DEAD/Hbox protein family, as a novel component of NLRP3 inflammasome. NLRP3 inflammasome plays a major role in innate immune and inflammatory pathologic responses, which is relevant to postrenaltransplantation recovery. To our knowledge, DDX19A has not been reported elsewhere as a statistically informative gene for postrenaltransplantation rejection types.
The individual and joint effects for some MUJI genes are illustrated in Figure 4. The marginal effects of those MUJI genes and their correlated marginally discriminative genes, together with rank of importance scores of the MUJI genes are summarized in Table 7. For example, in the top panel of Figure 4, the marginal standardized mean difference between C and TX for the gene ADAM8 is ranks 5,071 out of 12,625 genes. While gene ADAM8 is highly correlated with a marginally discriminative gene IFITM1, whose standardized marginal mean difference ranks 22. By incorporating the covariance structure, the CIS procedure promote rank of the IS for gene ADAM8 to 18.
Gene  class pair  MUJI?  literature 
(selection frequency %)  evidence  
PABPC4  CTX(54)  Chen et al. [8]  
GNB2L1  CAR(66)  Wang et al. [41]  
TCGA [38]  
Li et al. [25]  
TG(Thyroglobulin)  CAR(57)  Ju et al. [21]  
Wu et al. [46]  
Sellittia et al. [33]  
Luo et al. [28]  
PTTG1  CAR(52)  Hamid, Malik and Kakar [18]  
Wondergem et al. [45]  
Wei et al. [42]  
Kristopher et al. [23]  
DDX19A  CAR(44)  Li et al. [26]  
ASH2L  CAR(40)  Santa et al. [32]  
MAPK3  CADNR(100)  Zhang et al. [49]  
Cassidy et al. [7]  
Awazu, Omori and Hida [1]  
Zhang et al. [50]  
Zhang et al. [51]  
Kim and Choi [22]  
TMEM199  ARADNR(53)  Hogan et al. [19]  
Hogan et al. [19]  
KIAA1467  ARADNR(40)  Sui et al. [36]  
…  …  …  … 
* MUJI= a marginally uninformative but jointly informative. ** selection frequency= the percentage of gene selected out of bootstrapped data sets. 
MUJI  of  rank of  correleted  of  rank of  correlation  rank of 

gene  MUJI  MUJI  MI  MI  MI  
gene  gene  gene  gene  gene  
ADAM8  0.48  5071  IFITM1  1.35  22  0.70  18 
HDDC2  0.82  2497  ATP5D  1.53  51  0.71  77 
PSME2.1  0.84  3962  XRCC6.1  1.71  22  0.74  22 
* = standardized meanginal mean difference of gene j. ** MI= marginally informative 
The following Figure 5 illustrates the classification effect of a MUJI signal, gene IPO5, when considered with a marginally stronger signal, gene TTC37. Considered by itself, the minimal misclassification number of gene TTC37 is 10, as showed in the left panel of Figure 5. While jointly considered with gene IPO5, the minimal misclassification number reduces to 2, as showed in the right panel of Figure 5. Figure LABEL:ffig:MUJIeffect clearly demonstrates more discriminating power when two genes are considered jointly than when considered individually.
6 Discussion
Leveraging dependence among features, we propose a covarianceinsured screening and classification procedure for weak signal detection and ultrahighdimensional classification. The procedure is accompanied by easily implementable algorithms and renders nice theoretical properties, such as model selection consistency with optimal misclassification rates.
The proposed method has a wide range of applications, such as classification of disease subtypes based on spatially correlated fMRI brain image data and classification of portfolio categories based on time series data for prices. The local correlation structure can be parameterized with fewer parameters (such as AR1) to further improve the estimation accuracy and efficiency of precision matrices.
Several aspects of the proposed method can inspire future work in several directions. A key component of our approach is the notion of identifying blockdiagonal precision matrices by thresholding the sample covariance matrices. On the other hand, intensive biomedical research has generated a large body of biological knowledge, e.g. pathway information, that may help specify the structure of such matrics. How to effectively utilize such information, and further balance priori knowledge and datadriven approaches to achieve optimal decision rules remains rather challenging.
We assume a common covariance matrix across classes. However this restrictive assumption can be relaxed by allowing heteroscedastic normal models, which are more realistic in many practical applications. As stated in the proofs, the normality assumption can also be relaxed, and be replaced by subGaussian families. This will facilitate modeling nonGaussian data with heavytails.
There is much room for development of more efficient and faster algorithms. For instance, when the whole feature space can be divided into uncorrelated subspaces, such as different chromosomes in the genome or different functional regions in the brain, parallel computing on multiple partitioned feature spaces may be available. We will pursue this in future work.
Appendix A Proofs of the main results
We first state the Gaussian tail inequality, which will be used throughout in Appendix A. Let be a standard normal random variable. Then for any ,
Denote by with being the sample covariance matrix. Bickel and Levina [3] showed that for , . Furthermore, Bickel and Levina [3] and Fan, Liao and Min [12] showed that,
(A.1) 
Therefore, under the sparsity assumptions, the CIS estimated precision matrix satisfies for some constant and sufficiently large . The following Lemma 8 is useful in controlling the size of nonzero entries in a row of . Its proof is similar to that of Theorem 2 in [27] and thus is omitted.
Lemma 8.
Let . Under assumptions (A2)(A4), for some positive constants and .