Classification with Ultrahigh-Dimensional Features\thanksrefT1

# Classification with Ultrahigh-Dimensional Features\thanksrefT1

\fnmsYanming \snmLi\thanksrefm1label=e1]liyanmin@umich.edu [    \fnmsHyokyoung \snmHong\thanksreft2,m2label=e2]hhong@stt.msu.edu [    \fnmsJian \snmKang\thanksreft3,m1 label=e3]jiankang@umich.edu [    \fnmsKevin \snmHe\thanksrefm1 label=e4]kevinhe@umich.edu [    \fnmsJi \snmZhu\thanksrefm1 label=e5]jizhu@umich.edu [    \fnmsYi \snmLi\thanksrefm1 label=e6]yili@med.umich.edu [ University of Michigan\thanksmarkm1 and Michigan State University\thanksmarkm2 Yanming Li
Department of Biostatistics,
University of Michigan
1415 Washington Heights
Ann Arbor, Michigan 48109
Yi Li
Department of Biostatistics,
University of Michigan
1415 Washington Heights
Ann Arbor, Michigan 48109
###### Abstract

Although much progress has been made in classification with high-dimensional features [10, 16, 6, 47], classification with ultrahigh-dimensional features, wherein the features much outnumber the sample size, defies most existing work. This paper introduces a novel and computationally feasible multivariate screening and classification method for ultrahigh-dimensional data. Leveraging inter-feature correlations, the proposed method enables detection of marginally weak and sparse signals and recovery of the true informative feature set, and achieves asymptotic optimal misclassification rates. We also show that the proposed procedure provides more powerful discovery boundaries compared to those in Cai and Sun [6] and Jin [20]. The performance of the proposed procedure is evaluated using simulation studies and demonstrated via classification of patients with different post-transplantation renal functional types.

[
\kwd
\arxiv

arXiv:0000.0000 \startlocaldefs \endlocaldefs

\runtitle

Classification with Ultrahigh-Dimensional Features \thankstextT1Footnote to the title with the “thankstext” command.

{aug}

, , , , and

\thankstext

t1Some comment \thankstextt2Supported in part by the NSA Grant H98230-15-1-0260 \thankstextt3Supported in part by the NIH Grant 1R01MH105561

class=MSC] \kwd[Primary ]60K35 \kwd60K35 \kwd[; secondary ]60K35

Classification \kwdcorrelation \kwdmultivariate screening \kwdrare and weak signal model \kwdscreening \kwdsparsity.

## 1 Introduction

High-throughput data such as genomic, proteomic, microarray and neuroimaging data have now been routinely collected in many contemporary biomedical studies, wherein the number of variables far exceeds the number of observations. For example, our motivating kidney transplantation study [15] assayed a total of 12,625 genes from 62 tissue samples with 4 different post-transplantation kidney functional types. A key scientific interest lies in classifying patients with different functional types based on their molecular information, for the purpose of precision medicine.

Classification methods for high-dimensional data have been widely studied in recent years [17, 10, 43, 16, 34, 44, 11, 29, 47]. However, most classification approaches [16, 44, 47] and software (penalizedLDA, lda in R [39], etc.) use penalized methods, which are not directly applicable to ultrahigh-dimensional cases due to computational infeasibility.

Furthermore, discriminant methods based on independence rules which ignore correlations among features [13, 14] have been widely practiced. This is partly because, as Bickel and Levina [2] pointed out, the independence rule may perform better in the high-dimensional case due to ill-conditioned high-dimensional covariance matrices and cumulative error arising from estimation. However, this conclusion only holds for marginally strong signals. Most current ultrahigh-dimensional screening methods [10, 6] have assumed the independence rule when detect marginally strong signals. In many real world applications, an independence rule is restrictive and frequently fails to detect marginally weak features. Xu et al. [47] showed that even though a feature is marginally non-discriminative, it could be informative when considered jointly with other marginally differentiating features. For example, Figure 1 indicates that the best classifier involves both features and while does not have any discriminant power on its own.

Table 1, which summarizes popular modern high-dimensional classification methods. It reveals that weak signal detection has been less studied in the literature, even though detecting marginally weak signals has drawn recent attention[4, 24]. In this paper, we propose an ultrahigh-dimensional screening and classification method, which accounts for inter-feature correlations and enables detection of marginally weak or non-discriminant features that might be missed by marginal screening approaches. We cast the proposed method in a unified algorithm for both ultrahigh-dimensional variable selection and classification motivated by the formulation of linear discriminant analysis (LDA). It contains a screening step that simultaneously selects informative features and estimates the parameters, as well as a classification step. We show that the screening step can recover the informative features with a probability tending to and the classification step reaches an optimal misclassification rate even for weak signals. Furthermore, it improves the discovery boundary defined in Jin [20] and Cai and Sun [6]. The procedure is computationally efficient and an R package, “CIS-LDA,” implementing the proposed method is also developed.

The rest of the paper is organized as follows. In Section 2, we introduce the covariance insured screening and classification method for ultrahigh-dimensional data. Section 3 develops theoretical properties of the proposed method. In Section 4, the performance of the proposed method is evaluated using simulation studies. We apply the proposed procedure to the renal transplantation data in Section 5 and Section 6 provides some discussions.

## 2 Covariance insured screening and classification

### 2.1 Notation

Throughout the paper, we use bold upper-case letters to denote vectors and matrices. For a matrix , denote by its transpose, by its th column vector, and by its entry on the th row and th column, . Denote by the cardinality of a set and by the complement of .

Denote by the graph induced by a symmetric matrix , where the node set corresponds to the row indices of and the edge set contains all the edges. An edge between nodes and exists if and only if . For a subset , denote by the principle submatrix of confined on and the corresponding edge set. We say that the graph is a connected component of if the following two conditions are satisfied: (i) any two nodes in are connected by edges in and (ii) for any node in , there exists another node in such that the two can not be connected by edges in .

For a symmetric matrix , denote by the trace of , denote by and the minimum and maximum eigenvalues of . Denote the operator norm and the Frobenius norm by and , respectively.

### 2.2 Method

We assume that there are distinct classes. Let represent class label taking values in . Denote by for such that . Let X be a -dimensional vector of features, of which the distribution conditional on the corresponding class membership follows a multivariate Gaussian

 \textslX|Y=k∼N(μk,Σ),    k=1,⋯,K, (2.1)

where is a -dimensional vector, and is a positive definite matrix. Let be independent observations of . When , (2.1) can be rewritten as

 \textslXi|Yi∼(2−Yi)N(μ1,Σ)+(Yi−1)N(μ2,Σ),    i=1,⋯,n. (2.2)

For the ease of presentation, we assume throughout. We note that the theoretical properties, implementation, and our R package can all be readily extended to the case.

We recast (2.2) in the framework of rare/weak feature models [20], where is only different from in fraction of features. Specifically, let be samples from a Bernoulli() distribution, and

 μ2j=μ1j+δjIj,  1≤j≤p, (2.3)

where denotes the th coordinate of , . and for all . Equation (2.3) assigns the mean difference between the two classes on features with . Set . Therefore, controls the strength of signals and controls the rareness of marginally differentiable features. Jin [20] and Cai and Sun [6] showed that in the ultrahigh-dimensional case with both small and , it is impossible to discriminate one class from the other. Following Jin [20], let for some and for some . Then and are controlled by and , respectively. As gets closer to the signal becomes weaker, while as gets closer to the signal becomes sparser.

Next we outline a new covariance-insured screening (CIS) and classification method that utilizes the feature correlation structures to detect the signals that are marginally uninformative but jointly informative (MUJI). Denote by the precision matrix, where if and only if that the th and th () variables are conditionally uncorrelated given the rest of the variables. It was shown in [47] that a sufficient and necessary condition for variable to be non-informative in differentiating the two classes is

 p∑j′=1Ωjj′(μ1j′−μ2j′)=0. (Condition A)

Accurately estimating the precision matrix involves inverting a large square matrix and requires computational steps. However, under the sparsity assumption of , it is possible to estimate without fully uncovering the whole matrix. Specifically, assume that in each row of , the fraction of its nonzero entries is at most for . Hence, each marginally informative signal is partially correlated with at most number of the MUJI signals. Under the sparsity assumption, is a -sparse graph in the sense that the degree of each node . Particularly, denote by , then each feature in is partially correlated with at most features. Suppose the graph admits a decomposition of connected components and denote by the precision matrices in corresponding to those connected components, i.e.

 G(V,E;Ω)=B⋃l=1G(Vl,El;Ωl),

where is the sub-graph corresponds to the th connected component. Condition A reveals that if a marginally non-informative feature, say , is jointly informative, there must exist a feature such that and . That is, feature has to be connected to some marginally informative feature. This implies that, in order to detect the MUJI signals, one only needs to focus on the connected components that contains at least one marginally informative features in . See Figure 2 for a graphical illustration.

Under the sparsity assumption, the connected components and the graphic structure of can be inferred by thresholding the sample covariance matrix . Denote by the thresholded sample covariance matrix with a threshold , where , with being the indicator function. Denote by the graph corresponding to and suppose it can be decomposed into connected components

 G(V,~E;~Σα)=~B⋃l=1G(~Vl,~El;~Σαl)

with , , being the node set, edge set and the sub-matrix in corresponding to the th connected component, . For two symmetric matrices and , we say that graph belongs to graph , denote by , if and .

###### Lemma 1.

Let be the connected component of a marginally informative feature in , and be the connected component of feature in , then for sufficiently large and , we have

 P(G(Vj,Ej;Ωj)⊆G(~Vj,~Ej;~Σαj))≥1−C1exp(−C2n1−2κ),

where , are some positive constants and .

The proof of Lemma 1 is similar to Theorem 1 in [27] and is provided in the supplemental material.

Remark. The efficiency gain on estimating stems from that: (i) instead of solving an ultrahigh-dimensional graphic lasso problem, it computes the connected components of the thresholded sample covariance matrix, reducing computational cost; (ii) it only needs detection of the connected components of the marginally informative ones. Lemma 1 guarantees the efficacy of the proposed method on reconstructing the connected component structure.

### 2.3 covariance-insured screening and classification steps

Given the data set , denote by the number of samples in class , . Denote by the value of feature for sample and the mean of feature in class , . We further detail the proposed procedures that consist of screening and classification steps.

• Screening step

• Standardize columnwise.

• Select the set consisting of the features satisfying

 ∣∣¯X(1)⋅j−¯X(2)⋅j∣∣=∣∣ ∣∣1n1∑Yi=1Xij−1n2∑Yi=2Xij∣∣ ∣∣>τ, (2.4)

where is a thresholding parameter that controls the strength of the marginal effects.

• For a given threshold , construct a sparse estimate of the covariance matrix by setting , .

• For each feature in , detect its connected component in . Denote by the corresponding index set. Suppose the first components are the smallest number of components such that . Compute the precision matrix of the first components, .

• Evaluate Condition A and select informative features by

 S\tiny{CIS}(α,τ;νn)=⎧⎪ ⎪⎨⎪ ⎪⎩1≤j≤p:∣∣ ∣ ∣∣∑j′∈∪k(α,τ)g=1Sg^Ωα,τjj′(¯X(1)⋅j′−¯X(2)⋅j′)∣∣ ∣ ∣∣≥νn⎫⎪ ⎪⎬⎪ ⎪⎭

for some pre-specified . In our numerical studies, we rank the magnitude of the estimated Condition A: and select up to the first features, where is the sample size. refers to the importance score of predictor .

Remark. The optimal tuning parameter values and are selected through cross-validation. Resampling strategies such as stability selection [31] can be applied to reduce the variation from random sampling.

• Post-screening classification

A new observation is to be classified with the decision rule with . Here , , , and superscript “CIS” restricts features to .

Remark. Notice that 2-4 in the screening step are mainly thresholding the marginal mean difference and the correlation parameters. They estimate the parameters needed in constructing the screening statistics. And 5 in the screening step does the job of variable selection based on Condition A. Even though 2-4 can identify marginally informative signals and their connected components, but they do not select MUJI signals by themselves without 5.

At step 4 of screening, for each feature , we look for such that is relatively large. For each of those s, we then recursively look for its connected features till all the features in the connected component of feature are found. Other connected component labeling algorithms such as those listed in [35] can be employed. For a specific and some , it is possible that is of a greater dimension than the sample size and calculating is by itself a complex high-dimensional graphical model problem. In such cases, we consider an -depth connected subgraph, which contains features connecting to the feature through at most edges in the graph corresponding to . Denote the node set of such a connected subgraph by . We then set and to be the covariance and precision matrices of features in only. This is similar to the -sized subgraph in Zhang, Jin and Fan [48]. With a proper choice of , one can avoid inverting a high-dimensional covariance matrix while involving essentially all the highly correlated features. The impact of the different sizes of m on screening and classification is explored in Section 4. Using different depth for different connected component is, to some extent, equivalent to adaptively thresholding the corresponding correlation blocks for different features in .

## 3 Theoretical properties

Let be the true informative set. Then , where and . And .

Let be the index set of features selected by the CIS screening step with thresholding parameters . Let , where is a constant not depending on . Note that can be used as a measure of the overall sparsity of [3, 34, 12]. Then under the following assumptions, can uncover the true informative set with probability tending to .

(A1) for some positive ;
(A2) for some ;
(A3) for some such that for in (A2);
(A4) there exist positive constants and such that ;
(A5) , and ;
(A6) , i.e. both the marginally informative signals and the MUJI signals are sparse enough.

###### Theorem 2.

(Sure Screening Property) Under assumptions (A1)-(A5), for any ,

 P(|S0∩S\tiny{CIS}(α,τ;νn)|≥(1−ϵ)|S0|)→1 as n→∞. (3.1)
###### Theorem 3.

(False Positive Control Property) Under assumptions (A1)-(A6), for any , we have

 P(|S\tiny{CIS}(α,τ;νn)∩Sc0|≤ζ−1n|Sc0|)→1 as n→∞. (3.2)

Remark. Theorems 2 and 3 ensure that with probability tending to , screening step of the CIS method captures all and only the informative features.

Let denote the misclassifiction rate of the oracle rule or the optimal rule, which assumes that the parameters , and are known [30]. Under model (2.1), it can be shown that , where [30] with and is the standard normal cumulative distribution function. Consider the following definitions introduced in Shao et al. [34].

###### Definition 4.

Let be a classification rule with the conditional misclassification rate , given the training sample .

• [leftmargin=*]

• (D3.1) is asymptotically optimal if ;

• (D3.2) is asymptotically sub-optimal if .

For any thresholding parameters and of the order specified in assumption (A5), denote by and by the estimated precision matrix from the CIS screening step (fill the entries with zeros if ). We first consider a classification rule using all the features: with and . Its misclassification rate is given by [10]:

 R\tiny{CIS}≡122∑k=1Φ⎛⎜ ⎜⎝−(μk−^μ)′^Ωα,τ^δ√^δ′^Ωα,τΣ^Ωα,τ^δ⎞⎟ ⎟⎠. (3.3)

We further assume the following:

(A7) There exist positive constants and such that ;
(A8) Assume that ;
(A9) Let be the number of features with nonzero mean differences between two classes. Then ;
(A10) Let for some constant and . Then .

###### Theorem 5.

(Asymptotic Misclassification Rate) Under the assumptions (A7)-(A10), when ignoring the effect of variable selection based on Condition A in the screening step and constructing a classification rule just using the parameter estimators and , we have:

• [leftmargin=*]

• (T5.1) .

• (T5.2) If is bounded, the is asymptotically optimal.

• (T5.3) If , the and is asymptotically sub-optimal.

Theorem 5 reveals that in ultrahigh-dimensional settings, when not combining with a variable selection method, building a decision rule simply based on all features can lead to a misclassification error no better than a mere random guess. Fan and Fan [10] showed that in the case of uncorrelated features, for any classifier , the misclassification rate is for some constant . When the signal strength is not strong enough to balance out the increasing dimensionality , as and thus . Similarly for the proposed CIS classification, when all the features are used, according to Theorem 5, the signal strength needs to be strong enough to balance out to achieve asymptotic optimal and zero misclassification rates. When , Theorem 5 reduces to Theorem 3 in [34], where sparsity is only assumed for marginally informative features, but not for MUJI features.

However, according to Theorems 2 and 3, the proposed CIS screening step retains only truly informative predictors with large probability. If we base the post-screening classification only on , it can achieve better classification performance, because then only needs to be strong enough to balance out to yield an asymptotic zero and optimal post-screening misclassification rate.

Define the CIS post-screening classifier as , where with superscript CIS restricts the features to the set obtained from the CIS screening step. Then its misclassifcation rate is

 R\scriptsize{CIS\_PSC}≡122∑k=1Φ⎛⎜ ⎜⎝−(μ\tiny{CIS}k−^μ% \tiny{CIS})′^Ω\tiny{CIS}^δ\tiny{CIS}√^δ\tiny{CIS}′^Ω\tiny{CIS}^Ω\tiny{CIS}Σ\tiny{CIS}^Ω\tiny{CIS}^δ%CIS⎞⎟ ⎟⎠.
###### Theorem 6.

(Post-screening Misclassification Rate) Under the same conditions in Theorem 5, when classifying based on features selected from the screening step, we have

• [leftmargin=*]

• (T6.1) .

• (T6.2) If is bounded, then is asymptotically optimal.

• (T6.3) If , then and is asymptotically sub-optimal.

Now let be the informative feature set discovered by a decision rule . In order to characterize the phase transition in terms of optimal discovery, Cai and Sun [6] defined the discovery boundary that satisfies the following conditions:

• The false positive rate is vanishingly small, i.e. ;

• A non-empty discovery set is constructed with high probability, i.e. .

Assuming that the marginally informative features have a common marginal variance and marginally uninformative ones have a marginal variance , Cai and Sun [6] derived the discovery boundary that divides the - plane into discoverable and non-discoverable areas:

• For , ;

• For ,

• For ,

Specifically, they showed that (a) If , it is possible to find a classification rule that fulfills (C1) and (C2) simultaneously. (b) If , it is impossible to find a classification rule that fulfills (C1) and (C2) simultaneously. A similar detection boundary about reliably detecting any existing signals was defined in [20]:

 d\scriptsize{det}(β)=⎧⎪ ⎪⎨⎪ ⎪⎩0,0<β≤1/2;β−1/2,1/2<β≤3/4;(1−√1−β)2,3/4<β<1.

The discovery boundaries coincide with on . And on . For , one can detect the existence of signals reliably but it is impossible to separate any signal from noise [6].

However, when accounting for inter-feature correlations, we show in proposition 7 that the discovery boundary can be improved, and the marginally uninformative features that would have been undiscoverable in [6] and [20] can become “discoverable.” Though finding the exact optimal discovery bound for model (2.2) is beyond the scope of this paper, we do provide an upper bound of the exact optimal discovery bound and show that the upper bound is below the discovery or detection boundaries in [6] and [20] (See Figure 3).

For simplicity, from now on, we assume , for all and the features have the same marginal variance , . We also assume that , that is , which is equivalent to say that on average, the number of marginally non-informative signals correlated with some marginally informative signals is larger than the number of marginally informative signals. In most cases, the assumption of is a reasonable one to make. For example, in genomic data, MUJI markers are usually less sparse than the marginally informative ones.

###### Theorem 7.

(Discovery Boundary) Denote by the discovery boundary associated with the CIS decision rule. Let , , then

• [leftmargin=*]

• (T6.1) For , ;

• (T6.2) For ,

• (T6.3) For ,

Define the power of the discovery boundary of some classification procedure as the ratio of the discoverable area to the non-discoverable one. From Theorem 7, we can observe that for any single signal, the discovery boundary from the CIS procedure is at least as powerful as the discovery boundary in [6]. Moreover, as the MUJI signals get less sparse, the CIS discovery boundary has more power. Figure 3 depicts the discovery boundaries for the optimal screening procedure in Cai and Sun [6] and the CIS procedure with different values of . Theorem 7 implies that when , the detection boundary from the CIS procedure has more power than that in [20].

## 4 Numerical studies

In this section, we assess the finite sample performance of the proposed CIS screening and classification method and compare it with the marginal screening method [6, 20]. First, we generate 3 classes with equal sample size of with features, of which only the first 20 variables are informative.

The mean structure of the first variables is given in Table 2. For example, variables 1-4 and 11-14 have mean 0 in both classes 1 and 2, variables 5 and 15 have mean -0.5 in class 1 and mean 2 in class 2, and variables 6-10 and 16-20 have mean 1.5 in class 1 and mean -1.5 in class 2. Therefore variables 5, 15, 6-10, 16-20 all are marginally informative for differentiating classes 1 and 2. Variables 1-4 and 11-14, though marginally non-informative to differentiate classes 1 and 2, are correlated with variables 5 and 15 respectively. Effectively, all the first 20 variables are informative for differentiating classes 1 and 2.

Variables 1-5 and 11-15 are marginally informative for differentiating classes 2 and 3. Variables 6-10 and 16-20 are marginally non-informative for differentiating classes 2 and 3, and are set to be independent of variables 1-5 and 11-15 respectively. Therefore, variables 6-10 and 16-20 are non-informative for differentiating classes 2 and 3.

The remaining 9,980 variables are independently and identically distributed from for all classes and are independent of the first 20 variables, hence they are non-informative for differentiating neither classes 1 and 2 nor classes 2 and 3.

Our goal is to examine whether our proposed CIS-classification method renders more power to detect the MUJI variables 1-4 and 11-14 in classifying classes 1 and 2, and whether it performs similarly to the marginal screening methods in classifying classes 2 and 3.

Regarding the correlation structure between the MUJI variables and the marginally informative variables, we divide the first 20 variables into four correlation blocks , , and , and set variables from different correlation blocks to be independent. Variables within the same block are governed by the same correlation structure. We specifically consider two correlation structures:

Example 1 (Compound Symmetry): each block has a compound symmetry (CS) correlation structure with a correlation coefficient . The MUJI variables 1-4 and 11-14 for differentiating classes 1 and 2 are equally correlated with the marginally informative variables 5 and 15, respectively.

Example 2 (First Order Auto-Correlation): each block has a first order auto-correlation (AR1) structure with a correlation coefficient . The farther the MUJI variables are away from the marginally informative ones, the weaker their correlations are.

We generate 500 data sets, each of which is divided into a training set with 100 samples per class and a testing set with 50 samples per class. Features are selected by applying screening procedures (steps 1-5) on training dataset, described in the previous section. Post-screening is performed on the testing dataset. A thresholding tuning parameter is obtained by a 5-fold cross validation. We also explore the use of different correlation thresholding values of and report the corresponding screening results. In both examples, we fix the depth parameter as , which controls the maximum number of edges that connect any variable to a marginally informative variable in its connected component. In our numerical experience, we have found that the choice of is not critical.

We compare the performance of the proposed method and marginal screening method using the false positive (FP), false negative (FN), sensitivity (se), specificity (sp), and the minimum model size (MMS) that is the minimum number of features needed to include all informative features. To assess the classification performance, we report the percentage of samples that are misclassified (ER). Tables 3 and 4 summarize the results under the CS and AR1 correlation structures. The presented values are the averages of the 500 replicates, except that the median is reported for the MMS. Tables 3 and 4 reveal that in classifying between classes 1 and 2, where MUJI signals exist, the proposed method provides much smaller false negatives and much larger sensitivity values. The smaller MMS for the proposed method also indicates that it identifies MUJI signals more efficiently, compared to the marginal screening method. When the true correlation is , a smaller correlation matrix thresholding value () gives a better screening result as it includes more marginally weak features, but at the cost of increment of computational time and memory to detect their connected components. In classification between classes 2 and 3, where no MUJI signals exist, the proposed method performs similarly to the marginal screening method in both screening and classification.

We further investigate the effect of the “depth” parameter on the feature selection and classification in Example 3, where data are simulated from the CS correlation structure and other settings remain the same as Example 1. We fix at . Table 5 shows that CIS gives satisfactory screening and classification performance even when the depth is as small as .

## 5 Real data analysis

There are 4 functional types of tissues among the 62 kidney tissue samples in the kidney transplant study [15]: 17 normal donor kidneys (C), 19 well-functioning kidneys more than 1-year post-transplant (TX), 13 biopsy-confirmed acute rejection (AR), and 13 acute dysfunction with no rejection (NR). Each sample has gene profile assayed from kidney biopsies and peripheral blood lymphocytes on 12,625 genes. Distinguishing these 4 types of tissues is important in balancing the need for immunosuppression to prevent rejection and minimizing drug-induced toxicities.

We assess the classification performance by predicting each individual’s class using the leave-one-out procedure. Specifically, for each sample, we classify it between any class pair using the rest samples belong to the class pair to select informative features and optimal tuning parameters. The thresholding parameter is chosen by 5-fold cross validation. As the median absolute value of the correlation coefficient between two features for our data is , the covariance (or the correlation as we have standardized each feature) thresholding parameter is fixed at . The depth parameter is set as . Since there are 4 classes, a majority voting strategy is then employed to decide which class the sample is eventually classified to. For example, if a sample is classified to class C between the classes C and TX, to C between C and AR, to C between C and NR, to TX between TX and AR, to NR between TX and NR, and to NR between AR and NR, then the sample is eventually classified to C. The pairwise comparison strategy which reduces the multi-class problem to multiple two-class problems [37], and the leave-one-out plus majority voting classification error is 6 out of 62 samples.

In terms of selecting informative variables, we draw 100 bootstrap samples for each pairwise two-class comparison and use 5-fold cross-validation to select the tuning parameters. Top genes with high selection frequencies for each class pair are identified and reported in Table 6. MUJI genes are indicated in the third column of Table 6. The last column lists literature that validate the biological relevance of the selected genes.

Our method identifies some MUJI genes as well as some marginally informative genes supported by the clinical literature. For example, DDX19A has been identified as a MWJI gene which has high discrimination power between classes C and AR. [26] identified DDX19A, a member of the DEAD/H-box protein family, as a novel component of NLRP3 inflammasome. NLRP3 inflammasome plays a major role in innate immune and inflammatory pathologic responses, which is relevant to post-renal-transplantation recovery. To our knowledge, DDX19A has not been reported elsewhere as a statistically informative gene for post-renal-transplantation rejection types.

The individual and joint effects for some MUJI genes are illustrated in Figure 4. The marginal effects of those MUJI genes and their correlated marginally discriminative genes, together with rank of importance scores of the MUJI genes are summarized in Table 7. For example, in the top panel of Figure 4, the marginal standardized mean difference between C and TX for the gene ADAM8 is ranks 5,071 out of 12,625 genes. While gene ADAM8 is highly correlated with a marginally discriminative gene IFITM1, whose standardized marginal mean difference ranks 22. By incorporating the covariance structure, the CIS procedure promote rank of the IS for gene ADAM8 to 18.

The following Figure 5 illustrates the classification effect of a MUJI signal, gene IPO5, when considered with a marginally stronger signal, gene TTC37. Considered by itself, the minimal misclassification number of gene TTC37 is 10, as showed in the left panel of Figure 5. While jointly considered with gene IPO5, the minimal misclassification number reduces to 2, as showed in the right panel of Figure 5. Figure LABEL:ffig:MUJIeffect clearly demonstrates more discriminating power when two genes are considered jointly than when considered individually.

## 6 Discussion

Leveraging dependence among features, we propose a covariance-insured screening and classification procedure for weak signal detection and ultrahigh-dimensional classification. The procedure is accompanied by easily implementable algorithms and renders nice theoretical properties, such as model selection consistency with optimal misclassification rates.

The proposed method has a wide range of applications, such as classification of disease subtypes based on spatially correlated fMRI brain image data and classification of portfolio categories based on time series data for prices. The local correlation structure can be parameterized with fewer parameters (such as AR1) to further improve the estimation accuracy and efficiency of precision matrices.

Several aspects of the proposed method can inspire future work in several directions. A key component of our approach is the notion of identifying block-diagonal precision matrices by thresholding the sample covariance matrices. On the other hand, intensive biomedical research has generated a large body of biological knowledge, e.g. pathway information, that may help specify the structure of such matrics. How to effectively utilize such information, and further balance priori knowledge and data-driven approaches to achieve optimal decision rules remains rather challenging.

We assume a common covariance matrix across classes. However this restrictive assumption can be relaxed by allowing heteroscedastic normal models, which are more realistic in many practical applications. As stated in the proofs, the normality assumption can also be relaxed, and be replaced by sub-Gaussian families. This will facilitate modeling non-Gaussian data with heavy-tails.

There is much room for development of more efficient and faster algorithms. For instance, when the whole feature space can be divided into uncorrelated subspaces, such as different chromosomes in the genome or different functional regions in the brain, parallel computing on multiple partitioned feature spaces may be available. We will pursue this in future work.

## Appendix A Proofs of the main results

We first state the Gaussian tail inequality, which will be used throughout in Appendix A. Let be a standard normal random variable. Then for any ,

 P(Z>t)≤1√2πte−t22.

Denote by with being the sample covariance matrix. Bickel and Levina [3] showed that for , . Furthermore, Bickel and Levina [3] and Fan, Liao and Min [12] showed that,

 ∥(~Σα)−1−Σ−1∥=O(Ch,p(logpn)(1−h)/2). (A.1)

Therefore, under the sparsity assumptions, the CIS estimated precision matrix satisfies for some constant and sufficiently large . The following Lemma 8 is useful in controlling the size of non-zero entries in a row of . Its proof is similar to that of Theorem 2 in [27] and thus is omitted.

###### Lemma 8.

Let . Under assumptions (A2)-(A4), for some positive constants and .

Proof of Theorem 2. Denote for short by and let . Proving (3.1) is equivalent to proving the following when

 P(|Sα,τ;νnCIS∩S0|>(1−ϵ)|S0|) = P⎛⎝1|S0|∑j∈S01⎛⎝∑j′∈S∣∣^Ωα,τjj′(¯X(1)⋅j′−¯X(2)⋅j′)∣∣>νn⎞⎠>1−ϵ⎞⎠→0.

Let . We first develop an upper bound for . Notice that when is estimated as zero, it does not contribute to