High-Dimensional Regularized Discriminant Analysis

# High-Dimensional Regularized Discriminant Analysis

## Abstract

Regularized discriminant analysis (RDA), proposed by Friedman (1989), is a widely popular classifier that lacks interpretability and is impractical for high-dimensional data sets. Here, we present an interpretable and computationally efficient classifier called high-dimensional RDA (HDRDA), designed for the small-sample, high-dimensional setting. For HDRDA, we show that each training observation, regardless of class, contributes to the class covariance matrix, resulting in an interpretable estimator that borrows from the pooled sample covariance matrix. Moreover, we show that HDRDA is equivalent to a classifier in a reduced-feature space with dimension approximately equal to the training sample size. As a result, the matrix operations employed by HDRDA are computationally linear in the number of features, making the classifier well-suited for high-dimensional classification in practice. We demonstrate that HDRDA is often superior to several sparse and regularized classifiers in terms of classification accuracy with three artificial and six real high-dimensional data sets. Also, timing comparisons between our HDRDA implementation in the sparsediscrim R package and the standard RDA formulation in the klaR R package demonstrate that as the number of features increases, the computational runtime of HDRDA is drastically smaller than that of RDA.

###### keywords:
Regularized discriminant analysis, High-dimensional classification, Covariance-matrix regularization, Singular value decomposition, Multivariate analysis, Dimension reduction
###### Msc:
[2010] 62H30, 65F15, 65F20, 65F22

## 1 Introduction

In this paper, we consider the classification of small-sample, high-dimensional data, where the number of features exceeds the training sample size . In this setting, well-established classifiers, such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), become incalculable because the class and pooled covariance matrix estimators are singular (Murphy, 2012; Bouveyron, Girard, and Schmid, 2007; Mkhadri, Celeux, and Nasroallah, 1997). To improve the accuracy of the estimation of the class covariance matrices estimated in the QDA classifier and to ensure that the covariance matrix estimators are nonsingular, Friedman (1989) proposed the regularized discriminant analysis (RDA) classifier by incorporating a weighted average of the pooled sample covariance matrix and the class sample covariance matrix. To further improve the accuracy of the estimation of the class covariance matrix and to stabilize its inverse, Friedman (1989) also included a regularization component by shrinking the covariance matrix estimator towards the identity matrix, which yields a nonsingular estimator following the well-known ridge-regression approach of Hoerl and Kennard (1970). Despite its popularity, the “borrowing” operation employed in the RDA classifier lacks interpretability (Bensmail and Celeux, 1996). Furthermore, the RDA classifier is impractical for high-dimensional data sets because it computes the inverse and determinant of the covariance matrices for each class. Both matrix calculations are computationally expensive because the number of operations grows at a polynomial rate in the number of features. Moreover, the model selection of the RDA classifier’s two tuning parameters is computationally burdensome because the matrix inverse and determinant of each class covariance matrix are computed across multiple cross-validation folds for each candidate tuning-parameter pair.

Here, we present the high-dimensional RDA (HDRDA) classifier, which is intended for the case when . We reparameterize the RDA classifier similar to that of ?Hastie:2008dt and Halbe and Aladjem (2007) and employ a biased covariance-matrix estimator that partially pools the individual sample covariance matrices from the QDA classifier with the pooled sample covariance matrix from the LDA classifier. We then shrink the resulting covariance-matrix estimator towards a scaled identity matrix to ensure positive definiteness. We show that the pooling parameter in the HDRDA classifier determines the contribution of each training observation to the estimation of each class covariance matrix, enabling interpretability that has been previously lacking with the RDA classifier (Bensmail and Celeux, 1996). Our parameterization differs from that of ?Hastie:2008dt and that of Halbe and Aladjem (2007) in that our formulation allows the flexibility of various covariance-matrix estimators proposed in the literature, including a variety of ridge-like estimators, such as the one proposed by Srivastava and Kubokawa (2007).

Next, we establish that the matrix operations corresponding to the null space of the pooled sample covariance matrix are redundant and can be discarded from the HDRDA decision rule without loss of classificatory information when we apply reasoning similar to that of Ye and Wang (2006). As a result, we achieve a substantial reduction in dimension such that the matrix operations used in the HDRDA classifier are computationally linear in the number of features. Furthermore, we demonstrate that the HDRDA decision rule is invariant to adjustments to the approximately zero eigenvalues, so that the decision rule in the original feature space is equivalent to a decision rule in a lower dimension, such that matrix inverses and determinants of relatively small matrices can be rapidly computed. Finally, we show that several shrinkage methods that are special cases of the HDRDA classifier have no effect on the approximately zero eigenvalues of the covariance-matrix estimators when . Such techniques include work from Srivastava and Kubokawa (2007), Rao and Mitra (1971), and other methods studied by Ramey and Young (2013) and Xu, Brock, and Parrish (2009).

We also provide an efficient algorithm along with pseudocode to estimate the HDRDA classifier’s tuning parameters in a grid search via cross-validation. Timing comparisons between our HDRDA implementation in the sparsediscrim R package available on CRAN and the standard RDA formulation in the klaR R package demonstrate that as the number of features increases, the computational runtime of the HDRDA classifier is drastically smaller than that of RDA. In fact, when , we show that the HDRDA classifier is 502.786 times faster on average than the RDA classifier. In this scenario, the HDRDA classifier’s model selection requires 2.979 seconds on average, while that of the RDA classifier requires 24.933 minutes on average.

Finally, we study the classification performance of the HDRDA classifier on six real high-dimensional data sets along with a simulation design that generalizes the experiments initially conducted by Guo et al. (2007). We demonstrate that the HDRDA classifier often attains superior classification accuracy to several recent classifiers designed for small-sample, high-dimensional data from ?Tong:2012hw, Witten and Tibshirani (2011), ?Pang:2009ik, and Guo et al. (2007). We also include as a benchmark the random forest from Breiman (2001) because ?FernandezDelgado:2014ul have concluded that the random forest is often superior to other classifiers in benchmark studies. We show that our proposed classifier is competitive and often outperforms the random forest in terms of classification accuracy in the small-sample, high-dimensional setting.

The remainder of this paper is organized as follows. In Section 2 we introduce the classification problem and necessary notation to describe our contributions. In Section 3 we present the HDRDA classifier along with its interpretation. In Section 4, we provide properties of the HDRDA classifier and a computationally efficient model-selection procedure. In Section 5, we compare the model-selection timings of the HDRDA and RDA classifiers. In Section 6 we describe our simulation studies of artificial and real data sets and examine the experimental results. We conclude with a brief discussion in Section 7.

## 2 Preliminaries

### 2.1 Notation

To facilitate our discussion of covariance-matrix regularization and high-dimensional classification, we require the following notation. Let denote the matrix space of all matrices over the real field . Denote by the identity matrix, and let be the matrix of zeros, such that is understood to denote . Define as a vector of ones. Let , , and denote the transpose, the Moore-Penrose pseudoinverse, and the null space of , respectively. Denote by the cone of real positive-definite matrices. Similarly, let denote the cone of real positive-semidefinite matrices. Let denote the orthogonal complement of a vector space . For , let if and otherwise.

### 2.2 Discriminant Analysis

In discriminant analysis we wish to assign an unlabeled vector to one of unique, known classes by constructing a classifier from training observations. Let be the th observation with true, unique membership . Denote by the number of training observations realized from class , such that . We assume that is a realization from a mixture distribution , where is the probability density function (PDF) of the th class and is the prior probability of class membership of the th class. We further assume , , .

The QDA classifier is the optimal Bayesian decision rule with respect to a loss function when is the PDF of the multivariate normal distribution with known mean vectors and known covariance matrices , . Because and are typically unknown, we assign an unlabeled observation to class with the sample QDA classifier

 DQDA(x)=argmink(x−¯xk)TˆΣ−1k(x−¯xk)+log|ˆΣk|, (1)

where and are the maximum-likelihood estimators (MLEs) of and , respectively. If we assume further that , , then the pooled sample covariance matrix is substituted for in (1), where

 ˆΣ=N−1K∑k=1nkˆΣk (2)

is the MLE for . Here, (1) reduces to the sample LDA classifier. We omit the log-determinant because it is constant across the classes.

The smallest eigenvalues of and the directions associated with their eigenvectors can highly influence the classifier in (1). In fact, the eigenvalues of are well-known to be biased if such that the smallest eigenvalues are underestimated (Seber, 2004). Moreover, if , then rank, which implies that at least eigenvalues of are zero. Furthermore, although more feature information is available to discriminate among the classes, if , (1) is incalculable because does not exist.

Several regularization methods, such as the methods considered by Xu et al. (2009), Guo et al. (2007), and Mkhadri (1995), have been proposed in the literature to adjust the eigenvalues of so that (1) is calculable and provides reduced variability for . A common form of the covariance-matrix regularization applies a shrinkage factor , so that

 ˆΣk(γ)=ˆΣk+γIp, (3)

similar to a method employed in ridge regression (Hoerl and Kennard, 1970). Equation (3) effectively shrinks the sample covariance matrix toward , thereby increasing the eigenvalues of by . Specifically, the zero eigenvalues are replaced with , so that (3) is positive definite. For additional covariance-matrix regularization methods, see Ramey and Young (2013), Xu et al. (2009), and Ye and Ji (2009).

## 3 High-Dimensional Regularized Discriminant Analysis

Here, we define the HDRDA classifier by first formulating the covariance-matrix estimator and demonstrating its clear interpretation as a linear combination of the crossproducts of the training observations centered by their respective class sample means. We define the convex combination

 ˆΣk(λ)\coloneqq(1−λ)ˆΣk+λˆΣ,k=1,…,K, (4)

where is the pooling parameter. By rewriting (4) in terms of the observations , , each centered by its class sample mean, we attain a clear interpretation of . That is,

 ˆΣk(λ) =(1−λ+λnkN)ˆΣk+λNK∑k′=1k′≠knk′ˆΣk′ =(1−λnk+λN)N∑i=1I(yi=k)xixTi+λNN∑i=1I(yi≠k)xixTi =N∑i=1cik(λ)xixTi, (5)

where . From (5), we see that weights the contribution of each of the observations in estimating from all classes rather than using only the observations from a single class. As a result, we can interpret (5) as a covariance-matrix estimator that borrows from in (2) to estimate .

In Figure 1 we plot the contours of five multivariate normal populations for with unequal covariance matrices. As approaches 1, the contours become more similar, resulting in identical contours for . Below, we show that the pooling operation is advantageous in increasing the rank of each from rank to rank for . Notice that if , then the observations from the remaining classes do not contribute to the estimation of , corresponding to . Furthermore, if , the weights in (5) reduce to , corresponding to . For brevity, when , we define such that . Similarly, for , we define such that .

 [Insert Figure ??? approximately here ]

As we have discussed above, several eigenvalue adjustment methods have been proposed that increase eigenvalues (approximately) equal to 0. To further improve the estimation of and to stabilize the estimator’s inverse, we define the eigenvalue adjustment of (4) as

 ~Σk\coloneqqαkˆΣk(λ)+γIp, (6)

where and is an eigenvalue-shrinkage constant. Thus, the pooling parameter controls the amount of estimation information borrowed from to estimate , and the shrinkage parameter determines the degree of eigenvalue shrinkage. The choice of allows for a flexible formulation of covariance-matrix estimators. For instance, if , , then (6) resembles (3). Similarly, if , then (6) has a form comparable to the RDA classifier from Friedman (1989). Substituting (6) into (1), we define the HDRDA classifier as

 DHDRDA(x)=argmink(x−¯xk)T~Σ+k(x−¯xk)+log|~Σk|. (7)

For , is nonsingular such that can be substituted for in (7). If , we explicitly set equal to the product of the positive eigenvalues of . Following Friedman (1989), we select and from a grid of candidate models via cross-validation (Hastie et al., 2008). We provide an implementation of (7) in the hdrda function contained in the sparsediscrim R package, which is available on CRAN.

The choice of in (6) is one of convenience and allows the flexibility of various covariance-matrix estimators proposed in the literature. In practice, we generally are not interested in estimating because the estimation of additional tuning parameters via cross-validation is counterproductive to our goal of computational efficiency. For appropriate values of , the HDRDA covariance-matrix estimator includes or resembles a large family of estimators. Notice that if and , (6) is equivalent to the standard ridge-like covariance-matrix estimator in (3). Other estimators proposed in the literature can be obtained when one selects accordingly. For instance, with , we obtain the estimator from Srivastava and Kubokawa (2007).

When , (6) resembles the biased covariance-matrix estimator

 ˆΣk(λ,γ)=(1−γ)ˆΣ(RDA)k(λ)+γtr{ˆΣ(RDA)k(λ)}pIp (8)

employed in the RDA classifier, where is a pooled estimator of and is a regularization parameter that controls the shrinkage of (8) towards weighted by the average of the eigenvalues of . Despite the similarity of (8) to the HDRDA covariance-matrix estimator in (6), the RDA classifier is impractical for high-dimensional data because the inverse and determinant of (8) must be calculated when substituted into (1). Furthermore, (8) has no clear interpretation.

## 4 Properties of the HDRDA Classifier

Next, we establish properties of the covariance-matrix estimator and the decision rule employed in the HDRDA classifier. By doing so, we demonstrate that (7) lends itself to a more efficient calculation. We decompose (7) into a sum of two components, where the first summand consists of matrix operations applied to low-dimensional matrices and the second summand corresponds to the null space of in (2). We show that the matrix operations performed on the null space of yield constant quadratic forms across all classes and can be omitted. For , the constant component involves determinants and inverses of high-dimensional matrices, and by ignoring these calculations, we achieve a substantial reduction in computational costs. Furthermore, a byproduct is that adjustments to the associated eigenvalues have no effect on (7). Lastly, we utilize the singular value decomposition to efficiently calculate the eigenvalue decomposition of , further reducing the computational costs of the HDRDA classifier.

First, we require the following relationship regarding the null spaces of , , and .

###### Lemma 1.

Let and be the MLEs of and , respectively. Let be defined as in (4). Then, , .

###### Proof.

Let for some . Hence, . Because , we have and . In particular, we have that . Now, suppose . Similarly, we have that , which implies that because . Therefore, . ∎

In Lemma 2 below, we derive an alternative expression for in terms of the matrix of eigenvectors of . Let be the eigendecomposition of such that is the diagonal matrix of eigenvalues of with

 D=[Dq000p−q],

is the diagonal matrix consisting of the positive eigenvalues of , the columns of are the corresponding orthonormal eigenvectors of , and rank. Then, we partition such that and .

###### Lemma 2.

Let be the eigendecomposition of as above, and suppose that rank. Then, we have

 ~Σk =U[Wk00γIp−q]UT,k=1,…,K, (9) where Wk =αk{(1−λ)UT1ˆΣkU1+λDq}+γIq. (10)
###### Proof.

From Lemma 1, the columns of span the null space of , which implies that . Hence,

 UTˆΣkU=[UT1ˆΣkU1000p−q],k=1,…,K.

Thus, , and (9) holds because is orthogonal. ∎

As an immediate consequence of Lemma 2, we have the following corollary.

###### Corollary 0.

Let be defined as in (4). Then, for , rank, .

###### Proof.

The proof follows when we set in Lemma 2. ∎

Thus, by incorporating each into the estimation of , we increase the rank of to if . Next, we provide an essential result that enables us to prove that (7) is invariant to adjustments to the eigenvalues of corresponding to the null space of .

###### Lemma 3.

Let be defined as above. Then, for all , , , where .

###### Proof.

Let , and suppose that . Recall that , which implies that (Kollo and von Rosen, 2005, Lemma 1.2.5). Now, because , . Hence, , where . Therefore, . ∎

We now present our main result, where we decompose (7) and show that the term requiring the largest computational costs does not contribute to the classification of an unlabeled observation performed using Lemma 3. Hence, we reduce (7) to an equivalent, more computationally efficient decision rule.

###### Theorem 2.

Let and be defined as in (9) and (10), respectively, and let be defined as above. Then, the decision rule in (7) is equivalent to

 DHDRDA(x) =argmink(x−¯xk)TU1W−1kUT1(x−¯xk)+log|Wk|. (11)
###### Proof.

From (9), we have that

 ~Σ+k=U[W−1k00γ+Ip−q]UT

and , . Therefore, for all , we have that

 (x−¯xk)T~Σ+k(x−¯xk)+log|~Σk| =(x−¯xk)TU1W−1kUT1(x−¯xk) +γ+(x−¯xk)TU2UT2(x−¯xk)+log|Wk| +(p−q)logγ.

Because is constant for , we can omit the term and particularly avoid the calculation of for . Then, the proof follows from Lemma 3 because is constant for . ∎

Using Theorem 1, we can avoid the time-consuming inverses and determinants of covariance matrices in (7) and instead calculate these same operations on in (11). The substantial computational improvements arise because our proposed classifier in (7) is invariant to the term , thus yielding an equivalent classifier in (11) with a substantial reduction in computational complexity. Here, we demonstrate that the computational efficiency in calculating the inverse and determinant of can be further improved via standard matrix operations when we show that the inverses and determinants of can be performed on matrices of size .

###### Proposition 1.

Let be defined as above. Then, and

 W−1k =Γ−1k−n−1kαk(1−λ)Γ−1kUT1XTkQ−1kXkU1Γ−1k, (12) where Qk =Ink+n−1kαk(1−λ)XkU1Γ−1kUT1XTk (13) and Γk =αkλDq+γIq. (14)
###### Proof.

First, we write . To calculate , we apply Theorem 18.1.1 from Harville (2008), which states that , where , , , and . Thus, setting , , , and , we have . Similarly, (12) follows from the well-known Sherman-Woodbury formula (Harville, 2008, Theorem 18.2.8) because . ∎

Notice that is singular when because , in which case we use the formulation in (11) instead. Also, notice that if is constant across the classes, then in (14) is independent of . Consequently, is constant across the classes and need not be calculated in (11).

### 4.1 Model Selection

Thus far, we have presented the HDRDA classifier and its properties that facilitate an efficient calculation of the decision rule. Here, we describe an efficient model-selection procedure along with pseudocode in Algorithm 1 to select the optimal tuning-parameter estimates from the Cartesian product of candidate values . We estimate the -fold cross-validation error rate for each candidate pair and select , which attains the minimum error rate. To calculate the -fold cross-validation, we partition the original training data into mutually exclusive and exhaustive folds that have approximately the same number of observations. Then, for , we classify the observations in the th fold by training a classifier on the remaining folds. We calculate the cross-validation error as the proportion of misclassified observations across the folds.

A primary contributing factor to the efficiency of Algorithm 1 is our usage of the compact singular value decomposition (SVD). Rather than computing the eigenvalue decomposition of to obtain , we instead obtain by computing the eigendecomposition of a much smaller matrix when (Hastie et al., 2008, Chapter 18.3.5). Applying the SVD, we decompose , where is orthogonal, is a diagonal matrix consisting of the singular values of , and is orthogonal. Recalling that , we have the eigendecomposition , where is the matrix of eigenvectors of and is the diagonal matrix of eigenvalues of . Now, we can obtain and efficiently from the eigenvalue decomposition of the matrix . Next, we compute , where

 D+/2=[D−1/2q000N−q].

We then determine , the number of numerically nonzero eigenvalues present in , by calculating the number of eigenvalues that exceeds some tolerance value, say, . We then extract as the first columns of .

As a result of the compact SVD, we need calculate only once per cross-validation fold, requiring calculations. Hence, the computational costs of expensive calculations, such as matrix inverses and determinants, are greatly reduced because they are performed in the -dimensional subspace. Similarly, we reduce the dimension of the test data set by calculating once per fold. Conveniently, we see that the most costly computation involved in and is , which can be extracted from . Thus, after the initial calculation of per cross-validation fold, requires operations. Because , both its determinant and inverse require operations. Consequently, requires operations. Also, the inverse of the diagonal matrix requires operations. Finally, we remark that requires operations.

The expressions given in Proposition 1 also expedite the selection of and via cross-validation because the most time-consuming matrix operation involved in computing and is , which is independent of and . The subsequent operations in calculating and can be simply updated for different pairs of and without repeating the costly computations. Also, rather than calculating individually for each row of , we can calculate . The diagonal elements of the resulting matrix contain the individual quadratic form of each test observation, .

## 5 Timing Comparisons between RDA and HDRDA

In this section, we demonstrate that the computational performance of the model selection employed in the HDRDA classifier is substantially faster than that of the RDA classifier on small-sample, high-dimensional data sets. The relative difference in runtime between the two classifiers drastically increases as increases. To compare the two classifiers, we generated 25 observations from each of multivariate normal populations with mean vectors , , , and . We set the covariance matrix of each population to the identity matrix. For each data set generated, we estimated the parameters and for both classifiers using a grid of 5 equidistant candidate values between 0 and 1, inclusively. We set , , in the HDRDA classifier. At each pair of and , we computed the 10-fold cross-validation error rate (Hastie et al., 2008). Then, we selected the model that minimized the 10-fold cross-validation error rate.

We compared the runtime of both classifiers by increasing the number of features from to in increments of 500. Next, we generated 100 data sets for each value of and computed the training and model selection runtime of both classifiers. Our timing comparisons are based on our HDRDA implementation in the sparsediscrim R package and the standard RDA implementation in the klaR R package. All timing comparisons were conducted on an Amazon Elastic Compute Cloud (EC2) c4.4xlarge instance using version 3.3.1 of the open-source statistical software R. Our timing comparisons can be reproduced with the code available at https://github.com/ramhiser/paper-hdrda.

### 5.1 Timing Comparison Results

In Figure 2, we plotted the runtime of the model selections for both the HDRDA and RDA classifiers as a function of . We observed that the HDRDA classifier was substantially faster than the RDA classifier as increased. In the left panel of Figure 2, we fit a quadratic regression line to the RDA runtimes and a simple linear regression model to the HDRDA runtimes. For improved understanding, in the right panel we repeated the same scatterplot and linear fit with the timings restricted to the observed range of the HDRDA timings. Figure 2 suggests that the usage of a matrix inverse and determinant in the klaR R package’s discriminant function yielded model-selection timings that exceeded linear growth in . Because the HDRDA classifier removes inverse and determinants, it was computationally more efficient than the RDA classifier, especially as increased. In fact, when , the RDA classifier required 24.933 minutes on average to perform model selection, while the HDRDA classifier selected its optimal model in 2.979 seconds on average. Clearly, the model selection employed by the HDRDA classifier is substantially faster than that of the RDA classifier.

We quantified the relative timing comparisons between the two classifiers by calculating the ratio of mean timings of the RDA classifier to the HDRDA classifier for each value of . We employed nonparametric bootstrapping to estimate the mean ratio along with 95% confidence intervals. In Figure 3, the bootstrap sampling distributions for the ratio of mean timings are given. First, we observe that the mean relative timings increased as increased. For smaller dimensions, the relative difference in computing was sizable with the average ratio of the mean timings equal to 14.513 for and a 95% confidence interval of (14.191, 14.855). Furthermore, the ratio of mean computing times suggested that the RDA classifier is impractical for higher dimensions. For instance, when , the ratio of mean computing times increased to 502.786 with a 95% confidence interval of (462.863, 546.396).

 [Insert Figure ??? approximately here ]
 [Insert Figure ??? approximately % here ]

## 6 Classification Study

In this section, we compare our proposed classifier with four classifiers recently proposed for small-sample, high-dimensional data along with the random-forest classifier from Breiman (2001) using version 3.3.1 of the open-source statistical software R. Within our study, we included penalized linear discriminant analysis from Witten and Tibshirani (2011), implemented in the penalizedLDA package. We also considered shrunken centroids regularized discriminant analysis from Guo et al. (2007) in the rda package. Because the rda package does not perform the authors’ “Min-Min” rule automatically, we applied this rule within our R code. We included two modifications of diagonal linear discriminant analysis from Tong et al. (2012) and Pang et al. (2009), where the former employs an improved mean estimator and the latter utilizes an improved variance estimator. Both classifiers are available in the sparsediscrim package. Finally, we incorporated the random forest as a benchmark based on the findings of Fernández-Delgado et al. (2014), who concluded that the random forest is often superior to other classifiers in benchmark studies. We used the implementation of the random-forest classifier from the randomForest package with 250 trees and 100 maximum nodes. For each classifer we explicitly set prior probabilities as equal, if applicable. All other classifier options were set to their default settings. Below, we refer to each classifier by the first author’s surname. All simulations were conducted on an Amazon EC2 c4.4xlarge instance. Our analyses can be reproduced via the code available at https://github.com/ramhiser/paper-hdrda.

For the HDRDA classifier in (11), we examined the classification performance of two models. For the first HDRDA model, we set , , so that the covariance-matrix estimator (6) resembled (3). We estimated from a grid of 21 equidistant candidate values between 0 and 1, inclusively. Similarly, we estimated from a grid consisting of the values , and . We selected optimal estimates of and using -fold cross-validation. For the second model, we set , , to resemble Friedman’s parameterization, and we estimated both and from a grid of 21 equidistant candidate values between 0 and 1, inclusively.

We did not include the RDA classifier in our classification study because its training runtime was prohibitively slow on high-dimensional data in our preliminary experiments. As shown in Section 5, the runtime of the RDA classifier was drastically larger than that of the HDRDA classifier for a tuning grid of size . Consequently, a fair comparison between the RDA and HDRDA classifiers would require model selection of different pairs of tuning parameters in the RDA classifier. A tuning grid of this size yielded excessively slow training runtimes for the RDA implementation from the klaR R package.

### 6.1 Simulation Study

In this section we compare the competing classifiers using the simulation design from Guo et al. (2007). This design is widely used within the high-dimensional classification literature, including the studies by Ramey and Young (2013) and Witten and Tibshirani (2011). First, we consider the block-diagonal covariance matrix from Guo et al. (2007),

 Σk=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Σ(ρk)01000100⋯⋯⋯0100Σ(−ρk)01000100⋯⋮01000100Σ(ρk)0100⋯⋮⋮01000100Σ(−ρk)0100⋮⋮⋮⋮0100⋱⋮⋯⋯⋯⋯⋯⋯⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (15)

where the th entry of the block matrix is

 Σ(ρk)ij={ρ|i−j|k}1≤i,j≤100.

The block-diagonal covariance structure in (15) resembles gene-expression data: within each block of pathways, genes are correlated, and the correlation decays as a function of the distance between any two genes. The original design from Guo et al. (2007) comprised two -dimensional multivariate normal populations with a common block-diagonal covariance matrix.

Although the design is indeed standard, the simulation configuration lacks artifacts commonly observed in real data, such as skewness and extreme outliers. As a result, we wished to investigate the effect of outliers on the high-dimensional classifiers. To accomplish this goal, we generalized the block-diagonal simulation configuration by sampling from a -dimensional multivariate contaminated normal distribution. Denoting the PDF of the -dimensional multivariate normal distribution by , we write the PDF of the th class as

 p(x|ωk)=(1−ϵ)Np(x|μk,Σk)+ϵNp(x|μk,ηΣk), (16)

where is the probability that an observation is contaminated (i.e., drawn from a distribution with larger variance) and scales the covariance matrix to increase the extremity of outliers. For , we have the benchmark block-diagonal simulation design from Guo et al. (2007). As is increased, the average number of outliers is increased. In our simulation, we let and considered the values of , , , .

We generated populations from (16) with given in (15) and set the mean vector of class 1 to . Next, comparable to Guo et al. (2007), the first 100 features of were set to 1/2, while the rest were set to 0, i.e., . For simplicity, we defined . The three populations differed in their mean vectors in the first 100 features corresponding to the first block, and no difference in the means occurred in the remaining blocks.

From each of the populations, we sampled 25 training observations ( for all ) and 10,000 test observations. After training each classifier on the training data, we classified the test data sets and computed the proportion of mislabeled test observations to estimate the classification error rate for each classifier. Repeating this process 500 times, we computed the average of the error-rate estimates for each classifier. We allowed the number of features to vary from to in increments of 100 to examine the classification accuracy as the feature dimension increased while maintaining a small sample size. Guo et al. (2007) originally considered for all . Alternatively, to explore the more realistic assumption of unequal covariance matrices, we put , , and .

#### Simulation Results

In Figure 4, we observed each classifier’s average classification error rates for the values of and . Unsurprisingly, the average error rate increased for each classifier as the contamination probability increased regardless of the value of . Sensitivity to the presence of outliers was most apparent for the Pang, Tong, and Witten classifiers. For smaller dimensions, the random-forest and HDRDA classifiers tended to outperform the remaining classifiers with the random forest performing best. As the feature dimension increased with , both HDRDA classifiers outperformed all other classifiers, suggesting that their inherent dimension reduction better captured the classificatory information in the small training samples, even in the presence of outliers.

 [Insert Figure ??? approximately here ]

The Pang, Tong, and Witten methods yielded practically the same and consistently the worst error rates when outliers were present with , suggesting that these classifiers were sensitive to outliers. Notice, for example, that when , the error rates of the Pang, Tong, and Witten classifiers increased dramatically from approximately 19% when no outliers were present to approximately 43% when . The sharp increase in average error rates for these three classifiers continued as increased. Guo’s method always outperformed those of Pang, Witten, and Tong, but after outliers were introduced, the Guo classifier’s average error rate was not competitive with the HDRDA classifiers or the random-forest classifier.

 [Insert Figure ??? approximately here ]

In Figure 5, we again examine the simulation results as a function of for a subset of the values of . This set of plots allows us to investigate the effect of feature dimensionality on classification performance. When no outliers were present (i.e., ), the random-forest classifier was outperformed by all other classifiers. Furthermore, the HDRDA classifiers were superior in terms of average error rate in this setting. As increased, an elevation in average error rate was expected for all classifiers, but the increase was not observed to be substantial.

For , we observed a different behavior in classification performance. First, the Pang, Tong, and Witten methods, along with the random-forest method, increased in average error rate as increased. Contrarily, the performance of the HDRDA and Guo classifiers was hardly affected by . Also, as discussed above, the HDRDA classifiers were superior to all other classifiers for large values of with only the random-forest classifier outperforming them in smaller feature-dimension cases.

### 6.2 Application to Gene Expression Data

We compared the HDRDA classifier to the five competing classifiers on six benchmark gene-expression microarray data sets. First, we evaluated the classification accuracy of each classifier by randomly partitioning the data set under consideration such that of the observations were allocated as training data and the remaining of the observations were allocated as a test data set. To expedite the computational runtime, we reduced the training data to the top 1000 variables by employing the variable-selection method proposed by Dudoit et al. (2002). We then reduced the test data set to the same 1000 variables. After training each classifier on the training data, we classified the test data sets and computed the proportion of mislabeled test observations to estimate the classification error rate for each classifier. Repeating this process 100 times, we computed the average of the error-rate estimates for each classifier. We next provide a concise description of each high-dimensional data set examined in our classification study.

#### Chiaretti et al. (2004) Data Set

Chiaretti et al. (2004) measured the gene-expression profiles for 128 individuals with acute lymphoblastic leukemia (ALL) using Affymetrix human 95Av2 arrays. Following Xu et al. (2009), we restricted the data set to classes such that observations were without cytogenetic abnormalities and observations had a detected BCR/ABL gene. The robust multichip average normalization method was applied to all 12,625 gene-expression levels.

#### Chowdary et al. (2006) Data Set

Chowdary et al. (2006) investigated 52 matched pairs of tissues from colon and breast tumors using Affymetrix U133A arrays and ribonucleic-acid (RNA) amplification. Each tissue pair was gathered from the same patient and consisted of a snap-frozen tissue and a tissue suspended in an RNAlater preservative. Overall, 31 breast-cancer and 21 colon-cancer pairs were gathered, resulting in classes with and . A purpose of the study was to determine whether the disease state could be identified using 22,283 gene-expression profiles.

#### Nakayama et al. (2007) Data Set

Nakayama et al. (2007) acquired 105 gene-expression samples of 10 types of soft-tissue tumors through an oligonucleotide microarray, including 16 samples of synovial sarcoma (SS), 19 samples of myxoid/round cell liposarcoma (MLS), 3 samples of lipoma, 3 samples of well-differentiated liposarcoma (WDLS), 15 samples of dedifferentiated liposarcoma (DDLS), 15 samples of myxofibrosarcoma (MFS), 6 samples of leiomyosarcoma (LMS), 3 samples of malignant nerve sheathe tumor (MPNST), 4 samples of fibrosarcoma (FS), and 21 samples of malignant fibrous histiocytoma (MFH). Nakayama et al. (2007) determined from their data that these 10 types fell into 4 broader groups: (1) SS; (2) MLS; (3) Lipoma, WDLS, and part of DDLS; (4) Spindle cell and pleomorophic sarcomas including DDLS, MFS, LMS, MPNST, FS, and MFH. Following Witten and Tibshirani (2011), we restricted our analysis to the five tumor types having at least 15 observations.

#### Shipp et al. (2002) Data Set

According to Shipp et al. (2002), approximately 30%-40% of adult non-Hodgkin lymphomas are diffuse large B-cell lymphomas (DLBCLs). However, only a small proportion of DLBCL patients are cured with modern chemotherapeutic regimens. Several models have been proposed, such as the International Prognostic Index (IPI), to determine a patient’s curability. These models rely on clinical covariates, such as age, to determine if the patient can be cured, and the models are often ineffective. Shipp et al. (2002) have argued that researchers need more effective means to determine a patient’s curability. The authors measured 6,817 gene-expression levels from 58 DLBCL patient samples with customized cDNA (lymphochip) microarrays to investigate the curability of patients treated with cyclophosphamide, adriamycin, vincristine, and prednisone (CHOP)-based chemotherapy. Among the 58 DLBCL patient samples, 32 are from cured patients while 26 are from patients with fatal or refractory disease.

#### Singh et al. (2002) Data Set

Singh et al. (2002) have examined 235 radical prostatectomy specimens from surgery patients between 1995 and 1997. The authors used oligonucleotide microarrays containing probes for approximately 12,600 genes and expressed sequence tags. They have reported that 102 of the radical prostatectomy specimens are of high quality: 52 prostate tumor samples and 50 non-tumor prostate samples.

#### Tian et al. (2003) Data Set

Tian et al. (2003) investigated the purified plasma cells from the bone marrow of control patients along with patients with newly diagnosed multiple myeloma. Expression profiles for 12,2625 genes were obtained via Affymetrix U95Av2 microarrays. The plasma cells were subjected to biochemical and immunohistochemical analyses to identify molecular determinants of osteolytic lesions. For 36 multiple-myloma patients, focal bone lesions could not be detected by magnetic resonance imaging (MRI), whereas MRI was used to detect such lesions in 137 patients.

#### Classification Results

Similar to Witten and Tibshirani (2011), we report the average test error rates obtained over 100 random training-test partitions in Table 1 along with standard deviations of the test error rates in parentheses. The HDRDA and Guo classifiers were superior in classification performance for the majority of the simulations. The HDRDA classifiers yielded the best classification accuracy on the Chowdary and Shipp data sets. Although the random forest’s accuracy slightly exceeded the HDRDA classifiers on the Tian data set, our proposed classifiers outperformed the other competing classifiers considered here. Moreover, the HDRDA classifiers yielded comparable performance on five of the six data sets.

 [Insert Table ??? approximately here ]

The average error-rate estimates for the Pang, Tong, and Witten classifiers were comparable across all six data sets. Furthermore, the average error rates for the Pang and Tong classifiers were approximately equal for all data sets except for the Chiaretti dataset. This result suggests that the mean and variance estimators used in lieu of the MLEs provided little improvement to classification accuracies. However, we investigated the Pang classifier’s poor performance on the Chiaretti data set and determined that its variance estimator exhibited numerical instability. The classifier’s denominator was approximately zero for both classes and led to the poor classification performance.

The random-forest classifier was competitive when applied to the Chowdary and Singh data sets and yielded the smallest error rate of the considered classifiers on the Tian data set. The fact that the HDRDA and Guo classifiers typically outperformed the random-forest classifier challenges the claim of Fernández-Delgado et al. (2014) that random forests are typically superior. Further studies should be performed to validate this statement in the small-sample, high-dimensional setting.

Finally, the Pang, Tong, and Witten classifiers consistently yielded the largest average error rates across the six data sets. Given that the standard deviations were relatively large, we hesitate to generalize claims regarding the ranking of these three classifiers in terms of the average error rate. However, the classifiers’ error rates and their variability across multiple random partitions of each data set were large enough that we might question their benefit when applied to real data.