Continuum directions for supervised dimension reduction

# Continuum directions for supervised dimension reduction

Sungkyu Jung Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, U.S.A.
###### Abstract

Dimension reduction of multivariate data supervised by auxiliary information is considered. A series of basis for dimension reduction is obtained as minimizers of a novel criterion. The proposed method is akin to continuum regression, and the resulting basis is called continuum directions. With a presence of binary supervision data, these directions continuously bridge the principal component, mean difference and linear discriminant directions, thus ranging from unsupervised to fully supervised dimension reduction. High-dimensional asymptotic studies of continuum directions for binary supervision reveal several interesting facts. The conditions under which the sample continuum directions are inconsistent, but their classification performance is good, are specified. While the proposed method can be directly used for binary and multi-category classification, its generalizations to incorporate any form of auxiliary data are also presented. The proposed method enjoys fast computation, and the performance is better or on par with more computer-intensive alternatives.

###### keywords:
continuum regression, dimension reduction, linear discriminant analysis, high-dimension, low-sample-size (HDLSS), maximum data piling, principal component analysis
60K35
journal: CSDA

## 1 Introduction

In modern complex data, it becomes increasingly common that multiple data sets are available. We consider the data situation where a supervised dimension reduction is naturally considered. Two types of data are collected on a same set of subjects: a data set of primary interest and an auxiliary data set . The goal of supervised dimension reduction is to delineate major signals in , dependent to . Relevant application areas include genomics (genetic studies collect both gene expression and SNP data—Li et al. (2016)), finance data (stocks as in relation to characteristics of each stock: size, value, momentum and volatility—Connor et al. (2012)), and batch effect adjustments (Lee et al., 2014).

There has been a number of work in dealing with the multi-source data situation. Lock et al. (2013) developed JIVE to separate joint variation from individual variations. Large-scale correlation studies can identify millions of pairwise associations between two data sets through and multiple canonical correlation analysis (Witten and Tibshirani, 2009). These methods, however, do not provide supervised dimension reduction of a particular data set , since all data sets assume an equal role.

In contrast, reduced-rank regression (RRR, Izenman, 1975; Tso, 1981) and envelop models (Cook et al., 2010) provide sufficient dimension reduction (Cook and Ni, 2005) for regression problems. See Cook et al. (2013) for connections between envelops and partial least square regression. Variants of principal component analysis (PCA) have been proposed to incorporate auxiliary information; see Fan et al. (2016) and references therein. Recently, Li et al. (2016) proposed SupSVD, a supervised PCA that encompasses regular PCA to RRR. Our goal is similar to that of SupSVD, which extends RRR and envelop models, in that the primary and auxiliary data sets play different roles. We consider a basis (or subspace) recovery to extract the part of main data set which is relevant to the auxiliary data set. Unlike SupSVD, which provides a fully supervised dimension reduction, we seek a unified framework that covers a wide spectrum from fully-supervised to unsupervised dimension reduction.

A potential drawback of fully supervised dimension reduction as a preprocessing for further application of predictive modeling is a double-dipping problem: The same signal is considered both at dimension reduction and at classifiers. In high dimensional data situations, small signals can sway the whole analysis, often leading to a spurious finding that can not be replicated in subsequent studies. A regularized semi-supervised dimension reduction has a potential to mitigate the double-dipping problem.

We propose a semi-supervised basis learning for the primary data that covers a wide range of spectrum from supervised to unsupervised dimension reduction. A meta-parameter is introduced to control the degrees of supervision. The spectrum of dimension reduction given by different is best understood when there exists a single binary supervision. In such a special case, the directional vectors of the basis continuously bridge the principal component direction, mean difference and Fisher’s linear discriminant directions.

The proposed method was motivated by the continuum regression (Stone and Brooks, 1990), regressors ranging from the ordinary least square to the principal component regression. In the context of regression, our primary data set is predictors while the auxiliary data are the response. The new basis proposed in this work, called continuum directions, can be used with multivariate supervision data, consisting of either categorical or continuous variables.

We also pay a close attention to the high-dimension, low-sample-size situations (or the case), and give a new insight on the maximum data piling (MDP) direction , proposed as a discriminant direction for binary classification by Ahn and Marron (2010). In particular, we show that is a special case of the proposed continuum direction, and if , MDP is preferable to linear discriminant directions in terms of Fisher’s original criterion for linear discriminant analysis (LDA, Fisher, 1936). We further show, under the high-dimension, low-sample-size asymptotic scenario (Hall et al., 2005), although the empirical continuum directions are inconsistent with their population counterparts, the classification performance using the empirical continuum directions can be good, if the signal strength is large enough.

As an application of the continuum directions, we endeavor to use the continuum directions in classification problems. Recently, numerous efforts to improve classifications for the situation have been made. Linear classifiers such as LDA, the support vector machine (Vapnik, 2013) or distance weighted discrimination (Marron et al., 2007; Qiao et al., 2010) often yield better classification than nonlinear methods, in high dimensional data analysis. A recent trend is sparse estimations. Bickel and Levina (2004) studied the independence rule, ignoring off-diagonal entries of . Additionally assuming sparsity of the population mean difference, Fan and Fan (2008) proposed the features annealed independence rule (FAIR). Wu et al. (2009) and Shao et al. (2011) proposed sparse LDA estimations, and Clemmensen et al. (2011) proposed sparse discriminant analysis (SDA) to learn sparse basis for multi-category classification. Cai and Liu (2011) proposed the linear programming discriminant rule (LPD) for sparse estimation of the discriminant direction vector. The sparse LDA, SDA and LPD are designed to work well if their sparsity assumptions are satisfied. Sophisticated methods such as those of Wu et al. (2009) and Cai and Liu (2011) usually suffer from heavy computational cost. On the other hand, simpler methods such as FAIR and MDP are much faster. Our method, when applied to the binary classification problem, leads to analytic solutions, and the computation times are scalable. We show via simulation studies that classification performance using the continuum directions is among the best when the true signal is not sparse and the variables are highly correlated.

The rest of the paper is organized as follows. In Section 2, we introduce continuum directions and discuss its relation to continuum regression. In the same section, we provide some insights for continuum directions in high dimensions. In Section 3, we show numerical procedures that are efficient for high-dimensional data. Simulation studies for classification performance in high dimensions can be found in Section 4. We further show advantages of our method by a few real data examples in Section 5. We conclude with a discussion. Proofs are contained in Appendix.

## 2 Continuum directions

### 2.1 Motivation

To motivate the proposed framework for dimension reduction, we first analyze a special case where the supervision data consist of a binary variable. We discuss a few meaningful directions for such situations, viewed in terms of a two-group classification problem. These directions are special cases of the continuum directions, defined later in (6).

Let and be the numbers of observations in each group and . Denote and for the -dimensional observations of the first and second group, respectively. In our study it is sufficient to keep the sample variance-covariances. Denote for the within-group variance matrix, i.e. the estimated (pooled) common covariance, and for the between-group variance matrix. The total variance matrix is with the common mean , and .

Fisher’s criterion for discriminant directions is to find a direction vector such that, when data are projected onto , the between-variance is maximized while the within-variance is minimized. That is, one wishes to find a maximum of

 T(w)=wTSBwwTSWw. (1)

If is non-singular, i.e. the data are not collinear and , the solution is given by , where . It has been a common practice to extend the solution to the case using a generalized inverse, i.e.,

 wLDA∝S−Wd,

where stands for the Moore–Penrose pseudoinverse of square matrix .

In retrospect, when , Fisher’s criterion is ill-posed since there are infinitely many ’s satisfying . Any such , which also satisfies , leads to . In fact, in such a situation, is not a maximizer of but merely a critical point of . Ahn and Marron (2010) proposed a maximal data piling (MDP) direction which maximizes the between-group variance subject to , and is

 wMDP∝S−Td.

Note that also maximizes a criterion

 TMDP(w)=wTSBwwTSTw. (2)

In the conventional case where , the criteria (1) and (2) are equivalent up to a constant, and . We discuss further in Section 2.5 that MDP is more preferable than LDA in the high-dimensional situations.

A widely used modification to Fisher’s criterion is to shrink toward a diagonal matrix, leading to

 Tα(w)=wTSBwwT(SW+αI)w,  for some α≥0. (3)

This approach has been understood in a similar flavor to ridge regression (Hastie et al., 2009). The solution of the above criterion is simply given by A special case is in the limit , where the solution becomes the direction of mean difference (MD) , which maximizes

 TMD(w)=wTSBw, (4)

with a conventional constraint .

In high dimensional data situations, utilizing the principal components is a natural and nonparametric way to filter out redundant noise. Principal component analysis (PCA) reduces the dimension to some low number so that the subspace formed by the first principal component directions contains maximal variation of the data among all other -dimensional subspaces. In particular, the first principal component direction maximizes the criterion for the first principal component direction,

 TPCA(w)=wTSTwwTw. (5)

The important three directions of MDP, MD and PCA differ only in criteria maximized. With the constraint , the criteria (2)–(5) are functions of total-variance and between-variance . For the binary supervision case, a generalized criterion that embraces all three methods is

 Tγ(w)=(wTSBw)(wTSTw)γ−1  subject to  wTw=1, (6)

where takes some value in . The special cases are MDP as , MD at , and PCA when . The direction vector that maximizes is called the continuum direction for .

### 2.2 General Continuum directions

The continuum direction (6) defined for the binary supervision is now generalized to incorporate any form of supervision.

Denote for the primary data matrix and for the matrix with secondary information. The matrix contains the supervision information that can be binary, categorical, and continuous. For example if the supervision information is a binary indicator for two-group classification with group sizes and , as in Section 2.1, then the matrix can be coded as the matrix where and is the length- vector, whose th element is if the th subject is in the th group, and zero otherwise. Similarly, if the supervision information is multicategory with groups, then is the matrix whose th row is , where is the number of observations belonging to category . If the supervision is continuous and multivariate, such as responses in multivariate regression, then the matrix would collect centered measurements of response variables.

Assuming for simplicity that is centered, we write the total variance-covariance matrix of by , and the -relevant variance-covariance matrix of by . A completely unsupervised dimension reduction can be obtained by eigendecomposition of . On the contrary, a fully-supervised approach is to focus on the column space of , corresponding to the mean difference direction when is binary. An extreme approach that nullifies the variation in to maximize the signals in can be obtained by eigendecomposition of . When is categorical, this reduces to the reduced-rank LDA.

Generalizing (6), the following approach encompasses the whole spectrum from the supervised to unsupervised dimension reduction. A meta-parameter controls the degree of supervision. For each , we obtain a basis for dimension reduction of in a sequential fashion. In particular, given , the th direction is defined by maximizing

 Tγ(w) =(wTSBw)(wTSTw)γ−1, (7) subject to wTw=1  and wTSTw(ℓ)=0, ℓ=1,…,k.

The sequence of directions for a given value of is then -orthonormal to each other: for . An advantage of requiring -orthogonality is that the resulting scores are uncorrelated with for . This is desirable if these scores are used for further analysis, such as a classification based on these scores.

In sequentially solving (7), choosing large provides nearly unsupervised solutions while yields an extremely supervised dimension reduction. The spectrum from unsupervised to supervised dimension reduction is illustrated in a real data example shown in Example 1.

###### Example 1.

We demonstrate the proposed method of dimension reduction for a real data set from a microarray study. This data set, described in detail in Bhattacharjee et al. (2001), contains genes (primary data) from patients while the patients are labeled by four different lung cancer subtypes (supervision data). The primary dataset is the matrix of normalized gene expressions, while the supervision data is the matrix , coded to use the categorical cancer subtypes as the supervision.

The continuum directions can provide basis of dimension reduction, ranging from the unsupervised () to the fully supervised (). In Fig. 1, the projected scores of the original data are plotted for four choices of .

A dimension reduction by PCA has been useful for this data set, since the four subtypes are visually separated by using the first few sample principal components. The principal component scores are similar to those plotted in the first panel of Fig. 1 when is large enough. On the other hand, a fully supervised dimension reduction given by the MDP directions, plotted in the bottom right panel, nullifies any variation in the primary data set. Specifically, all observations corresponding to the same subtype project to a single point, a feature due to the high dimensionality. Thus the projected scores for contain information only relevant to the supervision.

The continuum direction as a function of is continuous (shown later in Proposition 2), thus the projected scores are also continuous with respect to . The continuous transition of the scores from large to small in Fig. 1 is thus expected. The question of which value of to use in final dimension reduction depends on the purpose of analysis. For exploratory analysis, several values of may be used to examine the data from a variety of viewpoints. If the dimension reduction is performed for regression or classification, a cross-validation can be used, which is discussed in Section 2.3.

### 2.3 Continuum directions for classification

When the supervision data is binary or categorical, it is natural to seek an application of continuum directions for the basis of classification. In particular, for the binary supervision case, as shown in Section 2.1, the continuum direction can be thought of as the normal direction to the separating hyperplane.

In the general -group situation, for each , the sequence of directions are used to obtain dimension-reduced scores , , for secondary discriminant analysis. In particular, we choose and use , , in training the ordinary LDA. For prediction, the scores of a new observation are used as the input to the LDA, trained on the -dimensional space of scores. This classification rule is called continuum discriminant analysis (CDA).

The CDA depends on the choice of . A 10-fold cross-validation to minimize the expected risk with the 0-1 loss can be used to tune . We use a cross-validation index that counts the number of misclassified observations for each given , divided by the total number of training sample. As exemplified with real data examples in Section 5, the index is typically U-shaped. This is because that the two ends of the spectrum are quite extreme. Choosing results in the unmodified LDA or MDP, while choosing results in using PC1 direction for classification. In our real data examples, the minimizer of is found in the interval .

### 2.4 Relation to continuum regression

A special case of the proposed method, specifically (6) for the binary supervision, can be viewed as a special case of continuum regression (Stone and Brooks, 1990). The continuum regression leads to a series of regressors that bridges ordinary least squares, partial least squares and principal component regressions. In connection with the continuum directions for binary classification, ordinary least squares regression corresponds to LDA (or MDP in (2)), and partial least squares corresponds to mean difference. In particular, in the traditional case where , it is well known that is identical to the coefficients of least squares regression, up to some constant.

Some related work has shed light on the relationship between continuum regression and ridge regression (Sundberg, 1993; de Jong and Farebrother, 1994; Bjorkstrom and Sundberg, 1999). A similar relationship can be established for our case when is of rank 1. For simplicity, we assume that the column space of is spanned by the vector . (In the binary classification case, , as discussed in Section 2.1.) To find the continuum direction that maximizes in (6), differentiating the Lagrangian function with respect to leads to the equation

 SBwwTSBw+(γ−1)STwwTSTw−λw=0. (8)

Left multiplication of leads to . A critical point of the preceding equation system gives the maximum of . Since , one can further simplify the equation for a critical point

 w∝(ST+γ1−γwTSTwwTwIp)−d=(ST+αIp)−d:=wRα. (9)

For each , there exists an such that the continuum discriminant direction is given by the ridge estimator . This parallels the observation made by Sundberg (1993) in regression context. We allow negative , so that the relation to ridge estimators is extended for .

###### Theorem 1.

If is not orthogonal to all eigenvectors corresponding to the largest eigenvalue of , then there exists a number for each such that , including the limiting cases and .

The above theorem can be shown by an application of Proposition 2.1 of Bjorkstrom and Sundberg (1999) who showed that, in our notation, the solution of is of the ridge form. See Appendix for a proof of the theorem.

The relation between and is nonlinear and depends on . A typical form of relation is plotted in Fig. 2, and is explained in the following example.

###### Example 2.

From Fisher’s iris data, we chose ‘versicolor’ and ‘virginica’ as two groups each with samples. For presentational purpose, we use the first two principal component scores of the data. For a dense set of , the corresponding is plotted (in the left panel of Fig. 2), which exhibits the typical relationship between and . The MDP at corresponds to the ridge solution with . As approaches 1, the corresponding ridge solution is obtained with . For , is negative and approaches as . The continuum directions range from (which is the same as since ) to as illustrated in the right panel of Fig. 2.

The ridge solution may not give a global maximum of when the assumption in Theorem 1 does not hold. An analytic solution for such a case is also provided in Proposition 7 in Appendix.

### 2.5 Continuum directions in high dimensions

In high-dimensional situations where the dimension of the primary data is much higher than the sample size , the continuum directions are still well-defined. We return to discuss that, if , MDP has more preferable properties than LDA for binary classification. The ridge solution plays an important role in the following discussion.

In the conventional case where , It is easy to see that the ridge criterion (3) and its solution (9) bridge LDA and MD. However, if and thus is rank deficient, one extreme of the ridge criterion is connected to MDP but not to LDA. The following proposition shows that ranges from MD to MDP, giving a reason to favor MDP over LDA in high dimensions.

###### Proposition 2.

For , . Moreover is continuous with respect to . The boundaries meet MDP and MD directions, that is, and .

While is a limit of ridge solutions, does not meet with . When , is orthogonal to if the mean difference is not in the range of , i.e., (Ahn and Marron, 2010). This fact and Proposition 2 give .

Algebraically, the discontinuity of the ridge direction to comes from the discontinuity of the pseudoinverse. Heuristically, the discontinuity comes from the fact that does not completely lie in the column space of . In such a case, there is a direction vector orthogonal to the column space of containing information about (i.e., ). Using in LDA ignores such information. On the other hand, MDP uses , which preserves all information contained in the special direction .

The values of Fisher’s criterion for various choices of in Fig. 3 exemplifies that should be used as Fisher discriminant direction rather than in high dimensions. In our experiments on classification (in Sections 4 and 5), we check that the empirical performance of LDA is among the worst.

Our discussion so far assumes that the covariance matrices are the sample covariance matrices. It is well-known that these matrices are inconsistent estimators of the population covariance matrices when , as . Only with strong assumptions on the covariance and mean difference (such as sparsity), it is possible to devise consistent estimators. In such situation, the sufficient statistics and can be replaced by consistent estimators and , in the evaluation of the continuum directions (7). This approach has a potential to provide an estimator of , consistent with a suitably defined population continuum directions, when . In the next section, we present a high-dimensional asymptotic study when and are used in computing the empirical continuum directions.

### 2.6 HDLSS asymptotic study of continuum directions

We employ the high-dimension, low-sample-size (HDLSS) asymptotics, that is, the asymptotic study of while the sample size is held fixed, to understand the high-dimensional behaviors of the true and sample continuum directions. The HDLSS asymptotics has been successfully used in revealing the properties of conventional multivariate methods in high dimensions, such as classification (Hall et al., 2005; Qiao et al., 2010), PCA (Jung and Marron, 2009; Yata and Aoshima, 2009; Zhou et al., 2015), and clustering (Ahn et al., 2012), to name a few. For a review of recent developments, see Aoshima et al. (2017).

To set up, suppose that are i.i.d. and are i.i.d. . The empirical continuum directions are given by (6) where and as defined in Section 2.1. By Theorem 1, the elements in the set of true continuum directions can also be parametrized by

 α(γ,ST)=γ1−γωTγSTωγωTγωγ, (10)

which leads to . For each fixed , if the dimension of increases, then the total variance of also increases, which in turn leads that in (10) be increasing. To lessen the technical difficulty in the exposition for this section, we use the ridge parameterization by for the continuum directions. In particular, we parameterize the continuum directions by , which is an increasing function of the dimension . For each , we consider the set of sample continuum directions, denoted by , for .

The population counterpart of the sample continuum directions is defined similarly. For , , , and , the population continuum directions are parameterized by , and are denoted by . Assume the following:

1. There exists a constant such that as .

2. , as .

3. The eigenvalues of (and ) are sufficiently concentrated, in the sense that as , for .

The condition C1 has also appeared in, e.g., Hall et al. (2005); Qiao et al. (2010); Ahn et al. (2012), and requires that the true mean difference grows as the dimension increases. The conditions C2 and C3 include the covariance matrix models for both independent variables and mildly-spiked cases (i.e., few eigenvalues are moderately larger than the others), and were first appeared in Ahn et al. (2007). These conditions can be generalized and the Gaussian assumption can be relaxed, as done in, e.g., Jung and Marron (2009); Jung et al. (2012), to produce the equivalent results shown below. We keep it simple for brevity.

The asymptotic behavior of the sample continuum directions , when , is investigated in two ways. We first show that is inconsistent, and has a non-negligible constant angular bias when compared to its population counterpart . Despite the bias, the CDA, the classification rule discussed in Section 2.3, can perfectly classify new observations under certain conditions.

###### Theorem 3.

Under the setting in this section, including the conditions C1—C3, the following holds.

(i) The sample continuum directions are inconsistent with its population counterparts. In particular, for any ,

 Angle(ωα,^wα)→cos−1(δ2δ2+σ21/n1+σ22/n2)1/2,

in probability as .

(ii) The probability that CDA classifies a new observation correctly tends to 1 as if .

Both results in Theorem 3 depend on the quantity in the condition C1, which may be interpreted as a signal strength. When is large, the sample continuum direction is less biased, and is small. On the other hand, if , then is strongly inconsistent with , and is asymptotically orthogonal to . The performance of CDA also depends on . Consider the case where and . Then CDA classification is perfect whenever is positive. On the other hand, if , then the classification is only as good as random guess. These observations are consistent with Hall et al. (2005) and Qiao et al. (2010), in which HDLSS asymptotic behaviors of the centroid rule, SVM and DWD are studied.

We conjecture that if the within-covariance matrix has a large first eigenvalue (that is, a large variance of the first principal component), then the sample continuum direction is less biased than in Theorem 3, even under smaller size of signal . This conjecture seems to be true, as shown in the simulation studies in Section 4, but rigorously proving this conjecture has been challenging.

## 3 Computations

### 3.1 Numerical algorithms for binary supervision

When is of rank 1, or when the supervision is binary, the relationship between and ridge solutions in Theorem 1 can be used to compute a discrete sequence of the first continuum directions. In particular, there is a corresponding for each ridge parameter . Let be a maximum value for evaluating . In our experience it is sufficient to choose , ten times larger than the largest eigenvalue of . Define and for for some number . The small number keeps the matrix invertible and was chosen to for numerical stability. For each or , we get , where satisfies and

 γ(α)=αwTγ(α)STwγ(α)+α.

The sequence is augmented by the two extremes and .

If is orthogonal to all eigenvectors corresponding to , then does not tend to infinity even though has reached . In such a case, the remaining sequence of directions is analytically computed using Proposition 7 in Appendix.

### 3.2 Numerical algorithms for general case

In general cases where , the connection to generalized ridge solutions in Theorem 1 does not hold. Even with binary supervision, when a sequence of continuum directions is desirable, the ridge parameter is different for different in , even when is held fixed. Here, we propose a gradient descent algorithm to sequentially solve (7) for a given .

We first discuss a gradient descent algorithm for . Since the only constraint is that the vector is of unit size, the unit sphere is the feasible space. To make the iterate confined in the feasible space we update a candidate with , for a step size , where the gradient vector is . To expedite convergence, is initially chosen to be large so that . If this choice of overshoots, i.e., with this update then we immediately reduce to unity, so that the convergence to maximum is guaranteed, sacrificing fast rate of convergence. The iteration is stopped if or for a needed precision . The step size can be reduced if needed, but setting has ensured convergence with a precision level in our experience.

For the second and subsequent directions, suppose we have and are in search for the . The -orthogonality and the unit size condition lead to the feasible space . Since any is orthogonal to , , the solution lies in the nullspace of . We use orthogonal projection matrix to project the variance-covariance matrices and onto the nullspace of , and obtain and . The gradient descent algorithm discussed above for is now applied with and to update candidates of , without the -orthogonality constraint.

The following lemma justifies this iterative algorithm converges to the solution .

###### Lemma 4.
1. Let be the projection of onto the nullspace of . Write . Then and .

2. For , .

3. The solution of the unconstrained optimization problem satisfies for .

It can be seen from Lemma 4 that the optimization is performed with the part of data that is -orthogonal to . While making the optimization simpler, we do not lose generality because the original criterion has the same value as for candidate in the feasible region (Lemma 4(ii)). This with the last result (iii) shows that our optimization procedure leads to (at least) local maximum in the feasible region.

Note that the sequence depends on the choice of . To obtain a spectrum of continuum directions, one needs to repeat the iterative algorithm for several choices of .

### 3.3 Efficient computation when p≫n

For large , directly working with matrices and needs to be avoided. For such cases, utilizing the eigendecomposition of (or, equivalently, the singular value decomposition of ) provides efficient and fast computation for continuum directions. Write , where spans thecolumn space of , for . Then the algorithms discussed in the previous sections can be applied to and , in place of and , to obtain . The continuum directions are then . If , this requires much less computing time than working with and directly. The next lemma ensures that our solution is the maximizing the criterion (7).

###### Lemma 5.

Any maximizer of (7) lies in the column space of .

In the case of binary supervision, one needs to avoid the inversion of large matrix . The continuum directions are obtained via only involving the inversion of matrices: In all of our experiments, involving moderately large data sets, where is tens of thousands and is hundreds, the computation takes only a few seconds at most, compared to minutes for, e.g., the estimation of Clemmensen et al. (2011).

## 4 Simulation studies

We present two simulation studies to empirically reveal the underlying model under which the continuum directions are useful. We numerically compare the performance of CDA, the linear classification followed by continuum dimension reduction, with several other classification methods, in binary or multi-category classification.

### 4.1 Binary classification

For binary classification, our method is compared with LDA (using the pseudoinverse), independence rule (IR) by Bickel and Levina (2004), the features annealed independence rule (FAIR) by Fan and Fan (2008), the distance weighted discrimination (DWD) by Marron et al. (2007) and the sparse discriminant analysis (SDA) by Clemmensen et al. (2011).

The setup for the simulation study is as follows. We assume two groups with mean and for some constant , where is the vector of length ,and . We choose or , to examine both sparse and non-sparse models. The common covariance matrix is for . This so-called compound symmetry model allows examination from independent to highly correlated settings. The scalar varies for different to keep the Mahalanobis distance between and equal to 3.

Training and testing data of size are generated from normal distribution of dimension and . The parameter of CDA is chosen by the 10-fold cross-validation. The number of features for FAIR, as well as the tuning parameters for SDA, were also chosen by 10-fold cross-validation. The mean and standard deviation of the misclassification rates, based on 100 replications, are listed in Table 1.

Our results show that CDA performs much better than other methods when the variables is strongly correlated (), for non-sparse models. In the independent setting (), the performance of CDA is comparable to IR and DWD. FAIR is significantly better than CDA under sparse model with independent variables, because the crucial assumption of FAIR that the non-zero coordinates of are sparse is also satisfied. Both IR and FAIR severely suffer from strong dependence (), in which case their classification rates are close to 50. DWD also suffers from the highly correlated structure. SDA performs well for all settings under the sparse model, as expected. However, for non-sparse models, CDA performs significantly better than SDA.

Another observation is that the performance of LDA is better for larger s. A possible explanation is that the underlying distribution becomes nearly degenerate for larger . The true covariance matrix has a very large first eigenvalue compared to the rest of eigenvalues , . As conjectured in Section 2.6, both LDA and CDA benefit from extensively incorporating the covariance structure, in spite of the poor estimation of when . Note that in terms of the conditions C1—C3 in Section 2.6, all of these models have signal strength and the condition C3 is violated when .

Poor performance of FAIR for the strongly correlated case is also reported in Fan et al. (2012), where they proposed the regularized optimal affine discriminant (ROAD), which is computed by a coordinate descent algorithm. Due to the heavy computational cost, we excluded the ROAD as well as the linear programming discriminant rule (LPD) by Cai and Liu (2011). We exclude results from Wu et al. (2009) since SDA (Clemmensen et al., 2011) performed uniformly better than the method of Wu et al. These methods aim to select few features as well as to classify, based on assumptions of sparse signals. CDA does not require such assumptions.

### 4.2 Multi-category classification

For multi-category classification, CDA is compared with the reduced-rank LDA (cf. Hastie et al., 2009) and SDA (Clemmensen et al., 2011).

The setup in the simulation study is as follows. We assume groups with means , and , for either or . The common covariance matrix is the compound symmetry model, parameterized by , and the scalar is set as explained in Section 4.1.

Training and testing data of size are generated from normal distribution of dimension and . The classification performances of CDA, reduced-rank LDA and SDA for these models are estimated by 100 replications, and are summarized in Table 2.

The simulation results for multi-category classification provide a similar insight obtained from the binary classification study. CDA performs better when the correlation between variables is strong (), for both sparse and non-sparse models. Our method is outperformed by SDA for the sparse model, but has significantly smaller misclassification rates for non-sparse models.

## 5 Real data examples

In this section, we provide three real data examples, where the supervision information is categorical with two or more categories.

### 5.1 Leukemia data

We first use the well-known data set of Golub et al. (1999), which consists of expression levels of 7129 genes from 72 acute leukemia patients. The data are prepared as done in Cai and Liu (2011). In particular, 140 genes with extreme variances, i.e., either larger than or smaller than are filtered out. Then genes with the 3000 largest absolute -statistics were chosen. The dataset included 38 training cases (27 AMLs and 11 ALLs) and 34 testing cases (20 AMLs and 14 ALLs).

With binary classification in mind, we obtain for a discrete set of , using the computational procedure discussed in Section 3.1. A 10-fold cross-validation leads to . As shown in Fig. 4, the smallest cross validatory misclassification rate is . (We chose to use the smallest among all minimizers of .) Figure 4 also shows the classification errors of training and testing data for different . For smaller values, including (corresponding to MDP) and , the classification errors are 1 out of 34 for the test set, and 0 out of 38 for the training set. In comparison, LDA, IR, DWD and SVM result in 2–6 testing errors. From the work of Fan and Fan (2008) and Cai and Liu (2011), FAIR and LPD makes only 1/34 testing error. Sparse LDA methods, SLDA of Wu et al. (2009) and SDA of Clemmensen et al. (2011), also performed quite well. The results are summarized in Table 3.

### 5.2 Liver cell nuclei shapes

In a biomedical study, it is of interest to quantify the difference between normal and cancerous cell nuclei, based on the shape of cells. We analyze discretized cell outlines, aligned to each other to extract shape information (Wang et al., 2011b). The data consist of outlines from normal liver tissues and hepatoblastoma tissues. Each outline is represented by 90 planar landmarks, leading to .

In the context of discriminating the disease based on the cell shapes, we compare our method with LDA, DWD, FAIR, and a quadratic discriminant analysis (QDA). As explained in Section 4, the threshold value of FAIR is chosen by cross validation. The QDA is modified to have smaller variability by using a ridge-type covariance estimator.

For the comparison, we randomly assign 50 cases as a testing data set, and each classifier is calculated with the remaining 450 cases. The empirical misclassification rates of classifiers are computed based on the training dataset and on the testing dataset. This is repeated for 100 times to observe the variation of the misclassification rates. For the continuum directions with varying , we observe that the misclassification rates become stable as increases, as shown in Fig. 5. Both the training and testing error rates become close to as moves closer to MD and to PCA. This is because, for this dataset, and are close to each other with , and both exhibit good classification performances, with error rate close to . For each training dataset, is chosen by the cross validation. Many chosen s have values between , but a few of those are as large as , as shown in Fig. 5. The performance of CDA with cross-validated is compared with other methods in Table 4. Based on the testing error rate, CDA performs comparable to more sophisticated methods such as FAIR and DWD. Both LDA and QDA tend to overfit and result in larger misclassification rates than other methods.

### 5.3 Invasive lobula breast cancer data

Invasive lobula carcinoma (ILC) is the second most prevalent subtype of invasive breast cancer. We use the protein expression data of breast tumors, measured by RNA sequencing (Ciriello et al., 2015), to demonstrate the use of continuum directions when the supervision information is categorical with 5 possible values. The dataset consists of genes of breast tumor samples, categorized into five subtypes—luminal A, basal-like, luminal B, HER2-enriched, and normal-like—by a pathology committee. Despite the large size of the data, computing the continuum directions is fast (few seconds, using a standard personal computer). Figure 6 displays the spectrum of continuum dimension reduction, parameterized by the meta-parameter .

To compare the performance of the multicategory classification with the reduced-rank LDA and SDA of Clemmensen et al. (2011), we keep only the 500 genes with the largest standard deviations, and formed a training set of samples and a testing set of samples. For each of the classifiers, the training set is used to train the classification rule, while the testing set is used to estimate the misclassification error. We randomly permute the memberships to the training and testing sets, for 10 times.

The result of experiment is summarized in Table 5. Our method exhibits the lowest misclassification error rates. Poor performance of SDA may indicate that the true signal in the data is not sparse. As expected, the reduced-rank LDA severely overfits.

## 6 Discussion

We proposed a criterion evaluating useful multivariate direction vectors, called continuum directions, while the degrees of supervision from an auxiliary data set are controlled by a meta-parameter . An application of the proposed dimension reduction to classification was discussed. Numerical properties of the proposed classifier have demonstrated good performance for high dimensional situation. In particular, our method outperforms several other methods when the variance of the first principal component is much larger than the rest.

The proposed method is akin to the continuum regression and connects several well-known approaches, LDA, MDP, MD, ridge estimators and PCA, thus providing a simple but unified framework in understanding the aforementioned methods. There are several other criteria that also give a transition between LDA (or MDP) and PCA. A slightly modified criterion from (6), with the constraint , gives the ridge solution with the same . This criterion is first introduced in a regression problem (Bjorkstrom and Sundberg, 1999), but has not been adopted into classification framework. Wang et al. (2011a) proposed a modified Fisher’s criterion

 τδ(w)=wTSTwwT(SW+δI)w, (11)

that bridges between LDA and PCA. For , the criterion (11) becomes identical to equation (1) up to the constant 1, thus equivalent to LDA. In the limit of , converges to the criterion for . The maximizer of is a solution of a generalized eigenvalue problem. We leave further investigation of these criteria as future research directions.

Lee et al. (2013) also discussed discrimination methods that bridge MDP and MD, in high dimensions. The method of Lee et al. (2013) is in fact equivalent to a part of continuum directions, restricted for . In this paper, the continuum between MDP to PCA is completed by also considering , the method is extended for supervised dimension reduction, and a connection to continuum regression is made clear.

The study for HDLSS asymptotic behavior of the continuum directions has a room for more investigation. We conjecture that the magnitude of large eigenvalues, in fast-diverging eigenvalue models, is a key parameter for successful dimension reduction, which may be shown using HDLSS asymptotic investigation similar to Jung et al. (2012).

## Acknowledgments

The author is grateful to Gustavo Rohde for sharing the nuclei cell outlines data, and to Jeongyoun Ahn and Myung Hee Lee for their helpful suggestions and discussions.

## Appendix A Technical details

### a.1 Proof of Theorem 1.

In a multivariate linear regression problem, with the design matrix and the vector of responses, denote a regressor by a linear combination of variables. Both and are assumed centered. Let be the sample variance of the regressor. Let be the sample covariance between the regressor and and be the sample correlation, which is proportional to . The following theorem is from Bjorkstrom and Sundberg (1999).

###### Theorem 6 (Proposition 2.1 of Bjorkstrom and Sundberg (1999)).

If a regressor is defined according to the rule

 wf=argmax∥w∥=1f(K2(w),V(w)),

where is increasing in (or ) for constant , and increasing in for constant , and if is not orthogonal to all eigenvectors corresponding to the largest eigenvalue of , then there exists a number such that , including the limiting cases and .

A two-group classification problem is understood as a special case of regression. In particular, let be if the th observation is in the first group or if it is in the second group. Then the total variance matrix and the mean difference . The criterion (6) is , which satisfies the assumptions of Theorem 6. Theorem 1 is thus a special case of Theorem 6.

### a.2 Analytic solution for the rare case

The ridge solution may not give a global maximum of when the assumption in Theorem 1 does not hold. We give an analytic solution for such a case. It is convenient to write in the canonical coordinates of . Let be the eigen-decomposition of with , for , with convention . To incorporate any duplicity of the first eigenvalue let represent the number of eigenvalues having the same value as , that is, . Denote and . Let and .

###### Proposition 7.

Suppose is orthogonal to all eigenvectors corresponding to and is not orthogonal to all eigenvectors corresponding to . Let

 zα=(Λ2+αI)−1δ2√δT2(Λ2+αI)−2δ2  for  α∈(−∞,−λ1]∪[0,∞).
1. If