Dynamic Linear Discriminant Analysis in High Dimensional Space

Dynamic Linear Discriminant Analysis in High Dimensional Space

Binyan Jiang,    Ziqi Chen,    Chenlei Leng 111Jiang is with the Hong Kong Polytechnic University. Chen is with Central South University. Leng is with the University of Warwick.
Abstract

High-dimensional data that evolve dynamically feature predominantly in the modern data era. As a partial response to this, recent years have seen increasing emphasis to address the dimensionality challenge. However, the non-static nature of these datasets is largely ignored. This paper addresses both challenges by proposing a novel yet simple dynamic linear programming discriminant (DLPD) rule for binary classification. Different from the usual static linear discriminant analysis, the new method is able to capture the changing distributions of the underlying populations by modeling their means and covariances as smooth functions of covariates of interest. Under an approximate sparse condition, we show that the conditional misclassification rate of the DLPD rule converges to the Bayes risk in probability uniformly over the range of the variables used for modeling the dynamics, when the dimensionality is allowed to grow exponentially with the sample size. The minimax lower bound of the estimation of the Bayes risk is also established, implying that the misclassification rate of our proposed rule is minimax-rate optimal. The promising performance of the DLPD rule is illustrated via extensive simulation studies and the analysis of a breast cancer dataset.

Key Words: Bayes rule; Discriminant analysis; Dynamic linear programming; High-dimensional data; Kernel smoothing; Sparsity.

1 Introduction

The rapid development of modern measurement technologies has enabled us to gather data that are increasingly larger. As the rule rather than the exception, these datasets have been gathered at different time, under different conditions, subject to a variety of perturbations, and so on. As a result, the complexity of many modern data is predominantly characterized by high dimensionality and the data dynamics. The former is featured by a large number of variables in comparison to the sample size, and the manifestation of the latter can be seen in the distribution of the data which is non-static and dependent on covariates such as time. Any approach ignoring either of the two aspects may give unsatisfactory performance and even incorrect conclusions.

The main aim of this paper is to address these two challenges simultaneously, for the first time, by developing a very simple yet useful dynamic linear programming discriminant (DLPD) rule for classification. Specializing to binary classification, we allow the means and the covariance matrices of the populations to vary with covariates of interest, which are estimated via local smoothing (Fan & Gijbels, 1996). Under an approximate sparsity assumption on a linear index that is central to classification, we propose to estimate the index vector via a technique akin to the Dantzig selector (Candes & Tao, 2007; Cai & Liu, 2011) in a dynamic setting. We show emphatically that the conditional misclassification rate of the DLPD rule converges to the Bayes risk in probability uniformly over a range of the variables used for modeling dynamics, where the dimensionality is allowed to be exponentially high relative to the sample size. The uniformity result is of particular importance as it permits simultaneous statements over the whole range of the covariate. In addition, we derived minimax lower bounds for the Bayes risk, which indicates that the misclassification rate of our DLPD rule is minimax-rate optimal. To our best knowledge, this is the first attempt in developing a high-dimensional discriminant method that exhibits local features of the data with sound theory. We remark that using existing approaches such as the one in Cai & Liu (2011) coupled with local smoothing, it is possible to establish a pointwise result for the misclassification rate. However, a pointwise convergence result will not be sufficient in a dynamic setting, as the main interest is often to assess the estimated classification rule across the whole of the covariates, not just at a single point of the covariates.

Before we proceed further, let’s quickly look at a dataset that motivated this study. In traditional disease diagnosis studies, the same classification rule for all the patients was often applied. However, it has become increasingly more desirable to develop personalized rules that takes into account individual characteristics (Alyass, Turcotte & Meyre, 2015). Intuitively, these patient-specific factors can be treated as dynamic factors in deriving decision rules. For example, in the breast cancer data we studied in Section 4.3, both (low dimensional) clinical risk factors (tumor size, age, histological grade etc.) and (high dimensional) expression levels for 24,481 gene probes were collected for 97 lymph node-negative breast cancer patients. Among them, 46 patients developed distant metastases within 5 years while the rest 51 remained metastases free for at least 5 years. To appreciate the need to incorporate dynamic information into the analysis, we look at the 100 genes with the largest absolute -statistic values between the two groups choosing the tumor size as the dynamic variable. We fit the gene expression levels as a function of the tumor size using a local regression model (Cleveland, Grosse & Shyu, 1992). The fitted plots for some randomly selected genes are presented in Figure 1, from which we can see that the gene expression levels of the patients in the two classes exhibit different levels as the tumor size changes. Similarly, the covariance matrix of these 100 genes also is found to behave dynamically in response to the changes of the tumor size. To see this, we separate the 97 observations into two groups depending on whether the tumor size is greater than the median of the tumor sizes 2.485. A -value (Li & Chen, 2011) indicates that we should reject the null hypothesis that the population covariance matrices of the two groups are equal. The method developed in this paper aims to capture this dynamic information in a high-dimensional setting for classification.

Figure 1: Gene expression level versus tumor size. Upper panel: selected genes from class; Lower panel: selected genes from class. The curves are LOWESS fits.

1.1 The setup

We now introduce formally the problem. Let , be -dimensional random vectors and be a -dimensional random covariate, where for simplicity we assume that is a fixed integer. In this paper we deal with the situation where is large. Given we assume that where and . Similarly, the conditional distribution of given is given as where . In other words, different from traditional linear discriminant analysis, we assume that the first and second moments of and change over a -dimensional covariate . Here could be dependent on the features and . When is a vector of discrete variables, the above mentioned model is named the location-scale model and was used for discriminant analysis with mixed data under finite dimension assumptions; see, for example, Krzanowski (1993) and the references therein.

In discriminant analysis, it is well known that the Bayes procedure is admissible; see for example Anderson (2003). Let be a generic random sample which can be from either the population or the population . In this paper we assume a priori that it is equally likely that comes from either population or population . Following simple algebra, it can be easily shown that the Bayes procedure is given as the following:

  • Classify into population if

  • Classify into population otherwise.

Given , by standard calculation, the conditional misclassification rate of this rule is

(1)

where and is the cumulative distribution function of a standard normal random variable. The expected misclassification rate is defined as

(2)

where means taking expectation with respect to . Practically , and are unknown but there are a sequence of independent random observations , from the population and a sequence of independent random observations , from the population . The central problem then becomes proposing methods based on the sample that give misclassification rates converging to that of the Bayes rule under appropriate assumptions.

1.2 Existing works

There has been increasing emphasis in recent years to address the high-dimensionality challenge posed by modern data where is large. However, the dynamic nature of the data collection process is often ignored in that and are assumed to be independent of . In this static case, the Bayes procedure given above reduces to the well-known Fisher’s linear discriminant analysis (LDA). In static high dimensional discriminant analysis, Bickel & Levina (2004) first highlighted that Fisher’s LDA is equivalent to random guessing. Fortunately, in many problems, various quantities in the LDA can be assumed sparse; See, for example, Witten & Tibshirani (2011); Shao et al. (2011); Cai & Liu (2011); Fan, Feng & Tong (2012); Mai, Zou & Yuan (2012), and Mai & Zou (2013) for a summary of selected sparse LDA methods. Further studies along this line can be found in Fan, Jin & Yao (2013) and Hao, Dong & Fan (2015). More recently, quadratic discriminant analysis has attracted increasing attention where the population covariance matrices are assumed static but different. This has motivated the study of more flexible models exploiting variable interactions for classification, analogous to two-way interaction in linear regression; see for example Fan, Ke & Liu (2015), Fan, et al. (2015), and Jiang, Wang & Leng (2015). However, none of these works addresses the dynamic nature of and .

In our setup where dynamics exists, in addition to the high dimensionality, we need to obtain dynamic estimators for and , or

as functions of . Under a similar setup where is categorical and supported on a set of finite elements, Guo et al. (2011) proposed a sparse estimator for . The emphasis of this work is for continuous that is compactly supported. Chen & Leng (2015) proposed nonparametric estimators of sparse using thresholding techniques for univariate where . The focus of this paper on high-dimensional classification is completely different. Importantly, we do not require the sparsity assumption on and our theory applies for any fixed-dimensional . Our paper is also different from Cai & Liu (2011), Fan, Feng & Tong (2012) and Mai, Zou & Yuan (2012) in that is allowed not only to be a smooth function of , but also to be approximately sparse (see Theorem 1). Our efforts greatly advance the now-classical approach of local polynomial (Fan & Gijbels, 1996) to the modern era of high-dimensional data analysis.

If we denote and as the estimators of , and defined as in Section 2 respectively, our Dynamic Linear Programming Discriminant (DLPD) rule is given as the following:

  • Classify into population if:

  • Classify into population otherwise.

The rest of this paper is organized as follows. In Section 2, we propose estimators for the components in the Bayes rule and propose the DLPD rule. Section 3 provides theoretical results of our DLPD rule. In particular, we show that under appropriate conditions, the risk function of the DLPD rule converges to the Bayes risk function uniformly in . In addition, we derived minimax lower bounds for the estimation of and the Bayes risk. In section 4, simulation study is conducted to assess the finite sample performance of the proposal method. The DLPD rule is then applied to solve interesting discriminant problems using a breast cancer dataset. Concluding remarks are made in Section 5. All the theoretical proofs are given in the Appendix.

2 A dynamic linear programming discriminant rule

We begin by introducing some notations. For any matrix , we use , and to denote its transpose, determinant and trace. Let be a -dimensional vector. Define as the norm and as the norm. For any , the norm of is defined as . We denote the -dimensional vector of ones as and the -dimensional vector of zeros as .

Denote and let be a kernel function such that

where is an univariate kernel function, for example, the Epanechnikov kernel used in kernel smoothing (Fan & Gijbels, 1996). Recent literature on multivariate kernel estimation can be found in Gu, Li & Yang (2015) and the references therein. Let be a diagonal bandwidth matrix and define:

Recall that we assume that there are a sequence of independent random observations , , from the population and a sequence of independent random observations , , from the population . For simplicity, throughout this paper we assume that and denote .

One of the most popular nonparametric estimators for estimating a conditional expectation is the Nadaraya-Watson estimator, which is a locally weighted average, using a kernel as a weighting function. Denote . Let be a given bandwidth matrix. We estimate using the Nadaraya-Watson estimator (Nadaraya, 1964) , where

(3)

Similarly, let . Given a bandwidth matrix , we estimate by , where

(4)

For the covariance matrix , we propose the following empirical estimator:

(5)

where

and

We remark that the estimators and are simply the weighted sample estimates with weights determined by the kernel.

For a given , we then estimate using a Dantzig selector (Candes & Tao, 2007; Cai & Liu, 2011) as

(8)

Given a new observation , our dynamic linear programming discriminant rule is obtained by plugging in the estimators given in (3), (4), (5) and (8) into the Bayes rule given in Section 1. That is,

  • Classify into population if:

  • Classify into population otherwise.

3 Theory

In this section we will first derive the theoretical properties of our proposed dynamic linear programming discriminant rule. In particular, the upper bounds of the misclassification rate are established. We will then derive minimax lower bounds for estimation of the misclassification rate. The upper bounds and lower bounds together show that the misclassification rate of our proposed discriminant rule achieves the optimal rate of convergence.

3.1 Upper bound analysis

In high dimensional data analysis, Bernstein-type inequalities are widely used to prove important theoretical results; see for example Lemma 4 of Bickel & Levina (2004), Merlevede, Peligrad & Rios (2009), Lemma 1 of Cai & Liu (2011). Different from existing literature in high dimensional linear discrimination analysis, we need to accommodate the dynamic pattern. Particularly, to prove our main results in this section, we establish uniform Bernstein-type inequalities for the mean estimators , and the covariance matrix estimators and ; see Lemma 4 and Lemma 5. We point out that these uniform concentration inequalities could be essential in other research problems that encounter high dimensionality and non-stationarity simultaneously. We present the risk function of the DLPD rule first.

Lemma 1.

Let be the support of and . Given , the conditional misclassification rate of the DLPD rule is

To obtain our main theoretical results, we make the following assumptions.

  • The kernel function is symmetric in that and there exists a constant such that for . In addition, there exists constants and such that and .

  • We assume the sample sizes and denote . In addition we assume that as and for simplicity we also assume that is large enough such that .

  • are independently and identically sampled from a distribution with a density function , which has a compact support . In addition, is twice continuously differentiable and is bounded away from on its support.

  • The bandwidths satisfy , , for

  • Let and be the smallest and largest eigenvalues of respectively. We assume that There exists a positive constant such that . In addition, there exists a constant such that .

  • The mean functions and all the entries of have continuous second order derivatives in a neighborhood of each belonging to the interior of .

    Clearly, all the supremum and infimum in this paper can be relaxed to essential supremum and essential infimum.

Assumptions (A1), (A3) and (A4) are commonly made on kernel functions in nonparametric smoothing literature; see for example Einmahl & Mason (2005), Fan & Gijbels (1996) and Pagan & Ullah (1999). The first statement of assumption (A2) is for simplicity and the second statement indicates that our approach allows the dimension to be as large as for any constant . That is, the dimensionality is allowed to be exponentially high in terms of the sample size. Assumption (A5) is also routinely made in high dimensional discrimination analysis; see for example Cai & Liu (2011). Assumption (A6) is a smoothness condition to ensure estimability and is commonly used in the literature of nonparametric estimation; see for example Fan & Gijbels (1996); Tsybakov (2009).

The following theorem shows that the risk function of the DLPD rule given in Lemma 1 converges to the Bayes risk function (1) uniformly in .

Theorem 1.

Assume that assumptions (A1)-(A6) and the following assumption hold:

(9)

For any constant , by choosing for some constant large enough, we have with probability larger than ,

Consequently, we have

Here measures the Mahalanobis distance between the two population centroids for a given . This theorem does not require to be sparse, but assumes the norm of divided by the Mahalanobis distance is bounded uniformly by a factor with an order smaller than . In particular, the dimensionality is allowed to diverge as quickly as . This theorem shows that uniformly in , the conditional misclassification rate converges to the Bayes risk in probability. In order to connect this theorem to the situation where is sparse, we note that from the Cauchy-Schwartz inequality and assumption (A5), we have for any ,

Consequently we have:

Corollary 1.

Assume that assumptions (A1)-(A6) and the following assumption hold:

(10)

For any constant , by choosing for some constant large enough, we have with probability larger than ,

This corollary states that the conditional misclassification rate converges to the Bayes risk again, if the cardinality of diverges in an order smaller than . Thus, our results apply to approximate sparse models as in Theorem 1 and sparse models as in Corollary 1.

In many high dimensional problems without a dynamic variable , it has been commonly assumed that the dimension and sample size satisfy . Denote or . From our proofs we see that in the dynamic case where has an effect, due to the local estimation, the dimension-sample-size condition becomes , which becomes under Assumption (A4). We give here a heuristic explanation for the change in the dimension-sample-size condition when . It is known that the variance of a kernel estimator is usually of order (Fan & Gijbels, 1996). On one hand, similar to the asymptotic results in local kernel estimation, the sample size would become in the denominator of the dimension-sample-size condition to account for the local nature of the estimators. On the other hand, for simplicity, assume that for some constants . To control the estimation error or bias for a -dimensional parameter uniformly over , it is to some degree equivalent to controlling the estimation error of a parameter of dimension proportion to . Therefore the numerator in the dimension-sample-size condition becomes in our case.

Note that when the Bayes misclassification rate , any classifier with misclassification rate tending to slower than would satisfy . To better characterize the misclassification rate of our DLPD rule, we establish the following stronger results on the rate of convergence in terms of the ratio .

Theorem 2.

Assume that assumptions (A1)-(A6) and the following assumption hold:

(11)

For any constant , by choosing for some constant large enough, we have with probability larger than ,

Consequently, we have

3.2 Minimax lower bound

We first introduce the parameter space and some known results in the literature of minimax lower bound theory. We consider the following parameter space:

where denotes the Hölder class with order two (Tsybakov, 2009). For definiteness, is defined to be 1. Clearly, assumptions A3 and A6 together imply that and belong to the Hölder class with domain . We shall denote .

Suppose is a family of probability measures and is the parameter of interest with values in the functional space . Let be any functional of some parameter . By noticing that defines a semi-distance for any , from LeCam’s Lemma (LeCam, 1973; Yu, 1997; Cai, Zhang & Zhou, 2011) we have

Lemma 2.

Let be any functional of and let be an estimator of on taking values in the metric space . Let and be two -separated subsets of in that . Let be the corresponding probability measure for , , and let where are nonnegative weights such that . We then have:

By the above version of LeCam’ lemma, the derivation of minimax lower bounds thus relies on the construction of the probability measure corresponding to the null hypothesis , the probability measures corresponding to the alternative and the weights such that (i) and the distance is as large as possible while (ii) the total variation is controlled to be away from 1. These technical details are deferred to the Appendix. By setting and where and are defined as in (1), the following theorem establishes minimax lower bounds for the Bayes misclassification rate.

Theorem 3.

Assume that for some constant and . Let and be estimators of and respectively. Assume that and let . We have,

(12)

and

(13)

Note that the upper bound we have obtained in Theorem 1 is of order in . Together with Theorem 3 we conclude that the misclassification rate of our proposed DLPD achieves the optimal rate of convergence over .

4 Numerical studies

4.1 Choice of tuning parameters

Given , the bandwidth matrix for computing , can be chosen using what we call a subset--variables cross validation procedure. More specifically, we repeat the following procedure for times. For the -th () replication, we randomly choose an -dimensional () subset--variable from . For the -th observation we denote . Let and be estimators of the mean and covariance matrix of computed in (3) and (2) by leaving out the -th observation . We then choose such that

is minimized. The bandwidth matrix for computing , is chosen similarly.

We remark that when the dimension is larger than the sample size, is no longer invertible and so the usual log-likelihood type leave-one-out cross-validation (Yin et al., 2010) fails as it requires the computation of . Our procedure, on the other hand, selects the bandwidth such that the average prediction errors of all the sampled subset-X-variables are minimized. Since the number of the subset-X-variables can be chosen to be much smaller than the sample size , the inverse of the covariance matrix estimator of the subset variable is well defined, and so our proposed procedure would still be feasible under high dimensionality. In some sense, our bandwidth selection method can be seen as a computationally feasible way of approximating the log-likelihood type leave-one-out cross-validation in Yin et al. (2010).

Now we obtain the estimators , , and for a given . For a given , the convex optimization problem (8) is implemented via linear programming as

where and is the -th row of .

This is similar to the Dantzig selector (Candes & Tao, 2007; Cai & Liu, 2011). The tuning parameter in (8) is chosen using -fold cross validation. More specifically, randomly divide the index set into subgroups , and divide into subgroups . Denote the full sample set as and let for . For a given and , let , and be estimators of , and computed using (3), (4) and (8), samples in and bandwidths . For each , let

and

Here is the indicator function. Clearly, gives the total number of correct classification for the test data set using the DLPD rule based on . We then find such that the following averaged correct classification number is maximized:

We remark that one can also choose the bandwidths together with the tuning parameter globally such that the above correct classification number is maximized. Note that local smoothing estimates are obtained in our method before applying linear programming. Hence the computation time consists of the time for local smoothing and the time for linear programming. The proposed method is computationally manageable for large dimensional data. For example, in our simulation study it took about 9 minutes to finish one replication on a windows workstation (2.40 GHz Intel Core i7-5500U CPU) when .

4.2 Simulation

For the simulation study, we consider the following four models:

Model 1. We generate independently from , and generate , for . The mean functions and are set as , and