Matrix Linear Discriminant Analysis
Abstract
We propose a novel linear discriminant analysis approach for the classification of highdimensional matrixvalued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a lowrank structure. Theoretical properties including a nonasymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches.
Keywords: Linear discriminant analysis; Low rank; Matrix data; Nuclear norm; Risk bound; Rank consistency
1 Introduction
Modern technologies have generated a large number of datasets that possess a matrix structure for classification purpose. For example, in neuropsychiatric disease studies, it is often of interest to evaluate the prediction accuracy of prognostic biomarkers by relating twodimensional imaging predictors, e.g., electroencephalography (EEG) and magnetoencephalography, to clinical outcomes such as diagnostic status (Mu and Gage, 2011). In this paper, we focus on extending one of the most commonly used classification methods, Fisher linear discriminant analysis (LDA) to matrixvalued predictors. Progress has been made in recent years on developing sparse LDA using regularization (Tibshirani, 1996), including Shao et al. (2011); Fan et al. (2012); Mai et al. (2012). However, all these methods only deal with vectorvalued covariates; and it remains challenging to accommodate the matrix structure. Naively transforming the matrix data into a highdimensional vector will result in unsatisfactory results for several reasons. First, vectorization destroys the structural information within the matrix such as shapes and spatial correlations. Second, turning a matrix into a vector generates unmanageably high dimensionality. E.g., estimating the population precision matrix for LDA can be troublesome if . Third, regularization does not necessarily work well because the underlying twodimensional signals are usually approximately lowrank rather than sparse.
Recently, there are some development of regression methods for matrix covariates. Zhou and Li (2014) proposed a class of regularized matrix regression methods based on spectral regularization. Wang and Zhu (2017) developed a generalized scalaronimage regression model via total variation. Chen et al. (2013) invented an adaptive nuclear norm penalization approach for lowrank matrix approximation.
In this paper, we propose a new matrix LDA approach by building on the equivalence between the classical LDA and the ordinary least squares. We formulate the binary classification as a nuclear norm penalized least squares problem, which efficiently exploits the low rank structure of the twodimensional discriminant direction matrix. The involved optimization is amenable to the accelerated proximal gradient method. Although our problem is formulated as a penalized regression problem, a fundamental difference is that the covariates and the residuals are no longer independent in our case. This requires extra effort for developing the risk bound and rank consistency result. The risk bound is explicit in terms of the rank of the image, image size, sample size, and the eigenvalues of the covariance matrix for the image covariates. This result also implies estimation consistency provided the image satisfies . Under stronger conditions, we show that the rank of the coefficient matrix can be consistently estimated as well. The proof is based on exploiting the spectral norm of random matrices with mixtureofGaussian components and extending the results in Bach (2008) to allow diverging matrix dimensions.
It is worth noting that the 2D classification method has been developed by Zhong and Suslick (2015). They proposed a discriminant analysis method for matrix data, which project the matrix into row space and column space separately. The two projections are estimated iteratively and integrated together for classification. Compared with this approach, our method has several advantages. First, the rank of the twodimensional discriminant direction matrix in Zhong and Suslick (2015) is set as one because of the separability assumption, while we allow the rank of the direction matrix flexible, which can be selected by a data driven procedure. Second, the directions in Zhong and Suslick (2015) are estimated through an iterative procedure. In contrast, our proposal adopts a direct estimation approach by solving a nuclear norm penalized regression problem, which is computationally much faster. Finally, we have provided theoretical guarantee for our estimator.
2 Method
We first define some useful notations. Let be a vectorization operator, which stacks the entries of a matrix into a column vector. The inner product between two matrices of same size is defined as .
Consider a binary classification problem, where is a twodimensional image covariate with dimension and denotes the class labels. The LDA assumes that , , and . Suppose we have subjects with subjects belonging to class 1 and subjects to class 2. It is well known that LDA is connected to the linear regression with the class labels as responses (Duda et al., 2012; Mika, 2002). When , the classical LDA is equivalent to solving
(1) 
where if subject is in class 1, and if subject is in class 2. Although this connection gives the exact LDA direction when , it has two potential drawbacks. First, when , the equivalence between Fisher LDA and (1) is lost because of the nonuniqueness of solution. Second, the formulation (1) does not incorporate the 2D image structure when estimating the direction because . These motivate us to consider a penalized version of (1) as follows
(2) 
where the nuclear norm and s are the singular values of the matrix . The nuclear norm plays an important role because it imposes a low rank structure in the estimated direction . An alternative choice is to add a Lasso type penalty, i.e. , where is the th element of . However, the Lasso type penalty can only identify at most nonzero components, and for most cases in imaging studies, the signal is usually not that sparse. More importantly, the Lasso type of penalty ignores the matrix structure because it is equivalent to vectorizing the array and applying sparse LDA. Once from (2) is obtained, a naive classification rule will assign the th subject to class 2 if . However, it can be shown that the intercept obtained from (2) is not optimal. Instead, we use the optimal intercept that minimizes the training error after obtaining . Mai et al. (2012) showed that the intercept of LDA actually has a closed form. Their derivations can be easily applied to our case. In particular, if , then
(3) 
where is the sample mean for subjects in class and is the estimated covariance matrix. If , we can plug into (3) to obtain the optimal intercept . The optimal classification rule is to assign the th subject to class 2 if .
For any fixed , the optimization problem in (2) can be solved using the accelerated proximal gradient method (Nesterov, 1983; Beck and Teboulle, 2009). Zhou and Li (2014) studied the algorithm for the nuclear norm regularized matrix regression. As we know, nuclear norm is not differentiable. Fortunately, its subderivative exists. Therefore (2) has local minima if and only if Thanks to the convexity of nuclear norm, the local minima is global as well. Based on these facts, singular value thresholding method for nuclear norm regularization was deployed for building blocks of the Nesterov’s method. Compared with classical gradient decent method with convergence of , where denotes the number of iteration, Nesterov’s accelerated gradient decent method achieves convergence rate of . It differs from traditional algorithms by utilizing the estimators from previous two iterations to generate the next estimator. Although motivated by a different goal, our optimization problem (LABEL:POLS) can be solved using the matrix_sparsereg function in the tensor regression Matlab toolbox (Zhou, 2013) for numerical computing. For tuning of the , we adopt the bic derived by Zhou and Li (2014) under the nuclear norm regularized matrix regression framework.
3 Theory
In this section we present an error bound for the risk of the proposed regularization estimator under the Frobenius norm and a rank consistency result. Denote the residuals and the true coefficient matrix by . By the equivalence between LDA direction and least squares, we know can be written as for some positive constant . Consider the singular value decomposition with and . Let and be (arbitrary) orthogonal complements of and , respectively. We make the following assumptions.

We assume that the secondorder moment of the covariate , , denoted by , satisfies , where and are the smallest and largest eigenvalues of , respectively, and are some positive constants.

Let be the unknown rank of the true coefficient matrix . Define as
We assume its spectral norm .

Assume the quantities , , tend to as .

There exists a positive constant such that .
Condition (A1) requires bounded eigenvalues for the covariance matrix of the vectored covariate, which is standard in the literature. Condition (A2) is also used by Bach (2008) and can be viewed as an analog of the strong irrepresentable condition for Lasso (Zhao and Yu, 2006). Condition (A3) puts more requirement on the order of , and in order to obtain consistent rank estimation in addition to consistent coefficient estimation. This is expected since rank estimation consistency is usually not implied by parameter estimation consistency. Condition (A4) can be viewed as a sparsity assumption on . Recall the solution (the slope) to classical LDA problem with vector covariates depends on the term . This assumption essentially implies that there are at most number of elements in the true coefficient matrix given the rank of is fixed.
Next, we briefly review two important concepts, namely decomposable regularizer and strong convex loss function, proposed by Negahban et al. (2012) and highlight their connection to the risk bound property for our estimator.
Definition 1.
A regularizer is decomposable with respect to a given pair of subspaces where if
In our setting, is the nuclear norm. Considering a matrix to be estimated, we observe that nuclear norm is decomposable given a pair of subspaces:
where represent ’s left and right singular vectors. For any pair of matrices and , the inner product of is due to their mutually orthogonal rows and columns. Hence we conclude . Since we assume the true parameter has a low rank structure, we expect the regularized estimator to have a large value of projection on and a relatively small valued projection on .
When the loss function defined as is close to , it is insufficient to claim is small if the loss function is relatively flat. This is why the strong convexity condition is required.
Definition 2.
For a given loss function and norm , we say is strong convex with curvature and tolerance function if
where .
Now we are ready to state the main result on the risk bound for our estimate. The proof is provided in the Appendix B.
Theorem 1.
Suppose that (A1) and (A4) hold. Let be the solution to (2). If
then with probability of at least for some constant ,
where and is some positive constant.
Theorem 1 implies the consistency of as long as . If the rank of is fixed, then both and can diverge with at the order of and their product . This result is compatible with Theorem 1 in Raskutti and Yuan (2015). Note that the estimated intercept converges to , which deviates from the truth . This is expected because the solution to OLS is only equivalent with LDA’s solution in terms of the slope , not on . More precisely, for OLS, by taking the derivative of squared loss function with respect to and set it to , we essentially require . However, this does not hold in our case. Instead we need to shift the residual by to balance off the bias in the crossproduct term E. The proof of the theorem uses Gaussian comparison inequality which allows us to deal with following a general Gaussian distribution instead of standard Gaussian distribution given that the largest singular value of is bounded. Based on this connection, we further utilize concentration property of spectral norm of Gaussian random matrices.
Next we show that is rankconsistent under stronger conditions.
Theorem 2.
Suppose that (A1)–(A4) hold. Then the estimate is rankconsistent, that is, as .
Similar to Lasso, estimation consistency does not guarantee correct rank estimation for matrix regularization. In fact, the assumptions here are stronger than those in Theorem 1. For example, Theorem 1 allows while Theorem 2 requires if . The proof is based on the arguments in Bach (2008) with modifications to allow diverging and .
Remark 1.
Although nuclear norm penalized least squares is used to estimate the classification direction, there is a fundamental difference between our theorems and the theoretical results derived for nuclear norm penalized least squares regression (Bach, 2008; Negahban et al., 2012). The previous work assumes that the data obey a linear regression model with covariatesindependent additive noise, which is not true in our case. In particular, the covariates and the residuals are no longer independent in our problem, which brings additional challenges in developing theoretical results.
4 Numerical results
4.1 Simulation
We conduct simulation studies to evaluate the numerical performance of our proposed method. We compare its performance with that of a few alternatives: “Lasso LDA”, which adopts a naive Lasso penalty in LDA without taking into account matrix structure, and the regularized matrix logistic regression (Zhou and Li, 2014) using nuclear norm and Lasso penalties, denoted by “Logistic Nuclear” and “Logistic Lasso”, respectively. We generate samples from two classes with weights . For each class, we generate predictors from a bivariate normal distribution with means , , and covariance . We set and . The covariance matrix has a 2D autoregressive structure: for and . The true signal is generated based on a 64by64 image. We consider three settings: a cross, a triangle and a butterfly. These pictures are shown in Figure 1(a) respectively. In particular, the white color denotes value 0 and black denotes 005. We apply each fitted model to an independent test data set of size and summarize the misclassification rates based on 1000 Monte Carlo replications. The results are contained in Table 1.
Shape  n  Our method  Lasso LDA  Logistic Nuclear  Logistic Lasso  

Cross  100  (05,05)  365(002)  1781(007)  370(002)  1951(007) 
100  (075,025)  332(002)  1489(005)  662(004)  1884(004)  
200  (05,05)  322(002)  1169(005)  326(002)  1339(005)  
200  (075,025)  287(002)  989(004)  414(003)  1627(004)  
500  (05,05)  309(002)  697(003)  311(002)  819(004)  
500  (075,025)  262(002)  581(003)  359(002)  1491(003)  
Triangle  100  (05,05)  312(002)  1573(006)  311(002)  1770(007) 
100  (075,025)  266(002)  1372(005)  610(004)  1719(004)  
200  (05,05)  285(002)  990(004)  281(002)  1181(004)  
200  (075,025)  243(002)  872(003)  362(002)  1340(004)  
500  (05,05)  267(002)  567(003)  273(002)  696(003)  
500  (075,025)  229(001)  489(002)  274(002)  997(003)  
Butterfly  100  (05,05)  386(002)  1710(006)  416(002)  1882(007) 
100  (075,025)  347(002)  1479(005)  714(004)  1778(004)  
200  (05,05)  367(002)  1100(004)  378(002)  1266(005)  
200  (075,025)  326(002)  980(004)  450(002)  1393(004)  
500  (05,05)  356(002)  650(003)  352(002)  770(003)  
500  (075,025)  302(002)  574(003)  351(002)  1049(003) 
The results show that our method performs much better than “Lasso LDA” and “Logistic Lasso” under all scenarios. This is expected because these two methods ignore the matrix structure. For “Logistic Nuclear”, it has similar misclassification rates with our method for balanced data, but does not perform as good as ours for unbalanced data. We have also plotted the estimates using nuclear norm and norm from one randomly selected Monte Carlo replicate in Figure 1(b)(c). It can be seen that the proposed nuclear norm regularization is much better than regularization in recovering the matrix signal in different shapes. By comparing the recovery of different shapes in Column (b) in Figure 1, we find that our method works better for cross than for triangle and butterfly. This is expected since triangle and butterfly do not have the low rank structure.
(a)  (b)  (c) 
4.2 Real data application
We apply our method to an EEG dataset, which is available at https://archive.ics.uci.edu/ml/datasets/EEG+Database. The data was collected by the Neurodynamics Laboratory to study the EEG correlates of genetic predisposition to alcoholism. It contained measurements from 64 electrodes placed on each subjectÂÂ’s scalps sampled at 256 Hz (39msec epoch) for 1 second. Each subject was exposed to three stimuli: a single stimulus, two matched stimuli, two unmatched stimuli. Among the 122 subjects in the study, 77 were alcoholic individuals and 45 were controls. More details about the study can be found in Zhang et al. (1995). For each subject, we use the average of all 120 runs for each subject under singlestimulus condition and use that as the covariate , which is a matrix. The classification label is alcoholic or not. We divide the full data into a training and a testing sample by using fold crossvalidation with and , for which corresponds to the leaveoneout cross validation. We apply the proposed method and the alternative methods to the data and summarize the average misclassification rates over all the folds in Table 2. It can be seen that our method performs uniformly better than the others, which agrees with the simulation findings for the unbalanced data. We also check the fitted signal matrix and it agrees well with the one obtained by Zhou and Li (2014).
Method  leaveoneout  5fold  10fold  20fold 

Our method  0205  0172  0189  0180 
Lasso LDA  0238  0221  0250  0262 
Logistic Nuclear  0230  0214  0222  0181 
Logistic Lasso  0246  0287  0271  0264 
References
 Anderson [1955] Theodore W Anderson. The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proceedings of the American Mathematical Society, 6(2):170–176, 1955.
 Bach [2008] F. R. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 8:1019–1048, 2008.
 Beck and Teboulle [2009] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 Cai et al. [2010] JianFeng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
 Chen et al. [2013] Kun Chen, Hongbo Dong, and KungSik Chan. Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100(4):901–920, 2013.
 Duda et al. [2012] Richard O Duda, Peter E Hart, and David G Stork. Pattern Classification. John Wiley & Sons, 2012.
 Fan et al. [2012] Jianqing Fan, Yang Feng, and Xin Tong. A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. Ser. B. Stat. Methodol., 74(4):745–771, 2012. ISSN 13697412. doi: 10.1111/j.14679868.2012.01029.x. URL http://dx.doi.org/10.1111/j.14679868.2012.01029.x.
 Mai et al. [2012] Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant analysis in ultrahigh dimensions. Biometrika, pages 29–42, 2012.
 Mika [2002] S Mika. Kernel Fisher Discriminant. PhD thesis, University of Technology, Berlin, 2002.
 Mu and Gage [2011] Y. Mu and F. Gage. Adult hippocampal neurogenesis and its role in alzheimers disease. Molecular Neurodegeneration, 6:85, 2011.
 Negahban et al. [2012] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for highdimensional analysis of estimators with decomposable regularizers. Statistical Science, 27:538–557, 2012.
 Nesterov [1983] Yurii Nesterov. A method of solving a convex programming problem with convergence rate . In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
 Raskutti and Yuan [2015] G. Raskutti and M. Yuan. Convex regularization for highdimensional tensor regression. Technical report, arXiv:1512.01215, 2015.
 Shao et al. [2011] Jun Shao, Yazhen Wang, Xinwei Deng, and Sijian Wang. Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Statist., 39(2):1241–1265, 2011. ISSN 00905364. doi: 10.1214/10AOS870. URL http://dx.doi.org/10.1214/10AOS870.
 Slepian [1962] D. Slepian. The onesided barrier problem for gaussian noise. Bell System Technical Journal, pages 463–501, 1962.
 Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Wainwright [2015] JM Wainwright. Highdimensional statistics: A nonasymptotic viewpoint. preparation. University of California, Berkeley, 2015.
 Wang and Zhu [2017] Xiao Wang and Hongtu Zhu. Generalized scalaronimage regression models via total variation. Journal of the American Statistical Association, 112(519):1156–1168, 2017.
 Zhang et al. [1995] Xiao Lei Zhang, Henri Begleiter, Bernice Porjesz, Wenyu Wang, and Ann Litke. Event related potentials during object recognition tasks. Brain Research Bulletin, 38(6):531–538, 1995.
 Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
 Zhong and Suslick [2015] Wenxuan Zhong and Kenneth S Suslick. Matrix discriminant analysis with application to colorimetric sensor array data. Technometrics, 57(4):524–534, 2015.
 Zhou [2013] Hua Zhou. Matlab tensorreg toolbox. http://huazhou.github.io/softwares/tensorreg/, 7 2013. available online.
 Zhou and Li [2014] Hua Zhou and Lexin Li. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):463–483, 2014.