WorstCase Linear Discriminant Analysis as Scalable Semidefinite Feasibility Problems
Abstract
In this paper, we propose an efficient semidefinite programming (SDP) approach to worstcase linear discriminant analysis (WLDA). Compared with the traditional LDA, WLDA considers the dimensionality reduction problem from the worstcase viewpoint, which is in general more robust for classification. However, the original problem of WLDA is nonconvex and difficult to optimize. In this paper, we reformulate the optimization problem of WLDA into a sequence of semidefinite feasibility problems. To efficiently solve the semidefinite feasibility problems, we design a new scalable optimization method with quasiNewton methods and eigendecomposition being the core components. The proposed method is orders of magnitude faster than standard interiorpoint based SDP solvers.
Experiments on a variety of classification problems demonstrate that our approach achieves better performance than standard LDA. Our method is also much faster and more scalable than standard interiorpoint SDP solvers based WLDA. The computational complexity for an SDP with constraints and matrices of size by is roughly reduced from to ( in our case).
I Introduction
Dimensionality reduction is a critical problem in machine learning, pattern recognition and computer vision, which for linear case learns a transformation matrix to project the input data to a lowerdimensional space such that the important structure or geometry of input data is preserved. It can help us to eliminate the inherent noise of data, and improve the classification performance. It can also decrease the computational complexities of subsequent machine learning tasks. There are two classical dimensionality reduction methods used widely in many applications, principal component analysis (PCA) and linear discriminant analysis (LDA). PCA is an unsupervised linear dimensionality reduction method, which seeks a subspace of the data that have the maximum variance and subsequently projects the input data onto it. PCA may not give good classification performance due to its unsupervised nature. LDA is in supervised fashion, which aims to maximize the average distance between each two class means and minimize the average distance between each two samples within the same class. However, it has some limitations: 1) For class data, the target dimension of the projected subspace is restricted to be at most . In this sense, LDA is suboptimal and may cause so called class separation problem that LDA tends to merge classes which are close in the original space; 2) It assumes that conditional probability density functions are normally distributed, which does not hold in many cases; 3) The scatter matrices are required to be nonsingular.
There are a large number of extension works to LDA and PCA [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Among these methods, our focus in this paper is on the discrimination criterion of worstcase linear discriminant analysis (WLDA), which was proposed by Zhang et al. [5]. Instead of using average betweenclass and withinclass distances as LDA, WLDA considers scatter measures from the worstcase view, which arguably is more suitable for applications like classification. Specifically, WLDA tries to maximize the minimum of pairwise distances between class means, and minimize the maximum of withinclass pairwise distances over all classes. Due to the complex formulation of its criterion, the problem of WLDA is difficult to optimize.
In [5], the problem of WLDA was firstly relaxed to a metric learning problem on , which can be solved by a sequence of SDP optimization procedures, where SDP problems are solved by standard interiorpoint methods (We denote it as Zhang et al. (SDP)). However, standard interiorpoint SDP solvers scale poorly to problems with high dimensionality, as the computational complexity is , where is the problem size, and is the number of constraints in SDP. An alternative optimization procedure was then proposed in [5] for largescale WLDA problems, in which one column of the transformation matrix was found at each iteration while fixing the other directions. We denote this method as Zhang et al. (SOCP) since it was reformulated as a series of secondorder cone programming (SOCP) problems lastly. Typically, this greedy method does not guarantee to find a globally optimal solution.
In this paper we propose a fast SDP approach to solve WLDA problem. The problem is converted to a sequence of SDP feasibility problems using bisection search strategy, which can find a globally optimal solution to the relaxed problem. More importantly, we adopt a novel approach to solve SDP feasibility problem at each iteration. Motivated by [11], a Frobeniusnorm regularized SDP formulation is used, and its Lagrangian dual can be solved effectively by quasiNewton methods. The computational complexity of this optimization method is dominated by eigendecomposition at each iteration, which is . The proposed method is denoted as SDWLDA. The main contributions of this work are: 1) By introducing an auxiliary variable, the original WLDA problem is reformulated and can be solved via a sequence of convex feasibility problems, by which the global optimum can be obtained for the relaxed metric learning problem. 2) By virtue of the use of Frobenius norm regularization, the optimization problem can be addressed by solving its Lagrange dual, where firstorder methods such as quasiNewton can be used. This approach is much faster than solving the corresponding primal problem using standard interiorpoint methods, and can be applied to largescale problems. Next, we briefly review some relevant work.
Dimensionality reduction In order to overcome the drawbacks of LDA and improve the accuracy in classification, many extensions have been proposed, such as relevant component analysis (RCA) [1], neighborhood component analysis (NCA) [2], null space LDA (NLDA) [3], orthogonal LDA (OLDA) [4], Enhanced fisher discriminant criterion (EFDC) [6], Geometric meanbased subspace selection (GMSS) [7], Harmonic meanbased subspace selection (HMSS) [8], and Maxmin distance analysis (MMDA) [9]. Assuming dimensions with large withinclass covariance are not relevant to subsequent classification tasks, RCA [1] assigns large weights to “relevant dimensions” and small weights to “irrelevant dimensions”, where the relevance is estimated using equivalence constraints. NCA [2] learns the transformation matrix directly by minimizing the expected leaveoneout classification error of nearest neighbours on the transformed space. Because the objective function to be optimized is not convex, NCA tends to converge to a local optimum. NLDA [3], OLDA [4] and EFDC [6] were proposed to address the problem that standard LDA fails when scatter matrices are singular. NLDA maximizes the betweenclass distance in the null space of the withinclass scatter matrix, while OLDA calculates a set of orthogonal discriminant vectors by diagonalizing the scatter matrices simultaneously. The resulting transformation matrices are both orthogonal for NLDA and OLDA, and they are equivalent to each other under a mild condition [12]. EFDC incorporates the intraclass variation into the Fisher discriminant criterion, so that data from the same class can be mapped to a subspace where both the intraclass compactness and intraclass variation are well preserved. In this way, this method is robust to the intraclass variation and results in a good generalization capability. To avoid the class separation problem of LDA, Tao et al. [13] proposed a general averaged divergence analysis (GADA) framework, which presented a general mean function in place of the arithmetic mean used in LDA. By choosing different mean functions, several subspace selection algorithms have been developed. GMSS [7] investigates the effectiveness of the geometric meanbased subspace selection, which maximizes the geometric mean of KullbackLeibler (KL) divergences between different class pairs. HMSS [8] maximizes the harmonic mean of the symmetric KL divergences between all class pairs. They adaptively give large weights to class pairs that are close to each other, and result in better class separation performance than LDA. Instead of assigning weights to class pairs, MMDA [9] directly maximizes the minimum pairwise distance of all class pairs in the lowdimensional subspace, which guarantees the separation of all class pairs. However, MMDA does not take into account of the withinclass pairwise distances over all classes. Recently, Bian et al. [10] presented an asymptotic generalization analysis of LDA, which enriched the existing theory of LDA further. They showed that the generalization ability of LDA is mainly determined by the ratio of dimensionality to training sample size, where both feature dimensionality and training data size can be proportionally large.
Many dimensionality reduction algorithm such as PCA and LDA can be formulated into a trace ratio optimization problem [14]. Guo et al. [15] presented a generalized Fisher discriminant criterion, which is essentially a trace ratio. They proposed a heuristic bisection way, which was proven to converge to the precise solution. Wang et al. [16] tackled the trace ratio problem directly by an efficient iterative procedure, where a trace difference problem was solved via the eigendecomposition method in each step. Shen et al. [17] provided a geometric revisit to the trace ratio problem in the framework of optimization on the Grassmann manifold. Different from [16], they proposed another efficient algorithm, which employed only one step of the parallel Rayleigh quotient iteration at each iteration. Kokiopoulou et al. [18] also treated the dimensionality reduction problem as trace optimization problems, and gave an overview of the eigenvalue problems encountered in dimensionality reduction area. They made a comparition between nonlinear and linear methods for dimensionality reduction, including Locally Linear Embedding (LLE), Laplacean Eigenmaps, PCA, Locality Preserving Projections (LPP), LDA, etc., and showed that all the eigenvalue problems in explicit linear projections can be regarded as projected analogues of the socalled nonlinear projections.
Different from the aforementioned methods, WLDA considers the dimensionality reduction problem from a worstcase viewpoint. It maximizes the worstcase betweenclass scatter matrix and minimizes the worstcase withinclass scatter matrix simultaneously, which can lead to more robust classification performance. The inner maximization and minimization over discrete variables make it different from the general trace ratio problem, and difficult to solve. The method of solving the general trace ratio problem cannot be extended here directly. Furthermore, different from the iterative algorithm for trace ratio optimization problem [16], we formulate the WLDA problem as a sequence of SDP problems, and propose an efficient SDP solving method. The eigendecomposition we used is to solve the Lagrange dual gradient, which differs from that employed in solving the trace ratio optimization problem.
Solving largescale SDP problems Instead of learning the transformation matrix , quadratic Mahalanobis distance metric learning methods (which are highly related to dimensionality reduction methods) optimize over , in order to obtain a convex relaxation. The transformation matrix can be recovered from the eigendecomposition of . Because is positive semidefinite (p.s.d.) by definition, quadratic Mahalanobis metric learning methods optimizing on usually need to solve an SDP problem.
Xing et al. [19] formulated metric learning as a convex (SDP) optimization problem, and a globally optimal solution can be obtained. Weinberger et al. [20] presented a distance metric learning method, which optimizes a Mahalanobis metric such that the nearest neighbours always belong to the same class while samples from different classes are separated by a large margin. In terms of SDP solver, they proposed an alternate projection method, where the learned metric is projected back onto the p.s.d. cone by eigendecomposition at each iteration. MMDA [9] was solved approximately by a sequence of SDP problems using standard interiorpoint methods. Shen et al. [21] proposed a novel SDP based method for directly solving trace quotient problems for dimensionality reduction. With this method, globallyoptimal solutions can be obtained for trace quotient problems.
As we can see, many aforementioned methods used standard interiorpoint SDP solvers, which are unfortunately computationally expensive (computational complexity is ) and scale poorly to largescale problems. Thus an efficient SDP optimization approach is critical for largescale metric learning problems.
There are many recent work to address largescale SDP problems arising from distance metric learning and other computer vision tasks. Shen et al. [11] proposed a fast SDP approach for solving Mahalanobis metric learning problem. They introduced a Frobeniusnorm regularization in the objective function of SDP problems, which leads to a much simpler Lagrangian dual problem: the objective function is continuously differentiable and p.s.d. constraints in the dual can be eliminated. LBFGSB was used to solve the dual, where a partial eigendecomposition needed to be calculated at each iteration. Wang et al. [22] also employed a similar dual approach to solve binary quadratic problems for computer vision tasks, such as image segmentation, cosegmentation, image registration. SDP optimization method in [11, 22] can be seen as an extension of the works in [23, 24], which considered semidefinite leastsquares problems. The key motivation of [23, 24] is that the objective function of the corresponding dual problem is continuously differentiable but not twice differentiable, therefore firstorder methods can be applied. Malick [24] and Boyd and Xiao [23] proposed to use quasiNewton methods and projected gradient methods respectively, to solve the Lagrangian dual of semidefinite leastsquares problems. Semismooth NewtonCG methods [25] and smoothing Newton methods [26] are also exploited for semidefinite leastsquares problems, which require much less number of iterations at the cost of higher computational complexity per iteration (full eigendecomposition plus conjugate gradient).
Alternatively, stochastic (sub)gradient descent (SGD) methods [27] were also employed to solve SDP problems. Combining with alternating direction methods [28, 29], SGD can be used for SDP problems with inequality and equality constraints. The computational bottleneck of typical SGD is the projection of one infeasible point onto the p.s.d. cone at each iteration, which leads to the eigendecomposition of a matrix. A number of methods have been proposed to speed up the projection operation at each iteration. Chen et al. [30] proposed a lowrank SGD method, in which rank stochastic gradient is constructed and then the projection operation is simplified to compute at most eigenpairs. In the works of [31, 32, 33, 34], the distance metric is updated by rankone matrices iteratively, and no eigendecomposition or only one leading eigenpair is required. Note that SGD methods usually need more iterations to converge than the dual approaches based on quasiNewton methods [11].
The most related work to ours may be Shen et al.’s [11]. We use similar SDP optimization technique as that in [11]. However, SDP feasibility problems are considered in our paper while the work in [11] focuses on standard SDP problems with linear objective functions.
Notation We use a bold lowercase letter to denote a column vector, and a bold capital letter to denote a matrix. is the transposition of . indicates the set of matrices. represents an identity matrix. indicates that the matrix is positive semidefinite. denotes the inner product of two matrices or vectors. indicates the trace of a matrix. is the Frobenius norm of a matrix. returns a diagonal matrix with the input elements on its main diagonal. Suppose that the eigendecomposition of a symmetric matrix is , where is the orthonormal matrix of eigenvectors of , and are the corresponding eigenvalues, we define the positive and negative parts of respectively as
(1)  
(2) 
It is clear that holds.
Ii WorstCase Linear Discriminant Analysis
We briefly review WLDA problem proposed by [5] firstly. Given a training set of samples (), which consists of classes , where class contains samples. As we mentioned before, the aim of linear dimensionality reduction is to find a transformation matrix with .
We define the withinclass scatter matrix of the th class as
(3) 
which is also the covariance matrix for the th class, and is the class mean of the th class . The betweenclass scatter matrix of the th and th classes is defined as
(4) 
Unlike LDA which seeks to minimize the average withinclass pairwise distance, the withinclass scatter measure used in WLDA is defined as
(5) 
which is the maximum of the withinclass pairwise distances over all classes.
On the other hand, the betweenclass scatter measure used in WLDA is defined as
(6) 
uses the minimum of the pairwise distances between class means, instead of the average of distances between each class mean and the sample mean employed in LDA.
The optimality criterion of WLDA is defined as to maximize the ratio of the betweenclass scatter measure to the withinclass scatter measure:
(7a)  
(7b) 
As stated in [5], this problem (7) is not easy to optimize with respect to , so a new variable is introduced, and the problem (7) is formulated as a metric learning problem.
Theorem II.1
Define sets , and . Then is the set of extreme points of , and is the convex hull of .
This theorem has been widely used and its proof can be found in [35]. According to Theorem II.1, the orthonormal constraint on can be relaxed to convex constraints on , and the problem (7) can be relaxed to the following problem defined on a p.s.d. variable [5]:
(8a)  
(8b)  
(8c) 
Once the optimal solution is obtained, the optimal for problem (7) can be recovered using the top eigenvectors of .
In [5], Zhang et al. proposed two methods to solve (7), as we stated in Section I. In the first one, an iterative algorithm was presented to solve the relaxed problem (8), where an SDP problem needs to be solved at each step by standard SDP solver. This method is not scalable to problems with high dimensionality or large training data points. The second one is based on a greedy approach, which cannot guarantee to find a globally optimal solution.
Hence, in the next section, we will describe our algorithm (so called SDWLDA) of finding the transformation matrix that maximizes (8), and demonstrate how to solve it using an efficient approach.
Iii A Fast SDP Approach to WLDA
In this section, problem (8) is firstly reformulated into a sequence of SDP optimization problems based on bisection search. Then, a Frobenius norm regularization is introduced and the SDP problem in each step is solved through Lagrangian dual formulation. With this SDWLDA method, the global optimum can be acquired for the relaxed problem (8). The computational complexity can be reduced as well by solving the dual problem using quasiNewton methods, compared with solving the primal problem directly using interiorpoint based algorithm.
Iiia Problem Reformulation
By introducing an auxiliary variable , problem (8) can be rewritten as
(9a)  
(9b)  
(9c)  
(9d) 
There are two variables to be optimized in problem (9), but we are interested in finding that can maximize . Problem (9) is clearly nonconvex with respect to and since the constraint (9b) is not convex. However, noting that (9b) will become linear if is given, we employ the bisection search strategy and convert the optimization problem (9) into a set of convex feasibility problems, by which the global optimum can be computed effectively.
Let denote the optimal value of (9a). Given , if the convex feasibility problem
(10a)  
(10b)  
(10c)  
(10d) 
is feasible, then . Otherwise, if the above problem is infeasible, then .
IiiB Lagrangian Dual Formulation
Algorithm 1 shows that an SDP feasibility problem needs to be solved at each step during the bisection search process. Considering that standard interiorpoint SDP solvers have a computational complexity of , where is the dimension of input data, and is the number of constraints in SDP, it becomes quite expensive for processing highdimensional data. In this subsection, we reformulate the feasibility problem (10) into a Frobenius norm regularized SDP problem, which can be efficiently solved via its Lagrangian dual using firstorder methods like quasiNewton. The computational complexity will be reduced to . The primal solution can then be calculated from the dual solution based on KarushKuhnTucker (KKT) [36, p. 243] conditions.
The problem (10) can be expressed equivalently in the following form:
(11a)  
(11b)  
(11c)  
(11d)  
(11e) 
where , , and .
In the above formulation, the variable is replaced by , where . The constraints (11b) and (11c) correspond to (10b) and (10c), respectively. The constraints (11d) stem from the fact that .
Proposition III.1
If the problem (12) is feasible, its optimal solution can be used as a feasible solution to (11), and one solution to problem (10) can be acquired as well.
The problem (12) is a standard semidifinite leastsquare problem and can be solved readily by offtheshelf SDP solvers. However, as we mentioned before, the computational complexity is really high if we solve the primal problem directly by standard interiorpoint SDP solvers. It will greatly hamper the use of WLDA in largescale problems. Thanks to the Frobenius norm regularization in the objective function of (12), we can use Lagrangian dual approach to solve the problem easily.
Introducing the Lagrangian multipliers , , corresponding to the constraints (12b)(12d), and a symmetric matrix corresponding to the p.s.d. constraint (12e), the following result can be acquired.
Proposition III.2
From the definition of and the operator , is forced to be p.s.d. and blockdiagonal, so the optimal solution to problem (10) can be acquired easily. In addition, it is noticed that the objective function of (13) is differentiable (but not necessarily to be twice differentiable), it allows us to solve the dual problem efficiently using firstorder methods, such as quasiNewton methods.
The gradient of the objective function in problem (13) is , where
IiiC Feasibility Condition
As stated in Proposition III.1, if (10) is feasible, a solution can be found by solving the problem (12). During running quasiNewton algorithms to solve the problem (12), an infeasibility condition of the problem (10) needs to be checked iteratively, which is presented here:
Proposition III.3
This infeasibility condition can be deduced from a general conic feasibility problem presented in [37]. Explanations are presented in detail in the appendix. We check this condition at each iteration of quasiNewton algorithms. is evaluated during calculating the dual objective function, so it will not bring extra computational cost. Once the condition (15) is satisfied, problem (10) (equivalently (11)) is not feasible and quasiNewton algorithms will be stopped. Otherwise, quasiNewton algorithms keep running until convergence, and then a feasible solution to the problem (11) is found.
IiiD Solving the Feasibility Problem
In this subsection, we summarize the procedure of solving the problem (10) by our fast SDP optimization algorithm. It has been domenstrated that we can find the feasible solution to (10) by solving the dual problem (13) with quasiNewton methods. In this work, we use LBFGSB [38, 39], a limitedmemory quasiNewton algorithm package, which can handle the problem with box constraints. Here we only need to provide the callback function to LBFGSB, which calculates the objective function of (13) and its gradient. The procedure of finding the feasible solution is described in Algorithm 2.
IiiE Computational Complexity Analysis
The computational complexity of LBFGSB is , where is a moderate number between 3 to 20, is the problem size to be solved by LBFGSB, which is equal to the number of constraints in the primal SDP problem (12). At each iteration of LBFGSB, the eigendecomposition of a matrix is carried out to compute , which is used to evaluate all the dual objective values, gradients, as well as the feasibility conditions (15). The computational complexity is . Hence, the overall computational complexity of our algorithm SDWLDA is . Since , eigendecomposition dominates the most computational time of SDWLDA, which is . On the other hand, solving an SDP problem using standard interiorpoint methods needs a computational complexity of . Since in our case, our algorithm is much faster than interiorpoint methods, and can be used to largescale problems.
Iv Experiments
In this section, experiments are performed to verify the performance of SDWLDA. We conduct comparisons between SDWLDA and other methods on both classification performance and computational complexity. The classification performance is contrasted between SDWLDA and LDA, LMNN, OLDA. We also compare the performance of our SDWLDA with both optimization methods proposed by Zhang et al. in [5] (Zhang et al. (SDP) and Zhang et al. (SOCP) receptively). This can be used to verify the correctness of our algorithm. The computational complexity is compared between SDWLDA, standard interiorpoint algorithms to solve our SDP formulation (SDPT3 [40] and SeDuMi [41]), Zhang et al. (SDP) which uses SDPT3 as well, and Zhang et al. (SOCP).
All algorithms are tested on a GHz Intel CPU with G memory. The SDWLDA algorithm is implemented in Matlab, where the Fortran code of LBFGSB is employed to solve the dual problem (13). The Matlab routine “eig” is used to compute eigendecomposition. The tolerance setting of LBFGSB is set to default. The tolerance in Algorithm 1 is set to , and the parameter in the feasibility condition (15) is set to .
Iva Experiments on UCI Datasets
Some UCI datasets [42] are used here firstly. We perform random splits for each dataset, with as training samples and as test samples. The classification performance is evaluated based on nearest neighbour (NN) classifier. For fair comparison with LDA, the final dimension is set to .
The experimental results are presented in Table I, where the baseline results are obtained by applying NN classifier on the original feature space directly. For each dataset, the experiment runs times, and the error rate is reported by the mean error as well as the standard deviation. The smallest classification error is shown in bold. The results illustrate that WLDA gives smaller classification error rates compared to other algorithms in most datasets. The classification results by our fast SDP solving algorithm SDWLDA and Zhang et al. (SDP) are quite similar, with small difference coming from numerical error during computation. The error rates calculated by Zhang et al. (SOCP) are sometimes quite different from that by Zhang et al. (SDP) and SDWLDA, e.g., on “Heart” and “Waveform” datasets, which results from different relaxation methods and optimization procedures employed.
In terms of computational speed, SDWLDA approach is much faster than other methods. Zhang et al. (SDP), which is also solved by standard interiorpoint algorithm SDPT3, is faster than Ours (SDPT3), because of different SDP problem formulations. The merit of SDWLDA on computation is even more dramatic for high dimensional problems. For example, we compare the computational speeds of SDWLDA and SDPT3 on the datasets of “Iris” and “Waveform”, which have the same number of classes. SDWLDA is about times faster than Zhang et al. (SDP), and times faster than Ours (SDPT3) on “Iris” which has training samples with the input dimension as , whereas it becomes times quicker than Zhang et al. (SDP), and times quicker than Ours (SDPT3) on “Waveform” which has training samples with the input dimension as . The computational time increases more significantly for SeDuMi with respect to input dimension. Zhang et al. (SOCP) has no computational advantage on solving problems with few training samples and low dimensionality. The computational superiority appears when dimensionality increases, referring to the results on “Waveform” dataset, which is quicker than Zhang et al. (SDP). Because of the columnwise iteration solving method of Zhang et al. (SOCP), the computational complexity of Zhang et al. (SOCP) relates closely with the final dimension. That is why Zhang et al. (SOCP) is quite fast on “Sonar” and “Ionosphere” datasets, which set the final dimension to . However, it still slower than SDWLDA.
Heart  Waveform  Iris  Balance  Sonar  Ionosphere  
Train  
Test  
Classes  
Input Dim.  
Final Dim.  
Error Rates (%)  
Euclidean  ()  ()  ()  ()  ()  () 
LDA  ()  ()  ()  ()  ()  () 
LMNN  ()  ()  ()  ()  ()  () 
OLDA  ()  ()  ()  ()  ()  () 
Zhang et al. (SDP)  ()  ()  ()  ()  ()  () 
Zhang et al. (SOCP)  ()  ()  ()  ()  ()  () 
SDWLDA  ()  ()  ()  ()  ()  () 
Computation Time  
Ours (SDPT3)  s  ms  s  s  ms  ms 
Ours (SeDuMi)  s  hm  s  s  hm  ms 
Zhang et al. (SDP)  s  s  s  s  s  s 
Zhang et al. (SOCP)  s  s  s  s  s  s 
SDWLDA  s  s  s  s  s  s 
To prove the robust classification performance of SDWLDA, we change the ratio of training samples from to on datasets “Sonar” and “Ionosphere”. For each value of , we calculate the average test error as well as the standard deviation across trials by SDWLDA and LDA respectively. The results in Fig. 1 demonstrate that SDWLDA is more superior than LDA when there is small number of training samples. This phenomenon illustrates that WLDA alleviates the dependence of classification performance on large number of training samples.
LDA requires the data to map to at most dimension, while SDWLDA, which is based on an SDP optimization method, does not have such a restriction. Here we perform another experiment by SDWLDA on “Heart” dataset, with different final dimensions. The result in Table II shows that is not the best final dimension for “Heart”. So SDWLDA algorithm is more flexible.
Final Dim.  

SDWLDA (%)  ()  ()  ()  ()  () 
IvB Experiments on Face, Object and Letter Datasets
As shown before, SDWLDA algorithm can be used to solve largescale problems due to its efficiency on computation. In this section, experiments are carried out on face, object and letter image datasets, which have high input dimension and more classes. The images are resized to different resolution before experiments, as shown in Table III. PCA has been applied beforehand to reduce the original dimension and also to remove noises. The final dimension is still set to be for fair comparison to LDA.
IvB1 Face recognition
four face databases are employed here. ORL [43] consists of face images of individuals, each with images. We randomly choose of the samples for training and the remaining for test. The Yale dataset [44] contains 165 greyscale images of individuals, images per subject. We split them into training and test sets by sampling as well. PIE dataset (Pose C29) [45] has subjects, and images for each individual. of the samples are chosen randomly for training. UMist dataset [46] contains 564 grayscale images of different people. We only use of the samples for training to test the performance of SDWLDA.
Experimental results in Table III show that SDWLDA gives better classification performance for all datasets. The classification error rates by SDWLDA and Zhang et al. (SDP) are identical with each other, as PCA used before has already removed the noises, which proves the correctness of our algorithm. However, SDWLDA is much faster than Zhang et al. (SDP) method. For example, SDWLDA runs almost times faster than Zhang et al. (SDP) on Yale dataset. The error rates calculated by Zhang et al. (SOCP) are rather larger, which result from the nonglobally optimal solution the algorithm reached. The computational superiority of Zhang et al. (SOCP) does not show up as well.
In order to illustrate the computational speed of SDWLDA and both methods in [5] with respect to the number of classes and the input data dimension (here it refers to the dimension after PCA) respectively, more experiments are performed on Yale dataset. Firstly, we set the dimension after PCA to be , and change the number of classes from to . Experimental results in Fig. 2(1) demonstrate that compared with the other methods, the speed of SDWLDA is less sensitive to the increase of the amount of classes. Since the final dimension is set to be , the computational complexity of Zhang et al. (SOCP) jumps up obviously with the increase of classes. Secondly, we use all classes, and let the dimension after PCA change from to . Experimental results in Fig. 2(2) show that the computational cost of SDWLDA rises up pretty slower with the growth of input dimension, in contrast to Zhang et al. (SDP). Zhang et al. (SOCP) becomes faster than Zhang et al. (SDP) when input dimension is larger than 90. This phenomenon certifies that Zhang et al. (SOCP) is more suitable for processing high dimensional datasets than Zhang et al. (SDP), as [5] presented. However, this method cannot guarantee to find a global optimal solution. Finally, we test on Yale dataset with an even high input dimension as . Experimental results in Table IV demonstrate that SDWLDA is absolutely faster than Zhang et al. (SDP), both of which lead to similar classification error rates. Although Zhang et al. (SOCP) is comparable with SDWLDA on computational time, the error rate it obtained is relatively bigger.
In addition, to show the superiority of SDWLDA on classification, another experiment is conducted using SDWLDA and LDA on Yale dataset, with the dimension after PCA as and classes. We reduce the final dimension from to . The test results shown in Fig. 3 demonstrate that SDWLDA always gives lower test error than LDA, which further proves the good classification performance of SDWLDA.

ORL  Yale  PIE  UMist  Coil20  Coil30  ALOI  BA1  BA2 

Train  
Test 

Classes 

Input Dim. 

Dim. after PCA 

Final Dim. 

Runs 

Error Rates (%)  
Euclidean  ()  ()  ()  ()  ()  ()  ()  ()  () 
LDA 
()  ()  ()  ()  ()  ()  ()  ()  () 
LMNN 
()  ()  ()  ()  ()  ()  ()  ()  () 
OLDA 
()  ()  ()  ()  ()  ()  ()  ()  () 
Zhang et al. (SDP) 
()  ()  ()  ()  ()  ()  ()  ()  () 
Zhang et al. (SOCP) 
()  ()  ()  ()  ()  ()  ()  ()  () 
SDWLDA 
()  ()  ()  ()  ()  ()  ()  ()  () 
Computational Time (Once)  
Ours (SDPT3)  h  hm  hm  hm  hm  hm  h  h  hm 
Ours (SeDuMi) 
h  hm  h  h  h  h  h  hm  hm 
Zhang et al. (SDP) 
hm  ms  m  ms  ms  ms  ms  ms  ms 
Zhang et al. (SOCP) 
hm  ms  hm  m  hm  hm  ms  ms  ms 
SDWLDA 
ms  s  ms  ms  ms  ms  ms  s  s 
Error Rates (%)  Computational Time (Once)  

SDWLDA  ()  ms 
Zhang et al. (SDP)  ()  hm 
Zhang et al. (SOCP)  ()  ms 
Classes  

SDWLDA(%)  ()  ()  ()  ()  ()  ()  () 
LDA (%)  ()  ()  ()  ()  ()  ()  () 
Classes  

SDWLDA(%)  ()  ()  ()  ()  ()  () 
LDA (%)  ()  ()  ()  ()  ()  () 
IvB2 Object recognition
Three datasets are used here: Coil20 [47], Coil30 [48], and ALOI [49]. Coil20 dataset contains greyscale images with black background for objects, with each containing different images. Coil30 dataset consists of RGB images for objects. We choose the first objects and convert them into greyscale images in our experiment. ALOI dataset consists of objects taken at varied viewing angles, illumination angles, etc.. We use the first objects here, with images for every object. Different training and test splitting ratios are adopted for different datasets in order to test the performance under different situations, which can be found in Table III. The experimental results demonstrate that SDWLDA is better in computational speed. Although OLDA produces the smallest classification error on Coil20 (), SDWLDA gives a comparable result ().
IvB3 Letter recognition
The Binary Alphadigits dateset [50] is employed here. The dataset BA1 contains digits of to , and BA2 contains capital letters through . Experimental results in Table III present that WLDA produces better classification performance on both databases. Zhang et al. (SDP) and SDWLDA give similar classification results, while Zhang et al. (SOCP) lead to much larger error rates. In terms of computational speed, SDWLDA runs times faster than Zhang et al. (SDP) on BA1, and more than times faster than Zhang et al. (SDP) on BA2, which has more training samples and number of classes. This experiment demonstrates again that SDWLDA is more efficient in processing largescale datasets.
IvB4 Classification performance regarding to number of classes
It has been validated that SDWLDA can give smaller classification errors than LDA using a small size of training set. In this section, we evaluate the classification performance of SDWLDA and LDA with respect to the number of classes . Take the datasets ORL and Coil20 as examples. The experimental settings for each dataset are shown in the captions of Table V and VI respectively. We choose different numbers of classes from each dataset, and compare the test errors of SDWLDA and LDA. The results in Tables V and VI demonstrate that when the number of classes is small, SDWLDA and LDA have comparable classification results, whereas when the number of classes increases, SDWLDA shows better classification performance.
V Conclusion
In this work, an efficient SDP optimization algorithm has been proposed to solve the problem of worstcase linear discriminant analysis. WLDA takes into account the worstcase pairwise distance between and within classes, which achieves better classification performance than conventional LDA. In order to reduce the computational complexity so that it can be applied to largescale problems, a fast algorithm has been presented by introducing the Frobenius norm regularization, and its Lagrangian dual can be simplified. Using our algorithm, the global optimum can be obtained in time. The algorithm is simple to implement and much faster than conventional SDP solvers. Experimental results on some UCI databases as well as face and object recognition tasks show the effectiveness on classification performance and the efficiency on computation of SDWLDA.
References
 [1] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, “Adjustment learning and relevant component analysis,” in Proc. Eur. Conf. Comp. Vis., 2002, pp. 776–790.
 [2] K. Q. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., vol. 18, 2006, pp. 1473–1481.
 [3] L.F. Chen, H.Y. M. Liao, M.T. Ko, J.C. Lin, and G.J. Yu, “A new ldabased face recognition system which can solve the small sample size problem,” Pattern Recogn., vol. 33, no. 10, pp. 1713–1726, 2000.
 [4] J. Ye and B. Yu, “Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems,” J. Mach. Learn. Res., vol. 6, no. 4, pp. 483–502, 2005.
 [5] Y. Zhang and D.Y. Yeung, “Worstcase linear discriminant analysis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 23, 2010, pp. 2568–2576.
 [6] Q. Gao, J. Liu, J. Zhang, J. Hou, and X. Yang, “Enhanced fisher discriminant criterion for image recognition,” Pattern Recogn., vol. 45, no. 10, pp. 3717–3724, 2012.
 [7] D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric mean for subspace selection,” IEEE Trans. Anal. Mach. Intell., vol. 31, no. 2, pp. 260–274, 2009.
 [8] W. Bian and D. Tao, “Harmonic mean for subspace selection,” in in Proceedings of the 19th International Conference on Pattern Recognition, 2008, pp. 1–4.
 [9] ——, “Maxmin distance analysis by using sequential sdp relaxation for dimension reduction,” IEEE Trans. Anal. Mach. Intell., vol. 33, no. 5, pp. 1037–1050, 2011.
 [10] ——, “Asymptotic generalization bound of fisher’s linear discriminant analysis,” CoRR, vol. abs/1208.3030, 2012. [Online]. Available: http://arxiv.org/abs/1208.3030
 [11] C. Shen, J. Kim, and L. Wang, “A scalable dual approach to semidefinite metric learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011, pp. 2601–2608.
 [12] J. Ye and T. Xiong, “Computational and theoretical analysis of null space and orthogonal linear discriminant analysis,” J. Machine Learning Research, vol. 7, pp. 1183–1204, 2006.
 [13] D. Tao, X. Li, X. Wu, and S. J. Maybank, “General averaged divergence analysis,” in Proc. of IEEE International Conference on Data Mining, 2007, pp. 302–311.
 [14] Y. Jia, F. Nie, and C. Zhang, “Trace ratio problem revisited,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 729–735, 2009.
 [15] Y. Guo, S. Li, J. Yang, T. Shu, and L. Wu, “A generalized foley sammon transform based on generalized fisher discriminant criterion and its application to face recognition,” Pattern Recogn. Lett., vol. 24, no. 13, pp. 147–158, 2003.
 [16] H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007, pp. 1–8.
 [17] H. Shen, K.Diepold, and K. Hueper, “A geometric revisit to the trace quotient problem,” in Proc. of 19th International Symposium of Mathematical Theory of Networks and Systems, 2010.
 [18] E. Kokiopoulou, J. Chen, and Y. Saad, “Trace optimization and eigenproblems in dimension reduction methods,” Numerical Linear Algebra with Applications, vol. 18, no. 3, pp. 565–602, 2011.
 [19] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning with application to clustering with sideinformation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 15, 2003, pp. 521–528.
 [20] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbour classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, 2009.
 [21] C. Shen, H. Li, and M. J. Brooks, “Supervised dimensionality reduction via sequential semidefinite programming,” Pattern Recogn., vol. 41, pp. 3644–3652, 2008.
 [22] P. Wang, C. Shen, and A. van den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013, pp. 1312–1319.
 [23] S. Boyd and L. Xiao, “Leastsquares covariance matrix adjustment,” SIAM J. Matrix Anal. Appl., vol. 27, no. 2, pp. 532–546, 2005.
 [24] J. Malick, “A dual approach to semidefinite leastsquares problems,” SIAM J. Matrix Anal. Appl., vol. 26, no. 1, pp. 272–284, 2004.
 [25] X.Y. Zhao, D. Sun, and K.C. Toh, “A newtoncg augmented lagrangian method for semidefinite programming,” SIAM J. Optim., vol. 20, no. 4, pp. 1737–1765, 2010.
 [26] Y. Gao and D. Sun, “Calibrating least squares covariance matrix problems with equality and inequality constraints,” SIAM J. Matrix Anal. Appl., vol. 31, pp. 1432–1457, 2009.
 [27] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, 2009.
 [28] Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmented lagrangian methods for semidefinite programming,” Mathematical Programming Computation, vol. 2, no. 34, pp. 203–230, 2010.
 [29] H. Ouyang, N. He, L. Tran, and A. Gray, “Stochastic alternating direction method of multipliers,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 80–88.
 [30] J. Chen, T. Yang, and S. Zhu, “Efficient lowrank stochastic gradient descent methods for solving semidefinite programs,” in Proc. Int. Workshop Artificial Intell. & Statistics, 2014, pp. 122–130.
 [31] S. ShalevShwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudometrics,” in Proc. Int. Conf. Mach. Learn., 2004, pp. 743–750.
 [32] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. Int. Conf. Mach. Learn., 2007, pp. 209–216.
 [33] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Proc. Adv. Neural Inf. Process. Syst., vol. 8, 2008, pp. 761–768.
 [34] C. Shen, J. Kim, L. Wang, and A. van den Hengel, “Positive semidefinite metric learning using boostinglike algorithms,” J. Machine Learning Research, vol. 9, no. 1, pp. 1007–1036, 2012.
 [35] M. L. Overton and R. S. Womersley, “On the sum of the largest eigenvalues of a symmetric matrix,” SIAM J. Matrix Anal. Appl., vol. 13, no. 1, pp. 41–45, 1992.
 [36] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
 [37] D. Henrion and J. Malick, “Projection methods for conic feasibility problems, applications to polynomial sumofsquares decompositions,” Optim. Methods and Softw., vol. 26, no. 1, pp. 23–46, 2009.
 [38] C. Zhu, R. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization,” ACM Transaction on Mathematical Software, vol. 23, pp. 550–560, 1997.
 [39] J. L. Morales and J. Nocedal, “Remark on âalgorithm 778: Lbfgsb: Fortran subroutines for largescale bound constrained optimizationâ,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 7, 2011.
 [40] K. C. Toh, M. Todd, and R. H. TÃ¼tÃ¼ncÃ¼, “SDPT3—a MATLAB software package for semidefinite programming,” Optimizat. Methods & Softw., vol. 11, pp. 545–581, 1999.
 [41] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimizat. Methods & Softw., vol. 11, pp. 625–653, 1999.
 [42] A. Frank and A. Asuncion. (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. [Online]. Available: http://archive.ics.uci.edu/ml
 [43] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Proc. of 2nd IEEE workshop on Applications of Computer Vision, 1994, pp. 138–142.
 [44] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, 1997.
 [45] D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Popular face data sets in matlab format. [Online]. Available: http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html
 [46] D. B. Graham and N. M. Allinson, “Characterizing virtual eigensignatures for general purpose face recognition,” Face Recognition: From Theory to Applications; NATO ASI Series F, Computer and Systems Sciences, vol. 163, pp. 446–456, 1998.
 [47] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil20),” in Technical Report CUCS00596, Feb 1996.
 [48] ——, “Columbia object image library (coil100),” in Technical Report CUCS00696, Feb 1996.
 [49] J. Geusebroek, G. Burghouts, and A. Smeulders, “The amsterdam library of object images,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 103–112, 2005.
 [50] S.Roweis. Data for MATLAB hankers: Handwritten digits. University of Toronto, Department of Computer Science. [Online]. Available: http://www.cs.nyu.edu/~roweis/data.html
Appendix A.1 Proof of the Proposition iii.2
This section presents the derivation of Proposition III.2.

With the Lagrangian dual multipliers , , and , the Lagrangian function of the primal problem (12) can be written as
(A16) with , and defined as (14).
Based on KKT conditions for convex problems, we have at the , where , , , and stand for primal and dual optimal variables, respectively. Then the relationship between primal and dual optimal variables is:
(A17) Substituting the expression for back into the Lagrangian (A16), the dual problem is formulated as
(A18a) (A18b) Given fixed , , , problem (A18) can be simplified to
(A19) which is equivalent to projecting the matrix to the positive semidefinite cone. The closedform solution to (A19) is:
(A20)
Appendix A.2 Explanation on the Feasibility Condition
In this part, we will briefly review a feasibility condition to a conic feasibility problem described in [37], and then extend it to our feasibility problem (10).
Consider a conic feasibility problem of finding a point such that
(A22) 
where and are given. defines an affine subspace , and is a convex cone.
Defining the polar cone of as the set of points whose projection into is 0, i.e.,
(A23) 
Henrion and Malick [37] proposed the following proposition.
Proposition A.2.1
If there exists a point such that the following conditions
(A24) 
are satisfied, there would be no feasible solution to (A22).
Based on this proposition, we can get a feasible condition to our problem (10). In our problem, is the set of positive semidefinite matrices, and the polar cone of is the set of negative semidefinite matrices. Then the feasibility problem
(A25) 
gives the certificate of infeasibility of problem (11) which is equivalent to (10). That is to say, if there is a feasible solution to (A25), there is no feasible solution to (10). We formalize this result in the following remark.
Proposition A.2.2

(i) Suppose that there is a feasible solution , and to (A25), and take a feasible point of the problem (11). implies that , i.e.,
(A26) Observing that satisfying the constraints in (11), the above inequality (A26) is equivalent to
(A27) Since , and , the above inequality (A27) cannot hold at all, which means there is no feasible solution to (A25).
(ii) is equivalent to . Combining with the condition , (A25) is feasible. Therefore problem (11) would be infeasible according to the contrapositive of (i).
is also equivalent to . Due to numerical reasons, the latter is adopted in the feasibility condition (15).
Appendix A.3 Data Visualization
This section presents the experimental results of data visualization. The input data is projected to two dimensional subspace using the linear transformation matrix learned by PCA, LDA, OLDA and the proposed algorithm SDWLDA. The data distributions after projection are shown in Fig. A4. Several datasets are evaluated: Yale face dataset [44] with 5 classes (th to th) adopted, ALOI object dataset [49] with 5 classes (th to th) used, Coil20 object dataset [47] with 10 classes (st to th) employed, and the Binary Alphadigits dateset [50] with images of digits , and used. As shown in Fig. A4, SDWLDA separates data better than PCA, LDA and OLDA on those datasets. PCA (unsupervised) preserves directions with the largest variance but much of the discriminant information is lost. LDA (supervised) considers the scatter measure in the average view, so the poor separations of two classes are probably to be concealed by other good separations. For example, in Fig. A4 (6), LDA separates most of classes, but fails to separate one pair. Because some classes are separated far from each other, the LDA criterion cannot demonstrate the fact that one pair of classes have not been separated yet. OLDA solved the nonsingularity limitation of scatter matrices in LDA, however, the scatter measures are still from the average viewpoint. Alternatively, SDWLDA tries to separate data from the worstcase viewpoint, so the separation of every classpair is taken into account.