Worst-Case Linear Discriminant Analysis as Scalable Semidefinite Feasibility Problems
In this paper, we propose an efficient semidefinite programming (SDP) approach to worst-case linear discriminant analysis (WLDA). Compared with the traditional LDA, WLDA considers the dimensionality reduction problem from the worst-case viewpoint, which is in general more robust for classification. However, the original problem of WLDA is non-convex and difficult to optimize. In this paper, we reformulate the optimization problem of WLDA into a sequence of semidefinite feasibility problems. To efficiently solve the semidefinite feasibility problems, we design a new scalable optimization method with quasi-Newton methods and eigen-decomposition being the core components. The proposed method is orders of magnitude faster than standard interior-point based SDP solvers.
Experiments on a variety of classification problems demonstrate that our approach achieves better performance than standard LDA. Our method is also much faster and more scalable than standard interior-point SDP solvers based WLDA. The computational complexity for an SDP with constraints and matrices of size by is roughly reduced from to ( in our case).
Dimensionality reduction is a critical problem in machine learning, pattern recognition and computer vision, which for linear case learns a transformation matrix to project the input data to a lower-dimensional space such that the important structure or geometry of input data is preserved. It can help us to eliminate the inherent noise of data, and improve the classification performance. It can also decrease the computational complexities of subsequent machine learning tasks. There are two classical dimensionality reduction methods used widely in many applications, principal component analysis (PCA) and linear discriminant analysis (LDA). PCA is an unsupervised linear dimensionality reduction method, which seeks a subspace of the data that have the maximum variance and subsequently projects the input data onto it. PCA may not give good classification performance due to its unsupervised nature. LDA is in supervised fashion, which aims to maximize the average distance between each two class means and minimize the average distance between each two samples within the same class. However, it has some limitations: 1) For -class data, the target dimension of the projected subspace is restricted to be at most . In this sense, LDA is suboptimal and may cause so called class separation problem that LDA tends to merge classes which are close in the original space; 2) It assumes that conditional probability density functions are normally distributed, which does not hold in many cases; 3) The scatter matrices are required to be nonsingular.
There are a large number of extension works to LDA and PCA [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Among these methods, our focus in this paper is on the discrimination criterion of worst-case linear discriminant analysis (WLDA), which was proposed by Zhang et al. . Instead of using average between-class and within-class distances as LDA, WLDA considers scatter measures from the worst-case view, which arguably is more suitable for applications like classification. Specifically, WLDA tries to maximize the minimum of pairwise distances between class means, and minimize the maximum of within-class pairwise distances over all classes. Due to the complex formulation of its criterion, the problem of WLDA is difficult to optimize.
In , the problem of WLDA was firstly relaxed to a metric learning problem on , which can be solved by a sequence of SDP optimization procedures, where SDP problems are solved by standard interior-point methods (We denote it as Zhang et al. (SDP)). However, standard interior-point SDP solvers scale poorly to problems with high dimensionality, as the computational complexity is , where is the problem size, and is the number of constraints in SDP. An alternative optimization procedure was then proposed in  for large-scale WLDA problems, in which one column of the transformation matrix was found at each iteration while fixing the other directions. We denote this method as Zhang et al. (SOCP) since it was reformulated as a series of second-order cone programming (SOCP) problems lastly. Typically, this greedy method does not guarantee to find a globally optimal solution.
In this paper we propose a fast SDP approach to solve WLDA problem. The problem is converted to a sequence of SDP feasibility problems using bisection search strategy, which can find a globally optimal solution to the relaxed problem. More importantly, we adopt a novel approach to solve SDP feasibility problem at each iteration. Motivated by , a Frobenius-norm regularized SDP formulation is used, and its Lagrangian dual can be solved effectively by quasi-Newton methods. The computational complexity of this optimization method is dominated by eigen-decomposition at each iteration, which is . The proposed method is denoted as SD-WLDA. The main contributions of this work are: 1) By introducing an auxiliary variable, the original WLDA problem is reformulated and can be solved via a sequence of convex feasibility problems, by which the global optimum can be obtained for the relaxed metric learning problem. 2) By virtue of the use of Frobenius norm regularization, the optimization problem can be addressed by solving its Lagrange dual, where first-order methods such as quasi-Newton can be used. This approach is much faster than solving the corresponding primal problem using standard interior-point methods, and can be applied to large-scale problems. Next, we briefly review some relevant work.
Dimensionality reduction In order to overcome the drawbacks of LDA and improve the accuracy in classification, many extensions have been proposed, such as relevant component analysis (RCA) , neighborhood component analysis (NCA) , null space LDA (NLDA) , orthogonal LDA (OLDA) , Enhanced fisher discriminant criterion (EFDC) , Geometric mean-based subspace selection (GMSS) , Harmonic mean-based subspace selection (HMSS) , and Max-min distance analysis (MMDA) . Assuming dimensions with large within-class covariance are not relevant to subsequent classification tasks, RCA  assigns large weights to “relevant dimensions” and small weights to “irrelevant dimensions”, where the relevance is estimated using equivalence constraints. NCA  learns the transformation matrix directly by minimizing the expected leave-one-out classification error of -nearest neighbours on the transformed space. Because the objective function to be optimized is not convex, NCA tends to converge to a local optimum. NLDA , OLDA  and EFDC  were proposed to address the problem that standard LDA fails when scatter matrices are singular. NLDA maximizes the between-class distance in the null space of the within-class scatter matrix, while OLDA calculates a set of orthogonal discriminant vectors by diagonalizing the scatter matrices simultaneously. The resulting transformation matrices are both orthogonal for NLDA and OLDA, and they are equivalent to each other under a mild condition . EFDC incorporates the intra-class variation into the Fisher discriminant criterion, so that data from the same class can be mapped to a subspace where both the intraclass compactness and intraclass variation are well preserved. In this way, this method is robust to the intraclass variation and results in a good generalization capability. To avoid the class separation problem of LDA, Tao et al.  proposed a general averaged divergence analysis (GADA) framework, which presented a general mean function in place of the arithmetic mean used in LDA. By choosing different mean functions, several subspace selection algorithms have been developed. GMSS  investigates the effectiveness of the geometric mean-based subspace selection, which maximizes the geometric mean of Kullback-Leibler (KL) divergences between different class pairs. HMSS  maximizes the harmonic mean of the symmetric KL divergences between all class pairs. They adaptively give large weights to class pairs that are close to each other, and result in better class separation performance than LDA. Instead of assigning weights to class pairs, MMDA  directly maximizes the minimum pairwise distance of all class pairs in the low-dimensional subspace, which guarantees the separation of all class pairs. However, MMDA does not take into account of the within-class pairwise distances over all classes. Recently, Bian et al.  presented an asymptotic generalization analysis of LDA, which enriched the existing theory of LDA further. They showed that the generalization ability of LDA is mainly determined by the ratio of dimensionality to training sample size, where both feature dimensionality and training data size can be proportionally large.
Many dimensionality reduction algorithm such as PCA and LDA can be formulated into a trace ratio optimization problem . Guo et al.  presented a generalized Fisher discriminant criterion, which is essentially a trace ratio. They proposed a heuristic bisection way, which was proven to converge to the precise solution. Wang et al.  tackled the trace ratio problem directly by an efficient iterative procedure, where a trace difference problem was solved via the eigen-decomposition method in each step. Shen et al.  provided a geometric revisit to the trace ratio problem in the framework of optimization on the Grassmann manifold. Different from , they proposed another efficient algorithm, which employed only one step of the parallel Rayleigh quotient iteration at each iteration. Kokiopoulou et al.  also treated the dimensionality reduction problem as trace optimization problems, and gave an overview of the eigenvalue problems encountered in dimensionality reduction area. They made a comparition between nonlinear and linear methods for dimensionality reduction, including Locally Linear Embedding (LLE), Laplacean Eigenmaps, PCA, Locality Preserving Projections (LPP), LDA, etc., and showed that all the eigenvalue problems in explicit linear projections can be regarded as projected analogues of the so-called nonlinear projections.
Different from the aforementioned methods, WLDA considers the dimensionality reduction problem from a worst-case viewpoint. It maximizes the worst-case between-class scatter matrix and minimizes the worst-case within-class scatter matrix simultaneously, which can lead to more robust classification performance. The inner maximization and minimization over discrete variables make it different from the general trace ratio problem, and difficult to solve. The method of solving the general trace ratio problem cannot be extended here directly. Furthermore, different from the iterative algorithm for trace ratio optimization problem , we formulate the WLDA problem as a sequence of SDP problems, and propose an efficient SDP solving method. The eigen-decomposition we used is to solve the Lagrange dual gradient, which differs from that employed in solving the trace ratio optimization problem.
Solving large-scale SDP problems Instead of learning the transformation matrix , quadratic Mahalanobis distance metric learning methods (which are highly related to dimensionality reduction methods) optimize over , in order to obtain a convex relaxation. The transformation matrix can be recovered from the eigen-decomposition of . Because is positive semidefinite (p.s.d.) by definition, quadratic Mahalanobis metric learning methods optimizing on usually need to solve an SDP problem.
Xing et al.  formulated metric learning as a convex (SDP) optimization problem, and a globally optimal solution can be obtained. Weinberger et al.  presented a distance metric learning method, which optimizes a Mahalanobis metric such that the -nearest neighbours always belong to the same class while samples from different classes are separated by a large margin. In terms of SDP solver, they proposed an alternate projection method, where the learned metric is projected back onto the p.s.d. cone by eigen-decomposition at each iteration. MMDA  was solved approximately by a sequence of SDP problems using standard interior-point methods. Shen et al.  proposed a novel SDP based method for directly solving trace quotient problems for dimensionality reduction. With this method, globally-optimal solutions can be obtained for trace quotient problems.
As we can see, many aforementioned methods used standard interior-point SDP solvers, which are unfortunately computationally expensive (computational complexity is ) and scale poorly to large-scale problems. Thus an efficient SDP optimization approach is critical for large-scale metric learning problems.
There are many recent work to address large-scale SDP problems arising from distance metric learning and other computer vision tasks. Shen et al.  proposed a fast SDP approach for solving Mahalanobis metric learning problem. They introduced a Frobenius-norm regularization in the objective function of SDP problems, which leads to a much simpler Lagrangian dual problem: the objective function is continuously differentiable and p.s.d. constraints in the dual can be eliminated. L-BFGS-B was used to solve the dual, where a partial eigen-decomposition needed to be calculated at each iteration. Wang et al.  also employed a similar dual approach to solve binary quadratic problems for computer vision tasks, such as image segmentation, co-segmentation, image registration. SDP optimization method in [11, 22] can be seen as an extension of the works in [23, 24], which considered semidefinite least-squares problems. The key motivation of [23, 24] is that the objective function of the corresponding dual problem is continuously differentiable but not twice differentiable, therefore first-order methods can be applied. Malick  and Boyd and Xiao  proposed to use quasi-Newton methods and projected gradient methods respectively, to solve the Lagrangian dual of semidefinite least-squares problems. Semismooth Newton-CG methods  and smoothing Newton methods  are also exploited for semidefinite least-squares problems, which require much less number of iterations at the cost of higher computational complexity per iteration (full eigen-decomposition plus conjugate gradient).
Alternatively, stochastic (sub)gradient descent (SGD) methods  were also employed to solve SDP problems. Combining with alternating direction methods [28, 29], SGD can be used for SDP problems with inequality and equality constraints. The computational bottleneck of typical SGD is the projection of one infeasible point onto the p.s.d. cone at each iteration, which leads to the eigen-decomposition of a matrix. A number of methods have been proposed to speed up the projection operation at each iteration. Chen et al.  proposed a low-rank SGD method, in which rank- stochastic gradient is constructed and then the projection operation is simplified to compute at most eigenpairs. In the works of [31, 32, 33, 34], the distance metric is updated by rank-one matrices iteratively, and no eigen-decomposition or only one leading eigenpair is required. Note that SGD methods usually need more iterations to converge than the dual approaches based on quasi-Newton methods .
The most related work to ours may be Shen et al.’s . We use similar SDP optimization technique as that in . However, SDP feasibility problems are considered in our paper while the work in  focuses on standard SDP problems with linear objective functions.
Notation We use a bold lower-case letter to denote a column vector, and a bold capital letter to denote a matrix. is the transposition of . indicates the set of matrices. represents an identity matrix. indicates that the matrix is positive semidefinite. denotes the inner product of two matrices or vectors. indicates the trace of a matrix. is the Frobenius norm of a matrix. returns a diagonal matrix with the input elements on its main diagonal. Suppose that the eigen-decomposition of a symmetric matrix is , where is the orthonormal matrix of eigenvectors of , and are the corresponding eigenvalues, we define the positive and negative parts of respectively as
It is clear that holds.
Ii Worst-Case Linear Discriminant Analysis
We briefly review WLDA problem proposed by  firstly. Given a training set of samples (), which consists of classes , where class contains samples. As we mentioned before, the aim of linear dimensionality reduction is to find a transformation matrix with .
We define the within-class scatter matrix of the th class as
which is also the covariance matrix for the th class, and is the class mean of the th class . The between-class scatter matrix of the th and th classes is defined as
Unlike LDA which seeks to minimize the average within-class pairwise distance, the within-class scatter measure used in WLDA is defined as
which is the maximum of the within-class pairwise distances over all classes.
On the other hand, the between-class scatter measure used in WLDA is defined as
uses the minimum of the pairwise distances between class means, instead of the average of distances between each class mean and the sample mean employed in LDA.
The optimality criterion of WLDA is defined as to maximize the ratio of the between-class scatter measure to the within-class scatter measure:
Define sets , and . Then is the set of extreme points of , and is the convex hull of .
This theorem has been widely used and its proof can be found in . According to Theorem II.1, the orthonormal constraint on can be relaxed to convex constraints on , and the problem (7) can be relaxed to the following problem defined on a p.s.d. variable :
Once the optimal solution is obtained, the optimal for problem (7) can be recovered using the top eigenvectors of .
In , Zhang et al. proposed two methods to solve (7), as we stated in Section I. In the first one, an iterative algorithm was presented to solve the relaxed problem (8), where an SDP problem needs to be solved at each step by standard SDP solver. This method is not scalable to problems with high dimensionality or large training data points. The second one is based on a greedy approach, which cannot guarantee to find a globally optimal solution.
Hence, in the next section, we will describe our algorithm (so called SD-WLDA) of finding the transformation matrix that maximizes (8), and demonstrate how to solve it using an efficient approach.
Iii A Fast SDP Approach to WLDA
In this section, problem (8) is firstly reformulated into a sequence of SDP optimization problems based on bisection search. Then, a Frobenius norm regularization is introduced and the SDP problem in each step is solved through Lagrangian dual formulation. With this SD-WLDA method, the global optimum can be acquired for the relaxed problem (8). The computational complexity can be reduced as well by solving the dual problem using quasi-Newton methods, compared with solving the primal problem directly using interior-point based algorithm.
Iii-a Problem Reformulation
By introducing an auxiliary variable , problem (8) can be rewritten as
There are two variables to be optimized in problem (9), but we are interested in finding that can maximize . Problem (9) is clearly non-convex with respect to and since the constraint (9b) is not convex. However, noting that (9b) will become linear if is given, we employ the bisection search strategy and convert the optimization problem (9) into a set of convex feasibility problems, by which the global optimum can be computed effectively.
Let denote the optimal value of (9a). Given , if the convex feasibility problem
is feasible, then . Otherwise, if the above problem is infeasible, then .
Iii-B Lagrangian Dual Formulation
Algorithm 1 shows that an SDP feasibility problem needs to be solved at each step during the bisection search process. Considering that standard interior-point SDP solvers have a computational complexity of , where is the dimension of input data, and is the number of constraints in SDP, it becomes quite expensive for processing high-dimensional data. In this subsection, we reformulate the feasibility problem (10) into a Frobenius norm regularized SDP problem, which can be efficiently solved via its Lagrangian dual using first-order methods like quasi-Newton. The computational complexity will be reduced to . The primal solution can then be calculated from the dual solution based on Karush-Kuhn-Tucker (KKT) [36, p. 243] conditions.
The problem (10) can be expressed equivalently in the following form:
where , , and .
The problem (12) is a standard semidifinite least-square problem and can be solved readily by off-the-shelf SDP solvers. However, as we mentioned before, the computational complexity is really high if we solve the primal problem directly by standard interior-point SDP solvers. It will greatly hamper the use of WLDA in large-scale problems. Thanks to the Frobenius norm regularization in the objective function of (12), we can use Lagrangian dual approach to solve the problem easily.
From the definition of and the operator , is forced to be p.s.d. and block-diagonal, so the optimal solution to problem (10) can be acquired easily. In addition, it is noticed that the objective function of (13) is differentiable (but not necessarily to be twice differentiable), it allows us to solve the dual problem efficiently using first-order methods, such as quasi-Newton methods.
The gradient of the objective function in problem (13) is , where
Iii-C Feasibility Condition
As stated in Proposition III.1, if (10) is feasible, a solution can be found by solving the problem (12). During running quasi-Newton algorithms to solve the problem (12), an infeasibility condition of the problem (10) needs to be checked iteratively, which is presented here:
If the following conditions are satisfied
then the problem (10) is considered as infeasible.
This infeasibility condition can be deduced from a general conic feasibility problem presented in . Explanations are presented in detail in the appendix. We check this condition at each iteration of quasi-Newton algorithms. is evaluated during calculating the dual objective function, so it will not bring extra computational cost. Once the condition (15) is satisfied, problem (10) (equivalently (11)) is not feasible and quasi-Newton algorithms will be stopped. Otherwise, quasi-Newton algorithms keep running until convergence, and then a feasible solution to the problem (11) is found.
Iii-D Solving the Feasibility Problem
In this subsection, we summarize the procedure of solving the problem (10) by our fast SDP optimization algorithm. It has been domenstrated that we can find the feasible solution to (10) by solving the dual problem (13) with quasi-Newton methods. In this work, we use L-BFGS-B [38, 39], a limited-memory quasi-Newton algorithm package, which can handle the problem with box constraints. Here we only need to provide the callback function to L-BFGS-B, which calculates the objective function of (13) and its gradient. The procedure of finding the feasible solution is described in Algorithm 2.
Iii-E Computational Complexity Analysis
The computational complexity of L-BFGS-B is , where is a moderate number between 3 to 20, is the problem size to be solved by L-BFGS-B, which is equal to the number of constraints in the primal SDP problem (12). At each iteration of L-BFGS-B, the eigen-decomposition of a matrix is carried out to compute , which is used to evaluate all the dual objective values, gradients, as well as the feasibility conditions (15). The computational complexity is . Hence, the overall computational complexity of our algorithm SD-WLDA is . Since , eigen-decomposition dominates the most computational time of SD-WLDA, which is . On the other hand, solving an SDP problem using standard interior-point methods needs a computational complexity of . Since in our case, our algorithm is much faster than interior-point methods, and can be used to large-scale problems.
In this section, experiments are performed to verify the performance of SD-WLDA. We conduct comparisons between SD-WLDA and other methods on both classification performance and computational complexity. The classification performance is contrasted between SD-WLDA and LDA, LMNN, OLDA. We also compare the performance of our SD-WLDA with both optimization methods proposed by Zhang et al. in  (Zhang et al. (SDP) and Zhang et al. (SOCP) receptively). This can be used to verify the correctness of our algorithm. The computational complexity is compared between SD-WLDA, standard interior-point algorithms to solve our SDP formulation (SDPT3  and SeDuMi ), Zhang et al. (SDP) which uses SDPT3 as well, and Zhang et al. (SOCP).
All algorithms are tested on a GHz Intel CPU with G memory. The SD-WLDA algorithm is implemented in Matlab, where the Fortran code of L-BFGS-B is employed to solve the dual problem (13). The Matlab routine “eig” is used to compute eigen-decomposition. The tolerance setting of L-BFGS-B is set to default. The tolerance in Algorithm 1 is set to , and the parameter in the feasibility condition (15) is set to .
Iv-a Experiments on UCI Datasets
Some UCI datasets  are used here firstly. We perform random splits for each dataset, with as training samples and as test samples. The classification performance is evaluated based on nearest neighbour (-NN) classifier. For fair comparison with LDA, the final dimension is set to .
The experimental results are presented in Table I, where the baseline results are obtained by applying -NN classifier on the original feature space directly. For each dataset, the experiment runs times, and the error rate is reported by the mean error as well as the standard deviation. The smallest classification error is shown in bold. The results illustrate that WLDA gives smaller classification error rates compared to other algorithms in most datasets. The classification results by our fast SDP solving algorithm SD-WLDA and Zhang et al. (SDP) are quite similar, with small difference coming from numerical error during computation. The error rates calculated by Zhang et al. (SOCP) are sometimes quite different from that by Zhang et al. (SDP) and SD-WLDA, e.g., on “Heart” and “Waveform” datasets, which results from different relaxation methods and optimization procedures employed.
In terms of computational speed, SD-WLDA approach is much faster than other methods. Zhang et al. (SDP), which is also solved by standard interior-point algorithm SDPT3, is faster than Ours (SDPT3), because of different SDP problem formulations. The merit of SD-WLDA on computation is even more dramatic for high dimensional problems. For example, we compare the computational speeds of SD-WLDA and SDPT3 on the datasets of “Iris” and “Waveform”, which have the same number of classes. SD-WLDA is about times faster than Zhang et al. (SDP), and times faster than Ours (SDPT3) on “Iris” which has training samples with the input dimension as , whereas it becomes times quicker than Zhang et al. (SDP), and times quicker than Ours (SDPT3) on “Waveform” which has training samples with the input dimension as . The computational time increases more significantly for SeDuMi with respect to input dimension. Zhang et al. (SOCP) has no computational advantage on solving problems with few training samples and low dimensionality. The computational superiority appears when dimensionality increases, referring to the results on “Waveform” dataset, which is quicker than Zhang et al. (SDP). Because of the column-wise iteration solving method of Zhang et al. (SOCP), the computational complexity of Zhang et al. (SOCP) relates closely with the final dimension. That is why Zhang et al. (SOCP) is quite fast on “Sonar” and “Ionosphere” datasets, which set the final dimension to . However, it still slower than SD-WLDA.
|Error Rates (%)|
|Zhang et al. (SDP)||()||()||()||()||()||()|
|Zhang et al. (SOCP)||()||()||()||()||()||()|
|Zhang et al. (SDP)||s||s||s||s||s||s|
|Zhang et al. (SOCP)||s||s||s||s||s||s|
To prove the robust classification performance of SD-WLDA, we change the ratio of training samples from to on datasets “Sonar” and “Ionosphere”. For each value of , we calculate the average test error as well as the standard deviation across trials by SD-WLDA and LDA respectively. The results in Fig. 1 demonstrate that SD-WLDA is more superior than LDA when there is small number of training samples. This phenomenon illustrates that WLDA alleviates the dependence of classification performance on large number of training samples.
LDA requires the data to map to at most dimension, while SD-WLDA, which is based on an SDP optimization method, does not have such a restriction. Here we perform another experiment by SD-WLDA on “Heart” dataset, with different final dimensions. The result in Table II shows that is not the best final dimension for “Heart”. So SD-WLDA algorithm is more flexible.
Iv-B Experiments on Face, Object and Letter Datasets
As shown before, SD-WLDA algorithm can be used to solve large-scale problems due to its efficiency on computation. In this section, experiments are carried out on face, object and letter image datasets, which have high input dimension and more classes. The images are resized to different resolution before experiments, as shown in Table III. PCA has been applied beforehand to reduce the original dimension and also to remove noises. The final dimension is still set to be for fair comparison to LDA.
Iv-B1 Face recognition
four face databases are employed here. ORL  consists of face images of individuals, each with images. We randomly choose of the samples for training and the remaining for test. The Yale dataset  contains 165 grey-scale images of individuals, images per subject. We split them into training and test sets by sampling as well. PIE dataset (Pose C29)  has subjects, and images for each individual. of the samples are chosen randomly for training. UMist dataset  contains 564 grayscale images of different people. We only use of the samples for training to test the performance of SD-WLDA.
Experimental results in Table III show that SD-WLDA gives better classification performance for all datasets. The classification error rates by SD-WLDA and Zhang et al. (SDP) are identical with each other, as PCA used before has already removed the noises, which proves the correctness of our algorithm. However, SD-WLDA is much faster than Zhang et al. (SDP) method. For example, SD-WLDA runs almost times faster than Zhang et al. (SDP) on Yale dataset. The error rates calculated by Zhang et al. (SOCP) are rather larger, which result from the non-globally optimal solution the algorithm reached. The computational superiority of Zhang et al. (SOCP) does not show up as well.
In order to illustrate the computational speed of SD-WLDA and both methods in  with respect to the number of classes and the input data dimension (here it refers to the dimension after PCA) respectively, more experiments are performed on Yale dataset. Firstly, we set the dimension after PCA to be , and change the number of classes from to . Experimental results in Fig. 2(1) demonstrate that compared with the other methods, the speed of SD-WLDA is less sensitive to the increase of the amount of classes. Since the final dimension is set to be , the computational complexity of Zhang et al. (SOCP) jumps up obviously with the increase of classes. Secondly, we use all classes, and let the dimension after PCA change from to . Experimental results in Fig. 2(2) show that the computational cost of SD-WLDA rises up pretty slower with the growth of input dimension, in contrast to Zhang et al. (SDP). Zhang et al. (SOCP) becomes faster than Zhang et al. (SDP) when input dimension is larger than 90. This phenomenon certifies that Zhang et al. (SOCP) is more suitable for processing high dimensional datasets than Zhang et al. (SDP), as  presented. However, this method cannot guarantee to find a global optimal solution. Finally, we test on Yale dataset with an even high input dimension as . Experimental results in Table IV demonstrate that SD-WLDA is absolutely faster than Zhang et al. (SDP), both of which lead to similar classification error rates. Although Zhang et al. (SOCP) is comparable with SD-WLDA on computational time, the error rate it obtained is relatively bigger.
In addition, to show the superiority of SD-WLDA on classification, another experiment is conducted using SD-WLDA and LDA on Yale dataset, with the dimension after PCA as and classes. We reduce the final dimension from to . The test results shown in Fig. 3 demonstrate that SD-WLDA always gives lower test error than LDA, which further proves the good classification performance of SD-WLDA.
Dim. after PCA
|Error Rates (%)|
Zhang et al. (SDP)
Zhang et al. (SOCP)
|Computational Time (Once)|
Zhang et al. (SDP)
Zhang et al. (SOCP)
|Error Rates (%)||Computational Time (Once)|
|Zhang et al. (SDP)||()||hm|
|Zhang et al. (SOCP)||()||ms|
Iv-B2 Object recognition
Three datasets are used here: Coil20 , Coil30 , and ALOI . Coil20 dataset contains grey-scale images with black background for objects, with each containing different images. Coil30 dataset consists of RGB images for objects. We choose the first objects and convert them into greyscale images in our experiment. ALOI dataset consists of objects taken at varied viewing angles, illumination angles, etc.. We use the first objects here, with images for every object. Different training and test splitting ratios are adopted for different datasets in order to test the performance under different situations, which can be found in Table III. The experimental results demonstrate that SD-WLDA is better in computational speed. Although OLDA produces the smallest classification error on Coil20 (), SD-WLDA gives a comparable result ().
Iv-B3 Letter recognition
The Binary Alphadigits dateset  is employed here. The dataset BA1 contains digits of to , and BA2 contains capital letters through . Experimental results in Table III present that WLDA produces better classification performance on both databases. Zhang et al. (SDP) and SD-WLDA give similar classification results, while Zhang et al. (SOCP) lead to much larger error rates. In terms of computational speed, SD-WLDA runs times faster than Zhang et al. (SDP) on BA1, and more than times faster than Zhang et al. (SDP) on BA2, which has more training samples and number of classes. This experiment demonstrates again that SD-WLDA is more efficient in processing large-scale datasets.
Iv-B4 Classification performance regarding to number of classes
It has been validated that SD-WLDA can give smaller classification errors than LDA using a small size of training set. In this section, we evaluate the classification performance of SD-WLDA and LDA with respect to the number of classes . Take the datasets ORL and Coil20 as examples. The experimental settings for each dataset are shown in the captions of Table V and VI respectively. We choose different numbers of classes from each dataset, and compare the test errors of SD-WLDA and LDA. The results in Tables V and VI demonstrate that when the number of classes is small, SD-WLDA and LDA have comparable classification results, whereas when the number of classes increases, SD-WLDA shows better classification performance.
In this work, an efficient SDP optimization algorithm has been proposed to solve the problem of worst-case linear discriminant analysis. WLDA takes into account the worst-case pairwise distance between and within classes, which achieves better classification performance than conventional LDA. In order to reduce the computational complexity so that it can be applied to large-scale problems, a fast algorithm has been presented by introducing the Frobenius norm regularization, and its Lagrangian dual can be simplified. Using our algorithm, the global optimum can be obtained in time. The algorithm is simple to implement and much faster than conventional SDP solvers. Experimental results on some UCI databases as well as face and object recognition tasks show the effectiveness on classification performance and the efficiency on computation of SD-WLDA.
-  N. Shental, T. Hertz, D. Weinshall, and M. Pavel, “Adjustment learning and relevant component analysis,” in Proc. Eur. Conf. Comp. Vis., 2002, pp. 776–790.
-  K. Q. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., vol. 18, 2006, pp. 1473–1481.
-  L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A new lda-based face recognition system which can solve the small sample size problem,” Pattern Recogn., vol. 33, no. 10, pp. 1713–1726, 2000.
-  J. Ye and B. Yu, “Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems,” J. Mach. Learn. Res., vol. 6, no. 4, pp. 483–502, 2005.
-  Y. Zhang and D.-Y. Yeung, “Worst-case linear discriminant analysis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 23, 2010, pp. 2568–2576.
-  Q. Gao, J. Liu, J. Zhang, J. Hou, and X. Yang, “Enhanced fisher discriminant criterion for image recognition,” Pattern Recogn., vol. 45, no. 10, pp. 3717–3724, 2012.
-  D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric mean for subspace selection,” IEEE Trans. Anal. Mach. Intell., vol. 31, no. 2, pp. 260–274, 2009.
-  W. Bian and D. Tao, “Harmonic mean for subspace selection,” in in Proceedings of the 19th International Conference on Pattern Recognition, 2008, pp. 1–4.
-  ——, “Max-min distance analysis by using sequential sdp relaxation for dimension reduction,” IEEE Trans. Anal. Mach. Intell., vol. 33, no. 5, pp. 1037–1050, 2011.
-  ——, “Asymptotic generalization bound of fisher’s linear discriminant analysis,” CoRR, vol. abs/1208.3030, 2012. [Online]. Available: http://arxiv.org/abs/1208.3030
-  C. Shen, J. Kim, and L. Wang, “A scalable dual approach to semidefinite metric learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011, pp. 2601–2608.
-  J. Ye and T. Xiong, “Computational and theoretical analysis of null space and orthogonal linear discriminant analysis,” J. Machine Learning Research, vol. 7, pp. 1183–1204, 2006.
-  D. Tao, X. Li, X. Wu, and S. J. Maybank, “General averaged divergence analysis,” in Proc. of IEEE International Conference on Data Mining, 2007, pp. 302–311.
-  Y. Jia, F. Nie, and C. Zhang, “Trace ratio problem revisited,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 729–735, 2009.
-  Y. Guo, S. Li, J. Yang, T. Shu, and L. Wu, “A generalized foley- sammon transform based on generalized fisher discriminant criterion and its application to face recognition,” Pattern Recogn. Lett., vol. 24, no. 1-3, pp. 147–158, 2003.
-  H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007, pp. 1–8.
-  H. Shen, K.Diepold, and K. Hueper, “A geometric revisit to the trace quotient problem,” in Proc. of 19th International Symposium of Mathematical Theory of Networks and Systems, 2010.
-  E. Kokiopoulou, J. Chen, and Y. Saad, “Trace optimization and eigenproblems in dimension reduction methods,” Numerical Linear Algebra with Applications, vol. 18, no. 3, pp. 565–602, 2011.
-  E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst., vol. 15, 2003, pp. 521–528.
-  K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbour classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, 2009.
-  C. Shen, H. Li, and M. J. Brooks, “Supervised dimensionality reduction via sequential semidefinite programming,” Pattern Recogn., vol. 41, pp. 3644–3652, 2008.
-  P. Wang, C. Shen, and A. van den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013, pp. 1312–1319.
-  S. Boyd and L. Xiao, “Least-squares covariance matrix adjustment,” SIAM J. Matrix Anal. Appl., vol. 27, no. 2, pp. 532–546, 2005.
-  J. Malick, “A dual approach to semidefinite least-squares problems,” SIAM J. Matrix Anal. Appl., vol. 26, no. 1, pp. 272–284, 2004.
-  X.-Y. Zhao, D. Sun, and K.-C. Toh, “A newton-cg augmented lagrangian method for semidefinite programming,” SIAM J. Optim., vol. 20, no. 4, pp. 1737–1765, 2010.
-  Y. Gao and D. Sun, “Calibrating least squares covariance matrix problems with equality and inequality constraints,” SIAM J. Matrix Anal. Appl., vol. 31, pp. 1432–1457, 2009.
-  A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, 2009.
-  Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmented lagrangian methods for semidefinite programming,” Mathematical Programming Computation, vol. 2, no. 3-4, pp. 203–230, 2010.
-  H. Ouyang, N. He, L. Tran, and A. Gray, “Stochastic alternating direction method of multipliers,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 80–88.
-  J. Chen, T. Yang, and S. Zhu, “Efficient low-rank stochastic gradient descent methods for solving semidefinite programs,” in Proc. Int. Workshop Artificial Intell. & Statistics, 2014, pp. 122–130.
-  S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudo-metrics,” in Proc. Int. Conf. Mach. Learn., 2004, pp. 743–750.
-  J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proc. Int. Conf. Mach. Learn., 2007, pp. 209–216.
-  P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Proc. Adv. Neural Inf. Process. Syst., vol. 8, 2008, pp. 761–768.
-  C. Shen, J. Kim, L. Wang, and A. van den Hengel, “Positive semidefinite metric learning using boosting-like algorithms,” J. Machine Learning Research, vol. 9, no. 1, pp. 1007–1036, 2012.
-  M. L. Overton and R. S. Womersley, “On the sum of the largest eigenvalues of a symmetric matrix,” SIAM J. Matrix Anal. Appl., vol. 13, no. 1, pp. 41–45, 1992.
-  S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
-  D. Henrion and J. Malick, “Projection methods for conic feasibility problems, applications to polynomial sum-of-squares decompositions,” Optim. Methods and Softw., vol. 26, no. 1, pp. 23–46, 2009.
-  C. Zhu, R. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization,” ACM Transaction on Mathematical Software, vol. 23, pp. 550–560, 1997.
-  J. L. Morales and J. Nocedal, “Remark on âalgorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound constrained optimizationâ,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 7, 2011.
-  K. C. Toh, M. Todd, and R. H. TÃ¼tÃ¼ncÃ¼, “SDPT3—a MATLAB software package for semidefinite programming,” Optimizat. Methods & Softw., vol. 11, pp. 545–581, 1999.
-  J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimizat. Methods & Softw., vol. 11, pp. 625–653, 1999.
-  A. Frank and A. Asuncion. (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. [Online]. Available: http://archive.ics.uci.edu/ml
-  F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Proc. of 2nd IEEE workshop on Applications of Computer Vision, 1994, pp. 138–142.
-  P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, 1997.
-  D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Popular face data sets in matlab format. [Online]. Available: http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html
-  D. B. Graham and N. M. Allinson, “Characterizing virtual eigensignatures for general purpose face recognition,” Face Recognition: From Theory to Applications; NATO ASI Series F, Computer and Systems Sciences, vol. 163, pp. 446–456, 1998.
-  S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” in Technical Report CUCS-005-96, Feb 1996.
-  ——, “Columbia object image library (coil-100),” in Technical Report CUCS-006-96, Feb 1996.
-  J. Geusebroek, G. Burghouts, and A. Smeulders, “The amsterdam library of object images,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 103–112, 2005.
-  S.Roweis. Data for MATLAB hankers: Handwritten digits. University of Toronto, Department of Computer Science. [Online]. Available: http://www.cs.nyu.edu/~roweis/data.html
Appendix A.1 Proof of the Proposition iii.2
This section presents the derivation of Proposition III.2.
With the Lagrangian dual multipliers , , and , the Lagrangian function of the primal problem (12) can be written as
with , and defined as (14).
Based on KKT conditions for convex problems, we have at the , where , , , and stand for primal and dual optimal variables, respectively. Then the relationship between primal and dual optimal variables is:
Substituting the expression for back into the Lagrangian (A16), the dual problem is formulated as
Given fixed , , , problem (A18) can be simplified to
which is equivalent to projecting the matrix to the positive semidefinite cone. The closed-form solution to (A19) is:
Appendix A.2 Explanation on the Feasibility Condition
Consider a conic feasibility problem of finding a point such that
where and are given. defines an affine subspace , and is a convex cone.
Defining the polar cone of as the set of points whose projection into is 0, i.e.,
Henrion and Malick  proposed the following proposition.
If there exists a point such that the following conditions
are satisfied, there would be no feasible solution to (A22).
Based on this proposition, we can get a feasible condition to our problem (10). In our problem, is the set of positive semidefinite matrices, and the polar cone of is the set of negative semidefinite matrices. Then the feasibility problem
gives the certificate of infeasibility of problem (11) which is equivalent to (10). That is to say, if there is a feasible solution to (A25), there is no feasible solution to (10). We formalize this result in the following remark.
is also equivalent to . Due to numerical reasons, the latter is adopted in the feasibility condition (15).
Appendix A.3 Data Visualization
This section presents the experimental results of data visualization. The input data is projected to two dimensional subspace using the linear transformation matrix learned by PCA, LDA, OLDA and the proposed algorithm SD-WLDA. The data distributions after projection are shown in Fig. A4. Several datasets are evaluated: Yale face dataset  with 5 classes (th to th) adopted, ALOI object dataset  with 5 classes (th to th) used, Coil20 object dataset  with 10 classes (st to th) employed, and the Binary Alphadigits dateset  with images of digits , and used. As shown in Fig. A4, SD-WLDA separates data better than PCA, LDA and OLDA on those datasets. PCA (unsupervised) preserves directions with the largest variance but much of the discriminant information is lost. LDA (supervised) considers the scatter measure in the average view, so the poor separations of two classes are probably to be concealed by other good separations. For example, in Fig. A4 (6), LDA separates most of classes, but fails to separate one pair. Because some classes are separated far from each other, the LDA criterion cannot demonstrate the fact that one pair of classes have not been separated yet. OLDA solved the nonsingularity limitation of scatter matrices in LDA, however, the scatter measures are still from the average viewpoint. Alternatively, SD-WLDA tries to separate data from the worst-case viewpoint, so the separation of every class-pair is taken into account.