Optimal Solutions for Sparse Principal Component Analysis

# Optimal Solutions for Sparse Principal Component Analysis

\nameAlexandre d’Aspremont \emailaspremon@Princeton.edu
Princeton, NJ 08544, USA.
\AND\nameFrancis Bach \emailfrancis.bach@mines.org
Département d’Informatique, Ecole Normale Supérieure
45, rue d’Ulm, 75230 Paris, France
\AND\nameLaurent El Ghaoui \emailelghaoui@eecs.Berkeley.edu
Berkeley, CA 94720, USA.
###### Abstract

Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity , where is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases.

Optimal Solutions for Sparse Principal Component Analysis Alexandre d’Aspremont aspremon@Princeton.edu
ORFE, Princeton University,
Princeton, NJ 08544, USA.
Francis Bach francis.bach@mines.org
INRIA - Willow project
Département d’Informatique, Ecole Normale Supérieure
45, rue d’Ulm, 75230 Paris, France
Laurent El Ghaoui elghaoui@eecs.Berkeley.edu
EECS Department, U.C. Berkeley,
Berkeley, CA 94720, USA.

Editor:

Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso.

## 1 Introduction

Principal component analysis (PCA) is a classic tool for data analysis, visualization or compression and has a wide range of applications throughout science and engineering. Starting from a multivariate data set, PCA finds linear combinations of the variables called principal components, corresponding to orthogonal directions maximizing variance in the data. Numerically, a full PCA involves a singular value decomposition of the data matrix.

One of the key shortcomings of PCA is that the factors are linear combinations of all original variables; that is, most of factor coefficients (or loadings) are non-zero. This means that while PCA facilitates model interpretation and visualization by concentrating the information in a few factors, the factors themselves are still constructed using all variables, hence are often hard to interpret.

In many applications, the coordinate axes involved in the factors have a direct physical interpretation. In financial or biological applications, each axis might correspond to a specific asset or gene. In problems such as these, it is natural to seek a trade-off between the two goals of statistical fidelity (explaining most of the variance in the data) and interpretability (making sure that the factors involve only a few coordinate axes). Solutions that have only a few nonzero coefficients in the principal components are usually easier to interpret. Moreover, in some applications, nonzero coefficients have a direct cost (e.g., transaction costs in finance) hence there may be a direct trade-off between statistical fidelity and practicality. Our aim here is to efficiently derive sparse principal components, i.e, a set of sparse vectors that explain a maximum amount of variance. Our belief is that in many applications, the decrease in statistical fidelity required to obtain sparse factors is small and relatively benign.

In what follows, we will focus on the problem of finding sparse factors which explain a maximum amount of variance, which can be written:

 max∥z∥≤1zTΣz−ρCard(z) (1)

in the variable , where is the (symmetric positive semi-definite) sample covariance matrix, is a parameter controlling sparsity, and denotes the cardinal (or norm) of , i.e. the number of non zero coefficients of .

While PCA is numerically easy, each factor requires computing a leading eigenvector, which can be done in , sparse PCA is a hard combinatorial problem. In fact, Moghaddam et al. (2006b) show that the subset selection problem for ordinary least squares, which is NP-hard (Natarajan, 1995), can be reduced to a sparse generalized eigenvalue problem, of which sparse PCA is a particular intance. Sometimes ad hoc “rotation” techniques are used to post-process the results from PCA and find interpretable directions underlying a particular subspace (see Jolliffe (1995)). Another simple solution is to threshold the loadings with small absolute value to zero (Cadima and Jolliffe, 1995). A more systematic approach to the problem arose in recent years, with various researchers proposing nonconvex algorithms (e.g., SCoTLASS by Jolliffe et al. (2003), SLRA by Zhang et al. (2002) or D.C. based methods (Sriperumbudur et al., 2007) which find modified principal components with zero loadings. The SPCA algorithm, which is based on the representation of PCA as a regression-type optimization problem (Zou et al., 2006), allows the application of the LASSO (Tibshirani, 1996), a penalization technique based on the norm. With the exception of simple thresholding, all the algorithms above require solving non convex problems. Recently also, d’Aspremont et al. (2007b) derived an based semidefinite relaxation for the sparse PCA problem (1) with a complexity of for a given . Finally, Moghaddam et al. (2006a) used greedy search and branch-and-bound methods to solve small instances of problem (1) exactly and get good solutions for larger ones. Each step of this greedy algorithm has complexity , leading to a total complexity of for a full set of solutions.

Our contribution here is twofold. We first derive a greedy algorithm for computing a full set of good solutions (one for each target sparsity between 1 and ) at a total numerical cost of based on the convexity of the of the largest eigenvalue of a symmetric matrix. We then derive tractable sufficient conditions for a vector to be a global optimum of (1). This means in practice that, given a vector with support , we can test if is a globally optimal solution to problem (1) by performing a few binary search iterations to solve a one dimensional convex minimization problem. In fact, we can take any sparsity pattern candidate from any algorithm and test its optimality. This paper builds on the earlier conference version (d’Aspremont et al., 2007a), providing new and simpler conditions for optimality and describing applications to subset selection and sparse recovery.

While there is certainly a case to be made for penalized maximum eigenvalues (à la d’Aspremont et al. (2007b)), we strictly focus here on the formulation. However, it was shown recently (see Candès and Tao (2005), Donoho and Tanner (2005) or Meinshausen and Yu (2006) among others) that there is in fact a deep connection between constrained extremal eigenvalues and LASSO type variable selection algorithms. Sufficient conditions based on sparse eigenvalues (also called restricted isometry constants in Candès and Tao (2005)) guarantee consistent variable selection (in the LASSO case) or sparse recovery (in the decoding problem). The results we derive here produce upper bounds on sparse extremal eigenvalues and can thus be used to prove consistency in LASSO estimation, prove perfect recovery in sparse recovery problems, or prove that a particular solution of the subset selection problem is optimal. Of course, our conditions are only sufficient, not necessary (which would contradict the NP-Hardness of subset selection) and the duality bounds we produce on sparse extremal eigenvalues cannot always be tight, but we observe that the duality gap is often small.

The paper is organized as follows. We begin by formulating the sparse PCA problem in Section 2. In Section 3, we write an efficient algorithm for computing a full set of candidate solutions to problem (1) with total complexity . In Section 4 we then formulate a convex relaxation for the sparse PCA problem, which we use in Section 5 to derive tractable sufficient conditions for the global optimality of a particular sparsity pattern. In Section 6 we detail applications to subset selection, sparse recovery and variable selection. Finally, in Section 7, we test the numerical performance of these results.

### Notation

For a vector , we let and , is the cardinality of , i.e. the number of nonzero coefficients of , while the support of is the set and we use to denote its complement. For , we write and for (the set of symmetric matrix of size ) with eigenvalues , . The vector of all ones is written , while the identity matrix is written . The diagonal matrix with the vector on the diagonal is written .

## 2 Sparse PCA

Let be a symmetric matrix. We consider the following sparse PCA problem:

 ϕ(ρ)≡max∥z∥≤1zTΣz−ρCard(z) (2)

in the variable where is a parameter controlling sparsity. We assume without loss of generality that is positive semidefinite and that the variables are ordered by decreasing marginal variances, i.e. that . We also assume that we are given a square root of the matrix with , where and we denote by the columns of . Note that the problem and our algorithms are invariant by permutations of and by the choice of square root . In practice, we are very often given the data matrix instead of the covariance .

A problem that is directly related to (2) is that of computing a cardinality constrained maximum eigenvalue, by solving:

 maximizezTΣzsubject toCard(z)≤k∥z∥=1, (3)

in the variable . Of course, this problem and (2) are related. By duality, an upper bound on the optimal value of (3) is given by:

 infρ∈Pϕ(ρ)+ρk.

where is the set of penalty values for which has been computed. This means in particular that if a point is provably optimal for (2), it is also globally optimum for (3) with .

We now begin by reformulating (2) as a relatively simple convex maximization problem. Suppose that . Since and for all , we always have:

 ϕ(ρ)=max∥z∥≤1zTΣz−ρCard(z)≤(Σ11−ρ)Card(z)≤0,

hence the optimal solution to (2) when is . From now on, we assume in which case the inequality is tight. We can represent the sparsity pattern of a vector by a vector and rewrite (2) in the equivalent form:

using the fact that for all variables and that for any matrix , . We then have:

Hence we finally get, after maximizing in (and using ):

 ϕ(ρ)=max∥x∥=1n∑i=1((aTix)2−ρ)+, (4)

which is a nonconvex problem in the variable . We then select variables such that . Note that if , we must have hence variable will never be part of the optimal subset and we can remove it.

## 3 Greedy Solutions

In this section, we focus on finding a good solution to problem (2) using greedy methods. We first present very simple preprocessing solutions with complexity and . We then recall a simple greedy algorithm with complexity . Finally, our first contribution in this section is to derive an approximate greedy algorithm that computes a full set of (approximate) solutions for problem (2), with total complexity .

### 3.1 Sorting and Thresholding

The simplest ranking algorithm is to sort the diagonal of the matrix and rank the variables by variance. This works intuitively because the diagonal is a rough proxy for the eigenvalues: the Schur-Horn theorem states that the diagonal of a matrix majorizes its eigenvalues (Horn and Johnson, 1985); sorting costs . Another quick solution is to compute the leading eigenvector of and form a sparse vector by thresholding to zero the coefficients whose magnitude is smaller than a certain level. This can be done with cost .

### 3.2 Full greedy solution

Following Moghaddam et al. (2006a), starting from an initial solution of cardinality one at , we can update an increasing sequence of index sets , scanning all the remaining variables to find the index with maximum variance contribution.

• Input:

• Algorithm:

1. Preprocessing: sort variables by decreasing diagonal elements and permute elements of accordingly. Compute the Cholesky decomposition .

2. Initialization: , .

3. Compute .

4. Set and compute as the leading eigenvector of .

5. Set . If go back to step 3.

• Output: sparsity patterns .

 zk=argmax{zIck=0, ∥z∥=1}zTΣz−ρk,

which means that is formed by padding zeros to the leading eigenvector of the submatrix . Note that the entire algorithm can be written in terms of a factorization of the matrix , which means significant computational savings when is given as a Gram matrix. The matrices and have the same eigenvalues and their eigenvectors are transformed of each other through the matrix , i.e., if is an eigenvector of , then is an eigenvector of .

### 3.3 Approximate greedy solution

Computing eigenvalues at each iteration is costly and we can use the fact that is a subgradient of at if is a leading eigenvector of  (Boyd and Vandenberghe, 2004), to get:

 λmax⎛⎝∑j∈Ik∪{i}ajaTj⎞⎠≥λmax⎛⎝∑j∈IkajaTj⎞⎠+(xTkai)2, (5)

which means that the variance is increasing by at least when variable is added to . This provides a lower bound on the objective which does not require finding eigenvalues at each iteration. We then derive the following algorithm:

• Input:

• Algorithm:

1. Preprocessing. Sort variables by decreasing diagonal elements and permute elements of accordingly. Compute the Cholesky decomposition .

2. Initialization: , .

3. Compute

4. Set and compute as the leading eigenvector of .

5. Set . If go back to step 3.

• Output: sparsity patterns .

 zk=argmax{zIck=0, ∥z∥=1}zTΣz−ρk,

which means that is formed by padding zeros to the leading eigenvector of the submatrix . Better points can be found by testing the variables corresponding to the largest values of instead of picking only the best one.

### 3.4 Computational Complexity

The complexity of computing a greedy regularization path using the classic greedy algorithm in Section 3.2 is : at each step , it computes maximum eigenvalue of matrices with size . The approximate algorithm in Section 3.3 computes a full path in : the first Cholesky decomposition is , while the complexity of the -th iteration is for the maximum eigenvalue problem and for computing all products . Also, when the matrix is directly given as a Gram matrix with with , it is advantageous to use directly as the square root of and the total complexity of getting the path up to cardinality is then reduced to (which is for the eigenvalue problems and for computing the vector products).

## 4 Convex Relaxation

In Section 2, we showed that the original sparse PCA problem (2) could also be written as in (4):

 ϕ(ρ)=max∥x∥=1n∑i=1((aTix)2−ρ)+.

Because the variable appears solely through , we can reformulate the problem in terms of only, using the fact that when , is equivalent to , and . We thus rewrite (4) as:

 ϕ(ρ)=max.∑ni=1(aTiXai−ρ)+s.t.Tr(X)=1, Rank(X)=1X⪰0.

Note that because we are maximizing a convex function over the convex set (spectahedron) , the solution must be an extreme point of (i.e. a rank one matrix), hence we can drop the rank constraint here. Unfortunately, , the function we are maximizing, is convex in and not concave, which means that the above problem is still hard. However, we show below that on rank one elements of , it is also equal to a concave function of , and we use this to produce a semidefinite relaxation of problem (2).

###### Proposition 1

Let , and denote by the columns of , an upper bound on:

 ϕ(ρ)=max.∑ni=1(aTiXai−ρ)+s.t.Tr(X)=1, X⪰0, Rank(X)=1 (6)

can be computed by solving

 ψ(ρ)=max.∑ni=1Tr(X1/2BiX1/2)+s.t.Tr(X)=1, X⪰0. (7)

in the variables , where , or also:

 ψ(ρ)=max.∑ni=1Tr(PiBi)s.t.Tr(X)=1, X⪰0, X⪰Pi⪰0, (8)

which is a semidefinite program in the variables .

Proof We let denote the positive square root (i.e. with nonnegative eigenvalues) of a symmetric positive semi-definite matrix . In particular, if with , then , and for all , has one eigenvalue equal to and equal to 0, which implies . We thus get:

 (aTiXai−ρ)+ = Tr((aTixxTai−ρ)xxT)+ = Tr(x(xTaiaTix−ρ)xT)+ = Tr(X1/2aiaTiX1/2−ρX)+=Tr(X1/2(aiaTi−ρI)X1/2)+.

For any symmetric matrix , the function is concave on the set of symmetric positive semidefinite matrices, because we can write it as:

 Tr(X1/2BX1/2)+=max{0⪯P⪯X}Tr(PB)=min{Y⪰B, Y⪰0}Tr(YX),

where this last expression is a concave function of as a pointwise minimum of affine functions. We can now relax the original problem into a convex optimization problem by simply dropping the rank constraint, to get:

 ψ(ρ)≡max.∑ni=1Tr(X1/2aiaTiX1/2−ρX)+s.t.Tr(X)=1, X⪰0,

which is a convex program in . Note that because has at most one nonnegative eigenvalue, we can replace by in the above program. Using the representation of detailed above, problem (7) can be written as a semidefinite program:

 ψ(ρ)=max.∑ni=1Tr(PiBi)s.t.Tr(X)=1, X⪰0, X⪰Pi⪰0,

in the variables , which is the desired result.

Note that we always have and when the solution to the above semidefinite program has rank one, and the semidefinite relaxation (8) is tight. This simple fact allows us to derive sufficient global optimality conditions for the original sparse PCA problem.

## 5 Optimality Conditions

In this section, we derive necessary and sufficient conditions to test the optimality of solutions to the relaxations obtained in Sections 3, as well as sufficient condition for the tightness of the semidefinite relaxation in (8).

### 5.1 Dual problem and optimality conditions

We first derive the dual problem to (8) as well as the Karush-Kuhn-Tucker (KKT) optimality conditions:

###### Lemma 2

Let , and denote by the columns of . The dual of problem (8):

 ψ(ρ)=max.∑ni=1Tr(PiBi)s.t.Tr(X)=1, X⪰0, X⪰Pi⪰0,

in the variables , is given by:

 min.λmax(∑ni=1Yi)s.t.Yi⪰Bi, Yi⪰0,i=1,…,n. (9)

in the variables . Furthermore, the KKT optimality conditions for this pair of semidefinite programs are given by:

 ⎧⎪ ⎪⎨⎪ ⎪⎩(∑ni=1Yi)X=λmax(∑ni=1Yi)X(X−Pi)Yi=0, PiBi=PiYiYi⪰Bi, Yi,X,Pi⪰0, X⪰Pi, TrX=1. (10)

Proof  Starting from:

 max.∑ni=1Tr(PiBi)s.t.0⪯Pi⪯XTr(X)=1, X⪰0,

we can form the Lagrangian as:

 L(X,Pi,Yi)=n∑i=1Tr(PiBi)+Tr(Yi(X−Pi))

in the variables , with and . Maximizing in the primal variables and leads to problem (9). The KKT conditions for this primal-dual pair of SDP can be derived from Boyd and Vandenberghe (2004, p.267).

### 5.2 Optimality conditions for rank one solutions

We now derive the KKT conditions for problem (8) for the particular case where we are given a rank one candidate solution and need to test its optimality. These necessary and sufficient conditions for the optimality of for the convex relaxation then provide sufficient conditions for global optimality for the non-convex problem (2).

###### Lemma 3

Let , and denote by the columns of . The rank one matrix is an optimal solution of (8) if and only if there are matrices such that:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩λmax(∑ni=1Yi)=∑i∈I((aTix)2−ρ)xTYix={(aTix)2−ρ if i∈I0 if i∈IcYi⪰Bi, Yi⪰0. (11)

where and is the complement of the set defined by:

 maxi∉I(aTix)2≤ρ≤mini∈I(aTix)2.

Furthermore, must be a leading eigenvector of both and .

Proof  We apply Lemma 2 given . The condition is equivalent to and . The equation is then equivalent to , with and the condition becomes . This means that and the first-order condition in (10) becomes . Finally, we recall from Section 2 that:

hence must also be a leading eigenvector of .

The previous lemma shows that given a candidate vector , we can test the optimality of for the semidefinite program (7) by solving a semidefinite feasibility problem in the variables . If this (rank one) solution is indeed optimal for the semidefinite relaxation, then must also be globally optimal for the original nonconvex combinatorial problem in (2), so the above lemma provides sufficient global optimality conditions for the combinatorial problem (2) based on the (necessary and sufficient) optimality conditions for the convex relaxation (7) given in lemma 2. In practice, we are only given a sparsity pattern (using the results of Section 3 for example) rather than the vector , but Lemma 3 also shows that given , we can get the vector as the leading eigenvector of .

The next result provides more refined conditions under which such a pair is optimal for some value of the penalty based on a local optimality argument. In particular, they allow us to fully specify the dual variables for .

###### Proposition 4

Let , and denote by the columns of . Let be the largest eigenvector of . Let be such that:

 maxi∉I(aTix)2<ρ

the matrix is optimal for problem (8) if and only if there are matrices satisfying

 λmax(∑i∈IBixxTBixTBix+∑i∈IcYi)≤∑i∈I((aTix)2−ρ), (13)

with , where .

Proof  We first prove the necessary condition by computing a first order expansion of the functions around . The expansion is based on the results in Appendix A which show how to compute derivatives of eigenvalues and projections on eigensubspaces. More precisely, Lemma 10 states that if , then, for any :

 Fi((1−t)xxT+tY)=Fi(xxT)+txTBixTrBixxTBi(Y−xxT)+O(t3/2),

while if , then, for any ,:

 Fi((1−t)xxT+tY)=t+Tr(Y1/2(Bi−BixxTBixTBix)Y1/2)++O(t3/2).

Thus if is a global maximum of , then this first order expansion must reflect the fact that it is also local maximum, i.e. for all such that and , we must have:

 limt→0+1tn∑i=1[Fi((1−t)xxT+tY)−Fi(xxT)]≤0,

which is equivalent to:

 −∑i∈IxTBix+TrY(∑i∈IBixxTBixTBix)+∑i∈IcTr(Y1/2(Bi−BixxTBixTBix)Y1/2)+≤0.

Thus if is optimal, with , we get:

 maxY⪰0,TrY=1TrY(∑i∈IBixxTBixTBix−σI)+∑i∈IcTr(Y1/2(Bi−Bix(xTBix)†xTBi)Y1/2)+≤0

which is also in dual form (using the same techniques as in the proof of Proposition 1):

 min{Yi⪰Bi−BixxTBixTBix,Yi⪰0}λmax(∑i∈IBixxTBixTBix+∑i∈IcYi)≤σ,

which leads to the necessary condition. In order to prove sufficiency, the only non trivial condition to check in Lemma 3 is that for , which is a consequence of the inequality:

 xT(∑i∈IBixxTBixTBix+∑i∈IcYi)x≤λmax(∑i∈IBixxTBixTBix+∑i∈IcYi)≤xT(∑i∈IBixxTBixTBix)x.

This concludes the proof.
The original optimality conditions in (3) are highly degenerate in and this result refines these optimality conditions by invoking the local structure. The local optimality analysis in proposition 4 gives more specific constraints on the dual variables . For , must be equal to , while if , we must have , which is a stricter condition than (because ).

### 5.3 Efficient Optimality Conditions

The condition presented in Proposition 4 still requires solving a large semidefinite program. In practice, good candidates for can be found by solving for minimum trace matrices satisfying the feasibility conditions of proposition 4. As we will see below, this can be formulated as a semidefinite program which can be solved explicitly.

###### Lemma 5

Let , , and with the columns of . If and , an optimal solution of the semidefinite program:

 minimizeTrYisubject toYi⪰Bi−BixxTBixTBix, xTYix=0, Yi⪰0,

is given by:

 Yi=max{0,ρ(aTiai−ρ)(ρ−(aTix)2)}(I−xxT)aiaTi(I−xxT)∥(I−xxT)ai∥2. (14)

Proof  Let us write , we first compute:

 aTiMiai = (aTiai−ρ)aTiai−(aTiaiaTix−ρaTix)2(aTix)2−ρ = (aTiai−ρ)ρ−(aTix)2ρ(aTiai−(aTix)2).

When , the matrix is negative semidefinite, because means and . The solution of the minimum trace problem is then simply . We now assume that and first check feasibility of the candidate solution in (14). By construction, we have and , and a short calculation shows that:

 aTiYiai = ρ(aTiai−ρ)(ρ−(aTix)2)(aTiai−(aTix)2) = aTiMiai.

We only need to check that on the subspace spanned by and , for which there is equality. This means that in (14) is feasible and we now check its optimality. The dual of the original semidefinite program can be written:

 maximizeTrPiMisubject toI−Pi+νxxT⪰0Pi⪰0,

and the KKT optimality conditions for this problem are written:

 ⎧⎪⎨⎪⎩Yi(I−Pi+νxxT)=0, Pi(Yi−Mi)=0,