A convex integer programming approach for optimal sparse PCA

# A convex integer programming approach for optimal sparse PCA

## Abstract

Principal component analysis (PCA) is one of the most widely used dimensionality reduction tools in scientific data analysis. The PCA direction, given by the leading eigenvector of a covariance matrix, is a linear combination of all features with nonzero loadings—this impedes interpretability. Sparse principal component analysis (SPCA) is a framework that enhances interpretability by incorporating an additional sparsity requirement in the feature weights (factor loadings) while finding a direction that explains the maximal variation in the data. However, unlike PCA, the optimization problem associated with the SPCA problem is NP-hard. While many heuristic algorithms based on variants of the power method are used to obtain good solutions, they do not provide certificates of optimality on the solution-quality via associated dual bounds. Dual bounds are available via standard semidefinite programming (SDP) based relaxations, which may not be tight and the SDPs are difficult to scale using off-the-shelf solvers. In this paper, we present a convex integer programming (IP) framework to solve the SPCA problem to near-optimality, with an emphasis on deriving associated dual bounds. We present worst-case results on the quality of the dual bound provided by the convex IP. We empirically observe that the dual bounds are significantly better than worst-case performance, and are superior to the SDP bounds on some real-life instances. Moreover, solving the convex IP model using commercial IP solvers appears to scale much better that solving the SDP-relaxation using commercial solvers. To the best of our knowledge, we obtain the best dual bounds for real and artificial instances for SPCA problems involving covariance matrices of size up to .

## 1 Introduction

Principal component analysis (PCA) is one of the most widely used dimensionality reduction methods in data science. Given a data matrix (with samples and features; and each feature is centered to have zero mean), PCA seeks to find a principal component (PC) direction with that maximizes the variance of a weighted combination of features. Formally, this PC direction can be found by solving

 max∥x∥2=1x⊤Ax (PCA)

where is the sample covariance matrix. An obvious drawback of PCA is that all the entries of (an optimal solution to (PCA)) are (usually) nonzero, which leads to the PC direction being a linear combination of all features – this impedes interpretability [8, 19, 36]. In biomedical applications for example, when corresponds to the gene-expression measurements for different samples, it is desirable to obtain a PC direction which involves only a handful of the features (e.g, genes) for interpretation purposes. In financial applications (e.g, may denote a covariance matrix of stock-returns), a sparse subset of stocks that are responsible for driving the first PC direction may be desirable for interpretation purposes. Indeed, in many scientific and industrial applications [31, 1, 16], for additional interpretability, it is desirable for the factor loadings to be sparse, i.e., few of the entries in are nonzero and the rest are zero. This motivates the notion of a sparse principal component analysis (SPCA) [19, 16], wherein, in addition to maximizing the variance, one also desires the direction of the first PC to be sparse in the factor loadings. The most natural optimization formulation of this problem, modifies criterion PCA with an additional sparsity constraint on leading to:

 λk(A)≜max∥x∥2=1,∥x∥0≤kx⊤Ax (SPCA)

where , is equivalent to allowing at most components of to be nonzero. Unlike the PCA problem, the SPCA problem is NP-hard [23, 9].

Many heuristic algorithms have been proposed in the literature that use greedy methods [19, 35, 17, 14], alternating methods [33] and the related power methods [20]. However, conditions under which (some of) these computationally friendlier methods can be shown to work well, make very strong and often unverifiable assumptions on the problem data. Therefore, the performance of these heuristics (in terms of how close they are to an optimal solution of the  SPCA problem) on a given dataset is not clear.

Since SPCA is NP-hard, there has been exciting work in the statistics community [4, 30] in understanding the statistical properties of convex relaxations (e.g., those proposed by [10] and variants) of SPCA. It has been established [4, 30] that the statistical performance of estimators available from convex relaxations are sub-optimal (under suitable modeling assumptions) when compared to estimators obtained by (optimally) solving SPCA—this further underlines the importance of creating tools to be able to solve SPCA to optimality.

Our main goal in this paper is to propose an integer programming framework that allows the computation of good solutions to the SPCA problem with associated certificates of optimality via dual bounds, which make limited restrictive/unverifiable assumptions on the data. To the best of our knowledge, the only published methods for obtaining duals bounds of SPCA are based on semidefinite programming (SDP) relaxations [12, 14, 34, 13] (see Appendix A for the SDP relaxation) and spectral methods involving a low-rank approximation of the matrix  [25]. Both these approaches however, have some limitations. The SDP relaxation does not appear to scale easily (using off-the-shelf solvers) for matrices with more than a few hundred rows/columns, while applications can be significantly larger. Indeed, even a relatively recent implementation based on the Alternating Direction Method of Multipliers for solving the SDP considers instances with  [22]. The spectral methods involving a low-rank approximation of proposed in [25] have a running time of where is the rank of the matrix—in order to scale to large instances, no more than a rank approximation of the original matrix seems possible. The paper [3] presents a specialized branch and bound solver4 to obtain solutions to the SPCA problem, but their method can handle problems with – the approach presented here is different, and our proposal scales to problem instances that are much larger.

The methods proposed here are able to obtain approximate dual bounds of SPCA by solving convex integer programs and a related perturbed version of convex integer programs that are easier to solve. The dual bounds we obtain are incomparable to dual bounds based on the SDP relaxation, i.e. neither dominates the other, and the method appears to scale well to matrices up to sizes of .

## 2 Main results

Given a set , we denote as the convex hull of ; given a positive integer we denote by ; given a matrix , we denote its trace by .

Notice that the constraint implies that . Thus, one obtains the so-called -norm relaxation of SPCA:

 OPTℓ1≜max∥x∥2≤1,∥x∥1≤√kx⊤Ax. (ℓ1-relax)

The relaxation -relax has two advantages: (a) As shown in Theorem 1 below, -relax gives a constant factor bound on SPCA, (b) The feasible region is convex and all the nonconvexity is in the objective function. We build on these two advantages: our convex IP relaxation is a further relaxation of -relax (together with some implied linear inequalities for SPCA) which heavily use the fact that the feasible region of -relax is convex.

We note that -relax is an important estimator in its own right [31, 16]—it is commonly used in the statistics/machine-learning community as one that leads to an eigenvector of with entries having a small -norm (as opposed to a small -norm). However, we emphasize that -relax is a nonconvex optimization problem—globally optimizing -relax is challenging—we show in this paper, how one can use IP methods to solve -relax to certifiable optimality.

The rest of this section is organized as follows: In Section 2.1, we present the constant factor bound on SPCA given by -relax, improving upon some known results. In Section 2.2, we present the construction of our convex IP and prove results on the quality of bound provided. In Section 2.3, we discuss perturbing the original matrix in order to make the convex IP more efficiently solvable while still providing reasonable dual bounds. In Section 4, we present results from our computational experiments.

### 2.1 Quality of ℓ1-relaxation as a surrogate for the Spca problem

The following theorem is an improved version of a result appearing in [29] (Exercise 10.3.7).

###### Theorem 1.

The objective value is upper bounded by a multiplicative factor away from , i.e., with .

Proof of Theorem 1 is provided in Section 3. While we have improved upon the bound presented in [29], we do not know if this new bound is tight.

Theorem 1 has implications regarding existence of polynomial-time algorithms to obtain a constant-factor approximation guarantee for -relax. In particular, the proof of Theorem 1 implies that if one can obtain a solution for -relax which is within a constant factor, say , of , then a solution for SPCA problem can be obtained, which is within a constant factor (at most ) of . Therefore, the -relax problem is also inapproximable in general.

### 2.2 Convex integer programming method

A classical integer programming approach to solving SPCA would be to go to an extended space involving the product of -variables and include one binary variable per -variable in order to model the -norm constraint, resulting in a very large number of binary variables. In particular, a typical model could be of the form:

 max \textuptr(AX) (1) s.t. xi≤zi, i∈[n] (2) n∑j=1zi≤k (3) ∥x∥2≤1 (4) [1x⊤xX]⪰0 (5) \textuprank([1x⊤xX])=1 (6) z∈{0,1}n. (7)

It is easy to see that such a model is challenging due to (a) “quadratic” increase in number of variables () (b) the presence of the rank constraint and (c) binary variables. Even with significant progress, it is well-known that solving such problems beyond being a few hundred variables is extremely challenging [5, 15]. Indeed, solving instances with arbitrary quadratic objective and bound constraints cannot be solved (exactly) by modern state-of-the-art methods as soon as the number of variables exceed a hundred or so [7, 6].

On the other hand, as mentioned before, the feasible region of -relax is a convex set. Therefore, we do not have to include binary variables as in the case for modeling the -norm constraint. Moreover, Theorem 1 suggests that -relax will provide quite strong dual bounds for SPCA. Thus, we will use -relax as our basic relaxation — we note that the factor of is a worst case bound, and as we see in our numerical experiments the objective function values of the two problems are quite close.

Since the feasible region of -relax is a convex set we need to model/upper bound the objective function using convex IP techniques. We follow the following arguments:

1. By a spectral decomposition, let where are unit norm orthogonal eigen-pairs. Then the objective function of -relax is:

 n∑i=1λi(x⊤vi)2.
2. Assuming that , we have that for such that , where is the identity matrix. Therefore, if we split the eigenvalues into two sets as and , the objective function can be represented as

 λ+∑i∈{i:λi>λ}(λi−λ)(x⊤vi)2+∑i∈{i:λi≤λ}(λi−λ)(x⊤vi)2.
3. For each index , we replace with a single continuous variable , and set (or if we explicitly want a relaxation of -relax) as an upper bound of . Then for each with , we construct a piecewise linear upper approximation for . Such a piecewise linear upper approximation is usually modeled via SOS-2 constraints [24], and this seems to work well in our numerical experiments.

4. For the term , since , we obtain a convex constraint .

Therefore, a convex integer programming problem is obtained as follows:

 λ+maxx,y,g,ξ,η,s∑i∈{i:λi>λ}(λi−λ)ξi−s(≜OPT% convex-IP) s.t. gi=x⊤vi, i∈[n](x⊤Ax=∑ni=1λig2i)⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩gi=N∑j=−Nγjiηji, i∈{i:λi>λ}ξi=N∑j=−N(γji)2ηji, i∈{i:λi>λ}(η−Ni,…,ηNi)∈SOS-2, i∈{i:λi>λ}((x⊤vi)2≤ξi, i∈{i:λi>λ})⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩n∑i=1x2i≤1∑i∈{i:λi>λ}ξi+∑i∈{i:λi≤λ}g2i≤1+14N2∑i∈{i:λi>λ}θ2i(∥x∥2≤1) n∑i=1yi≤√k and yi≥xi,yi≥−xi i∈{i:λi>λ}(∥x∥1≤√k) −θi≤gi≤θi, i∈[n](|gi|≤θi, i∈[n]) ∑i∈{i:λi≤λ}−(λi−λ)g2i≤s (Convex-IP)

where, for each , let be the number of splitting points that partition the domain of , i.e., equally, and let be the value of the splitting point.

Since ’s are orthogonal, implies . Then together with representing , we can obtain the implied inequality . The extra term in the right-hand-side reflects the fact that is not exactly equal to , but only a piecewise linear upper bound on . In fact, we have that . This constraint (cutting-plane) helps in improving the dual bound at the root node and significantly improves the running time of the solver. We arrive at the following result:

###### Proposition 1.

The optimal objective value of Convex-IP is an upper bound on the SPCA problem.

Proposition 1 is formally verified in Appendix B.

Next combining the result of Theorem 1 with the quality of the approximation of the objective function of -relax by Convex-IP, we obtain the following result:

###### Proposition 2.

The optimal objective value of Convex-IP is upper bounded by

 OPTconvex-IP≤ρ2λk(A)+14N2∑i∈{i:λi>λ}(λi−λ)θ2i.

A proof of Proposition 2 is presented in Appendix C.

Finally, let us discuss why we expect Convex-IP to be appealing from a computational viewpoint. Unlike typical integer programming approaches, the number of binary variables in Convex-IP is which is usually significantly smaller than . Indeed, heuristics for SPCA generally produce good values of , and in almost all experiments we found that . Moreover, is a parameter we control. In order to highlight the “computational tractability” of Convex-IP, we formally state the following result:

###### Proposition 3.

Given the number of splitting points and the size of set , the Convex-IP problem can be solved in polynomial time.

Note that the convex integer programming method which is solvable in polynomial time, does not contradict the inapproxamability of the SPCA problem, since is upper bounded by the sum of and a term corresponding to the sample covariance matrix.

### 2.3 Improving the running time of Convex-IP

#### Perturbation of the covariance matrix A:

Empirically, we use a (sequence of) perturbations on the covariance matrix to reduce the running time of solving the convex IP. Recall that denotes a lower bound of .

1. Set (where are the eigenvalues of ). We assume . However, when , one can apply Algorithm 1 to obtain a matrix , such that . We then replace by .

2. Perturb the covariance matrix by . Note that the objective value in Convex-IP is an upper bound on . Replace by .

3. Therefore, the convex constraint in Convex-IP can be replaced by , i.e., .

4. The constraint is satisfied at equality for any optimal solution, therefore the following convex constraints

 1≤∑i∈{i:λi>λ}¯ξi+∑i∈{i:λi≤λ}¯g2i≤1+14N2∑i∈{i:λi>λ}θ2i, n∑i=1¯g2i=∑i∈{i:λi>λ}¯g2i+∑i∈{i:λi≤λ}¯g2i≤1,

imply the following inequalities:

 1−¯sλ−¯λ≤∑i∈{i:λi>λ}¯ξi≤1+14N2∑i∈{i:λi>λ}θ2i−¯sλ−¯λ, ∑i∈{i:λi>λ}¯g2i≤1−¯sλ−¯λ.

Thus a simplified convex IP corresponding to the perturbed version of the matrix is:

 λ+maxx,y,g,ξ,η,s∑i∈{i:λi>λ}(λi−λ)ξi−s≜OPTPert-% Convex-IP s.t. gi=x⊤vi, i∈{i:λi>λ}(x⊤¯Ax=∑i∈I1λig2i+∑i∈IC1¯λg2i)⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩gi=N∑j=−Nγjiηji, i∈{i:λi>λ}ξi=N∑j=−N(γji)2ηji, i∈{i:λi>λ}(η−Ni,…,ηNi)∈SOS-2, i∈{i:λi>λ}((x⊤vi)2≤ξi, i∈{i:λi>λ})⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩n∑i=1x2i≤1∑i∈{i:λi>λ}g2i≤1−sλ−¯λ1−sλ−¯λ≤∑i∈{i:λi>λ}ξi≤1+14N2∑i∈{i:λi>λ}θ2i−sλ−¯λ(∥x∥2≤1) n∑i=1yi≤√k and yi≥|xi|, i∈{i:λi>λ}(∥x∥1≤√k) −θi≤gi≤θi, i∈{i:λi>λ}.(|gi|≤θi, i∈[n]) (Pert-Convex-IP)
###### Proposition 4.

The optimal objective value of Pert-Convex-IP is upper bounded by

 OPTPert-Convex-IP≤ρ2λk(A)+ρ2(¯λ−λmin(A))+14N2∑i∈{i:λi>λ}(λi−λ)θ2i.

Note that in Pert-Convex-IP, we do not need the variables which greatly reduces the number of variables since in general . In practice, we note a significant reduction in running time, while the dual bound obtained from Pert-Convex-IP model remains reasonable. More details are presented in Section 4.

#### Refining the splitting points

Since the Pert-Convex-IP model runs much faster than the Convex-IP model, we run the Pert-Convex-IP model iteratively. In each new iteration, we add one extra splitting point describing each function. In particular, once we solve a Pert-Convex-IP model, we add one splitting point at the optimal value of .

#### Cutting planes

###### Proposition 5.

Let . Let where . Then let be the vector:

 vij={|xij|\textupifj≤k|xik|\textupifj>k. (8)

Also let . The inequality

 v⊤y≤b(v), (9)

is a valid inequality for SPCA.

The validity of this inequality is clear: If is a feasible point, then the support of is at most and . Therefore, . Notice that this inequality is not valid for -relax. Also see [21].

We add these inequalities at the end of each iteration for the model where the seeding for constructing is chosen to be the optimal solution of the previous iteration.

## 3 Proof of Theorem 1

Given a vector , we denote the coordinate as in this section. Define

 Sk ≜{x∈Rn:∥x∥2≤1,∥x∥0≤k}, (10) Tk ≜{x∈Rn:∥x∥2≤1,∥x∥1≤√k}. (11)

Note that any can be represented as a nonnegative combination of points in , i.e., and for all . Here we think of each as a projection onto some unique components of and setting the other components to . Let , then . Now we have, , and therefore

 1∑mi=1∥xi∥2x=m∑i=1∥xi∥2∑mi=1∥xi∥2⋅yi. (12)

Thus, if we scale by , then the resulting vector belongs to . Since we want this scaling factor to be as small as possible, we solve the following optimization problem:

 min∥x1∥2+…+∥xm∥2: x=x1+…+xm; xi∈Sk,∀i∈[m]. (Bound)

Without loss of generality, we assume that and . Let where is an optimal solution of Bound. The following proposition presents a result on an optimal solution of Bound.

###### Proposition 6.

Let be a collection of supports such that: indexes the largest (in absolute value) components in , indexes the second largest (in absolute value) components in , and so on (note that ). Then is an optimal set of supports for Bound.

###### Proof.

We prove this result by the method of contradiction. Suppose we have an optimal representation as  — and without loss of generality, we assume that . Let be the set of supports of respectively, where we assume that the indices within each support vector are ordered such that

 [x¯Ij]1≥[x¯Ij]2≥⋯≥[x¯Ij]g

for all (note that if ).

Let be the first support that is different from , i.e., and . Let be the first index in that does not belong to with since . Therefore, must be in where . Note now that by construction of and our assumption on , we have that . Now we exchange the index in with in . We have:

 √∥x¯Ip∥22+([xIp]q)2−([x¯Ip]k)2+√∥x¯Ip′∥22+([x¯Ip]k)2−([xIp]q)2≤∥x¯Ip∥2+∥x¯Ip′∥2, (13)

which holds because and .

Now repeating the above step, we obtain the result. ∎

Based on Proposition 6, for any fixed , we can find out an optimal solution of Bound in closed form. Now we would like to know, for which vector , the scaling factor will be the largest. Let be obtained by solving the following optimization problem:

 ρ=maxx∥xI1∥2+⋯+∥xIm∥2s.t.x=xI1+⋯+xIm∥x∥22=∥xI1∥22+⋯+∥xIm∥22≤1∥x∥1=∥xI1∥1+⋯+∥xIm∥1≤√k[x]1≥⋯≥[x]n≥0. (Approximation ratio)

Then we obtain

 Tk⊆ρ⋅Conv(Sk). (14)

Although the optimal objective value of Approximation ratio is hard to compute exactly, we can still find an upper bound.

###### Lemma 1.

The objective value of Approximation ratio is bounded from above by .

###### Proof.

First consider the case when . In this case, . Consider the optimization problem:

 θ=max u+v s.t. u2+v2≤1

If we think of as and as , then we see that the above problem is a relaxation of Approximation ratio and therefore is an upper bound on . Noting that for all , we have the result.

Now we assume that and consequently . From Approximation ratio, let and . Based on the standard relationship between and norm, we have

 γ≤t≤√kγ.

Since each coordinate of is smaller in magnitude than the average coordinate of , we have

 ∥xI2∥2≤ ⎷(∥xI2∥1k)2k=t√k. (15)

Also note that an alternative bound is given by

 ∥xI2∥2≤√1−γ2.

Using an argument similar to the one used to obtain (15), we obtain that

 m∑i=3∥xIi∥2≤m−1∑i=2 ⎷(∥xIi∥1k)2k=1√km−1∑i=2∥xIi∥1≤√k−t√k.

Therefore we obtain

 m∑i=1∥xIi∥2 =∥xI1∥2+∥xI2∥2+m∑i=3∥xIi∥2≤γ+min{t√k,√1−γ2}+1−t√k. (Upper-Bound)

Now we consider two cases:

1. If , then Upper-Bound becomes . Since , satisfies . Moreover we have that . Since iff we obtain two cases:

 γ+√1−γ2+1−t√k ≤⎧⎪ ⎪⎨⎪ ⎪⎩1+√kk+11+√kk+1 (16)

where (i) the first inequality holds when , (ii) the second inequality holds since the function achieves (local and global) maximum at point which is less than for , thus for part .

2. If , then Upper-Bound becomes . Note now that , implies that satisfies . Therefore, .

Therefore, this upper bound holds. ∎

Therefore, we can show Theorem 1 holds.

## 4 Numerical experiments

In this section, we report results on our empirical comparison of the performances of Convex-IP method, Pert-Convex-IP method and the SDP relaxation method.

### 4.1 Hardware and Software

All numerical experiments are implemented on MacBookPro13 with 2 GHz Intel Core i5 CPU and 8 GB 1867 MHz LPDDR3 Memory. Convex-IPs were solved using Gurobi 7.0.2. SDPs were solved using Mosek 8.0.0.60.

### 4.2 Obtaining primal solutions

We used a heuristic, which is very similar to the truncated power method [33], but has some advantages over the truncated power method. Given , let be the set of indices corresponding to the top entries of (in absolute value).

We start with a random initialization such that , and set where is a square root of , i.e. . In the iteration, we update

 Ii←Ik(V⊤xi), xi+1←argmax∥x∥2=1x⊤AIix (17)

where is the matrix with for all and otherwise. It is easy to see that satisfy the condition . Moreover, using the fact is a PSD matrix, it is easy to verify that for all . Therefore, in each iteration, the above heuristic method leads to an improved feasible solution for the SPCA problem.

Our method has two clear advantages over the truncated power method:

• We use standard and efficient numerical linear algebra methods to compute eigenvalues of small matrices.

• The termination criteria used in our algorithm is also simple: if for some , then we stop. Clearly, this leads to a finite termination criteria.

In practice, we stop using a stopping criterion based on improvement and number of iterations instead of checking . Details are presented in Algorithm 2.