The Sparse Principal Component
of a Constant-rank Matrix
The computation of the sparse principal component of a matrix is equivalent to the identification of its principal submatrix with the largest maximum eigenvalue. Finding this optimal submatrix is what renders the problem -hard. In this work, we prove that, if the matrix is positive semidefinite and its rank is constant, then its sparse principal component is polynomially computable. Our proof utilizes the auxiliary unit vector technique that has been recently developed to identify problems that are polynomially solvable. Moreover, we use this technique to design an algorithm which, for any sparsity value, computes the sparse principal component with complexity , where and are the matrix size and rank, respectively. Our algorithm is fully parallelizable and memory efficient.
Principal component analysis (PCA) is a well studied and broadly used dimensionality reduction tool. The principal components (PCs) of a set of observations on variables capture orthogonal directions of maximum variance and offer a distance-optimal, low-dimensional representation that -for many purposes- conveys sufficient amount of information. Without additional constraints, the PCs of a data set can be computed in polynomial time in using the eigenvalue decomposition.
A disadvantage of conventional PCA is that, in general, the extracted components are expected to have nonzero elements in all their entries. In many applications, sparse vectors that convey information are more favorable either due to sparsity of the actual signals ,  or because sparsity implies interpretability  when each coordinate of a PC corresponds, for example, to a different word in text analysis applications or the expression of a particular gene in bio data sets. Thus, provided that the application requires it, some of the maximum variance of the true PCs may be traded for sparsity. Recently, there has been an increasing interest in computing sparse components of data sets with applications that range from signal processing, communication networks, and machine learning, to bioinformatics, finance, and meteorology -.
To enforce sparsity on the extracted components, a linearly constrained -norm minimization problem is usually considered , , , . This problem is equivalent to the sparse variance maximization, that is, the maximization of the Rayleigh quotient of a matrix under an -norm constraint on the maximizing argument -, , , , -. In both problems, due to the additional cardinality constraint that is enforced, the sparsity-aware flavor of PCA, termed sparse PCA, comes at a higher cost: sparse PCA is an -hard problem .
To approximate sparse PCA, various methods have been introduced in the literature. A modified PCA technique based on the LASSO was introduced in . In , a nonconvex regression-type optimization approach combined with LASSO penalty was used to approximately tackle the problem. A nonconvex technique, locally solving difference-of-convex-functions programs, was presented in . Semidefinite programming (SDP) was used in , , while  augmented the SDP approach with an extra greedy step that offers favorable optimality guarantees under certain sufficient conditions. The authors of  considered greedy and branch-and-bound approaches. Generalized power method techniques using convex programs were also used to approximately solve sparse PCA . A sparse-adjusted deflation procedure was introduced in  and in  optimality guarantees were shown for specific types of covariance matrices under thresholding and SDP relaxations. Iterative thresholding was also considered in  in conjunction with certain guarantees while a truncated power method was presented in .
In this present work, we prove that the sparse principal component of an matrix can be obtained in polynomial time under a new sufficient condition: when can be written as a sum of a scaled identity matrix and a positive semidefinite update, i.e., , and the rank of the update is not a function of the problem size.111If , then we simply have a constant-rank matrix . Under this condition, we show that sparse PCA is solvable with complexity . Our proof utilizes the auxiliary unit vector technique that we developed in , . This technique has been inspired by the work in , which reappeared in  and was used in . It introduces an auxiliary unit vector that unlocks the constant-rank structure of a matrix (in this present work, matrix ). The constant-rank property along with the auxiliary vector enable us to scan a constant-dimensional space and identify a polynomial number of candidate vectors (i.e., candidate solutions to the original problem). Interestingly, the optimal solution always lies among these candidates and a polynomial time search can always retrieve it. As a result, we have applied the auxiliary unit vector technique to identify the polynomial solvability of certain optimization problems and provide polynomial-time algorithms that are directly implementable, fully parallelizable, and memory efficient -.
The rest of this paper is organized as follows. In Section II, we state the sparse PCA problem and indicate its -hardness. Then, in Section III, we follow the principles of the auxiliary unit vector technique to present a proof of the polynomial solvability of the sparse PCA problem under a constant-rank condition. Moreover, we design a novel algorithm222Early versions of our algorithm appeared in , . which, for any sparsity value, computes the sparse principal component with complexity . Our algorithm is simply implementable, fully parallelizable, and memory efficient. Especially for the case , an alternative nonparallelizable version of our algorithm with complexity is presented in Section IV.333Alternative (non)parallelizable implementations of our algorithm for have been presented in . A few conclusions are drawn in Section V.
Ii Problem Statement
We are interested in the computation of the real, unit-norm, and at most -sparse principal component of a matrix , i.e.,
Interestingly, when can be decomposed as a constant-rank positive semidefinite update of the identity matrix, i.e.,
where , is the identity matrix, and is a positive semidefinite matrix with rank , then the optimization (1) can always be rewritten as
Since is positive semidefinite and has rank , it can be decomposed as
is an matrix, and problem (1) can be written as
For the optimization problem in (6), we note that
where . In (7), set (which we call the support) consists of the indices of the potentially nonzero elements of . For a given support , the inner maximization is a -dimensional principal-component problem, where is the corresponding submatrix of . The solution to the inner maximization is denoted by
and equals the principal left singular vector of . Then, our optimization problem in (7) becomes
where denotes the principal singular value of matrix . That is, to solve our original problem in (1), according to (9), we need to find the -row submatrix of whose principal singular value is the maximum one among all submatrices. The indices that are contained in the optimal support that solves (9) correspond to the nonzero loadings of the solution to (1). Then, according to (8), the values of these nonzero loadings are directly computed by the left singular vector of .
From the above discussion, it turns out that the hardness of the original problem in (1) comes from the identification of the optimal support in (9). To obtain the optimal support , we could simply perform an exhaustive search among all possible supports and compare them against the metric of interest in (9). However, if is not constant but grows with , then such an approach has complexity that is exponential in , indicating the -hardness of (1), which was shown in . In this present work, we show that, if the rank of is constant, then (9) can be solved in time polynomial in . In fact, we develop an algorithm that has complexity and returns candidate supports, one of which is guaranteed to be the solution to (9). Then, by an exhaustive search among only these candidate supports, we identify the optimal support in (9) and, hence, the sparse principal component of and with complexity polynomial in , for any sparsity value between and (that is, even if grows with ).
Iii Computation of the Sparse Principal Component in Time
Prior to presenting the main result for the general rank- case, in the following subsection we provide insights as to why the sparse principal component of constant-rank matrices can be solved in polynomial time by first considering the trivial case .
Iii-a Rank-: A motivating example
In this case, has rank and . For a given support , we have . Then, our optimization problem in (7) becomes
where, for any given support , the corresponding vector in (8) is
Therefore, (7) becomes
and the optimal support is
That is, to determine the solution to (9), we only need to compare the elements of and select the largest ones. Then, their indices are the elements of .
The above observation, although simple, turns out to be critical for the developments that follow. Hence, to simplify the presentation, we define function which is parameterized in an integer , has as input a vector of length , and returns the indices of the largest values in :
Function operates by selecting the indices of the largest values among , , , . Its complexity is .
We conclude this subsection by noting that, if , then the optimal support in (9) is
and is computed with linear complexity.
Iii-B Rank-: Utilizing the auxiliary unit vector technique
We consider now the case where has rank and, hence, is an matrix. Without loss of generality (w.l.o.g.), we assume that each row of has at least one nonzero element, i.e., , . Indeed, as explained in , if there exists an index such that , then, independently of the value of the corresponding element of , the contribution of this row to the value of in (6) will be zero. Hence, there is no point in “spending” in a weight that could be distributed to other elements of ; we can ignore the th row of , replace by , and, hence, reduce the problem size from to . In the final solution , will be set to zero.
In our subsequent developments, we rely on the auxiliary unit vector technique that was introduced in  for matrices of rank and generalized in  for matrices of rank . This technique utilizes an auxiliary vector to generate the subspace spanned by the columns of and result in a rank- problem for each value of . Interestingly, for several problems, the number of different solutions that we obtain as scans the unit-radius hypersphere has polynomial size. If the rank- problem for each value of is polynomially solvable (as, for example, in the optimization problem of this work, as indicated in Subsection III-A), then the optimal solution is obtained with overall polynomial complexity. In a few words, the auxiliary unit vector technique of ,  is a fully parallelizable and memory efficient technique that translates -dimensional problems into a polynomial collection of rank- problems among which one results in the overall optimal solution.
For our sparse-principal-component problem, the auxiliary unit vector technique works as follows. Consider a unit vector . By Cauchy-Schwartz Inequality, for any ,
with equality if and only if is collinear to . Then,
The rightmost equality in (18) is obtained by interchanging the maximizations. This is a critical step of the auxiliary unit vector technique. It unlocks the constant-rank structure of and allows us to consider a simple rank- problem for each value of . Indeed, for each , the inner double maximization problem
which is obtained with complexity . Then, according to (18), the solution to our original problem in (9) is met by collecting all possible supports as scans the unit-radius -dimensional hypersphere. That is, in (9) belongs to
Set contains candidate supports one of which is the solution to our original optimization problem. If was available, then one would have to compare the elements of against the metric of interest in (9) to obtain the optimal support . Therefore, the size of and the complexity to build determine the overall complexity to solve (9). Our major contribution in this work is that we show that the cardinality of is upper bounded by
and develop an algorithm to build with complexity . After is constructed, each element (support) of it is mapped to the principal singular value of the matrix with complexity , since is constant. Finally, all computed singular values are compared with each other to obtain the optimal support in (9). Then, the solution to our original problem in (1) is the principal left singular vector of the matrix , computed with complexity . Therefore, we compute the optimal support and the sparse principal component of a rank- matrix with complexity .
A constructive proof is presented in detail in the next three subsections. To give some insight of the proof, we begin with the simple case . Then, we generalize our proof for the case of any arbitrary .
Iii-C Rank-: A simple instance of our proof
If , then has size and the auxiliary vector is a length-, unit vector that, as in , can be parameterized in an auxiliary angle . That is,
Hence, lies on the unit-radius semicircle.444We ignore the other semicircle because any pair of angles and with difference results in opposite vectors which, however, are equivalent with respect to the optimization metric in (18) and produce the same support in (21). Then, the candidate set in (22) is re-expressed as
where, according to (21),
and, according to (19),
That is, for any given , the corresponding support is obtained with complexity by selecting the indices of the absolutely largest elements of .
However, why should simplify the computation of a solution? The intuition behind the auxiliary unit vector technique is that every element of is actually a continuous function of , i.e., a curve (or -manifold) in , since
Hence, the support that corresponds to the absolutely largest elements of at a given point is a function of . Due to the continuity of the curves and the discrete nature of the support, we expect that the support will retain the same elements in an area around . Therefore, we expect the formation of intervals in , within which the indices of the absolutely largest elements of remain unaltered. A support might change only if the sorting of the amplitudes of two elements in , say and , changes. This occurs at points where , that is, points where two curves intersect. Finding all these intersection points is sufficient to determine intervals and construct all possible candidate supports . Among all candidate supports, lies the support that corresponds to the optimal -sparse principal component. Exhaustively checking the supports of all intervals suffices to retrieve the optimal one. The number of these intervals is exactly equal to number of possible intersections among the amplitudes of , which is exactly equal to , counting all possible combinations of element pairs.
Before we proceed, in Fig. 1, we illustrate the interval partition of for an arbitrary matrix (i.e., ). We plot the curves that originate from the rows of and observe the intervals that are formed, within which the sorting of the curves does not change. The borders of the intervals are denoted by vertical dashed lines at points of curve intersections. Our approach creates intervals which exceeds the total number of possible supports, however this is not true for greater values of . In addition, for sparsity , we observe that is partitioned into regions (sets of adjacent intervals); within each region , although the sorting changes, the set of largest curves does not change. For example, in Fig. 1, we identify the regions , , , and where the candidate support remains fixed. These regions are an interesting feature that might further decrease the number of intervals we need to check. We exploit this feature in the serial implementation of our algorithm in Section IV.
Iii-D Rank-: Algorithmic developments and complexity
Our goal is the construction of all possible candidate -sparse vectors, determined by the support of each interval in . This is a two-step process. First, we identify interval borders and, then, we determine the supports associated with these intervals.
Algorithmic Steps: We first determine all possible intersections of curve pairs in . Any pair of distinct elements in is associated with two intersections: and . Solving these two equations with respect to determines two points
where a new support of indices of the -largest values of might occur. We note that, when varies in (from to ), changes of the support may occur only over intersection points given in (29), for any with .
Next, we observe that, at an intersection point , the intersecting curves are equal, i.e., , while their relative order changes over . That is, if in the interval immediately preceding , then in the interval immediately following , and vice versa. Exactly at the point of intersection, we can determine the order of the curves in .
If both or none of the th and th curves are included in the set of largest curves at , then none of the two curves leaves or joins, respectively, that set at , despite the change in their relative order. Assuming that no other pair of curves intersect exactly at ,555Recall that the support can change only at an intersection point. If a second pair of curves, say the th and th ones, intersect exactly at causing the support to change over , then the correct support sets for the two incident intervals will be determined when the intersections of the th and th curves are examined. the two intervals incident to are associated with the same support which can be determined exactly at the intersection point .666In fact, since the support does not change at , there exists another intersection point where the same support will be collected. Hence, the support at may safely be ignored.
If that is not the case, i.e., if the th and th curves occupy the th and th order at , then the change in their relative order affects the support : one of the two curves leaves and the other one joins the set of largest curves at . The support sets associated with the two adjacent intervals differ only in one element (one contains index and the other contains index instead), while the remaining common indices correspond to the largest curves at the intersection point .777Since any interval in follows an intersection point, we actually need to evaluate the support of only the interval that follows the intersection point . The other interval will be examined at the intersection point that precedes it. To identify which of the two indices and is contained in the support of the interval that follows the intersection point , we can visit the “rightmost” point of , that is, . There, due to the continuity of and , the relative order of and within the interval that follows will be the identical or opposite relative order of and , depending on whether is positive or negative, respectively.
We have fully described a way to calculate the (at most) two support sets associated with the two intervals incident to one intersection point. Since all intervals are incident to at least one intersection point, it suffices to consider all intersection points, that is, all pairwise combinations of elements in , to collect the corresponding support sets.
Computational Complexity: A single intersection point can be computed in time . At an intersection point , the th-order element of (and the elements larger than that) can be determined in time which equals the construction time of the (at most) two support sets associated with the intervals incident to . Collecting all candidate supports requires examining all intersection points, implying a total construction cost of . Since we obtain (at most) two supports for any of the intersection points, the size of the candidate support set is .
Iii-E Rank-: A generalized proof
In the general case, is a matrix. In this subsection, we present our main result where we prove that the problem of identifying the -sparse principal component of a rank- matrix is solvable with complexity . The statement is true for any value of (that is, even if is a function of ). The rest of this subsection contains a constructive proof of this statement.
Since has size , the auxiliary vector is a length-, unit vector. We begin our constructive proof by introducing the auxiliary-angle vector and parameterizing , as in , according to
Hence, lies on the unit-radius semihypersphere.888As in the rank- case, we ignore the other semihypersphere because any pair of vectors and whose first elements and , respectively, have difference results in opposite vectors which, however, are equivalent with respect to the optimization metric in (18) and produce the same support in (21). Then, the candidate set in (22) is re-expressed as
where, according to (21),
and, according to (19),
That is, for any given , the corresponding support is obtained with complexity by selecting the indices of the absolutely largest elements of .
To gain some intuition into the purpose of inserting the auxiliary-angle vector , notice that every element of in (34) is actually a continuous function of and so are the elements of . That is, each element of is a hypersurface (or -manifold) in the -dimensional space . When we sort the elements of at a given point , we actually sort the hypersurfaces at point . The key observation in our algorithm is that, due to their continuity, the hypersurfaces will retain their sorting in an area “around” . This implies the partition of into cells each of which (say, cell ) is associated with a single set of indices of hypersurfaces that lie above and a single set of indices of hypersurfaces that lie below it. Moreover, each cell contains at least one vertex (that is, intersection of hypersurfaces). Finally, for any , there is a unique cell , called “normal,” which contains uncountably many points in and is associated with a single index-set of cardinality (that is, exactly hypersurfaces lie above ). In fact, the indices of these hypersurfaces (that is, the elements of ) are the elements of support . Although our discussion refers to the general- case, for illustrative purposes we consider again the case and, in Fig. 2, we revisit the example that we presented in Subsection III-C. The normal cells that are created by the curves are the shaded ones. These cells carry the property that lie below exactly curves. We observe that there is a one-to-one correspondence between normal cells and regions .
According to (31) and the above observations, we need to determine the index-set of every normal cell in the partition. If we collect all such index-sets, then we have constructed in (31). This will be achieved if, instead, we identify all cells in and, for each cell, determine the largest hypersurfaces that lie above an arbitrary point of it. The latter will return the desired index-set if the cell is normal. In Fig. 2, we observe that, for each normal cell, the indices of the largest curves that lie above it can be computed at the leftmost vertex of it (we can ignore the leftmost normal cell because it produces the same indices with the rightmost one). In the following, we identify all cells in the partition and compute a size- support for each such cell. This way, we obtain the index-set of any normal cell, among which one is the optimal support in (9).
Since each cell contains at least one vertex, we only need to find all vertices in the partition and determine for all neighboring cells. Recall that a vertex is an intersection of hypersurfaces. Consider arbitrary hypersurfaces; say, for example, , , , . Their intersection satisfies and is computed by solving the system of equations
For any sign combination, the solution to the latter consists of the spherical coordinates of the unit vector in the null space of the leftmost matrix.999If the matrix is full-rank, then its null space has rank and is uniquely determined (within a sign ambiguity which, however, does not affect the final decision on the index-set). If, instead, the matrix is rank-deficient, then the intersection of the hypersurfaces (i.e., the solution of (36)) is a -manifold (with ) on the -dimensional space and does not generate a new cell. Hence, combinations of rows of that result in linearly dependent rows of the matrix in (36) can be simply ignored. Then, the index-set that corresponds to a neighboring cell is computed by
Note that the intersecting hypersurfaces have the same value at . Hence, (37) returns ambiguity regarding the sorting of these particular hypersurfaces. If hypersurfaces of these belong to the largest ones, then, due to this ambiguity, we have to consider all combinations of hypersurfaces among the intersecting ones, where . Finally, we have to repeat the above procedure for all sign combinations in (36) and any combination of intersecting hypersurfaces among the ones. The total number of combinations is , hence the cardinality of is upper bounded by .
A pseudocode that includes all the above steps is presented in Algorithm 1. Its complexity to build is determined by the complexity to build each element of it (i.e., each index-set ) for each examined intersection through (37). Note that function has complexity and the cardinality of is . Hence, the overall complexity to build is . Finally, we mention that the computation of each element of is performed independently of each other. Therefore, the proposed Algorithm 1 that builds and solves (1) or, equivalently, (9) with complexity is fully parallelizable and memory efficient.
Iv An Algorithm of Complexity for Rank- Matrices
In the special case of a rank- matrix (i.e., ), Algorithm 1 computes the sparse principal component of with complexity for any sparsity value. In this section, we develop an algorithm that computes the sparse principal component of with complexity .
Iv-a A serial algorithm for rank- matrices
For the rank- case, Algorithm 1, discussed in Subsection III-D, relied on identifying a polynomial number of non-overlapping intervals on , each associated with a candidate support set . These intervals are induced by the pairwise intersections of curves in . The candidate support set associated with an interval is determined at the leftmost point of the interval. The two-step algorithm first computes all pairwise intersection points. In the second step, it determines , the set of indices of the largest elements of , at each intersection point individually, in linear time. In this section, we present an algorithmic enhancement that reduces the overall computational complexity, exploiting the correlation between candidate support sets of neighboring intervals.
The relative order of and changes only at the points where the th and th curves intersect. Conversely, if those two are the only curves intersecting at a point , then the ordering of the remaining elements of in the intervals immediately preceding and succeeding is identical. The limited change on ordering of the curves in across an intersection point implies that the differences between the candidate support sets of the adjacent intervals cannot be arbitrary. The support sets associated with two neighboring intervals differ in at most one element. More formally, let be the intersection point of the th and th curves of . There exists an such that is the only intersection point lying in . If in , then in , and vice versa. The ordering of all other curves remains unaltered over . If the th and th curves are both members of the candidate support in the interval preceding , then . The same holds if neither is included in . On the contrary, if exactly one of and belongs to , then the candidate sets associated with the two neighboring intervals differ in exactly one element. If, w.l.o.g., the th curve is the one belonging to , then .
The key observation is that, if the candidate support set associated with a particular interval is known, then the set associated with a neighboring interval can be determined with a constant complexity. The above observations readily suggest the following procedure for determining the candidate support sets:
Compute all pairwise intersection points of the curves in .
Sort the intersection points in increasing order. Let be the sorted sequence.
Determine , the set of indices of the largest curves in at the first intersection point .
Successively determine at consecutive intersection points. At , the candidate support set is determined by appropriately updating , the set associated with the previous interval.
Once all candidate support sets have been collected, the optimal solution is determined as in Algorithm 1.
An illustrative figure that explains the above steps is presented in Fig. 3, using the same matrix as the one examined in Section III (Figs. 2 and 3). We seek the optimal -sparse solution, i.e., . The algorithm starts at and scans , keeping track of the th order curve, highlighted with a black solid line. Vertical dashed lines indicate the intersections points at which the support of the optimal changes. Regions of corresponding to different support have a different background color. The optimal support in each region is depicted in the top of the figure. Note that 2 of the support sets need not be considered: sets and do not appear for any .
A pseudocode that implements the above serial algorithmic steps is presented in Algorithm 2. It improves upon the computational complexity of its parallel counterpart (Algorithm 1) circumventing the construction of the candidate set at each individual intersection point. The computation and sorting of all intersection points (steps 1 and 2) is performed in operations. The construction of is performed in linear time. Finally, each successive update in the last step requires operations. Therefore, the overall complexity of the serial Algorithm 2 is as opposed to the complexity of the parallel Algorithm 1 for . The disadvantage of Algorithm 2 is that it is not parallelizable.
Iv-B A modified serial algorithm for rank- matrices
Algorithms 1 and 2 described so far collect the candidate support sets by examining all pairwise intersection points. The number of distinct candidate support sets, however, can be significantly less: multiple intervals on may be associated with the same support set. In the following, we describe a simple modification of Algorithm 2 that aims at reducing the total number of intersection points computed and examined. Although the worst case complexity remains the same, the modified serial algorithm may significantly speed up execution in certain cases.
Consider three consecutive intersection points , , and , such that the candidate support sets on the two sides of are different (see, for example, Fig. 4). Let denote the set of indices of the largest curves in in the interval . Similarly, is the set associated with . The two sets differ by one element: if is the set of the common elements, then and , for some curves and .
At , over which the candidate support set changes, the two curves intersect. The th curve joins the set of largest curves of , displacing the th curve which was a member of . In particular, the th curve must be the smallest element of in . To see that, assume for the sake of contradiction that among the curves in , the th curve was the smallest one in , where . Then, in . By assumption, at . Due to the continuity of the curves, there must exist a point in where either or . However, no intersection point lies in . Following a similar argument, the th curve must be the largest curve in among those not included in . Moreover, the th curve becomes the th-order element of in , i.e., it is the smallest curve in .
The key observation is that, along , the candidate support set changes only at the intersection points of the th-order curve in . Assume that the candidate support set is known at and the th curve is the th-order curve in , i.e., the smallest curve in . Moving along , the first point where the candidate support set can potentially change is the closest intersection point of the th curve. Let be that point and be the intersecting curve. If , then the th curve joins the set at , displacing the th curve. If , then is only a point of internal reordering for , hence . In either case, however, the th curve becomes the th-order curve immediately after . Proceeding in a similar fashion, the next point where the candidate support set can potentially change is a point closest to , where the th curve intersects one of the other curves.