Subexponential-Time Algorithms for Sparse PCA
We study the computational cost of recovering a unit-norm sparse principal component planted in a random matrix, in either the Wigner or Wishart spiked model (observing either with drawn from the Gaussian orthogonal ensemble, or independent samples from , respectively). Prior work has shown that when the signal-to-noise ratio ( or , respectively) is a small constant and the fraction of nonzero entries in the planted vector is , it is possible to recover in polynomial time if . While it is possible to recover in exponential time under the weaker condition , it is believed that polynomial-time recovery is impossible unless . We investigate the precise amount of time required for recovery in the “possible but hard” regime by exploring the power of subexponential-time algorithms, i.e., algorithms running in time for some constant . For any , we give a recovery algorithm with runtime roughly , demonstrating a smooth tradeoff between sparsity and runtime. Our family of algorithms interpolates smoothly between two existing algorithms: the polynomial-time diagonal thresholding algorithm and the -time exhaustive search algorithm. Furthermore, by analyzing the low-degree likelihood ratio, we give rigorous evidence suggesting that the tradeoff achieved by our algorithms is optimal.
- 1 Introduction
- 2 Main Results
- 3 Proofs for Subexponential-Time Algorithms
- 4 Proofs for Low-Degree Likelihood Ratio Bounds
- A Chernoff Bounds
1.1 Spiked Matrix Models
Since the foundational work of Johnstone [Joh01], spiked random matrix ensembles have been widely studied throughout random matrix theory, statistics, and theoretical data science. These models describe a deformation of one of several canonical random matrix distributions by a rank-one perturbation or “spike,” intended to capture a signal corrupted by noise. Spectral properties of these spiked models have received much attention in random matrix theory [BBP05, BS06, Pau04, Péc06, FP07, CDMF09, BGN11, PRS13, KY13], leading to a theoretical understanding of methods based on principal component analysis (PCA) for recovering the direction of the rank-one spike [Joh01, JL04, Pau07, Nad08, JL09]. Spiked matrix models have also found more specific applications to problems such as community detection in graphs (see, e.g., [McS01, Vu18, DAM16], or [Moo17, Abb17] for surveys) and synchronization over groups (see, e.g., [Sin11, SS11, JMRT16, PWBM16, PWBM18a]).
We will study two classical variants of the spiked matrix model: the Wigner and Wishart models. The models differ in how noise is applied to the signal vector. In either case, let be the signal vector (or “spike”). We will either have deterministic with , or random for each with in probability as .
Spiked Wigner Model. Let . Observe , where is drawn from the Gaussian orthogonal ensemble , i.e., is symmetric with entries distributed independently as for all , and for all .
Spiked Wishart Model. Let and . Observe samples drawn independently from . The ratio of dimension to number of samples is denoted . We will focus on the high-dimensional regime where converges to a constant as . We let denote the sample covariance matrix .
Each of these planted models has a corresponding null model, given by sampling from the planted model with either (Wigner) or (Wishart).
We are interested in the computational feasibility of the following two statistical tasks, to be performed given a realization of the data (either or ) drawn from either the null or planted distribution.
Detection. Perform a simple hypothesis test between the planted model and null model. We say that strong detection is achieved by a statistical test if both the type-I and type-II errors tend to 0 as .
Recovery. Estimate the spike given data drawn from the planted model. We say that a unit-norm estimator achieves weak recovery if remains bounded away from zero with probability tending to as .
1(Note that we cannot hope to distinguish between the planted models with signals and .)
For high-dimensional inference problems such as the spiked Wigner and Wishart models, these two tasks typically share the same computational profile: with a given computational time budget, strong detection and weak recovery are possible in the same regions of parameter space.
1.2 Principal Component Analysis
Simple algorithms for both detection and recovery are given by principal component analysis (PCA) of the matrix . For detection, one computes and thresholds the maximum eigenvalue of , while for recovery one estimates using the leading eigenvector . Both the spiked Wishart and Wigner models are known to exhibit a sharp transition in their top eigenvalue as the model parameters vary. For the Wishart model, the celebrated “BBP transition” of Baik, Ben Arous, and Péché [BBP05, BS06] states that the maximum eigenvalue of the sample covariance matrix emerges from the Marchenko–Pastur-distributed bulk if and only if . Similarly, in the Wigner model, the maximum eigenvalue of emerges from the semicircular bulk if and only if [FP07]. More formally, the following statements hold.
Consider the spiked Wigner model with and fixed. Then as ,
if , almost surely, and almost surely (where denotes the leading eigenvector);
if , almost surely, and almost surely.
Let denote the sample covariance matrix in the spiked Wishart model with , fixed, and such that converges to a constant as . Then as ,
if , almost surely, and almost surely (where denotes the leading eigenvector);
if , almost surely, and almost surely.
We define the signal-to-noise ratio (SNR) by
Theorems 1.1 and 1.2 then characterize the performance of PCA in terms of . Namely, thresholding the largest eigenvalue of succeeds at strong detection when and fails when ; similarly, the top eigenvector succeeds at weak recovery when and fails when . For some distributions of , including the spherical prior ( drawn uniformly from the unit sphere) and the Rademacher prior (each entry drawn i.i.d. from ), it is known that the PCA threshold is optimal, in the sense that strong detection and weak recovery are statistically impossible (for any test or estimator, regardless of computational cost) when [OMH13, MRZ15, DAM16, BMV18, PWBM18b].
1.3 Sparse PCA
Sparse PCA, a direction initiated by Johnstone and Lu [JL04, JL09], seeks to improve performance when the planted vector is known to be sparse in a given basis. While various sparsity assumptions have been considered, a simple and illustrative one is to take drawn from the sparse Rademacher prior, denoted , in which each entry is distributed independently as
for some known sparsity parameter , which may depend on .
Consider the Wishart model (we will see that the Wigner model shares essentially the same behavior) in the regime with (so that ordinary PCA fails at weak recovery). The simple diagonal thresholding algorithm proposed by [JL09] estimates the support of by identifying the largest diagonal entries of . Under the condition
On the other hand, given unlimited computational power, an exhaustive search over all possible support sets of size achieves exact support recovery under the much weaker assumption [PJ12, VL12, CMW13]. Similarly, strong detection and weak recovery are statistically possible even when is a sufficiently small constant (depending on ) [BMV18, PWBM18b], and the precise critical constant is given by the replica formula from statistical physics (see, e.g., [LKZ15a, LKZ15b, KXZ16, DMK16, LM19, Mio17, EKJ17, EK18, EKJ18], or [Mio18] for a survey). However, no polynomial-time algorithm is known to succeed (for any reasonable notion of success) when , despite extensive work on algorithms for sparse PCA [dGJL05, ZHT06, MWA06, dBG08, AW08, WTH09, BR13b, DM14a, KNV15, BPP18].
In fact, a growing body of theoretical evidence suggests that no polynomial-time algorithm can succeed when [BR13a, CMW13, MW15, WBS16, HKP17, BBH18, BB19]. Such evidence takes the form of reductions [BR13a, WBS16, BBH18, BB19] from the planted clique problem (which is widely conjectured to be hard in certain regimes [Jer92, DM15, MPW15, BHK19]), as well as lower bounds against the sum-of-squares hierarchy of convex relaxations [MW15, HKP17]. Thus, we expect sparse PCA to exhibit a large “possible but hard” regime when . Statistical-to-computational gaps of this kind are believed to occur and have been studied extensively in many other statistical inference problems, such as community detection in the stochastic block model [DKMZ11b, DKMZ11a] and tensor PCA [RM14, HSS15, HKP17].
1.4 Our Contributions
In this paper, we investigate precisely how hard the “hard” region () is in sparse PCA. We consider subexponential-time algorithms, i.e., algorithms with runtime for fixed . We show a smooth tradeoff between sparsity (governed by ) and runtime (governed by ). More specifically, our results (for both the Wishart and Wigner models) are as follows.
Algorithms. For any , we give an algorithm with runtime that achieves exact support recovery, provided .
Lower bounds. Through an analysis of the low-degree likelihood ratio (see Section 1.5), we give formal evidence suggesting that the above condition is essentially tight in the sense that no algorithm of runtime can succeed when . (Our results are sharper than the sum-of-squares lower bounds of [HKP17] in that we pin down the precise constant .)
Our algorithm involves exhaustive search over subsets of of cardinality . The case is diagonal thresholding (which is polynomial-time and succeeds when ) and the case is exhaustive search over all possible spikes (which requires time and succeeds when ). As varies in the range , our algorithm interpolates smoothly between these two extremes. For a given in the range , the smallest admissible choice of is roughly , yielding an algorithm of runtime .
Our results extend to the case , e.g., for some constant . In this case, provided (which is information-theoretically necessary [PJ12, VL12, CMW13]), there is an -time algorithm if , and the low-degree likelihood ratio again suggests that this is optimal. In other words, for a given in the range , we can solve sparse PCA in time .
The analysis of our algorithm applies not just to the sparse Rademacher spike prior, but also to a weaker set of assumptions on the spike that do not require all of the nonzero entries to have the same magnitude. Our algorithm is guaranteed (with high probability) to exactly recover both the support of and the signs of the nonzero entries of . Once the support is known, it is straightforward to estimate via the leading eigenvector of the appropriate submatrix.
In independent work [HSV19], a different algorithm was proposed and shown to have essentially the same subexponential runtime as ours.
Certain problems besides sparse PCA have a similar smooth tradeoff between subexponential runtime requirements and statistical power. These include refuting random constraint satisfaction problems [RRS17] and tensor PCA [BGG16, BGL16, WEM19]. In contrast, other problems have a sharp threshold at which they transition from being solvable in polynomial-time to (conjecturally) requiring essentially exponential time: . Examples of this behavior can occur at the spectral transition at in the spiked Wishart and Wigner matrix models (see [BKW19, KWB19]) as well as at the Kesten–Stigum threshold in the stochastic block model (see [DKMZ11b, DKMZ11a, HS17, Hop18]).
1.5 Background on the Low-Degree Likelihood Ratio
A sequence of recent work on the sum-of-squares hierarchy [BHK19, HS17, HKP17, Hop18] has led to the development of a remarkably simple method for predicting the amount of computation time required to solve statistical tasks. This method—which we will refer to as the low-degree method—is based on analyzing the so-called low-degree likelihood ratio, and is believed to be intimately connected to the power of sum-of-squares (although formal implications have not been established). We now give an overview of this method; see [Hop18, KWB19] for more details.
We will consider the problem of distinguishing two simple hypotheses and , which are probability distributions on some domain with . The idea of the low-degree method is to explore whether there is a low-degree polynomial that can distinguish from .
We call the “null” distribution, which for us will always be i.i.d. Gaussian (see Definitions 2.4 and 2.17). induces an inner product on functions given by , and a norm . For , let denote the multivariate polynomials of degree at most . For , let denote the orthogonal projection (with respect to ) of onto . The following result then relates the distinguishing power of low-degree polynomials (in a certain sense) to the low-degree likelihood ratio.
Let and be probability distributions on . Suppose is absolutely continuous with respect to , so that the likelihood ratio is defined. Then
(The proof is straightforward: the fraction on the left can be written as , so the maximizer is .) The left-hand side of (3) is a heuristic measure of how well degree- polynomials can distinguish from : if this quantity is as , this suggests that no degree- polynomial can achieve strong detection (and indeed this is made formal by Theorem 4.3 of [KWB19]). The right-hand side of (3) is the norm of the low-degree likelihood ratio (LDLR), which can be computed or bounded in many cases, making this heuristic a practical tool for predicting computational feasibility of hypothesis testing.
The key assumption underlying the low-degree method is that, for many natural distributions and , degree- polynomials are as powerful as algorithms of runtime . This is captured by the following informal conjecture, which is based on [HS17, HKP17, Hop18]; in particular, see Hypothesis 2.1.5 of [Hop18].
Conjecture 1.5 (Informal).
Suppose . For “nice” sequences of distributions and , if remains bounded as whenever , then there exists no sequence of functions with computable in time that strongly distinguishes and , i.e., that satisfies
On a finer scale, it is conjectured [HS17, Hop18] that if for some we have and , then no polynomial-time algorithm can strongly distinguish from . In practice, it seems that the converse of Conjecture 1.5 often holds as well, in the sense that if for some , then there is an -time distinguishing algorithm (however, see Remark 2.16 for one caveat).
Calculations with the LDLR have been carried out for problems such as community detection [HS17, Hop18], planted clique [BHK19, Hop18], the spiked Wishart model [BKW19], the spiked Wigner model [KWB19], and tensor PCA [HKP17, Hop18, KWB19] (tensor PCA exhibits a subexponential-time tradeoff similar to sparse PCA; see [KWB19]). In all of the above cases, the low-degree predictions coincide with widely-conjectured statistical-versus-computational tradeoffs.
Low-degree lower bounds of the form are known to formally imply lower bounds (in a particular sense) against a general class of spectral methods; see Theorem 4.4 of [KWB19]. The low-degree predictions are also conjectured to coincide with the power of the sum-of-squares hierarchy and are in particular connected to the pseudo-calibration approach [BHK19]; see [HKP17, RSS18, Hop18]. We refer the reader to Section 4 of [KWB19] for further discussion of the implications (both formal and conjectural) of low-degree lower bounds.
Conjecture 1.5 is informal in the sense that we have not specified the meaning of “nice” and . Roughly speaking, highly-symmetric high-dimensional problems are considered “nice” so long as and have at least a small amount of noise in order to rule out brittle high-degree algorithms such as Gaussian elimination. (In particular, we consider spiked Wigner and Wishart to be “nice.”) Conjecture 2.2.4 of [Hop18] is one formal variant of the low-degree conjecture, although it uses the more refined notion of coordinate degree and so does not apply to the calculations in this paper.
We remark that if (the ) case, then it is statistically impossible to strongly distinguish and ; this is a commonly-used second moment method (see e.g., [MRZ15, BMV18, PWBM18b]) of which Conjecture 1.5 is a computationally-bounded analogue.
In this paper we give tight computational lower bounds for sparse PCA, conditional on Conjecture 1.5. Alternatively, one can view the results of this paper as a “stress test” for Conjecture 1.5: we show that Conjecture 1.5 predicts a certain statistical-versus-computational tradeoff and this indeed matches the best algorithms that we know.
The remainder of the paper is organized as follows. In Section 2, we present our subexponential-time algorithms and our lower bounds based on the low-degree likelihood ratio. In Section 3, we give proofs for the correctness of our algorithms. In Section 4, we give proofs for our analysis of the low-degree likelihood ratio.
We use standard asymptotic notation , , , always pertaining to the limit . We also use to mean and to mean . Also recall that means as and means as . An event occurs with high probability if it occurs with probability . We sometimes use the shorthand to mean for an absolute constant , and the shorthand to mean .
2 Main Results
In the analysis of our algorithms, we consider the spiked Wishart and Wigner models with signal satisfying the following properties.
For and , a vector is called -sparse if
and , and
for any , .
Here we have used the standard notations and . We assume that (which may depend on ) is chosen so that is an integer.
A lower bound on is essential for exact support recovery, since we cannot hope to distinguish tiny nonzero entries of from zero entries. The upper bound on is a technical condition that is likely not essential, and is only used for recovery in the Wishart model (Theorem 2.10).
In our calculations of the low-degree likelihood ratio, we instead assume the signal is drawn from the sparse Rademacher distribution, defined as follows.
The sparse Rademacher prior with sparsity is the distribution on whereby has i.i.d. entries distributed as
Note that has in probability as .
2.1 The Wishart Model
We first present our results for the Wishart model. Our algorithms and results for the Wigner model are essentially identical and can be found in Section 2.2.
Definition 2.4 (Spiked Wishart model).
The spiked Wishart model with parameters , , and planted signal is defined as follows.
Under , we observe independent samples .
Under , we observe independent samples .
We will sometimes specify a prior for , in which case first draws and then draws as above.
We first consider the detection problem, where the goal is to determine whether the given data was drawn from or .
The detection algorithm is motivated by the fact that . Under the planted model , and thus for any fixed ; as a result, if correctly “guesses” entries of with correct signs (up to a global flip), then the contribution of to the variance of will cause to be large.
Remark 2.5 (Runtime).
The runtime of Algorithm 1 is dominated by exhaustive search over during Step 3, when we compute . Since , the runtime is . If for a constant , then the runtime is .
Theorem 2.6 (Wishart detection).
Consider the spiked Wishart model with a -sparse signal , and let . Let be drawn from either or , and let be the output of Algorithm 1. Suppose
Let be any integer in the interval
where the last inequality follows from (7).
Since the runtime is , for the best possible runtime we should choose as small as possible, i.e.,
We are primarily interested in the regime with , , for a constant , and either for a constant , or with (in which case ). In this case, the requirement (6) reads (or, in other words, ), which is information-theoretically necessary up to log factors [PJ12, VL12, CMW13]. Choosing as in Remark 2.7 yields an algorithm of runtime .
For , let denote the corresponding principal submatrix of (i.e., restrict to the rows and columns whose indices lie in ). An alternative detection algorithm would be to threshold the test statistic
i.e., the largest eigenvalue of any principal submatrix. One can obtain similar guarantees for this algorithm as for Algorithm 1.
We now turn to the problem of exactly recovering the support and signs of , given data drawn from . The goal is to output a vector such that where and
Note that we can only hope to recover up to a global sign flip, because .
For technical reasons, we divide our samples into two subsamples of size (with one sample discarded if is odd) and produce two independent sample covariance matrices and . The first step of the algorithm is similar to the detection algorithm: by exhaustive search, we find the vector maximizing . In the course of proving that the algorithm succeeds, we will show that has nontrivial correlation with . The second step is to recover the support (and signs) of by thresholding . Note that discards (i.e., does not depend on) the columns of that do not lie in ; since has substantial overlap with , this serves to amplify the signal.
Theorem 2.10 (Wishart support and sign recovery).
As for detection, the runtime of Algorithm 2 is , and we can minimize this by choosing
Once we obtain using Algorithm 2, it is straightforward to estimate (up to global sign flip) using the leading eigenvector of the appropriate submatrix. This step of the algorithm requires only polynomial time.
Theorem 2.12 (Wishart recovery).
Consider the planted spiked Wishart model with an arbitrary -sparse signal , and let . Suppose we have access (e.g., via Algorithm 2) to . Write , and . Let denote the unit-norm eigenvector corresponding to the maximum eigenvalue of . Then, there exists an absolute constant such that, for any ,
Now, we turn to controlling the low-degree likelihood ratio (LDLR) (see Section 1.5) to provide rigorous evidence that the above algorithms are optimal. In this section we take a fully Bayesian approach, and assume that the planted signal is drawn from the sparse Rademacher prior . Recall that the signal-to-noise ratio is defined as .
As discussed in Section 1.5, we will determine the behavior of in the limit : if , this suggests hardness for -time algorithms. We allow the parameters to depend on , which we sometimes emphasize by writing, e.g., . For , our results suggest hardness for -time algorithms whenever and . This is essentially tight, matching PCA (which succeeds when ) and our algorithm with (which succeeds when ) (however, see Remark 2.16 below for one caveat).
Theorem 2.14 (Boundedness of LDLR for large ).
Under the spiked Wishart model with spike prior , suppose . If one of the following holds for sufficiently large :
then, as , .
The following result on divergence of the LDLR serves as a sanity check: we show that indeed diverges in the regime where we know that a -time algorithm exists.
Theorem 2.15 (Divergence of LDLR for small ).
Under the spiked Wishart model with spike prior , suppose and . If one of the following holds:
, and for sufficiently large ,
where is an absolute constant,
then, as , .
There is one regime where the above results give some unexpected behavior. Recall first that optimal Bayesian inference for sparse PCA can be performed in time by computing the likelihood ratio. Thus if for some , this suggests that the problem is information-theoretically impossible; from our results above, there are regimes where this occurs (and indeed the problem is information-theoretically impossible), yet for some larger (which incorrectly suggests that there should be an algorithm). This is analogous to a phenomenon where the second moment of the (non-low-degree) likelihood ratio can sometimes diverge even when strong detection is impossible (see, e.g. [BMNN16, BMV18, PWBM18b]). Luckily, this issue never occurs for us in the regime of interest , and therefore does not prevent our results from being tight. Note also that none of these observations contradict Conjecture 1.5.
2.2 The Wigner Model
We now state our algorithms and results for the Wigner model. These are very similar to the Wishart case, so we omit some of the discussion.
Definition 2.17 (Spiked Wigner model).
The spiked Wigner model with parameters , , and planted signal is defined as follows.
Under , we observe the matrix , where .
Under , we observe the matrix .
Remark 2.18 (Runtime).
Theorem 2.19 (Wigner detection).
Consider the spiked Wigner model with an arbitrary -sparse signal . Let be drawn from either or , and let be the output of Algorithm 3. Suppose
Let be any integer in the interval
where the last inequality follows from (13).
For technical reasons, our first step is to fictitiously “split” the data into two independent copies and . Note that
Since and are independent matrices, and are distributed as independent observations drawn from with the same planted signal and with effective signal-to-noise ratio .
Theorem 2.20 (Wigner support and sign recovery).
Consider the planted spiked Wishart model with an arbitrary -sparse signal . Suppose