Sparse PCA from Sparse Linear Regression

# Sparse PCA from Sparse Linear Regression

Guy Bresler
MIT
guy@mit.edu
&Sung Min Park
MIT
sp765@mit.edu
&Mădălina Persu
Two Sigma, MIT
mpersu@mit.edu
The views expressed herein are solely the views of the author(s) and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates. They are not intended to provide, and should not be relied upon for, investment advice.
###### Abstract

Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR) have a wide range of applications and have attracted a tremendous amount of attention in the last two decades as canonical examples of statistical problems in high dimension. A variety of algorithms have been proposed for both SPCA and SLR, but an explicit connection between the two had not been made. We show how to efficiently transform a black-box solver for SLR into an algorithm for SPCA: assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, the SPCA algorithm derived from it achieves near state of the art guarantees for testing and for support recovery for the single spiked covariance model as obtained by the current best polynomial-time algorithms. Our reduction not only highlights the inherent similarity between the two problems, but also, from a practical standpoint, allows one to obtain a collection of algorithms for SPCA directly from known algorithms for SLR. We provide experimental results on simulated data comparing our proposed framework to other algorithms for SPCA.

Sparse PCA from Sparse Linear Regression

Guy Bresler MIT guy@mit.edu Sung Min Park MIT sp765@mit.edu Mădălina Persu Two Sigmathanks: The views expressed herein are solely the views of the author(s) and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates. They are not intended to provide, and should not be relied upon for, investment advice., MIT mpersu@mit.edu

\@float

noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float

## 1 Introduction

Principal component analysis (PCA) is a fundamental technique for dimension reduction used widely in data analysis. PCA projects data along a few directions that explain most of the variance of observed data. One can also view this as linearly transforming the original set of variables into a (smaller) set of uncorrelated variables called principal components.

Recent work in high-dimensional statistics has focused on sparse principal component analysis (SPCA), as ordinary PCA estimates become inconsistent in this regime johnstone2012consistency (). In SPCA, we restrict the principal components to be sparse, meaning they have only a few nonzero entries in the original basis. This has the advantage, among others, that the components are more interpretable jolliffe2003modified (); zou2006sparse (), while components may no longer be uncorrelated. We study SPCA under the Gaussian (single) spiked covariance model introduced by johnstone2001distribution (): we observe samples of a random variable distributed according to a Gaussian distribution , where with at most nonzero entries,111Sometimes we will write this latter condition as where is the -ball” of radius . is the identity matrix, and is the signal-to-noise parameter. We study two settings of the problem, hypothesis testing and support recovery.

Sparsity assumptions have played an important role in a variety of other problems in high-dimensional statistics, in particular linear regression. Linear regression is also ill-posed in high dimensions, so via imposing sparsity on the regression vector we recover tractability.

Though the literature on two problems are largely disjoint, there is a striking similarity between the two problems, in particular when we consider statistical and computational trade-offs. The natural information-theoretically optimal algorithm for SPCA berthet2013optimal () involves searching over all possible supports of the hidden spike. This bears resemblance to the minimax optimal algorithm for SLR raskutti2011minimax (), which optimizes over all sparse supports of the regression vector. Both problems appear to exhibit gaps between statistically optimal algorithms and known computationally efficient algorithms, and conditioned on relatively standard complexity assumptions, these gaps seem irremovable berthet2013complexity (); wang2016statistical (); zhang2014lower ().

### 1.1 Our contributions

In this paper we give algorithmic evidence that this similarity is likely not a coincidence. Specifically, we give a simple, general, and efficient procedure for transforming a black-box solver for sparse linear regression to an algorithm for SPCA. At a high level, our algorithm tries to predict each coordinate222From here on, we will use “coordinate” and “variable” interchangeably. linearly from the rest of the coordinates using a black-box algorithm for SLR. The advantages of such a black-box framework are two fold: theoretically, it highlights a structural connection between the two problems; practically, it allows one to simply plug in any of the vast majority of solvers available for SLR and directly get an algorithm for SPCA with provable guarantees. In particular,

• For hypothesis testing: we match state of the art provable guarantee for computationally efficient algorithms; our algorithm successfully distinguishes between isotropic and spiked Gaussian distributions at signal strength . This matches the phase transition of diagonal thresholding johnstone2012consistency () and Minimal Dual Perturbation berthet2013optimal () up to constant factors.

• For support recovery: for general and , when each non-zero entry of is at least (a standard assumption in the literature), our algorithm succeeds with high probability for signal strength , which is nearly optimal.333In the scaling limit as , the covariance thresholding algorithm deshpande2014sparse () theoretically succeeds at a signal strength that is an order of smaller. However, our experimental results indicate that with an appropriate choice of black-box, our algorithm outperforms covariance thresholding

• In experiments, we demonstrate that using popular existing SLR algorithms as our black-box results in reasonable performance.

• We theoretically and empirically illustrate that our SPCA algorithm is also robust to rescaling of the data, for instance by using a Pearson correlation matrix instead of a covariance matrix.444Solving SPCA based on the correlation matrix was suggested in a few earlier works zou2006sparse (); vu2013fantope (). Many iterative methods rely on initialization via first running diagonal thresholding, which filters variables with higher variance; rescaling renders diganoal thresholding useless, so in some sense our framework is more robust.

## 2 Preliminaries

### 2.1 Problem formulation for SPCA

Hypothesis testing Here, we want to distinguish whether is distributed according to an isotropic Gaussian or a spiked covariance model. That is, our null and alternate hypotheses are:

 H0:X∼N(0,Id) and H1:X∼N(0,Id+θuu⊤),

Our goal is to design a test that discriminates and . More precisely, we say that discriminates between and with probability if both type I and II errors have a probability smaller than :

 PH0(ψ(X)=1)≤δ and PH1(ψ(X)=0)≤δ.

We assume the following additional condition on the spike :

1. for at least one where is some constant.

The above condition says that at least one coordinate has enough mass, yet the mass is not entirely concentrated on just that singlecoordinate. Trivially, we always have at least one s.t. , but this is not enough for our regression setup, since we want at least one other coordinate to have sufficient correlation with coordinate . We remark that the above condition is a very mild technical condition. If it were violated, almost all of the mass of is on a single coordinate, so a simple procedure for testing the variance (which is akin to diagonal thresholding) would suffice.

Support recovery The goal of support recovery is to identify the support of from our samples . More precisely, we say that a support recovery algorithm succeeds if the recovered support is the same as , the true support of . As standard in the literature amini2009high (); meinshausen2006high (), we need to assume a minimal bound on the size of the entries of in the support.

For our support recovery algorithm, we will assume the following condition (note that it implies Condition 1 and is much stronger):

1. for some constant

Though the settings are a bit different, this minimal bound along with our results are consistent with lower bounds known for sparse recovery. These lower bounds (fletcher2009necessary (); wainwright2007information (); bound of fletcher2009necessary () is a factor of weaker) imply that the number of samples must grow roughly as where is the smallest entry of our signal normalized by , which is qualitativley the same threhold required by our theorems.

### 2.2 Background on SLR

In linear regression, we observe a response vector and a design matrix that are linked by the linear model , where is some form of observation noise, typically with i.i.d. entries. Our goal is to recover given noisy observations . While the matrices we consider arise from a (random) correlated design (as analyzed in wainwright2007information (), wainwright2009sharp ()), it will make no difference to assume the matrices are deterministic by conditioning, as long as the distribution of the design matrix and noise are independent, which we will demonstrate in our case. Most of the relevant results on sparse linear regression pertain to deterministic design.

In sparse linear regression, we additionally assume that has only non-zero entries, where . This makes the problem well posed in the high-dimensional setting. Commonly used performance measures for SLR are tailored to prediction error ( where is our guess), support recovery (recovering support of ), or parameter estimation (minimizing under some norm). We focus on prediction error, analyzed over random realizations of the noise. There is a large amount of work on SLR and we defer a more in-depth overview to Appendix A.

Most efficient methods for SLR impose certain conditions on . We focus on the restricted eigenvalue condition, which roughly stated makes the prediction loss strongly convex near the optimum:

###### Definition 2.1 (Restricted eigenvalue zhang2014lower ()).

First define the cone , where denotes the complement, is restricted to the subset . The restricted eigenvalue (RE) constant of , denoted , is defined as the largest constant s.t.

 \nicefrac1n∥Xβ∥22≥γ∥β∥22for all β∈⋃|S|=k,S⊆[d]C(S)

For more discussion on the restricted eigenvalue, see Appendix A.

Black-box condition Given the known guarantees on SLR, we define a condition that is natural to require on the guarantee of our SLR black-box, which is invoked as .

###### Condition 2.2 (Black-box condition).

Let denote the restricted eigenvalue of . There are universal constants such that outputs that is -sparse and satisfies:

 1n∥Xˆβ−Xβ∗∥22≤cγ(X)2(σ2klogd)n∀β∗∈B0(k) w.p. ≥1−c′exp(−c′′klogd)

## 3 Algorithms and main results

We first discuss how to view samples from the spiked covariance model in terms of a linear model. We then give some intuition motivating our statistic. Finally, we state our algorithms and main theorems, and give a high-level sketch of the proof.

### 3.1 The linear model

Let be i.i.d. samples from the spiked covariance model; denote as the matrix whose rows are . Intuitively, if variable is contained in the support of the spike, then the rest of the support should allow to provide a nontrivial prediction for since variables in the support are correlated. Conversely, for not in the support (or under the isotropic null hypothesis), all of the variables are independent and other variables are useless for predicting . So we regress onto the rest of the variables.

Let denote the matrix of samples in the SPCA model with the th column removed. For each column , we can view our data as coming from a linear model with design matrix and the response variable .

The “true” regression vector depends on . Under the alternate hypothesis , if , we can write where and with .555By the theory of linear minimum mean-square-error (LMMSE) confirms that this choice of minimizes the error . See Appendix B.1, B.2 for details of this calculation. If , and for any under the null hypothesis, where (implicitly ).

### 3.2 Designing the test statistic

Based on the linear model above, we want to compute a test statistic that will indicate when a coordinate is on support. Intuitively, we predictive power of our linear model should be higher when is on support. Indeed, a calculation shows that the variance in is reduced by approximately . We want to measure this reduction in noise to detect when is on support or not.

Suppose for instance that we have access to rather than (note that this is not possible in practice since we do not know the support!). Since we want to measure the reduction in noise when the variable is on support, as a first step we might try the following statistic:

 Qi=\nicefrac1n∥y−Xβ∗∥22

Unfortunately, this statistic will not be able to distinguish the two hypotheses, as the reduction in the above error is too small (on the order of compared to overall order of ), so deviation due to random sampling will mask any reduction in noise. We can fix this by adding the variance term :

 Qi=\nicefrac1n∥y∥22−\nicefrac1n∥y−Xβ∗∥22

On a more intuitive level, including allows us to measure the relative gain in predictive power without being penalized by a possibly large variance in . Fluctuations in due to noise will typically be canceled out in the difference of terms in , minimizing the variance of our statistic.

We have to add one final fix to the above estimator. We obviously do not have access to , so we must use the estimate ( are as defined in Section 3.1) which we get from our black-box. As our analysis shows, this substitution does not affect much of the discriminative power of as long as the SLR black-box satisfies prediction error guarantees stated in Condition 2.2. This gives our final statistic:666As pionted out by a reviewer, Note that this statistic is actually equivalent to up to rescaling by sample variance. Note that our formula is slightly different though as we use the sample variance computed with population mean as opposed to sample mean, as the mean is known to be zero.

 Qi=\nicefrac1n∥y∥22−\nicefrac1n∥y−Xˆβ∥22.

### 3.3 Algorithms

Below we give algorithms for hypothesis testing and for support recovery, based on the statistic:

Below we summarize our guarantees for the above algorithms. The proofs are simple, but we defer them to Appendix C.

###### Theorem 3.1 (Hypothesis test).

Assume we have access to that satisfies Condition 2.2 and with runtime per instance. Under Condition 1, there exist universal constants s.t. if and , Algorithm 1 outputs s.t.

 PH0(ψ(X)=1)∨PH1(ψ(X)=0)≤c3exp(−c4klogd)

in time .

###### Theorem 3.2 (Support recovery).

Under the same conditions as above plus Condition 2, if , Algorithm 2 above finds with probability at least in time .

RE for sample design matrix Because population covariance has minimum eigenvalue 1, with high probability the sample design matrix has constant restricted eigenvalue value given enough samples, i.e. is large enough (see Appendix B.3 for more details), and the prediction error guarantee of Condition 2.2 will be good enough for our analysis.

Running time The runtime of both Algorithm 1 and 2 is . The discussion presented at the end of Appendix C details why this is competitive for (single spiked) SPCA, at least theoretically.

Unknown sparsity Throughout the paper we assume that the sparsity level is known. However, if is unknown, standard techniques could be used to adaptively find approximate values of (do2008sparsity ()). For instance, for hypothesis testing, we can start with an initial overestimate , and keep halving until we get enough coordinates with that passes the threshold for the given .

### Robustness of Q statistic to rescaling

Intuitively, our algorithms for detecting correlated structure in data should be invariant to rescaling of the data; the precise scale or units for which one variable is measured should not have an impact on our ability to find meaningful structure underlying the data. Our algorithms based on are robust to rescaling, perhaps unsurprisingly, since correlations between variables in the support remain under rescaling.

On the other hand, diagonal thresholding, an often-used preprocessing step for SPCA which filters variables strictly based on variance, would trivially fail under rescaling. This illustrates a strength of our framework over other existing algorithms for SPCA.

Below we show explicitly that statistics are indeed robust to rescaling: Let be the rescaling of , where is some diagonal matrix. Let be restricted to rows and columns in . Note that , the covariance matrix of the rescaled data, is just by expanding the definition. Similarly, note where denotes without row and column . Now, recall the term which dominated our analysis of under , , which was equal to

 Σ1,2:dΣ−12:dΣ2:d,1

We replace the covariances by their rescaled versions to obtain:

 ~β∗⊤˜Σ~β∗=(D1Σ1,2:dD2:d)D−12:dΣ−12:dD−12:d(D2:dΣ2:d,1D1)=D21(β∗)⊤Σ2:dβ∗

For the spiked covariance model, rescaling variances to one amount to rescaling with . Thus, we see that our signal strength is affected only by constant factor (assuming ).

## 4 Experiments

We test our algorithmic framework on randomly generated synthetic data and compare to other existing algorithms for SPCA. The code was implemented in Python using standard libraries.

We refer to our general algorithm from Section 3 that uses the statistic as SPCAvSLR. For our SLR “black-box,” we use thresholded Lasso zhang2014lower ().777Thresholded Lasso is a variant of lasso where after running lasso, we keep the largest (in magnitude) coefficients to make the estimator k-sparse. Proposition 2 in zhang2014lower () shows that thresholded Lasso satisfies Condition 2.2. (We experimented with other SLR algorithms such as the forward-backward algorithm of foba () and CoSaMP888CoSaMP in theory requires the stronger condition of restricted isometry on the “sensing” matrix. cosamp (), but results were similar and only slower.)

For more details on our experimental setup, including hyperparameter selection, see Appendix D.

Support recovery We randomly generate a spike by first choosing a random support of size , and then using random signs for each coordinate (uniformity is to make sure Condition 2 is met). Then spike is scaled appropriately with to build the spiked covariance matrix of our normal distribution, from which we draw samples.

We study how the performance of six algorithms vary over various values of for fixed and .999We remark that while the size of this dataset might seem too small to be representative “high-dimensional” setting, these are representative of the usual size of dataset that these methods are usually tested on. One bottleneck is the computation of the covariance matrix. As in the deshpande2014sparse (), our measure is the fraction of true support. We compare SPCAvSLR with the following algorithms: diagnoal thresholding, which is a simple baseline; “SPCA” (ZHT zou2006sparse ()) is a fast heuristic also based on the regression idea; the truncated power method of yuan2013truncated (), which is known for both strong theoretical guarantees and empirical performance; covariance thresholding, which has state-of-the-art theoretical guarantees.

We modified each algorithm to return the top most likely coordinates in the support (rather than thresholding based on a cutoff); for algorithms that compute a candidate eigenvector, we choose the top coordinates largest in absolute value.

We observe that SPCAvSLR performs better than covariance thresholding and diagonal thresholding, but its performance falls short of that of the truncated power method and the heuristic algorithm of zou2006sparse (). We suspect the playing with different SLR algorithms may slightly improve its performance. The reason for the gap between performance of SPCAvSLR and other state of the arts algorithms despite its theoretical guarantees is open to further investigation.

Hypothesis testing We generate data under two different distributions: for the spiked covariance model, we generate a spike by sampling a uniformly random direction from the -dimensional unit sphere, and embedding the vector at a random subset of coordinates among coordinates; for the null, we draw from standard isotropic Gaussian. In a single trial, we draw samples from each distribution and we compute various statistics101010While some of the algorithms used for support recovery in the previous section could in theory be adapated for hypothesis testing, the extensions were immediate so we do not consider them here. (diagonal thresholding (DT), Minimal Dual Perturbation (MDP), and our statistic, again using thresholded Lasso). We repeat for trials, and plot the resulting empirical distribution for each statistic. We observe similar performance of DT and , while MDP seems slightly more effective at distinguishing and at the same signal strength (that is, the distributions of the statistics under vs. are more well-separated).

Rescaling variables As discussed in Section 3, our algorithms are robust to rescaling the covariance matrix to the correlation matrix. As illustrated in Figure 2 (right), DT fails while appears to be still effective for distinguishing hypotheses the same regime of parameters. Other methods such as MDP and CT also appear to be robust to such rescaling (not shown). This suggests that more modern algorithms for SPCA may be more appropriate than diagonal thresholding in practice, particularly on instances where the relative scales of the variables may not be accurate or knowable in advance, but we still want to be able to find correlation between the variables.

## 5 Previous work

Here we discuss in more detail previous approaches to SPCA and how it relates to our work. Various approaches to SPCA have been designed in an extensive list of prior work. As we cannot cover all of them, we focus on works that aim to give computationally efficient (i.e. polynomial time) algorithms with provable guarantees in settings similar to ours.

These algorithms include fast, heuristic methods based on minimization jolliffe2003modified (); zou2006sparse (), rigorous but slow methods based on natural semidefinite program (SDP) relaxations d2007direct (); amini2009high (); vu2013fantope (); wang2016statistical (), iterative methods motivated by power methods for approximating eigenvectors yuan2013truncated (); journee2010generalized (), non-iterative methods based on random projections gataric2017sparse (), among others. Many iterative methods rely on initialization schemes, such as ordinary PCA or diagonal thresholding johnstone2012consistency ().

Below, we discuss the known sample bounds for support recovery and hypothesis testing.

Support recovery amini2009high () analyzed both diagonal thresholding and an SDP for support recovery under the spiked covariance model.111111They analyze the subcase when the spike is uniform in all coordinates. They showed that the SDP requires an order of fewer samples when the SDP optimal solution is rank one. However, krauthgamer2013semidefinite () showed that the rank one condition does not happen in general, particularly in the regime approaching the information theoretic limit (). This is consistent with computational lower bounds from berthet2013complexity () (), but a small gap remains (diagonal thresholding and SDP’s succeed only up to ). The above gap was closed by the covariance thresholding algorithm, first suggested by krauthgamer2013semidefinite () and analyzed by deshpande2014sparse (), that succeeds in the regime , although the theoretical guarantee is limited to the regime when due to relying on techniques from random matrix theory.

Hypothesis testing Some works berthet2013optimal (); amini2009high (); d2014approximation () have focused on the problem of detection. In this case, berthet2013optimal () observed that it suffices to work with the much simpler dual of the standard SDP called Minimal Dual Perturbation (MDP). Diagonal thresholding (DT) and MDP work up to the same signal threshold as for support recovery, but MDP seems to outperform DT on simulated data berthet2013optimal (). MDP works at the same signal threshold as the standard SDP relaxation for SPCA. d2014approximation () analyze a statistic based on an SDP relaxation and its approximation ratio to the optimal statistic. In the regime where are proportional to , their statistic succeeds at a signal threshold for that is independent of , unlike the MDP. However, their statistic is quite slow to compute; runtime is at least a high order polynomial in .

Regression based approaches To the best of our knowledge, our work is the first to give a general framework for SPCA that uses SLR in a black-box fashion. zou2006sparse () uses specific algorithms for SLR such as Lasso as a subroutine, but they use a heuristic alternating minimization procedure to solve a non-convex problem, and hence lack any theoretical guarantees. meinshausen2006high () applies a regression based approach to a restricted class of graphical models. While our regression setup is similar, their statistic is different and their analysis depends directly on the particulars of Lasso. Further, their algorithm requires extraneous conditions on the data.cai2013sparse () also uses a reduction to linear regression for their problem of sparse subspace estimation. Their iterative algorithm depends crucially on a good initialization done by a diagonal thresholding-like pre-processing step, which fails under rescaling of the data.121212See Section 3 for more discussion on rescaling. Furthermore, their framework uses regression for the specific case of orthogonal design, whereas our design matrix can be more general as long as it satisfies a condition similar to the restricted eigenvalue condition. On the other hand, their setup allows for more general -based sparsity as well as the estimation of an entire subspace as opposed to a single component. ma2013sparse () also achieves this more general setup, while still suffering from the same initialization problem.

Sparse priors Finally, connections between SPCA and SLR have been noted in the probabilistic setting koyejo2014prior (); khanna2015sparse (), albeit in an indirect manner: the same sparsity-inducing priors can be used for either problem. We view our work as entirely different as we focus on giving a black-box reduction. Furthermore, provable guarantees for the EM algorithm and variational methods are lacking in general, and it is not immediately obvious what signal threshold their algorithm achieves for the single spike covariance model.

## 6 Conclusion

We gave a black-box reduction for reducing instances of the SPCA problem under the spiked covariance model to instances of SLR. Given oracle access to SLR black-box meeting a certain natural condition, the reduction is shown to efficiently solve hypothesis testing and support recovery.

Several directions are open for future work. The work in this paper remains limited to the Gaussian setting and to the single spiked covariance model. Making the results more general would make the connection made here more appealing. Also, the algorithms designed here, though simple, seem a bit wasteful in that they do not aggregate information from different statistics. Designing a more efficient estimator that makes a more efficient use of samples would be interesting. Finally, there is certainly room for improvement by tuning the choice of the SLR black-box to make the algorithm more efficient for use in practice.

## References

• [1] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 2454–2458. IEEE, 2009.
• [2] Afonso S Bandeira, Edgar Dobriban, Dustin G Mixon, and William F Sawin. Certifying the restricted isometry property is hard. IEEE transactions on information theory, 59(6):3448–3450, 2013.
• [3] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory, pages 1046–1066, 2013.
• [4] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.
• [5] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
• [6] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
• [7] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674–1697, 2007.
• [8] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Sparse density estimation with penalties. In Learning theory, pages 530–543. Springer, 2007.
• [9] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074–3110, 2013.
• [10] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when is much larger than . The Annals of Statistics, pages 2313–2351, 2007.
• [11] Emmanuel J Candes and Terence Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005.
• [12] Dong Dai, Philippe Rigollet, Lucy Xia, and Tong Zhang. Aggregation of affine estimators. Electronic Journal of Statistics, 8(1):302–327, 2014.
• [13] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 49(3):434–448, 2007.
• [14] Gamarnik David and Zadik Ilias. High dimensional regression with binary coefficients. estimating squared error and a phase transtition. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 948–953. PMLR, 2017.
• [15] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. In Advances in Neural Information Processing Systems, pages 334–342, 2014.
• [16] Thong T Do, Lu Gan, Nam Nguyen, and Trac D Tran. Sparsity adaptive matching pursuit algorithm for practical compressed sensing. In Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pages 581–587. IEEE, 2008.
• [17] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Approximation bounds for sparse principal component analysis. Mathematical Programming, 148(1-2):89–110, 2014.
• [18] Alyson K Fletcher, Sundeep Rangan, and Vivek K Goyal. Necessary and sufficient conditions for sparsity pattern recovery. IEEE Transactions on Information Theory, 55(12):5758–5772, 2009.
• [19] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. algorithmic barriers and a local search algorithm. arXiv preprint arXiv:1711.04952, 2017.
• [20] Milana Gataric, Tengyao Wang, and Richard J Samworth. Sparse principal component analysis via random projections. arXiv preprint arXiv:1712.05630, 2017.
• [21] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295–327, 2001.
• [22] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 2009.
• [23] Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A modified principal component technique based on the lasso. Journal of computational and Graphical Statistics, 12(3):531–547, 2003.
• [24] Michel Journée, Yurii Nesterov, Peter Richtárik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517–553, 2010.
• [25] Rajiv Khanna, Joydeep Ghosh, Russell A Poldrack, and Oluwasanmi Koyejo. Sparse submodular probabilistic pca. In AISTATS, 2015.
• [26] Oluwasanmi O Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distributions and approximate inference for structured variables. In Advances in Neural Information Processing Systems, pages 676–684, 2014.
• [27] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations really solve sparse pca. Technical report, Technical report, Weizmann Institute of Science, 2013.
• [28] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
• [29] Zongming Ma et al. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772–801, 2013.
• [30] Stéphane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397–3415, 1993.
• [31] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436–1462, 2006.
• [32] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.
• [33] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
• [34] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010.
• [35] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. Information Theory, IEEE Transactions on, 57(10):6976–6994, 2011.
• [36] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731–771, 2011.
• [37] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. In Conference on Learning Theory, pages 10–1, 2012.
• [38] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
• [39] Sara van de Geer. The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich, 2007.
• [40] Sara A Van De Geer, Peter Bühlmann, et al. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.
• [41] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. In Advances in Neural Information Processing Systems, pages 2670–2678, 2013.
• [42] Martin Wainwright. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. In Information Theory, 2007. ISIT 2007. IEEE International Symposium on, pages 961–965. IEEE, 2007.
• [43] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202, 2009.
• [44] Tengyao Wang, Quentin Berthet, Richard J Samworth, et al. Statistical and computational trade-offs in estimation of sparse principal components. The Annals of Statistics, 44(5):1896–1930, 2016.
• [45] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. The Journal of Machine Learning Research, 14(1):899–925, 2013.
• [46] Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems, pages 1921–1928, 2009.
• [47] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. arXiv preprint arXiv:1402.1918, 2014.
• [48] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Optimal prediction for sparse linear models? lower bounds for coordinate-separable m-estimators. arXiv preprint arXiv:1503.03188, 2015.
• [49] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

## Appendix

The appendix is organized as follows. In Section A, we give additionaal background on the literature for sparse linear regression. In Section B, we give calculations for quantities from Section 3.1 in the main text, where we set up our main linear model. In Section C, we give complete proofs for Theorems 3.1 and 3.2. Finally, we provide additional details for our experiments in Section D.

## Appendix A Additional background on SLR

Efficient methods The estimator, which minimizes the reconstruction error over all -sparse regression vectors, achieves prediction error bound of form (btw07 (), raskutti2011minimax ()): but takes exponential time to compute. Various efficient methods have been proposed to circumvent this computational intractability: basis pursuit, Lassolasso (), and the Dantzig selector candes2007dantzig () are some of initial approaches. Greedy pursuit methods such as OMP omp (), IHTiht (), CoSaMPcosamp (), and FoBafoba () among others offer more computationally efficient alternatives.131313Note that some of these algorithms were presented for compressed sensing; nonetheless, their guarantees can be converted appropriately. These algorithms achieve the same prediction error guarantee as up to a constant, but under the assumption that satisfies certain properties, such as restricted eigenvalue (bickel2009simultaneous ()), compatibility (van2007deterministic ()), restricted isometry property (candes2005decoding ()), (in)coherence (bunea2007sparse ()), among others. In this work, we focus on the restricted eigenvalue (see Definition 2.1 for a formal definition). We remark that restricted eigenvalue is among the weakest, and is only slightly stronger than the compatibility condition. Moreover, zhang2014lower () give complexity-theoretic evidence for the necessity of dependence on the RE constant for certain worst case instances of the design matrix. See van2009conditions () for implications between various conditions. Without such conditions on , the best known guarantees provably obtain only a decay rather than a decay in prediction error as number of samples increase; zhang2015optimal () give some evidence that this gap may be unavoidable.

Optimal estimators The SLR estimators we consider are efficiently computable. Another line of work considers arbitrary estimators that are not necessarily efficiently computable. These include BIC btw07 (), Exponential Screening RT11 (), and Q-aggregation DRXZ14 (). Such estimators achieve strong guarantees regarding minimax optimality in the form of oracle inequalities on MSE.

Restricted Eigenvalue The restricted eigenvalue (RE) lower bounds the quadratic form defined by in all (approximately) sparse directions. RE is related to more general notions such as the restricted strong convexity negahban2009unified (), which roughly says that loss function is not too “flat” near the point of interest; this allows us to convert convergence in loss value to convergence in parameter value. In general when , we cannot guarantee this for all directions, but it suffices to consider the set of “mostly” sparse directions.

We remark that the above condition is very natural and likely unavoidable. zhang2014lower () indicate that the dependence of the above prediction error guarantee on RE cannot be removed, under a standard conjecture in complexity theory. raskutti2010restricted (); rudelson2012reconstruction () show that RE holds with high probability for correlated Gaussian designs (though it remains NP-hard to verify it bandeira2013certifying ()).

A recent line of work pmlr-v65-david17a (); gamarnik2017sparse () studies the algorithmic hardness of SLR when has Gaussian design.

## Appendix B Deferred calculations from Section 3.1

### b.1 Linear minimum mean-square-error estimation

Given random variables and (this can be a vector more generally), what is the best prediction for conditioned on knowing ? What is considered “best” can vary, but here we consider the mean squared error. That is, we want to come up with s.t.

 E[(Y−^y)2]

is minimized.

It is not hard to show that is just the conditional expectation of conditioned on . The minimum mean-square-error estimate can be a highly nontrivial function of .

The linear minimum mean-square-error (LMMSE) estimate instead restricts the attention to estimators of the form . Notice here that and are fixed and are not functions of .

One can show that the LMMSE estimator is given by: , where is the appropriately indexed covariance matrix, and is chosen in the obvious way to make our estimator unbiased.

### b.2 Calculations for the linear model

To recap our setup, we input the design matrix and the response variable as inputs to an SLR black-box. Our goal is to express as a linear function of plus some independent noise . Without loss of generality let , and for our discussion below assume . For illustration, at times we will simplify our calculation further for the uniform case where for and for .

For the moment, just consider one row of , corresponding to one particular sample of the original SPCA distribution. Since is jointly Gaussian, we can express (the expectation of) as a linear function of the other coordinates:

 E[X1|X2:d=x2:d]=Σ1,2:d(Σ2:d)−1x2:d

Hence we can write

 X1=Σ1,2:d(Σ2:d)−1X2:d+w

where for some to be determined and for .

By directly computing the variance of the above expression for , we deduce an expression for the noise level:

 σ2=Σ11−Σ1,2:d(Σ2:d)−1Σ2:d,1

Note that is just under . We proceed to compute under , when . To compute , we use (a special case of) the Sherman-Morrison formula: .

 Σ−12:d =(Id−1+θu−1u⊤−1)−1=Id−1−θ1+(1−u21)θu−1u⊤−1

where is restricted to coordinates .

 Σ1,2:d(Σ2:d)−1Σ2:d,1 =(θu11+(1−u21))2u⊤−1(I+θu−1u⊤−1)u−1 =θ2u21(1−u21)1+(1−u21)θ (specializing to uniform case again) =θ2k(1−1k)11+k−1kθ≈θ2k(1+θ)

Finally, substituting into the expression for

 σ2 =1+θu21−θ2u21(1−u21)1+(1−u21)θ =1+θu211+(1−u21)θ ≤2if θ≤1

We remark that the noise level of column 1 has been reduced by roughly by regressing on correlated columns.

In summary, under (and if ) we can write

 y=Xβ∗+w

where

 β∗ =(Σ2:d)−1Σ2:d,1 =(I−θ1+(1−u1)2θu−1u⊤−1)θu1u−1 =θu1(1−θ1+(1−u21)θ(1−u21))u−1 =θu11+(1−u21)θu−1

(technically, the definition of on the RHS is a dimensional vector, but we augment it with zeros to make it dimensional) and where . Note that in the uniform case, as where is uniform 1 on first coordinates, as expected.

### b.3 Properties of the design matrix X

#### Restricted eigenvalue (RE)

Here we check that defined as in Section 3.1 has constant restricted eigenvalue constant. This allows us to apply Condition 2.2 for the SLR black-box with good guarantee on prediction error.

The rows of are drawn from where is restricted to coordinates wlog.141414We assume here that as in the previous section

Let . We can show that satisfies RE with by bounding ’s minimum eigenvalue. First, we compute the eigenvalues of . has a nullspace of dimension , so eigenvalue 0 has multiplicity . is a trivial eigenvector with eigenvalue . Therefore, has eigenvalues and .

Now we can extend this to the sample matrix by applying Corollary 1 of raskutti2010restricted () (also see Example 3 therein), and conclude that as soon as the matrix satisfies RE with .

We remark that the following small technical condition also appears in known bounds on prediction error:

#### Column normalization

This is a condition on the scale of relative to the noise in SLR, which is always .

 ∥Xθ∥22n≤∥θ∥22

for all

We can always rescale (and hence ) to satisfy this, which would also rescale the noise level in our linear model since the noise is derived from coming from the SPCA generative model, rather than added independently as in the usual SLR setup.

Hence, since all scale dependent quantities are scaled by the same amount when we scale the original data , wlog we may continue to use the same and in our analysis. As the column normalization condition does not affect us, we drop it from Condition 2.2 of our black-box assumption.

## Appendix C Proofs of main Theorems

In this section we analyze the distribution of under both and on our way to proving Theorems 3.1 and 3.2. Note that though the dimension and the sparsity of our instances are and (since we remove one column from the SPCA data matrix to obtain the design matrix ), for ease of exposition we just use in their place since it does not affect the analysis in any meaningful way.

First, we review a useful tail bound on random variables.

###### Lemma C.1 (Concentration on upper and lower tails of the χ2 distribution (laurent (), Lemma 1)).

Let be the random variable with degrees of freedom. Then,

 Pr(Z−k≥2√kt+2t) ≤exp(−t) Pr(Z−X≥2√kt) ≤exp(−t)

We can simplify the upper tail bound as follows for convenience:

###### Corollary C.2.

For r.v. with degrees of freedom and deviation .

### c.1 Analysis of Qi under H1

Without loss of generality assume the support of , denoted , is and consider the first coordinate. We expand by using as follows:

 Q1 =1n∥y∥22−1n∥y−Xˆβ∥22=1n∥Xβ∗+w∥22−1n∥Xβ∗−Xˆβ∥22−2nw⊤(Xβ∗−Xˆβ)−1n∥w∥22 =1n∥Xβ∗∥22−2nw⊤Xβ∗−1n(∥Xβ∗−Xˆβ∥22)−2nw⊤(Xβ∗−Xˆβ)

Observe that the noise term cancels conveniently.

Before bounding each of these four terms, we introduce a useful lemma to bound cross terms involving noise :

###### Lemma C.3 (Lemmas 8 and 9, raskutti2011minimax ()).

For any fixed and independent noise vector with i.i.d. entries:

 |w⊤Xθ|n≤9σ∥Xθ∥2n√klogdk

for all w.p. at least

We bound each term as follows:

Term 1. The first term contains the signal from the spike; notice its resemblance to the -sparse eigenvalue statistic. Rewritten in another way,

 (β∗)⊤X⊤Xnβ∗=(β∗)⊤ˆΣ2:dβ∗

Hence, we expect this to concentrate around , which simplifies to (see Appendix B.2 for the full calculation):

 (β∗)⊤Σ2:dβ∗=(Σ1,2:dΣ−12:d)Σ2:d(Σ−12:dΣ2:d,1)=θ2u21(1−u21)1+(1−u21)θ

For concentration, observe that we may rewrite

 (β∗)⊤ˆΣ2:dβ∗=1nn∑i=1(X(i)β∗)2

where is the th row, representing the th sample. This is just an appropriately scaled chi-squared random variable with degrees of freedom (since each is i.i.d. normal), and the expected value of each term in the sum is the same as computed above. Applying a lower tail bound on distribution (see Appendix ), with probability at least we have

 (β∗)⊤ˆΣ2:dβ∗≥θ2u21(1−u21)1+(1−u21)θ⋅⎛⎝1−2√log(1/δ)n⎞⎠

Choosing ,

 ∥Xβ∗∥22n ≥θ2u21(1−u21)1+(1−u21)θ⋅(1−2√klogdn) \lx@stackrel(a)≥12⋅θ2u21(1−u21)1+(1−u21)θ \lx@stackrel(b)≥c2min4θ2k (1)

where as long as and since and under Condition 1.

Term 2. The absolute value of the second term can be bounded by using Lemma C.3. From (1) as long as ( is some constant that we will choose later),

 ∥Xβ∗∥22n≥c2min4θ2k≥c14klogdn

so the first two terms together are lower bounded by:

 ∥Xβ∗∥2n(∥Xβ∗∥2−18√klogd/k)≥c15klogdn, (2)

for large enough constant .

Term 3. The third term, which is the prediction error , is upper bounded by with probability at least by Condition 2.2 on our SLR black-box. Note if we assume .151515As smaller makes the problem only harder, we assume for ease of computation and as standard in literature. Now, with probability at least if since (see Appendix B.3 for more details). Then,

 1n∥Xβ∗−X^β∥22≤Cklogdn

Term 4. The contribution of the last cross term can also be bounded by Lemma C.3 w.h.p. (note )

 |w⊤X(β∗−ˆβ)|n≤9σ∥X(β∗−ˆβ)∥2n√klogdk.

Combined with the above bound for prediction error, this bounds the cross term’s contribution by at most .

Putting the bounds on four terms together, we get the following lower bound on .

###### Lemma C.4.

There exists constants s.t. if and , with probability at least , for any that satisfies the size bound in Condition 1,

 Qi>13klogdn
###### Proof.

From 1-4 above, by union bound, all four bounds fail to hold with probability at most for appropriate constants if (required by Term 2) and for some (note that both terms 1 and 3 require sufficient number of samples ). If all four bounds hold, we have:

 Qi>c15klogdn−C′klogdn

where are just some constants. So if is sufficiently large, the above bound is greater than .161616The choice of constant 13 may seem a little arbitrary, but this is just to be consistent with Lemma C.5. There, the constant just falls out of convenient choices for simplification, and is not optimized for in particular.

### c.2 Analysis of Qi under H0

We could proceed by decomposing the same way as in ; all the error terms including prediction error are still bounded by in magnitude, and the signal term is gone now since . This will give the same upper bound (up to a constant) as the following proof is about to show. However, we find the following direct analysis more informative and intuitive.

Since our goal is to upper bound under , we may let be the optimal possible choice given and (one that minimizes , and hence maximizes ). We further break this into two steps. We enumerate over all possible subsets of size , and conditioned on each , choose the optimal .

Fix some support of size . The span of is at most a -dimensional subspace of . Hence, we can consider some unitary transformation of that maps the span of into the subspace spanned by the first standard basis vectors. Since is an isometry by definition,

 nQi=∥y∥22−∥y−XˆβS∥22=∥Uy∥22−∥Uy−UXˆβS∥22

Let . Since has nonzero entries only in the first coordinates, the optimal choice (in the sense of maximizing the above quantity) of is to choose linear combinations of the first columns of so that equals the first coordinates of . Then, is just the squared norm of the first coordinates of . Since is some unitary matrix that is independent of (being a function of which is independent of ), still has i.i.d. entries, and hence is a -var with degrees of freedom.

Now we apply an upper tail bound on the distribution (see Corollary C.2). Choosing , and after union bounding over all supports , with probability at most as long as .

###### Lemma C.5.

Under , w.p. at least .

###### Remark C.6.

Union bounding over all is necessary for the analysis. For instance, we cannot just fix to be (this denotes the support of ) since is a function of , so fixing changes the distribution of .

###### Remark C.7.

Observe that this analysis of for also extends immediately to when coordinate is outside the support. The reason the analysis cannot extend to when is because is not independent of in this case.

###### Corollary C.8.

Under , if , w.p. at least .