Sparse PCA via Covariance Thresholding

# Sparse PCA via Covariance Thresholding

Yash Deshpande Department of Electrical Engineering, Stanford University    Andrea Montanari Departments of Electrical Engineering and Statistics, Stanford University
###### Abstract

In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here each of the principal components has at most non-zero entries. We are particularly interested in the high dimensional regime wherein is comparable to, or even much larger than .

In an influential paper, [JL04] introduced a simple algorithm that estimates the support of the principal vectors by the largest entries in the diagonal of the empirical covariance. This method can be shown to identify the correct support with high probability if , and to fail with high probability if for two constants . Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees.

Here we analyze a covariance thresholding algorithm that was recently proposed by [KNV13]. On the basis of numerical simulations (for the rank-one case), these authors conjectured that covariance thresholding correctly recover the support with high probability for (assuming of the same order as ). We prove this conjecture, and in fact establish a more general guarantee including higher-rank as well as much smaller than . Recent lower bounds [BR13, MW15a] suggest that no polynomial time algorithm can do significantly better.

The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before. Using these, we also derive sharp bounds for estimating the population covariance (in operator norm), and the principal component (in -norm).

## 1 Introduction

In the spiked covariance model proposed by [JL04], we are given data with of the form111Throughout the paper, we follow the convention of denoting scalars by lowercase, vectors by lowercase boldface, and matrices by uppercase boldface letters.:

 xi =r∑q=1√βquq,ivq+zi, (1)

Here is a set of orthonormal vectors, that we want to estimate, while and are independent and identically distributed. The quantity is a measure of signal-to-noise ratio. In the rest of this introduction, in order to simplify the exposition, we will refer to the rank one case and drop the subscript . Further, we will assume to be of the same order as . Our results and proofs hold for a broad range of scalings of , , , and will be stated in general form.

The standard method of principal component analysis involves computing the sample covariance matrix and estimates by its principal eigenvector . It is a well-known fact that, in the high dimensional regime, this yields an inconsistent estimate (see [JL09]). Namely unless . Even worse, [BBAP05] and [Pau07] demonstrate the following phase transition phenomenon. Assuming that , if the estimate is asymptotically orthogonal to the signal, i.e. . On the other hand, for , remains bounded away from zero as . This phase transition phenomenon has attracted considerable attention recently within random matrix theory [FP07, CDMF09, BGN11, KY13].

These inconsistency results motivated several efforts to exploit additional structural information on the signal . In two influential papers, [JL04, JL09] considered the case of a signal that is sparse in a suitable basis, e.g. in the wavelet domain. Without loss of generality, we will assume here that is sparse in the canonical basis , …. In a nutshell, [JL09] propose the following:

1. Order the diagonal entries of the Gram matrix , and let be the set of indices corresponding to the largest entries.

2. Set to zero all the entries of unless , and estimate with the principal eigenvector of the resulting matrix.

Johnstone and Lu formalized the sparsity assumption by requiring that belongs to a weak -ball with . [AW09] studied the more restricted case when every entry of has equal magnitude of . Within this restricted model, they proved diagonal thresholding successfully recovers the support of provided the sample size satisfies222Throughout the introduction, we write as a shorthand of for a some constant . [AW09]. This result is a striking improvement over vanilla PCA. While the latter requires a number of samples scaling with the number of parameters , sparse PCA via diagonal thresholding achieves the same objective with a number of samples that scales with the number of non-zero parameters, .

At the same time, this result is not as strong as might have been expected. By searching exhaustively over all possible supports of size (a method that has complexity of order ) the correct support can be identified with high probability as soon as . No method can succeed for much smaller , because of information theoretic obstructions. We refer the reader to [AW09] for more details.

Over the last ten years, a significant effort has been devoted to developing practical algorithms that outperform diagonal thresholding, see e.g. [MWA05, ZHT06, dEGJL07, dBG08, WTH09]. In particular, [dEGJL07] developed a promising M-estimator based on a semidefinite programming (SDP) relaxation. [AW09] also carried out an analysis of this method and proved that, if333Throughout the paper, we denote by constants that can depend on problem parameters and . We denote by upper case (lower case ) generic absolute constants that are bigger (resp. smaller) than 1, but which might change from line to line. (i) , and (ii) the SDP solution has rank one, then the SDP relaxation provides a consistent estimator of the support of .

At first sight, this appears as a satisfactory solution of the original problem. No procedure can estimate the support of from less than samples, and the SDP relaxation succeeds in doing it from –at most– a constant factor more samples. This picture was upset by a recent, remarkable result by [KNV13] who showed that the rank-one condition assumed by Amini and Wainwright does not hold for . This result is consistent with recent work of [BR13] demonstrating that sparse PCA cannot be performed in polynomial time in the regime , under a certain computational complexity conjecture for the so-called planted clique problem.

In summary, the sparse PCA problem demonstrates a fascinating interplay between computational and statistical barriers.

From a statistical perspective,

and disregarding computational considerations, the support of can be estimated consistently if and only if . This can be done, for instance, by exhaustive search over all the possible supports of . We refer to [VL12, CMW13] for a minimax analysis.

From a computational perspective,

the problem appears to be much more difficult. There is rigorous evidence [BR13, MW15b, MW15a, WBS14] that no polynomial algorithm can reconstruct the support unless . On the positive side, a very simple algorithm (Johnstone and Lu’s diagonal thresholding) succeeds for .

Of course, several elements are still missing in this emerging picture. In the present paper we address one of them, providing an answer to the following question:

Is there a polynomial time algorithm that is guaranteed to solve the sparse PCA problem with high probability for ?

We answer this question positively by analyzing a covariance thresholding algorithm that proceeds, briefly, as follows. (A precise, general definition, with some technical changes is given in the next section.)

1. Form the empirical covariance matrix and set to zero all its entries that are in modulus smaller than , for a suitably chosen constant.

2. Compute the principal eigenvector of this thresholded matrix.

3. Estimate the support of by thresholding .

Such a covariance thresholding approach was proposed in [KNV13], and is in turn related to earlier work by [BL08b, CZZ10]. The formulation discussed in the next section presents some technical differences that have been introduced to simplify the analysis. Notice that, to simplify proofs, we assume to be known: this issue is discussed in the next two sections.

The rest of the paper is organized as follows. In the next section we provide a detailed description of the algorithm and state our main results. The proof strategy for our results is explained in Section 3. Our theoretical results assume full knowledge of problem parameters for ease of proof. In light of this, in Section 4 we discuss a practical implementation of the same idea that does not require prior knowledge of problem parameters, and is data-driven. We also illustrate the method through simulations. The complete proofs are in Sections 5, 7 and 6 respectively.

A preliminary version of this paper appeared in [DM14]. This paper extends significantly the results in [DM14]. In particular, by following an analogous strategy, we improve greatly the bounds obtained by [DM14]. This significantly improves the regimes of on which we can obtain non-trivial results. The proofs follow a similar strategy but are, correspondingly, more careful.

## 2 Algorithm and main results

We provide a detailed description of the covariance thresholding algorithm for the general model (1) in Table 1. For notational convenience, we shall assume that sample vectors are given (instead of ): .

We start by splitting the data into two halves: and and compute the respective sample covariance matrices and respectively. Define to be the population covariance minus identity, i.e.

 Σ≡r∑q=1βqvqvTq. (2)

Throughout, we let and denote the support of and its size respectively, for . We further let and . The matrix is used, in steps to to obtain a good estimate for the low rank part of the population covariance . The algorithm first computes , a centered version of the empirical covariance as follows:

 ˆΣ ≡G−Ip, (3)

where is the sample covariance matrix.

It then obtains the estimate by soft thresholding each entry of at a threshold . Explicitly:

 (η(ˆΣ))ij ≡η(ˆΣij;τ√n). (4)

Here is the soft thresholding function

 η(z;λ) =⎧⎨⎩z−λ if z≥λ−z+λ if z≤−λ0 otherwise. (5)

In step of the algorithm, this estimate is used to construct good estimates of the eigenvectors . Finally, in step , these estimates are combined with the (independent) second half of the data to construct estimators for the support of the individual eigenvectors . In the first two subsections we will focus on the estimation of and the individual principal components. Our results on support recovery are provided in the final subsection.

### 2.1 Estimating the population covariance

Our first result bounds the estimation error of the soft thresholding procedure in operator norm.

###### Theorem 1.

There exist numerical constants such that the following happens. Assume , and let . We keep the thresholding level according to

 τ=⎧⎪⎨⎪⎩τ∗ when τ∗≤√logp/2,s20≤p/eC2τ∗ when τ∗≥√logp/2,s0≤p/e0 otherwise. (6)

. Then with probability :

 ≤C ⎷s20(β2∨1)n(logps20∨1). (7)

At this point, it is useful to compare Theorem 1 with available results in the literature. Classical denoising theory [DJ94, Joh15] provides upper bounds on the estimation error of soft-thresholding. However, estimation error is measured by (element-wise) norm, while here we are interested in operator norm.

[BL08a, BL08b, Kar08, CZZ10, CL11] considered the operator norm error of thresholding estimators for structured covariance matrices. Specializing to our case of exact sparsity, the result of [BL08a] implies that, with high probability:

 ∥∥ηH(ˆΣ)−Σ∥∥op≤C0√s20logpn. (8)

Here is the hard-thresholding function: , and the threshold is chosen to be . Also, is the matrix obtained by thresholding the entries of . In fact, [CZ12] showed that the rate in (8) is minimax optimal over the class of sparse population covariance matrices, with at most non-zero entries per row, under the assumption .

Theorem 1 ensures consistency under a weaker sparsity condition, viz. is sufficient. Also, the resulting rate depends on instead of . In other words, in order to achieve for a fixed , it is sufficient as opposed to .

Crucially, in this regime for , Theorem 1 suggests a threshold of order as opposed to which is used in [BL08a, CZ12]. As we will see in Section 3, this regime mathematically more challenging than the one of [BL08a, CZ12]. By setting for a large enough constant , all the entries of outside the support of are set to . In contrast, a large part of our proof is devoted to control the operator norm of the noise part of .

### 2.2 Estimating the principal components

We next turn to the question of estimating the principal components . Of course, these are not identifiable if there are degeneracies in the population eigenvalues . We thus introduce the following identifiability condition.

1. The spike strengths are all distinct. We denote by and . Namely, is the largest signal strength and is the minimum gap.

We measure estimation error through the following loss, defined for :

 L(x,y) ≡12mins∈{+1,−1}∥∥x−sy∥∥2 (9) =1−|⟨x,y⟩|. (10)

Notice the minimization over the sign . This is required because the principal components are only identifiable up to a sign. Analogous results can obtained for alternate loss functions such as the projection distance:

 Lp(x,y) ≡1√2∥xxT−yyT∥F=√1−⟨x,y⟩2. (11)

The theorem below is an immediate consequence of Theorem 1. In particular, it uses the guarantee of Theorem 1 to show that the corresponding principal components of provide good estimates of the principal components .

###### Theorem 2.

There exists a numerical constant such that the following holds. Suppose that Assumption A1 holds in addition to the conditions , , and . Set as according to Theorem 1, and let denote the principal eigenvectors of . Then, with probability

 maxq∈[r]L(ˆvq,vq)≤Cβ2mins20(β2∨1)nlogps20. (12)
###### Proof.

Let . By Davis-Kahn sin-theta theorem [DK70], we have, for ,

 (13)

For , the claim follows by using Theorem 1. If , the claim is obviously true since always. ∎

### 2.3 Support recovery

Finally, we consider the question of support recovery of the principal components . The second phase of our algorithm aims at estimating union of the supports from the estimated principal components . Note that, although is not even expected to be sparse, it is easy to see that the largest entries of should have significant overlap with . Step 6 of the algorithm exploit this property to construct a consistent estimator of the support of the spike .

We will require the following assumption to ensure support recovery.

1. There exist constants such that the following holds. The non-zero entries of the spikes satisfy for all . Further, for any for every . Without loss of generality, we will assume .

###### Theorem 3.

Assume the spiked covariance model of Eq. (1) satisfying assumptions A1 and A2, and further , , and for a large enough numerical constant. Consider the Covariance Thresholding algorithm of Table 1, with as in Theorem 1 .

Then there exists such that, if

 n≥K0s20rlogps20 (14)

then the algorithm recovers the union of supports of with probability (i.e. we have ).

The proof in Section 7 also provides an explicit expression for the constant .

###### Remark 2.1.

In Assumption A2, the requirement on the minimum size of is standard in support recovery literature [Wai09, MB06]. Additionally, however, we require that when the supports of overlap, they are of the same order, quantified by the parameter . Relaxing this condition is a potential direction for future work.

###### Remark 2.2.

Recovering the signed supports and , up to a sign flip, is possible using the same technique as recovering the supports above, and poses no additional difficulty.

## 3 Algorithm intuition and proof strategy

For the purposes of exposition, throughout this section, we will assume that and drop the corresponding subscript .

Denoting by the matrix with rows , …, by the matrix with rows , …, and letting , the model (1) can be rewritten as

 X =√βuvT+Z. (15)

Recall that . For , the principal eigenvector of , and hence of is positively correlated with , i.e. is bounded away from zero. However, for , the noise component in dominates and the two vectors become asymptotically orthogonal, i.e. for instance . In order to reduce the noise level, we must exploit the sparsity of the spike .

Now, letting , and , we can rewrite as

 ˆΣ =β′vvT+vwT+wvT+1nZTZ−Ip,. (16)

For a moment, let us neglect the cross terms . The ‘signal’ component is sparse with entries of magnitude , which (in the regime of interest ) is equivalent to . The ‘noise’ component is dense with entries of order . Assuming for some small constant , it should be possible to remove most of the noise by thresholding the entries at level of order . For technical reasons, we use the soft thresholding function . We will omit the second argument from wherever it is clear from context.

Consider again the decomposition (16). Since the soft thresholding function is affine when , we would expect that the following decomposition holds approximately (for instance, in operator norm):

 η(ˆΣ) ≈η(β′vvT)+η(1nZTZ−Ip). (17)

Since and each entry of has magnitude at least , the first term is still approximately rank one, with

 ∥∥η(β′vvT)−βvvT∥∥op ≤s0τ√n. (18)

This is straightforward to see since soft thresholding introduces a maximum bias of per entry of the matrix, while the factor comes due to the support size of (see Proposition 6.2 below for a rigorous argument).

The main technical challenge now is to control the operator norm of the perturbation . We know that has entries of variance , for . If entries were independent with mild tail conditions, this would imply –with high probability–

 (19)

for some constant . Combining the bias bound from Eq. (18) and the heuristic decomposition of Eq. (19) with the decomposition (17) results in the bound

 ∥∥η(ˆΣ)−βvvT∥∥op ≤s0τ√n+Cexp(−cτ2)√pn. (20)

Our analysis formalizes this argument and shows that such a bound is essentially correct when . A modified bound is proved for (see Theorem 4 for a general statement).

The matrix is a special case of so-called inner-product kernel random matrices, which have attracted recent interest within probability theory [EK10a, EK10b, CS13, FM15]. The basic object of study in this line of work is a matrix of the type:

 Mij =fn(⟨~zi,~zj⟩n−I(i=j)). (21)

In other words, is a kernel function and is applied entry-wise to the matrix , with a matrix with independent standard normal entries as above and are the columns of .

The key technical challenge in our proof is the analysis of the operator norm of such matrices, when is the soft-thresholding function, with threshold of order . Earlier results are not general enough to cover this case. [EK10a, EK10b] provide conditions under which the spectrum of is close to a rescaling of the spectrum of . We are interested instead in a different regime in which the spectrum of is very different from the one of . [CS13] consider -dependent kernels, but their results are asymptotic and concern the weak limit of the empirical spectral distribution of . This does not yield an upper bound on the spectral norm of . Finally, [FM15] consider the spectral norm of kernel random matrices for smooth kernels , only in the proportional regime .

Our approach to proving Theorem 1 follows instead the -net method: we develop high probability bounds on the maximum Rayleigh quotient:

 maxy∈Sp−1⟨y,η(ZTZ/n−Ip)y⟩ =maxy∈Sp−1∑i,jη(⟨~zi,~zj⟩n;τ√n)yiyj, (22)

by discretizing , the unit sphere in dimensions. For a fixed , the Rayleigh quotient is a (complicated) function of the underlying Gaussian random variables . One might hope that it is Lipschitz continuous with some Lipschitz constant , thereby implying, by Gaussian isoperimetry [Led01], that it concentrates to the scale around its expectation (i.e. 0). Then, by a standard union bound argument over a discretization of the sphere, one would obtain that the operator norm of is typically no more than .

Unfortunately, this turns out not to be true over the whole space of , i.e. the Rayleigh quotient is not Lipschitz continuous in the underlying Gaussian variables . Our approach, instead, shows that for typical values of , we can control the gradient of with respect to , and extract the required concentration only from such local information of the function. This is formalized in our concentration lemma 5.4, which we apply extensively while proving Theorem 1. This lemma is a significantly improved version of the analogous result in [DM14].

## 4 Practical aspects and empirical results

Specializing to the rank one case, Theorems 2 and 3 show that Covariance Thresholding succeeds with high probability for a number of samples , while Diagonal Thresholding requires . The reader might wonder whether eliminating the factor has any practical relevance or is a purely conceptual improvement. Figure 1 presents simulations on synthetic data under the strictly sparse model, and the Covariance Thresholding algorithm of Table 1, used in the proof of Theorem 3. The objective is to check whether the factor has an impact at moderate . We compare this with Diagonal Thresholding.

We plot the empirical success probability as a function of for several values of , with . The empirical success probability was computed by using independent instances of the problem. A few observations are of interest: Covariance Thresholding appears to have a significantly larger success probability in the ‘difficult’ regime where Diagonal Thresholding starts to fail; The curves for Diagonal Thresholding appear to decrease monotonically with indicating that proportional to is not the right scaling for this algorithm (as is known from theory); In contrast, the curves for Covariance Thresholding become steeper for larger , and, in particular, the success probability increases with for . This indicates a sharp threshold for , as suggested by our theory.

In terms of practical applicability, our algorithm in Table 1 has the shortcomings of requiring knowledge of problem parameters . Furthermore, the thresholds suggested by theory need not be optimal. We next describe a principled approach to estimating (where possible) the parameters of interest and running the algorithm in a purely data-dependent manner. Assume the following model, for

 xi =μ+∑q√βquq,ivq+σzi,

where is a fixed mean vector, have mean and variance , and have mean and covariance . Note that our focus in this section is not on rigorous analysis, but instead to demonstrate a principled approach to applying covariance thresholding in practice. We proceed as follows:

Estimating , :

We let be the empirical mean estimate for . Further letting we see that entries of are mean and variance . We let where denotes the median absolute deviation of the entries of the matrix in the argument, and is a constant scale factor. Guided by the Gaussian case, we take .

Choosing :

Although in the statement of the theorem, our choice of depends on the SNR , it is reasonable to instead threshold ‘at the noise level’, as follows. The noise component of entry of the sample covariance (ignoring lower order terms) is given by . By the central limit theorem, . Consequently, , and we need to choose the (rescaled) threshold proportional to . Using previous estimates, we let for a constant . In simulations, a choice appears to work well.

Estimating :

We define and soft threshold it to get using as above. Our proof of Theorem 2 relies on the fact that has eigenvalues that are separated from the bulk of the spectrum. Hence, we estimate using : the number of eigenvalues separated from the bulk in . The edge of the spectrum can be computed numerically using the Stieltjes transform method as in [CS13].

Estimating :

Let denote the eigenvector of . Our theoretical analysis indicates that is expected to be close to . In order to denoise , we assume , where is additive random noise (perhaps with some sparse corruptions). We then threshold ‘at the noise level’ to recover a better estimate of . To do this, we estimate the standard deviation of the “noise” by . Here we set –again guided by the Gaussian heuristic– . Since is sparse, this procedure returns a good estimate for the size of the noise deviation. We let denote the vector obtained by hard thresholding : set and We then let and return as our estimate for .

Note that –while different in several respects– this empirical approach shares the same philosophy of the algorithm in Table 1. On the other hand, the data-driven algorithm presented in this section is less straightforward to analyze, a task that we defer to future work.

Figure 1 also shows results of a support recovery experiment using the ‘data-driven’ version of this section. Covariance thresholding in this form also appears to work for supports of size . Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance Thresholding on the “Three Peak” example of [JL04]. This signal is sparse in the wavelet domain and the simulations employ the data-driven version of covariance thresholding. A similar experiment with the “box” example of Johnstone and Lu is provided in Figure 3. These experiments demonstrate that, while for large values of both Diagonal Thresholding and Covariance Thresholding perform well, the latter appears superior for smaller values of .

## 5 Proof preliminaries

In this section we review some notation and preliminary facts that we will use throughout the paper.

### 5.1 Notation

We let denote the set of first integers. We will represent vectors using boldface lower case letters, e.g. . The entries of a vector will be represented by . Matrices are represented using boldface upper case letters e.g. . The entries of a matrix are represented by for . Given a matrix , we generically let , denote its rows, and , its columns.

For , we define the projector operator by letting be the matrix with entries

 PE(A)ij={Aijif (i,j)∈E,0otherwise. (23)

For a matrix , and a set , we define its column restriction to be the matrix obtained by setting to columns outside :

 (AE)ij ={Aij if j∈E,0otherwise.

Similarly is obtained from by setting to zero all indices outside . The operator norm of a matrix is denoted by (or ) and its Frobenius norm by . We write for the standard norm of a vector . Other vector norms such as or are denoted with appropriate subscripts.

We let denotes the support of the spike . Also, we denote the union of the supports of by . The complement of a set is denoted by .

We write for the soft-thresholding function. By we denote the derivative of with respect to the first argument, which exists Lebesgue almost everywhere. To simplify the notation, we omit the second argument when it is understood from context.

For a random variable and a measurable set we write to denote , the expectation of constrained to the event .

In the statements of our results, consider the limit of large and large with certain conditions on (as in Theorem 1). This limit will be referred to either as “ large enough” or “ large enough” where the phrase “large enough” indicates dependence of (and thereby ) on specific problem parameters.

The Gaussian distribution function will be denoted by .

### 5.2 Preliminary facts

Let denote the unit sphere in dimensions, i.e. . We use the following definition [Ver12, Definition 5.2] of the -net of a set :

###### Definition 5.1 (Nets, Covering numbers).

A subset is called an -net of if every point in may be approximated by one in with error at most . More precisely:

 ∀x∈X,infy∈Tε(X)∥x−y∥ ≤ε.

The minimum cardinality of an -net of , if finite, is called its covering number.

The following two facts are useful while using -nets to bound the spectral norm of a matrix. For proofs, we refer the reader to [Ver12, Lemmas 5.2, 5.4].

###### Lemma 5.2.

Let be the unit sphere in dimensions. Then there exists an -net of , satisfying:

 |Tε(Sn−1)|≤(1+2ε)n.
###### Lemma 5.3.

Let be a symmetric matrix. Then, there exists such that

 |⟨x,Ax⟩| ≥(1−2ε)∥A∥. (24)
###### Proof.

Firstly, we have . Let be the maximizer (which exists as is compact and is continuous in ). Choose so that . Then:

 ⟨x,Ax⟩ =⟨x−x∗,A(x+x∗)⟩+⟨x∗,Ax∗⟩. (25)

The lemma then follows as . ∎

Throughout the paper we will denote by an -net on the unit sphere that satisfies Lemma 5.2. For a subset of indices we denote by the natural isometric embedding of in .

We now state a general concentration lemma. This will be our basic tool to establish Theorem 2, and thereby Theorem 3.

###### Lemma 5.4.

Let be vector of i.i.d. standard normal variables. Suppose is a finite set and we have functions for every . Assume is a Borel set such that for Lebesgue-almost every :

 maxs∈Smaxt∈[0,1]∥∇Fs(√tx+√1−ty)∥ ≤L. (26)

Then, for any :

 P{maxs∈S|Fs(z)−EFs(z)|≥Δ} (27)

Here is an independent copy of .

###### Proof.

We use the Maurey-Pisier method along with symmetrization. By centering, assume that for all . Further, by including the functions in the set (at most doubling its size), it suffices to prove the one-sided version of the inequality:

 P{maxs∈SFs(z)≥Δ} (28)

We first implement the symmetrization. Note that:

 {x:max