Robust Shift-and-Invert Preconditioning: Faster and More Sample Efficient Algorithms for Eigenvector Computation 1footnote 11footnote 1The manuscript is out of date. An updated version of this work is available at [GHJ{}^{+}16].

# Robust Shift-and-Invert Preconditioning: Faster and More Sample Efficient Algorithms for Eigenvector Computation 111The manuscript is out of date. An updated version of this work is available at [Ghj+16].

Chi Jin
UC Berkeley
chijin@eecs.berkeley.edu
University of Washington
sham@cs.washington.edu
Cameron Musco
MIT
cnmusco@mit.edu
Praneeth Netrapalli
Microsoft Research, New England
praneeth@microsoft.com
Aaron Sidford
Microsoft Research, New England
asid@microsoft.com
November 2, 2015
###### Abstract

In this paper we provide faster algorithms and improved sample complexities for approximating the top eigenvector of a matrix . In particular we give the following results for computing an approximate eigenvector - i.e. some such that :

• Offline Eigenvector Estimation: Given an explicit matrix , we show how to compute an approximate top eigenvector in time and . Here is the stable rank, is the multiplicative gap between the largest and second largest eigenvalues, and hides log factors in and . By separating the dependence from our first runtime improves classic iterative algorithms such as the power and Lanczos methods. It also improves on previous work separating the and terms using fast subspace embeddings [AC09, CW13] and stochastic optimization [Sha15c]. We obtain significantly improved dependencies on and and our second running time improves this further when .

• Online Eigenvector Estimation: Given a distribution over vectors with covariance matrix and a vector which is an approximate top eigenvector for , we show how to compute an approximate eigenvector using samples from . Here is a natural notion of the variance of . Combining our algorithm with a number of existing algorithms to initialize we obtain improved sample complexity and runtime results under a variety of assumptions on . Notably, we show that, for general distributions, our sample complexity result is asymptotically optimal - we achieve optimal accuracy as a function of sample size as the number of samples grows large.

We achieve our results using a general framework that we believe is of independent interest. We provide a robust analysis of the classic method of shift-and-invert preconditioning to reduce eigenvector computation to approximately solving a sequence of linear systems. We then apply variants of stochastic variance reduced gradient descent (SVRG) and additional recent advances in solving linear systems to achieve our claims. We believe our results suggest the generality and effectiveness of shift-and-invert based approaches and imply that further computational improvements may be reaped in practice.

## 1 Introduction

Given a matrix , computing the top eigenvector of is a fundamental problem in computer science with applications ranging from principal component analysis [Jol02], to spectral clustering and learning of mixture models [NJW02, VW04], to pagerank computation [PBMW99], and a number of other graph related computations [Kor03, Spi07].

In this paper we provide improved algorithms for computing the top eigenvector, both in the offline case, where the matrix is given explicitly as well as in the online or statistical case where we are simply given samples from a distribution over vectors and wish to compute the top eigenvector of , the covariance matrix of the distribution.

Our algorithms are based on the classic idea of shift-and-invert preconditioning for eigenvalue computation [Saa92]. We give a new robust analysis of the shifted-and-inverted power method, which allows us to efficiently reduce maximum eigenvector computation to approximately solving a sequence of linear systems in the matrix for some shift parameter . We then show how to solve these systems efficiently using variants of Stochastic Variance Reduced Gradient (SVRG) [JZ13] that optimize a convex function that is given as a sum of non-convex components.

### 1.1 Our Approach

The well known power method for computing the top eigenvector of a starts with a initial vector (often random) and repeatedly multiplies by , eventually causing to converge to the top eigenvector. Assuming a random start vector, convergence requires iterations, where . The dependence on this gap is inherent to the power method and ensures that the largest eigenvalue is significantly amplified in comparison to the remaining values.

If the gap is small, one way to attempt to deal with this dependence is to replace with a preconditioned matrix – i.e. a matrix with the same top eigenvector but a much larger eigenvalue gap. Specifically, let for some shift parameter . We can see that the smallest eigenvector of (the largest eigenvector of ) is equal to the largest eigenvector of . Additionally, if is near the largest eigenvalue of , , there will be a constant gap between the largest and second largest values of . For example, if , then we will have and .

This constant factor gap means that running the power method on converges quickly to the top eigenvector of , specifically in iterations. Of course, there is a catch – each iteration of the shifted-and-inverted power method requires solving a linear system in . Furthermore, the condition number of is proportional , so as gets smaller, solving this linear system becomes more difficult for standard iterative methods.

Fortunately, the problem of solving linear systems is incredibly well studied and there are many efficient iterative algorithms we can adapt to apply approximately. In particular, we show how to accelerate the iterations of the shifted-and-inverted power method using variants of Stochastic Variance Reduced Gradient (SVRG) [JZ13]. Due to the condition number of , we will not entirely avoid a dependence, however, we can separate this dependence from the input size .

Typically, stochastic gradient methods are used to optimize convex functions that are given as the sum of many convex components. To solve a linear system we minimize the convex function with components where is the row of . Such an approach can be used to solve systems in , however solving systems in requires more care. The components of our function are not so simple, and we require an analysis of SVRG that guarantees convergence even when some of these components are non-convex. We give a simple analysis for this setting, generalizing recent work appearing in the literature [SS15, CR15].

Given fast approximate solvers for , the second main component required by our algorithmic framework is a new error bound for the shifted-and-inverted power method, showing that it is robust to the approximate linear system solvers, such as SVRG. We give a general robustness analysis, showing exactly what accuracy each system must be solved to, allowing for faster implementations using linear solvers with weaker guarantees. Our proofs center around the potential function

 G(x)\lx@stackreldef=∥∥Pv⊥1x∥∥B∥∥Pv1x∥∥B

where and are the projections onto the top eigenvector and its complement respectively. This function resembles tangent based potential functions used in previous work [HP14] except that the norms of the projections are measured over . For the exact power method, this is irrelevant – it is not hard to see that progress is identical in both the and norms (see Lemma 33 of the Appendix). However, since is a natural norm for measuring the progress of linear system solvers for , our potential function makes it possible to show that progress is also made when we compute approximately up to some error with bounded .

### 1.2 Our Results

Our algorithmic framework described above offers several advantageous. Theoretically, we obtain improved running times for computing the top eigenvector. In the offline case, in Theorem 16 we give an algorithm running in time , where is the multiplicative gap between the largest and second largest eigenvalues, is the stable rank of , and is the number of non-zero entries in the matrix. Up to log factors, our runtime is in many settings proportional to the input size , and so is very efficient for large data matrices. In the case when we also use the results of [FGKS15, LMH15] to provide an accelerated runtime of , shown in Theorem 17.

Our algorithms return an approximate top eigenvector with . Note that, by choosing error , we can ensure that is actually close to – i.e. that . Further, we obtain the same asymptotic runtime since . We compare our runtimes with previous work in Table 1.

In the online case, in Theorem 25, we show how to improve an factor approximation to the top eigenvector to an approximation using samples where is a natural upper bound on the variance of our distribution. Our algorithm is based off the streaming SVRG algorithm of [FGKS14, LMH15]. It requires just amortized time to process each sample, uses just space, and is easy to parallelize. We can apply our result in a variety of regimes, using algorithms from the literature to obtain the initial factor approximation and our algorithm to refine this solution. As shown in Table 2, this gives improved runtime and sample complexity over existing work. Notably, we give improved asymptotic sample complexity over known matrix concentration results for general distributions, and give the first streaming algorithm that is asymptotically optimal in the popular Gaussian spike model.

Our bounds hold for any or distribution . In the offline case we require no initial knowledge of , the eigenvalue gap, or the top eigenvector. We are hopeful that our online algorithm can also be made to work without such estimates.

Outside of our runtime results, our robust analysis of the shifted-and-inverted power method provides new understanding of this well studied and widely implemented technique. It gives a means of obtaining provably accurate results when each iteration is implemented using fast approximate linear system solvers with rather weak accuracy guarantees.

In practice, this reduction between approximate linear system solving and eigenvector computation shows that regression libraries can be directly utilized to obtain faster running times for eigenvector computation in many cases. Furthermore, in theory we believe that our reduction suggests computational limits inherent in eigenvector computation as seen by the often easier-to-analyze problem of linear system solving. Indeed, in Section 7, we provide evidence that in certain regimes our statistical results are optimal.

We remark that during the preparation of our manuscript we found that previously and independently Dan Garber and Elad Hazan had discovered a similar technique using shift-and-invert preconditioning and SVRG for sums of non-convex functions to improve the running time for offline eigenvector computation [GH15].

### 1.3 Previous Work

#### Offline Eigenvector Computation

Due to its universal applicability, eigenvector computation in the offline case is extremely well studied. Classical methods, such as the QR algorithm, take roughly time to compute a full eigendecomposition. This can be accelerated to , where is the matrix multiplication constant [Wil12, LG14], however this is still prohibitively expensive for large matrices. Hence, faster iterative methods are often employed, especially when only the top eigenvector (or a few of the top eigenvectors) is desired.

As discussed, the popular power method requires iterations to converge to an approximate top eigenvector. Using Chebyshev iteration, or more commonly, the Lanczos method, this bound can be improved to [Saa92], giving total runtime of .

Unfortunately, if is very large and is small, this can still be quite expensive, and there is a natural desire to separate the dependence from the term. One approach is to use random subspace embedding matrices [AC09, CW13] or fast row sampling algorithms [CLM15], which can be applied in time and yield a matrix which is a good spectral approximation to the original. The number of rows in depends only on the stable rank of and the error of the embedding – hence it can be significantly smaller than . Applying such a subspace embedding and then computing the top eigenvector of will require runtime , achieving the goal of reducing runtime dependence on the input size . Unfortunately, the dependence on will be significantly suboptimal – such an approach cannot be used to obtain a linearly convergent algorithm. Further, the technique does not extend to online setting, unless we are willing to store a full subspace embedding of our sampled rows.

Another approach, which we follow more closely, is to apply stochastic optimization techniques, which iteratively update an estimate to the top eigenvector, considering a random row of with each update step. Such algorithms naturally extend to the online setting and have led to improved dependence on the input size for a variety of problems [Bot10]. Using variance-reduced stochastic gradient techniques, [Sha15c] achieves runtime for approximately computing the top eigenvector of a matrix with constant probability. Here is an upper bound on the squared row norms of . In the best case, when row norms are uniform, this runtime can be simplified to .

The result in [Sha15c] makes an important contribution in separating input size and gap dependencies using stochastic optimization techniques. Unfortunately, the algorithm requires an approximation to the eigenvalue gap and a starting vector that has a constant dot product with the top eigenvector. In [Sha15b] the analysis is extended to a random initialization, however loses polynomial factors in . Furthermore, the dependences on the stable rank and are suboptimal – we improve them to and respectively, obtaining true linear convergence.

#### Online Eigenvector Computation

While in the offline case the primary concern is computation time, in the online, or statistical setting, research also focuses on minimizing the number of samples that we must draw from in order to achieve a given accuracy on our eigenvector estimate. Especially sought after are results that achieve asymptotically optimal accuracy as the sample size grows large.

While the result we give in Theorem 25 will work for any distribution parameterized by a variance bound, in this section, in order to more easily compare to previous work, we normalize and assume we have the variance bound along with the row norm bound . Additionally, we compare runtimes for computing some such that , as this is the most popular guarantee studied in the literature. Theorem 25 is easily extended to this setting as obtaining with ensures . Our our algorithm requires samples to find such a vector under the assumptions given above.

The simplest algorithm in this setting is to take samples from and compute the leading eigenvector of the empirical estimate . By a matrix Bernstein bound, such as inequality Theorem 6.6.1 of [Tro15], samples is enough to insure . By Lemma 32 in the Appendix, this gives that, if is set to the top eigenvector of it will satisfy . Such an can then be approximated by applying any offline eigenvector algorithm to the empirical estimate.

A large body of work focuses on improving the computational and sample cost of this simple algorithm, under a variety of assumptions on . The most common focus is on obtaining streaming algorithms, in which the storage space is just - proportional to the size of a single sample.

In Table 2 we give a sampling of results in this area. All of these results rely on distributional assumptions at least as strong as those given above. In each setting, we can use the cited algorithm to first compute an approximate eigenvector, and then refine this approximation to an approximation using samples by applying our streaming SVRG based algorithm. This allows us to obtain improved runtimes and sample complexities. Notably, by the lower bound shown in Section 7, in all settings considered in Table 2, we achieve optimal asymptotic sample complexity - as our sample size grows large, our decreases at an optimal rate. To save space, we do not include our improved runtime bounds in Table 2, however they are easy to derive by adding the runtime required by the given algorithm to achieve accuracy, to – the runtime required by our streaming algorithm.

The bounds given for the simple matrix Bernstein based algorithm described above, Krasulina/Oja’s Algorithm [BDF13], and SGD [Sha15a] require no additional assumptions, aside from those given at the beginning of this section. The streaming results cited for [MCJ13] and [HP14] assume is generated from a Gaussian spike model, where and . We note that under this model, the matrix Bernstein results improve by a factor and so match our results in achieving asymptotically optimal convergence rate. The results of [MCJ13] and [HP14] sacrifice this optimality in order to operate under the streaming model. Our work gives the best of both works – a streaming algorithm giving asymptotically optimal results.

The streaming Alecton algorithm [SRO15] assumes for any symmetric that commutes with . This is a strictly stronger assumption than the assumption above that .

### 1.4 Paper Organization

Section 2

Review problem definitions and parameters for our runtime and sample bounds.

Section 3

Describe the shifted-and-inverted power method and show how it can be implemented using approximate system solvers.

Section 4

Show how to apply SVRG to solve systems in our shifted matrix, giving our main runtime results for eigenvector computation in the offline setting.

Section 5

Show how to use an online variant of SVRG to run the shifted-and-inverted power method, giving our main sampling complexity and runtime results in the statistical setting.

Section 6

Show how to efficiently estimate the shift parameters required by our algorithms, completing their analysis.

Section 7

Give a lower bound in the statistical setting, showing that our results are asymptotically optimal.

## 2 Preliminaries

We bold all matrix variables. We use . For a symmetric positive semidefinite (PSD) matrix we let and we let denote its eigenvalues in decreasing order. We use to denote the condition that for all .

### 2.1 The Offline Problem

We are given a matrix with rows and wish to compute an approximation the top eigenvector of . Specifically for some error parameter we want a unit vector such that .

### 2.2 The Statistical Problem

We are given independent samples from a distribution on and wish to compute the top eigenvector of . Again, for some error parameter we want to return a unit vector such that .

### 2.3 Problem Parameters

We parameterize the running time of our algorithm in terms of several natural properties of , , and . We let denote the eigenvalues of in decreasing order and we let denote their corresponding eigenvectors. We define the eigenvalue gap by .

We use the following additional parameters to provide running times for the offline and statistical problems respectively:

• Offline Problem: We let denote the stable rank of . Note that we always have . We let denote the number of non-zero entries in .

• Online Problem: We let denote a natural upper bound on the variance of in various settings. Note that .

## 3 Framework

Here we develop our robust shift-and-invert framework. In Section 3.1 we provide a basic overview of the framework and in Section 3.2 we introduce the potential function we use to measure progress of our algorithms. In Section 3.3 we show how to analyze the framework given access to an exact linear system solver and in Section 3.4 we strengthen this analysis to work with an inexact linear system solver. Finally, in Section 3.5 we discuss initializing the framework.

### 3.1 Shifted-and-Inverted Power Method Basics

We let denote the shifted matrix that we will use in our implementation of the shifted-and-inverted power method. As discussed, in order for to have a large eigenvalue gap, should be set to for some constant . Throughout this section we assume that we have a crude estimate of and and fix to be a value satisfying . (See Section 6 for how we can compute such a ). For the remainder of this section we work with such a fixed value of and therefore for convenience denote as .

Note that and so This large gap will ensure that, assuming the ability to apply , the power method will converge very quickly. In the remainder of this section we develop our error analysis for the shifted-and-inverted power method which demonstrates that approximate application of in each iteration in fact suffices.

### 3.2 Potential Function

Our analysis of the power method focuses on the objective of maximizing the Rayleigh quotient, for unit vector . Note that as the following lemma shows, this has a directly correspondence to the error in maximizing :

###### Lemma 1 (Bounding Eigenvector Error by Rayleigh Quotient).

For a unit vector let . If then

 ∣∣v⊤1x∣∣≥√1−ϵλ1⋅gap.
###### Proof.

Among all unit vectors such that , a minimizer of has the form for some . We have

 ϵ =λ1−x⊤Σx=λ1−λ1(1−δ2)−λ2δ2=(λ1−λ2)δ2.

Therefore by direct computation,

 ∣∣v⊤1x∣∣=√1−δ2=√1−ϵλ1−λ2=√1−ϵλ1⋅gap .

In order to track the progress of our algorithm we use a more complex potential function than just the Rayleigh quotient error, . Our potential function is defined for by

 G(x)\lx@stackreldef=∥∥Pv⊥1(x)∥∥B∥∥Pv1(x)∥∥B

where and denote the projections of in the direction of and on the subspace orthogonal to respectively. Equivalently, we have that:

 G(x)=√∥x∥2B−(v⊤1B1/2x)2∣∣v⊤1B1/2x∣∣=√∑i≥2α2iλi(B−1)√α21λ1(B−1). (1)

where .

When the Rayleigh quotient error of is small, we can show a strong relation between and . We prove this in two parts. First we prove a technical lemma, Lemma 2, that we will use several time for bounding the numerator of and then we prove the connection in Lemma 3.

###### Lemma 2.

For a unit vector and if then

 ϵ≤x⊤Bx−(v⊤1Bx)(v⊤1x)≤ϵ(1+λ−λ1λ1⋅gap).
###### Proof.

Since and since is an eigenvector of with eigenvalue we have

 x⊤Bx−(v⊤1Bx)(v⊤1x) =λ∥x∥22−x⊤Σx−(λv⊤1x−v⊤1Σx)(v⊤1x) =λ−λ1+ϵ−(λv⊤1x−λ1v⊤1x)(v⊤1x) =(λ−λ1)(1−(v⊤1x)2)+ϵ.

Now by Lemma 1 we know that , giving us the upper bound. Furthermore, since trivially and , we have the lower bound. ∎

###### Lemma 3 (Potential Function to Rayleigh Quotient Error Conversion).

For a unit vector and if , we have:

 ϵλ−λ1≤G(x)2≤(1+λ−λ1λ1⋅gap)(1+2ϵλ1⋅gap)ϵλ−λ1.
###### Proof.

Since is an eigenvector of , we can write . Lemmas 1 and 2 then give us:

 ϵλ−λ1≤G(x)2≤(1+λ−λ1λ1⋅gap)ϵ(λ−λ1)(1−ϵλ1⋅gap).

Since , we have . This proves the lemma. ∎

### 3.3 Power Iteration

Here we show that the shifted-and-inverted power iteration in fact makes progress with respect to our objective function given an exact linear system solver for . Formally, we show that applying to a vector decreases the potential function geometrically.

###### Theorem 4.

Let be a unit vector with and let , i.e. the power method update of on . Then, we have:

Note that may no longer be a unit vector. However, for any scaling parameter , so the theorem also holds for scaled to have unit norm.

###### Proof.

Writing in the eigenbasis, we have and . Since , and by the equivalent formation of given in (1):

 G(˜x)=√∑i≥2α2iλi(B−1)√α21λ1(B−1)≤λ2(B−1)λ1(B−1)⋅√∑i≥2α2iλi(B−1)√α21λ1(B−1)=λ2(B−1)λ1(B−1)⋅G(x) .

Recalling that yields the result. ∎

The challenge in using the above theorem, and any traditional analysis of the shifted-and-inverted power method, is that we don’t actually have access to . In the next section we show that the shifted-and-inverted power method is robust – we still make progress on our objective function even if we only approximate using a fast linear system solver.

### 3.4 Approximate Power Iteration

We are now ready to prove our main result on the shifted-and-inverted power method using approximate linear system solves at each iteration. In words, we show that each iteration of the power method makes constant factor expected progress on our potential function assuming we:

1. Start with a sufficiently good and an approximation of

2. Apply approximately using a system solver such that the function error (or distance to in the norm) is sufficiently small in expectation.

3. Estimate Rayleigh quotients over well enough to only accept updates that do not hurt progress on the objective function too much.

This third assumption is necessary since the second assumption is quite weak. An expected progress bound on the linear system solver allows, for example, the solver to occasionally return a solution that is entirely orthogonal to , causing us to make unbounded backwards progress on our potential function. The third assumption allows us to reject possibly harmful updates and ensure that we still make progress in expectation. In the offline setting, we can access and are able able to compute Rayleigh quotients exactly in time time. However, we only assume the ability to estimate quotients as in the online setting we only have access to through samples from .

While, our general theorem for the approximate power iteration, Theorem 5, assumes that we can solve linear systems to some absolute accuracy in expectation, this is not the standard assumption for many linear system solvers. Many fast iterative linear system solvers assume an initial approximation to and then show that the quality of this approximation is then improved geometrically in each iteration of the algorithm. In Corollary 6 we show how to find a coarse initial approximation to , in fact just approximating with . Moreover, using this course approximation in Corollary 6 we show that Theorem 5 actually implies that it suffices to just make a fixed relative amount of progress in solving the linear system.

Note that in both claims we measure error of the linear system solver using . This is a natural norm in which geometric convergence is shown for many linear system solvers and directly corresponds to the function error of minimizing to compute .

###### Theorem 5 (Approximate Shifted-and-Inverted Power Iteration – Warm Start).

Let be a unit vector such that . Suppose we know some shift parameter with and an estimate of such that . Furthermore, suppose we have a subroutine that on any input

for some , and a subroutine that on any input

 ∣∣ˆquot(x)−quot(x)∣∣≤130(λ−λ1) for all % nonzero x∈Rd.

where .

Then the following update procedure:

 Set ˆx=solve(x),
 Set ˜x=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩ˆx if ⎧⎪⎨⎪⎩ˆquot(ˆx)≥ˆλ1−(λ−ˆλ1)/6 and ∥ˆx∥2≥231λ−ˆλ1x otherwise,

satisfies the following:

• and

• .

That is, not only do we decrease our potential function by a constant factor in expectation, but we are guaranteed that the potential function will never increase beyond .

###### Proof.

The first claim follows directly from our choice of from and . If , it holds trivially by our assumption that . Otherwise, and we know that

 λ1−quot(ˆx) ≤ˆλ1−quot(ˆx)≤ˆλ1−ˆquot(ˆx)+∣∣ˆquot(ˆx)−quot(ˆx)∣∣ ≤λ−ˆλ16+λ−λ130≤λ−λ15≤λ1⋅gap500 .

The claim then follows from Lemma 3 as

 G(ˆx)2 ≤(1+λ−λ1λ1⋅gap)(1+2(λ1−quot(ˆx))λ1⋅gap)λ1−quot(ˆx)λ−λ1

All that remains is to show the second claim, that . Let denote the event that we accept our iteration and set . That is:

 F \lx@stackreldef={ˆquot(ˆx)≥ˆλ1−λ−ˆλ16}∪{∥ˆx∥2≥231λ−ˆλ1}.

Using our bounds on and , we know that and . Therefore, since we have

 F ⊆{quot(ˆx)≥λ1−(λ−λ1)/2}∪{∥ˆx∥2≥231λ−λ1},

We will complete the proof in two steps. First we let and show that assuming is true then and are linearly related, i.e. expected error bounds on correspond to expected error bounds on . Second, we bound the probability that does not occur and bound error incurred in this case. Combining these yields the result.

To show the linear relationship in the case where is true, first note Lemma 1 shows that in this case . Consequently,

 ∥∥Pv1(ˆx)∥∥B=∣∣v⊤1ˆx∣∣√λ−λ1=∣∣∣v⊤1ˆx∥ˆx∥2∣∣∣⋅∥ˆx∥√λ−λ1≥34⋅231√λ−λ1=√λ1(B−1)2 .

However,

and by Theorem 4 and the definition of we have

Taking expectations, using that , and combining these three inequalities yields

 E[G(ˆx)|F]=E⎡⎢ ⎢⎣∥∥Pv⊥1(B−1x)∥∥B∥∥Pv1(B−1x)∥∥B∣∣ ∣ ∣∣F⎤⎥ ⎥⎦≤G(x)50+2E[∥ξ∥B∣∣F]√λ1(B−1) (2)

So, conditioning on making an update and changing (i.e. occurring), we see that our potential function changes exactly as in the exact case (Theorem 4) with additional additive error due to our inexact linear system solve.

Next we upper bound and use it to compute