Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja’s Algorithm

Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja’s Algorithm

Abstract

This work provides improved guarantees for streaming principle component analysis (PCA). Given sampled independently from distributions satisfying for , this work provides an -space linear-time single-pass streaming algorithm for estimating the top eigenvector of . The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of or where is the relative distance between the top two eigenvalues of .

These results are achieved through a novel analysis of the classic Oja’s algorithm, one of the oldest and most popular algorithms for streaming PCA. In particular, this work shows that simply picking a random initial point and applying the update rule suffices to accurately estimate the top eigenvector, with a suitable choice of . We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice, and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA.

1 Introduction

Principal component analysis (PCA) is one of the most fundamental problems in machine learning, numerical linear algebra, and data analysis. It is commonly used for data compression, image processing, and visualization  [1] etc.

When we desire to perform PCA on large data sets, it may be the case that we cannot afford more than single pass over the data (or worse to even store the data in the first place)  [2, 3, 4]. To alleviate this issue, a popular line of research over the past several decades has been to consider streaming algorithms for PCA under the assumption that the data has reasonable statistical properties  [5, 6, 7, 8, 9]. There have been significant breakthroughs in getting near-optimal streaming PCA algorithms under fairly specialized models, e.g. spiked covariance  [9].

This work considers one of the most natural variants of PCA, estimating the top eigenvector of a symmetric matrix, under a mild (and standard) set of assumptions under which concentration of measure applies (under the matrix Bernstein inequality [10, 11]). In particular, the setting is as follows:

Definition 1 (Streaming PCA).

Let be a sequence of (not necessarily symmetric) matrices sampled independently from distributions that satisfy the following:

  1. for symmetric positive semidefinite (PSD) matrix ,

  2. with probability , and

  3. .

Let denote the eigenvectors of and denote the corresponding eigenvalues. Our goal is to compute an -approximation to , that is a unit vector such that , in a single pass while minimizing space, time, and error (i.e. ). Note that denotes the of the angle between and .

A special case of Streaming PCA is to estimate the top eigenvector of the covariance matrix of a distribution over , i.e. given independent samples estimate the top eigenvector of . This encompasses the popular ”spiked covariance model” [12].

It is well known that to solve the Streaming PCA problem, one can simply compute the empirical covariance matrix and compute the right singular vector of this matrix. Here, matrix Bernstein inequality [10, 11] and Wedin’s theorem [13] implies the following standard sample complexity bound for the Streaming PCA problem:

Theorem 1.1 (Eigenvector Concentration using matrix Bernstein and Wedin’s theorem).

Under the assumptions of Definition 1, the top right singular vector of is an -approximation to the top eigenvector of with probability , where

Theorem 1.1 is essentially the previous best sample complexity known for estimating the top eigenvector 6. Unfortunately, the above is purely a statistical claim, and, algorithmically, there are least two concerns. First, computing the empirical covariance matrix naively requires time and space, and second, computing the top eigenvector of the empirical covariance matrix in general may require super linear time [15]. While there have been many attempts to produce streaming algorithms that use only space to solve the streaming PCA problem, to our knowledge, all previous methods either lose a multiplicative factor of either or in the analysis in order to achieve constant accuracy when applied in our setting [7, 8, 16, 9, 14].

In an attempt to overcome this limitation and improve the guarantees for solving the streaming PCA problem, this work seeks to address the following question:

Can we match the sample complexity of matrix Bernstein + Wedin’s theorem with an algorithm that uses space only and takes a single linear-time pass over the input?

This work answers this question in the affirmative, showing that one can succeed with constant probability matching the sample complexity of Theorem 1.1 up to logarithmic terms and small additive factors. Interestingly, this is achieved by providing a novel analysis of the classical Oja’s algorithm, which is perhaps, the most popular algorithm for Streaming PCA [6].

0:  .
  Choose uniformly at random from the unit sphere
  for  do
     
     
  end for
  
Algorithm 1 Oja’s algorithm for computing top eigenvector

Oja’s algorithm is one of the simplest algorithms one would imagine for the streaming PCA problem (See Algorithm 1). In fact, due to its simplicity, it was proposed a neurally plausible algorithm. In the case that each comes from the same distribution it corresponds to simply performing projected stochastic gradient descent on the objective function of maximizing the Rayleigh Quotient over the distribution . It is well known that under very mild conditions on the stepsize sequence, Oja’s algorithm asymptotically converges to the top eigenvector of the covariance matrix  [6]. However, obtaining optimal rates of convergence, let alone finite sample guarantees, for Streaming PCA has been quite challenging. The best known results are off from Theorem 1.1 by a factor of  [9].

This work shows that for proper choice of learning rates , Oja’s algorithm in fact can improve the best known results for streaming PCA and answer our question in the affirmative. In particular, we have that:

Theorem 1.2.

Let the assumptions of Definition 1 hold. Suppose the step size sequence for Algorithm 1 is chosen to be , where

Then the output of Algorithm 1 is an -approximation to the top eigenvector of satisfying

with probability greater than . Here is an absolute numerical constant.

The error above should be interpreted as being the sum of a higher order term and another lower order term which is at most (once ). In particular, this result shows that, up to an additive lower order term, one can match Theorem 1.1 with an asymptotic error of with constant probability. The lower order term has which is the of three parts: , and . The first part, depending on , is exactly the same as what appears in Theorem 1.1. The second one, depending on has an additional factor over the first order term and is irrelevant once, say . Notably, the third part, depending on , does not appear in Theorem 1.1; it arises here entirely due to computational reasons: the setting allows only a single linear-time pass over the matrices, while Theorem 1.1 makes no such assumption. For instance, consider the case which means . Matrix Bernstein tells us that one sample is sufficient to compute . However, it is not evident how to compute it using a single pass over . Note however, that the rate at which the lower order terms, i.e. , decrease is much better than guaranteed by Theorem 1.1.

In fact, this result also improves the asymptotic error rate obtained by Theorem 1.1. In particular, the following result shows that Oja’s algorithm gets an asymptotic rate of which is better than that of matrix Bernstein by a factor of .7

Theorem 1.3.

Let the assumptions of Definition 1 hold. Suppose the step size sequence for Algorithm 1 is chosen to be , where

Suppose . Then the output of Algorithm 1 is an -approximation to the top eigenvector of satisfying

with probability greater than . Here is an absolute numerical constant.

Note that Theorems 1.2 and 1.3 guarantee success probability of . One way to boost the probability to , for some , is to run copies of the algorithm, each with success probability and then output the geometric median of the solutions, which can be done in nearly linear time [17]. The detailes are omitted here.

Beyond the improved sample complexities we believe our analysis sheds light on the type of step sizes for which Oja’s algorithm converges quickly and therefore illuminates how to efficiently perform streaming PCA. We note that we have essentially assumed an oracle which sets the step size sequence, and an important question is how to set the step size in a robust and data data driven manner. Moreover, we believe that our analysis is fairly general and hope that it may be extended to make progress on analyzing the many variants of PCA that occur in both theory and in practice.

1.1 Comparison with Existing Results

Here we compare our sample complexity bounds with existing analyses of various methods. Recall that the error of the estimate is .

We consider three popular methods used for computing . The first one is the batch method which computes largest eigenvector of empirical covariance and uses Wedin’s theorem with matrix Bernstein inequality (cf. Theorem 1.1). The second method is Alecton, which is very similar to Oja’s algorithm  [9]. Finally, consider a block-power method (BPM)  [16, 8] which divides samples into different blocks and applies power iteration to the empirical estimate from each block. See Table 1 for the comparison.

We stress that some of the results we compare to make different assumptions than Definition 1. The bounds stated for them are our best attempt to adapt their bounds in the setting of Definition 1 (which is quite standard). The next paragraph provides a simple example, which demonstrates the improvement in our result as compared to existing work.

Let , where and with probability and with probability where denotes the standard basis vector and . Note that , for all , and . Even for constant accuracy , Theorem 1.2 tells us that is sufficient. On the other hand, Theorem of  [9] requires , while Theorem of  [16] requires . Asymptotically, as becomes larger, our error scales as while that of  [9] scales as and that of  [16] scales as . Combining matrix Bernstein and Wedin’s theorems gives an asymptotic error of .

Algorithm Error space?
Oja’s (this work, Theorem 4.1) Yes
Matrix Bernstein + Wedin’s theorem (Theorem 1.1) No
Alecton  [9] Yes
Block Power Method  [16] Yes
Table 1: Asymptotic error guaranteed by various methods under assumptions of Definition 1 with at least constant probability, and ignoring constant factors. Recall that the error is defined as . Our analysis provides the optimal error decay rate as compared to Alecton and Block power method which obtain . Moreover, our bound is tighter than that of Alecton  [9] and tighter bound than that of Block Power Method  [16]. The assumptions made in  [9] for Alecton are different from our (more standard) assumption; we have optimized their bounds are optimized in our setting. See Section 1.1 for a concrete example where our analysis provides these improvements over  [9, 16].

1.2 Additional Related Work

Existing results for computing largest eigenvector of a data covariance matrix using streaming samples can be divided into three broad settings: a) stochastic data, b) arbitrary sequence of data, c) regret bounds for arbitrary sequence of data.

Stochastic data: Here, the data is assumed to be sampled i.i.d. from a fixed distribution. The analysis of Oja’s algorithm as well as those of block power method and Alecton mentioned earlier are in this setting.  [8] also obtained a result in the restricted spiked covariance model.  [7] provides an analysis of a modification of Oja’s algorithm but with an extra multiplicative factor compared to ours.  [14] provides an algorithm based on shift and invert framework that obtains the same asymptotic error as ours. However, their algorithm requires warm start with a vector that is already constant close to the top eigenvector, which itself is a hard problem.

Arbitrary data: In this setting, each row of the data matrix is provided in an arbitrary order. Most of the existing methods here first compute a sketch of the matrix and use that to compute an estimate of the top eigenvector  [18, 19, 20, 21, 22, 23]. However, a direct application of such techniques to the stochastic setting leads to sample complexity bounds which are larger by a multiplicative factor of (ignoring other factors like variance etc). Finally,  [24, 25, 14] also provide methods for eigenvector computation, but they require multiple passes over the data and hence do not apply to the streaming setting.

Regret bounds: Here, at each step the algorithm has to output an estimate of for which we get reward of and the goal is to minimize the regret w.r.t. . The algorithms in this regime are mostly based on online convex optimization and applying them in our setting would again result in a loss of multiplicative . Moreover, typical algorithms in this setting are not memory efficient  [26, 27].

1.3 Notation

Bold lowercase letters such as are used to denote vectors and bold uppercase letters such as to denote matrices. For symmetric matrices and , denotes the condition that for all and define analogously. A symmetric matrix is positive semidefinite if . For symmetric matrices , define their inner product as .

1.4 Paper Organization

The rest of this paper is organized as follows. Section 2 introduces basic mathematical facts used throughout the paper and also provides a proof of the error bound of the standard batch method (Theorem 1.1). Section 3 provides an overview of our approach to analyzing Oja’s algorithm and provides the main technical result of the paper. This technical result is used in Section 4 to prove the running time for Oja’s algorithm and to justify the choice of step size. Section 5 presents the proof of the main technical result. Section 6 concludes and mentions a few interesting future directions.

2 Preliminaries

The following basic inequalities regarding power series, the exponential, and PSD matrices are used throughout. The facts are summarized here:

Lemma 2.1 (Basic Inequalities).

The following are true:

  • for all

  • for all

  • for PSD matrices with

  • for all matrices .

Proof.

The first inequality follows from the Taylor expansion of . The second comes from and for . The third follows by considering upper and lower Riemann sums of . The fourth from the fact that since is PSD there is a matrix with and therefore

The final follows from Cauchy Schwarz and Young’s inequality, i.e. as

The following is a matrix Bernstein based proof of the error bound of the batch method.

Proof of Theorem 1.1.

Using Theorem 1.4 of  [11], we have (w.p. ):

(1)

Let be the top eigenvector of . Using Wedin’s theorem  [13], implies:

(2)

Theorem now follows by combining (1) and (2). ∎

3 Approach

Let us now describe the approach to analyze Oja’s algorithm. We provide our main theorem regarding the convergence rate of Oja’s algorithm and discuss how it is proved. The details of the proof are deferred to Section 5 and the use of the theorem to choose step sizes is in Section 4.

One of the primary difficulties in analyzing Oja’s algorithm, or more broadly any algorithm for streaming PCA, is choosing a subtle potential function to analyze the method. If we try to analyze the progress of Oja’s algorithm in every iteration , by measuring the quality of , we run the risk that during the first few iterations of Oja’s algorithm a step may actually yield a that is orthogonal to . If this happens, even in the typical best case, where all future samples are itself, we would still fail to converge. In short, if we do not account for the randomness of in our potential function then it is difficult to show that a rapidly convergent algorithm does not catastrophically fail.

Rather than analyzing the convergence of directly we instead analyze the convergence of Oja’s algorithm as an operator on . Oja’s algorithm simply considers the matrix

(3)

and outputs the normalized result of applying this matrix, , to the random initial vector, i.e.

(4)

Rather than analyze the improvement of over we analyze ’s improvement over .

Another interpretation of (3) and (4) is that Oja’s algorithm simply approximates by performing 1 step of the power method on the matrix . Fortunately, analyzing when 1 step of the power method succeeds is fairly straightforward as we show below:

Lemma 3.1 (One Step Power Method).

Let , let be a unit vector, and let be a matrix whose columns form an orthonormal basis of the subspace orthogonal to . If is chosen uniformly at random from the surface of the unit sphere then with probability at least

where is an absolute constant.

Proof.

As is distributed uniformly over the sphere, we have: where . Consequently, with probability at least

where and are absolute constants. follows as where the second inequality follows from the fact that is a Gaussian random variable with variance . Similarly, follows from the fact that is a random variable with -degrees of freedom. ∎

This lemma makes our goal clear. To show that Oja’s algorithm succeeds we simply need to show that with constant probability is relatively large and is relatively small, where is a matrix whose columns form an orthonormal basis of the subspace orthogonal to . This immediately alleviates the issues of catastrophic failure that plagued analyzing . So long as we pick sufficiently small, i.e. then is invertible. In this case is invertible and . In short, so long as we pick sufficiently small the quantity we wish to bound is always finite.

To actually bound and we split the analysis into several parts in Section 5. First, we show that is small, which implies by Markov’s inequality that is small with constant probability. Then, we show that is large and that is small. By Chebyshev’s inequality this implies that is large with constant probability. Putting these together we achieve the main technical result regarding the analysis of Oja’s method. Once we devise this roadmap, the proof is fairly straightforward.

Theorem 3.1 (Oja’s Algorithm Convergence Rate).

Let and step sizes . The output of Algorithm 1 is an -approximation to with probability at least where

where , , and is an absolute constant.

Theorem 3.1 is proved in Section 5. Theorem 3.1 serves as the basis for our results regarding Oja’s algorithm. In the next section we show how to use this theorem to choose step sizes and achieve the main results of this paper.

4 Main Results

Theorem 3.1, from the previous section, leads to our main results, provided here. The theorem and proof are below and essentially consist of choosing appropriate parameters to efficiently apply Theorem 3.1. Once we have this theorem, Theorems 1.2 and 1.3 follow by choosing and respectively.

Theorem 4.1.

Fix any and suppose the step sizes are set to for and

Suppose the number of samples . Then the output of Algorithm 1 satisfies:

with probability at least . Here is an absolute numerical constant.

Proof.

Recall that Theorem 3.1 gives a bound of

(5)

where . Since , we have and by our assumption that , we have:

(6)

Moreover, since , we have

(7)

Note that . Moreover, as , we have:

(8)

Substituting (6), (7) and (8) into (5) proves the theorem. ∎

5 Bounding the Convergence of Oja’s Algorithm

In this section, we present a detailed proof of Theorem 3.1. The proof follows the approach outlined in Section 3 and uses the notation of that section, i.e.

  • We let with

  • We let

  • We let denote a matrix whose columns form an orthonormal basis for the subspace orthogonal to  .

We first provide several technical lemmas bounding the expected behavior of and ultimately use these lemmas to prove Theorem 3.1. We begin with a straightforward lemma bounding the rate of increase of in spectral norm.

Lemma 5.1.

For all and we have

Proof.

Let , i.e., . For all ,

(9)

where the last inequality follows from and,

Using (9) along with , , and , we have for :

The result follows by using induction along with and . ∎

Using Lemma 5.1 we next bound the expected value of . Ultimately this will allow us to bound the value with by Markov’s inequality.

Lemma 5.2.

For all and the following holds

Proof.

Let . We first simplify as follows:

(10)

Recall that . Now, the second term on the right hand side can be bounded as follows:

where follows from the fact that is orthogonal to and follows from defintion of .

Plugging the above into (10), we get for all ,

where the last inequality follows from and using Lemma 5.1.

Recursing the above inequality, we obtain

Since we see that . Using that