Streaming PCA and Subspace Tracking: The Missing Data Case

Streaming PCA and Subspace Tracking: The Missing Data Case

Abstract

For many modern applications in science and engineering, data are collected in a streaming fashion carrying time-varying information, and practitioners need to process them with a limited amount of memory and computational resources in a timely manner for decision making. This often is coupled with the missing data problem, such that only a small fraction of data attributes are observed. These complications impose significant, and unconventional, constraints on the problem of streaming Principal Component Analysis (PCA) and subspace tracking, which is an essential building block for many inference tasks in signal processing and machine learning. This survey article reviews a variety of classical and recent algorithms for solving this problem with low computational and memory complexities, particularly those applicable in the big data regime with missing data. We illustrate that streaming PCA and subspace tracking algorithms can be understood through algebraic and geometric perspectives, and they need to be adjusted carefully to handle missing data. Both asymptotic and non-asymptotic convergence guarantees are reviewed. Finally, we benchmark the performance of several competitive algorithms in the presence of missing data for both well-conditioned and ill-conditioned systems.

1 Introduction

The explosion of data collection across a variety of domains, for purposes that range from scientific to commercial to policy-oriented, has created a data deluge that requires new tools for extracting useful insights from data. Principal Component Analysis (PCA) [1] and subspace tracking are arguably some of the most commonly used tools for exploring and understanding data. The fundamental mathematics and algorithms for identifying signal subspaces from data have been studied for nearly a century. However, in the modern context, many novel challenges arise, due to the severe mismatch between limited resources available at computational platforms and increasing demand of processing high-dimensional data. In particular, this survey article is motivated by the following aspects of modern data processing.

  • Large-scale and high-rate. Data are collected at an extremely large scale with many variables, such as in video surveillance or internet monitoring, and they can accumulate at such high rates that real-time processing is necessary for timely decision making. Therefore, classical batch algorithms for data processing are replaced by online, streaming algorithms that have much smaller memory and computational footprints.

  • Missing data. At each time instance, only a very small subset of the data attributes may be measured, due to hardware limitations, power constraints, privacy concerns, or simple lack of observations. Therefore, classical algorithms that do not account for missing data may yield highly sub-optimal performance and need to be redesigned.

To elaborate on these modern challenges, we describe two concrete examples in more detail. First, consider recommendation systems [2], where users’ past product use and opinions are collected. Based on such data, the system attempts to predict other products of interest to those (and potentially other) users. This is of course a scenario involving extremely sparse observations in high dimensions—a user has only purchased or rated a vanishingly small number of products from a company. Moreover, as the users rate more products and as new products become available, it is desirable to update the system’s predictions on user preference in an online manner.

As another example, consider the rigid structure from motion problem in computer vision [3, 4]. One seeks to build a 3D model of a scene based on a sequence of 2D images that, for an orthographic camera, are projections of that scene onto a plane. Features in the scene can be tracked through the images, and a matrix of their locations in the images has a low-rank (3-dimensional) factorization in terms of the true 3D locations of feature points and the locations of the cameras at each image frame. The problem is obviously high dimensional, and it is also natural to consider the streaming setting, as large numbers of features can be tracked across image frames that arrive sequentially at a high rate. Moreover, many points in the scene are not visible in all image frames due to occlusion. Therefore, while the low-rank subspace of the data recovers the 3D structure of the entire scene, one must estimate this subspace in the presence of missing data.

The list of modern applications continues. The question is: can we have scalable and accurate algorithms for subspace learning that work well even in the presence of missing data in a dynamic environment?

1.1 Subspace Models and Missing Data

Subspace models have long been an excellent model for capturing intrinsic, low-dimensional structures in large datasets. A celebrated example, PCA [1], has been successfully applied to many signal processing applications including medical imaging, communications, source localization and clutter tracking in radar and sonar, computer vision for object tracking, system identification, traffic data analysis, and speech recognition, to name just a few. The calculated principal components and best-fit subspaces to a dataset not only allow dimensionality reduction but also provide intermediate means for signal estimation, noise removal, and anomaly detection [5]. As we highlight in this paper, the principal components can be updated using incoming data in a streaming manner, thus offering tracking capabilities that are necessary for real-time decision making.

While there are a plethora of traditional algorithms for performing PCA on a batch dataset and for estimating and tracking the principal components in a streaming scenario (see, e.g., [6], for a survey of earlier literature), most of these algorithms were developed during a time when datasets of interest had a moderate number of variables (say 10-100) and were collected in a controlled environment with little or no missing entries. As argued earlier, modern datasets are being collected on vast scales, in a much less controlled way, often with overwhelmingly many missing entries. In light of this prevalent and modern challenge in signal processing and machine learning, classical algorithms must be adjusted in order to gracefully handle missing data.

When do we have hope to recover missing data? If the complete high-dimensional data are well-approximated by their projection onto a lower-dimensional subspace, and hence in some sense redundant, then it is conceivable that incomplete or subsampled data may provide sufficient information for the recovery of that subspace. A related problem in the batch setting is the celebrated problem of low-rank matrix completion [7, 8], which suggests that it is possible to recover a highly incomplete matrix if its rank is much smaller than the dimension. This is the central intuition that motivates work on streaming PCA and subspace tracking with missing data. A burst of research activity has been devoted to developing algorithms and theoretical underpinnings for this problem over the last several years in signal processing, machine learning, and statistics. Moreover, powerful results from random matrix theory and stochastic processes have been leveraged to develop performance guarantees for both traditional and newly proposed methods. At the same time, these methods are also finding new applications to emerging data science applications such as monitoring of smart infrastructures [9], neurological, and physiological signal processing and understanding [10].

1.2 Overview of Subspace Tracking Algorithms

There is a long history of subspace tracking algorithms in the literature of signal processing. An extensive survey of methods prior to 1990 was provided in a popular Proceedings of the IEEE article by Comon and Golub [6]. As the common problem dimension was relatively small at that time, the focus was mostly on performance and computational complexity for fully observed data of moderate dimensions. Since then, new algorithms have been and continue to be developed with a focus on minimizing computation and memory complexity for very high-dimensional problems with missing data, without suffering too much on performance [11]. Consider the problem of estimating or tracking a -dimensional subspace in , where . For modern applications, it is desirable that both the computational complexity (per update) and the memory complexity scale at most linearly with respect to . Moreover, modern applications may require the algorithm to handle a range of missing data, from just a small fraction of missing entries to the information-theoretic limit of only entries observed in each data vector1.

Broadly speaking, there are two perspectives from which researchers have developed and studied streaming PCA and subspace tracking algorithms, as categorized by Smith [14]. The first class of algorithms can be interpreted through an algebraic lens; these can be regarded as variants of incremental methods for calculating top- eigenvectors or singular vectors of a time-varying matrix, such as the sample covariance matrix. Since this time-varying matrix is typically updated by a rank-one modification, various matrix manipulation techniques can be exploited to reduce computational and memory complexities. This viewpoint is particularly useful for understanding algorithms such as Incremental SVD [15], Karasalo’s method [16], Oja’s method [17], Krasulina’s method [18, 19], and other algorithms based on power iterations [20, 21], to name a few.

The other class of algorithms can be interpreted through a geometric lens. These algorithms are constructed as the solution to the optimization of a certain loss function, e.g., via gradient descent, designed in either Euclidean space or on a matrix manifold such as the Grassmannian. We focus mainly on methods where the loss function is updated by one additional term per streaming column vector, and the previous estimate can be used as a warm start or initialization. This viewpoint is particularly useful in the presence of missing data, since missing data are easily incorporated into a loss function, and has therefore been leveraged more often than the algebraic viewpoint in the design of subspace tracking algorithms that are tolerant to missing data. Examples include GROUSE [22], PETRELS [23], ReProCS [24], PAST [25], online nuclear norm minimization [26], and other algorithms based on stochastic approximation [27], to name a few.

The two classes of algorithms, while having distinct features, can often be unified, as an algorithm can often be interpreted through both perspectives. The trade-off between convergence speed in static environments and tracking speed in dynamic environments is also an important consideration in practice, achieved by balancing the influence from historical data and current data. This can be done by discounting historical data in the construction of the time-varying matrix in algebraic methods, and in the construction of the loss function or selection of step sizes in geometric methods.

There is also a vast literature on establishing theoretical performance guarantees for various streaming PCA and subspace tracking algorithms. Classical analysis is primarily done in the asymptotic regime (see, e.g., [28, 29]), where the discrete-time stochastic processes associated with the algorithms are shown to converge, in the scaling limit [30, 31], to the solution of some deterministic differential equations. Recent developments in performance analysis include new and more tractable asymptotic analysis for high-dimensional cases [32, 33, 34, 35], as well as finite-sample probabilistic performance guarantees [36, 37, 38, 39, 40].

1.3 Organization of the Paper

We first describe in Section 2 the problem formulation of PCA and streaming PCA in the presence of missing data. We then survey algorithms that perform streaming subspace estimation and tracking with full or incompletely observed columns: Section 3 focuses on those using algebraic approaches and Section 4 on those using geometric approaches. Many of these algorithms have associated theoretical analysis with regards to the estimation accuracy and algorithmic convergence rates, which we discuss in Section 5. We then provide numerical comparisons of a number of competitive algorithms in Section 6 and conclude in Section 7.

1.4 Notations

Throughout this paper, we use boldface letters to denote vectors and matrices, e.g., and . For a positive semidefinite (PSD) matrix , we write . The transpose of is denoted by , and , , and denote the spectral norm, the Frobenius norm and the trace, respectively. The expectation of a random variable is written as . The identity matrix of dimension is written as . We shall use to denote the dimension of the fully observed data vector and to denote the dimension of the subspace to be estimated. A subscript on the data vector refers to its order in a sequence of vectors, and the notation refers to the th component of the vector .

2 Problem Formulation

In this section, we will start by formulating the problem of subspace estimation in the batch setting, which serves as a good starting point to motivate streaming PCA and subspace tracking in the streaming setting with missing data.

2.1 PCA in the Batch Setting

The PCA or subspace estimation problem can be formulated either probabilistically, where data are assumed to be random vectors drawn from a distribution with mean zero and some covariance matrix whose principal subspace we wish to estimate, or deterministically, where we seek the best rank- subspace that fits the given data. Both models are used extensively throughout the literature. The former is used more prevalently in the signal processing and statistics literature, while the latter is more prevalent in applied mathematics, optimization, and computer science literature. The problem formulations result in equivalent optimization problems, and so we put them here together for a unified view.

(a) Probabilistic view: Consider a stationary, -dimensional random process , which has a zero mean and a covariance matrix .2 Denote the eigenvalue decomposition (EVD) of the covariance matrix as , where has orthonormal columns, and , where are the eigenvalues arranged in a non-increasing order. Our goal is to estimate the top- eigenvectors, also called the principal components , of , given a finite number of i.i.d. data samples, . Note that we do not require be a rank- matrix.

(b) Deterministic view: In a deterministic formulation, the data samples are considered arbitrary. We wish to find the rank- subspace that best fits these data in the sense of minimizing the projection error, that is

(1)
(2)
(3)

where denotes the projection operator onto the column span of the matrix and when has orthonormal columns, concatenates the data vectors as columns into a matrix, and is the (unscaled) Sample Covariance Matrix (SCM).

The equivalence of (2) and (3) suggests that finding the subspace that maximizes the explained variance of is equivalent to minimizing the approximation error of the data matrix . While the formulations (2) or (3) are non-convex, both due to the cost function’s non-convexity in and the non-convex constraint , they admit a well-defined solution, solved by the SVD of , equivalently the EVD of , as was discovered independently by [42] (see [43, 44] for details) and [45]. Specifically, the solution is given as the top- eigenvectors of the SCM .

(c) Unified perspective: Consider the following expected loss function

(4)

where and the expectation is taken with respect to which is zero mean with a covariance matrix . The following important result was proven in [25]: if , i.e., if there is a strict eigengap, then the global optima of correspond to that contains the top- eigenvectors of up to an orthonormal transformation, matching the solution of PCA in the probabilistic view. Interestingly, the solution to the deterministic formulation (1) can be thought of as an empirical version of (4), if the data samples are indeed drawn according to the probabilistic model. Moreover, in this case, produces an order-wise near-optimal estimate to for a large family of distributions [46]. In this regard, the two formulations are equivalent in some sense, though in the deterministic setting, there need not be any generative model or “ground truth” for the underlying subspace.

2.2 Streaming PCA and Subspace Tracking

In a streaming setting, the data samples arrive sequentially over time, with each sample only seen once,3 and one wishes to update the subspace estimate sequentially without accessing historical data. In a dynamic environment, either the covariance matrix or the best rank- subspace can be time-varying — therefore, we wish to track such changes as quickly as possible.

In this survey article, we use the terminology “streaming PCA” and “subspace tracking” interchangeably to refer to algorithms that can update and track a data subspace using streaming observations. Nonetheless, we acknowledge they have different connotations and indeed they have arisen from different contexts. The terminology “subspace tracking” is common in the literature of signal processing [6], where one often needs to update the subspace in a dynamic environment as in array signal processing or communications. The more recent terminology of “online PCA” or “streaming PCA” can be found in the machine learning literature, motivated by the study in computer science of trying to replicate the behavior of batch PCA with streaming data [47] or data too large for memory. In addition, “incremental SVD” [15] or “updating the SVD” [48] are terminology used more classically in numerical methods. It turns out that all of the algorithms reviewed herein can handle both the settings where the underlying subspace is static or time-varying by adjusting parameters within the algorithm such as data discounting factors or step sizes.

Streaming PCA can be considered as a nonconvex stochastic approximation problem, given by (4). The solution to the batch problem that we outlined in Section 2.1 is no longer appropriate for the streaming setting — it requires one to formulate and store the SCM , which has a memory complexity of , and to estimate the top- eigenvectors directly from the SCM, which has a time complexity of . Both these memory and time complexities are too expensive for large-scale problems. It is greatly desirable to have algorithms with computation and memory complexities that grow at most linearly in .

2.3 Missing Data

An important setting that we will consider in this survey is missing data, where only a subset of the coordinates of of each sample are observed. We denote this measurement as

(5)

where is a projection operator onto the coordinates represented by an observation mask, , where (the th entry of ) is observed if and only if . This issue poses severe challenges for most PCA algorithms, particularly when the number of observed entries is much smaller than . To begin, one may be concerned with identifiability: can we find a unique subspace of rank- that is consistent with the partial observations? Luckily, the answer to this question is yes, at least in the batch setting where the problem is equivalent to that of low-rank matrix completion: under mild assumptions, the low-rank subspace can be reconstructed from subsampled column vectors as long as there are enough observations.4 It may also be tempting to execute subspace tracking algorithms by ignoring the missing data and padding with zeros at the missing entries, however the sample covariance matrix constructed in this way leads to a biased estimator [49, 50]. Therefore, one must think more carefully about how to handle missing data in this context.

3 Algebraic Methods

In this section and the next, we will discuss two classes of algorithms based on algebraic approaches and geometric approaches respectively, as outlined in Section 1.2. The algebraic approaches are based on finding the top eigenvectors of a recursively updated SCM, or a surrogate of it, given as

(6)

where and balance the contributions from the previous SCM and the current data sample. Two popular choices are equal weights on all time slots, which is

and discounting on historical data, which is

Equivalently, the above can be reworded as finding the top singular vectors of a recursively updated data matrix . As we are interested in calculating or approximating the top- eigenvectors of , algebraic methods use matrix manipulations and exploit the simplicity of the rank-one update to reduce complexity.

3.1 Incremental Singular Value Decomposition (ISVD)

We begin by discussing the ISVD approach of Bunch and Neilsen [48], which is an exact method to compute the full SVD of a streaming full data matrix, i.e., with sequentially arriving, full data vectors. This algorithm is given in Algorithm 1 and is the result of some simple observations about the relationship of the SVD of and that of . Suppose we are given the compact SVD of the data matrix at time ,

where and are orthonormal, and is the concatenation of two matrices: a diagonal matrix (of size ) with non-negative non-increasing diagonal entries, and an all-zero matrix. Note that we are using the notation for the square orthogonal matrix as opposed to for a matrix, as in Section 2.1, because we are computing the full SVD, not a low-rank approximation. For simplicity of exposition, let’s assume (but both cases and are described in Algorithm 1). We wish to compute the SVD of

where , , and are defined similarly as , , and .

Recognizing that

and

where , we can compute the new singular values by finding the eigenvalues of using the zeros of the characteristic equation [51], which in this case has a special structure; in particular, if are the diagonal values of , then the zeros of

(7)

with respect to the variable identify the eigenvalues of . Denote the resulting eigenvalues as for . To update the left singular vectors to the new , we need to solve and normalize the solution. Therefore [51, 6],

(8)
1:  Given , set , , ;
2:   Set ;
3:  repeat
4:     Define ;
5:     Define ; ;
6:     if  then
7:        Compute the SVD of the update matrix:
(9)
by solving (7) for , which gives the diagonal entries of , where are the diagonal entries of and , and then solving (8) for .
8:        Set
(10)
9:     else if  then
10:        (this happens when or )
11:        Compute the SVD of the update matrix:
(11)
by solving (7) for , which gives the diagonal entries of , where are the diagonal entries of and , and then solving (8) for .
12:        Set
13:     end if
14:     ;
15:  until termination
Algorithm 1 ISVD

So far, the above derivations assume is a square orthonormal matrix, and the resulting computations are suitable for incremental updates of the full SVD. However, this still requires complexity5 for every iteration computing the full SVD (i.e., all singular values and left/right singular vectors) using Algorithm 1, and the memory requirement grows as grows if the data are full rank (or low rank with even a very small amount of additive noise), which are both undesirable.

On the other hand, estimating a thin SVD or the top -dimensional singular subspace can improve computation. In fact, if is exactly rank-, this incremental approach requires fewer computations as pointed out in [52]. In this case, take the first columns of and call these . We only need these to represent because the others correspond to zero singular values. Let and be the corresponding matrices for this thin SVD so that . We then notice as in [52, 53] that

(12)

where are the projection weights onto the span of this now tall matrix and is the residual from the projection. We only must diagonalize the center matrix of (12) to find the SVD of . We only assumed is rank-, but then to make this assumption at every step is a strong assumption and means the matrix is exactly low-rank. However, this technique is used successfully as a heuristic, by truncating smaller singular values and corresponding singular vectors, even when this assumption does not hold. This method is described in the following section. Finally, we point out that Karasalo’s subspace averaging algorithm [16] is similar to ISVD, but it uses specific information about the noise covariance.

3.2 MD-ISVD, Brand’s Algorithm and PIMC

A major drawback of many linear algebraic techniques is their inapplicability to datasets with missing data. While it is not straightforward how to adapt the ISVD for missing data, there are several different approaches in the literature [15, 54, 3]. All approaches begin in the same way, by considering how to compute two key quantities, the projection weights and the residual , given missing data. Whereas for complete data, we have , for missing data one may solve

(13)

where is the measurement operator with missing data as in (5). Then letting , define the residual to be

(14)

All the methods we describe in this section use these quantities in place of and in Algorithm 1. They also all mimic (9) for the update, and once they have observed enough vectors they truncate the singular value and corresponding singular vectors.

The methods diverge only in the way they replace the singular value matrix in (9). Brand [15] replaces with where is a scalar weight that diminishes the influence of previous singular values. If one takes , this is arguably the most direct extension of ISVD to missing data, and following [3] we call this algorithm Missing Data-ISVD (MD-ISVD). Kennedy et al. [54] present a Polar Incremental Matrix Completion (PIMC) method, which weights with a scalar based on the norm of the data observed thus far. These are different approaches to modeling the uncertainty in the singular values arising from incomplete observations. These algorithms are together summarized in Algorithm 2. These different rules provide different trade-offs in (6), as they represent different weighting schemes on historical data.

1:  Given an orthonormal matrix , ;
2:  For PIMC, .
3:   Set ;
4:  repeat
5:     Define ;
6:     Define ;
7:     Compute the SVD of the update matrix:
(15)
where
with for PIMC.
8:     Set ;
9:     Set as the first columns of and as the top -by- block of .
10:     ;
11:  until termination
Algorithm 2 MD-ISVD, Brand’s algorithm, PIMC

3.3 Oja’s method

Oja’s method was originally proposed in 1982 [17]. It is a very popular method for streaming PCA, and recent attention has yielded significant insight into its practical performance (see discussion in Section 5). Given an orthonormal initialization , at the th time Oja’s method updates to according to the input data as

(16)

where is an orthogonalization operator, i.e. is the QR decomposition. The parameter is the step size or learning rate that may change with time.

While Oja’s method has not been derived for the missing data case in the literature, following our discussion on ISVD, one realizes that if as before we let be the coefficient of in the previous estimate , then Oja’s method is equivalent to

(17)

A straightforward extension in the missing data case is then to estimate the coefficient as (13), and to fill in the missing entries in as follows. Let , and the data vector can be interpolated as

Then Oja’s update rule in the missing data case becomes

(18)

This algorithm is summarized in Algorithm 3. Note that Oja’s original method with full data becomes a special case of this update. We study this extension in the numerical experiments reported in Section 6.

1:  Given an orthonormal matrix ;
2:  Set ;
3:  repeat
4:     Define ;
5:     Set .
6:     Set
7:     ,
8:     ;
9:  until termination
Algorithm 3 Oja’s algorithm with missing data

Finally, we note that closely related to Oja’s is another method called Krasulina’s algorithm [19], which is developed for updating a rank- subspace with full data:

(19)

It can be viewed as a stochastic gradient descent method with the Rayleigh quotient as its objective. Oja’s method is equivalent to Krasulina’s method up to the second order terms [55, 18].

Remark 1 (Block Power Method).

A block variant of Oja’s method has been developed in the literature [36, 37, 56], where it partitions the input into blocks and each time processes one block in a way similar to Oja’s method. These methods are referred to as the block power method, or block Oja’s method, or the noisy power method. They are easier to analyze but yield suboptimal performance [40].

4 Geometric Methods

In this section, we review subspace tracking algorithms developed via geometric approaches. These are developed by optimizing certain loss functions over matrices in Euclidean space or the Grassmann manifold of rank- subspaces in . Subspace tracking is enabled by optimizing a recursively updated loss function, such as the squared projection loss onto the subspace, as

(20)

where , and is the time index, which is typically updated by using the previous estimate as a warm start. Similarly, the choice of and balances the convergence rate (how fast it converges with data from a static subspace) and the tracking capability (how fast it can adapt to changes in the subspace). Additionally, the step size of some gradient algorithms can also be used as a tuning knob for tracking; a more aggressive step size will adapt more quickly to new data. Given the necessity of scalable and memory-efficient algorithms, first-order and second-order stochastic gradient descent [57] are gaining a lot of popularity recently in signal processing and machine learning.

4.1 Grouse

Grassmannian Rank-One Update Subspace Estimation (GROUSE) was first introduced in [22] as an incremental gradient algorithm to build high quality subspace estimates from very sparsely sampled vectors, and has since been analyzed with fully sampled data [58, 59], noisy data [59], and missing or compressed data [58, 60] (see Section 5). The objective function for the algorithm is given by

(21)

which is a special case of (20) with . GROUSE implements a first-order incremental gradient procedure [61] to minimize this objective with respect to the subspace variable constrained to the Grassmannian [62], the manifold of all subspaces with a fixed rank, given as

GROUSE has iteration complexity at the th update and so is scalable to very high-dimensional applications. The algorithm steps are described in Algorithm 4.

The GROUSE update in (23) can also be written as:

where

This form makes it clear that GROUSE is simply replacing a direction in the current subspace estimate, , with a new vector that is a linear combination of and the residual vector . This of course makes orthogonal to the rest of , which is why will necessarily also have orthogonal columns.

We note that, if the step size is not given, one can use the step size prescribed in (22). This step size maximizes the per-iteration improvement of the algorithm in a greedy way, but can therefore be susceptible to noise. For example, with fully observed data, this greedy step size will replace the direction in the current iterate with the observed data . If bounds on the noise variance are known, one can use the noise-dependent step-size given in [59], which decreases the step as the noise floor is reached.

1:  Given , an orthonormal matrix, ;
2:  Optional input: Step size scheme ;
3:  Set ;
4:  repeat
5:     Define ;
6:     Define ;
7:     if  given then
8:        Set .
9:     else
10:        Set
(22)
11:     end if
(23)
12:     ;
13:  until termination
Algorithm 4 GROUSE [22]

In a follow-up work, Balzano et al. describe SAGE GROUSE [53, 3], which was derived in the context of Algorithm 2. SAGE GROUSE replaces with an identity matrix the same size as , which makes the algorithm completely agnostic to singular values or the relative weight of singular vectors that have been learned. This can be considered as yet another way of modeling uncertainty in the singular values learned thus far in a streaming context. SAGE GROUSE has been proven to be equivalent to the GROUSE gradient algorithm for a given step size [53], showing that indeed the distinction of “algebraic” and “geometric” algorithms is not fundamental.

Remark 2 (Snipe).

A block variant of GROUSE was presented in [63], called Subspace Navigation via Interpolation from Partial Entries (SNIPE). This algorithm partitions the input into blocks and for each block optimizes a subspace to fit the observed entries on that block but remain close to the previous subspace estimate.

4.2 Past

The Projection Approximation Subspace Tracking (PAST) is proposed by Yang [25, 28] for subspace tracking with full data, which is described in Algorithm 5. PAST optimizes the following function at time without constraining to have orthogonal columns:

(24)

where prior observations are discounted by a geometric factor . The name “projection approximation” comes from the fact that the projection onto the subspace is approximated by , without the constraint . This sum is further approximated by replacing the second in (24) by , yielding

(25)

Let the coefficient vector be , then (25) can be rewritten as

(26)

whose solution can be written in a closed-form and efficiently found via recursive least-squares. The PAST algorithm has a computational complexity of . PAST has been very popular due to its efficiency, and it has been extended and modified in various ways [20, 64, 65].

1:  Given , ;
2:  Set ;
3:  repeat
4:     Define ;
5:     ,
6:     ,
7:     ,
8:     .
9:     ;
10:  until termination
Algorithm 5 PAST [25]

4.3 Petrels

The PETRELS algorithm, proposed in [23], can be viewed as a modification of the PAST algorithm to handle missing data, which is summarized by Algorithm 6. PETRELS optimizes the following function at time without constraining to have orthogonal columns:

(27)

At each time , PETRELS alternates between coefficient estimation and subspace update. We first estimate the coefficient vector by minimizing the projection residual using the previous subspace estimate:

(28)

where is a random subspace initialization. The subspace is then updated by minimizing

(29)

where , are estimates from (28). The objective function in (29) decomposes into a parallel set of smaller problems, one for each row of , where . Thus the th row can be estimated by solving

(30)

for . Again, the problem can be solved efficiently via recursive least-squares. Moreover, PETRELS can be made very efficient by parallelizing the implementation of (4.3).

1:  Given , and , for all .
2:  Set ;
3:  repeat
4:     Define ;
5:     for  do
6:        ,
7:        ,
8:        ,
9:        .
10:     end for
11:     ;
12:  until termination
Algorithm 6 PETRELS [23]

Both PAST and PETRELS can be regarded as applying second-order stochastic gradient descent [57] to the loss function, and each step of the update is approximately a Newton step. Therefore, it is expected that the algorithm will converge quadratically when it is close to the optimal solution. Several algorithms can be developed along similar lines of PETRELS, where the loss function is revised to include regularization terms on the Frobenius norms of the subspace and the weight vector , which we refer the readers to [26, 27].

5 Performance Analysis

In this section we will describe general analysis methodologies as well as specific theoretical results for characterizing the performance of the aforementioned streaming PCA and subspace tracking algorithms.

To carry out the analysis, we need to make assumptions on how the data are generated. A popular approach that has been taken in the literature is to assume that each data vector is generated according to the following “spiked model” [66]:

(31)

where is a deterministic orthogonal matrix, is a random signal vector with covariance matrix , and is the noise vector. For simplicity, we assume that the covariance matrix of is the identity matrix , and we use to denote the noise level. This model arises in applications such as array signal processing, where is the “steering matrix” and denotes the number of targets to be tracked by the array. The generative model (31) can also be seen as a special case of the probabilistic model described in Section 2.1, since is a sum of an exact low-rank matrix and a full-rank identity matrix. In the missing data case studied in this paper, only a subset of the coordinates of are observed. Thus, the actual measurement is as in (5), where denotes the projection operator onto an observation mask, . The th entry of is observed if and only if . For simplicity, we shall assume that is a collection of i.i.d. binary random variables such that

(32)

for some constant .

5.1 Classical Asymptotic Analysis

Historically, the first analysis of subspace tracking algorithms was done in the asymptotic regime (see, e.g., [28, 29]), where the algorithms are shown to converge, in the small step size limit, to the solution of some deterministic Ordinary Differential Equations (ODEs).

To understand the basic ideas underlying such analysis, we note that the essence of almost all the online algorithms described in Section 3 and Section 4 is a stochastic recursion of the form

(33)

Here, is the estimate at time ; is some nonlinear function of the previous estimate , the new complete data vector , and its observation mask ; and is the step size (i.e., the learning rate). For example, Krasulina’s method given in (19) is just a special case of (33) [with ]. When the step size is small, we can perform Taylor’s expansion (with respect to ) on the recursion formulas of Oja’s method (16) and GROUSE (23), and show that these two algorithms can also be written in the form of (33) after omitting higher-order terms in .

Under the statistical model (31) and (32), the general algorithm (33) is simply a Markov chain with state vectors . The challenge in analyzing the convergence of (33) comes from the nonlinearity in the function . In the literature, a very powerful analytical tool is the so-called ODE method. It was introduced to the control and signal processing communities by Ljung [30] and Kushner [31] in the 1970s, and similar approaches have an even longer history in the literature of statistical physics and stochastic processes (see, e.g., [67, 68] for some historical remarks).

The basic idea of the ODE method is to associate the discrete-time stochastic process (33) with a continuous-time deterministic ODE. Asymptotically, as the step size and the number of steps , the process (33) can be shown to converge to the solution of an ODE. Specifically, we let the step sizes be such that

For example, a popular choice is for some . By defining as the “fictitious” time, we can convert the discrete-time process to a continuous-time process via linear interpolation:

(34)

Under certain regularity conditions on the function , one can then show that, as , the randomness in the trajectory of will diminish and will converge to the deterministic solution of an ODE [30, 31].

Although a rigorous proof of the above convergence is technical, the limiting ODE, if the convergence indeed holds, can be easily derived, at least in a non-rigorous way. To start, we can rewrite (33) as

(35)

where denotes the conditional expectation of the “new information” given the current state , and captures the remainder terms. From the construction of in (34), the left-hand side of (35) is equal to