Non-convex Robust PCA
We propose a new method for robust PCA – the task of recovering a low-rank matrix from sparse corruptions that are of unknown value and support. Our method involves alternating between projecting appropriate residuals onto the set of low-rank matrices, and the set of sparse matrices; each projection is non-convex but easy to compute. In spite of this non-convexity, we establish exact recovery of the low-rank matrix, under the same conditions that are required by existing methods (which are based on convex optimization). For an input matrix (, our method has a running time of per iteration, and needs iterations to reach an accuracy of . This is close to the running times of simple PCA via the power method, which requires per iteration, and iterations. In contrast, the existing methods for robust PCA, which are based on convex optimization, have complexity per iteration, and take iterations, i.e., exponentially more iterations for the same accuracy.
Experiments on both synthetic and real data establishes the improved speed and accuracy of our method over existing convex implementations.
Robust PCA, matrix decomposition, non-convex methods, alternating projections.
Principal component analysis (PCA) is a common procedure for preprocessing and denoising, where a low rank approximation to the input matrix (such as the covariance matrix) is carried out. Although PCA is simple to implement via eigen-decomposition, it is sensitive to the presence of outliers, since it attempts to “force fit” the outliers to the low rank approximation. To overcome this, the notion of robust PCA is employed, where the goal is to remove sparse corruptions from an input matrix and obtain a low rank approximation. Robust PCA has been employed in a wide range of applications, including background modeling [LHGT04], 3d reconstruction [MZYM11], robust topic modeling [Shi13], and community detection [CSX12], and so on.
Concretely, robust PCA refers to the following problem: given an input matrix , the goal is to decompose it into sparse and low rank matrices. The seminal works of [CSPW11, CLMW11] showed that this problem can be provably solved via convex relaxation methods, under some natural conditions on the low rank and sparse components. While the theory is elegant, in practice, convex techniques are expensive to run on a large scale and have poor convergence rates. Concretely, for decomposing an matrix, say with , the best specialized implementations (typically first-order methods) have a per-iteration complexity of , and require number of iterations to achieve an error of . In contrast, the usual PCA, which carries out a rank- approximation of the input matrix, has complexity per iteration – drastically smaller when is much smaller than . Moreover, PCA requires exponentially fewer iterations for convergence: an accuracy is achieved with only iterations (assuming constant gap in singular values).
In this paper, we design a non-convex algorithm which is “best of both the worlds” and bridges the gap between (the usual) PCA and convex methods for robust PCA. Our method has low computational complexity similar to PCA (i.e. scaling costs and convergence rates), and at the same time, has provable global convergence guarantees, similar to the convex methods. Proving global convergence for non-convex methods is an exciting recent development in machine learning. Non-convex alternating minimization techniques have recently shown success in many settings such as matrix completion [Kes12, JNS13, Har13], phase retrieval [NJS13], dictionary learning [AAJ13], tensor decompositions for unsupervised learning [AGH12], and so on. Our current work on the analysis of non-convex methods for robust PCA is an important addition to this growing list.
1.1 Summary of Contributions
We propose a simple intuitive algorithm for robust PCA with low per-iteration cost and a fast convergence rate. We prove tight guarantees for recovery of sparse and low rank components, which match those for the convex methods. In the process, we derive novel matrix perturbation bounds, when subject to sparse perturbations. Our experiments reveal significant gains in terms of speed-ups over the convex relaxation techniques, especially as we scale the size of the input matrices.
Our method consists of simple alternating (non-convex) projections onto low-rank and sparse matrices. For an matrix, our method has a running time of , where is the rank of the low rank component. Thus, our method has a linear convergence rate, i.e. it requires iterations to achieve an error of , where is the rank of the low rank component . When the rank is small, this nearly matches the complexity of PCA, (which is ).
We prove recovery of the sparse and low rank components under a set of requirements which are tight and match those for the convex techniques (up to constant factors). In particular, under the deterministic sparsity model, where each row and each column of the sparse matrix has at most fraction of non-zeros, we require that , where is the incoherence factor (see Section 3).
In addition to strong theoretical guarantees, in practice, our method enjoys significant advantages over the state-of-art solver for (1), viz., the inexact augmented Lagrange multiplier (IALM) method [CLMW11]. Our method outperforms IALM in all instances, as we vary the sparsity levels, incoherence, and rank, in terms of running time to achieve a fixed level of accuracy. In addition, on a real dataset involving the standard task of foreground-background separation [CLMW11], our method is significantly faster and provides visually better separation.
Overview of our techniques:
Our proof technique involves establishing error contraction with each projection onto the sets of low rank and sparse matrices. We first describe the proof ideas when is rank one. The first projection step is a hard thresholding procedure on the input matrix to remove large entries and then we perform rank- projection of the residual to obtain . Standard matrix perturbation results (such as Davis-Kahan) provide error bounds between the singular vectors of and . However, these bounds do not suffice for establishing the correctness of our method. Since the next step in our method involves hard thresholding of the residual , we require element-wise error bounds on our low rank estimate. Inspired by the approach of Erdős et al. [EKYY13], where they obtain similar element-wise bounds for the eigenvectors of sparse Erdős–Rényi graphs, we derive these bounds by exploiting the fixed point characterization of the eigenvectors111If the input matrix is not symmetric, we embed it in a symmetric matrix and consider the eigenvectors of the corresponding matrix.. A Taylor’s series expansion reveals that the perturbation between the estimated and the true eigenvectors consists of bounding the walks in a graph whose adjacency matrix corresponds to (a subgraph of) the sparse component . We then show that if the graph is sparse enough, then this perturbation can be controlled, and thus, the next thresholding step results in further error contraction. We use an induction argument to show that the sparse estimate is always contained in the true support of , and that there is an error contraction in each step. For the case, where has rank , our algorithm proceeds in several stages, where we progressively compute higher rank projections which alternate with the hard thresholding steps. In stage , we compute rank- projections, and show that after a sufficient number of alternating projections, we reduce the error to the level of singular value of , using similar arguments as in the rank- case. We then proceed to performing rank- projections which alternate with hard thresholding. This stage-wise procedure is needed for ill-conditioned matrices, since we cannot hope to recover lower eigenvectors in the beginning when there are large perturbations. Thus, we establish global convergence guarantees for our proposed non-convex robust PCA method.
1.2 Related Work
Guaranteed methods for robust PCA have received a lot of attention in the past few years, starting from the seminal works of [CSPW11, CLMW11], where they showed recovery of an incoherent low rank matrix through the following convex relaxation method:
where denotes the nuclear norm of (nuclear norm is the sum of singular values). A typical solver for this convex program involves projection on to and nuclear norm balls (which are convex sets). Note that the convex method can be viewed as “soft” thresholding in the standard and spectral domains, while our method involves hard thresholding in these domains.
[CSPW11] and [CLMW11] consider two different models of sparsity for . Chandrasekaran et al. [CSPW11] consider a deterministic sparsity model, where each row and column of the matrix, , has at most fraction of non-zero entries. For guaranteed recovery, they require , where is the incoherence level of , and is its rank. Hsu et al. [HKZ11] improve upon this result to obtain guarantees for an optimal sparsity level of . This matches the requirements of our non-convex method for exact recovery. Note that when the rank , this allows for a constant fraction of corrupted entries. Candès et al. [CLMW11] consider a different model with random sparsity and additional incoherence constraints, viz., they require . Note that our assumption of incoherence, viz., , only yields . The additional assumption enables [CLMW11] to prove exact recovery with a constant fraction of corrupted entries, even when is nearly full-rank. We note that removing the condition for robust PCA would imply solving the planted clique problem when the clique size is less than [Che13]. Thus, our recovery guarantees are tight upto constants without these additional assumptions.
A number of works have considered modified models under the robust PCA framework, e.g. [ANW12, XCS12]. For instance, Agarwal et al. [ANW12] relax the incoherence assumption to a weaker “diffusivity” assumption, which bounds the magnitude of the entries in the low rank part, but incurs an additional approximation error. Xu et al.[XCS12] impose special sparsity structure where a column can either be non-zero or fully zero.
In terms of state-of-art specialized solvers, [CLMW11] implements the in-exact augmented Lagrangian multipliers (IALM) method and provides guidelines for parameter tuning. Other related methods such as multi-block alternating directions method of multipliers (ADMM) have also been considered for robust PCA, e.g. [WHML13]. Recently, a multi-step multi-block stochastic ADMM method was analyzed for this problem [SAJ14], and this requires iterations to achieve an error of . In addition, the convergence rate is tight in terms of scaling with respect to problem size and sparsity and rank parameters, under random noise models.
There is only one other work which considers a non-convex method for robust PCA [KC12]. However, their result holds only for significantly more restrictive settings and does not cover the deterministic sparsity assumption that we study. Moreover, the projection step in their method can have an arbitrarily large rank, so the running time is still , which is the same as the convex methods. In contrast, we have an improved running time of .
In this section, we present our algorithm for the robust PCA problem. The robust PCA problem can be formulated as the following optimization problem: find s.t. 222 is the desired reconstruction error and
lies in the set of low-rank matrices,
lies in the set of sparse matrices.
A natural algorithm for the above problem is to iteratively project onto the set of sparse matrices to update , and then to project onto the set of low-rank matrices to update . Alternatively, one can view the problem as that of finding a matrix in the intersection of the following two sets: a) set of rank- matrices, b) where is a sparse matrix. Note that these projections can be done efficiently, even though the sets are non-convex. Hard thresholding (HT) is employed for projections on to sparse matrices, and singular value decomposition (SVD) is used for projections on to low rank matrices.
Rank- case: We first describe our algorithm for the special case when is rank 1. Our algorithm performs an initial hard thresholding to remove very large entries from input . Note that if we performed the projection on to rank- matrices without the initial hard thresholding, we would not make any progress since it is subject to large perturbations. We alternate between computing the rank- projection of , and performing hard thresholding on to remove entries exceeding a certain threshold. This threshold is gradually decreased as the iterations proceed, and the algorithm is run for a certain number of iterations (which depends on the desired reconstruction error).
General rank case: When has rank , a naive extension of our algorithm consists of alternating projections on to rank- matrices and sparse matrices. However, such a method has poor performance on ill-conditioned matrices. This is because after the initial thresholding of the input matrix , the sparse corruptions in the residual are of the order of the top singular value (with the choice of threshold as specified in the algorithm). When the lower singular values are much smaller, the corresponding singular vectors are subject to relatively large perturbations and thus, we cannot make progress in improving the reconstruction error. To alleviate the dependence on the condition number, we propose an algorithm that proceeds in stages. In the stage, the algorithm alternates between rank- projections and hard thresholding for a certain number of iterations. We run the algorithm for stages, where is the rank of . Intuitively, through this procedure, we recover the lower singular values only after the input matrix is sufficiently denoised, i.e. sparse corruptions at the desired level have been removed. Figure 1 shows a pictorial representation of the alternating projections in different stages.
Parameters: As can be seen, the only real parameter to the algorithm is , used in thresholding, which represents “spikiness” of . That is if the user expects to be “spiky” and the sparse part to be heavily diffused, then higher value of can be provided. In our implementation, we found that selecting aggressively helped speed up recovery of our algorithm. In particular, we selected .
Complexity: The complexity of each iteration within a single stage is , since it involves calculating the rank- approximation333Note that we only require a rank- approximation of the matrix rather than the actual singular vectors. Thus, the computational complexity has no dependence on the gap between the singular values. of an matrix (done e.g. via vanilla PCA). The number of iterations in each stage is and there are at most stages. Thus the overall complexity of the entire algorithm is then . This is drastically lower than the best known bound of on the number of iterations required by convex methods, and just a factor away from the complexity of vanilla PCA.
In this section, we present our main result on the correctness of AltProj. We assume the following conditions:
Rank of is at most .
is -incoherent, i.e., if is the SVD of , then and , , where and denote the rows of and respectively.
Each row and column of have at most fraction of non-zero entries such that .
Note that in general, it is not possible to have a unique recovery of low-rank and sparse components. For example, if the input matrix is both sparse and low rank, then there is no unique decomposition (e.g. ). The above conditions ensure uniqueness of the matrix decomposition problem.
Additionally, we set the parameter in Algorithm 1 be set as .
We now establish that our proposed algorithm recovers the low rank and sparse components under the above conditions.
Theorem 1 (Noiseless Recovery).
Under conditions , and , and choice of as above, the outputs and of Algorithm 1 satisfy:
Remark (tight recovery conditions): Our result is tight up to constants, in terms of allowable sparsity level
under the deterministic sparsity model. In other words, if we exceed the sparsity limit imposed in S1, it is possible to construct instances where there is no unique decomposition444For instance, consider the matrix which has copies of the all ones matrix, each of
size , placed across the diagonal.
We see that this matrix has rank and is incoherent with parameter
. Note that a fraction of sparse perturbations suffice to erase one of these blocks
making it impossible to recover the matrix..
Our conditions L1, L2 and S1 also match the conditions required by the convex method for recovery, as established in [HKZ11].
Remark (convergence rate): Our method has a linear rate of convergence, i.e. to achieve an error of , and hence we provide a strongly polynomial method for robust PCA. In contrast, the best known bound for convex methods for robust PCA is iterations to converge to an -approximate solution.
Theorem 1 provides recovery guarantees assuming that is exactly rank-. However, in several real-world scenarios, can be nearly rank-. Our algorithm can handle such situations, where , with being an additive noise. Theorem 1 is a special case of the following theorem which provides recovery guarantees when has small norm.
Theorem 2 (Noisy Recovery).
3.1 Proof Sketch
We now present the key steps in the proof of Theorem 1. A detailed proof is provided in the appendix.
Step I: Reduce to the symmetric case, while maintaining incoherence of and sparsity of . Using standard symmetrization arguments, we can reduce the problem to the symmetric case, where all the matrices involved are symmetric. See appendix for details on this step.
Step II: Show decay in after projection onto the set of rank- matrices. The -th iterate of the -th stage is given by . Hence, is obtained by using the top principal components of a perturbation of given by . The key step in our analysis is to show that when an incoherent and low-rank is perturbed by a sparse matrix , then is small and is much smaller than . The following lemma formalizes the intuition; see the appendix for a detailed proof.
Step III: Show decay in after projection onto the set of sparse matrices. We next show that if is much smaller than then the iterate also has a much smaller error (w.r.t. ) than . The above given lemma formally provides the error bound.
Step IV: Recurse the argument. We have now reduced the norm of the sparse part by a factor of half, while maintaining its sparsity. We can now go back to steps II and III and repeat the arguments for subsequent iterations.
We now present an empirical study of our AltProj method. The goal of this study is two-fold: a) establish that our method indeed recovers the low-rank and sparse part exactly, without significant parameter tuning, b) demonstrate that AltProj is significantly faster than Conv-RPCA (see (1)); we solve Conv-RPCA using the IALM method [CLMW11], a state-of-the-art solver [LCM10]. We implemented our method in Matlab and used a Matlab implementation of the IALM method by [LCM10].
We consider both synthetic experiments and experiments on real data involving the problem of foreground-background separation in a video. Each of our results for synthetic datasets is averaged over runs.
Parameter Setting: Our pseudo-code (Algorithm 1) prescribes the threshold in Step 4, which depends on the knowledge of the singular values of the low rank component . Instead, in the experiments, we set the threshold at the -th step of -th stage as . For synthetic experiments, we employ the used for data generation, and for real-world datasets, we tune through cross-validation. We found that the above thresholding provides exact recovery while speeding up the computation significantly. We would also like to note that [CLMW11] sets the regularization parameter in Conv-RPCA (1) as (assuming ). However, we found that for problems with large incoherence such a parameter setting does not provide exact recovery. Instead, we set in our experiments.
Synthetic datasets: Following the experimental setup of [CLMW11], the low-rank part is generated using normally distributed , . Similarly, is generated by sampling a uniformly random subset of with size and each non-zero is drawn i.i.d. from the uniform distribution over . For increasing incoherence of , we randomly zero-out rows of and then re-normalize them.
There are three key problem parameters for RPCA with a fixed matrix size: a) sparsity of , b) incoherence of , c) rank of . We investigate performance of both AltProj and IALM by varying each of the three parameters while fixing the others. In our plots (see Figure 2), we report computational time required by each of the two methods for decomposing into up to a relative error () of . Figure 2 shows that AltProj scales significantly better than IALM for increasingly dense . We attribute this observation to the fact that as increases, the problem is “harder” and the intermediate iterates of IALM have ranks significantly larger than . Our intuition is confirmed by Figure 2 (b), which shows that when density () of is then the intermediate iterates of IALM can be of rank over while the rank of is only . We observe a similar trend for the other parameters, i.e., AltProj scales significantly better than IALM with increasing incoherence parameter (Figure 2 (c)) and increasing rank (Figure 2 (d)). See Appendix C for additional plots.
Real-world datasets: Next, we apply our method to the problem of foreground-background (F-B) separation in a video [LHGT04]. The observed matrix is formed by vectorizing each frame and stacking them column-wise. Intuitively, the background in a video is the static part and hence forms a low-rank component while the foreground is a dynamic but sparse perturbation.
Here, we used two benchmark datasets named Escalator and Restaurant dataset. The Escalator dataset has frames at a resolution of . We first applied the standard PCA method for extracting low-rank part. Figure 3 (b) shows the extracted background from the video. There are several artifacts (shadows of people near the escalator) that are not desirable. In contrast, both IALM and AltProj obtain significantly better F-B separation (see Figure 3(c), (d)). Interestingly, AltProj removes the steps of the escalator which are moving and arguably are part of the dynamic foreground, while IALM keeps the steps in the background part. Also, our method is significantly faster, i.e., our method, which takes is about times faster than IALM, which takes .
Restaurant dataset: Figure 4 shows the comparison of AltProj and IALM on a subset of the “Restaurant” dataset where we consider the last frames at a resolution of . AltProj was around times faster than IALM. Moreover, visually, the background extraction seems to be of better quality (for example, notice the blur near top corner counter in the IALM solution). Plot(b) shows the PCA solution and that also suffers from a similar blur at the top corner of the image, while the background frame extracted by AltProj does not have any noticeable artifacts.
In this work, we proposed a non-convex method for robust PCA, which consists of alternating projections on to low rank and sparse matrices. We established global convergence of our method under conditions which match those for convex methods. At the same time, our method has much faster running times, and has superior experimental performance. This work opens up a number of interesting questions for future investigation. While we match the convex methods, under the deterministic sparsity model, studying the random sparsity model is of interest. Our noisy recovery results assume deterministic noise; improving the results under random noise needs to be investigated. There are many decomposition problems beyond the robust PCA setting, e.g. structured sparsity models, robust tensor PCA problem, and so on. It is interesting to see if we can establish global convergence for non-convex methods in these settings.
AA and UN would like to acknowledge NSF grant CCF-1219234, ONR N00014-14-1-0665, and Microsoft faculty fellowship. SS would like to acknowledge NSF grants 1302435, 0954059, 1017525 and DTRA grant HDTRA1-13-1-0024. PJ would like to acknowledge Nikhil Srivastava and Deeparnab Chakrabarty for several insightful discussions during the course of the project.
- [AAJ13] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization. Available on arXiv:1310.7991, Oct. 2013.
- [AGH12] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor Methods for Learning Latent Variable Models. Available at arXiv:1210.7559, Oct. 2012.
- [ANW12] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171–1197, 2012.
- [Bha97] Rajendra Bhatia. Matrix Analysis. Springer, 1997.
- [Che13] Y. Chen. Incoherence-Optimal Matrix Completion. ArXiv e-prints, October 2013.
- [CLMW11] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3):11, 2011.
- [CSPW11] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.
- [CSX12] Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. In Advances in neural information processing systems, pages 2204–2212, 2012.
- [EKYY13] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of Erdős–Rényi graphs I: Local semicircle law. The Annals of Probability, 2013.
- [Har13] Moritz Hardt. On the provable convergence of alternating minimization for matrix completion. arXiv:1312.0925, 2013.
- [HKZ11] Daniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. ITIT, 2011.
- [JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In STOC, 2013.
- [KC12] Anastasios Kyrillidis and Volkan Cevher. Matrix alps: Accelerated low rank and sparse matrix reconstruction. In SSP Workshop, 2012.
- [Kes12] Raghunandan H. Keshavan. Efficient algorithms for collaborative filtering. Phd Thesis, Stanford University, 2012.
- [LCM10] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv:1009.5055, 2010.
- [LHGT04] Liyuan Li, Weimin Huang, IY-H Gu, and Qi Tian. Statistical modeling of complex backgrounds for foreground object detection. ITIP, 2004.
- [MZYM11] Hossein Mobahi, Zihan Zhou, Allen Y. Yang, and Yi Ma. Holistic 3d reconstruction of urban structures from low-rank textures. In ICCV Workshops, pages 593–600, 2011.
- [NJS13] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In NIPS, pages 2796–2804, 2013.
- [SAJ14] H. Sedghi, A. Anandkumar, and E. Jonckheere. Guarantees for Stochastic ADMM in High Dimensions. Preprint., Feb. 2014.
- [Shi13] Lei Shi. Sparse additive text models with low rank background. In Advances in Neural Information Processing Systems, pages 172–180, 2013.
- [WHML13] X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv:1308.5294, 2013.
- [XCS12] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047–3064, 2012.
Appendix A Proof of Theorem 1
We will start with some preliminary lemmas. The first lemma is the well known Weyl’s inequality in the matrix setting[Bha97].
Suppose be an matrix. Let and be the eigenvalues of and respectively such that and . Then we have:
The following lemma is the Davis-Kahan theorem[Bha97], specialized for rank- matrices.
Suppose . Let be a rank- matrix with unit spectral norm. Suppose further that . Then, we have:
where and are the top eigenvalue eigenvector pair of .
As outlined in Section 3.1 (and formalized in the proof of Theorem 1), it is sufficient to prove the correctness of Algorithm 1 for the case of symmetric matrices. So, most of the lemmas we prove in this section assume that the matrices are symmetric.
Let satisfy assumption . Then, .
Proof of Lemma 4.
Let be unit vectors such that . Then, using , we have:
where the last inequality follows from the fact that has at most non-zeros per row and per column. ∎
Let satisfy assumption (). Also, let be a -incoherent orthogonal matrix, i.e., , where stands for the standard basis vector. Then, , the following holds:
Proof of Lemma 5.
We prove the lemma using mathematical induction.
Base Case (): This is just a restatement of the incoherence of .
Induction step: We have:
where follows by , and follows from assumption on and from the inductive hypothesis on . ∎
In what follows, we prove a number of lemmas concerning the structure of and . The following lemma shows that the threshold in (2) is close to that with replaced by .
The following lemma shows that under the same assumptions as in Lemma 6, we can obtain a bound on the norm of . This is the most crucial step in our analysis since we bound norm of errors which are quite hard to obtain.
Let be the eigenvalue decomposition of . Also, recall that . Then, for every eigenvector of , we have
By triangle inequality, we have
We now bound the two terms on the right hand side above.
We note that,
where we denote to be the SVD of . Let be the eigenvalue decomposition of . Note that . Recall that, . Also note that,
Hence, using Lemma 8, we have: