Adaptive Stochastic Gradient Descent on the Grassmannian for Robust LowRank Subspace Recovery and Clustering
Abstract
In this paper, we present GASG21 (Grassmannian Adaptive Stochastic Gradient for norm minimization), an adaptive stochastic gradient algorithm to robustly recover the lowrank subspace from a large matrix. In the presence of column outliers, we reformulate the batch mode matrix norm minimization with rank constraint problem as a stochastic optimization approach constrained on Grassmann manifold. For each observed data vector, the lowrank subspace is updated by taking a gradient step along the geodesic of Grassmannian. In order to accelerate the convergence rate of the stochastic gradient method, we choose to adaptively tune the constant stepsize by leveraging the consecutive gradients. Furthermore, we demonstrate that with proper initialization, the Ksubspaces extension, KGASG21, can robustly cluster a large number of corrupted data vectors into a union of subspaces. Numerical experiments on synthetic and real data demonstrate the efficiency and accuracy of the proposed algorithms even with heavy column outliers corruption.
I Introduction
Lowrank subspaces have long been a powerful tool in data modeling and analysis. Applications in communications [1], source localization and target tracking in radar and sonar [2], medical imaging [3], and face recognition [4] all leverage subspace models in order to recover the signal of interest and reject noise. Moreover it is usually natural to model data lying on a union of subspaces to explore the intrinsic structure of the dataset. For example, a video sequence could contain several moving objects and for those objects different subspaces might be used to describe their motion [5]. However from the many reasons of instrumental failures, environmental effects, and human factors, people are always facing the incompletely measured highdimensional vectors, or even the observations are seriously corrupted by outliers, which pose great challenges on the traditional subspace methods. It is wellknown that the current de facto subspace learning method, principal component analysis (PCA), is extremely sensitive to outliers: even a single but severe outlier may degrade the effectiveness of the model[6].
On the other hand, with the explosion of online social network and the emergence of Internet of Things [7], we are seeing databases grow at unprecedented rates. This kind of data deluge, the so called big data, also poses great challenge to modern data analysis [8]. Conventional subspace methods and many recent proposed robust subspace approaches all operate in batch mode, which requires all the available data has to be stored then leads to increased memory requirements and high computational complexity. The majority of algorithms use SVD (singular value decomposition) computations to perform Robust PCA, for example [9] [10] [11] [12] [13]. The SVD is too slow and can not scale to the massive volume of data. For big data optimization, randomization provides a promising alternative to scale extraordinary well on very large dataset, especially, a popular and practical approach is stochastic gradient method which only randomly operates a data point with the approximate gradient at each iteration [14]. As a sequel, the stochastic gradient methods have been adopted and well incorporated into the popular big data machine learning libraries, such as MLib in Apache Spark [15] and Apache Mahout [16].
In order to address both these issues discussed above, algorithms for modern big data analysis must be computationally fast, memory efficient, as well as robust to corruption and missing data.
Ia Related works
When dealing with robust subspace recovery, here we first categorize how outliers contaminate the data matrix. For a matrix , Figure 1 demonstrates two kinds of outlier corruption model: column corruption and elementwise corruption. For column corruption model in Figure 1(a), some columns of the data matrix are seriously corrupted by outliers while other columns are kept from corruption, say inliers; for elementwise corruption model in Figure 1(b), outliers are distributed across the matrix. In this paper, we are interested in how to efficiently recover the lowrank subspace from an incomplete data matrix corrupted by column outliers, outliers for short in this paper.
(a)  (b) 
For column corruption, [9] presents an outlier pursuit algorithm, and their supporting theory states that as long as the fraction of corrupted points is small enough, their algorithm will recover the lowrank subspace as well as identify which columns are outliers. Like [9], the work of [10] supposes that a constant fraction of the observations (rows or columns) are outliers. In [10] the authors provide an algorithm for Robust PCA and give very strong guarantees in terms of the breakdown point of the estimator [6]. The REAPER algorithm proposed in [11] and the GMS algorithm proposed in [12] are both Robust PCA via convex relaxation of absolute subspace deviation.
For elementwise corruption, the work of [17] provided breakthrough theory for decomposing a matrix into a sum of a lowrank matrix and a sparse matrix; it defined a notion of ranksparsity incoherence under which the problem is wellposed. The authors in [18] provide very similar guarantees, with high probability for particular random models: the random orthogonal model for the lowrank matrix and a Bernoulli model for the sparse matrix. The work of [19] is slightly more general in the sense that it proves results about matrix decompositions that are the sum of a lowrank matrix and a matrix with complementary structure, of which sparsity is only one example. In [20], the authors again follow a similar story as [17], providing guarantees for the low rank + sparse model for deterministic sparsity patterns.
Recently, several online/stochastic robust subspace recovery algorithms have been proposed. For missing data scenario, the online algorithms proposed in [21] and its variant [22] can efficiently identify the lowrank subspace from highly incomplete observations and even the data stream is badly conditioned; a powerful parallel stochastic gradient algorithm has been proposed in[23] to complete a very large matrix; for outliers corruption, robust online/stochastic algorithms have been developed in [24] [25] [26] [27] [28] [29] [30] for elementwise corruption cases and [31] [32] for column corruption cases respectively.
For robust subspace clustering, a median Kflat algorithm proposed in [33] is a robust extension to classical Ksubspaces method by incorporating the norm into the loss function. Local best fit (LBF) and spectral LBF (SLBF) proposed in [34] and [35] tackle the robust Ksubspaces problem by selecting a set of local best fit flats which are seeded from large enough candidate flats by minimizing a global error measure. Furthermore, [36] provides theoretical support for such minimization based robust Ksubspaces approaches. For convex approaches of robust subspace clustering, [37] presents a lowrank representation approach (LRR) which extends the robust PCA model and their method is guaranteed to produce exactly recovery. On the other hand, a sparse representation based method [38] also has a strong theoretical guarantee, which extends the sparse subspace clustering (SSC) [39].
IB Contributions
The contributions of this paper are threefold.

Firstly, we cast the batch mode matrix norm minimization with rank constraint for robust subspace recovery into the stochastic optimization framework constrained on the Grassmannian which makes the algorithm can scale very well to very big matrices.

Secondly, we propose a novel adaptive stepsize rule which adaptively determines the constant stepsize. With the proposed stepsize rule, our approach demonstrates empirical linear convergence rate which is much faster than the classic diminishing stepsize for SGD methods.

Thirdly, with proper initialization by incorporating combinatorial Ksubspaces selection we extend the proposed adaptive SGD approach to handle the challenging robust subspace clustering problem. Real world Hopkins 155 dataset and numerical subspace clustering simulation show the excellent performance of the simple Ksubspace extension which can compete with the state of the arts.
The rest of this paper is organized as follows. In section II, we reformulate the batch mode matrix norm minimization with rank constraint for robust subspace recovery as the stochastic optimization problem. In Section III, we present the adaptive stochastic gradient algorithm in detail, which we refer to as GASG21 (Grassmannian Adaptive Stochastic Gradient for norm minimization), and discuss the critical parts of implementation. In Section IV, we take a simple Ksubspaces extension of GASG21, KGASG21, to tackle robust subspace clustering problem. In Section V, we compare GASG21 and KGASG21 with several other subspace learning and clustering algorithms via extensive numerical experiments and realworld face data and Hopkins 155 trajectories clustering experiments. Section VI concludes our work and gives some discussion on future directions.
Ii Model of Robust Subspace Recovery
We denote the dimensional subspace of as . In applications of interest we have . Let the columns of an matrix be orthonormal and span . The set of all subspaces of of fixed dimension is called the Grassmannian denoted by . For an matrix , let be the columns of , the norm is defined as which is a sum of Euclidean norm of columns. We also define norm as which is a sum of absolute value of all elements, and define matrix nuclear norm as which is a sum of singular values of the matrix.
Iia Spherizing the data matrix to ball
For a matrix consisting of inliers and column outliers, denoted by , we assume that the inlier is generated as follows:
(1) 
where is the weight vector, and is the zeromean Gaussian white noise vector with small variance. If is outlier, it is assumed to be zeromean Gaussian noise vector with arbitrary large variance .
(a)  (b) 
(c)  (d) 
Though it would be difficult to directly pursuing the lowrank subspace due to the large variance of outliers, it has been pointed out that normalizing the data matrix into the ball is a simple but powerful method for addressing this challenge [11]. Identifying the lowrank subspace of inliers from the normalized data matrix is essentially equivalent to our original problem. Once the data matrix is spherized, it can be obviously observed that the outliers are constrained on the surface of the ball and the rank subspace only has degree of freedom (DoF) along the sphere. Figure 2 (a) and (b) illustrate that the inliers lying on the rank2 subspace are transformed into a pretty circle along the ball and the outliers are distributed on the sphere. The blue squares also clearly demonstrate the initial estimated subspace which are another circle in Figure 2 (b). Our aim is to rotate the blue circle (the estimated subspace) to approach the red inlier circle (the ground truth subspace) as demonstrated in Figure 2 (c) and (d).
IiB norm minimization with rank constraint
Suppose the data matrix has been normalized to the ball, to tackle column outliers corruption, one direct approach is to consider a matrix decomposition problem (2):
(2) 
where counts the number of nonzero columns in matrix . That is, the matrix can be decomposed into a lowrank matrix and a column sparse matrix . However, this optimization problem is not directly tractable: both rank and norm are nonconvex and combinatorial. By exploiting convex relaxation, an outlier pursuit model has been proposed in [40]:
(3) 
which can guarantee to recover the lowrank subspace and identify which columns are outliers, as long as the fraction of corrupted points is small enough.
However, due to the nuclear norm is not decomposable, the outlier pursuit model (3) can not be easily extended to handle the big data scenario. Here we take an alternative matrix factorization approach with rank constraint. Specifically, since the objective in the outlier pursuit model (3) is to minimize the sum of nuclear norm and norm which aims to promote lowrank and column sparsity, we consider to factorize the lowrank matrix and minimize the norm of matrix with rank constraint of by constraining on the Grassmann manifold:
(4)  
where the orthonormal columns of span and is constrained to the Grassmannian .
In order to optimize , we can take an alternating approach: fix then calculate ; fix and then update . As the objective in Equation (4) is summable, it is easy to derive its gradient with respect to and the best can be optimized by classic conjugate gradient methods. However, for big data optimization, computing and storing the full gradient of a very large matrix at each iteration is infeasible [14]. Here we turn to solve the norm minimization with rank constraint by stochastic gradient descent (SGD) on the Grassmannian.
IiC Reformulation by stochastic optimization
For a single data point and considering the incomplete information scenario, is the observed indices of an incomplete data, we introduce the loss function as follows:
(5) 
We then rewrite Equation (4) as
(6)  
Then instead of computing the full gradient of Equation (4) to update the column orthonormal matrix , we uniformly at random choose the data point and compute , the gradient of the loss function , to update incrementally. In the theory of stochastic optimization, the random data point selection results in an unbiased gradient estimation [14].
Iii Robust Subspace Recovery by Adaptive Stochastic Gradient Descent on the Grassmannian
Iiia Stochastic gradient descent on the Grassmannian
Stochastic gradient derivation
For a single vector , we know that and is essentially the same least square optimization problem. Then for the loss function (5), the best fit weight vector is its least squares solution: .
From Equation (2.70) in [41], the gradient can be determined from the derivative of with respect to the components of . Let is defined to be the columns of an identity matrix corresponding to those indices in ; that is, this matrix zeropads a vector in to be length with zeros on the complement of . The derivative of the loss function with respect to the components of is as follows:
(7) 
Here we denote the subspace fit residue as , , is the normalized residual vector. Then gradient is
(8)  
The final equality follows because the normalized residual vector is orthogonal to all of the columns of .
Subspace update
It is easy to verify that is rank one since is a vector and is a weight vector. The derivation of geodesic gradient step is similar to GROUSE [21] and GRASTA [27].
Following Equation (2.65) in [41], a gradient step of length in the direction is given by
(9)  
here the sole nonzero singular value is .
We summarize our stochastic gradient method for norm minimization constrained on Grassmannian as Algorithm 1.
IiiB Adaptive stepsize rules
For SGD methods, if stepsize is generated by where and is the predefined constant stepsize scale, it is obvious that the stepsize satisfies and . It is the classic diminishing stepsize rule which has been proven to guarantee convergence to a stationary point [42] [43]. However, this unfortunately leads to sublinear slow convergence rate.
As it is pointed out in [44] that a constant stepsize at each iteration will quickly lead the SGD method to reduce its initial error, and inspired by the adaptive SGD work [45] and [46], here we propose to use a modified adaptive stepsize rule to produce a proper constant stepsize that empirically achieves linear rate of convergence. Our modified adaptive SGD method incorporates the level idea into the stepsize update rule. Essentially, the modified adaptive SGD is to perform different constant stepsize at different level. Lower level means large constant stepsize and higher level means small constant stepsize.
Our stepsize rule will update three main parameters: , , and . We update according to the inner product of two consecutive gradients as follows:
(10) 
where the function is defined as:
(11) 
with , , , and . and are chosen to control how much grows or shrinks; and controls the shape of the function. In this paper we always set , , and . By incorporating the level idea, we only let change in , where and are prescribed constants, and here we always set . For wellconditional data matrix the range of is small; for illconditional data matrix the range of should be large.
Once calculated by Equation (10) is larger than , we increase the level variable by and set , . If , we decrease by and also set .
Then finally the constant stepsize is as follows:
(12) 
Combining these ideas together, we state our new adaptive stepsize rule as Algorithm 2.
IiiC Discussions
Complexity and memory usage
Each subspace update step in GASG21 needs only simple linear algebraic computations. The total computational cost of each step of Algorithm 1 is , where is the number of samples per vector used, is the dimension of the subspace, and is the ambient dimension. Specifically, computing the weights in Step 3 of Algorithm 1 costs at most flops; computing the gradient needs simple matrixvector multiplication which costs flops; producing the adaptive stepsize costs flops; and the final update step also costs flops.
Throughout the process, GASG21 only needs memory elements to maintain the estimated lowrank orthonormal basis , elements for , elements for , and for the previous step gradient in memory.
This analysis decidedly shows that GASG21 is both computation and memory efficient.
Relationship with GROUSE and GRASTA
GASG21 is closely related to GROUSE [21] and GRASTA [27]. For GROUSE, the gradient of the loss function is . Then actually the gradient direction of GASG21 and GROUSE is the same. The main difference between the two algorithms is their stepsize rules. It has been proved that with constant stepsize GROUSE converges locally at linear rate [47]. However, GROUSE doesn’t discriminate between inliers and outliers. This leads us to rethink that GASG21 is essentially a weighted version of GROUSE. We leave this problem for future investigation.
For GRASTA, it actually minimizes the elementwise matrix norm. So GRASTA is well suited for elementwise outliers corruption as it is demonstrated in Figure 1 (b). Indeed GRASTA can still work for column outlier corruption in some scenario, but it would cost much time on ADMM for each vector [26]. Here GASG21 only needs a simple least square estimation for each vector which reduce the computational complexity of each subspace update from to .
Iv Algorithm for Robust KSubspaces Recovery
In this section, we show that the proposed adaptive SGD algorithm can be easily extended to robustly recover Ksubspaces. For the Ksubspaces scenario [48, 49], we assume the observed data vectors are lying on or near a union of subspaces [5] where the number of total subspaces is known as a prior.
Iva Model of robust Ksubspaces recovery
For robust Ksubspace recovery, as we discussed in Section IIA, the data matrix should also be spherized into ball to mitigate outlier corruption. Our extension follows the work of incremental Ksubspaces with missing data [50] where the authors establish the low bound of how much information to justify which subspace should an incomplete data vector belong to. Then based on the theory [50] the GROUSE algorithm [21] can be extended to identify Ksubspaces if the candidate subspaces are linearly independent. When the data contain column outliers which follow Figure 1 (b) model, a simple outlier detection  outlier removal approach may apply, however it will be problematic when the outliers dominate the data distribution because the large amount of outliers would lead to incorrect subspace assignment then the following subspace update would not converge.
Due to the robust characteristic of the proposed GASG21, we can expect that even if some outliers are assigned incorrectly into a subspace, GASG21 can still robustly find the true subspace. Formally, given a matrix consisting of inliers and column outliers, where the inliers are fallen into Ksubspaces, in order to cluster the data vectors of the matrix into Ksubspaces, we extend the robust subspace model (4) to the robust Ksubspaces model (13). Here we also consider the missing data scenario.
(13) 
where
We note here that the model (13) follows the subspace assignment  subspace update two stage paradigm in which indicates which subspace a data vector should be assigned to. Then essentially (13) minimizes the columnwise energy by assigning each column to its proper subspace, which is an extension of the classic matrix norm minimization. This kind of robust Ksubspaces model has been used and discussed in several recent works [33, 34, 35].
IvB Stochastic algorithm for robust Ksubspace recovery
It is wellknown that the model (13) is a nonconvex optimization problem, then directly minimizing the cost function (13) by gradient descent will only lead to local minima, especially for random initialization of the Ksubspaces [5]. Instead of simply making several restarts to look for the global optimal, here we borrow the idea of [35, 34] to initialize the best candidate subspaces, and then refine the Ksubspaces by our adaptive SGD approach.
Specifically, we use probabilistic farthest insertion [51] to generate candidate subspaces which best fit the nearest neighbours of the sampled data vectors respectively, where . In the case of missing data, we simply zerofill the unobserved entries in each vector [50]. To make a good initialization of the robust Ksubspaces algorithm, we should select the best subspaces from the candidates which score the lowest loss value of model (13). However the problem is difficult to solve as it is combinatorial. We exploit the greedy selection algorithm proposed in [35, 34] to approximate the best Ksubspaces. Though the elaborated initialization are not the final optimal Ksubspaces, they are always good enough to cluster the data vectors and lead the final refinement process to global convergence with high probability. Figure 3 illustrates how the best 2subspaces are initialized. In Figure 3 (a) the candidate subspaces (in blue) are generated by probabilistic farthest insertion and in Figure 3 (b) it is demonstrated that the selected 2subspaces are well approximated to the inlier 2subspaces (in red).
(a)  (b) 
Due to the presence of outliers the initialized Ksubspaces are not optimal. Once we obtain the good initialization, we can easily refine the Ksubspaces by our proposed adaptive SGD approach. Simply for each data vector , we assign it to its nearest subspace ,
(14) 
and then update according to the adaptive SGD method discussed in Section IIIA. Though outliers would be inevitably misassigned to some candidate subspaces, the robust nature of our algorithm would guarantee the Ksubspaces to converge to the optimal. Essentially, for the refinement process, we just perform GASG21 for each candidate subspace respectively by the subspace assignment  subspace update paradigm. We conclude this section by listing our robust Ksubspace recovery approach, denoted as KGASG21, in Algorithm 3.
Input: A collection of vectors , and the observed indices . An integer number of subspaces and dimensions , . An integer number ( ) of candidate subspaces. A maximum number of iterations, .
Output: The estimated Subspace spanned by and a partition of into disjoint clusters
V Experiments
In order to evaluate the performance of GASG21 and its extension KGASG21, we conduct both synthetic numerical simulations and real world datasets to investigate the convergence in difference scenarios. In all the following experiments, we use Matlab R2010b on a Macbook Pro laptop with 2.3GHz Intel Core i5 CPU and 8 GB RAM. To improve the performance, we implement GASG21 and the greedy selection for the critical initialization of KGASG21 in C++ via the wellknown linear algebra library Armadillo [52]
Va Numerical experiments on robust subspace recovery
We generate the synthetic data matrix by , where is an random matrix and is an random matrix both with i.i.d. Gaussian entries, and is a diagonal matrix which controls the conditional number of . We randomly select columns and replace them with an random matrix as outliers. In the following numerical experimental plots, we always use the principal angle between the simulated true subspace and the recovered subspace to evaluate convergence.
Convergence comparison with GROUSE and GRASTA
Because of the close relationship between GASG21, GROUSE, and GRASTA, we want to examine the convergence behaviour of these algorithms for large matrices corrupted by column outliers. Besides, in order to show the fast convergence rate of the GASG21 compared with the classic diminishing stepsize of SGD, we also consider the diminishing stepsize version of GASG21 denoted as GASG21DM.
Firstly we generate two rank , matrices with different conditional numbers. One is a wellconditional matrix with singular values in the range of and the other is a matrix with singular values in the range of . The outlier fraction is set to and we only reveal of the matrices for those algorithms. For GASG21, we set ; for GROUSE we use the constant stepsize which has been proved to locally converge in linear rate for clean matrices [47]; for GRASTA we also exploit our proposed adaptive stepsize method and denote it as GRASTAML. It can be seen from Figure 4 that GASG21 converges linearly for both matrices. However, GASG21DM converges sublinearly due to the diminishing stepsize. Though basically GROUSE takes step along the same gradient direction on the Grassmannian as GASG21, GROUSE can not converge to the true subspace in the presence of outliers. It is because that large fraction of outliers will always lead the wrong update directions in which GROUSE treats them equally as inliers. One possible approach to overcome outliers corruption for GROUSE is to incorporate outlier detection and take much smaller steps for outliers. However, it would complicate GROUSE and the outlier threshold parameter would be hard to tune for different scenarios. On the contrary, our GASG21 treats outliers and inliers in a unified way and choose the best constant stepsize adaptively. There is an interesting observation in Figure 4 that though GRASTA essentially minimizes the matrix norm, it does successfully recover the lowrank subspace for wellconditional matrices corrupted by column outliers as it is shown in Figure 4 (a). However, Figure 4 (b) shows that GRASTA fails when the conditional number of the matrices is slightly higher.
(a)  (b) 
(a)  (b) 
Secondly we generate , matrices to examine the recovery results of those SGD algorithms by varying the percentage of outliers and subsampling ratios respectively for a given iterations which equal to only cycles around the matrices rounds. In Figure 5 (a), we observe information of the matrices and vary the outlier fraction from zero to ; and in Figure 5 (b), we fix the percentage of outliers as and vary the subsampling ratios from to . It demonstrates clearly that GASG21 can resist large fraction of outlier corruption even with highly incomplete data. Moreover, in our C++ implementation, the iterations of GASG21 for thoses large matrices only cost around seconds much less than GRASTA which is around seconds in C++ implementation. Detailed running time results of GASG21 are reported in Figure 8.
The effects of
For our adaptive SGD approach on the Grassmannian, the important parameter regarding the convergence rate is . As stated in Alg. 2 the parameter only changes in the range of , then controls how quickly the algorithm will be adapted to a smaller constant stepssize . With a smaller , GASG21 is very likely to raise to a higher level, then it will quickly generate smaller constant step sizes which will lead GASG21 converge faster; in contrast, with a larger , raising to a higher level would cost more iterations which will lead GASG21 converge slower. In Section IIIB, we point out that can be small for a wellconditional data matrix to obtain faster convergence but it must be large enough to guarantee convergence for a moderate illconditional data matrix.
Here, we generate two matrices and with different conditional number to examine how effects the convergence. We set the rank of both matrices as . For matrix we manually set the singular values between then the conditional number is ; for matrix we manually set the singular values between then the conditional number is . For both matrices we randomly place outliers on columns and we only observe entries of the matrices. We vary from to and run GASG21 for the two matrices. Figure 6 (a) shows that for a wellconditional matrix smaller indeed lead to faster convergence. However, Figure 6 (b) demonstrates that for a moderate illconditional matrix , should be large because with large the SGD algorithm will take enough iterations to reduce the initial error for each level with the constant stepsize .
(a)  (b) 
Recovery comparison with the state of the arts
In the final numerical experiments, we compare GASG21 with the state of the arts algorithms of robust subspace recovery. Here we consider three representative algorithms, two are batched version  REAPER [11] and Outlier Pursuit (OP for short) [40], and one is stochastic  Robust Online PCA (denoted as RobustMD in accordance with the reference) [32] which is based on REAPER formulation. In comparison with these state of the arts, we want to show how fast our GASG21 is for large matrices corrupted by large fraction of outliers.
Firstly, we only generate rank , small matrices to evaluate all the four algorithms because OP and RobustMD are very sensitive to the ambient dimension, and the current implementation of RobustMD is not well optimized. Besides, due to the sublinear convergence rate of RobustMD [32], we cannot expect RobustMD to converge to the true subspace with high precision. Then in the following experiments, we will terminate all algorithms once the principal angle between the true subspace and the recovered subspace satisfying . The percentage of outliers is varied from zero to . We set the nonzero singular values in the range of . Figure 7 demonstrates that for those small size matrices GASG21 takes no more than seconds to reach the stopping criteria which is around times faster than RobustMD and OP whose running time is around seconds. Also, we point out that for the outliers case OP fails to recover the subspace after iterations. The running time of REAPER is competitive with GASG21 for those small matrices.
Next, we compare GASG21 with REAPER on bigger matrices. We examine the running time of both algorithms from two aspects. One is the percentage of outlier and the other is the ambient dimension of the lowrank subspace. As both algorithms can converge to the true subspace precisely, in the following experiments we will let the algorithms to run until the principal angle . We generate rank , big matrices and vary the percentage of outliers from zero to . Again, the nonzero singular values of the matrices are set in the range of . Figure 8 (a) shows that REAPER will cost more than seconds for the outlier corruption case because it need to iterate more times on the big matrix and each iteration of REAPER involves do SVD on the big matrix. However, on the contrary, GASG21 only costs less than seconds for this case due to its simple linear algebra computation at each iteration. In Figure 8 (b), we show how the ambient dimension of the subspace effects the running time. Here the rank , matrices are generated with inliers and outliers. The ambient dimension is set from to . Compared with the quickly growing running time of REAPER for larger ambient dimension, GASG21 keeps at an extremely low computational time. The running time is linear to the ambient dimension which is consistent with our complexity analysis in Section IIIC. We can expect that even for very highdimensional matrices GASG21 can recover the lowrank subspace in a short time which will be problematic for the SVD based methods.
(a)  (b) 
Finally we turn to the recent proposed stochastic approach RobustMD [32]. Though GASG21 is much more efficient than RobustMD, especially for big matrices, we observe that RobustMD works consistent well on illconditional matrices. However, for GASG21, it can be shown from Figure 9 that by increasing the conditional number of matrix it will cost GASG21 to iterate more times to converge. The main reason regarding the convergence issue for illconditional matrices is that in our stochastic optimization on the Grassmannian we do not make any use of the singular values of the matrix, as it has been pointed out for GROUSE [22]. This enlightens us to design a scaling version of our approach. We put this endeavour for future work.
(a)  (b)  (c) 
VB Robust subspace recovery for real face dataset
We consider a data set. Here images of individual faces under different illuminating conditions serves as inliers, which can fall on a lowdimensional subspace [4]. Outliers are random natural images. In our data set, the inliers are chosen from the Extended Yalo Facebase [53], in which there are 10 individual faces and each face 64 images. And the outliers are chosen from BACKGROUND/Google folder of the Caltech101 database [54].
We made a total of 10 groups of experiments. In each group, we compose a data set containing 64 face images, which are from the same individual, and 400 random images from BACKGROUND as Figure 10 demonstrated. Each images are gray and downsampled to dimension. Here we set , so we want to obtain a 9dimensional subspace through our experiments. We compare GASG21 with two robust methods REAPER and OP, and one nonrobust method PCA. For REAPER and OP we set the maximum iteration as 50, and for GASG21 we set the max iteration as 5000 which means GASG21 cycles around the dataset about 10 times.
(a)  (b) 
In Figure 11, we visually compare GASG21 to REAPER. We recover the inliers faces by projecting them to the learned subspace, so clearer images indicate better performance of the algorithm. From Figure 11, we can see that the robust methods such as our GASG21, REAPER, and OP outperform the nonrobust PCA.
(a)  (b) 
In Table I, we also quantitatively compare the performance of the investigated algorithms through the residual term
(15) 
where the is the original inlier face images, and the is the recovered images by projecting inliers to the learned subspace. Table I indicates that GASG21 is superior to OP and is competent to REAPER. The boldface indicates best recovery result. We also record the running time of these algorithms in Table II. For all the experiments, GASG21 only takes around 4 seconds, however REAPER costs around 25 seconds.
Face1  Face2  Face3  Face4  Face5  
GASG21  0.1490  0.1671  0.1564  0.1663  0.1511 
REAPER  0.1456  0.1429  0.1531  0.1477  0.1407 
OP  0.1878  0.1864  0.1824  0.2063  0.1745 
PCA  0.2223  0.2236  0.2134  0.2354  0.2116 
Face6  Face7  Face8  Face9  Face10  
GASG21  0.1676  0.1615  0.1745  0.1799  0.1674 
REAPER  0.1616  0.1464  0.1755  0.1664  0.1567 
OP  0.1875  0.2099  0.2311  0.2090  0.1723 
PCA  0.2229  0.2401  0.2648  0.2406  0.1963 

Face1  Face2  Face3  Face4  Face5  
GASG21  4.04  4.01  4.02  4.03  4.01 
REAPER  21.40  26.27  25.17  32.15  28.03 
OP  36.85  38.71  36.71  38.95  37.90 
PCA  0.33  0.42  0.42  0.70  0.43 
Face6  Face7  Face8  Face9  Face10  
GASG21  4.03  4.02  4.04  4.13  4.40 
REAPER  26.42  28.28  26.99  25.78  28.357 
OP  40.00  39.50  39.51  38.93  37.88 
PCA  0.42  0.43  0.43  0.45  0.47 

VC Experiments on robust ksubspaces recovery
In this subsection, we validate the robust and fast convergent performance of our Ksubspaces extension.
Numerical experiments
In order to examine the convergent characteristic of KGASG21, i.e. Algorithm 3, we make the following numerical experiment. We randomly generate subspaces with rank and ambient dimension . The inliers of each subspace is generated by in which and are two factors with i.i.d. Gaussian entries. is and is . Here we set for each subspace then totally we have inliers lying in the subspaces. We will insert different fraction of outliers into the dataset in the following experiments. Moreover, as it is discussed in Section IVB that the parameter is important to select the best initial Ksubspaces, we will also vary , for example , to give an experimental suggestion of selecting a good .
We first show the convergent plots of recovering the subspaces with the presence of outliers in Figure 12. In this experiment, we are given an matrix in which the inliers are generated as stated before. Here the column outliers are randomly inserted into the data matrix. We randomly reveal entries of the matrix. The parameter is set to in this experiment. The max iteration of the SGD part of Algorithm 3, say Step 3 to Step 11, is set to . We plot the logged projection residuals associated to one subspace in Figure 12 (a) from which we can clearly see the convergent trend despite outliers. In Figure 12 (b), we manually filter out the outliers’ part from the logged residuals for each subspace. Then we get a clear picture that all subspaces enjoy linear rate of convergence. Moreover, Figure 12 (c) demonstrates that the recovered subspaces are approximating to the ground truth subspaces with high precision. We note here that due to the nonconvex nature of model (13), the nice convergence of all subspaces should be contributed to the best selected initial Ksubspaces via the important Step 1 and Step 2 in Algorithm 3.
(a)  (b)  (c) 
Next, we explore how the initialization of Ksubspaces affects the final recovered results. For Step 1, we search the nearest neighbours to construct the Qsubspaces by probabilistic farthest insertion [51]. For Step 2, we will vary to examine the final recovery and clustering performance. The experimental settings are as same as before except which is set to . Table III reports the average running time of Step 1, Step 2, and SGD (Step 3 to Step 11) for different . The recovery accuracy of Ksubspaces is reported in the last three columns which shows the average worst, median, and mean of the principal angle between the true subspace and the recovered subspace respectively. We take 5 runs for each setting. It can be seen from the first row, , that at least half of subspaces are correctly identified. The decrease in means more and more subspaces are recovered with more candidate subspaces. All subspaces are correctly identified when we seed more than candidates. Ideally, more candidate subspaces will increase robustness, but on the other hand it would induce more running time for initialization. In our simulation, for outliers, is a good tradeoff between running time and accuracy. Furthermore, if the data set is outlier free, then simply will guarantee to have a good performance[50, 55].
0.79  0.29  8.27  1.17  2.42E8  1.56E1  
1.60  0.86  8.35  5.81E1  8.53E9  7.07E2  
2.36  1.46  8.31  2.40E1  6.15E9  3.28E2  
3.21  2.35  8.33  2.90E1  4.57E9  1.45E2  
3.97  3.27  8.23  1.95E7  6.36E9  2.04E8  
5.06  4.26  8.29  7.17E8  5.61E9  1.09E8  
6.11  5.66  8.31  1.39E7  5.74E9  1.78E8  
6.98  7.70  8.28  8.96E8  4.51E9  1.21E8  
7.97  8.81  8.30  6.62E8  4.22E9  1.15E8  
8.86  10.60  8.40  8.63E8  5.36E9  1.27E8  

Third, we report the experimental relationship between the fraction of outliers and the candidate subspaces in Table IV. We vary the fraction of outliers from gentle corruption to heavy corruption . For each corruption scenario, we set the number of candidate subspace and for each setting the max iteration of the SGD phase is set to . In addition, for the heavy corruption ratio , we increase the to to check its convergence when is set to . Table IV shows the importance of selecting enough candidate subspaces in the case of outlier corruption. Although can work very well when the data is free of outlier [50, 55], we must seed at least candidate subspaces to correctly recover a union of subspaces when . Moreover, large fraction of outliers need more candidates, for example, when , more than candidates could guarantee to obtain good recovery. We also need to take more iterations for our SGD to converge to high precision.
2.86E1 (6.61E2)  1.17E+0 (1.98E1)  —–  —–  
5.09E5 (2.63E6)  5.67E1 (5.65E2)  1.47E+0 (1.93E1)  —–  
1.29E15 (8.39E16)  2.69E4 (1.35E5)  2.88E1 (4.21E2)  1.47E+0 (4.00E1)  
—–  1.98E12 (3.41E13)  3.01E1 (2.98E2)  8.75E1 (1.27E1)  
—–  —–  1.06E7 (1.43E8)  6.42E1 (1.02E1)  
—–  —–  1.20E7 (1.63E8)  5.82E1 (5.66E2)  
1⃝  —–  —–  —–  4.77E3 (5.39E4) 
2⃝  —–  —–  —–  2.22E5 (1.24E6) 

Experiments on Hopkins155 dataset
Finally, we validate KGASG21 on the wellknown Hopkins 155 dataset [56] of motion segmentation in which it consists of 155 video sequences along with the coordinates of certain features extracted and tracked for each sequence in all its frames.
Because of the close relationship with the LBF (Local Bestfit Flat) algorithm [34, 35]
(16) 
We first compare KGASG21 with LBF and LRR on the original outlier free Hopkins 155 dataset. Here we vary of KGASG21 from to . Table V demonstrates that KGASG21 only requires to seed slightly more candidate subspaces, for example , from which the overall performance is superior to the baseline LBF with candidates. Because of the stochastic subspace learning phase in KGASG21, unlike LBF, if the candidates are selected enough, sampling more candidates won’t lead to improving the clustering performance. For example, is basically comparable to and both are superior to LBF. Then based on this observation we can safely choose a moderate number of candidate subspaces to perform KGASG21 for subspace clustering. Though LRR achieves overall better segmentation performance in this experiment, our approach is at least faster than these competitors and our memory request is lowest.
To examine the capability of handling outlier corruption, outliers are manually inserted into the Hopkins 155 dataset. Outliers are generated as Gaussian random vectors with large variance. Here we only compare KGASG21 with LBF. Table VI shows that even without additional candidate subspaces, say , KGASG21 still performs better than LBF which selects the best K subspaces from and even large number of candidates. In this experiment, we observe that the best results are attained when we set and . If we want to sample more candidate subspaces, it is interesting to note that the clustering accuracy can even deteriorate a little, which is not aligned with our simulation results. We observe that this issue also occurs in LBF. It is probably because some trajectories of the Hopkins 155 dataset are not well conditioned and some subspaces are dependent. We will investigate this issue in future work.
2motion  Checker  Traffic  Articulated  All  Time  
mean  median  mean  median  mean  median  mean  median  
KGASG21 (Q=K)  6.33  2.15  0.91  0.00  5.39  0.00  4.88  0.86  63.18 
KGASG21 (Q=5K)  3.07  0.28  0.50  0.00  2.78  0.00  2.39  0.00  68.66 
KGASG21 (Q=10K)  3.22  0.00  0.50  0.00  2.01  0.00  2.42  0.00  76.21 
KGASG21 (Q=20K)  2.85  0.00  0.31  0.00  2.58  0.00  2.19  0.00  92.27 
KGASG21 (Q=40K)  2.07  0.00  0.67  0.00  3.04  0.00  1.82  0.00  121.88 
KGASG21 (Q=70K)  2.18  0.00  0.64  0.00  2.92  0.00  1.87  0.00  180.41 
BaselineLBF (Q=70K)  3.71  0.00  6.86  1.48  4.87  1.53  4.62  0.05  442.17 
LBF (Q=200K)  4.13  0.00  8.71  0.82  5.08  1.53  5.37  0.00  816.30 
LRR  1.50  0.00  0.52  0.00  2.25  0.00  1.33  0.00  457.93 
3motion 
Checker  Traffic  Articulated  All  
mean  median  mean  median  mean  median  mean  median  
KGASG21 (Q=K)  13.72  10.23  5.23  5.18  19.36  19.36  11.94  9.64   
KGASG21 (Q=5K)  6.77  2.42  0.99  0.00  11.91  11.91  5.60  1.78   
KGASG21 (Q=10K)  7.30  2.06  0.69  0.00  10.43  10.43  5.16  1.21   
KGASG21 (Q=20K)  6.00  2.29  0.33  0.00  10.00  10.00  4.82  1.63   
KGASG21 (Q=40K)  5.45  1.86  1.72  0.00  8.94  8.94  4.70  0.92   
KGASG21 (Q=70K)  6.06  1.63  1.91  0.00  8.51  8.51  5.18  1.47   
BaselineLBF (Q=70K)  9.95  1.91  12.88  12.99  48.94  48.94  11.44  2.85   
LBF (Q=200K)  9.19  1.81  17.04  15.61  47.23  47.23  12.07  2.25   
LRR  2.57  0.11  1.58  0.00  8.51  8.51  2.51  0.00   

2motion  Checker  Traffic  Articulated  All  Time  
mean  median  mean  median  mean  median  mean  median  
KGASG21 (Q=K)  10.55  8.54  3.70  0.99  7.09  1.49  8.49  5.82  83.15 
KGASG21 (Q=5K)  7.54  5.78  3.03  0.00  7.58  0.00  6.42  2.98  88.30 
KGASG21 (Q=10K)  8.21  6.91  3.95  0.00  7.23  0.00  7.04  1.36  101.58 
KGASG21 (Q=20K)  10.05  7.29  3.44  0.09  6.66  0.00  8.05  2.46  124.10 
BaselineLBF (Q=70K)  13.20  12.44  6.07  0.27  7.02  0.00  10.80  7.68  452.88 
LBF (Q=200K)  12.64  11.17  6.86  0.87  10.43  1.02  10.97  7.81  970.91 
3motion 
Checker  Traffic  Articulated  All  
mean  median  mean  median  mean  median  mean  median  
KGASG21 (Q=K)  17.95  15.51  9.30  5.86  27.23  27.23  16.24  14.96   
KGASG21 (Q=5K)  14.92  15.33  7.24  2.22  25.74  25.74  13.47  12.24   
KGASG21 (Q=10K)  15.77  16.62  4.46  0.25  35.74  35.74  13.76  13.67   
KGASG21 (Q=20K)  17.68  19.67  7.60  2.15  20.64  20.64  15.46  17.82   
BaselineLBF (Q=70K)  24.12  22.83  14.79  8.97  25.96  25.96  22.04  21.56   
LBF (Q=200K)  25.91  26.51  14.07  13.76  37.66  37.66  23.54  22.93   

Vi Conclusion
In this paper we have presented a stochastic gradient descent algorithm to robustly recover lowrank subspace by adaptive Grassmannian optimization. Our stochastic approach is both computational and memory efficiency which permits it to tackle robust subspace recovery for big data. By incorporating the proposed adaptive stepsize rule our approach empirically exhibits linear convergence rate. We also demonstrate that this efficient algorithm can be easily extended to cluster a large number of corrupted data vectors into a union of subspaces with proper initialization.
Though this work presents an approach and its extension for robust subspace recovery and clustering more computationally efficient than state of the arts, a foremost remaining problem is how to extend the proposed method to illconditional matrices. As the numerical comparison with RobustMD shows that our approach can only tolerate moderate illconditional matrices, we are very interested in developing a scaling version of the algorithm by taking into account singular values. The recent works of both batch [58] [59] and online [22] completion for illconditional matrices shed light on this direction.
Acknowledgment
We thank Laura Balzano for her thoughtful suggestions while we prepared this manuscript. We also thank John Goes for generously providing the code of RobustMD.
Footnotes
 Here we use Armadillo of version downloaded from http://arma.sourceforge.net/download.html
 Hopkins 155 dataset is downloaded from http://www.vision.jhu.edu/data/hopkins155/.
 http://www.math.umn.edu/~lerman/lbf/.
 https://sites.google.com/site/guangcanliu/.
References
 E. Moulines, P. Duhamel, J.F. Cardoso, and S. Mayrargue, “Subspace methods for the blind identification of multichannel fir filters,” Signal Processing, IEEE Transactions on, vol. 43, no. 2, pp. 516–525, feb 1995.
 H. Krim and M. Viberg, “Two decades of array signal processing research: the parametric approach,” Signal Processing Magazine, IEEE, vol. 13, no. 4, pp. 67–94, July 1996.
 M. A. Audette, F. P. Ferrie, and T. M. Peters, “An algorithmic overview of surface registration techniques for medical imaging,” Medical Image Analysis, vol. 4, no. 3, pp. 201 – 217, 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1361841500000141
 R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 2, pp. 218–233, 2003.
 R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52 – 68, 2011.
 P. J. Huber and E. M. Ronchetti, Robust Statistics. Wiley, 2009.
 J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of things (iot): A vision, architectural elements, and future directions,” Future Generation Computer Systems, vol. 29, no. 7, pp. 1645–1660, 2013.
 K. Slavakis, G. Giannakis, and G. Mateos, “Modeling and optimization for big data analytics:(statistical) learning tools for our era of data deluge,” Signal Processing Magazine, IEEE, vol. 31, no. 5, pp. 18–31, 2014.
 H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” Information Theory, IEEE Transactions on, vol. 58, no. 5, pp. 3047–3064, 2012.
 H. Xu, C. Caramanis, and S. Mannor, “Outlierrobust pca: The high dimensional case,” IEEE Transactions on Information Theory, vol. 59, no. 1, pp. 546–572, 2013.
 G. Lerman, M. McCoy, J. A. Tropp, and T. Zhang, “Robust computation of linear models, or how to find a needle in a haystack,” arXiv preprint arXiv:1202.4044, 2012.
 T. Zhang and G. Lerman, “A novel mestimator for robust pca,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 749–808, 2014.
 E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 11:1–11:37, Jun. 2011. [Online]. Available: http://doi.acm.org/10.1145/1970392.1970395
 V. Cevher, S. Becker, and M. Schmidt, “Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics,” Signal Processing Magazine, IEEE, vol. 31, no. 5, pp. 32–43, Sept 2014.
 Spark, http://spark.apache.org/.
 Mahout, http://mahout.apache.org/.
 V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky, “Ranksparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, p. 572, 2011.
 J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted lowrank matrices via convex optimization,” Advances in neural information processing systems, vol. 22, pp. 2080–2088, 2009.
 A. Agarwal, S. Negahban, and M. J. Wainwright, “Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions,” The Annals of Statistics, vol. 40, no. 2, pp. 1171–1197, 2012.
 D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with sparse corruptions,” Information Theory, IEEE Transactions on, vol. 57, no. 11, pp. 7221–7234, 2011.
 L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on. IEEE, 2010, pp. 704–711.
 R. Kennedy, C. J. Taylor, and L. Balzano, “Online completion of illconditioned lowrank matrices,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), December 2014.
 B. Recht and C. Ré, “Parallel stochastic gradient algorithms for largescale matrix completion,” Mathematical Programming Computation, vol. 5, no. 2, pp. 201–226, 2013.
 G. Mateos and G. B. Giannakis, “Robust pca as bilinear decomposition with outliersparsity regularization,” Signal Processing, IEEE Transactions on, vol. 60, no. 10, pp. 5176–5190, 2012.
 Y. Li, “On incremental and robust subspace learning,” Pattern Recognition, vol. 37, no. 7, pp. 1509–1518, 2004.
 J. He, L. Balzano, and J. Lui, “Online robust subspace tracking from partial information,” Arxiv preprint arXiv:1109.3827, 2011.
 J. He, L. Balzano, and A. Szlam, “Incremental gradient on the grassmannian for online foreground and background separation in subsampled video,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 J. He, D. Zhang, L. Balzano, and T. Tao, “Iterative grassmannian optimization for robust image alignment,” Image and Vision Computing, vol. 32, no. 10, pp. 800–813, 2014.
 H. Guo, C. Qiu, and N. Vaswani, “An online algorithm for separating sparse and lowdimensional signal sequences from their sum,” Signal Processing, IEEE Transactions on, vol. 62, no. 16, pp. 4284–4297, Aug 2014.
 J. Feng, H. Xu, and S. Yan, “Online robust pca via stochastic optimization,” in Advances in Neural Information Processing Systems, 2013, pp. 404–412.
 J. Feng, H. Xu, S. Mannor, and S. Yan, “Online pca for contaminated data,” in Advances in Neural Information Processing Systems, 2013, pp. 764–772.
 J. Goes, T. Zhang, R. Arora, and G. Lerman, “Robust stochastic principal component analysis,” in Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, 2014, pp. 266–274.
 T. Zhang, A. Szlam, and G. Lerman, “Median kflats for hybrid linear modeling with many outliers,” in Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 234–241.
 T. Zhang, A. Szlam, Y. Wang, and G. Lerman, “Randomized hybrid linear modeling by local bestfit flats,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1927–1934.
 ——, “Hybrid linear modeling via local bestfit flats,” International Journal of Computer Vision, vol. 100, no. 3, pp. 217–240, 2012.
 G. Lerman, T. Zhang et al., “Robust recovery of multiple subspaces by geometric lp minimization,” The Annals of Statistics, vol. 39, no. 5, pp. 2686–2715, 2011.
 G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by lowrank representation,” in Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010, pp. 663–670.
 M. Soltanolkotabi, E. Elhamifar, E. J. Candes et al., “Robust subspace clustering,” The Annals of Statistics, vol. 42, no. 2, pp. 669–699, 2014.
 E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 2790–2797.
 H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” Information Theory, IEEE Transactions on, vol. 58, no. 5, pp. 3047–3064, May 2012.
 A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.
 H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
 H. J. Kushner and G. Yin, Stochastic approximation and recursive algorithms and applications. Springer, 2003, vol. 35.
 A. Nedić and D. Bertsekas, “Convergence rate of incremental subgradient algorithms,” in Stochastic optimization: algorithms and applications. Springer, 2001, pp. 223–264.
 A. Plakhov and P. Cruz, “A stochastic approximation algorithm with stepsize adaptation,” Journal of Mathematical Sciences, vol. 120, no. 1, pp. 964–973, 2004.
 S. Klein, J. Pluim, M. Staring, and M. Viergever, “Adaptive stochastic gradient descent optimisation for image registration,” International journal of computer vision, vol. 81, no. 3, pp. 227–239, 2009.
 L. Balzano and S. Wright, “Local convergence of an algorithm for subspace identification from partial data,” Foundations of Computational Mathematics, pp. 1–36, 2014. [Online]. Available: http://dx.doi.org/10.1007/s1020801492277
 P. K. Agarwal and N. H. Mustafa, “kmeans projective clustering,” in Proceedings of the twentythird ACM SIGMODSIGACTSIGART symposium on Principles of database systems. ACM, 2004, pp. 155–165.
 P. S. Bradley and O. L. Mangasarian, “kplane clustering,” Journal of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000.
 L. Balzano, A. Szlam, B. Recht, and R. Nowak, “ksubspaces with missing data,” in Statistical Signal Processing Workshop (SSP), 2012 IEEE. IEEE, 2012, pp. 612–615.
 R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy, “The effectiveness of lloydtype methods for the kmeans problem,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 165–176.
 C. Sanderson, “Armadillo: An open source c++ linear algebra library for fast prototyping and computationally intensive experiments,” Technical Report, 2010.
 K.C. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 5, pp. 684–698, 2005.
 L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 106, no. 1, pp. 59–70, 2007.
 D. Pimentel, R. Nowak, and L. Balzano, “On the sample complexity of subspace clustering with missing data,” in Statistical Signal Processing (SSP), 2014 IEEE Workshop on. IEEE, 2014, pp. 280–283.
 R. Tron and R. Vidal, “A benchmark for the comparison of 3d motion segmentation algorithms,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
 J. P. Costeira and T. Kanade, “A multibody factorization method for independently moving objects,” International Journal of Computer Vision, vol. 29, no. 3, pp. 159–179, 1998.
 T. Ngo and Y. Saad, “Scaled gradients on grassmann manifolds for matrix completion,” in Advances in Neural Information Processing Systems, 2012, pp. 1412–1420.
 B. Mishra, K. Adithya Apuroop, and R. Sepulchre, “A Riemannian geometry for lowrank matrix completion,” ArXiv eprints, Nov. 2012.