Robust Recovery of Subspace Structures by Low-Rank Representation

Robust Recovery of Subspace Structures by Low-Rank Representation

Guangcan Liu, Member, IEEE, Zhouchen Lin111Corresponding Author, Senior Member, IEEE, Shuicheng Yan, Senior Member, IEEE, Ju Sun, Student Member, IEEE, Yong Yu, and Yi Ma, Senior Member, IEEE
Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Visual Computing Group, Microsoft Research Asia, Beijing, China
Key Laboratory of Machine Perception, Peking University, Beijing, China
Electrical and Computer Engineering, National University of Singapore, Singapore
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, USA
Electrical Engineering, Columbia University, USA
Abstract

In this work we address the subspace clustering problem. Given a set of data samples (vectors) approximately drawn from a union of multiple subspaces, our goal is to cluster the samples into their respective subspaces and remove possible outliers as well. To this end, we propose a novel objective function named Low-Rank Representation (LRR), which seeks the lowest-rank representation among all the candidates that can represent the data samples as linear combinations of the bases in a given dictionary. It is shown that the convex program associated with LRR solves the subspace clustering problem in the following sense: when the data is clean, we prove that LRR exactly recovers the true subspace structures; when the data are contaminated by outliers, we prove that under certain conditions LRR can exactly recover the row space of the original data and detect the outlier as well; for data corrupted by arbitrary sparse errors, LRR can also approximately recover the row space with theoretical guarantees. Since the subspace membership is provably determined by the row space, these further imply that LRR can perform robust subspace clustering and error correction, in an efficient and effective way.

{keywords}

low-rank representation, subspace clustering, segmentation, outlier detection.

I Introduction

In pattern analysis and signal processing, an underlying tenet is that the data often contains some type of structure that enables intelligent representation and processing. So one usually needs a parametric model to characterize a given set of data. To this end, the well-known (linear) subspaces are possibly the most common choice, mainly because they are easy to compute and often effective in real applications. Several types of visual data, such as motion [1, 2, 3], face [4] and texture [5], have been known to be well characterized by subspaces. Moreover, by applying the concept of reproducing kernel Hilbert space, one can easily extend the linear models to handle nonlinear data. So the subspace methods have been gaining much attention in recent years. For example, the widely used Principal Component Analysis (PCA) method and the recently established matrix completion [6] and recovery [7] methods are essentially based on the hypothesis that the data is approximately drawn from a low-rank subspace. However, a given data set can seldom be well described by a single subspace. A more reasonable model is to consider data as lying near several subspaces, namely the data is considered as samples approximately drawn from a mixture of several low-rank subspaces, as shown in Fig.1.

Fig. 1: A mixture of subspaces consisting of a 2D plane and two 1D lines. (a) The samples are strictly drawn from the underlying subspaces. (b) The samples are approximately drawn from the underlying subspaces.

The generality and importance of subspaces naturally lead to a challenging problem of subspace segmentation (or clustering), whose goal is to segment (cluster or group) data into clusters with each cluster corresponding to a subspace. Subspace segmentation is an important data clustering problem and arises in numerous research areas, including computer vision [3, 8, 9], image processing [5, 10] and system identification [11]. When the data is clean, i.e., the samples are strictly drawn from the subspaces, several existing methods (e.g., [12, 13, 14]) are able to exactly solve the subspace segmentation problem. So, as pointed out by [3, 14], the main challenge of subspace segmentation is to handle the errors (e.g., noise and corruptions) that possibly exist in data, i.e., to handle the data that may not strictly follow subspace structures. With this viewpoint, in this paper we therefore study the following subspace clustering [15] problem.

Problem I.1 (Subspace Clustering)

Given a set of data samples approximately (i.e., the data may contain errors) drawn from a union of linear subspaces, correct the possible errors and segment all samples into their respective subspaces simultaneously.

Fig. 2: Illustrating three typical types of errors: (a) noise [6], which indicates the phenomena that the data is slightly perturbed around the subspaces (what we show is a perturbed data matrix whose columns are samples drawn from the subspaces); (b) random corruptions [7], which indicate that a fraction of random entries are grossly corrupted; (c) sample-specific corruptions (and outliers), which indicate the phenomena that a fraction of the data samples (i.e., columns of the data matrix) are far away from the subspaces.

Notice that the word “error” generally refers to the deviation between model assumption (i.e., subspaces) and data. It could exhibit as noise [6], missed entries [6], outliers [16] and corruptions [7] in reality. Fig.2 illustrates three typical types of errors under the context of subspace modeling. In this work, we shall focus on the sample-specific corruptions (and outliers) shown in Fig.2(c), with mild concerns to the cases of Fig.2(a) and Fig.2(b). Notice that an outlier is from a different model other than subspaces, and is essentially different from a corrupted sample that belongs to the subspaces. We put them into the same category just because they can be handled in the same way, as will be shown in Section V-B.

To recover the subspace structures from the data containing errors, we propose a novel method termed low-rank representation (LRR) [14]. Given a set of data samples each of which can be represented as a linear combination of the bases in a dictionary, LRR aims at finding the lowest-rank representation of all data jointly. The computational procedure of LRR is to solve a nuclear norm [17] regularized optimization problem, which is convex and can be solved in polynomial time. By choosing a specific dictionary, it is shown that LRR can well solve the subspace clustering problem: when the data is clean, we prove that LRR exactly recovers the row space of the data; for the data contaminated by outliers, we prove that under certain conditions LRR can exactly recover the row space of the original data and detect the outlier as well; for the data corrupted by arbitrary errors, LRR can also approximately recover the row space with theoretical guarantees. Since the subspace membership is provably determined by the row space (we will discuss this in Section III-B), these further imply that LRR can perform robust subspace clustering and error correction, in an efficient way. In summary, the contributions of this work include:

  • We develop a simple yet effective method, termed LRR, which has been used to achieve state-of-the-art performance in several applications such as motion segmentation [4], image segmentation [18], saliency detection [19] and face recognition [4].

  • Our work extends the recovery of corrupted data from a single subspace [7] to multiple subspaces. Compared to [20], which requires the bases of subspaces to be known for handling the corrupted data from multiple subspaces, our method is autonomous, i.e., no extra clean data is required.

  • Theoretical results for robust recovery are provided. While our analysis shares similar features as previous work in matrix completion [6] and robust PCA (RPCA) [7, 16], it is considerably more challenging due to the fact that there is a dictionary matrix in LRR.

Ii Related Work

In this section, we discuss some existing subspace segmentation methods. In general, existing works can be roughly divided into four main categories: mixture of Gaussian, factorization, algebraic and spectral-type methods.

In statistical learning, mixed data is typically modeled as a set of independent samples drawn from a mixture of probabilistic distributions. As a single subspace can be well modeled by a (degenerate) Gaussian distribution, it is straightforward to assume that each probabilistic distribution is Gaussian, i.e., adopting a mixture of Gaussian models. Then the problem of segmenting the data is converted to a model estimation problem. The estimation can be performed either by using the Expectation Maximization (EM) algorithm to find a maximum likelihood estimate, as done in [21], or by iteratively finding a min-max estimate, as adopted by K-subspaces [8] and Random Sample Consensus (RANSAC) [10]. These methods are sensitive to errors. So several efforts have been made for improving their robustness, e.g., the Median K-flats [22] for K-subspaces, the work [23] for RANSAC, and [5] use a coding length to characterize a mixture of Gaussian. These refinements may introduce some robustness. Nevertheless, the problem is still not well solved due to the optimization difficulty, which is a bottleneck for these methods.

Factorization based methods [12] seek to approximate the given data matrix as a product of two matrices, such that the support pattern for one of the factors reveals the segmentation of the samples. In order to achieve robustness to noise, these methods modify the formulations by adding extra regularization terms. Nevertheless, such modifications usually lead to non-convex optimization problems, which need heuristic algorithms (often based on alternating minimization or EM-style algorithms) to solve. Getting stuck at local minima may undermine their performances, especially when the data is grossly corrupted. It will be shown that LRR can be regarded as a robust generalization of the method in [12] (which is referred to as PCA in this paper). The formulation of LRR is convex and can be solved in polynomial time.

Generalized Principal Component Analysis (GPCA) [24] presents an algebraic way to model the data drawn from a union of multiple subspaces. This method describes a subspace containing a data point by using the gradient of a polynomial at that point. Then subspace segmentation is made equivalent to fitting the data with polynomials. GPCA can guarantee the success of the segmentation under certain conditions, and it does not impose any restriction on the subspaces. However, this method is sensitive to noise due to the difficulty of estimating the polynomials from real data, which also causes the high computation cost of GPCA. Recently, Robust Algebraic Segmentation (RAS) [25] has been proposed to resolve the robustness issue of GPCA. However, the computation difficulty for fitting polynomials is unfathomably large. So RAS can make sense only when the data dimension is low and the number of subspaces is small.

As a data clustering problem, subspace segmentation can be done by firstly learning an affinity matrix from the given data, and then obtaining the final segmentation results by spectral clustering algorithms such as Normalized Cuts (NCut) [26]. Many existing methods such as Sparse Subspace Clustering (SSC) [13], Spectral Curvature Clustering (SCC) [27, 28], Spectral Local Best-fit Flats (SLBF) [29, 30], the proposed LRR method and [2, 31], possess such spectral nature, so called as spectral-type methods. The main difference among various spectral-type methods is the approach for learning the affinity matrix. Under the assumption that the data is clean and the subspaces are independent, [13] shows that solution produced by sparse representation (SR) [32] could achieve the so-called Subspace Detection Property (-SDP): the within-class affinities are sparse and the between-class affinities are all zeros. In the presence of outliers, it is shown in [15] that the SR method can still obey -SDP. However, -SDP may not be sufficient to ensure the success of subspace segmentation [33]. Recently, Lerman and Zhang [34] prove that under certain conditions the multiple subspace structures can be exactly recovered via () minimization. Unfortunately, since the formulation is not convex, it is still unknown how to efficiently obtain the globally optimal solution. In contrast, the formulation of LRR is convex and the corresponding optimization problem can be solved in polynomial time. What is more, even if the data is contaminated by outliers, the proposed LRR method is proven to exactly recover the right row space, which provably determines the subspace segmentation results (we shall discuss this in Section III-B). In the presence of arbitrary errors (e.g., corruptions, outliers and noise), LRR is also guaranteed to produce near recovery.

Iii Preliminaries and Problem Statement

Iii-a Summary of Main Notations

In this work, matrices are represented with capital symbols. In particular, is used to denote the identity matrix, and the entries of matrices are denoted by using with subscripts. For instance, is a matrix, is its -th entry, is its -th row, and is its -th column. For ease of presentation, the horizontal (resp. vertical) concatenation of a collection of matrices along row (resp. column) is denoted by (resp. ). The block-diagonal matrix formed by a collection of matrices is denoted by

(1)

The only used vector norm is the norm, denoted by . A variety of norms on matrices will be used. The matrix , , , norms are defined by , , and , respectively. The matrix norm is defined as . The spectral norm of a matrix is denoted by , i.e., is the largest singular value of . The Frobenius norm and the nuclear norm (the sum of singular values of a matrix) are denoted by and , respectively. The Euclidean inner product between two matrices is , where is the transpose of a matrix and is the trace of a matrix.

The supports of a matrix are the indices of its nonzero entries, i.e., . Similarly, its column supports are the indices of its nonzero columns. The symbol (superscripts, subscripts, etc.) is used to denote the column supports of a matrix, i.e., . The corresponding complement set (i.e., zero columns) is . There are two projection operators associated with and : and . While applying them to a matrix , the matrix (resp. ) is obtained from by setting to zero for all (resp. ).

We also adopt the conventions of using to denote the linear space spanned by the columns of a matrix , using to denote that a vector belongs to the space , and using to denote that all column vectors of belong to .

Finally, in this paper we use several terminologies, including “block-diagonal matrix”, “union and sum of subspaces”, “independent (and disjoint) subspaces”, “full SVD and skinny SVD”, “pseudoinverse”, “column space and row space” and “affinity degree”. These terminologies are defined in Appendix.

Iii-B Relations Between Segmentation and Row Space

Fig. 3: An example of the matrix computed from dependent subspaces. In this example, we create 11 pairwise disjoint subspaces each of which is of dimension 20, and draw 20 samples from each subspace without errors. The ambient dimension is 200, which is smaller than the sum of the dimensions of the subspaces. So the subspaces are dependent and is not strictly block-diagonal. Nevertheless, it is simple to see that high segmentation accuracy can be achieved by using the above affinity matrix to do spectral clustering.

Let with skinny SVD be a collection of data samples strictly drawn from a union of multiple subspaces (i.e., is clean), the subspace membership of the samples is determined by the row space of . Indeed, as shown in [12], when subspaces are independent, forms a block-diagonal matrix: the -th entry of can be nonzero only if the -th and -th samples are from the same subspace. Hence, this matrix, termed as Shape Interaction Matrix (SIM) [12], has been widely used for subspace segmentation. Previous approaches simply compute the SVD of the data matrix and then use 222For a matrix , denotes the matrix with the -th entry being the absolute value of . for subspace segmentation. However, in the presence of outliers and corruptions, can be far away from and thus the segmentation using such approaches is inaccurate. In contrast, we show that LRR can recover even when the data matrix is contaminated by outliers.

If the subspaces are not independent, may not be strictly block-diagonal. This is indeed well expected, since when the subspaces have nonzero (nonempty) intersections, then some samples may belong to multiple subspaces simultaneously. When the subspaces are pairwise disjoint (but not independent), our extensive numerical experiments show that may still be close to be block-diagonal, as exemplified in Fig. 3. Hence, to recover is still of interest to subspace segmentation.

Iii-C Problem Statement

Problem I.1 only roughly describes what we want to study. More precisely, this paper addresses the following problem.

Problem III.1 (Subspace Clustering)

Let with skinny SVD store a set of -dimensional samples (vectors) strictly drawn from a union of subspaces of unknown dimensions ( is unknown either). Given a set of observation vectors generated by

the goal is to recover the row space of , or to recover the true SIM as equal.

The recovery of row space can guarantee high segmentation accuracy, as analyzed in Section III-B. Also, the recovery of row space naturally implies the success in error correction. So it is sufficient to set the goal of subspace clustering as the recovery of the row space identified by . For ease of exploration, we consider the problem under three assumptions of increasing practicality and difficulty.

Assumption 1

The data is clean, i.e., .

Assumption 2

A fraction of the data samples are grossly corrupted and the others are clean, i.e., has sparse column supports as shown in Fig.2(c).

Assumption 3

A fraction of the data samples are grossly corrupted and the others are contaminated by small Gaussian noise, i.e., is characterized by a combination of the models shown in Fig.2(a) and Fig.2(c).

Unlike [14], the independent assumption on the subspaces is not highlighted in this paper, because the analysis in this work focuses on recovering other than a pursuit of block-diagonal matrix.

Iv Low-Rank Representation for Matrix Recovery

In this section we abstractly present the LRR method for recovering a matrix from corrupted observations. The basic theorems and optimization algorithms will be presented. The specific methods and theories for handling the subspace clustering problem are deferred until Section V.

Iv-a Low-Rank Representation

In order to recover the low-rank matrix from the given observation matrix corrupted by errors (), it is straightforward to consider the following regularized rank minimization problem:

s.t. (2)

where is a parameter and indicates certain regularization strategy, such as the squared Frobenius norm (i.e., ) used for modeling the noise as show in Fig.2(a) [6], the norm adopted by [7] for characterizing the random corruptions as shown in Fig.2(b), and the norm adopted by [14, 16] for dealing with sample-specific corruptions and outliers. Suppose is a minimizer with respect to the variable , then it gives a low-rank recovery to the original data .

The above formulation is adopted by the recently established Robust PCA (RPCA) method [7] which has been used to achieve the state-of-the-art performance in several applications (e.g., [35]). However, this formulation implicitly assumes that the underlying data structure is a single low-rank subspace. When the data is drawn from a union of multiple subspaces, denoted as , it actually treats the data as being sampled from a single subspace defined by . Since the sum can be much larger than the union , the specifics of the individual subspaces are not well considered and so the recovery may be inaccurate.

To better handle the mixed data, here we suggest a more general rank minimization problem defined as follows:

s.t. (3)

where is a “dictionary” that linearly spans the data space. We call the minimizer (with regard to the variable ) the “lowest-rank representation” of data with respect to a dictionary . After obtaining an optimal solution , we could recover the original data by using (or ). Since , is also a low-rank recovery to the original data . By setting , the formulation (3) falls back to (2). So LRR could be regarded as a generalization of RPCA that essentially uses the standard bases as the dictionary. By choosing an appropriate dictionary , as we will see, the lowest-rank representation can recover the underlying row space so as to reveal the true segmentation of data. So, LRR could handle well the data drawn from a union of multiple subspaces.

Iv-B Analysis on the LRR Problem

The optimization problem (3) is difficult to solve due to the discrete nature of the rank function. For ease of exploration, we begin with the “ideal” case that the data is clean. That is, we consider the following rank minimization problem:

s.t. (4)

It is easy to see that the solution to (4) may not be unique. As a common practice in rank minimization problems, we replace the rank function with the nuclear norm, resulting in the following convex optimization problem:

s.t. (5)

We will show that the solution to (5) is also a solution to (4) and this special solution is useful for subspace segmentation.

In the following, we shall show some general properties of the minimizer to problem (5). These general conclusions form the foundations of LRR (the proofs can be found in Appendix).

Iv-B1 Uniqueness of the Minimizer

The nuclear norm is convex, but not strongly convex. So it is possible that problem (5) has multiple optimal solutions. Fortunately, it can be proven that the minimizer to problem (5) is always uniquely defined by a closed form. This is summarized in the following theorem.

Theorem IV.1

Assume and have feasible solution(s), i.e., . Then

(6)

is the unique minimizer to problem (5), where is the pseudoinverse of .

From the above theorem, we have the following corollary which shows that problem (5) is a good surrogate of problem (4).

Corollary IV.1

Assume and have feasible solutions. Let be the minimizer to problem (5), then and is also a minimal rank solution to problem (4).

Iv-B2 Block-Diagonal Property of the Minimizer

By choosing an appropriate dictionary, the lowest-rank representation can reveal the true segmentation results. Namely, when the columns of and are exactly sampled from independent subspaces, the minimizer to problem (5) can reveal the subspace membership among the samples. Let be a collection of subspaces, each of which has a rank (dimension) of . Also, let and . Then we have the following theorem.

Theorem IV.2

Without loss of generality, assume that is a collection of samples of the -th subspace , is a collection of samples from , and the sampling of each is sufficient such that (i.e., can be regarded as the bases that span the subspace). If the subspaces are independent, then the minimizer to problem (5) is block-diagonal:

where is an coefficient matrix with

Note that the claim of guarantees the high within-class homogeneity of , since the low-rank properties generally requires to be dense. This is different from SR, which is prone to produce a “trivial” solution if , because the sparsest representation is an identity matrix in this case. It is also worth noting that the above block-diagonal property does not require the data samples have been grouped together according to their subspace memberships. There is no loss of generality to assume that the indices of the samples have been rearranged to satisfy the true subspace memberships, because the solution produced by LRR is globally optimal and does not depend on the arrangements of the data samples.

Iv-C Recovering Low-Rank Matrices by Convex Optimization

Corollary IV.1 suggests that it is appropriate to use the nuclear norm as a surrogate to replace the rank function in problem (3). Also, the matrix and norms are good relaxations of the and norms, respectively. So we could obtain a low-rank recovery to by solving the following convex optimization problem:

s.t. (7)

Here, the norm is adopted to characterize the error term , since we want to model the sample-specific corruptions (and outliers) as shown in Fig.2(c). For the small Gaussian noise as shown in Fig.2(a), should be chosen; for the random corruptions as shown in Fig.2(b), is an appropriate choice. After obtaining the minimizer , we could use (or ) to obtain a low-rank recovery to the original data .

  Input: data matrix , parameter .
  Initialize: , and .
  while not converged do
     1. fix the others and update by
     2. fix the others and update by
     3. fix the others and update by
     4. update the multipliers
     5. update the parameter by .
     6. check the convergence conditions:
  end while
Algorithm 1 Solving Problem (7) by Inexact ALM

The optimization problem (7) is convex and can be solved by various methods. For efficiency, we adopt in this paper the Augmented Lagrange Multiplier (ALM) [36, 37] method. We first convert (7) to the following equivalent problem:

This problem can be solved by the ALM method, which minimizes the following augmented Lagrange function:

The above problem is unconstrained. So it can be minimized with respect to , and , respectively, by fixing the other variables, and then updating the Lagrange multipliers and , where is a penalty parameter. The inexact ALM method, also called the alternating direction method, is outlined in Algorithm 1 333To solve the problem , one only needs to replace Step 3 of Algorithm 1 by , which is solved by using the shrinkage operator [36]. Also, please note here that the setting of is based on the assumption that the values in has been normalized within the range of .. Note that although Step 1 and Step 3 of the algorithm are convex problems, they both have closed-form solutions. Step 1 is solved via the Singular Value Thresholding (SVT) operator [38], while Step 3 is solved via the following lemma:

Lemma IV.1 ([39])

Let be a given matrix. If the optimal solution to

is , then the -th column of is

Iv-C1 Convergence Properties

When the objective function is smooth, the convergence of the exact ALM algorithm has been generally proven in [37]. For inexact ALM, which is a variation of exact ALM, its convergence has also been well studied when the number of blocks is at most two [36, 40]. Up to present, it is still difficult to generally ensure the convergence of inexact ALM with three or more blocks [40]. Since there are three blocks (including and ) in Algorithm 1 and the objective function of (7) is not smooth, it would be not easy to prove the convergence in theory.

Fortunately, there actually exist some guarantees for ensuring the convergence of Algorithm 1. According to the theoretical results in [41], two conditions are sufficient (but may not necessary) for Algorithm 1 to converge: the first condition is that the dictionary matrix is of full column rank; the second one is that the optimality gap produced in each iteration step is monotonically decreasing, namely the error

is monotonically decreasing, where (resp. ) denotes the solution produced at the -th iteration, indicates the “ideal” solution obtained by minimizing the Lagrange function with respect to both and simultaneously. The first condition is easy to obey, since problem (7) can be converted into an equivalent problem where the full column rank condition is always satisfied (we will show this in the next subsection). For the monotonically decreasing condition, although it is not easy to strictly prove it, the convexity of the Lagrange function could guarantee its validity to some extent [41]. So, it could be well expected that Algorithm 1 has good convergence properties. Moreover, inexact ALM is known to generally perform well in reality, as illustrated in [40].

That should be upper bounded (Step 5 of Algorithm 1) is required by the traditional theory of the alternating direction method in order to guarantee the convergence of the algorithm. So we also adopt this convention. Nevertheless, please note that the upper boundedness may not be necessary for some particular problems, e.g., the RPCA problem as analyzed in [36].

Iv-C2 Computational Complexity

For ease of analysis, we assume that the sizes of both and are in the following. The major computation of Algorithm 1 is Step 1, which requires computing the SVD of an matrix. So it will be time consuming if is large, i.e., the number of data samples is large. Fortunately, the computational cost of LRR can be easily reduced by the following theorem, which is followed from Theorem IV.1.

Theorem IV.3

For any optimal solution to the LRR problem (7), we have that

The above theorem concludes that the optimal solution (with respect to the variable ) to (7) always lies within the subspace spanned by the rows of . This means that can be factorized into , where can be computed in advance by orthogonalizing the columns of . Hence, problem (7) can be equivalently transformed into a simpler problem by replacing with :

where . After obtaining a solution to the above problem, the optimal solution to (7) is recovered by . Since the number of rows of is at most (the rank of ), the above problem can be solved with a complexity of by using Algorithm 1. So LRR is quite scalable for large-size ( is large) datasets, provided that a low-rank dictionary has been obtained. While using , the computational complexity is at most (assuming ). This is also fast provided that the data dimension is not high.

While considering the cost of orthogonalization and the number of iterations needed to converge, the complexity of Algorithm 1 is

where is the number of iterations. The iteration number depends on the choice of : is smaller while is larger, and vice versa. Although larger does produce higher efficiency, it has the risk of losing optimality to use large [36]. In our experiments, we always set . Under this setting, the iteration number usually locates within the range of .

V Subspace Clustering by LRR

In this section, we utilize LRR to address Problem III.1, which is to recover the original row space from a set of corrupted observations. Both theoretical and experimental results will be presented.

V-a Exactness to Clean Data

When there are no errors in data, i.e., and , it is simple to show that the row space (identified by ) of is exactly recovered by solving the following nuclear norm minimization problem:

s.t. (8)

which is to choose the data matrix itself as the dictionary in (5). By Theorem IV.1, we have the following theorem which has also been proven by Wei and Lin [42].

Theorem V.1

Suppose the skinny SVD of is , then the minimizer to problem (8) is uniquely defined by

This naturally implies that exactly recovers when is clean (i.e., ).

The above theorem reveals the connection between LRR and the method in [12], which is a counterpart of PCA (referred to as “PCA” for simplicity). Nevertheless, it is well known that PCA is fragile to the presence of outliers. In contrast, it can be proven in theory that LRR exactly recovers the row space of from the data contaminated by outliers, as will be shown in the next subsection.

V-B Robustness to Outliers and Sample-Specific Corruptions

Assumption 2 is to imagine that a fraction of the data samples are away from the underlying subspaces. This implies that the error term has sparse column supports. So, the norm is appropriate for characterizing . By choosing in (7), we have the following convex optimization problem:

s.t. (9)

The above formulation “seems” questionable, because the data matrix (which itself can contain errors) is used as the dictionary for error correction. Nevertheless, as shown in the following two subsections, is indeed a good choice for several particular problems 444Note that this does not deny the importance of learning the dictionary. Indeed, the choice of dictionary is a very important aspect in LRR. We leave this as future work..

V-B1 Exactness to Outliers

Fig. 4: An example of the matrices and computed from the data contaminated by outliers. In a similar way as [14], we create 5 pairwise disjoint subspaces each of which is of dimension 4, and draw 40 samples (with ambient dimension 200) from each subspace. Then, 50 outliers are randomly generated from , where the standard deviation is set to be three times as large as the averaged magnitudes of the samples. By choosing , LRR produces a solution with the column space of exactly recovering the row space of , and the column supports of exactly identifying the indices of the outliers.

When an observed data sample is far away from the underlying subspaces, a typical regime is that this sample is from a different model other than subspaces, so called as an outlier 555Precisely, we define an outlier as a data vector that is independent to the samples drawn from the subspaces [43].. In this case, the data matrix contains two parts, one part consists of authentic samples (denoted by ) strictly drawn from the underlying subspaces, and the other part consists of outliers (denoted as ) that are not subspace members. To precisely describe this setting, we need to impose an additional constraint on , that is,

(10)

where is the indices of the outliers (i.e., the column supports of ). Furthermore, we use to denote the total number of data samples in , the fraction of outliers, and the rank of . With these notations, we have the following theorem which states that LRR can exactly recover the row space of and identify the indices of outliers as well.

Theorem V.2 ([43])

There exists such that LRR with parameter strictly succeeds, as long as . Here, the success is in a sense that any minimizer to (9) can produce

and (11)

where is the column space of , and is column supports of .

There are several importance notices in the above theorem. First, although the objective function (9) is not strongly convex and multiple minimizers may exist, it is proven that any minimizer is effective for subspace clustering. Second, the coefficient matrix itself does not recover (notice that is usually asymmetric except ), and it is the column space of that recovers the row space of . Third, the performance of LRR is measured by the value of (the larger, the better), which depends on some data properties such as the incoherence and the extrinsic rank ( is larger when is lower). For more details, please refer to [43].

Fig.4 shows some experimental results, which verify the conclusions of Theorem V.2. Notice that the parameter setting is based on the condition (i.e., the outlier fraction is smaller than a certain threshold), which is just a sufficient (but not necessary) condition for ensuring the success of LRR. So, in practice (even for synthetic examples) where , it is possible that other values of achieve better performances.

V-B2 Robustness to Sample-Specific Corruptions

Fig. 5: Two examples of the matrix computed from the data corrupted by sample-specific corruptions. (a) The magnitudes of the corruptions are set to be about 0.7 times as large as the samples. Considering as an affinity matrix, the average affinity degree of the corrupted samples is about , which means that the corrupted samples can be projected back onto their respective subspaces. (b) The magnitudes of the corruptions are set to be about 3.5 times as large as the samples. The affinity degrees of the corrupted samples are all zero, which means that the corrupted samples are treated as outliers. In these experiments, the data samples are generated in the same way as in Fig.4. Then, 10% samples are randomly chosen to be corrupted by additive errors of Gaussian distribution. For each experiment, the parameter is carefully determined such that the column supports of identify the indices of the corrupted samples.

For the phenomenon that an observed sample is away from the subspaces, another regime is that this sample is an authentic subspace member, but grossly corrupted. Usually, such corruptions only happen on a small fraction of data samples, so called as “sample-specific” corruptions. The modeling of sample-specific corruptions is the same as outliers, because in both cases has sparse column supports. So the formulation (9) is still applicable. However, the setting (10) is no longer valid, and thus LRR may not exactly recover the row space in this case. Empirically, the conclusion of still holds [14], which means that the column supports of can identify the indices of the corrupted samples.

While both outliers and sample-specific corruptions 666Unlike outlier, a corrupted sample is unnecessary to be independent to the clean samples. are handled in the same way, a question is how to deal with the cases where the authentic samples are heavily corrupted to have similar properties as the outliers. If a sample is heavily corrupted so as to be independent from the underlying subspaces, it will be treated as an outlier in LRR, as illustrated in Fig.5. This is a reasonable manipulation. For example, it is appropriate to treat a face image as a non-face outlier if the image has been corrupted to be look like something else.

V-C Robustness in the Presence of Noise, Outliers and Sample-Specific Corruptions

Fig. 6: An example of the matrices and computed from the data contaminated by noise, outliers and sample-specific corruptions. In this experiment, first, we create 10 pairwise disjoint subspaces (each of which is of dimension 4) and draw 40 samples (with ambient dimension 2000) from each subspace. Second, we randomly choose 10% samples to be grossly corrupted by large errors. The rest 90% samples are slightly corrupted by small errors. Finally, as in Fig.4, 100 outliers are randomly generated. The total amount of errors (including noise, sample-specific corruptions and outliers) is given by . By setting , approximately recovers with error , and the column supports of accurately identify the indices of the outliers and corrupted samples. In contrast, the recover error produced by PCA is 0.66, and that by the RPCA method (using the best parameters) introduced in [16] is 0.23.

When there is noise in the data, the column supports of are not strictly sparse. Nevertheless, the formulation (9) is still applicable, because the norm (which is relaxed from norm) can handle well the signals that approximately have sparse column supports. Since all observations may be contaminated, it is unlikely in theory that the row space can be exactly recovered. So we target on near recovery in this case. By the triangle inequality of matrix norms, the following theorem can be simply proven without any assumptions.

Theorem V.3

Let the size of be , and the rank of be . For any minimizer to problem (9) with , we have

Fig.6 demonstrates the performance of LRR, in the presence of noise, outliers and sample-specific corruptions. It can be seen that the results produced by LRR are quite promising.

One may have noticed that the bound given in above theorem is somewhat loose. To obtain a more accurate bound in theory, one needs to relax the equality constraint of (9) into:

where is a parameter for characterizing the amount of the dense noise (Fig.2(a)) possibly existing in data. The above problem can be solved by ALM, in a similar procedure as Algorithm 1. However, the above formulation needs to invoke another parameter , and thus we do not further explore it in this paper.

V-D Algorithms for Subspace Segmentation, Model Estimation and Outlier Detection

V-D1 Segmentation with Given Subspace Number

  Input: data matrix , number of subspaces.
  1. obtain the minimizer to problem (9).
  2. compute the skinny SVD .
  3. construct an affinity matrix by (12).
  4. use to perform NCut and segment the data samples into clusters.
Algorithm 2 Subspace Segmentation

After obtaining by solving problem (9), the matrix that identifies the column space of is useful for subspace segmentation. Let the skinny SVD of as , we define an affinity matrix as follows:

(12)

where is formed by with normalized rows. Here, for obtaining better performance on corrupted data, we assign each column of a weight by multiplying . Notice that when the data is clean, and thus this technique does not take any effects. The technical detail of using is to ensure that the values of the affinity matrix are positive (note that the matrix can have negative values). Finally, we could use the spectral clustering algorithms such as Normalized Cuts (NCut) [26] to segment the data samples into a given number of clusters. Algorithm 2 summarizes the whole procedure of performing segmentation by LRR.

V-D2 Estimating the Subspace Number

  Input: data matrix .
  1. compute the affinity matrix in the same way as in Algorithm 2.
  2. compute the Laplacian matrix , where .
  3. estimate the subspace number by (13).
Algorithm 3 Estimating the Subspace Number

Although it is generally challenging to estimate the number of subspaces (i.e., number of clusters), it is possible to resolve this model estimation problem due to the block-diagonal structure of the affinity matrix produced by specific algorithms [13, 44, 45]. While a strictly block-diagonal affinity matrix is obtained, the subspace number can be found by firstly computing the normalized Laplacian (denoted as ) matrix of , and then counting the number of zero singular values of . While the obtained affinity matrix is just near block-diagonal (this is the case in reality), one could predict the subspace number as the number of singular values smaller than a threshold. Here, we suggest a soft thresholding approach that outputs the estimated subspace number by

(13)

Here, is the total number of data samples, are the singular values of the Laplacian matrix , is the function that outputs the nearest integer of a real number, and is a soft thresholding operator defined as

where is a parameter. Algorithm 3 summarizes the whole procedure of estimating the subspace number based on LRR.

V-D3 Outlier Detection

As shown in Theorem V.2, the minimizer (with respect to the variable ) can be used to detect the outliers that possibly exist in data. This can be simply done by finding the nonzero columns of , when all or a fraction of data samples are clean (i.e., Assumption 1 and Assumption 2). For the cases where the learnt only approximately has sparse column supports, one could use thresholding strategy; that is, the -th data vector of is judged to be outlier if and only if

(14)

where is a parameter.

Since the affinity degrees of the outliers are zero or close to being zero (see Fig.4 and Fig.6), the possible outliers can be also removed by discarding the data samples whose affinity degrees are smaller than a certain threshold. Such a strategy is commonly used in spectral-type methods [13, 34]. Generally, the underlying principle of this strategy is essential the same as (14). Comparing to the strategy of characterizing the outliers by affinity degrees, there is an advantage of using to indicate outliers; that is, the formulation (9) can be easily extended to include more priors, e.g., the multiple visual features as done in [18, 19].

Vi Experiments

LRR has been used to achieve state-of-the-art performance in several applications such as motion segmentation [4], image segmentation [18], face recognition [4] and saliency detection [19]. In the experiments of this paper, we shall focus on analyzing the essential aspects of LRR, under the context of subspace segmentation and outlier detection.

Vi-a Experimental Data

data # of data # of error
dimension samples subspaces level
max 201 556 3 0.0130
min 31 39 2 0.0002
mean 59.7 295.7 2.3 0.0009
std. 20.2 140.8 0.5 0.0012
TABLE I: Some information about Hopkins155.

Vi-A1 Hopkins155

To verify the segmentation performance of LRR, we adopt for experiments the Hopkins155 [46] motion database, which provides an extensive benchmark for testing various subspace segmentation algorithms. In Hopkins155, there are 156 video sequences along with the features extracted and tracked in all the frames. Each sequence is a sole dataset (i.e., data matrix) and so there are in total 156 datasets of different properties, including the number of subspaces, the data dimension and the number of data samples. Although the outliers in the data have been manually removed and the overall error level is low, some sequences (about 10 sequences) are grossly corrupted and have notable error levels. Table I summarizes some information about Hopkins155. For a sequence represented as a data matrix , its error level is estimated by its rank- approximation: , where contains the largest singular values of , and (resp. ) is formed by taking the top left (resp. right) singular vectors. Here, we set ( is the subspace number of the sequence), due to the fact that the rank of each subspace in motion data is at most 4.

Fig. 7: Examples of the images in the Yale-Caltech dataset.

Vi-A2 Yale-Caltech

To test LRR’s effectiveness in the presence of outliers and corruptions, we create a dataset by combining Extended Yale Database B [47] and Caltech101 [48]. For Extended Yale Database B, we remove the images pictured under extreme light conditions. Namely, we only use the images with view directions smaller than 45 degrees and light source directions smaller than 60 degrees, resulting in 1204 authentic samples approximately drawn from a union of 38 low-rank subspaces (each face class corresponds to a subspace). For Caltech101, we only select the classes containing no more than 40 images, resulting in 609 non-face outliers. Fig.7 shows some examples of this dataset.

Vi-B Baselines and Evaluation Metrics

Due to the close connections between PCA and LRR, we choose PCA and RPCA methods as the baselines. Moreover, some previous subspace segmentation methods are also considered.

Vi-B1 PCA (i.e., SIM)

The PCA method is widely used for dimension reduction. Actually, it can also be applied to subspace segmentation and outlier detection as follows: first, we use SVD to obtain the rank- ( is a parameter) approximation of the data matrix , denoted as ; second, we utilize , which is an estimation of the true SIM , for subspace segmentation in a similar way as Algorithm 2 (the only difference is the estimation of SIM); finally, we compute and use to detect outliers according to (14).

Vi-B2 Rpca

As an improvement over PCA, the robust PCA (RPCA) methods can also do subspace segmentation and outlier detection. In this work, we consider two RPCA methods introduced in [7] and [16], which are based on minimizing

In [7], the norm is used to characterize random corruptions, so referred to as “RPCA”. In [16], the norm is adopted for detecting outliers, so referred to as “RPCA”. The detailed procedures for subspace segmentation and outlier detection are almost the same as the PCA case above. The only difference is that is formed from the skinny SVD of (not ), which is obtained by solving the above optimization problem. Note here that the value of is determined by the parameter , and thus one only needs to select .

Vi-B3 Sr

LRR has similar appearance as SR, which has been applied to subspace segmentation [13]. For fair comparison, in this work we implement an -norm based SR method that computes an affinity matrix by minimizing

Here, SR needs to enforce to avoid the trivial solution . After obtaining a minimizer , we use as the affinity matrix to do subspace segmentation. The procedure of using to perform outlier detection is the same as LRR.

Vi-B4 Some other Methods

We also consider for comparison some previous subspace segmentation methods, including Random Sample Consensus (RANSAC) [10], Generalized PCA (GPCA) [24], Local Subspace Analysis (LSA) [2], Agglomerative Lossy Compression (ALC) [3], Sparse Subspace Clustering (SSC) [13], Spectral Clustering (SC) [31], Spectral Curvature Clustering (SCC) [27], Multi Stage Learning (MSL) [49], Locally Linear Manifold Clustering (LLMC) [50], Local Best-fit Flats (LBF) [29] and Spectral LBF (SLBF) [29].

Vi-B5 Evaluation Metrics

Segmentation accuracy (error) is used to measure the performance of segmentation. The areas under the receiver operator characteristic (ROC) curve, known as AUC, is used for for evaluating the quality of outlier detection. For more details about these two evaluation metrics, please refer to Appendix.

Vi-C Results on Hopkins155

Fig. 8: The influences of the parameter of LRR. (a) On all 156 sequences of Hopkins155, the overall segmentation performance is equally good while . (b) On the -th sequence, the segmentation error is always 0 for . (c) On the -th sequence, the segmentation performance is good only when .

Vi-C1 Choosing the Parameter

The parameter is used to balance the effects of the two parts in problem (9). In general, the choice of this parameter depends on the prior knowledge of the error level of data. When the errors are slight, we should use relatively large ; when the errors are heavy, we should set to be relatively small.

Fig.8(a) shows the evaluation results over all 156 sequences in Hopkins155: while ranges from to , the segmentation error only varies from to ; while ranges from to , the segmentation error almost remains unchanged, slightly varying from 1.69% to 1.87%. This phenomenon is mainly due to two reasons as follows. First, on most sequences (about 80%) which are almost clean and easy to segment, LRR could work well by choosing arbitrarily, as exemplified in Fig.8(b). Second, there is an “invariance” in LRR, namely Theorem IV.3 implies that the minimizer to problem (9) always satisfies . This implies that the solution of LRR can be partially stable while is varying.

The analysis above does not deny the importance of model selection. As shown in Fig.8(c), the parameter can largely affect the segmentation performance on some sequences. Actually, if we turn to the best for each sequence, the overall error rate is only 0.07%. Although this number is achieved in an “impractical” way, it verifies the significance of selecting the parameter , especially when the data is corrupted. For the experiments below, we choose for LRR.

Vi-C2 Segmentation Performance

segmentation errors (%) over all 156 sequences
PCA RPCA RPCA SR LRR
mean 4.56 4.13 3.26 3.89 1.71
std. 10.80 10.37 9.09 7.70 4.85
max 49.78 45.83 47.15 32.57 33.33
average run time (seconds) per sequence
0.2 0.8 0.8 4.2 1.9
TABLE II: Segmentation results (on Hopkins155) of PCA, RPCA, RPCA, SR and LRR.

In this subsection, we show LRR’s performance in subspace segmentation with the subspace number given. For comparison, we also list the results of PCA, RPCA, RPCA and SR (these methods are introduced in Section VI-B). Table II illustrates that LRR performs better than PCA and RPCA. Here, the advantages of LRR are mainly due to its methodology. More precisely, LRR directly targets on recovering the row space , which provably determines the segmentation results. In contrast, PCA and RPCA methods are designed for recovering the column space , which is designed for dimension reduction. One may have noticed that RPCA outperforms PCA and RPCA. If we use instead the norm to regularize in (9), the segmentation error is 2.03% (, optimally determined). These illustrate that the errors in this database tend to be sample-specific.

Besides the superiorities in segmentation accuracy, another advantage of LRR is that it can work well under a wide range of parameter settings, as shown in Fig.8. Whereas, RPCA methods are sensitive to the parameter . Taking RPCA for example, it achieves an error rate of 3.26% by choosing . However, the error rate increases to 4.5% at , and 3.7% at .

The efficiency (in terms of running time) of LRR is comparable to PCA and RPCA methods. Theoretically, the computational complexity (with regard to and ) of LRR is the same as RPCA methods. LRR costs more computational time because its optimization procedure needs more iterations than RPCA to converge.

Vi-C3 Performance of Estimating Subspace Number

# total # predicted prediction rate (%) absolute error
156 121 77.6 0.25
influences of the parameter
parameter 0.06 0.07 0.08 0.09 0.10 0.11
prediction rate 66.7 71.2 77.6 75.0 72.4 71.2
absolute error 0.37 0.30 0.25 0.26 0.29 0.30
TABLE III: Results (on Hopkins155) of estimating the subspace number.

Since there are 156 sequences in total, this database also provides a good benchmark for evaluating the effectiveness of Algorithm 3, which is to estimate the number of subspaces underlying a collection of data samples. Table III shows the results. By choosing , LRR correctly predicts the true subspace number of 121 sequences. The absolute error (i.e., ) averaged over all sequences is . These results illustrate that it is hopeful to resolve the problem of estimating the subspace number, which is a challenging model estimation problem.

Vi-C4 Comparing to State-of-the-art Methods

GPCA RANSAC MSL LSA LLMC
mean 10.34 9.76 5.06 4.94 4.80
PCA LBF ALC SCC SLBF
mean 4.47 3.72 3.37 2.70 1.35
LRR
SSC SC [51] [4] this paper
mean 1.24 1.20 1.22 0.85 1.59
TABLE IV: Segmentation errors (%) on Hopkins155 (155 sequences).

Notice that previous methods only report the results for 155 sequences. After discarding the degenerate sequence, the error rate of LRR is 1.59% which is comparable to the state-of-the-art methods, as shown in Table IV. The performance of LRR can be further improved by refining the formulation (9), which uses the observed data matrix itself as the dictionary. When the data is corrupted by dense noise (this is usually true in reality), this certainly is not the best choice. In [51] and [42], a non-convex formulation is adopted to learn the original data and its row space simultaneously:

where the unknown variable is used as the dictionary. This method can achieve an error rate of 1.22%. In [4], it is explained that the issues of choosing dictionary can be relieved by considering the unobserved, hidden data. Furthermore, it is deduced that the effects of hidden data can be approximately modeled by the following convex formulation:

which intuitively integrates subspace segmentation and feature extraction into a unified framework. This method can achieve an error rate of 0.85%, which outperforms other subspace segmentation algorithms.

While several methods have achieved an error rate below 3% on Hopkins155, subspace segmentation problem is till far from solved. A long term difficult is how to solve the model selection problems, e.g., estimating the parameter of LRR. Also, it would not be trivial to handle more complicated datasets that contain more noise, outliers and corruptions.

Vi-D Results on Yale-Caltech

The goal of this test is to identify 609 non-face outliers and segment the rest 1204 face images into 38 clusters. The performance of segmentation and outlier detection is evaluated by segmentation accuracy (ACC) and AUC, respectively. While investigating segmentation performance, the affinity matrix is computed from all images, including both the face images and non-face outliers. However, for the convenience of evaluation, the outliers and the corresponding affinities are removed (according to the ground truth) before using NCut to obtain the segmentation results.

PCA RPCA RPCA SR LRR
ACC (%) 77.15 82.97 83.72 73.17 86.13
AUC 0.9653 0.9819 0.9863 0.9239 0.9927
time (sec.) 0.6 60.8 59.2 383.5 152.6
TABLE V: Segmentation accuracy (ACC) and AUC comparison on the Yale-Caltech dataset.

We resize all images into pixels and form a data matrix of size . Table V shows the results of PCA, RPCA, SR and LRR. It can be seen that LRR is better than PCA and RPCA methods, in terms of both subspace segmentation and outlier detection. These experimental results are consistent with Theorem V.2, which shows that LRR has a stronger guarantee than RPCA methods in performance. Notice that SR is behind the others 777The results (for outlier detection) in Table V are obtained by using the strategy of (14). While using the strategy of checking the affinity degree, the results produced by SR is even worse, only achieving an AUC of 0.81 by using the best parameters.. This is because the presence or absence of outliers is unnecessary to notably alert the sparsity of the reconstruction coefficients, and thus it is hard for SR to handle well the data contaminated by outliers.

Fig. 9: The influences of the parameter of LRR. These results are collected from the Yale-Caltech dataset. All images are resized to pixels.

Fig.9 shows the performance of LRR while the parameter varies from 0.06 to 0.22. Notice that LRR is more sensitive to on this dataset than on Hopkins155. This is because the error level of Hopkins155 is quite low (see Table I), whereas, the Yale-Caltach dataset contains outliers and corrupted images (see Fig.7).

Fig. 10: Some examples of using LRR to correct the errors in the Yale-Caltech dataset. Left: the original data matrix ; Middle: the corrected data ; Right: the error .

To visualize LRR’s effectiveness in error correction, we create another data matrix with size by resizing all images into . Fig.10 shows some results produced by LRR. It is worth noting that the “error” term can contain “useful” information, e.g., the eyes and salient objects. Here, the principle is to decompose the data matrix into a low-rank part and a sparse part, with the low-rank part () corresponding to the principal features of the whole dataset, and the sparse part () corresponding to the rare features which cannot be modeled by low-rank subspaces. This implies that it is possible to use LRR to extract the discriminative features and salient regions, as done in face recognition [4] and saliency detection [19].

Vii Conclusion and Future Work

In this paper we proposed low-rank representation (LRR) to identify the subspace structures from corrupted data. Namely, our goal is to segment the samples into their respective subspaces and correct the possible errors simultaneously. LRR is a generalization of the recently established RPCA methods [7, 16], extending the recovery of corrupted data from single subspace to multiple subspaces. Also, LRR generalizes the approach of Shape Interaction Matrix (SIM), giving a way to define an SIM between two different matrices (see Theorem IV.1), and providing a mechanism to recover the true SIM (or row space) from corrupted data. Both theoretical and experimental results show the effectiveness of LRR. However, there still remain several problems for future work:

  • It may achieve significant improvements by learning a dictionary , which partially determines the solution of LRR. In order to exactly recover the row space , Theorem IV.3 illustrates that the dictionary must satisfy the condition of . When the data is only contaminated by outliers, this condition can be obeyed by simply choosing . However, this choice cannot ensure the validity of while the data contains other types of errors, e.g., dense noise.

  • The proofs of Theorem V.2 are specific to the case of . As a future direction, it is interesting to see whether the technique presented can be extended to general dictionary matrices other than .

  • A critical issue in LRR is how to estimate or select the parameter . For the data contaminated by various errors such as noise, outliers and corruptions, the estimation of is quite challenging.

  • The subspace segmentation should not be the only application of LRR. Actually, it has been successfully used in the applications other than segmentation, e.g., saliency detection [19]. In general, the presented LRR method can be extended to solve various applications well.

Appendix

Vii-a Terminologies

In this subsection, we introduce some terminologies used in the paper.

Vii-A1 Block-Diagonal Matrix

In this paper, a matrix is called block-diagonal if it has the form as in (1). For the matrix which itself is not block-diagonal but can be transformed to be block-diagonal by simply permuting its rows and/or columns, we also say that is block-diagonal. In summary, we say that a matrix is block-diagonal whenever there exist two permutation matrices and such that is block-diagonal.

Vii-A2 Union and Sum of Subspaces

For a collection of subspaces , their union is defined by , and their sum is defined by . If any can be uniquely expressed as , , then the sum is also called the directed sum, denoted as .

Vii-A3 Independent Subspaces

A collection of subspaces are independent if and only if (or ). When the subspaces are of low-rank and the ambient dimension is high, the independent assumption is roughly equal to the pairwise disjoint assumption; that is .

Vii-A4 Full SVD and Skinny SVD

For an matrix (without loss of generality, assuming ), its Singular Value Decomposition (SVD) is defined by , where and are orthogonal matrices and with being singular values. The SVD defined in this way is also called the full SVD. If we only keep the positive singular values, the reduced form is called the skinny SVD. For a matrix of rank , its skinny SVD is computed by , where with being positive singular values. More precisely, and are formed by taking the first columns of and , respectively.

Vii-A5 Pseudoinverse

For a matrix with skinny SVD , its pseudoinverse is uniquely defined by

Vii-A6 Column Space and Row Space

For a matrix , its column (resp. row) space is the linear space spanned by its column (resp. row) vectors. Let the skinny SVD of be , then (resp. ) are orthonormal bases of the column (resp. row) space, and the corresponding orthogonal projection is given by (resp. ). Since (resp. ) is uniquely determined by the column (resp. row) space, sometimes we also use (resp. ) to refer to the column (resp. row) space.

Vii-A7 Affinity Degree

Let be a symmetric affinity matrix for a collection of data samples, the affinity degree of the -th sample is defined by , i.e., the number of samples connected to the -th sample.

Vii-B Proofs

Vii-B1 Proof of Theorem 4.1

The proof of Theorem 4.1 is based on the following three lemmas.

Lemma VII.1

Let , and be matrices of compatible dimensions. Suppose both and have orthogonal columns, i.e., and , then we have

{proof}

Let the full SVD of be , then . As and , is actually an SVD of . By the definition of the nuclear norm, we have .

Lemma VII.2

For any four matrices , , and of compatible dimensions, we have

where the equality holds if and only if and

{proof}

The proof is simply based on the following fact: for any two matrices and , we have