A Appendix

# Robust Low-Rank Subspace Segmentation with Semidefinite Guarantees

## Abstract

Recently there is a line of research work proposing to employ Spectral Clustering (SC) to segment (group)1 high-dimensional structural data such as those (approximately) lying on subspaces2 or low-dimensional manifolds. By learning the affinity matrix in the form of sparse reconstruction, techniques proposed in this vein often considerably boost the performance in subspace settings where traditional SC can fail. Despite the success, there are fundamental problems that have been left unsolved: the spectrum property of the learned affinity matrix cannot be gauged in advance, and there is often one ugly symmetrization step that post-processes the affinity for SC input. Hence we advocate to enforce the symmetric positive semidefinite constraint explicitly during learning (Low-Rank Representation with Positive SemiDefinite constraint, or LRR-PSD), and show that factually it can be solved in an exquisite scheme efficiently instead of general-purpose SDP solvers that usually scale up poorly. We provide rigorous mathematical derivations to show that, in its canonical form, LRR-PSD is equivalent to the recently proposed Low-Rank Representation (LRR) scheme [1], and hence offer theoretic and practical insights to both LRR-PSD and LRR, inviting future research. As per the computational cost, our proposal is at most comparable to that of LRR, if not less. We validate our theoretic analysis and optimization scheme by experiments on both synthetic and real data sets.

{IEEEkeywords}

spectral clustering, affinity matrix learning, rank minimization, robust estimation, eigenvalue thresholding

## 1 Introduction

This paper deals with grouping or segmentation of high-dimensional data under subspace settings. The problem is formally defined as follows

###### Problem 1 (Subspace Segmentation).

Given a set of sufficiently dense data vectors , representing a sample . Suppose the data are drawn from a union of subspaces of unknown dimensions respectively, segment all data vectors into their respective subspaces.

In this regard, the vast number of available clustering algorithms, ranging from the most basic k-means method to the most elegant and sophisticated spectral clustering (SC) method, can all be used towards a solution. Nevertheless, there are strong reasons to believe that exploiting the very regularity associated with the data can enhance the clustering performance.

We choose SC as the basic framework for subspace segmentation. SC has been extensively researched (see  [2] for a recent review) and employed for many applications (e.g. image segmentation [3] in computer vision). SC has the remarkable capacity to deal with highly complicated data structures which may easily fail simple clustering methods such as k-means. The excellent performance of SC can be partially explained via its connection to the kernel method which has been extensively studied in machine learning, specially the recent unification of weighted kernel k-means and SC [4]. The implicit data transformation into higher-dimensional spaces is likely to make the clustering task easy for basic clustering algorithms.

Analogous to the freedom to choose the kernel function in kernel methods, SC is flexible enough to admit any similarity measures in the form of affinity matrices as its input. Despite the existence of research work on SC with general affinity matrices that are not positive semidefinite (see e.g., [5]), in practice the Gaussian kernel and the linear kernel are evidently the most commonly employed. Use of these kernels naturally ensures about symmetry and positive semidefiniteness of the affinity matrix. When there are processing steps that cause asymmetry, e.g., construction of nearest-neighbors based affinity matrix, there is normally an additional symmetrization step involved before the subsequent eigen-analysis on the resultant Laplacian matrix in normal SC routines [2].

### 1.1 Prior Work on Subspace Segmentation

The most intuitive way to solve the subspace segmentation problem is perhaps by robust model fitting. In this aspect, classic robust estimation methods such as RANSAC [6] and Expectation Minimization [7] can be employed, based on some assumptions about the data distribution, and possibly also the parametric form (e.g., mixture of Gaussians).

Most customized algorithms on this problem, however, are contributed from researchers in computer vision, to solve the 3D multibody motion segmentation (MMS) problem (see e.g., [8] for the problem statement and review of existing algorithms). In this problem, geometric argument shows that trajectories of same rigid-body motion lie on a subspace of dimension at most . Hence MMS serves as a typical application of subspace segmentation. There are factorization based methods [9], algebraic methods exemplified by the Generalized Principal Component Analysis (GPCA) [10], and local subspace affinity (LSA) [11] to address MMS. They are all directly or indirectly linked to SC methods, and can be considered as different ways to construct the affinity matrix for subsequent SC (for the former is similar to the linear kernel3, and the latter two kernels defined with local subspace distances).

Of special interest to the current investigation is the recent line of work on constructing the affinity matrix by sparsity-induced optimization. Cheng et al [13] ( graph ) and Elhamifar et al [14] (sparse subspace clustering, SSC) have independently proposed to use sparse reconstruction coefficients as similarity measures. To obtain the sparse coefficients, they reconstruct one sample using all the rest samples, while regularizing the coefficient vector by norm to promote sparsity. Hence the problem to solve boils down to the Lasso (i.e., -regularized least square problem, [15]), which has been well studied on theoretic and computational sides (ref e.g., [16]). Most recently Liu et al [1] has proposed to compute the reconstruction collectively, and regularize the rank of the affinity matrix for capturing global data structures. This is made possible by employing the nuclear norm minimization as a surrogate, and they also provide a robust version to resist noise and outliers.

Nuclear norm minimization as a surrogate for rank minimization is a natural generalization of the trace heuristic used for positive semidefinite matrices in several fields, such as control theory [17]. The need for rank minimization has theoretically stemmed from the exploding research efforts in compressed sensing sparkled by the seminal paper [18]. In fact, generalizing from vector sparsity to spectrum sparsity for matrices is natural. The practical driving forces come from applications such as collaborative filtering, sensor localization, to just name a few [19]. From the computational side, there are several cutting-edge customized algorithms for solving the otherwise large-scale SDP problem that is likely to plague most popular off-the-shelf SDP solvers. These algorithms include prominently singular value thresholding [20], accelerated proximal gradient (APG) [21], and augmented Lagrange multiplier (ALM) methods [22] (see [22] for a brief review).

### 1.2 Our Investigation and Contributions

We advocate to learn an affinity matrix that makes a valid kernel directly, i.e., being symmetric positive semidefinite. This is one critical problem the previous sparse-reconstruction and global low-rank minimization approaches has bypassed. Without consideration in this aspect, the empirical behaviors of the learnt affinity matrices are poorly justified. We will focus on the global framework proposed in [1] as the global conditions are easier to gauge and additional constraints on the learnt affinity matrix can be put in directly.

We will constrain the affinity matrix to be symmetric positive semidefinite directly in LRR-PSD. Surprisingly, during analysis of connection with the canonical form of LRR proposed in [1], we find out the two formulations are exactly equivalent, and moreover we can accurately characterize the spectrum of the optimal solution. In addition, we successfully establish the uniqueness of the solution to both LRR and LRR-PSD, and hence correct one critical error reported in [1] stating that the optimal solutions are not unique.

More interestingly, we show that our advocated formulation (LRR-PSD) in its robust form also admits a simple solution like that of LRR as reported in [1], but at a lower computational cost. As a nontrivial byproduct, we also provide a rigorous but elementary proof to nuclear-norm regularized simple least square problem with a positive semidefinite constraint, which complements the elegant closed-form solution to the general form [20].

To sum up, we highlight our contributions in two aspects: 1) we provide a rigorous proof of the equivalence between LRR and LRR-PSD, and establish the uniqueness of the optimal solution. In addition, we offer a sensible characterization of the spectrum for the optimal solution; and 2) we show that Robust LRR-PSD can also be efficiently solved in a scheme similar to that of LRR but with notable difference at a possibly lower cost.

## 2 Robust Low-Rank Subspace Segmentation with Semidefinite Guarantees

We will first set forth the notation used throughout the paper, and introduce necessary analytic tools. The canonical optimization framework of learning a low-rank affinity matrix for subspace segmentation/clustering will be presented next, and the equivalence between LRR-PSD and LRR will be formally established. Taking on the analysis, we briefly discuss about the spectrum in the robust versions of LRR-PSD and LRR, and touches on other noise assumptions. We will then proceed to present the optimization algorithm to tackle LRR-PSD under noisy settings (i.e., Robust LRR-PSD).

### 2.1 Notation and Preliminaries

#### Summary of Notations

We will use bold capital and bold lowercase for matrices and vectors respectively, such as and , and use normal letters for scalars and entries of matrices and vectors, e.g., , (the entry of matrix ). We will consider real vector and matrix spaces exclusively in this paper and use or alike to denote the real spaces of proper dimensionality or geometry.

We are interested in five norms of a matrix . The first three are functions of singular values and they are: 1) the operator norm or induced 2-norm denoted by , which is essentially the largest singular value ; 2) the Frobenius norm, defined as ; and 3) the nuclear norm, or sum of the singular values , assuming . The additional two include: 4) the matrix norm which generalizes the vector norm to the concatenation of matrix columns; and 5) the group norm , which sums up the norms of columns. Besides, the Euclidean inner product between matrices is . This also induces an alternative calculation of the Frobenius norm, .

We will denote the spectrum (the set of eigenvalues) of a square matrix by , for (similarly the collection of singular values for a general matrix ). We denote the set of all real symmetric matrix by , and the corresponding positive semidefinite cone as and and , in which is said to be positive semidefinite and simply designated as . We reiterate the requirement on symmetry here since conventionally definiteness is not defined for asymmetric matrices.

In addition, we all use and to mean taking the diagonal vector of a matrix and reshape a vector into a diagonal matrix, respectively. Other notations such as , manifest themselves literally.

#### Nuclear Norm Minimization and Rank Minimization

We choose to devote this subsection to reviewing more concrete properties about the nuclear norm due to its significance to this paper in particular and to the whole range of work on low-rank minimization in general.

###### Definition 2 (Unitarily Invariant Norms).

A matrix norm is unitarily invariant if for all matrices and all unitary matrices and (i.e. ) of compatible dimension.

Interestingly common encountered unitarily invariant norms are all functions of the singular values, and lie within two general families: 1) Schatten- norms, arising from applying the -norm to the vector of singular values, ; and 2)Ky-Fan norms, representing partial ordered sums of largest singular values, , assuming and all for . Of our interest here is that the nuclear norm and Frobenius norm are Schatten- norm and Schatten- norms, respectively. This fact will be critical for several places of our later argument.

Next we will state one crucial fact about the duality between the nuclear norm and the operator norm. For any norm , its dual norm is defined via the variational characterization  [23]

 ∥X∥D=supY{⟨Y,X⟩|∥Y∥≤1}, (1)

where can always be taken as equality for the supremum to achieve, since the inner product is homogeneous w.r.t. . Then we have a formal statement about the duality

###### Lemma 3 ([24], Proposition 2.1).

The dual norm of the operator norm in is the nuclear norm .

In fact, the duality taken together with the characterization of dual norms implies , which has been used extensively in the analysis of nuclear norm problems.

Our last piece of review touches on the core of the problem, i.e., how rank minimization problems (NP-Hard) could be (conditionally) solved via nuclear norm minimization formulation which is convex. This myth lies with the concept of convex envelope, which means the tightest convex pointwise approximation to a function (tightest convex lower bound). Formally, for any (possibly nonconvex, e.g., the rank function currently under investigation) function , where denotes a given convex set, the convex envelope of is the largest convex function such that for all  [17]. The following lemma relates the rank function to the nuclear norm via convex envelope

###### Lemma 4 ([17], Theorem 1, pp.54 and Sec.5.1.5 for proof).

The convex envelope of on the set is the nuclear norm .

This lemma justifies the heuristic to use the nuclear norm as a surrogate for the rank. Much of recent work, e.g. low-rank matrix completion [19] and Robust Principal Component Analysis (RPCA) [25], proves theoretically under mild conditions, the optimization can be exactly equivalent. We will borrow heavily the idea of this surrogate, and build on the theoretical underpinnings to develop our formulation and analysis of LRR-PSD/LRR for subspace segmentation.

### 2.2 Subspace Segmentation with Clean Data – An Amazing Equivalence

To tackle the subspace segmentation problem, Liu et al [1] have proposed to learn the affinity matrix for SC via solving the following rank minimization problem

 (LRANK)min.rank(Z),s.t.X=XZ. (2)

As a convex surrogate, the rank objective is replaced by the nuclear norm, and hence the formulation

 (LRR)min.∥Z∥∗,s.t.X=XZ. (3)

Instead, we advocate to solve the problem incorporating the positive semidefinite constraint directly to produce a valid kernel directly as argued above

 (LRR-PSD)min.∥Z∥∗,s.t.X=XZ,Z⪰0. (4)

Liu et al [1] has established one important characterization about solution(s) to LRR for and , provided the data have been arranged by their respective groups, i.e., the true segmentation.

###### Theorem 5 ([1], Theorem 3.1).

Assume the data sampling is sufficient such that and the data have been ordered by group. If the subspaces are independent then there exists an optimal solution to problem LRR that is block-diagonal:

 Z∗n×n=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Z∗10000Z∗20000⋱0000Z∗k⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ (5)

with being an matrix with , .

This observation is critical to good segmentation since affinity matrices with block diagonal structure (for sorted data as stated) favor perfect segmentation as revealed by theoretic analysis of SC algorithms (e.g., refer to [2] for brief exposition). There are, however, discoveries that are equally important yet to make. We will next state somewhat surprising results that we have derived, complementing Theorem 5 and providing critical insights in characterizing the (identical and unique) solution to LRR-PSD and LRR.

###### Theorem 6.

Optimization problem LRR has a unique minimizer . Moreover there exists an orthogonal matrix such that

 Q⊤Z∗Q=[Ir000] (6)

where , and obviously .

Three important corollaries follow immediately from Theorem 6.

###### Corollary 7 (LRR-PSD/LRR Equivalence).

The LRR problem and LRR-PSD problem are exactly equivalent, i.e. with identical unique minimizers that are symmetric positive semidefinite.

###### Proof.

in Theorem 6 naturally obeys LRR-PSD. ∎

###### Corollary 8.

Assume the setting in Theorem 5. The optimal solution to problem LRR and LRR-PSD are block-diagonal as in Eq. (5).

###### Proof.

Follow directly from LRR-PSD/LRR equivalence and Theorem 5. ∎

###### Corollary 9 (LRR-PSD/LRR/LRANK Equivalence).

The optimal rank of in LRANK is the objective value obtained from LRR-PSD/LRR, i.e., .

###### Proof.

Proof of Theorem 6 (later) shows cannot be lower than due to the constraint . is the optimal objective value for nuclear norm of since . ∎

The development of the results in Theorem 6 will testify the beautiful interplay between classic matrix computation and properties of nuclear norms we reviewed above. We present and validate several critical technical results preceding formal presentation of our proof.

###### Lemma 10 ([26], Lemma 7.1.2 on Real Matrices).

If , , and satisfy

 AM=MB,rank(M)=p, (7)

then there exists an orthogonal such that

 Misplaced & (8)

for , and , and of compatible dimensions. Furthermore, .4

Since the proof is critical for subsequent arguments, we reproduce the sketch here for completeness.

###### Proof.

Let

 M=Q[R10],Q∈Rn×n,R1∈Rp×p (9)

be a QR factorization5 of . By substituting this into Eq. (7) and rearranging we arrive at

 [T11T12T21T22][R10]=[R10]B,whereQ⊤AQ=[T11T12T21T22], (10)

with and others of compatible dimension. Since is nonsingular, implying . Moreover, , suggesting and are similar and hence . Lemma 7.1.1 [26] dictates that , which leads to the conclusion. ∎

The next lemma deals with an important inequality of nuclear norms on vertically-partitioned or horizontally-partitioned matrices.

###### Lemma 11 ([27], Adaptation of Theorem 4.4, pp 33-34).

Let be partitioned in the form

 A=[A1A2](resp.A=[A1A2]) (11)

and the sorted singular values of be and those of be for . Then , where the equality holds if and only if .

###### Proof.

The original proof in [27] has shown that , . This is because and , are eigenvalues of (resp. ) and (resp. ), respectively, and (resp. ). Theorem 3.14 [27] dictates that , , where is the smallest eigenvalue of (resp. ). It follows , and hence we have .

We next show stronger results saying that the inequality is strict unless .

() Since , , requiring or amounts to imposing , , which suggests , or (resp. ), identically . But we also have from the above argument (resp. ), or . Taking them together we obtain , implying .

() Simple substitution verifies the equality and also completes the proof. ∎

Based on the above two important lemmas, we derived our main results as follows.

###### Proof.

(of Theorem 6) We first show the claim about the semidefiniteness of , and then proceed to prove the uniqueness.

(Semidefiniteness of ) By , we have . Taking independent columns from (i.e., independent rows from ) and organize them into a submatrix of , we obtain . By Lemma 10 and its proof, we have a QR factorization of and one similarity transform of as, respectively

 M Unknown environment '% (12) T =[UU⊥]⊤Z⊤[UU⊥] =[T11T120T22]=[IrT120T22],

where spans the complementary dual subspace of . Moreover, we have obtained because proof of Lemma 10 suggests . The dimension is obviously determined by rank of , i.e., .

We continue to show that towards minimal , . By the unitary invariance property of nuclear norm, . Noticing that

 Unknown environment '% (13)

and results in two constraints

 Z⊤UR=URandZ⊤U⊥0=U⊥0=0. (14)

Under these constraints, we can always make , or effectively , attaining the minimizer in that

 ∥∥∥[IrT120T22]∥∥∥∗≥∥∥∥[Ir0]∥∥∥∗=∥Ir∥∗=r, (15)

where the first inequality has followed from Lemma 11 and equality is obtained only when . Hence we have shown that , as an optimal solution of LRR.

(Uniqueness of ) Suppose a perturbed version is also a minimizer. So , suggesting or is in (which complements the row space). We have

 H⊤X⊤=0 Extra open brace or missing close brace (16) ⟹H⊤UR=0⟹H⊤U=0,

where the last equality holds because is nonsingular. If , we have

 [UU⊥]⊤Z′⊤[UU⊥] (17) =[T′11T′120T′22]=⎡⎣IrU⊤H⊤U⊥0(U⊥)⊤H⊤U⊥⎤⎦,

where we have substituted the analytic values of and its corresponding as discussed above and the fact . Since (otherwise together with we would obtain ), employing the inequality in Lemma 11 again we see that

 Misplaced & (18)

In other words, the objective is strictly increased unless or , which establishes the uniqueness. and also concludes the proof. ∎

###### Remark 12.

Nonuniqueness of factorization of will not affect the uniqueness of as follows. Suppose we choose another basis for column space of , obviously and must be related by a within-space rotation , i.e., . Hence w.r.t. the new basis we have , as .

### 2.3 Robust Subspace Segmentation with Data Containing Outliers and Noises

To account for noises and outliers, explicit distortion terms can be introduced into the objective and constraint. Hence we obtain the robust version of LRR-PSD and LRR respectively as follows

 min.∥Z∥∗+λ∥E∥ℓ,s.t.X=XZ+E,Z⪰0, (19)
 min.∥Z∥∗+λ∥E∥ℓ,s.t.XZ+E=X. (20)

We have used to mean generic norms. We caution that we cannot in general expect these two versions to be equivalent despite the provable equivalence of LRR-PSD and LRR. Remarkably, the problem has changed much due to the extra variable . Nevertheless, it is still possible to partially gauge the behaviors of the solutions as follows.

Suppose an optimal is somehow achieved (i.e., we assume it is fixed), we are then only concerned with

 min.∥Z∥∗s.t.XZ+E∗=X,(Z⪰0). (21)

Since columns of must be in the column space of , we assume . Then we obtain from the equality constraint . By employing similar process in the proof of Theorem 6, one can easily verify that

 [UU⊥]⊤(Z⊤+δE⊤)[UU⊥] = [TZ⊤11TZ⊤12CTZ⊤22]+[TδE⊤11TδE⊤12−CTδE⊤22] (22)

where the notation is consistent with the proof in Theorem 6. So towards minimizing , we can always have , or , for any . So the rest of spectrum of is determined by , and we have that can have at most nonvanishing eigenvalues, where . Note that has eigenvalues of , so spectrum of will be perturbation of that since the norm of is in general small. This is also confirmed by our numerical experiments in 4.2.2.

Moreover, we have intentionally left the norm for unspecified since it apparently depends on the noise model we assume. The use of assumes the noise is sample-specific. In practice, however, a more natural assumption is uniformly random, i.e., each dimension of every data sample has the same chance of getting corrupted. In this case, the simple will suffice. We demonstrate via experiments 4.3, and show that indeed is more robust in that case.

The above comments about spectrum properties and noise model selection apply to both settings.

### 2.4 Solving Robust LRR-PSD via Eigenvalue Thresholding

The equivalence of LRR-PSD and LRR does not readily translate to the respective robust versions, and hence we need to figure out ways of solving the robust LRR-PSD. Due to the strong connection between these two problems, however, we will still try to employ the Augmented Lagrange Multipler (ALM) method (see e.g., [22]) to tackle this as in  [1].

We first convert the problem into its equivalent form as

 minZ,E,J∥J∥∗+λ∥E∥ℓ,s.t.X=XZ+E,Z=J,Z⪰0, (23)

where we have used to mean generic norms. Forming the partial ALM problem, we have

 minZ,E,J⪰0,Y1,Y2 ∥J∥∗+λ∥E∥ℓ + ⟨Y1,X−XZ−E⟩+⟨Y2,Z−J⟩ + μ2∥X−XZ−E∥2F+μ2∥Z−J∥2F. (24)

We can then follow the inexact ALM routine [22] to update , , , , alternately. While fixing others, how to update depends on the norm . There are a bunch of norms that facilitate closed-form solutions, such as the discussed in [1] and (see e.g., [22]). How to update is the major obstacle to clean up. To be specific, we will be facing problem of this form to update

 M∗=argminM1μ∥M∥∗+12∥M−G∥2F,s.t.M⪰0, (25)

where may or may not be symmetric. We will next show in Theorem 14 that symmetric facilitates a closed-form solution, and generalize this in Theorem 16 which basically states that asymmetric also leads to a closed-form solution. Moreover, the major computational cost lies with eigen-decomposition of a symmetric square matrix, as compared with singular value decomposition of a square matrix of the same size in solving the counterpart in robust LRR.

###### Lemma 13 ([23], Lemma 3.2).

For any block partitioned matrix this inequality holds

 ∥X∥∗≥∥∥∥[A00D]∥∥∥∗=∥A∥∗+∥D∥∗. (26)

Similar inequality also holds for the square of Frobenius norm .

###### Theorem 14.

For any symmetric matrix , the unique closed form solution to the optimization problem

 M∗=argminM1μ∥M∥∗+12∥M−S∥2F,M⪰0, (27)

takes the form

 M∗=QDiag[max(λ−1/μ,0)]Q⊤, (28)

whereby , for , is the spectrum(eigen-) decomposition of and should be understood element-wise. 6.

###### Proof.

Observing that the objective is strictly convex over a convex set, we assert there exists a unique minimizer. The remaining task to single out the minimizer. Symmetric admits a spectrum factorization , where . We set , and hence the optimization in Eq. (27) can be cast into

 ˜M∗=argmin˜M1μ∥˜M∥∗+12∥˜M−Λ∥2F,˜M⪰0. (29)

By the unitary invariance property of the Frobenius norm and the nuclear norm, and the fact that with unitary (orthogonal) , we assert these two optimization problems are exactly equivalent (in the sense that and can be recovered from each other deterministically).

Next we argue that a minimizer must be a diagonal matrix. Let . In fact, for a non-diagonal matrix , we can always restrict it to diagonal elements to get such that by Lemma. 13 and the fact being diagonal. The strict inequality holds since restriction from a non-diagonal matrix to its diagonal elements results in strict decrease in square of the Frobenius norm. So assuming and , the problem reduces to a quadratic program w.r.t.

 {ξ∗i}ni=1=argmin{ξi}ni=11μn∑i=1ξi+12n∑i=1∥ξi−λi∥2,ξi≥0,∀i. (30)

The programming is obviously separable and simple manipulation suggests the unique closed form solution , which concludes the proof. ∎

###### Remark 15.

Note that uniqueness of the solution may not be directly translated from Eq. (29) to (27) since one may argue is not unique in general. There are three causes to the ambiguity: 1) general sign reversal ambiguity of eigenvectors, 2) freedom with eigenvectors corresponding to the zero eigenvalues, and 3) freedom with eigenvectors corresponding to eigenvalues with multiplicity greater than . Noticing that , the sign ambiguity and problems caused by zero-valued eigenvalues are readily removed in view of the form of the summand . For the last problem, assume one repeated eigenvalue has one set of its eigenvectors arranged column-wise in , which essentially spans a -dimensional subspace (and acts as the basis). So this part of contribution to can be written as . Realizing that generating a new set of eigenvectors via linear combination can be accounted for by a rotation to the original basis vectors, namely for in that subspace, we have . Hence the sum is not altered by any cause.

In fact, building on Theorem 14, we can proceed to devise a more general result on any real square matrix as follows.7

###### Theorem 16.

For any square matrix , the unique closed form solution to the optimization problem

 M∗=argminM1μ∥M∥∗+12∥M−P∥2F,M⪰0, (31)

takes the form

 M∗=QDiag[max(λ−1/μ,0)]Q⊤, (32)

whereby , for , is the spectrum(eigen-) decomposition of and should be understood element-wise.

###### Proof.

See Appendix A.2. ∎

The above two theorems (Theorem 14 and 16) have enabled a fast solution to updating . Moreover, since they ensure the symmetry of output irrespective of the symmetry of , we can be assured the alternation optimization process converges to a solution of that satisfies the constraint .

## 3 Complexity Analysis and Scalability

For solving the ALM problems corresponding to robust LRR-PSD and robust LRR, the main computational cost per iteration comes from either eigen-decomposition of a symmetric matrix or SVD of a square matrix of the same size. In numerical linear algebra [26], computing a stable SVD of matrix is to convert it to an symmetric eigen-decomposition problem on an augmented matrix

 ˜X=[0X⊤X0].

Hence from SVD to eigen-decomposition of comparable size, we can expect a constant factor of speedup that depends on the matrix dimension.

Figure 1 provides benchmark results on computational times of SVD and eigen-decomposition on matrices of sizes ranging from very small up to . Tested solvers include these provided in Matlab and these in PROPACK [28]. It is evident that for matrices of the same size, eigen-decomposition is significantly faster in both solver package. We will stick to the built-in functions in Matlab as we find in practice PROPACK is sometimes unstable when solving full problems (it is specialized in solving large and sparse matrices).

## 4 Experiments

In this section, we systematically verify both the theoretic analysis provided before, and related claims.

### 4.1 Experiment Setups

We use two data sets throughout our experiments.
Toy Data (TD). Following setting in [1], independent subspaces are constructed, whose bases are generated by , , where represents a random rotation and a random orthogonal matrix of dimension . So each subspace has a dimension of 4. data vectors are sampled from each subspace by , with being a iid zero mean unit variance Gaussian matrix . Collection of this clean should have rank .
Extended Yale B (EYB). Following setting in [1], frontal face images of classes from the whole Yale B dataset are selected. Each class contains about images, and images are resized to . Raw pixel values are stacked into data vectors of dimension as features. This dataset is an example of heavily corrupted data.

### 4.2 Equivalence of LRR-PSD and LRR

#### Spectrum Verification

Recall the key to establish the equivalence of LRR-PSD and LRR lies with showing that the eigenvalues and singular values of are identical, with of multiplicity equal to the data rank and the rest 0’s (Ref. Theorem 6 and the associated proof). In order to verify this, we use TD without introducing any noise, and hence the data matrix has rank . We simulate the clean settings, i.e., LRR-PSD and LRR by gradually increasing the regularization parameter of the robust versions (19) and (20). Intuitively for large enough , the optimization tends to put and hence approaches the clean settings. Figure 2 presents the results along the regularization path (0.1 1). It is evident during the passing to , the eigenvalue and singular value spectra match each other, and identically produce values of and the rest all . This confirms empirically the correctness of our theoretic analysis.

#### Spectrum Perturbation Under Robust Setting

As we conjectured in Sec. 2.3, in most cases spectrum of the obtained affinity matrix from robust LRR-PSD or robust LRR will be perturbation of the ideal spectrum. Repeated experiments on many settings confirm about this, although we cannot offer a formal explanation to this yet. Here we only produce a visualization (Figure 3) to show how things evolve under different noise level when we set . The noise is added in sample-specific sense, as done in [1], i.e., some samples are chosen to be corrupted while others are kept clean. We do observe some breakdown cases when is very small (not presented in the figure), which can be partially explained by that in that case the effect of nuclear norm regularization is weakened.

### 4.3 Selection of Noise Models

We have argued that the norm selection for the noise term should depend on the knowledge on noise patterns. We are going to compare the noise model with the noise model used in [1].

First we test on TD. Instead of adopting a sample-specific noise assumption, we assume that the corruptions are totally random w.r.t. data dimension and data sample which is more realistic. We add Gaussian noise with zero mean and variance , where is the whole data collection. Percentage of corruption is measured against the total number of entries in . The evolution of SC performance against the percentage of corruption is presented in Figure 4. We can see the obvious better resistance against noises exhibited by the form.

### 4.4 Performance Benchmark: LRR-PSD vs. LRR

We benchmark for the speed of robust LRR-PSD and robust LRR on EYB, and also present the clustering performance as compared to the conventional Gaussian kernel and linear kernel SC, which is obviously missing from [1].

Table 1 presents the accuracy obtained via various affinity matrices for SC, with different setting of PCA pre-processing for noise removal8. By comparison, obviously LRR-PSD and LRR win out and they are relatively robust against the PCA step, partially by virtue of their design to perform corruption removal together with affinity learning. To test the running time, we also include another set where each image in EYB is resized into (Set 1). We denote the original setting Set 2, and use the first classes of which each image resized into to produce Set 3. We report the running time (T), number of iterations (Iter), convergence tolerance (Tol) for each setting. Table 2 presents the results. Interestingly, LRR-DSP always converges with

more iterations but less running time than that of LRR. The benefit of using eigen-decomposition in place of SVD is apparent.

## 5 Summary and Outlook

In pursuit of providing more insights into recent line of research work on sparse-reconstruction based affinity matrix learning for subspace segmentation, we have discovered an important equivalence between the recently proposed LRR and our advocated version LRR-PSD in their canonical forms. This is a critical step towards understanding the behaviors of this family of algorithms. Moreover, we show that our advocated version, in its robust/denoising form, also facilitates a simple solution scheme that is as least as simple as the original optimization of LRR. Our experiments suggest in practice LRR-PSD is more likely to be flexible in solving large-scale problems.

Our current work is far from conclusive. In fact, there are several significant problems remained to be solved. First of all we observed in experiments the robust versions most of the times also produce affinity matrices with only positive eigenvalues, and themselves are very close to symmetric. We have not figured out ways to formally explain or even prove this. Furthermore, similar to the RPCA problem, it is urgent to provide theoretic analysis of the operational conditions of such formulation. From the computational side, SVD or eigen-decomposition on large matrices would finally become prohibitive. It would be useful to figure out ways to speed up nuclear norm optimization problems for practical purposes.

## Acknowledgements

This work is partially supported by project grant NRF2007IDM-IDM002-069 on “Life Spaces” from the IDM Project Office, Media Development Authority of Singapore. We thank Prof. Kim-Chuan Toh, Mathematics Department of the National University of Singapore, for his helpful comments and suggestions to revision of the manuscript.

## Appendix A Appendix

### a.1 Proof of Lemma 13

###### Proof.

Recall the fact that nuclear norm is dual to the spectral norm (Lemma 3), the dual description follows

 ∥X∥∗=sup{⟨[Z11Z12Z21Z22],[ABCD]⟩∣∣∣∥∥∥[Z11Z12Z21Z22]∥∥∥2=1}, (33)

and similarly we also have

 ∥∥∥[A00D]∥∥∥∗ (34) =sup{⟨[Z11Z12Z21Z22],[A00D]⟩∣∣∣∥∥∥[Z11Z12Z21Z22]∥∥∥2=1} =sup{⟨[Z1100Z22],[ABCD]⟩∣∣∣∥∥∥[Z1100Z22]∥∥∥2=1} =∥A∥∗+∥D∥∗.

Since (34) is a supremum over a subset of that in (33), the inequality about the nuclear norm holds. The claim about the square of Frobenuis norm holds trivially from nonnegativeness of any block norm squares contributed to the total norm square. ∎

### a.2 Proof of Theorem 16

###### Proof.

Similarly the program is strictly convex and we expect a unique minimizer. By the semi-definiteness constraint, we are only interested in . Hence , which suggests the objective function can be cast in its equivalent form . Further we observe that

 ∥M−P∥2F+∥M−P⊤∥2F=∥M−(P+P⊤)/2∥2F+C(P) (35)

where only depends on (are hence constants) . Hence we reach an equivalent formation of the original program Eq. (31) as

 M∗=argminM1μ∥M∥∗+12∥M−˜P∥2F,s.t.M⪰0, (36)

with . Solution to Eq. (36) readily follows from Theorem 14. ∎

### Footnotes

1. Throughout the paper, we use segmentation, clustering, and grouping, and their verb forms, interchangeably.
2. We follow [1] and use the term “subspace” to denote both linear subspaces and affine subspaces. There is a trivial conversion between linear subspaces and affine subspaces as mentioned therein.
3. Wei and Lin [12] have concurrently got similar results as produced in Sec.2.2 of the current paper, and also they have identified the closed-form solution of LRR with that of the shape interaction matrix (SIM) in the classic factorization method.
4. We follow the convention in Golub and van Loan [26] and use to denote the set of eigenvalues counting multiplicity. Hence it is not a normal set, and use of set operators here abuses their traditional definitions.
5. Note that QR factorization may not be unique, complement to the freedom in choosing a basis for , which is dual to the column space of .
6. Toh and Yun [21] have shed some light on the results (ref. Remark in their paper) but lack a detailed development and theoretic proof, and our proof is derived independent of their work. Moreover, solution to the general case as stated in the next theorem extends this results.
7. Moreover, using the similar arguments, plus Lemma 11, we are able to produce a nonconstructive proof to the well known results about singular value thresholding [20] without any use of subgradient. We will not pursue in this direction as it is out of the scope of this paper.
8. For SSC we used the implementation provided by the authors of [14] with proper modification to their PCA routine.

### References

1. G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in ICML, 2010.
2. U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
3. J. Shi and J. Malik, “Normalized cuts and image segmentation,” PAMI, vol. 22, no. 8, pp. 888–905, 2000.
4. I. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: spectral clustering and normalized cuts,” in KDD, 2004.
5. M. Meila and W. Pentney, “Clustering by weighted cuts in directed graphs,” pp. 135–144, 2007.
6. M. Fischler and R. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
7. R. Duda, P. Hart, and D. Stork, Pattern classification, 2001.
8. R. Tron and R. Vidal, “A benchmark for the comparison of 3-D motion segmentation algorithms,” in CVPR, 2007.
9. J. Costeira and T. Kanade, “A multibody factorization method for independently moving objects,” IJCV, vol. 29, no. 3, pp. 159–179, 1998.
10. Y. Ma, A. Yang, H. Derksen, and R. Fossum, “Estimation of subspace arrangements with applications in modeling and segmenting mixed data,” SIAM review, vol. 50, no. 3, pp. 413–458, 2008.
11. J. Yan and M. Pollefeys, “A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate,” ECCV, pp. 94–106, 2006.
12. S. Wei and Z. Lin, “Analysis and improvement of low rank representation for subspace segmentation,” Submitted to IEEE Transactions on Signal Processing (Correspondence), 2010.
13. B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang, “Learning with -graph for image analysis,” Image Processing, IEEE Transactions on, vol. 19, no. 4, pp. 858–866, April 2010.
14. E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in CVPR, 2009.
15. R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
16. S. Becker, J. Bobin, and E. Candès, “NESTA: A fast and accurate first-order method for sparse recovery,” Submitted. Available from arXiv, 2009.
17. M. Fazel, “Matrix rank minimization with applications,” Elec. Eng. Dept, Stanford University, 2002.
18. E. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
19. E. Candes and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009.
20. J. Cai, E. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” preprint, 2008.
21. K. Toh and S. Yun, “An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems,” preprint, 2009.
22. Z. Lin, M. Chen, L. Wu, and Y. Ma, “The augmented Lagrange multiplier method for exact recovery of a corrupted low-rank matrices,” Mathematical Programming, submitted, 2009.
23. B. Recht, W. Xu, and B. Hassibi, “Necessary and Sufficient Conditions for Success of the Nuclear Norm Heuristic for Rank Minimization,” ArXiv e-prints, Sep. 2008.
24. B. Recht, M. Fazel, and P. Parrilo, “Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization,” Submitted to SIAM Review, 2008.
25. E. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Submitted to the Journal of the ACM, 2009.
26. G. Golub and C. Van Loan, Matrix computations.   Johns Hopkins Univ. Press, 1996.
27. G. Stewart and J. Sun, Matrix Perturbation Theory.   Academic Press, 1990.
28. R. Larsen, “PROPACK–Software for large and sparse SVD calculations,” Available from http://soi.stanford.edu/~rmunk/PROPACK.