Robust Multi-subspace Analysis Using Novel Column L_{0}-norm Constrained Matrix Factorization

Robust Multi-subspace Analysis Using Novel Column -norm Constrained Matrix Factorization

Binghui Wang,  and Chuang Lin,  Binghui Wang is with the Electrical and Computer Engineering Department, Iowa State University, Ames, IA, 50010 USA, e-mail: Lin is with the Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen, China, e-mail: Wang and Chuang Lin contributed equally to this work.The preliminary work was done in Dalian University of Technology, China, and was further extended and completed in Iowa State University, USA.Preliminary results of the proposed method have been published in [1].

We study the underlying structure of data (approximately) generated from a union of independent subspaces. Traditional methods learn only one subspace, failing to discover the multi-subspace structure, while state-of-the-art methods analyze the multi-subspace structure using data themselves as the dictionary, which cannot offer the explicit basis to span each subspace and are sensitive to errors via an indirect representation. Additionally, they also suffer from a high computational complexity, being quadratic or cubic to the sample size.

To tackle all these problems, we propose a method, called Matrix Factorization with Column -norm constraint (MFC), that can simultaneously learn the basis for each subspace, generate direct sparse representation for each data sample, as well as removing errors in the data in an efficient way. Furthermore, we develop a first-order alternating direction algorithm, whose computational complexity is linear to the sample size, to stably and effectively solve the nonconvex objective function and nonsmooth -norm constraint of MFC. Experimental results on both synthetic and real-world datasets demonstrate that besides the superiority over traditional and state-of-the-art methods for subspace clustering, data reconstruction, error correction, MFC also shows its uniqueness for multi-subspace basis learning and direct sparse representation.

Robust multi-subspace analysis, matrix factorization, multi-subspace basis learning, -norm sparse representation, alternating direction algorithm, proximal algorithm.

I Introduction

The observed data are extremely high dimensional in this “big data” era. Typically, these data reside in a much lower-dimensional latent subspace, instead of being uniformly distributed in the high-dimensional ambient space. Thus, it is of great importance to reveal the underlying structure of the data as it helps to reduce the computational cost and enables a compact representation for learning. Such an idea has been successfully applied to various reseach communities, e.g., dimensionality reduction [2, 3, 4], face recognition [5, 6, 7], metric learning [8], etc..

Subspace methods have been widely used to analyze the data, and linear subspace111A vector space is a subset of some other higher-dimensional vector space. is the most common choice for its simplicity and computational efficiency. Additionally, linear subspace has shown its effectiveness in modeling real-world problems such as motion segmentation [9, 10], face clustering [11, 12], and handwritten digits recognition [13]. Consequently, subspace analysis has been paid much attention in the past decades. For instance, principal component analysis (PCA) aims to learn a subspace while retaining maximal variances of the data. Nonnegative matrix factorization (NMF) [14] is designed to learn both the nonnegative basis and nonnegative parts-based representation. Robust PCA (RPCA) [15] assumes that the data are approximately drawn from a low-rank subspace while perturbed by sparse noise. The basic assumption of these methods is the single subspace, which is not the case in many practical applications. A more reasonable way is to consider the data as lying on or near a union of linear subspaces. Unfortunately, the generalization to handle multiple subspaces is quite challenging.

Recently, multi-subspace analysis has attracted increasingly interests in visual data analysis [16, 17, 18]. Derived from recent advances in compressive sensing [19, 20], the methods including  [21, 22, 23, 24, 25, 26, 27] have incorporated sparse and/or low-rank regularization into their formulations to model the mixture of linear structures for clean data222Data points are strictly sampled from the respective subspace. as well as dealing with errors in data, e.g., noise [28], missed entries [19], corruptions [15], and outliers [29]. Among them, sparse subspace clustering (SSC) [21] and low-rank representation (LRR) [22] stand out as two most popular methods, which formulate the discovery of multi-subspace structure as finding a sparse or low-rank representation of the data samples using data themselves as the dictionary (basis). It has been shown in literatures that these methods can achieve more accurate subspace structure than single subspace analysis methods. However, both SSC and LRR have the following major drawbacks: First, they use the original data as the basis rather than learning the basis explicitly. It is problematic when data contain errors yet are still used for reconstruction or/and clustering. Second, neither sparse representation via regularization nor low-rank representation in a global constraint provides a direct description for each data sample. In other words, data samples cannot find the subspace where they lie, and their contained errors are unable to be removed. Thirdly, SSC and LRR suffer from the computational complexity that is quadratic and cubic to the sample size, respectively.

Our work: In this paper, to tackle all above problems, we aim to simultaneously learn the basis for each subspace and directly generate sparse representation for each data sample, as well as removing errors (e.g., random corruptions333A fraction of random entries of data are grossly corrupted. and sample-specific outliers444A fraction of data are far away from their respective subspaces.) in the data in an efficient way. To this end, we propose a novel column -norm constrained matrix factorization (MFC) method, and develop a first-order alternating direction algorithm to stably and efficiently solve the nonsmooth and nonconvex objective function of MFC.

Specifically, given a collection of data approximately generated from a union of independent subspaces with an equal subspace dimension, we explore their subspace structure from the matrix factorization perspective in the following way: First, we restrict the basis matrix to be orthonormal, which guarantees that the learnt basis indeed span multiple subspaces. Then, we pursue a direct sparse representation of each data sample on only the basis that form its underlying subspace. To achieve this goal, we leverage an -norm to confine the number of nonzeros elements of each column of the representation matrix. Furthermore, to handle different types of errors contained in the data, we incorporate different regularizations.

With above constraints, the objective function of MFC is nonconvex on the coupled basis and representation matrix, and is nonsmooth on the -norm constraint. Thus, it is a very challenging optimization problem that cannot be solved via typical gradient (or subgradient) based methods. To overcome the nonconvex problem, we develop a first-order optimization algorithm motivated by ADMM [30] to split multivariable objective function into several univariate subproblems and solve each subproblem one by one. To tackle the subproblem on nonsmooth -norm constraint, we design a novel proximal operator, which is inspired by recent proximal algorithms [31], and solve it efficiently and analyticly.

After learning basis and representation matrix, MFC can accomplish many tasks, e.g., subspace clustering, data reconstruction, error correction, etc.. We evaluate MFC on both synthetic data and real-world datasets. Experimental results demonstrate that besides the superiority and efficiency over traditional and state-of-the-art methods for subspace clustering, data reconstruction, and error correction, MFC0 also demonstrates its uniqueness for multi-subspace basis learning and direct representation learning.

In summary, our key contributions are as follows:

  • We propose a novel MFC method from the matrix factorization perspective to perform multi-subspace analysis. To the best of our knowledge, this is the first work that considers simultaneously learning the basis, generating direct sparse representation, and correcting errors of data generated from multiple subspaces.

  • We develop a first-order alterating direction algorithm to stably and effectively solve the nonconvex objective function of MFC.

  • We design a novel proximal operator to analyticly and efficiently solve the nonsmooth -norm constraint.

  • Experimental results demonstrate that MFC shows high resistance to errors, achieves high subspace clustering performance, learns multi-subspace basis, and generates direct sparse representation.

Ii Related Work

Matrix factorization and its variants:  Matrix Factorization (MF) has been studied for several decades. Classical methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Independent Component Analysis (ICA), are exemplars of low-rank matrix factorizations. Afterwards, to solve specific problems, many MF variants, including Concept MF [32], Maximum Margin MF [33], Convex MF [34], and Online MF [35], etc., are proposed. These methods share in common that there is no constraint on the sign of elements of factorized matrices. In contrast, a paradigm of factorization, called NMF, was first proposed in [14] and has been widely used for its sparse, parts-based, and additive representation. NMF is suitable for many research fields, e.g., data mining, image processing, pattern recognition, and machine learning. Its real-world applications include document clustering [36], image segmentation [37], image recognition [38, 39], speech and audio processing [40], blind source separation [41], to name a few. One can refer to the comprehensive review of NMF in [42].

Approximate sparsity via -norm:  Sparsity in classic NMF occurs is as a by-product rather than a designed objective. To this end, researchers have paid much attention to impose sparsity on NMF explicitly. These methods either penalize [43, 44] or constrain [45] -norm of the representation matrix to yield a sparse representation [46]. As a convex relaxation of -norm, the exact sparsity measure which renders NP-hard combinational optimization problem, -norm is used to improve computational feasibility and efficiency of sparse representation [47]. However, as NMF is NP-hard itself [48], we can say that any algorithm for NMF is suboptimal. Thus, a heuristic -norm constrained NMF might be as appropriate and efficient as -norm sparse NMF.

-norm based sparsity:  Little work has been done using direct -norm, the exact sparsity measure which causes the NP-hard combinational optimization problem. K-SVD [49] aims to find an overcomplete dictionary for sparse representation, and the sparse approximation is based on an -norm inequality constraint and is solved via greedy orthogonal matching pursuit (OMP) [50]. Peharz and Pernkopf [51] further proposed sparse NMF that introduced an -norm to constrain the basis or representation matrix of approximate NMF. It is solved via sparse alternating nonnegative least-squares (sANLS). Wang et al. [52] leveraged hierarchical Bayes to model an adaptive sparseness prior, which has a similar properity with an -norm. However, these methods are under the hidden assumption that data are generated from single subspace. In addition, they cannot handle various errors. In this work, we focus on analyzing data (approximately) drawn from multiple subspaces. Different from using greedy methods, the -norm constraint in MFC is solved analytically via our designed proximal operator, which is easy to be incorporated into our developed alternating direction algorithm.

Iii Notations and Problem Definition

Iii-a Main Notations

In this paper, vectors and matrices are written as bold lowercase and uppercase symbols. The -th entry of vector is . For matrix , its -th entry, -th column, and -th row are denoted as , , and respectively. The horizontal and vertical concatenation of a set of matrices along row and column are denoted by and , respectively. The superscript stands for the transpose.

For vector , its -norm is , and pseudo -norm is the number of nonzero entries, i.e., . For matrix , its Frobenius norm is ; -norm is ; -norm is ; and -norm is . The inner product between two matrices , is , where tr is the trace operator defined on a square matrix.

We denote -th subspace as and a collection of subspaces as .

Iii-B Problem Definition

We present two definitions related to our problem at first.

  • [noitemsep,nolistsep]

  • Union and sum of subspaces:  The union of subspaces is defined as . The sum of subspaces is defined as .

  • Independent subspaces:  subspaces are independent if and only if . Or to say, .

With above definitions, our problem is defined as

Problem 1 (Robust Multi-subspace Analysis).

Given a collection of data samples strictly/approximately drawn from a union of independent subspaces with an equal subspace dimension, robust multi-subspace analysis is to learn the basis for each subspace, learn the associated representation for each sample upon the basis, or/and correct the error for each data sample.

Formally, given data samples that are generated from a basis matrix that span independent subspaces , with a corresponding representation matrix , or/and contaminated with errors


Then, our purpose is from the observed data to learn the basis for each subspace , the representation upon the basis , or/and the error such that


where with merely collecting the samples from the -th subspace and ; with containing the errors of samples in ; with the basis to span ; with the representation upon the basis ; indicates the subspace dimension, .

In the sequel, we first propose a model to deal with data that are clean, i.e., . Then we generalize our model to handle data that are contaminated with random corruptions or sample-specific outliers, i.e., .

Iv The Proposed Method

In this section, we first introduce a theorem that inspires us to formulate the problem with clean data. Then, we generalize the model to handle contaminated data. Finally, we present the algorithm to solve our model.

Iv-a Problem Formulation

To start with, we assume that data samples are strictly generated from independent subspaces. In doing so, we can come to the conclusion as follows:

Theorem 1.

Suppose subspaces are independent and consists of basis that only span subspace , then the solution of representation matrix to Eq.(2) dealing with clean data is block-diagonal:


where and .


See Appendix -A. ∎

Theorem 1 shows the condition under which we can achieve the block-diagonal representation matrix, which is of vital importance to support analyzing the structure of multi-subspace. That is, if we know the basis that clean data lie in, then each block of the representation matrix characterizes each subspace. In practice, however, the basis of each subspace are often unknown, Thus, the generated representation matrix cannot be guaranteed to be block-diagonal. Fortunately, Theorem 1 offers us the hint to construct a satisfactory model. In short, we expect to simultaneously learn the basis matrix and learn the representation matrix inspired by Theorem 1.

To be more specific, we first impose an orthonormal constraint on the basis matrix to ensure that its columns can indeed form the “basis” to span independent subspaces. In order to generate the block-diagonal representation, we characterize each data sample as a nonnegative combination555We only deal with data with nonnegative entries. of only the basis that span its underlying subspace. To this end, we leverage a column -norm constrained on the representation matrix and set its number of nonzero elements equal to the subspace dimension.

Formally, the objective function satisfying above constraints is defined as


where Eq.(5) is the orthonormal constraint on ; is a identity matrix. Eq.(6) indicates a nonnegative combination of the orthonormal basis; “” is taken component-wise. Eq.(7) is the requirement of the number of nonzero coefficients. For writing simplicity, we omit the “” symbol in the subsequent related equations.

In practice, the observed data are often contaminated with errors, e.g., noises, random corruptions or sample-specific outliers, rendering the data away from their exact underlying subspaces. To handle different types of errors, we adopt different regularizations. Specifically, we generalize the objective function in Eq.(4) to


where is some type of matrix norm. For random corruptions, it is ; While for sample-specific outliers, it is . is the regularization parameter that controls the effect of two terms.

By implementing above two objective functions, we claim that when the data are clean, then is block-diagonal after reordering its columns and rows, i.e., . When the data are contaminated with random corruptions or sample-specific outliers to some extent, our experimental results show that is still approximate block-diagonal.

It is necessary to notice that SSC and LRR can also be boiled down to the matrix factorization framework. They share some similarities with the proposed method, but there exist two major differences: On one hand, they take advantage of what they refer to as “self-expressiveness” property and use the data themselves to form the dictionary (basis). However, leveraging the data as the basis fail to perform data reconstruction when they contain errors to some degree (See Figure 1). On the other hand, they respectively utilize an indirect -norm and nuclear norm to obtain the sparse and low-rank representation of the data, which however, are unable to know the exact basis that are used to represent each data sample. Thus, their ability for subspace clustering is also limited (See Figure 5).

(a) Random Corruptions
(b) Sample-Specific Outliers
Fig. 1: Reconstruction results of data generated from 3 independent subspaces and contaminated with (a) random corruptions and (b) sample-specific outliers. The first column shows the original contaminated data. The remaining three columns are the results of SSC, LRR, and MFC respectively.

Iv-B Problem Optimization

In this part, without loss of generality, we just solve for the general case in Eq.(8). Notice that we need to handle the nonconvex multivariable objective function, as well as the nonsmooth -norm constraint. Fortunately, for function with multivariables, the ADMM method is often the choice [30]. Moreover, proximal algorithms has recently been a popular tool for solving nonsmooth and nonconvex problems. It’s basic operation is evaluating the proximal operator of a univariable function, which often admits a closed-from solution or can be solved very quickly [31]. Inspired by such superiority, we develop a first-order alternating direction algorithm to solve Eq.(8). The overall procedure of the algorithm is first introducing auxiliary variables and quadratic penalties into the objective function, then iteratively minimizing the augmented Lagrangian function with respect to each primal variable, and finally updating the multipliers. To tackle the subproblem on nonsmooth -norm constraint, we design a novel proximal operator to solve it efficiently and analytically.

We begin with introducing an auxiliary variable to variable which exists in both the object function and the constraint. We reformulate Eq.(8) as


Then, the augmented Lagrangian function of Eq.(9) is


where is the Lagrangian multiplier, is the quadratic penalty parameter. Note that adding the penalty term does not change the optimal solution, since any feasible solution satisfying the constraint in Eq.(10) vanishes the penalty term.

Finally, our alternating direction algorithm consists of iteratively minimizing Eq.(10) with respect to one of while fixing the others, and updating the multiplier . Specifically, we solve for variables and the multiplier step by step.

Update X: By discarding terms that are irrelevant to , we rewrite the subproblem with respect to as


We denote the singular value decomposition (SVD) of , then from Theorem 2 below, we can achieve the closed-form solution of . That is,

Theorem 2.

Let and be any two matrices. Denote the SVD of as . Then the orthonormal constrained minimization problem

has an analytic solution .


See Appendix -B. ∎

Update Y: The subproblem of is


Setting the derivative with respect to to zero, rearranging the terms, and using the constraint yields


Update E: The subproblem of becomes


For sample-specific outliers, we set . Then the solution to Eq.(15) is equivalent to solving the proximal operator for -norm. From [53], can be obtained with a closed-form. Concretely, denote , then


Similarly, for random corruptions, we set . The solution then equals to finding the proximal operator for -norm, and can be achieved via the Soft-Thresholding Operator [54]


Update V: The subproblem associated with is


Here, we emphasize that due to the discrete -norm constraint in Eq.(18), cannot be obtained via gradient or subgradient based methods. Benefiting from aforementioned proximal algorithms [31], we can design a simple yet very efficient algorithm to solve column by column. The detail is as follows:

We denote and define the operator for as

We also define the nonnegative orthant mapping as

and introduce an indicator function over the set

Then, we have the following theorem:

Theorem 3.

Given above defined operator , nonnegative orthant mapping , and indicator function . Then the solution of Eq.(18) equals to solving the proximal operator666Note that there is a factor 1/2 on the quadratic term in the standard form. Here we remove it to simplify the notation. of the indicator function

which has an analytical form, i.e., .


See Appendix -C. ∎

Based on Theorem 3, can be acquired by selecting largest nonnegative entries of with corresponding indices . To be specific, for , we set


Update P: Finally, for the multiplier , based on the dual optimal condition [30], its updating rule is


where , with pregiven and .

Algorithm 1 summarizes the algorithmic procedure of our proposed alternating direction algorithm for solving Eq.(10).

After obtaining , and , MFC can accomplish the following tasks:

  • Subspace Clustering: We apply normalized cut [55]777One can also use K-means clustering on . We choose normalized cut here so as to have a fair comparison with SSC and LRR. to to cluster all data samples to their respective subspaces.

  • Data Reconstruction: We treat the product of and as the reconstruction for the data.

  • Error Correction: We regard as the error for the data.

  • Representation Learning: We permute the rows and/or columns of to make it be (appximate) block-diagonal. Then the block indicates the representation for subspace .

  • Basis Learning: We respectively extract columns of , whose indices correspond to the row numbers of ), to form the basis of subspace .

0:  Observed data matrix , subspace dimension .
0:  . Initialize , , ; Initialize , , .
  while not converged do
     Update using corresponding equations.
     Check the convergence condition:    or .
  end while
Algorithm 1 Solving the problem Eq.(10)

V Computational Analysis

The computational complexity of MFC in each iteration can be summarized as follows:

Calculate :  We first perform matrix subtraction of size and matrix multiplication of size , with complexities and , respectively. Then, we compute the SVD of an matrix and store both the singular vectors—the complexity is . Finally, we calculate the matrix product of size and with complexity to obtain the orthonormal basis . The total complexity is .

Compute :  We just perform matrix multiplication of size and and matrix addition of size , with the dominant complexity .

Compute :  The computation consists of matrix multiplication, matrix substraction, and proximal operator calculation888Both and have the same complexity .. The complexity of each part is , , and , and the dominant complexity is .

Compute :  The complexity of matrix addition is . The nonnegative orthant mapping and sorting of -dim vector respectively need and (in the average case) calculation amount. Thus, the dominant complexity is .

As the dimensionality of multi-subspace is much smaller than the data dimensionality and sample size , i.e., , we conclude that the dominant complexity of MFC is .

TABLE I: Dominant complexity of SSC, LRR, and MFC

Here, we also discuss the computational complexity of SSC and LRR. For SSC, the computational burden is matrix multiplication, with the complexity . For LRR, the computational burden lies in two parts: matrix multiplication and matrix SVD, the complexity of which are and , respectively. One should note that: (1) the matrix size for matrix multiplication of MFC ( and ) is smaller than that of SSC and LRR ( and ); (2) the matrix size for SVD of MFC () is smaller that of LRR (). The dominate complexity of SSC, LRR, and MFC are listed in Table I.

Compared MFC with SSC and LRR, we observe that when handling data with high dimensionality, SSC and LRR are relatively faster—the complexity in terms of is quadratic for MFC, and linear for SSC and LRR. Actually, we can leverage random projections, a very effective [56] feature extraction method, as a preprocessing to reduce the dimensionality, thus largely decreasing the complexity of MFC. In contrast, if we deal with large data sample size, MFC is more efficient than SSC and LRR—the complexity of SSC and LRR are quadratic and cubic to the sample size , while MFC is linear.

In practice, the overall computational complexity is determined by not only the complexity of each iteration but also the number of total iterations. Although there is no theoretical guarantees, experimental results (See Tables VI-VIII) show that MFC can converge with 20-40 iterations, while LRR and SSC need 55-70 and 90-110 iterations, respectively.

Vi Experimental Results

In this section, we carry out several experiments on both synthetic data and real-world datasets to test the performance of our proposed MFC method.

Compared methods:  We compare MFC with single subspace learning methods, i.e., PCA, NMF [14], and Sparse NMF (SNMF) [51]; and state-of-the-art multi-subspace learning methods, i.e., SSC and LRR. MFC can leverage different regularizations to handle different types of errors. In the experiment, we mainly focus on random corruptions () and sample-specific outliers (). Accordingly, we denote MFC as MFC1 and MFC2, respectively.

For PCA, we preserve 95 percent of total data variance; For NMF and SNMF, we set their basis number as , the same as MFC’s. For MFC, we randomly generate a matrix with size to initialize the basis matrix. The hyperparameters of all comparable methods are selected via cross-validation. The stopping criterion for each method is either it reaches the maximal iterations or the objective function values between neighboring iterations have difference less than . All experiments are conducted in Matlab R2010b with the platform CPU 3.10 GHz and RAM 16.0 GB.

Evaluation metrics:  We use clustering accuracy, time, and iterations as the metrics to evaluate the performance of all compared methods.

Vi-1 Clustering Accuracy

The subspace clustering result is evaluated by comparing the estimated label of each data sample with that provided by the ground truth. In the paper, we use clustering accuracy (ACC) to measure the clustering performance. ACC is defined as


where and are the estimated label and the ground truth of the -th point; if , and otherwise. map(x) is the permutation mapping function that maps each label to the equivalent label from the entire data. The best mapping can be efficiently computed by the Kuhn-Munkres algorithm [57].

Vi-2 Time and Iterations

We use three indices to quantize the computational effciency, i.e., total running time , iterations , and averaged running time per iteration (). indicates the overall effciency; reflects the convergence rate to reach local solutions, and shows the efficiency per iteration.

Vi-a Results on Synthetic Data

Vi-A1 Data reconstruction visualization on 3-dim space

We visualize data reconstruction in 3-dim ambient space in Figure 1 and compare MFC with SSC [21] and LRR [22]. The data samples are drawn from three 1-dim independent subspaces, and disturbed by random corruptions and sample-specific outliers, respectively. Figure 1 shows that SSC and LRR are unable to remove the errors, which validates our aforementioned argument; While MFC fully eliminates them. The reason is that both SSC and LRR use the original contaminated data as the basis, which is unreasonable for data reconstruction. In contrast, MFC learns the basis that span the underlying subspace the clean data lie in–Indeed, the learnt basis are , , and , respectively; Errors contained in the data are absorbed in the regularization term.

Vi-A2 Results on high-dimensional data

For high-dimensional data analysis, the synthetic data are generated from independent subspaces with each containing samples. All subspaces have the same dimension embedded in a dimensional ambient space. The procedure of generating above subspaces is similar to that of [13]: the basis of each subspace are calculated by , where is a random orthonormal matrix and is a random column orthogonal matrix. The data points from each subspace are sampled by , where elements in are from standard uniform distribution.

Representation learning and stability:  We exhibit MFC’s ability to learn the representation matrix, i.e., the block-diagonal structure, of data without and with errors, and show its efficiency and stability to obtain the local solution.

The representation matrix of clean data is shown in Figure and objective function values versus iteration in 10 runs are plotted in Figure . From Figure 2, we can see that MFC can precisely discover the block-diagonal structure when the data samples are clean. Moreover, our proposed first-order optimization algorithm can fast converge (nearly 15 iterations) to a very stable value although using different random initializations. These observations demonstrate that MFC is both effective and efficient to analyze the structure of multiple subspaces.

We further consider the cases that data contain errors. For random corruptions and sample-specific outliers with error ratio999The proportion of data samples are contaminated with errors. 0.1, 0.3, and 0.5, the corresponding representation matrices are plotted in Figure 3 and Figure 4, respectively. We observe that even the data are contaminated with different types of errors to a high level, the representation matrix learnt via MFC is still close to be block-diagonal.

Clustering accuracy vs. error ratio:  We compare MFC with all the other methods in terms of subspace clustering to validate its robustness to resist different types of errors. The performance of all compared methods versus different error ratios, ranging from 0 to 0.8, of random corruptions and sample-specific outliers in the data are depicted in Figure 5(a) and 5(b). It can be seen that the clustering accuracy of MFC is consistently the highest. SSC and LRR come next. Surprisingly, even when the ratio is 0.6, MFC can still obtain accuracy in both cases. Whereas, for LRR and SSC, the accuracies are approximate 70%, 85% for random corruptions, and 60%, 60% for sample-specific outliers, respectively. The performance of PCA, NMF, and SNMF decrease sharply, which validates that single subspace learning methods fail to work when the error ratio increases.

Vi-A3 Summary

  • MFC can accurately reconstruct the data and discover the basis that span each subspace;

  • MFC is capable of obtaining the exact block-diagonal representation matrix when data is clean;

  • MFC shows strong ability to resist errors with a high ratio and is more powerful than state-of-the-art LRR and SSC methods for subspace clustering, not to mention single subspace learning methods;

  • MFC can fast converge to a very stable solution via our proposed first-order optimization algorithm.

Fig. 2: (a) Representation matrix learnt on clean data. (b) Objective function values vs. iteration.
(a) ratio=0.1
(b) ratio=0.3
(c) ratio=0.5
Fig. 3: Representation matrix learnt on data contaminated by different ratios of random corruptions.
(a) ratio=0.1
(b) ratio=0.3
(c) ratio=0.5
Fig. 4: Representation matrix learnt on data contaminated by different ratios of sample-specific outliers.
Fig. 5: Clustering accuracy vs. ratio of (a) random corruptions and (b) sample-specific outliers in the data.

Vi-B Results on Real-World Datasets

In this part, we pay close attention to the tasks: subspace clustering, as well as associated computational complexity; subspace basis learning and subspace recovery. Experiments are carried out on three real-world face datasets: AR101010, Yale111111, and Extended Yale B (EYaleB)121212 and the USPS handwritten digit dataset131313

Vi-B1 Dataset Description

AR contains over 3,000 images of 126 individuals (70 men and 56 women). They are taken at two different occasions with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf) under strictly controlled conditions. Raw images are of 768x576 pixels and 24 bits of depth. In the experiment, images are resized to 48x48 pixels.

Yale contains 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink. We simply use the cropped images with size 32x32 pixels.

EYaleB consists of 2414 images of 38 subjects under 9 poses and 64 illumination conditions. The subset used in the experiment has 20 individuals and around 64 near frontal images under different illuminations per individual. All images are cropped to 32x32 pixels.

USPS contains 9198 handwritten digit images (from 0 to 9) and the image size is 16x16. In the experiment, we choose 100 images of each digit, thus 1000 images in total.

Some sample images from the four datasets are shown in Figure 6. Statistics of the selected subsets are listed in Table II.

K 3 11 20 10
15 15 64 100
m 48x48 32x32 32x32 16x16
TABLE II: Statistics of selected subsets of four datasets
Fig. 6: Sample images from (a) AR, (b) USPS, (c) Yale, and (d) EYaleB.

Vi-B2 Subspace Dimension Estimation

For MFC, we need to know the subspace dimension of each subject in advance. A simple yet effective way is to calculate the number of nonzero singular values of each subject. Figure 7 depicts the singular values of several randomly chosen subjects in the four datasets. The subspace dimensions for AR, Yale, EYaleB, and USPS are approximate 12, 10, 10, and 12 respectively. Note that for real-world datasets, the singular values often gradually close to zero, validating that the images are contaminated with errors. The higher dimension of AR reflects its more complicated face distribution than Yale and EYaleB. It is necessary to point out that, we cannot distinguish which type of error contained in the images, therefore we report

Fig. 7: Singular values of several randomly chosen subjects in (a) AR, (b) Yale, (c) EYaleB, and (d) USPS.

Vi-B3 Subspace Clustering

In this part, we perform subspace clustering on real-world datasets: Given face images or handwritten digits of multiple categories, our task is to group them according to their respective subspace. Our purpose is to verify that it is more advantageous to (1) model face distribution using multiple subspaces than using the single one; (2) learn the basis rather than leverage the data themseleves as the basis.

Tables III and IV show the clustering performance on Yale and EYaleB, respectively. To study the effect of category number, we choose that is ranging from 2 to 11 on Yale and from 2 to 20 with an interval 2 on EYaleB. From the results, we have the following observations:

(1) As a whole, MFC performs the best on both datasets. By learning each subject the respective subspace, together with a direct column -norm constraint and an error correction term, MFC is discriminative to separate face images from the subjects. Essentially, this is the key factor that makes it outperform single subspace analysis methods. Moreover, using a column -norm constraint to directly control the sparsity, MFC is superior to the indirect penalty and avoids tuning the sparse hyperparameter;

(2) SSC and LRR can also achieve promising results as they are multi-subspace learning methods with error correction. However, that they use contaminated face images as the basis is problematic and irrational, the accuracy gap between them and MFC validates this argument;

(3) Errors contained in face images are more likely to be sample-specific outliers, as on both Yale and EYaleB, the result of MFC is better than that of MFC. A corroborative evidence is that LRR, also using the term, is slightly superior to SSC, the term;

(4) Sparse constraint for clustering is important. The better results of SNMF than that of NMF in both datasets verify the importance using sparse constraint;

(5) Modeling face images distribution using multi-subspace Since PCA, NMF, and SNMF are single subspace learning methods, so their performance are worse than SSC, LRR, MFC, MFC. When the sample size of each subject is small in Yale, the structure of multi-subspace is not so obvious and the accuracy gap is not remarkable. However, when the sample size of each subject in EYaleB is sufficient, all of them obtain poor results on the whole. Even worse, they fail to work when increases to a certain value, say 10. While for remaining methods, they still work well.

We also implement subspace clustering on USPS with from 2 to 10. Our aim is to see whether handwritten digits are distributed in “independent subspaces”, as no pervious work mentions such assumption.

Subspaces of handwritten digits are probably not independent. From the results listed in Table V, we notice that all PCA, NMF, and SNMF have promising results, even comparable with SSC and LRR in some s. Still, our MFC performs the best, but the superiority is not so obvious as that in EYaleB. One possible reason is that the subspaces of handwritten digits are not independent in essence. In this sense, the distribution of handwritten digits modeled by disjoint subspaces (or submanifolds) may be more reasonable.

2 0.76 0.82 0.82 0.87 0.95 0.95 0.95
3 0.61 0.70 0.69 0.64 0.68 0.70 0.74
4 0.63 0.64 0.68 0.61 0.64 0.66 0.76
5 0.53 0.64 0.67 0.67 0.67 0.65 0.69
6 0.56 0.53 0.58 0.65 0.58 0.63 0.68
7 0.53 0.55 0.56 0.62 0.57 0.58 0.61
8 0.55 0.49 0.57 0.60 0.63 0.56 0.60
9 0.43 0.38 0.49 0.55 0.57 0.51 0.58
10 0.45 0.47 0.54 0.54 0.54 0.51 0.54
11 0.54 0.49 0.54 0.57 0.57 0.50 0.59
Avg. 0.56 0.57 0.61 0.63 0.64 0.63 0.66
TABLE III: Clustering performance on Yale
2 0.51 0.80 1.00 0.98 1.00 1.00 1.00
4 0.35 0.59 0.63 0.88 0.86 0.85 0.89
6 0.30 0.72 0.57 0.84 0.82 0.84 0.86
8 0.20 0.37 0.48 0.76 0.74 0.69 0.76
10 0.23 0.45 0.41 0.74 0.71 0.68 0.75
12 0.22 0.37 0.41 0.67 0.70 0.70 0.70
14 0.18 0.36 0.39 0.64 0.68 0.77 0.72
16 0.18 0.37 0.39 0.63 0.67 0.72 0.73
18 0.15 0.35 0.35 0.63 0.68 0.74 0.74
20 0.15 0.37 0.37 0.60 0.66 0.72 0.72
Avg. 0.25 0.47 0.49 0.74 0.75 0.77 0.79
TABLE IV: Clustering performance on EYaleB
2 0.92 0.98 0.96 0.99 0.99 1.00 1.00
3 0.80 0.91 0.86 0.95 0.91 0.94 0.94
4 0.63 0.76 0.78 0.85 0.89 0.93 0.87
5 0.67 0.64 0.74 0.80 0.79 0.87 0.88
6 0.57 0.56 0.67 0.80 0.77 0.78 0.76
7 0.56 0.55 0.58 0.68 0.65 0.74 0.64
8 0.56 0.57 0.57 0.54 0.62 0.66 0.69
9 0.54 0.52 0.51 0.46 0.61 0.68 0.63
10 0.54 0.47 0.48 0.53 0.47 0.64 0.58
Avg. 0.64 0.66 0.68 0.73 0.74 0.80 0.78
TABLE V: Clustering performance on USPS

Vi-B4 Computational Complexity

Another advantage of MFC against LRR and SSC is its efficient computation in terms of sample size. Therefore, our aim in this part is to validate that when increasing the sample size MFC is more efficient than SSC and LRR. As mentioned, we quantize the efficiency of each method using total running time (), iterations (), and averaged running time per iteration ().

The results on Yale, EYaleB, and USPS are listed in Tables VIVIIVIII, respectively. From Table VI, we see that

(1) Running time per iteration of MFC or MFC is bigger than that of SSC and LRR on dataset with small sample size. The averaged of MFC or MFC is approximately 5 and 3 times of SSC and LRR. It is because when the sample size of each subject is small in Yale, the dimensionality of data samples () dominates the computational complexity in each iteration. Remembering that the complexity of MFC is quadratic with respect to , while of LRR and SSC is linear.

(2) Iterations of MFC is much smaller than those of SSC and LRR. Despite of higher , the averaged iteration of MFC or MFC is 24 or 23, much smaller than those of SSC and LRR, which are 93 and 57, respectively. Such observation demonstrats MFC’s faster convergence rate than LRR and SSC. Ultimately, the total running time of these methods are similar.

(3) Increasing sample size makes MFC more advantageous. When the sample size of each subject increases to 64 in EYaleB, the results in Table VII manifest that values of SSC and LRR are larger than that of MFC or MFC on average. It is due to the fact that the computational complexity of MFC is linear with respect to , while quadratic and cubic of LRR and SSC. Similar to the results on Yale, compared with SSC (110) and LRR (65), the averaged values of MFC (31) and MFC (30) are much smaller. Consequently, the total running time of MFC and MFC are just 1/4, 1/5.7 and 1/6, 1/8.5 of SSC and LRR. To further highlighting the efficiency of MFC0, we add samples to 100 in USPS and related results are displayed in Table VIII. We observe that when the category number is 10, it takes SSC and LRR almost 94 and 255 seconds to implement the clustering task, while both MFC and MFC consume less than 10 seconds. Again, the less of MFC or MFC than SSC and LRR suggests they can reach the stable solution more quickly.

K T(s) I T/I T(s) I T/I T(s) I T/I T(s) I T/I
2 0.120 92 0.001 0.105 57 0.002 0.191 22 0.009 0.218 24 0.009
3 0.196 99 0.002 0.167 55 0.003 0.281 22 0.013 0.316 22 0.014
4 0.303 91 0.003 0.286 56 0.005 0.416 22 0.019 0.446 22 0.020
5 0.445 96 0.005 0.396 57 0.007 0.592 22 0.027 0.537 22 0.024
6 0.483 89 0.005 0.551 57 0.010 0.727 23 0.032 0.747 23 0.033
7 0.643 89 0.007 0.718 57 0.013 0.905 23 0.039 0.923 23 0.040
8 0.830 90 0.009 0.773 57 0.014 1.149 27 0.043 1.118 24 0.047
9 1.007 92 0.011 1.051 57 0.018 1.339 25 0.054 1.503 24 0.063
10 1.196 93 0.013 1.368 57 0.024 1.630 25 0.065 1.598 24 0.067
11 1.480 94 0.016 1.380 58 0.024 1.760 28 0.063 1.765 24 0.074
Avg. 0.670 93 0.007 0.679 57 0.012 0.899 24 0.036 0.917 23 0.039
TABLE VI: Computational complexity on Yale
K T(s) I T/I T(s) I T/I T(s) I T/I T(s) I T/I
2 1.73 107 0.02 1.73 61 0.03 1.14 24 0.05 0.96 24 0.04
4 5.69 108 0.05 6.32 64 0.10 2.70 25 0.11 2.19 24 0.09
6 12.20 109 0.11 14.19 64 0.22 4.95 25 0.20 3.36 24 0.14
8 21.33 110 0.19 25.14 64 0.39 7.62 27 0.28 5.48 32 0.17
10 37.54 110 0.34 40.82 64 0.64 11.17 35 0.32 7.75 32 0.24
12 52.08 111 0.47 67.52 65 1.04 14.32 35 0.41 9.77 32 0.31
14 69.28 111 0.62 109.57 66 1.66 19.09 35 0.54 12.39 32 0.39
16 93.08 111 0.84 142.68 65 2.20 22.96 34 0.67 16.18 32 0.51
18 124.00 111 1.12 185.63 65 2.85 28.12 35 0.80 19.17 32 0.60
20 152.89 111 1.38 263.19 72 3.65 34.02 35 0.97 23.60 32 0.74
Avg. 56.98 110 0.51 85.68 65 1.28 14.61 31 0.44 10.09 30 0.32
TABLE VII: Computational complexity on EYaleB
K T(s) I T/I T(s) I T/I T(s) I T/I T(s) I T/I
2 6.28 108 0.06 2.54 61 0.04 1.22 27 0.05 0.91 31 0.03
3 13.21 110 0.12 6.33 63 0.10 1.87 27 0.07 1.38 31 0.04
4 22.07 110 0.20 12.24 63 0.19 2.47 29 0.09 1.94 30 0.06
5 29.18 111 0.26 21.52 63 0.34 3.13 30 0.10 2.66 30 0.08
6 38.06 111 0.34 34.84 64 0.54 4.36 40 0.11 3.76 36 0.10
7 54.53 111 0.49 107.60 65 1.65 5.27 39 0.14 4.36 36 0.12
8 64.16 111 0.58 137.31 65 2.11 6.64 40 0.17 5.64 36 0.16
9 71.58 111 0.64 192.85 65 2.97 8.80 39 0.22 7.12 35 0.20
10 93.97 111 0.85 255.04 65 3.92 9.90 39 0.25 8.48 35 0.24
Avg. 43.67 110 0.39 85.59 64 1.32 4.85 34 0.13 4.03 33 0.12
TABLE VIII: Computational complexity on USPS

Vi-B5 Face Basis Learning and Reconstruction on AR

Subspace basis learning aims to learn the representative basis that provide a compact representation for each data sample. Traditional methods assume that the face images of all subjects can be represented by a single lower-dimensional subspace. However, previous work [12] revealed that, under the Lambertian assumption, face images of a subject with a fixed pose and varying illumination lie close to a linear subspace. Thus for multiple subjects, the reasonable way is to learn each subject its underlying basis. To confirm this argument, we conduct an experiment to visualize the learnt face basis (Note that SSC and LRR are unable to learn the basis, thus we do not show their results). Moreover, we also display the reconstructed face image and the associated error. To see the difference with single subspace learning methods, we choose PCA for comparison.

Fig. 8: Some examples of learnt face basis, reconstructed face images, and errors of three subjects in AR. (a) and (b) are representative face basis obtained by PCA and MFC. The five columns in (c) and (d) are raw face images, reconstructed face images by PCA, errors corrected by PCA, reconstructed face images by MFC, and errors corrected by MFC, respectively. (c) shows results on some images with laughter, where “teeth” is treated as the error. (d) show results on some images with wearing sunglass, where “sunglass” is treated as the error. We do not show results on SSC and LRR as they are unable to learn the face basis.

We randomly choose three subjects (subject 1, 2, and 55, each 15 samples) in AR to learn the basis, which are then used for face reconstruction. Since some face images in AR are grossly corrupted, for instance, wearing sunglass, we therefore treat them as sample-specific outliers and choose MFC2. Some examples of learnt face basis, reconstructed face images, and errors of PCA and MFC are plotted in Figure 8. From it, we have several observations:

(1) MFC can learn each subpace the corresponding basis. The learnt basis via MFC can be separated into three parts—basis in each row represent one subject, meaning face images of one subject lie on its own underlying subspace. In contrast, the basis of PCA are mixed together, i.e., each basis contains information of face images across the three subjects;

(2) MFC is effective in reconstructing face images. The reconstructioned face images via MFC are very clean, while they are blurry and untidy via PCA.

(3) MFC is able to correct discriminant errors, which can further be used to perform recognition tasks. The errors via MFC are almost extracted. For Figure 8(c) and (d), the errors are “teeth”—One reasonable explanation is that face images in the training set do not show their teeth. Similarly, the errors in Figure 8(e) and (f) are “sunglass”. Interestingly, the errors contain discriminant features. In fact, they can be used for facial emotion recognition or object recognition [11]. For instance, we can leverage these errors to detect whether one is laughing or wearing sunglass. Nevertheless, the errors obtained by PCA cannot provide any useful information.

Vi-B6 Summary

  • MFC (either MFC or MFC) is not only effective for subspace clustering, subspace basis learning, but also implemented efficiently.

  • The advantage of MFC against single subspace learning methods can be magnified when handling data generated from multiple subspaces.

  • The advantage against SSC and LRR can be boosted by handling data contained gross errors or/and data with large sample size.

Vii Conclusions and Future Work

In this paper, we propose a novel method, called Column -norm constrained Matrix Factorization (MFC), for robustly analyzing the structure of data generated from multiple categories. By learning the basis with an orthonormal constraint, MFC is able to discover the mixture subspace structure and is robust to different types of errors with the specific regularization. Moreover, MFC directly imposes an -norm constraint on the representation matrix, which achieves a (or approximate) block-diagonal structure when the data are clean (or contain errors). We propose a very efficient first-order alternating direction type algorithm to stably solve the nonconvex and nonsmooth objective function of MFC. Experimental results on synthetic data and real-world datasets verify that, besides the superiority and efficiency over traditional and state-of-the-art methods for multi-subspace recovery and clustering, MFC also demonstrates its uniqueness for multi-subspace basis learning and direct representation learning.

Although MFC owns these advantages, it also faces several problems that should be investigated in future. First, we assume that the multiple subspaces are independent, however in practice the data may be generated from disjoint subspaces. In this case, how can we incorporate the constraint or regularization into the model? More challenging, how can we process the data that are generated from nonlinear manifolds, instead of linear subspaces? Second, the computational complexity of MFC is linear with respect to the sample size, so its extension to deal with massive dataset will be investigated in future.


  • [1] B. Wang, R. Liu, C. Lin, and X. Fan, “Matrix factorization with column l0-norm constraint for robust multi-subspace analysis,” in 2015 IEEE ICDM Workshop, 2015, pp. 1189–1195.
  • [2] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
  • [3] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [4] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [5] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE TPAMI, vol. 19, no. 7, pp. 711–720, 1997.
  • [6] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using laplacianfaces,” IEEE TPAMI, vol. 27, no. 3, pp. 328–340, 2005.
  • [7] B.-H. Wang, C. Lin, X.-F. Zhao, and Z.-M. Lu, “Neighbourhood sensitive preserving embedding for pattern classification,” IET IP, vol. 8, no. 8, pp. 489–497, 2014.
  • [8] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in neural information processing systems, 2005, pp. 1473–1480.
  • [9] S. Rao, R. Tron, R. Vidal, and Y. Ma, “Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories,” IEEE TPAMI, vol. 32, no. 10, pp. 1832–1845, 2010.
  • [10] J. Yan and M. Pollefeys, “A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate,” in ECCV.   Springer, 2006, pp. 94–106.
  • [11] G. Liu and S. Yan, “Latent low-rank representation for subspace segmentation and feature extraction,” in IEEE ICCV, 2011, pp. 1615–1622.
  • [12] R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE TPAMI, vol. 25, no. 2, pp. 218–233, 2003.
  • [13] K. Tang, R. Liu, Z. Su, and J. Zhang, “Structure-constrained low-rank representation,” IEEE TNNLS, vol. 25, no. 12, pp. 2167–2179, 2014.
  • [14] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
  • [15] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal ACM, vol. 58, no. 3, p. 11, 2011.
  • [16] R. Liu, Z. Lin, and Z. Su, “Learning markov random walks for robust subspace clustering and estimation,” Neural Networks, vol. 59, pp. 1–15, 2014.
  • [17] G. Liu, Z. Lin, X. Tang, and Y. Yu, “Unsupervised object segmentation with a hybrid graph model (hgm),” IEEE TPAMI, vol. 32, no. 5, pp. 910–924, 2010.
  • [18] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (GPCA),” IEEE TPAMI, vol. 27, no. 12, pp. 1945–1959, 2005.
  • [19] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” FOUND COMPUT MATH, vol. 9, no. 6, pp. 717–772, 2009.
  • [20] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” COMMUN PUR APPL MATH, vol. 59, no. 6, pp. 797–829, 2006.
  • [21] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE TPAMI, vol. 35, no. 11, pp. 2765–2781, 2013.
  • [22] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE TPAMI, vol. 35, no. 1, pp. 171–184, 2013.
  • [23] R. Vidal and P. Favaro, “Low rank subspace clustering (LRSC),” PRL, vol. 43, pp. 47–61, 2014.
  • [24] R. Liu, Z. Lin, F. De la Torre, and Z. Su, “Fixed-rank representation for unsupervised visual learning,” in IEEE CVPR, 2012, pp. 598–605.
  • [25] Y. Ni, J. Sun, X. Yuan, S. Yan, and L.-F. Cheong, “Robust low-rank subspace segmentation with semidefinite guarantees,” in IEEE ICDMW, 2010, pp. 1179–1188.
  • [26] C. Peng, Z. Kang, H. Li, and Q. Cheng, “Subspace clustering using log-determinant rank approximation,” in ACM SIGKDD, ser. KDD ’15.   New York, NY, USA: ACM, 2015, pp. 925–934.
  • [27] G. Liu, Q. Liu, and P. Li, “Blessing of dimensionality: Recovering mixture data via dictionary pursuit,” IEEE TPAMI, vol. 39, no. 1, pp. 47–60, 2017.
  • [28] E. J. Candes and Y. Plan, “Matrix completion with noise,” Proc. of IEEE, vol. 98, no. 6, pp. 925–936, 2010.
  • [29] H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” in NIPS, 2010, pp. 2496–2504.
  • [30] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [31] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in optimization, vol. 1, no. 3, pp. 123–231, 2013.
  • [32] H. Liu, G. Yang, Z. Wu, and D. Cai, “Constrained concept factorization for image representation,” IEEE Trans. Cybernetics, vol. 44, no. 7, pp. 1214–1224, 2014.
  • [33] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximum-margin matrix factorization,” in NIPS, 2004, pp. 1329–1336.
  • [34] C. Ding, T. Li, M. Jordan et al., “Convex and semi-nonnegative matrix factorizations,” IEEE TPAMI, vol. 32, no. 1, pp. 45–55, 2010.
  • [35] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” JMLR, vol. 11, pp. 19–60, 2010.
  • [36] W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in ACM SIGIR.   ACM, 2003, pp. 267–273.
  • [37] R. Sandler and M. Lindenbaum, “Nonnegative matrix factorization with earth mover’s distance metric for image analysis,” IEEE TPAMI, vol. 33, no. 8, pp. 1590–1602, 2011.
  • [38] S. Zafeiriou, A. Tefas, I. Buciu, and I. Pitas, “Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification,” IEEE TNN, vol. 17, no. 3, pp. 683–695, 2006.
  • [39] B. Wang, M. Pang, C. Lin, and X. Fan, “Graph regularized non-negative matrix factorization with sparse coding,” in ChinaSIP.   IEEE, 2013, pp. 476–480.
  • [40] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
  • [41] A. Cichocki, R. Zdunek, and S.-i. Amari, “New algorithms for non-negative matrix factorization in applications to blind source separation,” in IEEE ICASSP, vol. 5, 2006, pp. V–V.
  • [42] Y.-X. Wang and Y.-J. Zhang, “Nonnegative matrix factorization: A comprehensive review,” IEEE TKDE, vol. 25, no. 6, pp. 1336–1353, 2013.
  • [43] J. Eggert and E. Körner, “Sparse coding and nmf,” in IEEE IJCNN, vol. 4, 2004, pp. 2529–2533.
  • [44] H. Kim and H. Park, “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis,” Bioinformatics, vol. 23, no. 12, pp. 1495–1502, 2007.
  • [45] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” JMLR, vol. 5, pp. 1457–1469, 2004.
  • [46] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ?1 minimization,” PNAS, vol. 100, no. 5, pp. 2197–2202, 2003.
  • [47] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34–81, 2009.
  • [48] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009.
  • [49] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE TSP, vol. 54, no. 11, pp. 4311–4322, 2006.
  • [50] J. Tropp, A. C. Gilbert et al., “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE TIT, vol. 53, no. 12, pp. 4655–4666, 2007.
  • [51] R. Peharz and F. Pernkopf, “Sparse nonnegative matrix factorization with l0-constraints,” Neurocomputing, vol. 80, pp. 38–46, 2012.
  • [52] B. Wang, C. Lin, X. Fan, N. Jiang, and D. Farina, “Hierarchical bayes based adaptive sparsity in gaussian mixture model,” Pattern Recognition Letters, vol. 49, pp. 238–247, 2014.
  • [53] J. Yang, W. Yin, Y. Zhang, and Y. Wang, “A fast algorithm for edge-preserving variational multichannel image restoration,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 569–592, 2009.
  • [54] K. P. Murphy, Machine learning: a probabilistic perspective.   MIT press, 2012.
  • [55] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 888–905, 2000.
  • [56] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE TPAMI, vol. 31, no. 2, pp. 210–227, 2009.
  • [57] M. D. Plummer and L. Lovász, Matching theory.   Elsevier, 1986.

-a Proof of Theorem 1

Let be an optimal solution of Eq.(2) under the given basis . We can then decompose into two parts, i.e., the diagonal and non-diagonal , where

with and .

Let respectively denote the -th column of . Suppose that . It is easy to check , therefore . However, . As all subspaces are independent, meaning , we must require . Thus , and , . Therefore, the solution to Eq.(3) is block-diagonal.

-B Proof of Theorem 2

We first unfold the objective function:

Using the constraint and the SVD of transforms above minimization to the maximization problem

Note that is a constant. As is a diagonal matrix, so the maximum is achieved when is also a diagonal matrix, with positive diagonal elements. Therefore, we have , and thus .

-C Proof of Theorem 3

For any , we first split its elements into the negative and positive parts, with corresponding indices defined as and . Denote and . Then, we have two useful properties:

(1) ;

(2) .

Property (1) is obvious and (2) can be simply proved:

Based on above properties, we have

Note that we remove the constant in the fourth equation.

In the last equation, when , . As , so . Therefore, the minimum is achieved if and only if for , resulting in . Finally, we have