Robust Multisubspace Analysis Using Novel Column norm Constrained Matrix Factorization
Abstract
We study the underlying structure of data (approximately) generated from a union of independent subspaces. Traditional methods learn only one subspace, failing to discover the multisubspace structure, while stateoftheart methods analyze the multisubspace structure using data themselves as the dictionary, which cannot offer the explicit basis to span each subspace and are sensitive to errors via an indirect representation. Additionally, they also suffer from a high computational complexity, being quadratic or cubic to the sample size.
To tackle all these problems, we propose a method, called Matrix Factorization with Column norm constraint (MFC), that can simultaneously learn the basis for each subspace, generate direct sparse representation for each data sample, as well as removing errors in the data in an efficient way. Furthermore, we develop a firstorder alternating direction algorithm, whose computational complexity is linear to the sample size, to stably and effectively solve the nonconvex objective function and nonsmooth norm constraint of MFC. Experimental results on both synthetic and realworld datasets demonstrate that besides the superiority over traditional and stateoftheart methods for subspace clustering, data reconstruction, error correction, MFC also shows its uniqueness for multisubspace basis learning and direct sparse representation.
I Introduction
The observed data are extremely high dimensional in this “big data” era. Typically, these data reside in a much lowerdimensional latent subspace, instead of being uniformly distributed in the highdimensional ambient space. Thus, it is of great importance to reveal the underlying structure of the data as it helps to reduce the computational cost and enables a compact representation for learning. Such an idea has been successfully applied to various reseach communities, e.g., dimensionality reduction [2, 3, 4], face recognition [5, 6, 7], metric learning [8], etc..
Subspace methods have been widely used to analyze the data, and linear subspace^{1}^{1}1A vector space is a subset of some other higherdimensional vector space. is the most common choice for its simplicity and computational efficiency. Additionally, linear subspace has shown its effectiveness in modeling realworld problems such as motion segmentation [9, 10], face clustering [11, 12], and handwritten digits recognition [13]. Consequently, subspace analysis has been paid much attention in the past decades. For instance, principal component analysis (PCA) aims to learn a subspace while retaining maximal variances of the data. Nonnegative matrix factorization (NMF) [14] is designed to learn both the nonnegative basis and nonnegative partsbased representation. Robust PCA (RPCA) [15] assumes that the data are approximately drawn from a lowrank subspace while perturbed by sparse noise. The basic assumption of these methods is the single subspace, which is not the case in many practical applications. A more reasonable way is to consider the data as lying on or near a union of linear subspaces. Unfortunately, the generalization to handle multiple subspaces is quite challenging.
Recently, multisubspace analysis has attracted increasingly interests in visual data analysis [16, 17, 18]. Derived from recent advances in compressive sensing [19, 20], the methods including [21, 22, 23, 24, 25, 26, 27] have incorporated sparse and/or lowrank regularization into their formulations to model the mixture of linear structures for clean data^{2}^{2}2Data points are strictly sampled from the respective subspace. as well as dealing with errors in data, e.g., noise [28], missed entries [19], corruptions [15], and outliers [29]. Among them, sparse subspace clustering (SSC) [21] and lowrank representation (LRR) [22] stand out as two most popular methods, which formulate the discovery of multisubspace structure as finding a sparse or lowrank representation of the data samples using data themselves as the dictionary (basis). It has been shown in literatures that these methods can achieve more accurate subspace structure than single subspace analysis methods. However, both SSC and LRR have the following major drawbacks: First, they use the original data as the basis rather than learning the basis explicitly. It is problematic when data contain errors yet are still used for reconstruction or/and clustering. Second, neither sparse representation via regularization nor lowrank representation in a global constraint provides a direct description for each data sample. In other words, data samples cannot find the subspace where they lie, and their contained errors are unable to be removed. Thirdly, SSC and LRR suffer from the computational complexity that is quadratic and cubic to the sample size, respectively.
Our work: In this paper, to tackle all above problems, we aim to simultaneously learn the basis for each subspace and directly generate sparse representation for each data sample, as well as removing errors (e.g., random corruptions^{3}^{3}3A fraction of random entries of data are grossly corrupted. and samplespecific outliers^{4}^{4}4A fraction of data are far away from their respective subspaces.) in the data in an efficient way. To this end, we propose a novel column norm constrained matrix factorization (MFC) method, and develop a firstorder alternating direction algorithm to stably and efficiently solve the nonsmooth and nonconvex objective function of MFC.
Specifically, given a collection of data approximately generated from a union of independent subspaces with an equal subspace dimension, we explore their subspace structure from the matrix factorization perspective in the following way: First, we restrict the basis matrix to be orthonormal, which guarantees that the learnt basis indeed span multiple subspaces. Then, we pursue a direct sparse representation of each data sample on only the basis that form its underlying subspace. To achieve this goal, we leverage an norm to confine the number of nonzeros elements of each column of the representation matrix. Furthermore, to handle different types of errors contained in the data, we incorporate different regularizations.
With above constraints, the objective function of MFC is nonconvex on the coupled basis and representation matrix, and is nonsmooth on the norm constraint. Thus, it is a very challenging optimization problem that cannot be solved via typical gradient (or subgradient) based methods. To overcome the nonconvex problem, we develop a firstorder optimization algorithm motivated by ADMM [30] to split multivariable objective function into several univariate subproblems and solve each subproblem one by one. To tackle the subproblem on nonsmooth norm constraint, we design a novel proximal operator, which is inspired by recent proximal algorithms [31], and solve it efficiently and analyticly.
After learning basis and representation matrix, MFC can accomplish many tasks, e.g., subspace clustering, data reconstruction, error correction, etc.. We evaluate MFC on both synthetic data and realworld datasets. Experimental results demonstrate that besides the superiority and efficiency over traditional and stateoftheart methods for subspace clustering, data reconstruction, and error correction, MFC0 also demonstrates its uniqueness for multisubspace basis learning and direct representation learning.
In summary, our key contributions are as follows:

We propose a novel MFC method from the matrix factorization perspective to perform multisubspace analysis. To the best of our knowledge, this is the first work that considers simultaneously learning the basis, generating direct sparse representation, and correcting errors of data generated from multiple subspaces.

We develop a firstorder alterating direction algorithm to stably and effectively solve the nonconvex objective function of MFC.

We design a novel proximal operator to analyticly and efficiently solve the nonsmooth norm constraint.

Experimental results demonstrate that MFC shows high resistance to errors, achieves high subspace clustering performance, learns multisubspace basis, and generates direct sparse representation.
Ii Related Work
Matrix factorization and its variants: Matrix Factorization (MF) has been studied for several decades. Classical methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Independent Component Analysis (ICA), are exemplars of lowrank matrix factorizations. Afterwards, to solve specific problems, many MF variants, including Concept MF [32], Maximum Margin MF [33], Convex MF [34], and Online MF [35], etc., are proposed. These methods share in common that there is no constraint on the sign of elements of factorized matrices. In contrast, a paradigm of factorization, called NMF, was first proposed in [14] and has been widely used for its sparse, partsbased, and additive representation. NMF is suitable for many research fields, e.g., data mining, image processing, pattern recognition, and machine learning. Its realworld applications include document clustering [36], image segmentation [37], image recognition [38, 39], speech and audio processing [40], blind source separation [41], to name a few. One can refer to the comprehensive review of NMF in [42].
Approximate sparsity via norm: Sparsity in classic NMF occurs is as a byproduct rather than a designed objective. To this end, researchers have paid much attention to impose sparsity on NMF explicitly. These methods either penalize [43, 44] or constrain [45] norm of the representation matrix to yield a sparse representation [46]. As a convex relaxation of norm, the exact sparsity measure which renders NPhard combinational optimization problem, norm is used to improve computational feasibility and efficiency of sparse representation [47]. However, as NMF is NPhard itself [48], we can say that any algorithm for NMF is suboptimal. Thus, a heuristic norm constrained NMF might be as appropriate and efficient as norm sparse NMF.
norm based sparsity: Little work has been done using direct norm, the exact sparsity measure which causes the NPhard combinational optimization problem. KSVD [49] aims to find an overcomplete dictionary for sparse representation, and the sparse approximation is based on an norm inequality constraint and is solved via greedy orthogonal matching pursuit (OMP) [50]. Peharz and Pernkopf [51] further proposed sparse NMF that introduced an norm to constrain the basis or representation matrix of approximate NMF. It is solved via sparse alternating nonnegative leastsquares (sANLS). Wang et al. [52] leveraged hierarchical Bayes to model an adaptive sparseness prior, which has a similar properity with an norm. However, these methods are under the hidden assumption that data are generated from single subspace. In addition, they cannot handle various errors. In this work, we focus on analyzing data (approximately) drawn from multiple subspaces. Different from using greedy methods, the norm constraint in MFC is solved analytically via our designed proximal operator, which is easy to be incorporated into our developed alternating direction algorithm.
Iii Notations and Problem Definition
Iiia Main Notations
In this paper, vectors and matrices are written as bold lowercase and uppercase symbols. The th entry of vector is . For matrix , its th entry, th column, and th row are denoted as , , and respectively. The horizontal and vertical concatenation of a set of matrices along row and column are denoted by and , respectively. The superscript stands for the transpose.
For vector , its norm is , and pseudo norm is the number of nonzero entries, i.e., . For matrix , its Frobenius norm is ; norm is ; norm is ; and norm is . The inner product between two matrices , is , where tr is the trace operator defined on a square matrix.
We denote th subspace as and a collection of subspaces as .
IiiB Problem Definition
We present two definitions related to our problem at first.

[noitemsep,nolistsep]

Union and sum of subspaces: The union of subspaces is defined as . The sum of subspaces is defined as .

Independent subspaces: subspaces are independent if and only if . Or to say, .
With above definitions, our problem is defined as
Problem 1 (Robust Multisubspace Analysis).
Given a collection of data samples strictly/approximately drawn from a union of independent subspaces with an equal subspace dimension, robust multisubspace analysis is to learn the basis for each subspace, learn the associated representation for each sample upon the basis, or/and correct the error for each data sample.
Formally, given data samples that are generated from a basis matrix that span independent subspaces , with a corresponding representation matrix , or/and contaminated with errors
(1) 
Then, our purpose is from the observed data to learn the basis for each subspace , the representation upon the basis , or/and the error such that
(2) 
where with merely collecting the samples from the th subspace and ; with containing the errors of samples in ; with the basis to span ; with the representation upon the basis ; indicates the subspace dimension, .
In the sequel, we first propose a model to deal with data that are clean, i.e., . Then we generalize our model to handle data that are contaminated with random corruptions or samplespecific outliers, i.e., .
Iv The Proposed Method
In this section, we first introduce a theorem that inspires us to formulate the problem with clean data. Then, we generalize the model to handle contaminated data. Finally, we present the algorithm to solve our model.
Iva Problem Formulation
To start with, we assume that data samples are strictly generated from independent subspaces. In doing so, we can come to the conclusion as follows:
Theorem 1.
Suppose subspaces are independent and consists of basis that only span subspace , then the solution of representation matrix to Eq.(2) dealing with clean data is blockdiagonal:
(3) 
where and .
Proof.
See Appendix A. ∎
Theorem 1 shows the condition under which we can achieve the blockdiagonal representation matrix, which is of vital importance to support analyzing the structure of multisubspace. That is, if we know the basis that clean data lie in, then each block of the representation matrix characterizes each subspace. In practice, however, the basis of each subspace are often unknown, Thus, the generated representation matrix cannot be guaranteed to be blockdiagonal. Fortunately, Theorem 1 offers us the hint to construct a satisfactory model. In short, we expect to simultaneously learn the basis matrix and learn the representation matrix inspired by Theorem 1.
To be more specific, we first impose an orthonormal constraint on the basis matrix to ensure that its columns can indeed form the “basis” to span independent subspaces. In order to generate the blockdiagonal representation, we characterize each data sample as a nonnegative combination^{5}^{5}5We only deal with data with nonnegative entries. of only the basis that span its underlying subspace. To this end, we leverage a column norm constrained on the representation matrix and set its number of nonzero elements equal to the subspace dimension.
Formally, the objective function satisfying above constraints is defined as
(4)  
(5)  
(6)  
(7) 
where Eq.(5) is the orthonormal constraint on ; is a identity matrix. Eq.(6) indicates a nonnegative combination of the orthonormal basis; “” is taken componentwise. Eq.(7) is the requirement of the number of nonzero coefficients. For writing simplicity, we omit the “” symbol in the subsequent related equations.
In practice, the observed data are often contaminated with errors, e.g., noises, random corruptions or samplespecific outliers, rendering the data away from their exact underlying subspaces. To handle different types of errors, we adopt different regularizations. Specifically, we generalize the objective function in Eq.(4) to
(8) 
where is some type of matrix norm. For random corruptions, it is ; While for samplespecific outliers, it is . is the regularization parameter that controls the effect of two terms.
By implementing above two objective functions, we claim that when the data are clean, then is blockdiagonal after reordering its columns and rows, i.e., . When the data are contaminated with random corruptions or samplespecific outliers to some extent, our experimental results show that is still approximate blockdiagonal.
It is necessary to notice that SSC and LRR can also be boiled down to the matrix factorization framework. They share some similarities with the proposed method, but there exist two major differences: On one hand, they take advantage of what they refer to as “selfexpressiveness” property and use the data themselves to form the dictionary (basis). However, leveraging the data as the basis fail to perform data reconstruction when they contain errors to some degree (See Figure 1). On the other hand, they respectively utilize an indirect norm and nuclear norm to obtain the sparse and lowrank representation of the data, which however, are unable to know the exact basis that are used to represent each data sample. Thus, their ability for subspace clustering is also limited (See Figure 5).


IvB Problem Optimization
In this part, without loss of generality, we just solve for the general case in Eq.(8). Notice that we need to handle the nonconvex multivariable objective function, as well as the nonsmooth norm constraint. Fortunately, for function with multivariables, the ADMM method is often the choice [30]. Moreover, proximal algorithms has recently been a popular tool for solving nonsmooth and nonconvex problems. It’s basic operation is evaluating the proximal operator of a univariable function, which often admits a closedfrom solution or can be solved very quickly [31]. Inspired by such superiority, we develop a firstorder alternating direction algorithm to solve Eq.(8). The overall procedure of the algorithm is first introducing auxiliary variables and quadratic penalties into the objective function, then iteratively minimizing the augmented Lagrangian function with respect to each primal variable, and finally updating the multipliers. To tackle the subproblem on nonsmooth norm constraint, we design a novel proximal operator to solve it efficiently and analytically.
We begin with introducing an auxiliary variable to variable which exists in both the object function and the constraint. We reformulate Eq.(8) as
(9) 
Then, the augmented Lagrangian function of Eq.(9) is
(10) 
where is the Lagrangian multiplier, is the quadratic penalty parameter. Note that adding the penalty term does not change the optimal solution, since any feasible solution satisfying the constraint in Eq.(10) vanishes the penalty term.
Finally, our alternating direction algorithm consists of iteratively minimizing Eq.(10) with respect to one of while fixing the others, and updating the multiplier . Specifically, we solve for variables and the multiplier step by step.
Update X: By discarding terms that are irrelevant to , we rewrite the subproblem with respect to as
(11) 
We denote the singular value decomposition (SVD) of , then from Theorem 2 below, we can achieve the closedform solution of . That is,
(12) 
Theorem 2.
Let and be any two matrices. Denote the SVD of as . Then the orthonormal constrained minimization problem
has an analytic solution .
Proof.
See Appendix B. ∎
Update Y: The subproblem of is
(13) 
Setting the derivative with respect to to zero, rearranging the terms, and using the constraint yields
(14) 
Update E: The subproblem of becomes
(15) 
For samplespecific outliers, we set . Then the solution to Eq.(15) is equivalent to solving the proximal operator for norm. From [53], can be obtained with a closedform. Concretely, denote , then
(16) 
Similarly, for random corruptions, we set . The solution then equals to finding the proximal operator for norm, and can be achieved via the SoftThresholding Operator [54]
(17) 
Update V: The subproblem associated with is
(18) 
Here, we emphasize that due to the discrete norm constraint in Eq.(18), cannot be obtained via gradient or subgradient based methods. Benefiting from aforementioned proximal algorithms [31], we can design a simple yet very efficient algorithm to solve column by column. The detail is as follows:
We denote and define the operator for as
We also define the nonnegative orthant mapping as
and introduce an indicator function over the set
Then, we have the following theorem:
Theorem 3.
Given above defined operator , nonnegative orthant mapping , and indicator function . Then the solution of Eq.(18) equals to solving the proximal operator^{6}^{6}6Note that there is a factor 1/2 on the quadratic term in the standard form. Here we remove it to simplify the notation. of the indicator function
which has an analytical form, i.e., .
Proof.
See Appendix C. ∎
Based on Theorem 3, can be acquired by selecting largest nonnegative entries of with corresponding indices . To be specific, for , we set
(19) 
Update P: Finally, for the multiplier , based on the dual optimal condition [30], its updating rule is
(20) 
where , with pregiven and .
Algorithm 1 summarizes the algorithmic procedure of our proposed alternating direction algorithm for solving Eq.(10).
After obtaining , and , MFC can accomplish the following tasks:

Subspace Clustering: We apply normalized cut [55]^{7}^{7}7One can also use Kmeans clustering on . We choose normalized cut here so as to have a fair comparison with SSC and LRR. to to cluster all data samples to their respective subspaces.

Data Reconstruction: We treat the product of and as the reconstruction for the data.

Error Correction: We regard as the error for the data.

Representation Learning: We permute the rows and/or columns of to make it be (appximate) blockdiagonal. Then the block indicates the representation for subspace .

Basis Learning: We respectively extract columns of , whose indices correspond to the row numbers of ), to form the basis of subspace .
V Computational Analysis
The computational complexity of MFC in each iteration can be summarized as follows:
Calculate : We first perform matrix subtraction of size and matrix multiplication of size , with complexities and , respectively. Then, we compute the SVD of an matrix and store both the singular vectors—the complexity is . Finally, we calculate the matrix product of size and with complexity to obtain the orthonormal basis . The total complexity is .
Compute : We just perform matrix multiplication of size and and matrix addition of size , with the dominant complexity .
Compute : The computation consists of matrix multiplication, matrix substraction, and proximal operator calculation^{8}^{8}8Both and have the same complexity .. The complexity of each part is , , and , and the dominant complexity is .
Compute : The complexity of matrix addition is . The nonnegative orthant mapping and sorting of dim vector respectively need and (in the average case) calculation amount. Thus, the dominant complexity is .
As the dimensionality of multisubspace is much smaller than the data dimensionality and sample size , i.e., , we conclude that the dominant complexity of MFC is .
Methods  SSC  LRR  MFC 
Complexity 
Here, we also discuss the computational complexity of SSC and LRR. For SSC, the computational burden is matrix multiplication, with the complexity . For LRR, the computational burden lies in two parts: matrix multiplication and matrix SVD, the complexity of which are and , respectively. One should note that: (1) the matrix size for matrix multiplication of MFC ( and ) is smaller than that of SSC and LRR ( and ); (2) the matrix size for SVD of MFC () is smaller that of LRR (). The dominate complexity of SSC, LRR, and MFC are listed in Table I.
Compared MFC with SSC and LRR, we observe that when handling data with high dimensionality, SSC and LRR are relatively faster—the complexity in terms of is quadratic for MFC, and linear for SSC and LRR. Actually, we can leverage random projections, a very effective [56] feature extraction method, as a preprocessing to reduce the dimensionality, thus largely decreasing the complexity of MFC. In contrast, if we deal with large data sample size, MFC is more efficient than SSC and LRR—the complexity of SSC and LRR are quadratic and cubic to the sample size , while MFC is linear.
In practice, the overall computational complexity is determined by not only the complexity of each iteration but also the number of total iterations. Although there is no theoretical guarantees, experimental results (See Tables VIVIII) show that MFC can converge with 2040 iterations, while LRR and SSC need 5570 and 90110 iterations, respectively.
Vi Experimental Results
In this section, we carry out several experiments on both synthetic data and realworld datasets to test the performance of our proposed MFC method.
Compared methods: We compare MFC with single subspace learning methods, i.e., PCA, NMF [14], and Sparse NMF (SNMF) [51]; and stateoftheart multisubspace learning methods, i.e., SSC and LRR. MFC can leverage different regularizations to handle different types of errors. In the experiment, we mainly focus on random corruptions () and samplespecific outliers (). Accordingly, we denote MFC as MFC1 and MFC2, respectively.
For PCA, we preserve 95 percent of total data variance; For NMF and SNMF, we set their basis number as , the same as MFC’s. For MFC, we randomly generate a matrix with size to initialize the basis matrix. The hyperparameters of all comparable methods are selected via crossvalidation. The stopping criterion for each method is either it reaches the maximal iterations or the objective function values between neighboring iterations have difference less than . All experiments are conducted in Matlab R2010b with the platform CPU 3.10 GHz and RAM 16.0 GB.
Evaluation metrics: We use clustering accuracy, time, and iterations as the metrics to evaluate the performance of all compared methods.
Vi1 Clustering Accuracy
The subspace clustering result is evaluated by comparing the estimated label of each data sample with that provided by the ground truth. In the paper, we use clustering accuracy (ACC) to measure the clustering performance. ACC is defined as
(21) 
where and are the estimated label and the ground truth of the th point; if , and otherwise. map(x) is the permutation mapping function that maps each label to the equivalent label from the entire data. The best mapping can be efficiently computed by the KuhnMunkres algorithm [57].
Vi2 Time and Iterations
We use three indices to quantize the computational effciency, i.e., total running time , iterations , and averaged running time per iteration (). indicates the overall effciency; reflects the convergence rate to reach local solutions, and shows the efficiency per iteration.
Via Results on Synthetic Data
ViA1 Data reconstruction visualization on 3dim space
We visualize data reconstruction in 3dim ambient space in Figure 1 and compare MFC with SSC [21] and LRR [22]. The data samples are drawn from three 1dim independent subspaces, and disturbed by random corruptions and samplespecific outliers, respectively. Figure 1 shows that SSC and LRR are unable to remove the errors, which validates our aforementioned argument; While MFC fully eliminates them. The reason is that both SSC and LRR use the original contaminated data as the basis, which is unreasonable for data reconstruction. In contrast, MFC learns the basis that span the underlying subspace the clean data lie in–Indeed, the learnt basis are , , and , respectively; Errors contained in the data are absorbed in the regularization term.
ViA2 Results on highdimensional data
For highdimensional data analysis, the synthetic data are generated from independent subspaces with each containing samples. All subspaces have the same dimension embedded in a dimensional ambient space. The procedure of generating above subspaces is similar to that of [13]: the basis of each subspace are calculated by , where is a random orthonormal matrix and is a random column orthogonal matrix. The data points from each subspace are sampled by , where elements in are from standard uniform distribution.
Representation learning and stability: We exhibit MFC’s ability to learn the representation matrix, i.e., the blockdiagonal structure, of data without and with errors, and show its efficiency and stability to obtain the local solution.
The representation matrix of clean data is shown in Figure and objective function values versus iteration in 10 runs are plotted in Figure . From Figure 2, we can see that MFC can precisely discover the blockdiagonal structure when the data samples are clean. Moreover, our proposed firstorder optimization algorithm can fast converge (nearly 15 iterations) to a very stable value although using different random initializations. These observations demonstrate that MFC is both effective and efficient to analyze the structure of multiple subspaces.
We further consider the cases that data contain errors. For random corruptions and samplespecific outliers with error ratio^{9}^{9}9The proportion of data samples are contaminated with errors. 0.1, 0.3, and 0.5, the corresponding representation matrices are plotted in Figure 3 and Figure 4, respectively. We observe that even the data are contaminated with different types of errors to a high level, the representation matrix learnt via MFC is still close to be blockdiagonal.
Clustering accuracy vs. error ratio: We compare MFC with all the other methods in terms of subspace clustering to validate its robustness to resist different types of errors. The performance of all compared methods versus different error ratios, ranging from 0 to 0.8, of random corruptions and samplespecific outliers in the data are depicted in Figure 5(a) and 5(b). It can be seen that the clustering accuracy of MFC is consistently the highest. SSC and LRR come next. Surprisingly, even when the ratio is 0.6, MFC can still obtain accuracy in both cases. Whereas, for LRR and SSC, the accuracies are approximate 70%, 85% for random corruptions, and 60%, 60% for samplespecific outliers, respectively. The performance of PCA, NMF, and SNMF decrease sharply, which validates that single subspace learning methods fail to work when the error ratio increases.
ViA3 Summary

MFC can accurately reconstruct the data and discover the basis that span each subspace;

MFC is capable of obtaining the exact blockdiagonal representation matrix when data is clean;

MFC shows strong ability to resist errors with a high ratio and is more powerful than stateoftheart LRR and SSC methods for subspace clustering, not to mention single subspace learning methods;

MFC can fast converge to a very stable solution via our proposed firstorder optimization algorithm.
ViB Results on RealWorld Datasets
In this part, we pay close attention to the tasks: subspace clustering, as well as associated computational complexity; subspace basis learning and subspace recovery. Experiments are carried out on three realworld face datasets: AR^{10}^{10}10http://www2.ece.ohiostate.edu/~aleix/ARdatabase.html, Yale^{11}^{11}11http://vision.ucsd.edu/content/yalefacedatabase, and Extended Yale B (EYaleB)^{12}^{12}12http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html and the USPS handwritten digit dataset^{13}^{13}13http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps.
ViB1 Dataset Description
AR contains over 3,000 images of 126 individuals (70 men and 56 women). They are taken at two different occasions with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf) under strictly controlled conditions. Raw images are of 768x576 pixels and 24 bits of depth. In the experiment, images are resized to 48x48 pixels.
Yale contains 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: centerlight, w/glasses, happy, leftlight, w/no glasses, normal, rightlight, sad, sleepy, surprised, and wink. We simply use the cropped images with size 32x32 pixels.
EYaleB consists of 2414 images of 38 subjects under 9 poses and 64 illumination conditions. The subset used in the experiment has 20 individuals and around 64 near frontal images under different illuminations per individual. All images are cropped to 32x32 pixels.
USPS contains 9198 handwritten digit images (from 0 to 9) and the image size is 16x16. In the experiment, we choose 100 images of each digit, thus 1000 images in total.
Some sample images from the four datasets are shown in Figure 6. Statistics of the selected subsets are listed in Table II.
AR  Yale  EYaleB  USPS  
K  3  11  20  10 
15  15  64  100  
m  48x48  32x32  32x32  16x16 
ViB2 Subspace Dimension Estimation
For MFC, we need to know the subspace dimension of each subject in advance. A simple yet effective way is to calculate the number of nonzero singular values of each subject. Figure 7 depicts the singular values of several randomly chosen subjects in the four datasets. The subspace dimensions for AR, Yale, EYaleB, and USPS are approximate 12, 10, 10, and 12 respectively. Note that for realworld datasets, the singular values often gradually close to zero, validating that the images are contaminated with errors. The higher dimension of AR reflects its more complicated face distribution than Yale and EYaleB. It is necessary to point out that, we cannot distinguish which type of error contained in the images, therefore we report
ViB3 Subspace Clustering
In this part, we perform subspace clustering on realworld datasets: Given face images or handwritten digits of multiple categories, our task is to group them according to their respective subspace. Our purpose is to verify that it is more advantageous to (1) model face distribution using multiple subspaces than using the single one; (2) learn the basis rather than leverage the data themseleves as the basis.
Tables III and IV show the clustering performance on Yale and EYaleB, respectively. To study the effect of category number, we choose that is ranging from 2 to 11 on Yale and from 2 to 20 with an interval 2 on EYaleB. From the results, we have the following observations:
(1) As a whole, MFC performs the best on both datasets. By learning each subject the respective subspace, together with a direct column norm constraint and an error correction term, MFC is discriminative to separate face images from the subjects. Essentially, this is the key factor that makes it outperform single subspace analysis methods. Moreover, using a column norm constraint to directly control the sparsity, MFC is superior to the indirect penalty and avoids tuning the sparse hyperparameter;
(2) SSC and LRR can also achieve promising results as they are multisubspace learning methods with error correction. However, that they use contaminated face images as the basis is problematic and irrational, the accuracy gap between them and MFC validates this argument;
(3) Errors contained in face images are more likely to be samplespecific outliers, as on both Yale and EYaleB, the result of MFC is better than that of MFC. A corroborative evidence is that LRR, also using the term, is slightly superior to SSC, the term;
(4) Sparse constraint for clustering is important. The better results of SNMF than that of NMF in both datasets verify the importance using sparse constraint;
(5) Modeling face images distribution using multisubspace Since PCA, NMF, and SNMF are single subspace learning methods, so their performance are worse than SSC, LRR, MFC, MFC. When the sample size of each subject is small in Yale, the structure of multisubspace is not so obvious and the accuracy gap is not remarkable. However, when the sample size of each subject in EYaleB is sufficient, all of them obtain poor results on the whole. Even worse, they fail to work when increases to a certain value, say 10. While for remaining methods, they still work well.
We also implement subspace clustering on USPS with from 2 to 10. Our aim is to see whether handwritten digits are distributed in “independent subspaces”, as no pervious work mentions such assumption.
Subspaces of handwritten digits are probably not independent. From the results listed in Table V, we notice that all PCA, NMF, and SNMF have promising results, even comparable with SSC and LRR in some s. Still, our MFC performs the best, but the superiority is not so obvious as that in EYaleB. One possible reason is that the subspaces of handwritten digits are not independent in essence. In this sense, the distribution of handwritten digits modeled by disjoint subspaces (or submanifolds) may be more reasonable.
K  PCA  NMF  SNMF  SSC  LRR  MFC  MFC 
2  0.76  0.82  0.82  0.87  0.95  0.95  0.95 
3  0.61  0.70  0.69  0.64  0.68  0.70  0.74 
4  0.63  0.64  0.68  0.61  0.64  0.66  0.76 
5  0.53  0.64  0.67  0.67  0.67  0.65  0.69 
6  0.56  0.53  0.58  0.65  0.58  0.63  0.68 
7  0.53  0.55  0.56  0.62  0.57  0.58  0.61 
8  0.55  0.49  0.57  0.60  0.63  0.56  0.60 
9  0.43  0.38  0.49  0.55  0.57  0.51  0.58 
10  0.45  0.47  0.54  0.54  0.54  0.51  0.54 
11  0.54  0.49  0.54  0.57  0.57  0.50  0.59 
Avg.  0.56  0.57  0.61  0.63  0.64  0.63  0.66 
K  PCA  NMF  SNMF  SSC  LRR  MFC  MFC 
2  0.51  0.80  1.00  0.98  1.00  1.00  1.00 
4  0.35  0.59  0.63  0.88  0.86  0.85  0.89 
6  0.30  0.72  0.57  0.84  0.82  0.84  0.86 
8  0.20  0.37  0.48  0.76  0.74  0.69  0.76 
10  0.23  0.45  0.41  0.74  0.71  0.68  0.75 
12  0.22  0.37  0.41  0.67  0.70  0.70  0.70 
14  0.18  0.36  0.39  0.64  0.68  0.77  0.72 
16  0.18  0.37  0.39  0.63  0.67  0.72  0.73 
18  0.15  0.35  0.35  0.63  0.68  0.74  0.74 
20  0.15  0.37  0.37  0.60  0.66  0.72  0.72 
Avg.  0.25  0.47  0.49  0.74  0.75  0.77  0.79 
K  PCA  NMF  SNMF  SSC  LRR  MFC  MFC 
2  0.92  0.98  0.96  0.99  0.99  1.00  1.00 
3  0.80  0.91  0.86  0.95  0.91  0.94  0.94 
4  0.63  0.76  0.78  0.85  0.89  0.93  0.87 
5  0.67  0.64  0.74  0.80  0.79  0.87  0.88 
6  0.57  0.56  0.67  0.80  0.77  0.78  0.76 
7  0.56  0.55  0.58  0.68  0.65  0.74  0.64 
8  0.56  0.57  0.57  0.54  0.62  0.66  0.69 
9  0.54  0.52  0.51  0.46  0.61  0.68  0.63 
10  0.54  0.47  0.48  0.53  0.47  0.64  0.58 
Avg.  0.64  0.66  0.68  0.73  0.74  0.80  0.78 
ViB4 Computational Complexity
Another advantage of MFC against LRR and SSC is its efficient computation in terms of sample size. Therefore, our aim in this part is to validate that when increasing the sample size MFC is more efficient than SSC and LRR. As mentioned, we quantize the efficiency of each method using total running time (), iterations (), and averaged running time per iteration ().
The results on Yale, EYaleB, and USPS are listed in Tables VI, VII, VIII, respectively. From Table VI, we see that
(1) Running time per iteration of MFC or MFC is bigger than that of SSC and LRR on dataset with small sample size. The averaged of MFC or MFC is approximately 5 and 3 times of SSC and LRR. It is because when the sample size of each subject is small in Yale, the dimensionality of data samples () dominates the computational complexity in each iteration. Remembering that the complexity of MFC is quadratic with respect to , while of LRR and SSC is linear.
(2) Iterations of MFC is much smaller than those of SSC and LRR. Despite of higher , the averaged iteration of MFC or MFC is 24 or 23, much smaller than those of SSC and LRR, which are 93 and 57, respectively. Such observation demonstrats MFC’s faster convergence rate than LRR and SSC. Ultimately, the total running time of these methods are similar.
(3) Increasing sample size makes MFC more advantageous. When the sample size of each subject increases to 64 in EYaleB, the results in Table VII manifest that values of SSC and LRR are larger than that of MFC or MFC on average. It is due to the fact that the computational complexity of MFC is linear with respect to , while quadratic and cubic of LRR and SSC. Similar to the results on Yale, compared with SSC (110) and LRR (65), the averaged values of MFC (31) and MFC (30) are much smaller. Consequently, the total running time of MFC and MFC are just 1/4, 1/5.7 and 1/6, 1/8.5 of SSC and LRR. To further highlighting the efficiency of MFC0, we add samples to 100 in USPS and related results are displayed in Table VIII. We observe that when the category number is 10, it takes SSC and LRR almost 94 and 255 seconds to implement the clustering task, while both MFC and MFC consume less than 10 seconds. Again, the less of MFC or MFC than SSC and LRR suggests they can reach the stable solution more quickly.
SSC  LRR  MFC  MFC  
K  T(s)  I  T/I  T(s)  I  T/I  T(s)  I  T/I  T(s)  I  T/I 
2  0.120  92  0.001  0.105  57  0.002  0.191  22  0.009  0.218  24  0.009 
3  0.196  99  0.002  0.167  55  0.003  0.281  22  0.013  0.316  22  0.014 
4  0.303  91  0.003  0.286  56  0.005  0.416  22  0.019  0.446  22  0.020 
5  0.445  96  0.005  0.396  57  0.007  0.592  22  0.027  0.537  22  0.024 
6  0.483  89  0.005  0.551  57  0.010  0.727  23  0.032  0.747  23  0.033 
7  0.643  89  0.007  0.718  57  0.013  0.905  23  0.039  0.923  23  0.040 
8  0.830  90  0.009  0.773  57  0.014  1.149  27  0.043  1.118  24  0.047 
9  1.007  92  0.011  1.051  57  0.018  1.339  25  0.054  1.503  24  0.063 
10  1.196  93  0.013  1.368  57  0.024  1.630  25  0.065  1.598  24  0.067 
11  1.480  94  0.016  1.380  58  0.024  1.760  28  0.063  1.765  24  0.074 
Avg.  0.670  93  0.007  0.679  57  0.012  0.899  24  0.036  0.917  23  0.039 
SSC  LRR  MFC  MFC  
K  T(s)  I  T/I  T(s)  I  T/I  T(s)  I  T/I  T(s)  I  T/I 
2  1.73  107  0.02  1.73  61  0.03  1.14  24  0.05  0.96  24  0.04 
4  5.69  108  0.05  6.32  64  0.10  2.70  25  0.11  2.19  24  0.09 
6  12.20  109  0.11  14.19  64  0.22  4.95  25  0.20  3.36  24  0.14 
8  21.33  110  0.19  25.14  64  0.39  7.62  27  0.28  5.48  32  0.17 
10  37.54  110  0.34  40.82  64  0.64  11.17  35  0.32  7.75  32  0.24 
12  52.08  111  0.47  67.52  65  1.04  14.32  35  0.41  9.77  32  0.31 
14  69.28  111  0.62  109.57  66  1.66  19.09  35  0.54  12.39  32  0.39 
16  93.08  111  0.84  142.68  65  2.20  22.96  34  0.67  16.18  32  0.51 
18  124.00  111  1.12  185.63  65  2.85  28.12  35  0.80  19.17  32  0.60 
20  152.89  111  1.38  263.19  72  3.65  34.02  35  0.97  23.60  32  0.74 
Avg.  56.98  110  0.51  85.68  65  1.28  14.61  31  0.44  10.09  30  0.32 
SSC  LRR  MFC  MFC  
K  T(s)  I  T/I  T(s)  I  T/I  T(s)  I  T/I  T(s)  I  T/I 
2  6.28  108  0.06  2.54  61  0.04  1.22  27  0.05  0.91  31  0.03 
3  13.21  110  0.12  6.33  63  0.10  1.87  27  0.07  1.38  31  0.04 
4  22.07  110  0.20  12.24  63  0.19  2.47  29  0.09  1.94  30  0.06 
5  29.18  111  0.26  21.52  63  0.34  3.13  30  0.10  2.66  30  0.08 
6  38.06  111  0.34  34.84  64  0.54  4.36  40  0.11  3.76  36  0.10 
7  54.53  111  0.49  107.60  65  1.65  5.27  39  0.14  4.36  36  0.12 
8  64.16  111  0.58  137.31  65  2.11  6.64  40  0.17  5.64  36  0.16 
9  71.58  111  0.64  192.85  65  2.97  8.80  39  0.22  7.12  35  0.20 
10  93.97  111  0.85  255.04  65  3.92  9.90  39  0.25  8.48  35  0.24 
Avg.  43.67  110  0.39  85.59  64  1.32  4.85  34  0.13  4.03  33  0.12 
ViB5 Face Basis Learning and Reconstruction on AR
Subspace basis learning aims to learn the representative basis that provide a compact representation for each data sample. Traditional methods assume that the face images of all subjects can be represented by a single lowerdimensional subspace. However, previous work [12] revealed that, under the Lambertian assumption, face images of a subject with a fixed pose and varying illumination lie close to a linear subspace. Thus for multiple subjects, the reasonable way is to learn each subject its underlying basis. To confirm this argument, we conduct an experiment to visualize the learnt face basis (Note that SSC and LRR are unable to learn the basis, thus we do not show their results). Moreover, we also display the reconstructed face image and the associated error. To see the difference with single subspace learning methods, we choose PCA for comparison.
We randomly choose three subjects (subject 1, 2, and 55, each 15 samples) in AR to learn the basis, which are then used for face reconstruction. Since some face images in AR are grossly corrupted, for instance, wearing sunglass, we therefore treat them as samplespecific outliers and choose MFC2. Some examples of learnt face basis, reconstructed face images, and errors of PCA and MFC are plotted in Figure 8. From it, we have several observations:
(1) MFC can learn each subpace the corresponding basis. The learnt basis via MFC can be separated into three parts—basis in each row represent one subject, meaning face images of one subject lie on its own underlying subspace. In contrast, the basis of PCA are mixed together, i.e., each basis contains information of face images across the three subjects;
(2) MFC is effective in reconstructing face images. The reconstructioned face images via MFC are very clean, while they are blurry and untidy via PCA.
(3) MFC is able to correct discriminant errors, which can further be used to perform recognition tasks. The errors via MFC are almost extracted. For Figure 8(c) and (d), the errors are “teeth”—One reasonable explanation is that face images in the training set do not show their teeth. Similarly, the errors in Figure 8(e) and (f) are “sunglass”. Interestingly, the errors contain discriminant features. In fact, they can be used for facial emotion recognition or object recognition [11]. For instance, we can leverage these errors to detect whether one is laughing or wearing sunglass. Nevertheless, the errors obtained by PCA cannot provide any useful information.
ViB6 Summary

MFC (either MFC or MFC) is not only effective for subspace clustering, subspace basis learning, but also implemented efficiently.

The advantage of MFC against single subspace learning methods can be magnified when handling data generated from multiple subspaces.

The advantage against SSC and LRR can be boosted by handling data contained gross errors or/and data with large sample size.
Vii Conclusions and Future Work
In this paper, we propose a novel method, called Column norm constrained Matrix Factorization (MFC), for robustly analyzing the structure of data generated from multiple categories. By learning the basis with an orthonormal constraint, MFC is able to discover the mixture subspace structure and is robust to different types of errors with the specific regularization. Moreover, MFC directly imposes an norm constraint on the representation matrix, which achieves a (or approximate) blockdiagonal structure when the data are clean (or contain errors). We propose a very efficient firstorder alternating direction type algorithm to stably solve the nonconvex and nonsmooth objective function of MFC. Experimental results on synthetic data and realworld datasets verify that, besides the superiority and efficiency over traditional and stateoftheart methods for multisubspace recovery and clustering, MFC also demonstrates its uniqueness for multisubspace basis learning and direct representation learning.
Although MFC owns these advantages, it also faces several problems that should be investigated in future. First, we assume that the multiple subspaces are independent, however in practice the data may be generated from disjoint subspaces. In this case, how can we incorporate the constraint or regularization into the model? More challenging, how can we process the data that are generated from nonlinear manifolds, instead of linear subspaces? Second, the computational complexity of MFC is linear with respect to the sample size, so its extension to deal with massive dataset will be investigated in future.
References
 [1] B. Wang, R. Liu, C. Lin, and X. Fan, “Matrix factorization with column l0norm constraint for robust multisubspace analysis,” in 2015 IEEE ICDM Workshop, 2015, pp. 1189–1195.
 [2] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
 [3] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
 [4] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
 [5] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE TPAMI, vol. 19, no. 7, pp. 711–720, 1997.
 [6] X. He, S. Yan, Y. Hu, P. Niyogi, and H.J. Zhang, “Face recognition using laplacianfaces,” IEEE TPAMI, vol. 27, no. 3, pp. 328–340, 2005.
 [7] B.H. Wang, C. Lin, X.F. Zhao, and Z.M. Lu, “Neighbourhood sensitive preserving embedding for pattern classification,” IET IP, vol. 8, no. 8, pp. 489–497, 2014.
 [8] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in neural information processing systems, 2005, pp. 1473–1480.
 [9] S. Rao, R. Tron, R. Vidal, and Y. Ma, “Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories,” IEEE TPAMI, vol. 32, no. 10, pp. 1832–1845, 2010.
 [10] J. Yan and M. Pollefeys, “A general framework for motion segmentation: Independent, articulated, rigid, nonrigid, degenerate and nondegenerate,” in ECCV. Springer, 2006, pp. 94–106.
 [11] G. Liu and S. Yan, “Latent lowrank representation for subspace segmentation and feature extraction,” in IEEE ICCV, 2011, pp. 1615–1622.
 [12] R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE TPAMI, vol. 25, no. 2, pp. 218–233, 2003.
 [13] K. Tang, R. Liu, Z. Su, and J. Zhang, “Structureconstrained lowrank representation,” IEEE TNNLS, vol. 25, no. 12, pp. 2167–2179, 2014.
 [14] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
 [15] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal ACM, vol. 58, no. 3, p. 11, 2011.
 [16] R. Liu, Z. Lin, and Z. Su, “Learning markov random walks for robust subspace clustering and estimation,” Neural Networks, vol. 59, pp. 1–15, 2014.
 [17] G. Liu, Z. Lin, X. Tang, and Y. Yu, “Unsupervised object segmentation with a hybrid graph model (hgm),” IEEE TPAMI, vol. 32, no. 5, pp. 910–924, 2010.
 [18] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (GPCA),” IEEE TPAMI, vol. 27, no. 12, pp. 1945–1959, 2005.
 [19] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” FOUND COMPUT MATH, vol. 9, no. 6, pp. 717–772, 2009.
 [20] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution,” COMMUN PUR APPL MATH, vol. 59, no. 6, pp. 797–829, 2006.
 [21] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE TPAMI, vol. 35, no. 11, pp. 2765–2781, 2013.
 [22] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by lowrank representation,” IEEE TPAMI, vol. 35, no. 1, pp. 171–184, 2013.
 [23] R. Vidal and P. Favaro, “Low rank subspace clustering (LRSC),” PRL, vol. 43, pp. 47–61, 2014.
 [24] R. Liu, Z. Lin, F. De la Torre, and Z. Su, “Fixedrank representation for unsupervised visual learning,” in IEEE CVPR, 2012, pp. 598–605.
 [25] Y. Ni, J. Sun, X. Yuan, S. Yan, and L.F. Cheong, “Robust lowrank subspace segmentation with semidefinite guarantees,” in IEEE ICDMW, 2010, pp. 1179–1188.
 [26] C. Peng, Z. Kang, H. Li, and Q. Cheng, “Subspace clustering using logdeterminant rank approximation,” in ACM SIGKDD, ser. KDD ’15. New York, NY, USA: ACM, 2015, pp. 925–934.
 [27] G. Liu, Q. Liu, and P. Li, “Blessing of dimensionality: Recovering mixture data via dictionary pursuit,” IEEE TPAMI, vol. 39, no. 1, pp. 47–60, 2017.
 [28] E. J. Candes and Y. Plan, “Matrix completion with noise,” Proc. of IEEE, vol. 98, no. 6, pp. 925–936, 2010.
 [29] H. Xu, C. Caramanis, and S. Sanghavi, “Robust pca via outlier pursuit,” in NIPS, 2010, pp. 2496–2504.
 [30] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
 [31] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in optimization, vol. 1, no. 3, pp. 123–231, 2013.
 [32] H. Liu, G. Yang, Z. Wu, and D. Cai, “Constrained concept factorization for image representation,” IEEE Trans. Cybernetics, vol. 44, no. 7, pp. 1214–1224, 2014.
 [33] N. Srebro, J. Rennie, and T. S. Jaakkola, “Maximummargin matrix factorization,” in NIPS, 2004, pp. 1329–1336.
 [34] C. Ding, T. Li, M. Jordan et al., “Convex and seminonnegative matrix factorizations,” IEEE TPAMI, vol. 32, no. 1, pp. 45–55, 2010.
 [35] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” JMLR, vol. 11, pp. 19–60, 2010.
 [36] W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in ACM SIGIR. ACM, 2003, pp. 267–273.
 [37] R. Sandler and M. Lindenbaum, “Nonnegative matrix factorization with earth mover’s distance metric for image analysis,” IEEE TPAMI, vol. 33, no. 8, pp. 1590–1602, 2011.
 [38] S. Zafeiriou, A. Tefas, I. Buciu, and I. Pitas, “Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification,” IEEE TNN, vol. 17, no. 3, pp. 683–695, 2006.
 [39] B. Wang, M. Pang, C. Lin, and X. Fan, “Graph regularized nonnegative matrix factorization with sparse coding,” in ChinaSIP. IEEE, 2013, pp. 476–480.
 [40] C. Févotte, N. Bertin, and J.L. Durrieu, “Nonnegative matrix factorization with the itakurasaito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
 [41] A. Cichocki, R. Zdunek, and S.i. Amari, “New algorithms for nonnegative matrix factorization in applications to blind source separation,” in IEEE ICASSP, vol. 5, 2006, pp. V–V.
 [42] Y.X. Wang and Y.J. Zhang, “Nonnegative matrix factorization: A comprehensive review,” IEEE TKDE, vol. 25, no. 6, pp. 1336–1353, 2013.
 [43] J. Eggert and E. Körner, “Sparse coding and nmf,” in IEEE IJCNN, vol. 4, 2004, pp. 2529–2533.
 [44] H. Kim and H. Park, “Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis,” Bioinformatics, vol. 23, no. 12, pp. 1495–1502, 2007.
 [45] P. O. Hoyer, “Nonnegative matrix factorization with sparseness constraints,” JMLR, vol. 5, pp. 1457–1469, 2004.
 [46] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ?1 minimization,” PNAS, vol. 100, no. 5, pp. 2197–2202, 2003.
 [47] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34–81, 2009.
 [48] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009.
 [49] M. Aharon, M. Elad, and A. Bruckstein, “Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE TSP, vol. 54, no. 11, pp. 4311–4322, 2006.
 [50] J. Tropp, A. C. Gilbert et al., “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE TIT, vol. 53, no. 12, pp. 4655–4666, 2007.
 [51] R. Peharz and F. Pernkopf, “Sparse nonnegative matrix factorization with l0constraints,” Neurocomputing, vol. 80, pp. 38–46, 2012.
 [52] B. Wang, C. Lin, X. Fan, N. Jiang, and D. Farina, “Hierarchical bayes based adaptive sparsity in gaussian mixture model,” Pattern Recognition Letters, vol. 49, pp. 238–247, 2014.
 [53] J. Yang, W. Yin, Y. Zhang, and Y. Wang, “A fast algorithm for edgepreserving variational multichannel image restoration,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 569–592, 2009.
 [54] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
 [55] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 888–905, 2000.
 [56] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE TPAMI, vol. 31, no. 2, pp. 210–227, 2009.
 [57] M. D. Plummer and L. Lovász, Matching theory. Elsevier, 1986.
a Proof of Theorem 1
Let be an optimal solution of Eq.(2) under the given basis . We can then decompose into two parts, i.e., the diagonal and nondiagonal , where
with and .
Let respectively denote the th column of . Suppose that . It is easy to check , therefore . However, . As all subspaces are independent, meaning , we must require . Thus , and , . Therefore, the solution to Eq.(3) is blockdiagonal.
B Proof of Theorem 2
We first unfold the objective function:
Using the constraint and the SVD of transforms above minimization to the maximization problem
Note that is a constant. As is a diagonal matrix, so the maximum is achieved when is also a diagonal matrix, with positive diagonal elements. Therefore, we have , and thus .
C Proof of Theorem 3
For any , we first split its elements into the negative and positive parts, with corresponding indices defined as and . Denote and . Then, we have two useful properties:
(1) ;
(2) .
Property (1) is obvious and (2) can be simply proved:
Based on above properties, we have
Note that we remove the constant in the fourth equation.
In the last equation, when , . As , so . Therefore, the minimum is achieved if and only if for , resulting in . Finally, we have