Deep Subspace Clustering Networks
Abstract
We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep autoencoders, which nonlinearly map the input data into a latent space. Our key idea is to introduce a novel selfexpressive layer between the encoder and the decoder to mimic the “selfexpressiveness” property that has proven effective in traditional subspace clustering. Being differentiable, our new selfexpressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard backpropagation procedure. Being nonlinear, our neuralnetwork based method is able to cluster data points having complex (often nonlinear) structures. We further propose pretraining and finetuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that the proposed method significantly outperforms the stateoftheart unsupervised subspace clustering methods.
1 Introduction
In this paper, we tackle the problem of subspace clustering vidal2011subspace () – a subfield of unsupervised learning – which aims to cluster data points drawn from a union of lowdimensional subspaces in an unsupervised manner. Subspace clustering has become an important problem as it has found various applications in computer vision, e.g., image segmentation yang2008unsupervised (); ma2007segmentation (), motion segmentation kanatani2001motion (); elhamifar2009sparse (), and image clustering ho2003clustering (); elhamifar2013sparse (). For example, under Lambertian reflectance, the face images of one subject obtained with a fixed pose and varying lighting conditions lie in a lowdimensional subspace of dimension close to nine basri2003lambertian (). Therefore, one can employ subspace clustering to group images of multiple subjects according to their respective subjects.
Most recent works on subspace clustering yan2006general (); chen2009spectral (); elhamifar2013sparse (); liu2013robust (); wang2013provable (); lu2012robust (); ji2015shape (); you2016oracle () focus on clustering linear subspaces. However, in practice, the data do not necessarily conform to linear subspace models. For instance, in the example of face image clustering, reflectance is typically nonLambertian and the pose of the subject often varies. Under these conditions, the face images of one subject rather lie in a nonlinear subspace (or submanifold). A few works chen2009kernel (); patel2013latent (); patel2014kernel (); yin2016kernel (); xiao2016robust () have proposed to exploit the kernel trick shawe2004kernel () to address the case of nonlinear subspaces. However, the selection of different kernel types is largely empirical, and there is no clear reason to believe that the implicit feature space corresponding to a predefined kernel is truly wellsuited to subspace clustering.
In this paper, by contrast, we introduce a novel deep neural network architecture to learn (in an unsupervised manner) an explicit nonlinear mapping of the data that is welladapted to subspace clustering. To this end, we build our deep subspace clustering networks (DSCNets) upon deep autoencoders, which nonlinearly map the data points to a latent space through a series of encoder layers. Our key contribution then consists of introducing a novel selfexpressive layer – a fully connected layer without bias and nonlinear activations – at the junction between the encoder and the decoder. This layer encodes the “selfexpressiveness” property rao2008motion (); elhamifar2009sparse () of data drawn from a union of subspaces, that is, the fact that each data sample can be represented as a linear combination of other samples in the same subspace. To the best of our knowledge, our approach constitutes the first attempt to directly learn the affinities (through combination coefficients) between all data points within one neural network. Furthermore, we propose effective pretraining and finetuning strategies to learn the parameters of our DSCNets in an unsupervised manner and with a limited amount of data.
We extensively evaluate our method on face clustering, using the Extended Yale B lee2005acquiring () and ORL samaria1994parameterisation () datasets, and on general object clustering, using COIL20 coil20 () and COIL100 coil100 (). Our experiments show that our DSCNets significantly outperform the stateoftheart subspace clustering methods.
2 Related Work
Subspace Clustering.
Over the years, many methods have been developed for linear subspace clustering. In general, these methods consist of two steps: the first and also most crucial one aims to estimate an affinity for every pair of data points to form an affinity matrix; the second step then applies normalized cuts shi2000normalized () or spectral clustering ng2001spectral () using this affinity matrix. The resulting methods can then be roughly divided into three categories vidal2011subspace (): factorization methods SIM (); kanatani2001motion (); vidal2008multiframe (); mo2012semi (); ji2015shape (), higherorder model based methods yan2006general (); chen2009spectral (); Brox_CVPR12 (); purkait2014clustering (), and selfexpressiveness based methods elhamifar2009sparse (); liu2010robust (); lu2012robust (); wang2013provable (); ji2014efficient (); feng2014robust (); li2015structured (); you2016oracle (). In essence, factorization methods build the affinity matrix by factorizing the data matrix, and methods based on higherorder models estimate the affinities by exploiting the residuals of local subspace model fitting. Recently, selfexpressiveness based methods, which seek to express the data points as a linear combination of other points in the same subspace, have become the most popular ones. These methods build the affinity matrix using the matrix of combination coefficients. Compared to factorization techniques, selfexpressiveness based methods are often more robust to noise and outliers when relying on regularization terms to account for data corruptions. They also have the advantage over higherorder model based methods of considering connections between all data points rather than exploiting local models, which are often suboptimal. To handle situations where data points do not exactly reside in a union of linear subspaces, but rather in nonlinear ones, a few works patel2013latent (); patel2014kernel (); yin2016kernel (); xiao2016robust () have proposed to replace the inner product of the data matrix with a predefined kernel matrix (e.g., polynomial kernel and Gaussian RBF kernel). There is, however, no clear reason why such kernels should correspond to feature spaces that are wellsuited to subspace clustering. By contrast, here, we propose to explicitly learn one that is.
AutoEncoders.
Autoencoders (AEs) can nonlinearly transform data into a latent space. When this latent space has lower dimension than the original one hinton2006reducing (), this can be viewed as a form of nonlinear PCA. An autoencoder typically consists of an encoder and a decoder to define the data reconstruction cost. With the success of deep learning lecun2015deep (), deep (or stacked) AEs have become popular for unsupervised learning. For instance, deep AEs have proven useful for dimensionality reduction hinton2006reducing () and image denoising vincent2010stacked (). Recently, deep AEs have also been used to initialize deep embedding networks for unsupervised clustering xie2016unsupervised (). A convolutional version of deep AEs was also applied to extract hierarchical features and to initialize convolutional neural networks (CNNs) masci2011stacked ().
There has been little work in the literature combining deep learning with subspace clustering. To the best of our knowledge, the only exception is peng2016deep (), which first extracts SIFT lowe2004distinctive () or HOG dalal2005histograms () features from the images and feeds them to a fully connected deep autoencoder with a sparse subspace clustering (SSC) elhamifar2013sparse () prior. The final clustering is then obtained by applying kmeans or SSC on the learned autoencoder features. In essence, peng2016deep () can be thought of as a subspace clustering method based on kmeans or SSC with deep autoencoder features. Our method significantly differs from peng2016deep () in that our network is designed to directly learn the affinities, thanks to our new selfexpressive layer.
3 Deep Subspace Clustering Networks (DSCNets)
Our deep subspace clustering networks leverage deep autoencoders and the selfexpressiveness property. Before introducing our networks, we first discuss this property in more detail.
3.1 SelfExpressiveness
Given data points drawn from multiple linear subspaces , one can express a point in a subspace as a linear combination of other points in the same subspace. In the literature rao2008motion (); elhamifar2009sparse (), this property is called selfexpressiveness. If we stack all the points into columns of a data matrix , the selfexpressiveness property can be simply represented as one single equation, i.e., , where is the selfrepresentation coefficient matrix. It has been shown in ji2014efficient () that, under the assumption that the subspaces are independent, by minimizing certain norms of , is guaranteed to have a blockdiagonal structure (up to certain permutations), i.e., iff point and point lie in the same subspace. So we can leverage the matrix to construct the affinity matrix for spectral clustering. Mathematically, this idea is formalized as the optimization problem
(1) 
where represents an arbitrary matrix norm, and the optional diagonal constraint on prevents trivial solutions for sparsity inducing norms, such as the norm. Various norms for have been proposed in the literature, e.g., the norm in Sparse Subspace Clustering (SSC) elhamifar2009sparse (); elhamifar2013sparse (), the nuclear norm in Low Rank Representation (LRR) liu2010robust (); liu2013robust () and Low Rank Subspace Clustering (LRSC) favaro2011closed (); vidal2014low (), and the Frobenius norm in LeastSquares Regression (LSR) lu2012robust () and Efficient Dense Subspace Clustering (EDSC) ji2014efficient (). To account for data corruptions, the equality constraint in (1) is often relaxed as a regularization term, leading to
(2) 
Unfortunately, the selfexpressiveness property only holds for linear subspaces. While kernel based methods patel2013latent (); patel2014kernel (); yin2016kernel (); xiao2016robust () aim to tackle the nonlinear case, it is not clear that predefined kernels yield implicit feature spaces that are wellsuited for subspace clustering. In this work, we aim to learn an explicit mapping that makes the subspaces more separable. To this end, and as discussed below, we propose to build our networks upon deep autoencoders.
3.2 SelfExpressive Layer in Deep AutoEncoders
Our goal is to train a deep autoencoder, such as the one depicted by Figure 1, such that its latent representation is wellsuited to subspace clustering. To this end, we introduce a new layer that encodes the notion of selfexpressiveness.
Specifically, let denote the autoencoder parameters, which can be decomposed into encoder parameters and decoder parameters . Furthermore, let denote the output of the encoder, i.e., the latent representation of the data matrix . To encode selfexpressiveness, we introduce a new loss function defined as
(3) 
where represents the data reconstructed by the autoencoder. To minimize (3), we propose to leverage the fact that, as discussed below, can be thought of as the parameters of an additional network layer, which lets us solve for and jointly using backpropagation.^{1}^{1}1Note that one could also alternate minimization between and . However, since the loss is nonconvex, this would not provide better convergence guarantees and would require investigating the influence of the number of steps in the optimization w.r.t. on the clustering results.
Specifically, consider the selfexpressiveness term in (3), . Since each data point (in the latent space) is approximated by a weighted linear combination of other points (optionally, ) with weights , this linear operation corresponds exactly to a set of linear neurons without nonlinear activations. Therefore, if we take each as a node in the network, we can then represent the selfexpressiveness term with a fullyconnected linear layer, which we call the selfexpressive layer. The weights of the selfexpressive layer correspond to the matrix in (3), which are further used to construct affinities between all data points. Therefore, our selfexpressive layer essentially lets us directly learn the affinity matrix via the network. Moreover, minimizing simply translates to adding a regularizer to the weights of the selfexpressive layer. In this work, we consider two kinds of regularizations on : (i) the norm, resulting in a network denoted by DSCNetL1; (ii) the norm, resulting in a network denoted by DSCNetL2.
For notational consistency, let us denote the parameters of the selfexpressive layer (which are just the elements of ) as . As can be seen from Figure 2, we then take the input to the decoder part of our network to be the transformed latent representation . This lets us rewrite our loss function as
(4) 
where the network parameters now consist of encoder parameters , selfexpressive layer parameters , and decoder parameters , and where the reconstructed data is now a function of rather than just in (3).
3.3 Network Architecture
Our network consists of three parts, i.e., stacked encoders, a selfexpressive layer, and stacked decoders. The overall network architecture is shown in Figure 2. In this paper, since we focus on image clustering problems, we advocate the use of convolutional autoencoders that have fewer parameters than the fully connected ones and are thus easier to train. Note, however, that fullyconnected autoencoders are also compatible with our selfexpressive layer. In the convolutional layers, we use kernels with stride 2 in both horizontal and vertical directions, and rectified linear unit (ReLU) krizhevsky2012imagenet () for the nonlinear activations. Given images to be clustered, we use all the images in a single batch. Each input image is mapped by the convolutional encoder layers to a latent vector (or node) , represented as a shaded circle in Figure 2. In the selfexpressive layer, the nodes are fully connected using linear weights without bias and nonlinear activations. The latent vectors are then mapped back to the original image space via the deconvolutional decoder layers.
For the encoder layer with channels of kernel size , the number of weight parameters is , with . Since the encoders and decoders have symmetric structures, their total number of parameters is plus the number of bias parameters . For input images, the number of parameters for the selfexpressive layer is . For example, if we have three encoder layers with 10, 20, and 30 channels, respectively, and all convolutional kernels are of size , then the number of parameters for encoders and decoders is . If we have input images, then the number of parameters in the selfexpressive layer is . Therefore, the network parameters are typically dominated by those of the selfexpressive layer.
3.4 Training Strategy
Since the size of datasets for unsupervised subspace clustering is usually limited (e.g., in the order of thousands of images), our networks remain of a tractable size. However, for the same reason, it also remains difficult to directly train a network with millions of parameters from scratch. To address this, we design the pretraining and finetuning strategies described below. Note that this also allows us to avoid the trivial allzero solution while minimizing the loss (4).
As illustrated in Figure 2, we first pretrain the deep autoencoder without the selfexpressive layer on all the data we have. We then use the trained parameters to initialize the encoder and decoder layers of our network. After this, in the finetuning stage, we build a big batch using all the data to minimize the loss defined in (4) with a gradient descent method. Specifically, we used Adam kingma2014adam (), an adaptive momentum based gradient descent method, to minimize the loss, where we set the learning rate to in all our experiments. Since we always use the same batch in each training epoch, our optimization strategy is rather a deterministic momentum based gradient method than a stochastic gradient method. Note also that, since we only have access to images for training and not to cluster labels, our training strategy is unsupervised (or selfsupervised).
Once the network is trained, we can use the parameters of the selfexpressive layer to construct an affinity matrix for spectral clustering ng2001spectral (), as illustrated in Figure 3. Although such an affinity matrix could in principle be computed as , over the years, researchers in the field have developed many heuristics to improve the resulting matrix. Since there is no globallyaccepted solution for this step in the literature, we make use of the heuristics employed by SSC elhamifar2013sparse () and EDSC ji2014efficient (). Due to the lack of space, we refer the reader to the publicly available implementation of SSC and Section 5 of ji2014efficient () for more detail, and, to the TensorFlow implementation of our algorithm ^{2}^{2}2https://github.com/panji1990/Deepsubspaceclusteringnetworks.
4 Experiments
We implemented our method in Python with Tensorflow1.0 abadi2016tensorflow (), and evaluated it on four standard datasets, i.e., the Extended Yale B and ORL face image datasets, and the COIL20/100 object image datasets. We compare our methods against the following baselines: Low Rank Representation (LRR) liu2013robust (), Low Rank Subspace Clustering (LRSC) vidal2014low (), Sparse Subspace Clustering (SSC) elhamifar2013sparse (), Kernel Sparse Subspace Clustering (KSSC) patel2014kernel (), SSC by Orthogonal Matching Pursuit (SSCOMP) you2016scalable (), Efficient Dense Subspace Clustering (EDSC) ji2014efficient (), SSC with the pretrained convolutional autoencoder features (AE+SSC), and EDSC with the pretrained convolutional autoencoder features (AE+EDSC). For all the baselines, we used the source codes released by the authors and tuned the parameters by grid search to the achieve best results on each dataset. Since the code for the deep subspace clustering method of peng2016deep () is not publicly available, we are only able to provide a comparison against this approach on Extended Yale B and COIL20, for which the results are provided in peng2016deep (). Note that this comparison already clearly shows the benefits of our approach.
For all quantitative evaluations, we make use of the clustering error rate, defined as
(5) 

4.1 Extended Yale B Dataset
The Extended Yale B dataset lee2005acquiring () is a popular benchmark for subspace clustering. It consists of 38 subjects, each of which is represented with 64 face images acquired under different illumination conditions (see Figure 4(a) for sample images from this dataset). Following the experimental setup of elhamifar2013sparse (), we downsampled the original face images from to pixels, which makes it computationally feasible for the baselines elhamifar2013sparse (); liu2013robust (). In each experiment, we pick subjects (each subject with 64 face images) to test the robustness w.r.t. an increasing number of clusters. Taking all possible combinations of subjects out of 38 would result in too many experimental trials. To get a manageable size of experiments, we first number the subjects from 1 to 38 and then take all possible consecutive subjects. For example, in the case of subjects, we take all the images from subject , , , , giving rise to experimental trials.
We experimented with different architectures for the convolutional layers of our network, e.g., different network depths and number of channels. While increasing these values, increases the representation power of the network, it also increases the number of network parameters, thus requiring larger training datasets. Since the size of Extended Yale B is quite limited with only images, we found having threelayer encoders and decoders with channels to be a good tradeoff for this dataset. The detailed network settings are described in Table 1. In the finetuning phase, since the number of epochs required for gradient descent increases as the number of subjects increases, we defined the number of epochs for DSCNetL1 as and for DSCNetL2 as . We set the regularization parameters to .
layers  encoder1  encoder2  encoder3  selfexpressive  decoder1  decoder2  decoder3 

kernel size  –  
channels  10  20  30  –  30  20  10 
parameters  260  1820  5430  5914624  5420  1810  251 
The clustering performance of different methods for different numbers of subjects is provided in Table 2. For the experiments of subjects, we report the mean and median errors of experimental trials. From these results, we can see that the performance of most of the baselines decreases dramatically as the number of subjects increases. By contrast, the performance of our deep subspace clustering methods, DSCNetL1 and DSCNetL2, remains relatively stable w.r.t. the number of clusters. Specifically, our DSCNetL2 achieves error rate for 38 subjects, which is only around of the best performing baseline EDSC. We also observe that using the pretrained autoencoder features does not necessarily improve the performance of SSC and EDSC, which confirms the benefits of our joint optimization of all parameters in one network. The results of peng2016deep () on this dataset for 38 subjects was reported to be in terms of clustering accuracy, or equivalently in terms of clustering error, which is worse than both our methods – DSCNetL1 and DSCNetL2. We further notice that DSCNetL1 performs slightly worse than DSCNetL2 in the current experimental settings. We conjecture that this is due to the difficulty in optimization introduced by the norm as it is nondifferentiable at zero.
Method  LRR  LRSC  SSC  AE+ SSC  KSSC  SSCOMP  EDSC  AE+ EDSC  DSCNetL1  DSCNetL2 

10 subjects  
Mean  22.22  30.95  10.22  17.06  14.49  12.08  5.64  5.46  2.23  1.59 
Median  23.49  29.38  11.09  17.75  15.78  8.28  5.47  6.09  2.03  1.25 
15 subjects  
Mean  23.22  31.47  13.13  18.65  16.22  14.05  7.63  6.70  2.17  1.69 
Median  23.49  31.64  13.40  17.76  17.34  14.69  6.41  5.52  2.03  1.72 
20 subjects  
Mean  30.23  28.76  19.75  18.23  16.55  15.16  9.30  7.67  2.17  1.73 
Median  29.30  28.91  21.17  16.80  17.34  15.23  10.31  6.56  2.11  1.80 
25 subjects  
Mean  27.92  27.81  26.22  18.72  18.56  18.89  10.67  10.27  2.53  1.75 
Median  28.13  26.81  26.66  17.88  18.03  18.53  10.84  10.22  2.19  1.81 
30 subjects  
Mean  37.98  30.64  28.76  19.99  20.49  20.75  11.24  11.56  2.63  2.07 
Median  36.82  30.31  28.59  20.00  20.94  20.52  11.09  10.36  2.81  2.19 
35 subjects  
Mean  41.85  31.35  28.55  22.13  26.07  20.29  13.10  13.28  3.09  2.65 
Median  41.81  31.74  29.04  21.74  25.92  20.18  13.10  13.21  3.10  2.64 
38 subjects  
Mean  34.87  29.89  27.51  25.33  27.75  24.71  11.64  12.66  3.33  2.67 
Median  34.87  29.89  27.51  25.33  27.75  24.71  11.64  12.66  3.33  2.67 
4.2 ORL Dataset
The ORL dataset samaria1994parameterisation () is composed of 400 human face images, with 40 subjects each having 10 samples. Following cai2007learning (), we downsampled the original face images from to . For each subject, the images were taken under varying lighting conditions with different facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses)(see Figure 4(b) for sample images). Compared to Extended Yale B, this dataset is more challenging for subspace clustering because (i) the face subspaces have more nonlinearity due to varying facial expressions and details; (ii) the dataset size is much smaller (400 vs. 2432). To design a trainable deep autoencoder on 400 images, we reduced the number of network parameters by decreasing the number of channels in each encoder and decoder layer. The resulting network is specified in Table 3.
Since we already verified the robustness of our method to the number of clusters in the previous experiment, here, we only provide results for clustering all 40 subjects. In this setting, we set and and run epochs for DSCNetL2 and epochs for DSCNetL1 during finetuning. Note that, since the size of this dataset is small, we can even use the whole data as a single batch in pretraining. We found this to be numerically more stable and converge faster than stochastic gradient descent using randomly sampled minibatches.
Figure 5(a) shows the error rates of the different methods, where different colors denote different subspace clustering algorithms and the length of the bars reflects the error rate. Since there are much fewer samples per subject, all competing methods perform worse than on Extended Yale B. Note that both EDSC and SSC achieve moderate clustering improvement by using the features of pretrained convolutional autoencoders, but their error rates are still around twice as high as those of our methods.
layers  encoder1  encoder2  encoder3  selfexpressive  decoder1  decoder2  decoder3 

kernel size  –  
channels  5  3  3  –  3  3  5 
parameters  130  138  84  160000  84  140  126 
4.3 COIL20 and COIL100 Datasets
The previous experiments both target face clustering. To show the generality of our algorithm, we also evaluate it on the COIL object image datasets – COIL20 coil20 () and COIL100 coil100 (). COIL20 consists of 1440 grayscale image samples, distributed over 20 objects such as duck and car model (see sample images in Figure 4(c)). Similarly, COIL100 consists of 7200 images distributed over 100 objects. Each object was placed on a turntable against a black background, and 72 images were taken at pose intervals of 5 degrees. Following cai2011graph (), we downsampled the images to . In contrast with the previous human face datasets, in which faces are well aligned and have similar structures, the object images from COIL20 and COIL100 are more diverse, and even samples from the same object differ from each other due to the change of viewing angle. This makes these datasets challenging for subspace clustering techniques. ‘’ For these datasets, we used shallower networks with one encoder layer, one selfexpressive layer, and one decoder layer. For COIL20, we set the number of channels to 15 and the kernel size to . For COIL100, we increased the number of channels to 50 and the kernel size to . The settings for both networks are provided in Table 4. Note that with these network architectures, the dimension of the latent space representation increases by a factor of 15/4 for COIL20 (as the spatial resolution of each channel shrinks to 1/4 of the input image after convolutions with stride 2, and we have 15 channels) and 50/4 for COIL100. Thus our networks perform dimensionality lifting rather than dimensionality reduction. This, in some sense, is similar to the idea of Hilbert space mapping in kernel methods shawe2004kernel (), but with the difference that, in our case, the mapping is explicit, via the neural network. In our experiments, we found that these shallow, dimensionlifting networks performed better than deep, bottleneck ones on these datasets. While it is also possible to design deep, dimensionlifting networks, the number of channels has to increase by a factor of 4 after each layer to compensate for the resolution loss. For example, if we want the latent space dimension to increase by a factor of 15/4, we need channels in the second layer for a 2layer encoder, channels in the third layer for a 3layer encoder, and so forth. In the presence of limited data, this increasing number of parameters makes training less reliable. In our finetuning stage, we ran 30 epochs (COIL20) / 100 epochs (COIL100) for DSCNetL1 and 40 epochs (COIL20) / 120 epochs (COIL100) for DSCNetL2, and set the regularization parameters to .
Figure 5(b) and (c) depict the error rates of the different methods on clustering 20 classes for COIL20 and 100 classes for COIL100, respectively. Note that, in both cases, our DSCNetL2 achieves the lowest error rate. In particular, for COIL20, we obtain an error of 5.34%, which is roughly 1/3 of the error rate of the bestperforming baseline AE+EDSC. The results of peng2016deep () on COIL20 was reported to be in terms of clustering error, which is also much higher than ours.
COIL20  COIL100  

layers  encoder1  selfexpressive  decoder1  encoder1  selfexpressive  decoder1 
kernel size  –  –  
channels  15  –  15  50  –  50 
parameters  150  2073600  136  1300  51840000  1251 
5 Conclusion
We have introduced a deep autoencoder framework for subspace clustering by developing a novel selfexpressive layer to harness the "selfexpressiveness" property of a union of subspaces. Our deep subspace clustering network allows us to directly learn the affinities between all data points through one neural network. Furthermore, we have proposed pretraining and finetuning strategies to train our network, demonstrating the ability to handle challenging scenarios with smallsize datasets, such as the ORL dataset. Our experiments have demonstrated that our deep subspace clustering methods provide significant improvement over the stateoftheart subspace clustering solutions in terms of clustering accuracy on several standard datasets.
References
 (1) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
 (2) R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. TPAMI, 25(2):218–233, 2003.
 (3) D. Cai, X. He, J. Han, and T. Huang. Graph regularized nonnegative matrix factorization for data representation. TPAMI, 33(8):1548–1560, 2011.
 (4) D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Learning a spatially smooth subspace for face recognition. In CVPR, pages 1–7. IEEE, 2007.
 (5) G. Chen, S. Atev, and G. Lerman. Kernel spectral curvature clustering (KSCC). In ICCV Workshops, pages 765–772. IEEE, 2009.
 (6) G. Chen and G. Lerman. Spectral curvature clustering (SCC). IJCV, 81(3):317–330, 2009.
 (7) J. Costeira and T. Kanade. A multibody factorization method for independently moving objects. IJCV, 29(3):159–179, 1998.
 (8) N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR 2005, pages 886–893. IEEE, 2005.
 (9) E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, pages 2790–2797, 2009.
 (10) E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. TPAMI, 35(11):2765–2781, 2013.
 (11) P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation and clustering. In CVPR, pages 1801–1807. IEEE, 2011.
 (12) J. Feng, Z. Lin, H. Xu, and S. Yan. Robust subspace segmentation with blockdiagonal prior. In CVPR, pages 3818–3825, 2014.
 (13) G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 (14) J. Ho, M.H. Yang, J. Lim, K.C. Lee, and D. Kriegman. Clustering appearances of objects under varying illumination conditions. In CVPR, volume 1, pages 11–18. IEEE, 2003.
 (15) P. Ji, M. Salzmann, and H. Li. Efficient dense subspace clustering. In WACV, pages 461–468. IEEE, 2014.
 (16) P. Ji, M. Salzmann, and H. Li. Shape interaction matrix revisited and robustified: Efficient subspace clustering with corrupted and incomplete data. In ICCV, pages 4687–4695, 2015.
 (17) K.i. Kanatani. Motion segmentation by subspace separation and model selection. In ICCV, volume 2, pages 586–591. IEEE, 2001.
 (18) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
 (19) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 (20) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 (21) K.C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. TPAMI, 27(5):684–698, 2005.
 (22) C.G. Li and R. Vidal. Structured sparse subspace clustering: A unified optimization framework. In CVPR, pages 277–286, 2015.
 (23) G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by lowrank representation. TPAMI, 35(1):171–184, 2013.
 (24) G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by lowrank representation. In ICML, pages 663–670, 2010.
 (25) D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91–110, 2004.
 (26) C.Y. Lu, H. Min, Z.Q. Zhao, L. Zhu, D.S. Huang, and S. Yan. Robust and efficient subspace segmentation via least squares regression. In ECCV, pages 347–360. Springer, 2012.
 (27) Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data coding and compression. TPAMI, 29(9), 2007.
 (28) J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber. Stacked convolutional autoencoders for hierarchical feature extraction. Artificial Neural Networks and Machine Learning–ICANN 2011, pages 52–59, 2011.
 (29) Q. Mo and B. A. Draper. Seminonnegative matrix factorization for motion segmentation with missing data. In ECCV, pages 402–415. Springer, 2012.
 (30) S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL100). Technical Report CUCS00696, 1996.
 (31) S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL20). Technical Report CUCS00596, 1996.
 (32) A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. In NIPS, 2001.
 (33) P. Ochs and T. Brox. Higher order motion models and spectral clustering. In CVPR, 2012.
 (34) V. M. Patel, H. Van Nguyen, and R. Vidal. Latent space sparse subspace clustering. In ICCV, pages 225–232, 2013.
 (35) V. M. Patel and R. Vidal. Kernel sparse subspace clustering. In ICIP, pages 2849–2853. IEEE, 2014.
 (36) X. Peng, S. Xiao, J. Feng, W.Y. Yau, and Z. Yi. Deep subspace clustering with sparsity prior. In IJCAI, 2016.
 (37) P. Purkait, T.J. Chin, H. Ackermann, and D. Suter. Clustering with hypergraphs: the case for large hyperedges. In ECCV, pages 672–687. Springer, 2014.
 (38) S. R. Rao, R. Tron, R. Vidal, and Y. Ma. Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In CVPR, pages 1–8. IEEE, 2008.
 (39) F. S. Samaria and A. C. Harter. Parameterisation of a stochastic model for human face identification. In Applications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on, pages 138–142. IEEE, 1994.
 (40) J. ShaweTaylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
 (41) J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888–905, 2000.
 (42) R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.
 (43) R. Vidal and P. Favaro. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47–61, 2014.
 (44) R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data using powerfactorization and GPCA. IJCV, 79(1):85–105, 2008.
 (45) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11(Dec):3371–3408, 2010.
 (46) Y.X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When LRR meets SSC. In Advances in Neural Information Processing Systems, pages 64–72, 2013.
 (47) S. Xiao, M. Tan, D. Xu, and Z. Y. Dong. Robust kernel lowrank representation. IEEE transactions on neural networks and learning systems, 27(11):2268–2281, 2016.
 (48) J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016.
 (49) J. Yan and M. Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, nonrigid, degenerate and nondegenerate. In ECCV, pages 94–106. Springer, 2006.
 (50) A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry. Unsupervised segmentation of natural images via lossy data compression. CVIU, 110(2):212–225, 2008.
 (51) M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie. Kernel sparse subspace clustering on symmetric positive definite manifolds. In CVPR, pages 5157–5164, 2016.
 (52) C. You, C.G. Li, D. P. Robinson, and R. Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In CVPR, pages 3928–3937, 2016.
 (53) C. You, D. Robinson, and R. Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In CVPR, pages 3918–3927, 2016.