Clustering by Directly Disentangling Latent Space
Abstract
To overcome the high dimensionality of data, learning latent feature representations for clustering has been widely studied recently. However, it is still challenging to learn “clusterfriendly” latent representations due to the unsupervised fashion of clustering. In this paper, we propose Disentangling Latent Space Clustering (DLSClustering), a new clustering mechanism that directly learning cluster assignment during the disentanglement of latent spacing without constructing the “clusterfriendly” latent representation and additional clustering methods. We achieve the bidirectional mapping by enforcing an inference network (i.e. encoder) and the generator of GAN to form a deterministic encoderdecoder pair with a maximum mean discrepancy (MMD)based regularization. We utilize a weightsharing procedure to disentangle latent space into the onehot discrete latent variables and the continuous latent variables. The disentangling process is actually performing the clustering operation. Eventually the onehot discrete latent variables can be directly expressed as clusters, and the continuous latent variables represent remaining unspecified factors. Experiments on six benchmark datasets of different types demonstrate that our method outperforms existing stateoftheart methods. We further show that the latent representations from DLSClustering also maintain the ability to generate diverse and highquality images, which can support more promising application scenarios.
1 Introduction
As an important unsupervised learning method, clustering has been widely used in many computer vision applications, such as image segmentation [7], visual features learning [3], and 3D object recognition [43]. Clustering becomes difficult when processing large amounts of highsemantic and highdimensional data samples [10]. In order to overcome these challenges, many latent space clustering approaches such as DEC [46], DCN [47] and ClusterGAN [34], have been proposed. In these latent space clustering methods, the original highdimensional data is first projected to lowdimensional latent space, then clustering algorithms, such as Kmeans [30], are performed on the latent space.
Most existing latent space clustering methods focus on learning the “clusteringfriendly” latent representations. To avoid learning the random discriminative representations, their training objectives are usually coupled with data reconstruction loss or data generation constraints, which allow to rebuild or generate the input samples from the latent space. These objectives force the latent space to capture all key factors of variations and similarities, which are essential for reconstruction or generation. Therefore, these learned lowdimensional representations are not just related to clusters, and not the optimal latent representations for clustering.
Furthermore, current latent space clustering methods depend on additional clustering methods (e.g., Kmeans) to output the final clustering result based on learned latent representations. It’s difficult to effectively integrate lowdimensional representation learning and clustering algorithm together. The performance of distancebased clustering algorithms, such as Kmeans [30], is highly dependent on the selection of proper similarity and distance measures. Although constructing latent space can alleviate the problem of computing the distance between high dimensional data, defining a proper distance in latent space to obtain best clustering performance is still a challenge.
In this paper, we propose disentangle latent space clustering (DLSClustering), a new type of clustering algorithm that directly obtains the cluster information during the disentanglement of latent space. The disentangling process partitions the latent space into two parts: the onehot discrete latent variables directly related to categorical cluster information, and the continuous latent variables related to other factors of variations. The disentanglement of latent space is actually performing the clustering operation, and no further clustering method is needed. Unlike existing distancebased clustering methods, our method does not need any explicit clustering objective and distance/similarity calculation in the latent space.
To separate the latent space into two completely independent parts and directly obtain clusters, we first couple the inference network and the generator of GAN to form a deterministic encoderdecoder pair under the maximum mean discrepancy (MMD) regularization [18]. Then, we utilize the weight sharing strategy, which involves the bidirectional mapping between latent space and data space, to separate the latent space into onehot discrete variables and continuous variables of other factors. Our method integrates the GAN and deterministic Autoencoder together, to achieve the disentanglement of the latent space. It includes three different types of regularizations: an adversarial densityratio loss in data space, MMD loss in the continuous latent code and crossentropy loss in discrete latent code. We choose adversarial densityratio estimation for modeling the data space because it can handle complex distributions. MMDbased regularizer is stable to optimize and works well with multivariate normal distributions [41]. Our code and models are publicly available at this link ^{1}^{1}1 after the paper is accepted.
In summary, our contributions are as follows:
(1) We propose a new clustering approach called DLSClustering, which directly obtain clusters in a completely unsupervised manner through disentangling latent space.
(2) We introduce a MMDbased regularization to enforce the inference network and the generator of standard GAN to form a deterministic encoderdecoder pair.
(3) We define a disentanglement training procedure based on the standard GAN and the inference network without increasing model parameters and requiring extra inputs. This procedure is also suitable for disentangling other factors of variation.
(4) We evaluate DLSClustering using six different types of benchmark datasets. DLSClustering achieves superior clustering performance on five of six datasets and close to best result on the other one.
2 Related works
Latent space clustering. Recently, many latent space clustering methods that leverage the advance of deep neural network based unsupervised representation learning [42, 2] have been developed. Several pioneering works propose to utilize an encoding architecture [48, 4, 23, 3] to learn the lowdimensional representations. In these methods, the pseudolabels that are created based on some hypothetical similarities are used during optimization process. Because pseudolabels usually underfit the semanticity of realworld datasets, they often suffer the Feature Randomness problem [33]. Most of recent latent space clustering methods are based on Autoencoders [46, 8, 20, 47, 49], which enables to reconstruct data sample from a lowdimensional representation. For example, Deep Embedded Clustering (DEC) [46] proposes to pretrain an Autoencoder with the reconstruction objective to learn lowdimensional embedded representations. Then, it discards the decoder and continues to train the encoder for clustering objective through a welldesigned regularizer. IDEC [20] combines the reconstruction objective and clustering objective to jointly learn suitable representations with preserving local structure. DCN [47] proposes a joint dimensionality reduction and Kmeans clustering approach, in which the lowdimensional representation is obtained via the Autoencoder. Because the learned latent representations are closely related to the reconstruction objective, these methods still do not achieve the desired clustering results.
Recently, ClusterGAN [34] integrated GAN with an encoder network for clustering by creating a nonsmooth latent space with the mixture of onehot encoded discrete variables and continuous latent variables. However, the onehot encoded discrete variables and continuous latent variables are not completely disentangled in ClusterGAN. Thus, the onehot encoded discrete variable cannot effectively represent cluster. To obtain clustering assignment, ClusterGAN still need to perform additional clustering on entire dimensions of latent space under the discretecontinuous prior distribution.
Disentanglement of latent space. Learning disentangled representations enables us to reveal the factors of variation in the data [2], and provides interpretable semantic latent codes for generative models. Generally, existing disentangling methods can be mainly divided into two different types according to the disentanglement level. The first type of disentanglement involves separating the latent representations into two [32, 21, 52, 36] or three [16] parts. This type of method can be achieved in one step. For example, Mathieu et al. [32] introduce a conditional VAE with adversarial training to disentangle the latent representations into label relevant and the remaining unspecified factors. YAE [36] focuses on the standard Autoencoder to achieve the disentanglement of implicit and explicit representations. Meanwhile, twostep disentanglement methods based on Autoencoder [21] or VAE [52] are also proposed. In those methods, the first step is to extract the label relevant representations by training a classifier. Then, they obtain label irrelevant representations mainly via the reconstruction loss. All of these methods improve the disentanglement results by leveraging (partial) label information to minimize the crossentropy loss. The second type of disentanglement, such as VAE [22], FactorVAE [24] and TCVAE [5], learns to separate each dimension in latent space without supervision.These VAEbased frameworks choose the standard Gaussian distribution as the prior distribution. And they aim to balance the reconstruction quality and the latent code regularization through a stochastic encoderdecoder pair.
Considering that the realworld data usually contains several discrete factors (e.g., categories), which are difficult to be modelled with continuous variables. Several studies begin to disentangle the latent representation to discrete and continuous factors of variation, such as JointVAE [12] and InfoGAN [6]. Although most of the disentanglement learning methods [36, 12, 13] have been proposed based on the Autoencoder, especially VAEs [25], VAEs usually can not achieve highquality generation in realworld scenarios, which is related to the training objective [15]. Recently, InfoGAN [6], an informationtheoretic extension to GAN, reveals the disentanglement of latent code by maximizing the mutual information between the latent code and the generated data. In this paper, the proposed method integrates the Autoencoder and GAN together, and separates the latent variables into two parts without any supervision. The discrete latent variables directly represent clusters, and the other continuous latent variables summarize the remaining unspecified factors of variation.
3 Proposed method
We propose an unsupervised learning algorithm to disentangle the latent space into the onehot discrete latent variables, , and the continuous latent variables, . For each input, naturally represents the categorical cluster information; is expected to contain information of other variations. By sampling latent variables from these discretecontinuous mixtures, we utilize a generator to map these latent variables to data space and an encoder to project the data back to the latent space. To enforce our model to fully split the latent space into two separate parts, we utilize the bidirectional mapping networks to perform multiple generating and encoding processes, and jointly train the generator and encoder with a disentanglingspecific loss. The instability of training due to adversarial training may be mitigated by recent training improvements [19] and the integration of GAN and Autoencoder [1, 26].
3.1 Problem formulation
Given a collection i.i.d. samples (e.g., images) drawn from the real data distribution , where is the th data sample and is the size of the dataset, our goal is to learn a general method to project the data to the latent space, which is divided into the onehot discrete latent variables directly related to clusters and the remaining unspecified continuous latent variables. One important challenge to disentangle latent space is how to encourage independence between and as much as possible. First, this involves a bidirectional mapping between the latent space and the data space . It’s difficult to enforce distributions consistency in both spaces: and . Second, it’s still challenging to learn two separate latent variables without any supervision. Existing methods [21, 52, 36] leverage labels to achieve disentanglement of various factors.
In the following sections, we first describe the GAN that generate data space from a discretecontinuous prior (Section 3.2). Then, we introduce a deterministic encodergenerator pair for bidirectional mapping (Section 3.3). After that, we present our disentangling process (Section 3.4). Finally, we describe the objectives of the proposed method (Section 3.5).
3.2 Discretecontinuous prior in DLSClustering
We choose a joint distribution of discrete and continuous latent variables as the prior of our GAN model, same as the ClusterGAN [34]. This discretecontinuous prior is helpful for the generation of structured data in generative models. Using images as an example, distinct identities or attributes of objects would be reasonably represented by discrete variables, while other continuous factors, such as style and scale information, can be represented by the continuous variables. In this work, we split the latent representations into and based on the discretecontinuous prior.
The standard generative adversarial networks [17, 19] consist of two components: the generator and the discriminator . defines a mapping from the latent space to the data space and can be considered as a mapping from the data space to a real value in , which represents the probability of one sample being real. Eq. 1 defines the minimax objective of the standard GANs:
(1) 
where is the real data distribution, is the prior distribution on the latent space, and is the model distribution of the generated sample . For the original GAN [17], the function is chosen as , and the Wasserstein GAN [19] applies . This adversarial densityratio estimation [41] enforces to match .
3.3 Deterministic encodergenerator pair
Many previous works, such as ALI [11], BiGAN [9], combined the inference network (i.e., encoder) and GAN together to form a bidirectional mapping. However, due to the lack of consistent mapping between data samples and latent variables, it usually obtains poor reconstruction results. To turn the generator in DLSClustering into a good decoder, we need to apply several constraints between the posterior distribution and the prior distribution . Because the latent variable , for the prior , these constraints can be added by simply penalizing the discrete variable part and the continuous variable part separately.
The constraint of discrete variables can be computed through the inverse network, which involves first generating the data sample from and then encoding it back to the latent variable (), as shown in Figure 1. Therefore, the penalty of discrete variables can be defined by the crossentropy loss between the original input and the recalculated discrete variable
(2) 
The constraint of continuous variables can be considered in the standard Autoencoder model. As shown in Figure 1, the encoder encodes the real data sample to the latent variables and . To ensure that the generator can reconstruct the original data from these latent variables, we apply an additional regularizer to encourage the encoded posterior distribution to match the prior distribution like AAE [31] and WAE [40]. The former uses the GANbased densityratio trick to estimate the KLdivergence [41], and the latter minimizes the distance between distributions based on Maximum mean discrepancy (MMD) [18, 28]. For the sake of optimization stability, we choose MMD to quantify the distance between the prior distribution and the posterior . And the regularizer based on MMD can be expressed as
(3)  
where can be any positive definite kernel, are sampled from the prior distribution , is sampled from the posterior and is sampled from the real data samples for .
In DLSClustering, the encoding distribution and the decoding distribution are taken to be deterministic, i.e., and can be replaced by and , respectively. Therefore, we use a mean squared error (MSE) criterion as reconstruction loss, and write the standard Autoencoder loss as
(4) 
3.4 Disentangled representation
Although the above constraints are applied to enforce consistency between the distributions over and , in order to avoid “posterior collapse” and obtain more promising representations, we impose an additional penalty to the objective to disentangle the latent variables. We utilize the weights sharing generator and encoder to enforce the disentanglement between discrete and continuous latent variables. In our architecture (Figure 1), all encoders and generators share the same weights. Thus, it requires no more parameters to disentangle latent variables.
In practice, we sample the data sample from the real data distribution, and sample the latent variable from the discretecontinuous prior. The encoder maps the data sample to latent representations and . To ensure that and are independent, we create the new latent variable by recombining the variables and . Therefore, the generated data samples and will have identical discrete latent variable . Then is reencoded to the latent variables . The crossentropy loss between and can ensure that the discrete variable isn’t modified when the continuous variable changes,
(5) 
In addition, to ensure that the continuous variable doesn’t contain any information about the discrete variable, it is also necessary to use an additional regularizer to penalize the continuous latent variable. The generator generates the data sample from new latent variable , and the encoder recovers the continuous latent variable from . Therefore, we penalize the deviation between and by using the MSE loss:
(6) 
3.5 Objective of DLSClustering
The objective function of our approach can be integrated into the following form:
(7) 
where the corresponding regularization coefficients , controlling the relative contribution of different loss terms. Each term of Eq. 7 plays a different role for the three components: generator , discriminator and encoder . Both of and are related to and , which constrain the whole latent variables. The term is also related to the , which focus on distinguishing the true data samples from the fake samples generated by . and are related to continuous latent variables, and and are related to discrete latent variables. All these loss terms can ensure that our algorithm will disentangle the whole latent space into cluster information and remaining unspecified factors. The training procedure of DLSClustering involves jointly updating the parameters of , and , as described in Algorithm 1. In this work, we empirically set and to enable a reasonable adjustment of the relative importance of continuous and discrete parts.
4 Experiments
Dataset  Dimensions  Layer Type  G1/D4/E4  G2/D3/E3  G3/D2/E2  G4/D1/E1 
MNIST  ConvDeconv      
Fashion10  ConvDeconv      
YTF  ConvDeconv  
Pendigits  MLP  256  256      
10x_73k  MLP  256  256      

Dataset  Discrete Dim.  Continuous Dim. 

MNIST  10  25 
Fashion10  10  40 
YTF  41  60 
Pendigits  10  5 
10x_73k  8  30 

Method  MNIST  Fashion10  YTF  Pendigits  10x_73k  
ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  
Kmeans [30]  0.532  0.500  0.474  0.512  0.601  0.776  0.793  0.730  0.623  0.577 
NMF [27]  0.560  0.450  0.500  0.510      0.670  0.580  0.710  0.690 
SC [39]  0.656  0.731  0.508  0.575  0.510  0.701  0.700  0.690  0.400  0.290 
AGGLO [50]  0.640  0.650  0.550  0.570      0.700  0.690  0.630  0.580 
DEC [46]  0.863  0.834  0.518  0.546  0.371  0.446         
DCN [47]  0.830  0.810          0.720  0.690     
JULE [48]  0.964  0.913  0.563  0.608  0.684  0.848         
DEPICT[14]  0.965  0.917  0.392  0.392  0.621  0.802         
SpectralNet [38]  0.800  0.814      0.685  0.798         
InfoGAN [6]  0.890  0.860  0.610  0.590      0.720  0.730  0.620  0.580 
ClusterGAN [34]  0.950  0.890  0.630  0.640      0.770  0.730  0.810  0.730 
DualAE [49]  0.978  0.941  0.662  0.645  0.691  0.857         
Our Method  0.975  0.936  0.693  0.669  0.721  0.790  0.847  0.803  0.905  0.820 

In this section, we perform a variety of experiments to evaluate the effectiveness of our proposed method.
4.1 Data sets
The clustering experiments first are carried out on five datasets: MNIST, FashionMNIST [45], YouTubeFace (YTF) [44], Pendigits and 10x_73k [51]. Both of the first two datasets contain 70k images with 10 categories, and each sample is a grayscale image. YTF contains 10k face images of size , belonging to 41 categories. The Pendigits dataset contains a time series of coordinates of handwritten digits. It has 10 categories and contains 10992 samples, and each sample is represented as a 16dimensional vector. The 10x_73k dataset contains 73233 data samples of single cell RNAseq counts of 8 cell types, and the dimension of each sample is 720. We choose these datasets to demonstrate that our method can be effective for clustering different types of data.
4.2 Implementation
We implement different neural network structures for , and to handle different types of data. For the image datasets (MNIST, FashionMNIST and YTF), We employ the similar and of DCGAN [37] with convdeconv layers, batch normalization and leaky ReLU activations with slope of 0.2. The uses the same architecture as the except the last layer. For the Pendigits and 10x_73k datasets, the , and are the MLP with 2 hidden layers of 256 hidden units each. Table 1 summarizes the network structures for different datasets. The model parameters have been initialized following the random normal distribution. For the prior distribution of our method, we randomly generate the discrete latent code , which is equal to one of the elementary onehot encoded vectors in , then sample the continuous latent code from , here . The sampled latent code is used as the input of to generate samples. The dimensions of and are shown in Table 2. We implement the MMD loss with RBF kernel [40] to penalize the posterior distribution . The improved GAN variant with a gradient penalty [19] is used in all experiments. To obtain the cluster assignment, we directly use the argmax over all softmax probabilities for different clusters. The following regularization parameters work well during all experiments: , , . We implement all the models in Python using TensorFlow library, and train them on one NVIDIA DGX1 station.
4.3 Evaluation of DLSClustering algorithm
To evaluate clustering results, we report two standard evaluation metrics: Clustering Purity (ACC) and Normalized Mutual Information (NMI). We compare DLSClustering with four clustering baselines: Kmeans [30], Nonnegative matrix Factorization (NMF) [27], Spectral Clustering (SC) [39] and Agglomerative Clustering(AGGLO) [50]. We also compare our method with the stateoftheart clustering approaches based on GAN and Autoencoder respectively. For GANbased approaches, ClusterGAN [34] is chosen as it achieves the superior clustering performance compared to other GAN models (e.g., InfoGAN and GAN with bp). For Autoencoderbased methods, DEC [46], DCN [47] and DEPICT [14], especially, Dual Autoencoder Network (DualAE) [49] are used for comparison. In addition, the deep spectral clustering (SpectralNet) [38] and joint unsupervised learning (JULE) [48] are also included in our comparison.
Table 3 reports the best clustering metrics of different models from 5 runs. Our method achieves significant performance improvement on Fashion10, YTF, Pendigits and 10x_73k datasets than other methods. In particular, for the 16dimensional Pendigit dataset, the methods all perform worse than Kmeans does, while our method significantly outperforms Kmeans in both ACC (0.847 vs. 0.793) and NMI (0.803 vs. 0.730). DLSClustering achieves the best ACC result on YTF dataset while maintaining comparable NMI value. For MNIST dataset, DLSClustering achieves close to best performance on both ACC and NMI metrics.
4.4 Analysis on continuous latent variables
The superior clustering performance of DLSClustering demonstrates that the onehot discrete latent variables directly represent the category information in data. To understand the information contained in the continuous latent variables, we first use tSNE [29] to visualize the continuous latent variable of MNIST and FashionMNIST datasets and compare them to the original data. As shown in Figure 2, we can clearly see category information in original MNIST (a(1)) and FashionMNIST (b(1))data. Meanwhile, there is no obvious category in the of MNIST (a(2)) and FashionMNIST (b(2)) data. Samples in all categories are well mixed in both data sets. A small bulk of samples in the right part of a(2) is a group of “1” images. The reason that they are not distributed may be due to their low complexity.
Then, we fix the discrete latent variable and generate images belonging to the same clusters by sampling the continuous latent variables. As shown in Figure 3, the diversity of generated images indicates that the continuous latent variable contains a large number of generative factors, except the cluster information. To further understand the factors in continuous latent variable , we change the value of one single dimension from [0.5, 0.5] in while fixing other dimensions and the discrete latent variable . As shown in Figure 4, the value changing leads to semantic changes in the generated images. For the MNIST data, this changed dimension represents the width factor of variation in the digits. For the FashionMNIST data, it captures the shape factor of objects. All these informative continuous factors are independent of cluster categories.
These results demonstrate that the learned continuous latent representations from DLSClustering have captured other meaningful generative factors that are not related to clusters. Therefore, the proposed method successfully performs the mapping from the data to the disentangled latent space. The onehot discrete latent variable is directly related to clusters, and the continuous latent variable, which corresponds to the other unspecified generative factors, governs the diversity of generated samples.
4.5 Scalability of large number of clusters
Method  ACC  NMI  ARI 

Kmeans  0.668  0.836  0.574 
Our method  0.822  0.911  0.764 

To further evaluate the scalability of DLSClustering to large numbers of clusters, we run the it on the multiview object image dataset COIL100 [35]. The COIL100 dataset has 100 clusters and contains 7200 images of size . Here, we compare our clustering method with Kmeans on three standard evaluation metrics: ACC, NMI and Adjusted Rand Index (ARI). As shown in Table 4, DLSClustering achieves better performance on all three metrics by directly learning clusters and 100dimensional continuous latent representations. Especially, DLSClustering gains an increase of 0.154 on ACC metric. We also perform image generation task on Coil100 dataset, to further verify the generative performance, which involves mapping latent variables to the data space. Figure 5 shows the generated samples by fixing onehot discrete latent variables, which are diverse and realistic. The continuous latent variables represent meaningful factors such as the pose, location and orientation information of objects. Therefore, the disentanglement of latent space not only provides the superior clustering performance, but also retains the remarkable ability of diverse and highquality image generation.
5 Conclusion
In this work, we present DLSClustering, a new type of clustering method that directly obtain the cluster assignments by disentangling the latent space in an unsupervised fashion. Unlike existing latent space clustering algorithms, our method does not build “clustering friendly” latent space explicitly and does not need extra clustering operation. Furthermore, our method does not disentangle class relevant features from class nonrelevant features. The disentanglement in our method is targeted to extract “cluster information” from data. Moreover, unlike distancebased clustering algorithms, our method does not depend on any explicit distance calculation in the latent space. The distance between data may be implicitly defined in neural network.
Besides clustering, the generator in our method can also generate diverse and realistic samples. The proposed method can also support other applications, including conditional generation based on clusters, clusterspecific image transfer and crosscluster retrieval. In the future, we will explore better priors for the latent space and more disentanglement of other generative factors.
References
 [1] (2017) CVAEgan: finegrained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: §3.
 [2] (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2, §2.
 [3] (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §1, §2.
 [4] (2017) Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5879–5887. Cited by: §2.
 [5] (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
 [6] (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2, Table 3.
 [7] (2006) Fuzzy cmeans clustering with spatial information for image segmentation. computerized medical imaging and graphics 30 (1), pp. 9–15. Cited by: §1.
 [8] (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §2.
 [9] (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §3.3.
 [10] (2000) Highdimensional data analysis: the curses and blessings of dimensionality. AMS math challenges lecture 1 (2000), pp. 32. Cited by: §1.
 [11] (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §3.3.
 [12] (2018) Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pp. 710–720. Cited by: §2.
 [13] (2018) Structured disentangled representations. arXiv preprint arXiv:1804.02086. Cited by: §2.
 [14] (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745. Cited by: §4.3, Table 3.
 [15] (2019) From variational to deterministic autoencoders. arXiv preprint arXiv:1903.12436. Cited by: §2.
 [16] (2018) Imagetoimage translation for crossdomain disentanglement. In Advances in Neural Information Processing Systems, pp. 1287–1298. Cited by: §2.
 [17] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.2, §3.2.
 [18] (2012) A kernel twosample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §1, §3.3.
 [19] (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §3.2, §3.2, §3, §4.2.
 [20] (2017) Improved deep embedded clustering with local structure preservation.. In IJCAI, pp. 1753–1759. Cited by: §2.
 [21] (2018) A twostep disentanglement method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 772–780. Cited by: §2, §3.1.
 [22] (2017) Betavae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2.
 [23] (2017) Learning discrete representations via information maximizing selfaugmented training. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1558–1567. Cited by: §2.
 [24] (2018) Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2.
 [25] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
 [26] (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §3.
 [27] (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §4.3, Table 3.
 [28] (2015) Generative moment matching networks. In International Conference on Machine Learning, pp. 1718–1727. Cited by: §3.3.
 [29] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.4.
 [30] (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §1, §1, §4.3, Table 3.
 [31] (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §3.3.
 [32] (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048. Cited by: §2.
 [33] (2019) Adversarial deep embedded clustering: on a better tradeoff between feature randomness and feature drift. arXiv preprint arXiv:1909.11832. Cited by: §2.
 [34] (2019) Clustergan: latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4610–4617. Cited by: §1, §2, §3.2, §4.3, Table 3.
 [35] Columbia object image library (coil20). Cited by: §4.5.
 [36] (2019) Yautoencoders: disentangling latent representations via sequentialencoding. arXiv preprint arXiv:1907.10949. Cited by: §2, §2, §3.1.
 [37] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.2.
 [38] (2018) Spectralnet: spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587. Cited by: §4.3, Table 3.
 [39] (2000) Normalized cuts and image segmentation. Departmental Papers (CIS), pp. 107. Cited by: §4.3, Table 3.
 [40] (2017) Wasserstein autoencoders. arXiv preprint arXiv:1711.01558. Cited by: §3.3, §4.2.
 [41] (2018) Recent advances in autoencoderbased representation learning. arXiv preprint arXiv:1812.05069. Cited by: §1, §3.2, §3.3.
 [42] (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §2.
 [43] (2019) Dominant set clustering and pooling for multiview 3d object recognition. arXiv preprint arXiv:1906.01592. Cited by: §1.
 [44] (2011) Face recognition in unconstrained videos with matched background similarity. IEEE. Cited by: §4.1.
 [45] (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.
 [46] (2016) Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §1, §2, §4.3, Table 3.
 [47] (2017) Towards kmeansfriendly spaces: simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3861–3870. Cited by: §1, §2, §4.3, Table 3.
 [48] (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156. Cited by: §2, §4.3, Table 3.
 [49] (2019) Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4066–4075. Cited by: §2, §4.3, Table 3.
 [50] (2012) Graph degree linkage: agglomerative clustering on a directed graph. In European Conference on Computer Vision, pp. 428–441. Cited by: §4.3, Table 3.
 [51] (2017) Massively parallel digital transcriptional profiling of single cells. Nature communications 8, pp. 14049. Cited by: §4.1.
 [52] (2019) Disentangling latent space for vae by label relevant/irrelevant dimensions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12192–12201. Cited by: §2, §3.1.