Clustering by Directly Disentangling Latent Space

Clustering by Directly Disentangling Latent Space

Fei Ding   Feng Luo
School of Computing, Clemson University

To overcome the high dimensionality of data, learning latent feature representations for clustering has been widely studied recently. However, it is still challenging to learn “cluster-friendly” latent representations due to the unsupervised fashion of clustering. In this paper, we propose Disentangling Latent Space Clustering (DLS-Clustering), a new clustering mechanism that directly learning cluster assignment during the disentanglement of latent spacing without constructing the “cluster-friendly” latent representation and additional clustering methods. We achieve the bidirectional mapping by enforcing an inference network (i.e. encoder) and the generator of GAN to form a deterministic encoder-decoder pair with a maximum mean discrepancy (MMD)-based regularization. We utilize a weight-sharing procedure to disentangle latent space into the one-hot discrete latent variables and the continuous latent variables. The disentangling process is actually performing the clustering operation. Eventually the one-hot discrete latent variables can be directly expressed as clusters, and the continuous latent variables represent remaining unspecified factors. Experiments on six benchmark datasets of different types demonstrate that our method outperforms existing state-of-the-art methods. We further show that the latent representations from DLS-Clustering  also maintain the ability to generate diverse and high-quality images, which can support more promising application scenarios.

1 Introduction

As an important unsupervised learning method, clustering has been widely used in many computer vision applications, such as image segmentation [7], visual features learning [3], and 3D object recognition [43]. Clustering becomes difficult when processing large amounts of high-semantic and high-dimensional data samples [10]. In order to overcome these challenges, many latent space clustering approaches such as DEC [46], DCN [47] and ClusterGAN [34], have been proposed. In these latent space clustering methods, the original high-dimensional data is first projected to low-dimensional latent space, then clustering algorithms, such as K-means [30], are performed on the latent space.

Most existing latent space clustering methods focus on learning the “clustering-friendly” latent representations. To avoid learning the random discriminative representations, their training objectives are usually coupled with data reconstruction loss or data generation constraints, which allow to rebuild or generate the input samples from the latent space. These objectives force the latent space to capture all key factors of variations and similarities, which are essential for reconstruction or generation. Therefore, these learned low-dimensional representations are not just related to clusters, and not the optimal latent representations for clustering.

Furthermore, current latent space clustering methods depend on additional clustering methods (e.g., K-means) to output the final clustering result based on learned latent representations. It’s difficult to effectively integrate low-dimensional representation learning and clustering algorithm together. The performance of distance-based clustering algorithms, such as K-means [30], is highly dependent on the selection of proper similarity and distance measures. Although constructing latent space can alleviate the problem of computing the distance between high dimensional data, defining a proper distance in latent space to obtain best clustering performance is still a challenge.

In this paper, we propose disentangle latent space clustering (DLS-Clustering), a new type of clustering algorithm that directly obtains the cluster information during the disentanglement of latent space. The disentangling process partitions the latent space into two parts: the one-hot discrete latent variables directly related to categorical cluster information, and the continuous latent variables related to other factors of variations. The disentanglement of latent space is actually performing the clustering operation, and no further clustering method is needed. Unlike existing distance-based clustering methods, our method does not need any explicit clustering objective and distance/similarity calculation in the latent space.

To separate the latent space into two completely independent parts and directly obtain clusters, we first couple the inference network and the generator of GAN to form a deterministic encoder-decoder pair under the maximum mean discrepancy (MMD) regularization [18]. Then, we utilize the weight sharing strategy, which involves the bidirectional mapping between latent space and data space, to separate the latent space into one-hot discrete variables and continuous variables of other factors. Our method integrates the GAN and deterministic Autoencoder together, to achieve the disentanglement of the latent space. It includes three different types of regularizations: an adversarial density-ratio loss in data space, MMD loss in the continuous latent code and cross-entropy loss in discrete latent code. We choose adversarial density-ratio estimation for modeling the data space because it can handle complex distributions. MMD-based regularizer is stable to optimize and works well with multivariate normal distributions [41]. Our code and models are publicly available at this link 111 after the paper is accepted.

In summary, our contributions are as follows:

(1) We propose a new clustering approach called DLS-Clustering, which directly obtain clusters in a completely unsupervised manner through disentangling latent space.

(2) We introduce a MMD-based regularization to enforce the inference network and the generator of standard GAN to form a deterministic encoder-decoder pair.

(3) We define a disentanglement training procedure based on the standard GAN and the inference network without increasing model parameters and requiring extra inputs. This procedure is also suitable for disentangling other factors of variation.

(4) We evaluate DLS-Clustering using six different types of benchmark datasets. DLS-Clustering  achieves superior clustering performance on five of six datasets and close to best result on the other one.

Figure 1: The architecture of DLS-Clustering (G: generator, E: encoder, D: discriminator). The latent representations are separated into cluster-relevant latent variables and other factors of variation . The and are concatenated and fed into the for generation and the maps the samples (, and ) back into latent space. The is adopted for the adversarial training in the data space. Note that all generators share same parameters and all encoders share same parameters.

2 Related works

Latent space clustering. Recently, many latent space clustering methods that leverage the advance of deep neural network based unsupervised representation learning [42, 2] have been developed. Several pioneering works propose to utilize an encoding architecture [48, 4, 23, 3] to learn the low-dimensional representations. In these methods, the pseudo-labels that are created based on some hypothetical similarities are used during optimization process. Because pseudo-labels usually underfit the semanticity of real-world datasets, they often suffer the Feature Randomness problem [33]. Most of recent latent space clustering methods are based on Autoencoders [46, 8, 20, 47, 49], which enables to reconstruct data sample from a low-dimensional representation. For example, Deep Embedded Clustering (DEC) [46] proposes to pretrain an Autoencoder with the reconstruction objective to learn low-dimensional embedded representations. Then, it discards the decoder and continues to train the encoder for clustering objective through a well-designed regularizer. IDEC [20] combines the reconstruction objective and clustering objective to jointly learn suitable representations with preserving local structure. DCN [47] proposes a joint dimensionality reduction and K-means clustering approach, in which the low-dimensional representation is obtained via the Autoencoder. Because the learned latent representations are closely related to the reconstruction objective, these methods still do not achieve the desired clustering results.

Recently, ClusterGAN [34] integrated GAN with an encoder network for clustering by creating a non-smooth latent space with the mixture of one-hot encoded discrete variables and continuous latent variables. However, the one-hot encoded discrete variables and continuous latent variables are not completely disentangled in ClusterGAN. Thus, the one-hot encoded discrete variable cannot effectively represent cluster. To obtain clustering assignment, ClusterGAN still need to perform additional clustering on entire dimensions of latent space under the discrete-continuous prior distribution.

Disentanglement of latent space. Learning disentangled representations enables us to reveal the factors of variation in the data [2], and provides interpretable semantic latent codes for generative models. Generally, existing disentangling methods can be mainly divided into two different types according to the disentanglement level. The first type of disentanglement involves separating the latent representations into two [32, 21, 52, 36] or three [16] parts. This type of method can be achieved in one step. For example, Mathieu et al[32] introduce a conditional VAE with adversarial training to disentangle the latent representations into label relevant and the remaining unspecified factors. Y-AE [36] focuses on the standard Autoencoder to achieve the disentanglement of implicit and explicit representations. Meanwhile, two-step disentanglement methods based on Autoencoder [21] or VAE [52] are also proposed. In those methods, the first step is to extract the label relevant representations by training a classifier. Then, they obtain label irrelevant representations mainly via the reconstruction loss. All of these methods improve the disentanglement results by leveraging (partial) label information to minimize the cross-entropy loss. The second type of disentanglement, such as -VAE [22], FactorVAE [24] and -TCVAE [5], learns to separate each dimension in latent space without supervision.These VAE-based frameworks choose the standard Gaussian distribution as the prior distribution. And they aim to balance the reconstruction quality and the latent code regularization through a stochastic encoder-decoder pair.

Considering that the real-world data usually contains several discrete factors (e.g., categories), which are difficult to be modelled with continuous variables. Several studies begin to disentangle the latent representation to discrete and continuous factors of variation, such as JointVAE [12] and InfoGAN [6]. Although most of the disentanglement learning methods [36, 12, 13] have been proposed based on the Autoencoder, especially VAEs [25], VAEs usually can not achieve high-quality generation in real-world scenarios, which is related to the training objective [15]. Recently, InfoGAN [6], an information-theoretic extension to GAN, reveals the disentanglement of latent code by maximizing the mutual information between the latent code and the generated data. In this paper, the proposed method integrates the Autoencoder and GAN together, and separates the latent variables into two parts without any supervision. The discrete latent variables directly represent clusters, and the other continuous latent variables summarize the remaining unspecified factors of variation.

3 Proposed method

We propose an unsupervised learning algorithm to disentangle the latent space into the one-hot discrete latent variables, , and the continuous latent variables, . For each input, naturally represents the categorical cluster information; is expected to contain information of other variations. By sampling latent variables from these discrete-continuous mixtures, we utilize a generator to map these latent variables to data space and an encoder to project the data back to the latent space. To enforce our model to fully split the latent space into two separate parts, we utilize the bidirectional mapping networks to perform multiple generating and encoding processes, and jointly train the generator and encoder with a disentangling-specific loss. The instability of training due to adversarial training may be mitigated by recent training improvements [19] and the integration of GAN and Autoencoder [1, 26].

3.1 Problem formulation

Given a collection i.i.d. samples (e.g., images) drawn from the real data distribution , where is the -th data sample and is the size of the dataset, our goal is to learn a general method to project the data to the latent space, which is divided into the one-hot discrete latent variables directly related to clusters and the remaining unspecified continuous latent variables. One important challenge to disentangle latent space is how to encourage independence between and as much as possible. First, this involves a bidirectional mapping between the latent space and the data space . It’s difficult to enforce distributions consistency in both spaces: and . Second, it’s still challenging to learn two separate latent variables without any supervision. Existing methods [21, 52, 36] leverage labels to achieve disentanglement of various factors.

In the following sections, we first describe the GAN that generate data space from a discrete-continuous prior (Section 3.2). Then, we introduce a deterministic encoder-generator pair for bidirectional mapping (Section 3.3). After that, we present our disentangling process (Section 3.4). Finally, we describe the objectives of the proposed method (Section 3.5).

3.2 Discrete-continuous prior in DLS-Clustering

We choose a joint distribution of discrete and continuous latent variables as the prior of our GAN model, same as the ClusterGAN [34]. This discrete-continuous prior is helpful for the generation of structured data in generative models. Using images as an example, distinct identities or attributes of objects would be reasonably represented by discrete variables, while other continuous factors, such as style and scale information, can be represented by the continuous variables. In this work, we split the latent representations into and based on the discrete-continuous prior.

The standard generative adversarial networks [17, 19] consist of two components: the generator and the discriminator . defines a mapping from the latent space to the data space and can be considered as a mapping from the data space to a real value in , which represents the probability of one sample being real. Eq. 1 defines the minimax objective of the standard GANs:


where is the real data distribution, is the prior distribution on the latent space, and is the model distribution of the generated sample . For the original GAN [17], the function is chosen as , and the Wasserstein GAN [19] applies . This adversarial density-ratio estimation [41] enforces to match .

3.3 Deterministic encoder-generator pair

Many previous works, such as ALI [11], BiGAN [9], combined the inference network (i.e., encoder) and GAN together to form a bidirectional mapping. However, due to the lack of consistent mapping between data samples and latent variables, it usually obtains poor reconstruction results. To turn the generator in DLS-Clustering  into a good decoder, we need to apply several constraints between the posterior distribution and the prior distribution . Because the latent variable , for the prior , these constraints can be added by simply penalizing the discrete variable part and the continuous variable part separately.

The constraint of discrete variables can be computed through the inverse network, which involves first generating the data sample from and then encoding it back to the latent variable (), as shown in Figure 1. Therefore, the penalty of discrete variables can be defined by the cross-entropy loss between the original input and the recalculated discrete variable


The constraint of continuous variables can be considered in the standard Autoencoder model. As shown in Figure 1, the encoder encodes the real data sample to the latent variables and . To ensure that the generator can reconstruct the original data from these latent variables, we apply an additional regularizer to encourage the encoded posterior distribution to match the prior distribution like AAE [31] and WAE [40]. The former uses the GAN-based density-ratio trick to estimate the KL-divergence [41], and the latter minimizes the distance between distributions based on Maximum mean discrepancy (MMD) [18, 28]. For the sake of optimization stability, we choose MMD to quantify the distance between the prior distribution and the posterior . And the regularizer based on MMD can be expressed as


where can be any positive definite kernel, are sampled from the prior distribution , is sampled from the posterior and is sampled from the real data samples for .

In DLS-Clustering, the encoding distribution and the decoding distribution are taken to be deterministic, i.e., and can be replaced by and , respectively. Therefore, we use a mean squared error (MSE) criterion as reconstruction loss, and write the standard Autoencoder loss as


3.4 Disentangled representation

Although the above constraints are applied to enforce consistency between the distributions over and , in order to avoid “posterior collapse” and obtain more promising representations, we impose an additional penalty to the objective to disentangle the latent variables. We utilize the weights sharing generator and encoder to enforce the disentanglement between discrete and continuous latent variables. In our architecture (Figure 1), all encoders and generators share the same weights. Thus, it requires no more parameters to disentangle latent variables.

In practice, we sample the data sample from the real data distribution, and sample the latent variable from the discrete-continuous prior. The encoder maps the data sample to latent representations and . To ensure that and are independent, we create the new latent variable by recombining the variables and . Therefore, the generated data samples and will have identical discrete latent variable . Then is re-encoded to the latent variables . The cross-entropy loss between and can ensure that the discrete variable isn’t modified when the continuous variable changes,


In addition, to ensure that the continuous variable doesn’t contain any information about the discrete variable, it is also necessary to use an additional regularizer to penalize the continuous latent variable. The generator generates the data sample from new latent variable , and the encoder recovers the continuous latent variable from . Therefore, we penalize the deviation between and by using the MSE loss:


3.5 Objective of DLS-Clustering

The objective function of our approach can be integrated into the following form:


where the corresponding regularization coefficients , controlling the relative contribution of different loss terms. Each term of Eq. 7 plays a different role for the three components: generator , discriminator and encoder . Both of and are related to and , which constrain the whole latent variables. The term is also related to the , which focus on distinguishing the true data samples from the fake samples generated by . and are related to continuous latent variables, and and are related to discrete latent variables. All these loss terms can ensure that our algorithm will disentangle the whole latent space into cluster information and remaining unspecified factors. The training procedure of DLS-Clustering  involves jointly updating the parameters of , and , as described in Algorithm 1. In this work, we empirically set and to enable a reasonable adjustment of the relative importance of continuous and discrete parts.

Input: , , initial parameters of , and , the dimension of latent code , the number of clusters K, the batch size B, the number of critic iterations per end-to-end iteration M, the regularization parameters -
Output: The parameters of , and
Data: Training data set
1 while not converged do
2       for i=1, …, M do
3             Sample a batch of random noise Sample a batch of random one-hot vectors Sample a batch of training data set
4      Sample a batch of random noise Sample a batch of random one-hot vectors , , ,
Algorithm 1 The training procedure of DLS-Clustering.

4 Experiments

Dataset Dimensions Layer Type G-1/D-4/E-4 G-2/D-3/E-3 G-3/D-2/E-2 G-4/D-1/E-1
MNIST Conv-Deconv - -
Fashion-10 Conv-Deconv - -
YTF Conv-Deconv
Pendigits MLP 256 256 - -
10x_73k MLP 256 256 - -

Table 1: The structure summary of the generator (G), discriminator (D) and encoder (E) in DLS-Clustering  for different datasets.
Dataset Discrete Dim. Continuous Dim.
MNIST 10 25
Fashion-10 10 40
YTF 41 60
Pendigits 10 5
10x_73k 8 30

Table 2: The dimensions of one-hot discrete latent variables and continuous latent variables in DLS-Clustering  for different datasets. Note that the dimension of one-hot discrete latent variables is equal to the number of clusters.
Method MNIST Fashion-10 YTF Pendigits 10x_73k
K-means [30] 0.532 0.500 0.474 0.512 0.601 0.776 0.793 0.730 0.623 0.577
NMF [27] 0.560 0.450 0.500 0.510 - - 0.670 0.580 0.710 0.690
SC [39] 0.656 0.731 0.508 0.575 0.510 0.701 0.700 0.690 0.400 0.290
AGGLO [50] 0.640 0.650 0.550 0.570 - - 0.700 0.690 0.630 0.580
DEC [46] 0.863 0.834 0.518 0.546 0.371 0.446 - - - -
DCN [47] 0.830 0.810 - - - - 0.720 0.690 - -
JULE [48] 0.964 0.913 0.563 0.608 0.684 0.848 - - - -
DEPICT[14] 0.965 0.917 0.392 0.392 0.621 0.802 - - - -
SpectralNet [38] 0.800 0.814 - - 0.685 0.798 - - - -
InfoGAN [6] 0.890 0.860 0.610 0.590 - - 0.720 0.730 0.620 0.580
ClusterGAN [34] 0.950 0.890 0.630 0.640 - - 0.770 0.730 0.810 0.730
DualAE [49] 0.978 0.941 0.662 0.645 0.691 0.857 - - - -
Our Method 0.975 0.936 0.693 0.669 0.721 0.790 0.847 0.803 0.905 0.820

Table 3: Comparison of clustering algorithms on five benchmark datasets. The results marked by (*) are from existing sklearn.cluster.KMeans package. The dash marks (-) mean that the source code is not available or that running released code is not practical, all other results are from [34] and [49].

In this section, we perform a variety of experiments to evaluate the effectiveness of our proposed method.

4.1 Data sets

The clustering experiments first are carried out on five datasets: MNIST, Fashion-MNIST [45], YouTube-Face (YTF) [44], Pendigits and 10x_73k [51]. Both of the first two datasets contain 70k images with 10 categories, and each sample is a grayscale image. YTF contains 10k face images of size , belonging to 41 categories. The Pendigits dataset contains a time series of coordinates of hand-written digits. It has 10 categories and contains 10992 samples, and each sample is represented as a 16-dimensional vector. The 10x_73k dataset contains 73233 data samples of single cell RNA-seq counts of 8 cell types, and the dimension of each sample is 720. We choose these datasets to demonstrate that our method can be effective for clustering different types of data.

4.2 Implementation

We implement different neural network structures for , and to handle different types of data. For the image datasets (MNIST, Fashion-MNIST and YTF), We employ the similar and of DCGAN [37] with conv-deconv layers, batch normalization and leaky ReLU activations with slope of 0.2. The uses the same architecture as the except the last layer. For the Pendigits and 10x_73k datasets, the , and are the MLP with 2 hidden layers of 256 hidden units each. Table 1 summarizes the network structures for different datasets. The model parameters have been initialized following the random normal distribution. For the prior distribution of our method, we randomly generate the discrete latent code , which is equal to one of the elementary one-hot encoded vectors in , then sample the continuous latent code from , here . The sampled latent code is used as the input of to generate samples. The dimensions of and are shown in Table 2. We implement the MMD loss with RBF kernel [40] to penalize the posterior distribution . The improved GAN variant with a gradient penalty [19] is used in all experiments. To obtain the cluster assignment, we directly use the argmax over all softmax probabilities for different clusters. The following regularization parameters work well during all experiments: , , . We implement all the models in Python using TensorFlow library, and train them on one NVIDIA DGX-1 station.

4.3 Evaluation of DLS-Clustering algorithm

To evaluate clustering results, we report two standard evaluation metrics: Clustering Purity (ACC) and Normalized Mutual Information (NMI). We compare DLS-Clustering  with four clustering baselines: K-means [30], Non-negative matrix Factorization (NMF) [27], Spectral Clustering (SC) [39] and Agglomerative Clustering(AGGLO) [50]. We also compare our method with the state-of-the-art clustering approaches based on GAN and Autoencoder respectively. For GAN-based approaches, ClusterGAN [34] is chosen as it achieves the superior clustering performance compared to other GAN models (e.g., InfoGAN and GAN with bp). For Autoencoder-based methods, DEC [46], DCN [47] and DEPICT [14], especially, Dual Autoencoder Network (DualAE) [49] are used for comparison. In addition, the deep spectral clustering (SpectralNet) [38] and joint unsupervised learning (JULE) [48] are also included in our comparison.

Table 3 reports the best clustering metrics of different models from 5 runs. Our method achieves significant performance improvement on Fashion-10, YTF, Pendigits and 10x_73k datasets than other methods. In particular, for the 16-dimensional Pendigit dataset, the methods all perform worse than K-means does, while our method significantly outperforms K-means in both ACC (0.847 vs. 0.793) and NMI (0.803 vs. 0.730). DLS-Clustering  achieves the best ACC result on YTF dataset while maintaining comparable NMI value. For MNIST dataset, DLS-Clustering achieves close to best performance on both ACC and NMI metrics.

4.4 Analysis on continuous latent variables

The superior clustering performance of DLS-Clustering  demonstrates that the one-hot discrete latent variables directly represent the category information in data. To understand the information contained in the continuous latent variables, we first use t-SNE [29] to visualize the continuous latent variable of MNIST and Fashion-MNIST datasets and compare them to the original data. As shown in Figure 2, we can clearly see category information in original MNIST (a(1)) and Fashion-MNIST (b(1))data. Meanwhile, there is no obvious category in the of MNIST (a(2)) and Fashion-MNIST (b(2)) data. Samples in all categories are well mixed in both data sets. A small bulk of samples in the right part of a(2) is a group of “1” images. The reason that they are not distributed may be due to their low complexity.

Then, we fix the discrete latent variable and generate images belonging to the same clusters by sampling the continuous latent variables. As shown in Figure 3, the diversity of generated images indicates that the continuous latent variable contains a large number of generative factors, except the cluster information. To further understand the factors in continuous latent variable , we change the value of one single dimension from [-0.5, 0.5] in while fixing other dimensions and the discrete latent variable . As shown in Figure 4, the value changing leads to semantic changes in the generated images. For the MNIST data, this changed dimension represents the width factor of variation in the digits. For the Fashion-MNIST data, it captures the shape factor of objects. All these informative continuous factors are independent of cluster categories.

These results demonstrate that the learned continuous latent representations from DLS-Clustering  have captured other meaningful generative factors that are not related to clusters. Therefore, the proposed method successfully performs the mapping from the data to the disentangled latent space. The one-hot discrete latent variable is directly related to clusters, and the continuous latent variable, which corresponds to the other unspecified generative factors, governs the diversity of generated samples.

Figure 2: The t-SNE visualization of raw data (1) and continuous latent variables (2) on MNIST (a) and Fashin-MNIST (b) datasets. The bulk of samples in right part of a(2) is a small group of “1” images.
Figure 3: Samples generated by fixing discrete latent code from the models trained on MNIST and Fashion-MNIST. Note that the discrete latent variables are directly related to the cluster assignment, the continuous latent variables correspond to other informative factors.
Figure 4: Samples generated on fixed discrete latent variables from the models trained on MNIST and Fashion-MNIST.

4.5 Scalability of large number of clusters

K-means 0.668 0.836 0.574
Our method 0.822 0.911 0.764

Table 4: The scalability to large number of clusters (K=100) on Coil-100 dataset.

To further evaluate the scalability of DLS-Clustering  to large numbers of clusters, we run the it on the multi-view object image dataset COIL-100 [35]. The COIL-100 dataset has 100 clusters and contains 7200 images of size . Here, we compare our clustering method with K-means on three standard evaluation metrics: ACC, NMI and Adjusted Rand Index (ARI). As shown in Table 4, DLS-Clustering  achieves better performance on all three metrics by directly learning clusters and 100-dimensional continuous latent representations. Especially, DLS-Clustering  gains an increase of 0.154 on ACC metric. We also perform image generation task on Coil-100 dataset, to further verify the generative performance, which involves mapping latent variables to the data space. Figure 5 shows the generated samples by fixing one-hot discrete latent variables, which are diverse and realistic. The continuous latent variables represent meaningful factors such as the pose, location and orientation information of objects. Therefore, the disentanglement of latent space not only provides the superior clustering performance, but also retains the remarkable ability of diverse and high-quality image generation.

Figure 5: The samples generated on fixed discrete latent variables from the models trained on Coil-100 dataset. Each column corresponds to a specific cluster.

5 Conclusion

In this work, we present DLS-Clustering, a new type of clustering method that directly obtain the cluster assignments by disentangling the latent space in an unsupervised fashion. Unlike existing latent space clustering algorithms, our method does not build “clustering friendly” latent space explicitly and does not need extra clustering operation. Furthermore, our method does not disentangle class relevant features from class non-relevant features. The disentanglement in our method is targeted to extract “cluster information” from data. Moreover, unlike distance-based clustering algorithms, our method does not depend on any explicit distance calculation in the latent space. The distance between data may be implicitly defined in neural network.

Besides clustering, the generator in our method can also generate diverse and realistic samples. The proposed method can also support other applications, including conditional generation based on clusters, cluster-specific image transfer and cross-cluster retrieval. In the future, we will explore better priors for the latent space and more disentanglement of other generative factors.


  • [1] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua (2017) CVAE-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: §3.
  • [2] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2, §2.
  • [3] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §1, §2.
  • [4] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan (2017) Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5879–5887. Cited by: §2.
  • [5] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2, Table 3.
  • [7] K. Chuang, H. Tzeng, S. Chen, J. Wu, and T. Chen (2006) Fuzzy c-means clustering with spatial information for image segmentation. computerized medical imaging and graphics 30 (1), pp. 9–15. Cited by: §1.
  • [8] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §2.
  • [9] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §3.3.
  • [10] D. L. Donoho et al. (2000) High-dimensional data analysis: the curses and blessings of dimensionality. AMS math challenges lecture 1 (2000), pp. 32. Cited by: §1.
  • [11] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §3.3.
  • [12] E. Dupont (2018) Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pp. 710–720. Cited by: §2.
  • [13] B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J. van de Meent (2018) Structured disentangled representations. arXiv preprint arXiv:1804.02086. Cited by: §2.
  • [14] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745. Cited by: §4.3, Table 3.
  • [15] P. Ghosh, M. S. Sajjadi, A. Vergari, M. Black, and B. Schölkopf (2019) From variational to deterministic autoencoders. arXiv preprint arXiv:1903.12436. Cited by: §2.
  • [16] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. In Advances in Neural Information Processing Systems, pp. 1287–1298. Cited by: §2.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.2, §3.2.
  • [18] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §1, §3.3.
  • [19] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §3.2, §3.2, §3, §4.2.
  • [20] X. Guo, L. Gao, X. Liu, and J. Yin (2017) Improved deep embedded clustering with local structure preservation.. In IJCAI, pp. 1753–1759. Cited by: §2.
  • [21] N. Hadad, L. Wolf, and M. Shahar (2018) A two-step disentanglement method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 772–780. Cited by: §2, §3.1.
  • [22] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2.
  • [23] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama (2017) Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1558–1567. Cited by: §2.
  • [24] H. Kim and A. Mnih (2018) Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2.
  • [25] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [26] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §3.
  • [27] D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §4.3, Table 3.
  • [28] Y. Li, K. Swersky, and R. Zemel (2015) Generative moment matching networks. In International Conference on Machine Learning, pp. 1718–1727. Cited by: §3.3.
  • [29] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.4.
  • [30] J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §1, §1, §4.3, Table 3.
  • [31] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §3.3.
  • [32] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048. Cited by: §2.
  • [33] N. Mrabah, M. Bouguessa, and R. Ksantini (2019) Adversarial deep embedded clustering: on a better trade-off between feature randomness and feature drift. arXiv preprint arXiv:1909.11832. Cited by: §2.
  • [34] S. Mukherjee, H. Asnani, E. Lin, and S. Kannan (2019) Clustergan: latent space clustering in generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4610–4617. Cited by: §1, §2, §3.2, §4.3, Table 3.
  • [35] S. A. Nene, S. K. Nayar, H. Murase, et al. Columbia object image library (coil-20). Cited by: §4.5.
  • [36] M. Patacchiola, P. Fox-Roberts, and E. Rosten (2019) Y-autoencoders: disentangling latent representations via sequential-encoding. arXiv preprint arXiv:1907.10949. Cited by: §2, §2, §3.1.
  • [37] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.2.
  • [38] U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger (2018) Spectralnet: spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587. Cited by: §4.3, Table 3.
  • [39] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. Departmental Papers (CIS), pp. 107. Cited by: §4.3, Table 3.
  • [40] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf (2017) Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558. Cited by: §3.3, §4.2.
  • [41] M. Tschannen, O. Bachem, and M. Lucic (2018) Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069. Cited by: §1, §3.2, §3.3.
  • [42] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §2.
  • [43] C. Wang, M. Pelillo, and K. Siddiqi (2019) Dominant set clustering and pooling for multi-view 3d object recognition. arXiv preprint arXiv:1906.01592. Cited by: §1.
  • [44] L. Wolf, T. Hassner, and I. Maoz (2011) Face recognition in unconstrained videos with matched background similarity. IEEE. Cited by: §4.1.
  • [45] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.
  • [46] J. Xie, R. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §1, §2, §4.3, Table 3.
  • [47] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3861–3870. Cited by: §1, §2, §4.3, Table 3.
  • [48] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156. Cited by: §2, §4.3, Table 3.
  • [49] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu (2019) Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4066–4075. Cited by: §2, §4.3, Table 3.
  • [50] W. Zhang, X. Wang, D. Zhao, and X. Tang (2012) Graph degree linkage: agglomerative clustering on a directed graph. In European Conference on Computer Vision, pp. 428–441. Cited by: §4.3, Table 3.
  • [51] G. X. Zheng, J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, T. D. Wheeler, G. P. McDermott, J. Zhu, et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nature communications 8, pp. 14049. Cited by: §4.1.
  • [52] Z. Zheng and L. Sun (2019) Disentangling latent space for vae by label relevant/irrelevant dimensions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12192–12201. Cited by: §2, §3.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description