Deep Clustering and Representation Learning that Preserves Geometric Structures
In this paper, we propose a novel framework for Deep Clustering and Representation Learning (DCRL) that preserves the geometric structure of data. In the proposed DCRL framework, manifold clustering is done in the latent space guided by a clustering loss. To overcome the problem that clustering-oriented losses may deteriorate the geometric structure of embeddings in the latent space, an isometric loss is proposed for preserving intra-manifold structure locally and a ranking loss for inter-manifold structure globally. Experimental results on various datasets show that the DCRL framework leads to performances comparable to current state-of-the-art deep clustering algorithms, yet exhibits superior performance for downstream tasks. Our results also demonstrate the importance and effectiveness of the proposed losses in preserving geometric structure in terms of visualization and performance metrics.
Clustering, a fundamental tool for data analysis and visualization, has been an essential research topic in data science and machine learning. Conventional clustering algorithms such as -Means (MacQueen, 1965), Gaussian Mixture Models (GMM) (Bishop, 2006), and spectral clustering (Shi and Malik, 2000) perform clustering based on distance or similarity. However, handcrafted distance or similarity measures are rarely reliable for large-scale high-dimensional data, making it increasingly challenging to achieve effective clustering. An intuitive solution is to transform the data from the high-dimensional input space to the low-dimensional latent space and then to cluster the data in the latent space. This can be achieved by applying dimensionality reduction techniques such as PCA (Wold et al., 1987), t-SNE (Maaten and Hinton, 2008), and UMAP (McInnes et al., 2018). However, since these methods are not specifically designed for clustering tasks, some of their properties may be contrary to our expectations, e.g., two data points from different manifolds that are close in the input space will be closer in the latent space derived by UMAP. Therefore, the first question here is how to learn the representation that favors clustering?
The two main points of the multi-manifold representation learning are (1) preserving the local geometric structure within each manifold and (2) ensuring the discriminability between different manifolds. However, it is challenging to decouple complex cross-over relations and ensure the discriminability between different manifolds, especially in unsupervised situations. One natural strategy is to perform clustering in the input space to get pseudo-labels and then perform representation learning for each manifold. However, in that case, representation learning’s performance depends heavily on the clustering effect, but commonly used clustering algorithms such as -Means do not work well on high-dimensional data. Thus, the second question here is how to cluster data that favors representation learning?
To answer these two questions, some pioneering work has proposed integrating deep clustering and representation learning into a unified framework by defining a clustering-oriented loss. Though promising performance has been demonstrated on various datasets, we observe that a vital factor has been ignored by these work that the defined clustering-oriented loss may deteriorate the geometric structure of the latent space, which in turn hurts the performance of visualization, clustering generalization, and downstream tasks. In this paper, we propose to jointly perform deep clustering and representation learning with geometric structure preservation. Inspired by Xie et al. (2016), the clustering centers are defined as a set of learnable parameters, and we use a clustering loss to simultaneously guide the separation of data points from different manifolds and the learning of the clustering centers. To prevent clustering loss from deteriorating the latent space, an isometric loss and a ranking loss are proposed to preserve the intra-manifold structure locally and inter-manifold structure globally. Our experimental results show that our method exhibits far superior performance to counterparts in terms of clustering and representation learining, which demonstrates the importance and effectiveness of preserving geometric structure.
The contributions of this work are summarized as below:
Proposing to integrate deep clustering and representation learning into a unified framework with local and global structure preservation.
Unlike conventional multi-manifold learning algorithms that deal with all point pair relationships between different manifolds simultaneously, we set the clustering centers as a set of learnable parameters and achieve global structure preservation in a faster, more efficient, and easier to optimize manner by applying ranking loss to the clustering centers.
Analyzing the contradiction between two optimization goals of clustering and local structure preservation, and proposing an elegant training strategy to alleviate it.
The proposed DCRL algorithm outperforms competing algorithms in terms of clustering effect, generalizability to out-of-sample, and performance in downstream tasks.
2 Related Work
Clustering analysis. As a fundamental tool in machine learning, it has been widely applied in various domains. One branch of classical clustering is -Means (MacQueen, 1965) and Gaussian Mixture Models (GMM) (Bishop, 2006), which are fast, easy to understand, and can be applied to a large number of problems. However, limited by Euclidean measure, their performance on high-dimensional data is often unsatisfactory. Spectral clustering and its variants (such as SC-Ncut (Bishop, 2006)) extend clustering to high-dimensional data by allowing more flexible distance measures. However, limited by computational efficiency of the full Laplace matrix, spectral clustering is challenging to extend to large-scale datasets.
Deep clustering. The success of deep learning has contributed to the growth of deep clustering. One branch of deep clustering performs clustering after learning a representation through existing unsupervised techniques. For example, Tian et al. (2014) use autoencoder to learn low dimensional features and then run -Means to get clustering results (AE+-Means). Considering the geometric structure of the data, N2D applies UMAP to find the best clusterable manifold of the obtained embedding, and then run -Means to discover higher-quality clusters (McConville et al., 2019). The other category of algorithms tries to optimize clustering and representation learning jointly. The closest work to us is Deep Embedding for Clustering (DEC) (Xie et al., 2016), which learns a mapping from the input space to a lower-dimensional latent space through iteratively optimizing a clustering objective. As a modified version of DEC, while IDEC claims to preserve the local structure of the data (Guo et al., 2017), in reality, their contribution is nothing more than adding a reconstruction loss. JULE proposes a recurrent framework for integrating clustering and representation learning into a single model with a weighted triplet loss and optimizing it end-to-end (Yang et al., 2016b). DSC devises a dual autoencoder to embed data into latent space, and then deep spectral clustering (Shaham et al., 2018) is applied to obtain label assignments (Yang et al., 2019).
Manifold Representation Learning. Isomap, as a representative algorithm of single-manifold learning, aims to capture global nonlinear features and seek an optimal subspace that best preserves the geodesic distance between data points (Tenenbaum et al., 2000). In contrast, some algorithms, such as the LLE (Roweis and Saul, 2000), are more concerned with the preservation of local neighborhood information. Combining DNN with manifold learning, the recently proposed MLDL algorithm achieves the preservation of local and global geometries by imposing LIS prior constraints (Li et al., 2020). Furthermore, multi-manifold learning is proposed to obtain intrinsic properties of different manifolds. Yang et al. (2016a) proposed a supervised MMD-Isomap where data points are partitioned into different manifolds according to label information. Similarly, Zhang et al. (2018) proposed a semi-supervised local multi-manifold learning framework, termed SSMM-Isomap, that applies the labeled and unlabeled training samples to perform the joint learning of local neighborhood-preserving features. In most previous work on multi-manifold learning, the problem is considered from the perspective that the label is known or partially known, which significantly simplifies the problem. For unsupervised multi-manifold learning, it is still very challenging to decouple multiple overlapping manifolds, and that is exactly what this paper aims to explore.
3 Proposed Method
Consider a dataset with samples, and each sample is sampled from different manifolds . Assume that each category in the data set lies in a compact low-dimensional manifold, and the number of manifolds is prior knowledge. Define two nonlinear mapping and , where is the embedding of in the latent space, is the reconstruction of . The -th cluster center is denoted as , where is defined as a set of learnable parameters. We aim to find optimal parameters so that the embedding features can achieve clustering with local and global structure preservation. To this end, a denoising autoencoder (Vincent et al., 2010) shown in Fig 1 is first pre-trained in an unsupervised manner to learn an initial latent space. Denoising autoencoder aims to optimize the self-reconstruction loss , where the is a copy of with Gaussian noise added, that is, . Then the autoencoder is finetuned by optimizing the following clustering-oriented loss and structure-oriented losses . Since the clustering should be performed on features of clean data, instead of noised data that is used in denoising autoencoder, the clean data is used for fine-tuning.
3.1 Clustering-oriented Loss
First, the cluster centers in the latent space are initialized (the initialization method will be introduced in Sec 4.1). Then the similarity between the embedded point and cluster centers is measured by Studentâs -distribution:
The auxiliary target distribution is designed to help manipulate the latent space, defined as:
where is the normalized cluster frequency, used to balance the size of different clusters. Then the encoder is optimized by the following objective:
The gradient of with respect to each learnable cluster center can be computed as:
facilitates the aggregation of data points within the same manifold, while data points from different manifolds are kept away from each other. However, we find that the clustering-oriented loss may deteriorate the geometric structure of the latent space, which hurts the clustering accuracy and leads to meaningless representation. To prevent the deterioration of clustering loss, we introduce isometry loss and ranking loss to preserve the local and global structure, respectively.
3.2 Structure-oriented Loss
Intra-manifold Isometry Loss. The intra-manifold local structure is preserved by optimizing the following objective:
where represents the neighborhood of data point in the feature space , and the NN is applied to determine the neighborhood. is an indicator function, and is a manifold determination function that returns the manifold where sample is located, that is . Then we can derive manifolds : . In a nutshell, the loss constrains the isometry within each manifold.
Inter-manifold Ranking Loss. The inter-manifold global structure is preserved by optimizing the following objective:
where is defined as the centers of different manifolds in the original input space with (). The parameter determines the extent to which different manifolds move away from each other. The larger is, the further away the different manifolds are from each other. The derivation for the gradient of with respect to each learnable cluster center is placed in Appendix A.1. Additionally, contrary to us, the conventional methods for dealing with inter-manifold separation typically impose push-away constraints on all data points from different manifolds (Zhang et al., 2018; Yang et al., 2016a), defined as:
The main differences between and are as follows: (1) imposes constraints on embedding points , which in turn indirectly affects the network parameters . In contrast, imposes rank-preservation constrains directly on learnable parameters in the form of regularization item to control the separation of the clustering centers. (2) involves point-to-point relationships, while involves only cluster-to-cluster relationships, so is easier to optimize, faster to process, and more accurate. (3) The parameter introduced in allows us to control the extent of separation between manifolds for specific downstream tasks.
Alignment Loss. Note that the global ranking loss is imposed directly on the learnable parameter , so optimizing will only update rather the encoder’s parameter . Thus here we need to introduce an auxiliary item to align learnable cluster centers with real cluster centers :
where are defined as (). We place the derivation for the gradient of with respect to each learnable cluster center in Appendix A.1.
3.3 Training Strategy
The contradiction between clustering and local structure preservation is analyzed from the forces analysis perspective. As shown in Fig 2, we assume that there exists a data point (\color[rgb]1,0,0red point) and its three nearest neighbors (\color[rgb]0,0,1blue points) around a cluster center (\color[rgb]0.7,0.7,0.7gray point). When clustering and local structure preserving are optimized simultaneously, it is very easy to fall into a local optimum, where the data point is in steady-state, and the resultant force from its three nearest neighbors is equal in magnitude and opposite to the gravitational forces of the cluster. Therefore, the following training strategy is applied to prevent such local optimal solutions.
Alternating Training and Weight Graduality
Alternating Training. To solve the above problem and integrate the goals of clustering and structure preservation into a unified framework, we take an alternating training strategy. Within each epoch, we first jointly optimize and in a mini-batch, with joint loss defined as
where is the weighting factor that balances the effects of clustering and global rank-preservation. Then at each epoch, we optimize isometry loss and on the whole dataset, defined as
Weight Graduality. At different stages of training, we have different expectations for the clustering and structure-preserving. At the beginning of training, to successfully decouple the overlapping manifolds, we hope that the will dominate and will be auxiliary. When the margin between different manifolds is sufficiently pronounced, the weight for can be gradually reduced, while the weight for can be gradually increased, focusing on the preservation of the local isometry. The whole algorithm is summarized in Algorithm 1 in Appendix A.2.
Three-stage explanation. The entire training process can be roughly divided into three stages, as shown in Fig 3, to explain the training strategy more vividly. At first, four different manifolds overlap each other. At Stage 1, dominates, thus data points within each manifold are converging towards the clustering center to form a sphere, and the local structure of manifolds is destroyed. At Stage 2, dominates, thus different manifolds in the latent space move away from each other to increase the manifold margin and enhance the discriminability. At stage 3, the manifolds gradually recover their original local structure from the spherical shape with dominating. It is worth noting that all of the above losses coexist rather than independently at different stages, but that the role played by different losses varies due to the alternating training and weight graduality.
4.1 Experimental setups
In this section, the effectiveness of the proposed framework is evaluated in 5 benchmark datasets: USPS
Parameters settings. The encoder is a multilayer perceptron (MLP) with dimensions -500-500-2000-10 where is the dimension of the input data, and the decoder is its mirror. After pretraining, in order to initialize the learnable clustering centers, the t-SNE is applied to transform the latent space to 2 dimensions further, and then the -Means algorithm is run to obtain the label assignments for each data point. The centers of each category in the latent space are set as initial clustering centers . The batch size is set to 256, the epoch is set to 300, the parameter for nearest neighbor is set to 3, and the parameter is set to 3 for all datasets. Besides, Adam optimizer (Kingma and Ba, 2014) with learning rate =0.001 is used. As described in Sec 3.3.2, the weight graduality is applied to train the model. The weight parameter for decreases linearly from 0.1 to 0 within epoch 0-150. In contrast, the weight parameter for loss increases linearly from 0 to 1.0 within epoch 0-150. The implementation is based on the PyTorch running on NVIDIA v100 GPU.
Evaluation Metrics. Two standard evaluation metrics: Accuracy (ACC) and Normalized Mutual Information (NMI) (Xu et al., 2003) are used to evaluate clustering performance. Besides, six evaluation metrics are adopted in this paper to evaluate the performance of representation learning, including Relative Rank Error (RRE), Trustworthiness (Trust), Continuity (Cont), Root Mean Reconstruction Error (RMRE), Locally Geometric Distortion (LGD) and Cluster Rank Accuracy (CRA). Limited by space, their precise definitions are available in Appendix A.4.
4.2 Evaluation of Clustering
The metrics ACC/NMI of different methods on various datasets are reported in Tab LABEL:tab:1. For those comparison methods whose results are not reported on some datasets, we run the released code using the hyperparameters provided in their paper and label them with (*). We find that our method outperforms -Means and SC-Ncut with a significant margin and surpasses the other six competing DNN-based algorithms on all datasets except MNIST-test. With even the MNIST-test dataset, we still rank second, outperforming the third by 1.1%. In particular, we obtained the best performance on the Fashion-MNIST dataset and, more notably, our clustering accuracy exceeds the current best method (N2D) by 3.8%. Despite is inspired by and highly consistent with the design of DEC, our method achieves much better clustering results than them. With MNIST-full, for example, our clustering accuracy is 11.7% and 9.9% higher than DEC and IDEC, respectively. \subfiletable_1.tex
|Algorithms||training samples||testing samples|
Tab 1 demonstrates that a learned DCRL can generalize well to unseen data with high clustering accuracy. Taking MNIST-full as an example, DCRL was trained using 50,000 training samples and then tested on the remaining 20,000 testing samples using the learned model. In terms of the metrics ACC and MNI, our method is optimal for both training and test samples. More importantly, there is hardly any degradation in the performance of our method on the test samples compared to the training samples, while all other methods showed a significant drop in performance, e.g., DEC from 84.1% to 74.8%. This demonstrates the importance of geometric structure maintenance for good generalizability. The testing visualization available in Appendix A.5 shows that DCRL still maintains clear inter-cluster boundaries even on the test samples, which demonstrates the great generalizability of our method.
The visualization of DCRL with several comparison methods is shown in Fig 4 (visualized using UMAP). From the perspective of clustering, our method is much better than the other methods. Among all methods, only DEC, IDEC and DCRL can hold clear boundaries between different clusters, while the cluster boundaries of the other methods are indistinguishable. Although DEC and IDEC can successfully separate different clusters, they group many data points from different classes into the same cluster. Most importantly, due to the use of the clustering-oriented loss, the embedding learned by algorithms such as DEC, IDEC, JULE, and DSC (especially DSC) tend to form spheres and disrupt the original topological structure. Instead, our method overcomes these problems and achieves almost perfect separation between different clusters while preserving the local and global structure. Additionally, the embedding of latent space during training process is visualized in Appendix A.6, which is highly consistent with the three-stage explanation mentioned in Sec 3.3.2.
4.3 Evaluation of Representation Learning
Although numerous previous work has claimed that they brought clustering and representation learning into a unified framework, they all, unfortunately, lack an analysis of the effectiveness of the learned representations. In this paper, we compare DCRL with the other five methods in six evaluation metrics on five datasets. (Limited by space, only MNIST-full results are provided in the Tab LABEL:tab:3 and the complete results are in Appendix A.7). The results show that DCRL outperforms all other methods, especially in the CRA metric, which is not only the best on all datasets but also reaches 1.0. This means that the “rank” between different manifolds in the latent space is completely preserved and undamaged, which proves the effectiveness of our global ranking loss . Moreover, statistical analysis is performed in this paper to show the extent to which local and global structure is preserved in the latent space for each algorithm. Limited by space, they are placed in Appendix A.8. \subfiletable_34.tex
Recently, numerous deep clustering algorithms have claimed to obtain meaningful representations, however, they do not analyze and experiment with the so-called “meaningful”. Therefore, we are interested to see whether these proposed methods can indeed learn representations that are useful for downstream tasks. Tab LABEL:tab:4 compares DCRL with the other six methods on five datasets (Limited by space, only MNIST-full results are provided in the paper, and the complete results are in Appendix A.9). Four different classifiers, including a linear classifier (Logistic Regression; LR), two nonlinear classifiers (MLP, SVM), and a tree-based classifier (Random Forest Classifier; RFC) are used as downstream tasks, all of which use default parameters and default implementations in sklearn (Pedregosa et al., 2011) for a fair comparison. The learned representations are frozen and used as input for training. The classification accuracy evaluated on the test set serves as a metric to evaluate the effectiveness of learned representations. On the MNIST-full dataset, our method outperforms all the other methods. Moreover, we surprisingly find that with MLP and RFC as downstream tasks, all methods except DCRL could not even match the accuracy of AE. Significantly, the performance of DEC on downstream tasks deteriorates sharply and even shows a large gap with the simplest AEs, which once again shows that the clustering-oriented loss may damage the data geometric structure.
4.4 Ablation Study
This evaluates the effects of the loss terms and training strategies in the DCRL with five sets of experiments: the model without (A) Structure-oriented Loss (SL); (B) Clustering-oriented Loss (CL); (C) Weight Graduality (WG); (D) Alternating Training (AT), and (E) the full model. Limited by space, only MNIST-full results are provided in the paper, and results for the other four datasets are in Appendix A.10. After analyzing the results, we can conclude: (1) CL is the most important factor for obtaining good clustering, the lack of which leads to unsuccessful clustering, hence the numbers in the table are not very meaningful and are shown in \color[rgb]0.7,0.7,0.7gray color. (2) SL not only brings subtle improvements in clustering but also greatly improves the performance of representation learning. (3) Our elegant training strategies (WG and AT) both improve the performance of clustering and representation learning to some extent, especially on metrics such as RRE, Trust, Cont, and CRA. \subfiletable_5.tex
The proposed DCRL framework imposes clustering-oriented and structure-oriented constraints to optimize the latent space for simultaneously performing clustering and representation learning with local and global structure preservation. Extensive experiments on image and text datasets demonstrate that DCRL is not only comparable to the state-of-the-art deep clustering algorithms but also able to learn effective and robust representation, which is beyond the capability of those clustering methods that only care about clustering accuracy. Future work will focus on the adaptive determination of manifolds (clusters) number and extend our work to datasets with larger scale.
- Pattern recognition and machine learning. springer. Cited by: §1, §2.
- Improved deep embedded clustering with local structure preservation.. In IJCAI, pp. 1753–1759. Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
- Rcv1: a new benchmark collection for text categorization research. Journal of machine learning research 5 (Apr), pp. 361–397. Cited by: §4.1.
- Markov-lipschitz deep learning. arXiv preprint arXiv:2006.08256. Cited by: §2.
- Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1.
- Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math., Stat., and Prob, pp. 281. Cited by: §1, §2.
- N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding. arXiv preprint arXiv:1908.05968. Cited by: §2.
- Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §1.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.3.2.
- Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
- Spectralnet: spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587. Cited by: §2.
- Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §1.
- A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §2.
- Learning deep representations for graph clustering.. In Aaai, Vol. 14, pp. 1293–1299. Cited by: §2.
- Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research 11 (12). Cited by: §3.
- Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §1.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.
- Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §1, §2.
- Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. Cited by: §4.1.
- Multi-manifold discriminant isomap for visualization and classification. Pattern Recognition 55, pp. 215–230. Cited by: §2, §3.2.
- Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156. Cited by: §2.
- Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4066–4075. Cited by: §2.
- Semi-supervised local multi-manifold isomap by linear embedding for feature extraction. Pattern Recognition 76, pp. 662–678. Cited by: §2, §3.2.