Adversarial Deep Embedded Clustering:on a better trade-off betweenFeature Randomness and Feature Drift

Adversarial Deep Embedded Clustering:
on a better trade-off between
Feature Randomness and Feature Drift

Nairouz Mrabah
Department of Computer Science
University of Quebec at Montreal
Montreal, QC, Canada
mrabah.nairouz@courrier.uqam.ca
&Mohamed Bouguessa
Department of Computer Science
University of Quebec at Montreal
Montreal, QC, Canada
bouguessa.mohamed@uqam.ca
&Riadh Ksantini
Department of Computer Science
University of Windsor
Windsor, ON, Canada
ksantini@uwindsor.ca
Abstract

Clustering using deep autoencoders has been thoroughly investigated in recent years. Current approaches rely on simultaneously learning embedded features and clustering the data points in the latent space. Although numerous deep clustering approaches outperform the shallow models in achieving favorable results on several high-semantic datasets, a critical weakness of such models has been overlooked. In the absence of concrete supervisory signals, the embedded clustering objective function may distort the latent space by learning from unreliable pseudo-labels. Thus, the network can learn non-representative features, which in turn undermines the discriminative ability, yielding worse pseudo-labels. In order to alleviate the effect of random discriminative features, modern autoencoder-based clustering papers propose to use the reconstruction loss for pretraining and as a regularizer during the clustering phase. Nevertheless, a clustering-reconstruction trade-off can cause the Feature Drift phenomena. In this paper, we propose ADEC (Adversarial Deep Embedded Clustering) a novel autoencoder-based clustering model, which addresses a dual problem, namely, Feature Randomness and Feature Drift, using adversarial training. We empirically demonstrate the suitability of our model on handling these problems using benchmark real datasets. Experimental results validate that our model outperforms state-of-the-art autoencoder-based clustering methods.

\keywords

Unsupervised Learning, Deep Learning, Clustering, Autoencoders.

1 Introduction

The main focus of clustering is to partition the original data into clusters without using any supervisory signal. During the last decades, a plethora of clustering algorithms has been proposed to overcome three main challenges. The first challenge is the high-dimentionality of the exiting real-world information. For example, a typical image has thousands of pixels. This characteristic makes the clustering task more difficult due to the well-known curse of dimensionality [58]. The large amount of data or big data, as popularized by the public community [36], constitutes the second challenge. Computationally, processing large-scale datasets is generally associated with time and memory overheads. Last but not least, the high-semantic aspect of natural data makes clustering a more challenging task. For example, two images of cats may look nothing like each other from pixel-level although they belong to the same class. The high-semantic aspect of natural data can be explained by the compositional hierarchy of features [37]. In that respect, it is well-know that high-level features are nothing than a combination of lower-level ones. To give an example in the case of visual data, edges are combined to construct motifs, which are the keystone for building objects. The same applies to speech and text datasets.

Classical clustering methods, such as, k-means [42], Gaussian Mixture Model [4], DBSCAN [20] and Mean Shift [11] are shallow models. They rely on computing distance-based similarities in the raw data-space or in the space where the handcrafted features live. However, features engineering is task-specific. Therefore, it is inappropriate to integrate such pre-processing task in the pipeline of a general-purpose clustering framework. Moreover, natural data (e.g., images and videos) have high-dimensional and high-semantic aspects. So, when dealing with such datasets, the conventional clustering approaches have limited performance as the computational time increases considerably. In addition, distance-based metrics computed in the raw data-space are unreliable for discovering semantic similarities.

To address the curse of dimensionality, the original high-dimensional data should be projected in a low-dimensional feature space. While abundant literature revolves around unsupervised dimensionality reduction, there are two main families. The first one consists of the linear dimensionality reduction methods, such as, Principal Component Analysis [48] and Factor Analysis [2]. The second family is based on the assumption that the most pertinent information lies on a low dimensional manifold (not a linear subspace) [56]. Multi-dimensional scaling [12], Isometric Feature Mapping [56] and Hessian Eigen mapping [18] are among the popular manifold dimensionality reduction techniques. Although the linear and non-linear methods aim to preserve substantial information, they are prone to discriminative information loss, which in turn decimates the clustering performance.

Projected clustering [5] and subspace clustering [71] have gained popularity thanks to their ability to address the problem of high-dimensional data clustering, as they identify relevant dimensions that exhibit the cluster structure. Unlike pure dimensionality reduction techniques, these methods do not ignore the discriminative aspect. Yet, they are only effective when the data meets the linear subspace hypothesis, which is rarely the case for natural data.

Differently, kernel k-means [13] and spectral clustering [45] map the data to non-linear manifolds. Nevertheless, the transformation capacity of these approaches is limited. They generally underfit the complexity and semanticity of real-world information. Added to that, their computational time usually grows considerably when processing large databases.

The recent advancement in unsupervised representation learning [61, 3] based on deep neural networks gave birth to a new family of clustering strategies, known as deep clustering. The multi-layers architecture has become the natural choice when it comes to processing large, high-semantic and high-dimensional datasets for several reasons. First, backpropagation and Stochastic Gradient Descent (SGD) allow to update the network weights in a cheap way without the need to loop around the whole dataset, in every single iteration. That’s why a neural network is well-adopted for analyzing large data. Second, the compositional nature of the data [37] justifies the need to gradually extract higher semantic representations from one layer to another using non-linear projection. Finally, the number of neurons in the hidden layers defines the dimensionality of the embedded spaces. Hence, a deep architecture allows to reduce dimensionality if it is designed to have a low-dimensional latent space. In spite of the deep learning success in many supervised applications, leveraging the power of neural networks in performing data clustering is still an open problem.

The most prominent deep clustering approaches rely on autoencoders [64, 22, 65, 59, 53]. Some other deep clustering strategies harness an encoding architecture [24, 66, 9, 8, 26, 27] without a decoding network. However, dispensing with the decoder and clustering the data in a latent space just using pseudo-labels can mislead the training process. This is because pseudo-labels are primarily generated based on hypothetical similarities, which generally underfit the semanticity of natural datasets. We call this problem Feature Randomness. The rational of choosing auto-encoding as the standard deep clustering architecture can be imputed to the limited reliability of pseudo-supervision when used alone. The reconstruction allows to rebuild the input samples after encoding them in a low dimensional latent space. The ability to reconstruct a point from a low-dimensional representation suggests that the latent space captures the key factors of variations and similarities. Otherwise, it would be impossible to regenerate the data samples.

Autoencoder-based clustering approaches generally consists of a joint optimization process. The reconstruction cost is combined with a clustering-specific objective function. Referring to the previous point, retaining the reconstruction cost during the clustering phase helps in reducing Feature Randomness by blocking the encoder from generating random discriminative features. However, regularizing with the reconstruction end-to-end can lead to Feature Drift. This problem emerges due to the natural trade-off between clustering and reconstruction. Put it differently, while latent clustering allows to group and separate the embedded samples by emphasizing within-cluster similarities and destroying between-cluster similarities, the reconstruction is associated with preserving all factors of similarities.

Meanwhile, Generative Adversarial Network (GAN) [21] has shown great promise in learning complex natural data distributions. It allows to synthesize out-of-sample data points. Besides, it has been shown that GAN can obtain images with impressive visual quality [49, 6]. Apart from being a successful generative model, the adversarial training strategy has inspired several modern achievements on unsupervised representation learning [3, 19, 17, 10]. Although GAN does not come with an encoder out-of-the-box, some recent papers have suggested to extend the classical GAN framework to permit data encoding in a latent space, where the semantic factors of variations and similarities are better-emphasized [19, 17, 43, 57]. Nevertheless, it is still unclear to what extent the features learned, based on a deep generative model, can be useful for downstream discriminative tasks (e.g., classification and clustering).

To address the aforementioned problems, we propose Adversarial Deep Embedded Clustering (ADEC). Our framework consists of eliminating the strong competition between embedded clustering and reconstruction without incurring a Feature Randomness cost. This is done by getting the strong competition outside of a single network, while relying on a discriminator, in order to make sure that the embedded representations preserve the intrinsic data characteristics. Optimizing every objective function in a separate network , based on adversarial training, allows to reach a better trade-off between Feature Randomness and Feature Drift. Experimentation on real benchmark datasets shows the superiority of our framework, in terms of accuracy (ACC) and normalized mutual information (NMI). In a nutshell, the key contributions of this paper are:

  • Devising a new pretraing framework based on adversarially constrained interpolation and data transformation.

  • Overcoming the clustering-reconstruction compromise based on an adversarial training strategy.

  • Enhancing state-of-the-art autoencoder-based clustering by alleviating Feature Randomness and Feature Drift.

  • Outperforming modern clustering models in terms of ACC and NMI on real benchmark datasets.

The rest of this paper has the following organization: Section 2 is devoted to related work. Section 3 presents an analysis of the trade-off between Feature Randomness and Feature Drift. In section 4, we present our methodology for tackling the Feature Randomness and Feature Drift problems. In Section 5, we show our experimental results. Finally, section 6 concludes the paper.

2 Related work

To demonstrate the merit of our proposed framework, we provide a critical review of mainstream approaches related to our work. ADEC comes under the realm of deep clustering strategies. Furthermore, it is deemed to be part of the concerted effort to combine GANs and autoencoders. To this end, we shall review the existing deep clustering approaches and the unifying techniques of GANs and autoencoders. We should also review DEC [64] and IDEC [22] since they constitute our principal baselines.


Figure 1: The different deep clustering categories. The self-supervised loss enforces reasonable general-purpose features and the pseudo-supervised loss is used for clustering the embedded data.

2.1 Deep Clustering

In deep unsupervised learning, typically, there are two conceivable options to make up for the absence of supervisory signals. The first option consists of contriving a pretext task that encourages to learn general-purpose features. It is better known as self-supervision. For this case, labels are extracted from the input data. The intuition behind self-supervision is that the pretext task can not be solved efficiently without gaining a semantically high-level grasp of the input data. The obtained features can be used to outsource downstream tasks, such as, classification. There exists a wide variety of pretext tasks. Among them, the vanilla reconstruction, the denoising loss [60], the variational loss [34], the adversarial loss [21], predicting the location of image patches [16], predicting the permutations of a "jigsaw puzzle"[46], predicting unpainted image patches[47], and predicting image colorization [70]. The second option consists of contriving a pseudo supervisory signal. Similar to self-supervision, the labeling is available within the data. However, for pseudo-supervision, labels are predicted. Therefore, they are not 100% correct. In this paper, denotes a self-supervised loss and denotes a pseudo-supervised loss.

A possible categorization of deep clustering methods can be imputed to the used loss functions and the way they are combined. Based on that, the existing models fall into three main categories. In Figure 1, the framework of each category is illustrated. Within the actual context, the pseudo-supervised loss stands for the embedded clustering objective function, which can be any one of the typical clustering objective functions, such as, Gaussian Mixture Model (GMM) or k-means. As regards the self-supervised cost, reconstruction is commonly selected.

For the first category, the clustering is directly performed using a pseudo-supervised loss. However, the self-acquired labels are unreliable due to their hypothetical aspect. This can mislead the data grouping by learning non-representative features, which in turn deteriorates the discriminative ability of the model. Concisely, the main weakness of the methods affiliated with this category is Feature Randomness. As part of this category, Yang et al. proposed JULE [66], a deep recurrent framework that allows to perform agglomerative clustering and feature learning alongside with a unified triplet loss. The whole process is optimized end-to-end. However, one among the prominent downsides of JULE is the run-time overhead due to the recurrent framework. Chang et al. proposed DAC [9], a framework that enables to cluster the data based on pairwise constraints. DAC has curriculum learning strategy, where only high-confidence training samples are selected. In another line of research, Hu et al. proposed IMSAT [27], which consists of maximizing the mutual information between discrete predictions and their associated data points. The loss function of IMSAT is regularized by a self-augmented training term that allows to penalize the discrepancy between initial data and their geometrically transformed ones. DeepCluster [8] is another framework intimately tied to this category. It alternates between two basic steps. First, the latent representations are clustered by k-means. Then, the obtained clustering assignments are fed to a convolutional neural network as supervisory signals to learn better features. It was mainly applied to large-scale datasets.

As for the second category, the network is pretrained using a self-supervised cost function. Then, the obtained latent features are finetuned by retraining using pseudo labels. Compared with the random initialization, pretraining with a self-supervised loss leads to improved initial features. Nevertheless, there is no correction mechanism to attenuate the noisy labels harm. Hence, Feature Randomness is a strongly remaining concern for this category. As part of this category, DEC [64] is the first deep clustering framework to follow a pretraining-finetuning strategy.

The third category consists of pretraining using a self-supervised cost function similar to the second category. However, their finetuning phases are different. In fact, the third category regularizes the pseudo-supervised objective function with a self-supervised one. The advantage of such a strategy is that it offers a mechanism to reduce Feature Randomness. However, combining pseudo-supervision and self-supervision can lead to a strong competition between them. In order to balance the two cost functions, a hyperparameter is required. To give an example, Guo et al. proposed IDEC [22] and Dizaji et al. proposed DEPICT [15]. Both models can be considered as extensions to DEC. They regularize the clustering loss with reconstruction during the finetuning phase. Therefore, the decoder is maintained throughout the whole training process. The main difference between them is that DEPICT utilizes a convolutional architecture, whereas IDEC leverages a fully-connected autoencoder. Apart from that, Yang et al. proposed DCN [65]. Compared with IDEC and DEPICT, DCN optimizes a latent k-means objective function. VaDE [32] is another model from this category. Its variational auto-encoding architecture allows to impose a GMM latent distribution. Thanks to the reparameterization trick, VaDE can be optimized using backpropagation. In addition to clustering, VaDE can generate data samples. Even so, it is subjected to elevated computational complexity similar to all the other variational frameworks.

2.2 Deep Embedding Clustering (DEC)

DEC [64] has two phases. The pretraining phase allows to learn low-dimensional embedded representations by training the autoencoder with vanilla reconstruction. Then, comes the clustering phase. First, the decoder is discarded. After that, the encoder is trained to jointly optimize the embedded representations and the clustering centers. For every training iteration, a soft clustering assignment (1) is computed based on the Student’s t-distribution. represents an assessment of the between the embedded data point and the center .

(1)

The DEC loss function (2) is the Kullback Leibler divergence between the soft clustering assignment and an auxiliary target distribution (3).

(2)
(3)

2.3 Improved Deep Embedded Clustering (IDEC)

IDEC [22] has the same pretraining phase as DEC. The main difference between them is that IDEC is finetuned to minimize joint embedded clustering and reconstruction as described by (4).

(4)

stands for the reconstruction and is in charge of balancing the two costs. The key idea of IDEC is to block the clustering loss from corrupting the feature space. However, we argue that combining embedded clustering and vanilla reconstruction gives birth to a strong competition between them (i.e., Feature Drift).

2.4 Combining Autoencoders with GANs

Interpolating data samples from the prior distribution, in the latent space of the generator, leads to realistic and semantically explainable variations [49, 6]. As a consequence of GAN effectiveness in capturing the semantic factors of variations, many researchers have studied the inverse mapping problem (i.e., projecting the data back in the embedded space) [17, 43, 19, 57]. As coupled with the generator, an encoder can potentially learn to produce latent high-semantic features from the initial data distribution. This can bring an important advancement in solving inverse problems (e.g., image inpainting and noise removal) and downstream discrimination tasks. Two of the most seminal contributions on combining the power of GANs with Autoencoders, are BiGAN [17] and AAE [43].

Although we use the same architectural components (i.e., Encoder, Decoder and Discriminator), our framework differs from the previous mentioned ones in several glaring aspects. First, our discriminator operates in the data space. In contrast, the critic of AAE processes samples from the latent space, while BiGAN framework concatenates a sample from the data space with its projection in the embedded space, before feeding it to the discriminator. Second, in BiGAN, the encoder and decoder can not directly communicate with each other. Therefore, the objective function of this approach does not have an explicit reconstruction cost. However, AAE and our model explicitly minimize the cycle cost. Unlike AAE, where both the encoder and decoder are trained to perform reconstruction, our encoder weights are frozen, while optimizing with respect to the cycle loss, in order to avoid drifting the features learned using the clustering loss. Moreover, in BiGAN and AAE, the encoder and decoder are trained jointly in competition with the discriminator network. However, in our framework, each network is trained separately. Furthermore, AAE and BiGAN are standard generative models. So, in order to allow sampling, they enforce the aggregated posterior to match an arbitrary prior. However, in our case, we do not impose any hypothetical prior distribution and the adversarial training strategy is introduced to tackle problems related to embedded clustering, that is, Features Randomness and Feature Drift.

3 Trade-Off Between Feature Randomness and Feature Drift

In this section, we propose a mathematical formalism to characterize Feature Randomness and Feature Drift. We explain the identified problems and we shed light on the trade-off between them.

3.1 Feature Randomness

For a pseudo-supervised loss, the used labels are predicted based on a presumptive similarity metric. It then follows that part of the pseudo-labels mismatch the real ones. Zhang et al. [69] showed empirically that standard deep neural networks can easily and perfectly fit completely random labels without any considerable time overhead, using the same hyperparameters and architecture as used for training with correct labels. This result suggests that a neural network has sufficient capacity to memorize the whole dataset even when there is little or no correlation between the training samples and their corresponding labels.

We call Feature Randomness the training of a neural network using pseudo-labels. Put it differently, Feature Randomness takes place when a significant portion of the true labels are substituted by random ones. At every iteration of a neural network optimization process, Feature Randomness can be characterized by (5). is the cosine of the angle between the gradient of the unsupervised loss and the gradient of the real supervised loss w.r.t the network parameters . denotes the true labels (100% correct) and denotes the pseudo-labels (partially correct).

(5)

Training with pseudo-labels deteriorates the generalization capacity of a neural network. In fact, it enforces learning features that emphasize similarities between uncorrelated data points. In order to reduce the harm of Feature Randomness, a possible solution consists of adjusting the gradient of the pseudo-supervised loss by another vector. The gradient of is a candidate to be that vector for several reasons. First, it is well known that minimizing generates reasonable general-purpose features. Second, the self-acquired labels for are 100% correct. Hence, minimizing does not contribute to Feature Randomness. Besides, self-supervision can be used to integrate relevant prior-knowledge. In Figure 2, we illustrate the role of the self-supervised objective function in adjusting the gradient of the pseudo-supervised one. denotes the pretext-labels.


Figure 2: Adjusting the gradient of the pseudo-supervised loss by the gradient of a self-supervised loss to better approximate the gradient of the true supervised loss.

3.2 Feature Drift

In the context of multi-objective optimization, two objective functions are said to be conflicting, if optimizing one of them in value degrade the other one. In such a case, the optimum should be computed, while taking into consideration the trade-offs between the competing objective functions. A solution is called non-dominated when there are multiple optima that jointly optimize the objective functions. These optima are considered equally valid in the absence of extra subjective information.

In the case of deep learning, we call Feature Drift the optimization of a neural network’s loss function whose secondary component (regularizer) strongly competes with the main one. This phenomenon can lead to a failure of the global learning process. The features learned based on the main cost function can be easily drifted by updating with respect to the secondary loss. To better understand this problem, Figure 3 shows a simplistic illustration of Feature Drift. In Figure 3.a, a linear combination of two strongly competing vectors and is pulling the ball. In Figure 3.b, the ball is pulled by another couple of forces and , which are less conflicting. In both cases, the pulling forces are adjusted using a balancing positive coefficient . So, the resultant vector is equal to for the first case and for the second case. In this example, we consider that the pulling forces , , , and are constants and the coefficient is variable. After applying the resultant vector, the object is supposed to reach a position . In both figures, the colored area (delimited by the competing vectors) represents the field of possible positions after applying the resultant vector. The target solution lies within the green field. It is reasonable to predict that the smaller the area of the field, the easier to reach the target solution. For two strongly competing vectors, we observe that the variation of dramatically affects the reached position. Therefore, the choice of is quite critical for this case. Whereas, for the second case, where the competition is less ardent, the variation of has a less important impact on the reached position. Hence, the choice of is less crucial than in the first case. We conclude that the importance of a balancing coefficient depends on the level of competition. Most importantly, we observe that the level of competition can be assessed by the cosine of the angle between the two competing vectors. Hence, at every training iteration, we can characterize Feature Drift as following:

(6)

Where and are the gradient of the pseudo-supervised loss and the gradient of the self-supervised loss, respectively. When the strongly contending vectors are not balanced meticulously, the desired solution would not be reached even after multiple iterations. Besides, it is of great importance to make unsupervised learning methods less reliant on unpredictable and dataset-specific parameters. The example presented by Figure 3 was selected for its simplicity, since it is difficult to visualize the gradient vectors in a high-dimensional space.

(a) Strong competition between and .
(b) Weak competition between and .
Figure 3: Comparing the impact of strong and weak competition in reaching a target position.

Several modern deep clustering models [22, 65, 15, 32] jointly perform reconstruction and embedded clustering. In this work, we show empirically that combining them leads to a very strong competition between their gradients. Under specific conditions, we provide a mathematical proof of this hypothesis. For this reason, we consider the problem of clustering a dataset and X the matrix whose raw vectors are . We presume that the number of clusters . Our operators include a linear encoder , which maps samples from the data space to the latent space and a linear decoder performing an inverse mapping. The matrices and hold the learnable parameters of the encoder and decoder, respectively. We define the vector as the projection of the data point in the latent space and is the reconstructed representation of . We denote . Furthermore, we constrain to the set of semi-orthogonal matrices. Thus, , where is the identity matrix. Each cluster is associated with a centroid in the embedded space , where , represents the cluster j and is the number of points in . The center of the embedded points is denoted by . We define, , as the reconstruction loss, and , as the k-means loss in the latent space.

Let , be the average distance between two clusters and . If is equal to , this distance is called within-cluster distance, and defined by and . Otherwise, it is called between-cluster distance, and defined by and .

Theorem 1.

Under the specific conditions described above, the loss function can be expressed as following:

(7)

where

The proof of Theorem 1 is provided in Appendix A. This theorem shows the implicit competition between a typical clustering loss (k-means) and the reconstruction one. Intuitively, minimizing the clustering loss has two objectives. First, it allows to emphasize the similarities between data points within the same cluster. Second, it enables to stress the variations between data points from different clusters. However, minimizing the reconstruction loss aims to preserve all the similarities and variances between every couple of data points, whether or not they belong to the same cluster. Using Theorem 1, we can notice that minimizing the first term of leads to the maximization of between-cluster distances, which force the clusters to be separable. Added to that, minimizing the second and third terms of minimizes within-cluster variances, which pushes the clusters to be as compact as possible. However, minimizing implies minimizing both between-cluster distances and within-cluster variances.

In a general case, increasing significantly causes the self-supervised loss to easily win the competition. Thus, any discriminative feature learned in the direction of the pseudo-supervised loss’s gradient can be easily drifted by the gradient of the self-supervised loss. On the other side, decreasing significantly leads to Feature Randomness.

4 Adversarial Deep Embedded Clustering

In this section, we describe our proposed framework ADEC. Our model is designed to address Feature Randomness and Feature Drift. To this end, we consider the problem of clustering a dataset into clusters. Each cluster is associated with a centroid in the embedded space , where . Our operators include a deep non-linear encoder , which maps samples from the data space to the latent space and a deep non-linear decoder performing an inverse mapping. and represent the learnable parameters of the encoder and decoder, respectively. We define the vector to be the projection of a data point into the latent space and is the reconstructed representation of .

4.1 Pretraining phase

Following state-of-the-art autoencoder-based clustering approaches, we pretrain the encoder and decoder. In the context of deep clustering, pseudo-labels are the cluster representatives (i.e., embedded centers). It is important to start the clustering phase with latent features that reflect the data distribution. Otherwise, it would be impossible to extract meaningful pseudo-labels. Training a neural network using embedded centers extracted from completely random latent representations, leads to excessive Feature Randomness (due to the large number of unreliable pseudo-labels). It is well-known that self-supervision allows to learn reliable general-purpose features by solving a pretext task. Therefore, the pretraining phase should consist of minimizing a self-supervised loss.

Previously proposed algorithms, such as, DEC and IDEC rely on a stacked denoising self-encoding strategy [62] for initializing the training weights and . In our case, we opted for pretraining the autoencoder using vanilla reconstruction loss regularized by an adversarially constrained interpolation [3] and data augmentation (e.g., slight random rotation and translation of the input samples) [23]. These techniques are backed up by results showing an important enhancement in learning unsupervised representations for downstream tasks [3, 23]. When pretraining the model, a real number is randomly sampled to compute , such that, is the reconstruction of a data point interpolated from the latent representations of and . The framework of ACAI [3] simulates a game competition between two adversarial networks. The autoencoder is trained to generate interpolated points. While the critic , which is a neural network parameterized by , enables to regress the interpolation parameter in (9), the autoencoder aims to fool the critic into considering the generated interpolants as real samples (i.e., outputting ) in (8). The second term in (9) allows the critic to identify non-interpolated inputs. The coefficient , in (9), is randomly selected from at every iteration and , in (8), is responsible for balancing the reconstruction and the regularization. For the sake of simplification, we assume that stands for the data samples after carrying out the random transformations (rotation and translation). The full framework of our pretraining phase is illustrated in Figure 4. To the best of our knowledge, we are the first to propose such a pretraining strategy in the context of deep clustering.

(8)
(9)

Figure 4: The pretraining phase of ADEC.

4.2 Clustering phase

For this phase, on top of the pretrained encoder and decoder, we need one additional network. More precisely, we introduce a Discriminator . Similar to the standard GAN framework, the Discriminator allows to identify real data samples from the fake ones. This network is parameterized with . Based on our experimental results, the feature learned by minimizing the embedded clustering objective function can be easily drifted, while minimizing the reconstruction loss. To inhibit this implicit strong competition from taking place, our strategy aims at transforming within-network competition to a different between-networks one. Therefore, each network is trained independently from the other ones to avoid the drifting effect.

Training the encoder is the main step in our framework. The clustering loss in (10) is inspired by DEC [64]. It refines the clusters by gradually stressing high confidence assignments. It is worth to note that our methodology can be applied using a different embedded clustering loss. The choice of the DEC cost can be explained by its simplicity and popularity in the deep clustering community. Unlike DEC, we add a regularization term (second part of (10)). Our regularization allows to penalize generating embedded features, which could not be decoded into realistic data points. This constraint is verified by the discriminator, leading to rejecting discriminative features, which corrupt the clustering space. Hence, we argue that minimizing (10) enables to reduce Feature Randomness.

(10)

As shown by equation (10), our model does not require a balancing hyperparameter. Unlike IDEC, where the balancing hyperparameter is critical and hard-to-tune due to the strong trade-off, in our case, the clustering and regularization terms do not reflect any explicit competition. We would provide an experimental study on hyperparameter tuning to validate the aforesaid hypothesis.

Unlike DEC, where the decoder is discarded straight away, this network plays a pivotal role in our case. It can be seen as a monitor. It allows to investigate the variations of the embedded representations induced by training the encoder. Hence, we argue that the decoder should be trained as well to catch-up with the encoder updates. However, training the decoder similar to IDEC would drift the discriminative features learned by the encoder. We propose to restrain the backpropagation of the reconstruction loss to the decoder layers as shown by equation (11). We argue that such a strategy helps in reducing Feature Drift.

(11)

Unlike DEC and IDEC, we introduce a discriminator as an additional architectural component. As exhibited by equation (12), the discriminator is supposed to differentiate real data points from those generated randomly. Similar to the decoder, the discriminator should be sufficiently trained before updating the encoder weights at every clustering iteration. An illustration of the clustering phase is given by Figure 5.

(12)

Figure 5: The clustering phase of ADEC.

At the end of the training process, we observe that the output images are smoother. Added to that, they do not represent a pure reconstruction anymore even if the decoder is trained for a huge number of iterations. This suggests that the encoder learned to destroy non-discriminative information. Another interesting observation is that the decoder maps images from the same class to the same blurry output image. This observation suggests that the encoder has learned to collapse within-class variances. Such characteristics of our model are inconsistent with IDEC pure reconstruction as illustrated by Figure 6.


Figure 6: First row: MNIST input images; Second row: Output images from IDEC; Third row: Output images from ADEC.

4.3 Optimization

ADEC has five kind of learnable parameters , , , and . All of them are updated using mini-batch SGD and backpropagation. , and are, respectively, the stochastic mini-batch approximation of , and . The gradients are computed following Theorem 2 and Theorem 3. Refer to Appendix B and C for the proofs.

Theorem 2.

The gradient of the loss function w.r.t the encoded data points is calculated by (13), where denotes the transpose Jacobian matrix of and is the gradient of .

(13)
Theorem 3.

The gradient of w.r.t. to the cluster center is computed following (14).

(14)

For the clustering phase, we run the optimization for batch iterations or until the clustering assignment variation between two consecutive clustering iterations is lower than . We found empirically that the decoder needs to be trained for a greater number of iterations compared to the other networks, otherwise it would cause instability. Thus, we alternate between training the {Decoder , Encoder , Discriminator } for number of iterations and training the {Decoder } alone also for auxiliary iterations. The target distribution , which is computed based on the predicted clustering assignment distribution , constitutes the support for computing the pseudo-labels. is updated every iterations based on equations (1) and (3). In practice, we refrain from bringing modifications on at every single step to avoid instability. The predicted label for a data point is calculated based on the following equation:

(15)

Our proposed algorithm is summarized in Algorithm 1.

Input: Input data: , Number of clusters: , Learning rate: , Convergence threshold: , Maximum iterations: , Auxiliary iterations: , Distribution update interval: .
Output: Encoder weights: , Decoder weights: , Discriminator weights: , Embedded centroids: .

  Pretrain the autoencoder by minimizing (8) and (9).
  Pretrain the discriminator by maximizing (12).
  Initialize the embedded centroids using k-means.
   .
  for i0 to  do
     if (then
        Update and using (1) and (3).
        Save last predicted labels: .
        Compute the new predicted labels using (15).
        if (then
           End training.
        end if
     end if
     if (then
        .
        .
        if (then
           .
        end if
     else
         ( is computed using (10)).
         ( is computed using (11)).
         ( is computed using (12)).
        .
        .
        if (then
           .
        end if
     end if
  end for
Algorithm 1 Minibatch stochastic gradient descent training of Adversarial Deep Embedded Clustering.

5 Experiments

An extensive experimental protocol is conducted to validate the suitability of ADEC in tackling Feature Randomness and Feature Drift. In order to perform this, we need to specify the scope of our experiments and the required experimental configurations.

5.1 Scope of experiments

Deep Clustering models differ from each other in five substantial aspects. Each one of these aspects has been proved to have a significant impact on the clustering quality. The first factor is the used architecture. Some studies [8, 31] rely on sophisticated architectures (e.g., ResNet32, AlexNet and VGG) to cluster very large datasets. Other studies [64, 22] leverage fairly sized architectures to cluster sizeable datasets. Likewise, in this paper, we opted for the same architecture used by [64, 22, 32, 65]. The second factor is the integrated prior knowledge (e.g. invariance of images’ labels to small linear transformations and symmetries). For instance, previous works [27, 23] have proved that data augmentation based on prior knowledge leads to better clustering results. Inspired by these studies, we apply a similar data augmentation technique for image datasets. The third factor of quality is the learning dynamics. This was the focus of the following studies: (1) deep over-clustering [31], which offers two sub-heads, one for grouping the data in more clusters than required, and the second for clustering according to the ground truth number of clusters; (2) deep adaptive clustering [9], which clusters the easy samples first and then gradually supply the learning model with more difficult ones; and (3) clustering with a dynamic loss function [44], which gradually change the cost function according to the clustered samples. Unlike these papers, ADEC does not rely on any specific learning dynamics. Finally, the fourth and fifth factors consist in choosing the self-supervised and pseudo-supervised losses, respectively. Jabi et al. [29] proved that, under mild conditions, several pseudo-supervised objective functions are equivalent to each other.

All the previous deep clustering studies revolve around the five mentioned axes. The modification of any one of these factors is deemed to improve or worsen the effectiveness and efficiency of the studied deep clustering model. The following experimental protocol aims to show that the trade-off between Feature Randomness and Feature Drift, which was neglected by previous studies, is influential in designing deep clustering models. For this reason, our experiments should include a comparison, where all the other factors of quality are kept identical between ADEC and its baselines.

5.2 Experimental Settings

All experiments are conducted on a server with 4 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz, 32 GO RAM and a NVIDIA TESLA K80 GPU.

5.2.1 Datasets

We evaluate our approach on six benchmark datasets:

  • MNIST-full [38]: a 10 classes database of grayscale handwritten digit images of size each.

  • MNIST-test: a subset of images of the MNIST-full dataset.

  • USPS [28]: a 10 classes database of grayscale digit images of size each.

  • Fashion-MNIST [63]: a 10 classes database of grayscale images of size each.

  • REUTERS-10K [39]: a 4 classes database (corporate/industrial, government/social, markets and economics) of articles. The most frequent words in all articles are selected. Then, for each article, we compute the TF-IDF features using the selected dictionary.

  • Mice Protein [25]: an 8 classes database of mice samples. The features of this database consists of the expression levels of 77 proteins.

All datasets are normalized before being fed to the clustering models, thereby the norm of each data point is approximately equal to 1. For fully-connected models, we flatten the input data if its dimension is greater than one.

5.2.2 Baselines

In order to show the effectiveness of the proposed model, ADEC is compared against classical clustering algorithms, subspace clustering algorithms, manifold clustering algorithms, and state-of-the-art deep clustering algorithms. The classical clustering baselines include k-means [42], Gaussian mixture models (GMM) [4], Least Squares Non-negative Matrix Factorization (LSNMF) [40] and agglomerative clustering (AC) [30]. The subspace clustering methods include Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit (SSC-OMP) [68] and Scalable Elastic Net Subspace Clustering (EnSC) [67]. The other subspace clustering baselines are not efficient enough to deal with 70,000 samples and therefore they are left out. The manifold clustering approaches include normalized-cut spectral clustering (SC) [54] and Kernel (RBF) k-means [52]. Finally, the deep clustering algorithms include DeepCluster [8], JULE [66], SR-k-means[29], DEC [64], IDEC [22], DCN [65], VaDE [32] and DEPICT [15]. Our baselines also cover clustering the embedded data of an autoencoder using k-means and FINCH [51] denoted, respectively, by (AE+k-means) and (AE+FINCH). As a side note, all the fully-connected baselines share the same architecture with ADEC.

5.2.3 Evaluation Metrics

We adopt the metrics ACC [7], NMI [55], and for assessing the clustering quality. The first two metrics are widely used to compare deep clustering methods. The third and fourth metrics are among the contributions of this work. ACC and NMI lie within the range and and lie within the range . Higher values are better. As shown by (16) and (17), ACC and NMI depend on and , where is a vector representing the predicted labels and is the ground-truth labels vector.

(16)

is selected from the set of all possible permutations mapping the predicted clusters to the ground-truth categories. The best matching can be found using the Hungarian Algorithm [35].

(17)

denotes the entropy and is the mutual information.

5.2.4 Implementation

The encoder has eight layers of dimensions - 500 - 500 - 2000 - 10. Apart from the bottleneck layer, all the other ones are activated by ReLu [41]. The decoder is an inverse mapping of the encoding layers 10 - 2000 - 500 - 500 - with ReLu activations except for the last layer. We pretrain the autoencoder in competition with a critic for iterations to perform data reconstruction constrained by an adversarially constrained interpolation. The learning weights are optimized using Adam [33] with a learning rate equal to . , , and (i.e., hyperparameters specific to Adam) have the respective values , , and . According to ACAI [3] paper, and are set equal to and , respectively. For the clustering stage, the encoder, decoder, and discriminator are trained alternatively for . The training is stopped before reaching the final iteration if the convergence criterion is met. This criterion is parameterized by a threshold . We update the encoder, decoder and discriminator weights using SGD with a learning rate and momentum . All backpropagation updates are performed on random batches of size 256 for both stages (i.e., pretraining and clustering). ADEC is implemented using Python and Tensorflow [1].

5.3 Results

Our experimental protocol has three parts. In the first part, our model is compared with state-of-the-art clustering algorithms. In the second part, we analyse the ability of our model to tackle Feature Randomness and Feature Drift. In the last part, some qualitative results are exhibited. Before showing our results, we establish some useful notations. In all the following experiments: - indicates OUT_OF_MEMORY, denotes the unsuitability of the algorithm to process one-dimensional data, indicates that the pretraining phase does not support Data Transform and Adversarially Constrained Interpolation, indicates that the pretraining phase does not support Data Transform, * indicates that the evaluated methods share the same pretraining weights, the same architecture, the same learning dynamics and the same clustering loss with ADEC.

5.3.1 Comparing state-of-the art approaches

Table 1 illustrates the evaluation of several clustering approaches, including our proposed method, in terms of ACC and NMI. All baselines methods are tuned according to their default settings. First of all, we observe that state-of-the art subspace clustering algorithms, such as, SSC-OMP and EnSC are generally not suitable for clustering datasets with semantic similarities (e.g., images, text, sounds). In fact, subspace clustering presumes the data to lie in a union of low-dimensional linear subspaces. However, this assumption does not hold for datasets with clusters lieing near non-linearly shaped manifolds [50]. Secondly, we observe that the manifold clustering approaches have better ACC and NMI values than the classical approaches on some datasets. In fact, for the manifold category, the selection of the non-linear transform is largely empirical. Particularly, no kernel space is sufficiently well-suited to effectively cluster any dataset. Thirdly, in most cases, we can see that deep clustering models outperform all the other approaches by a huge margin. This observation confirms the suitability of deep clustering when it comes to clustering high-dimensional datasets. Finally, comparing among the deep clustering approaches, we can observe that our method provides the best results on every dataset. In terms of ACC and NMI, ADEC outperforms its state-of-the-art counterpart DEPICT by 2% and 5%, respectively. Worthy of note that DEPICT is the convolutional version of DEC with some minor modifications. In order to understand the outperformance of our approach, we need to conduct further experiments.


Method MNIST-full MNIST-test USPS Fashion-MNIST REUTERS-10K Mice Protein
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
k-means 0.532 0.500 0.546 0.501 0.668 0.627 0.474 0.512 0.522 0.313 0.342 0.252
GMM 0.433 0.366 0.540 0.493 0.551 0.530 0.556 0.557 0.402 0.375 0.139 1.00
LSNMF 0.540 0.455 0.550 0.463 0.575 0.551 0.549 0.523 0.596 0.361 0.497 0.506
AC 0.621 0.682 0.695 0.711 0.683 0.725 0.500 0.564 0.526 0.365 0.294 0.211
SSC-OMP 0.309 0.315 0.413 0.450 0.477 0.503 0.100 0.007 0.402 0.008 0.152 0.078
EnSC 0.111 0.014 0.603 0.591 0.610 0.684 0.629 0.636 0.401 0.014 0.434 0.347
SC 0.656 0.731 0.660 0.704 0.649 0.794 0.508 0.575 0.402 0.375 0.298 0.268
RBF k-means - - 0.560 0.523 0.629 0.631 - - 0.499 0.288 0.363 0.269
AE + k-means 0.807 0.730 0.702 0.617 0.720 0.698 0.585 0.614 0.695 0.475 0.238 0.131
AE + FINCH - - 0.709 0.754 0.704 0.788 - - 0.241 0.414 0.157 0.083
DeepCluster 0.797 0.661 0.854 0.713 0.562 0.540 0.542 0.510
DCN 0.830 0.810 0.802 0.786 0.688 0.683 0.501 0.558 0.422 0.109 0.197 0.051
DEC 0.863 0.834 0.856 0.830 0.762 0.767 0.518 0.546 0.814 0.598 0.184 0.026
IDEC 0.881 0.867 0.846 0.802 0.761 0.785 0.529 0.557 0.790 0.550 0.196 0.037
SR-k-means 0.939 0.866 0.863 0.873 0.901 0.912 0.507 0.548
VaDE 0.945 0.876 0.287 0.287 0.566 0.512 0.578 0.630 0.793 0.521 0.139 1.00
JULE 0.964 0.913 0.961 0.915 0.950 0.913 0.563 0.608
DEPICT 0.965 0.917 0.963 0.915 0.899 0.906 0.392 0.392
ADEC 0.986 0.961 0.985 0.957 0.981 0.948 0.586 0.662 0.821 0.605 0.500 0.604
Table 1: Comparison of the clustering performances in terms of ACC and NMI. The different clustering categories are separated by double horizontal lines. Best method in bold, second best emphasized.

Method MNIST-full MNIST-test USPS Fashion-MNIST REUTERS-10K Mice Protein
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
DEC* 0.971 0.929 0.968 0.920 0.963 0.910 0.575 0.589 0.814 0.598 0.267 0.158
IDEC* 0.982 0.952 0.978 0.944 0.980 0.946 0.575 0.631 0.790 0.550 0.188 0.033
ADEC 0.986 0.961 0.985 0.957 0.981 0.948 0.586 0.662 0.821 0.605 0.500 0.604
Table 2: Comparison of the clustering performances of DEC*, IDEC*, and ADEC in terms of ACC and NMI. Best method in bold, second best emphasized.

For a fair comparison with state-of-the-art deep clustering approaches, the baselines need to be reimplemented in a way to neutralize factors, which are out of this article scope. The new reimplemented models share the same deep clustering factors (i.e., architecture, learning dynamics, integrated prior knowledge and clustering loss) with ADEC. From Table 1 and Table 2, we can notice a considerable improvement in terms of ACC and NMI for the modified version of DEC and IDEC, comparatively to the original ones. More specifically, DEC* outperforms vanilla DEC by a huge margin. Similarly, IDEC* surpasses its standard counterpart significantly. The huge gap between the ordinary pretrained models (i.e., based on a simple reconstruction) and the modified ones, demonstrates the effectiveness of combining Adversarially Constrained Interpolation and Data Transformation, as a pretraining strategy. Furthermore, as we can see from Table 2, ADEC* outperforms DEC* and IDEC*. This result suggests that ADEC offers a better trade-off between Feature Randomness and Feature Drift. This hypothesis will be further supported in the subsequent experiments.

In Table 3, we report the execution time of different deep clustering methods. The comparison is limited to deep clustering models. Henceforth, we exclude all the other clustering categories due to their less competitive results as just demonstrated by the previous comparison in Table 1. As we can see in Table 3, the run-time of ADEC is significantly higher than the execution times of DEC, IDEC, DCN and DeepCluster on all datasets. As it stands, these methods are more efficient than ADEC. However, we can also observe that the execution time of our method is on par with the execution times of DEPICT, SR-k-means and JULE. Interestingly, our algorithm is way faster than VaDE on all datasets.


Method MNIST-full MNIST-test USPS Fashion-MNIST REUTERS-10K Mice Protein
DeepCluster 1,375 74 64 1,250 - -
DCN 640 55 49 732 279 40
DEC 693 58 53 2,384 105 22
IDEC 890 349 110 857 97 150
SR-k-means 14,872 1,657 1,655 4,551 - -
VaDE 123,000 15,000 13,000 120,000 105 15
JULE 12,500 3,247 2,540 13,100 - -
DEPICT 9,561 2,320 1,778 8,581 - -
ADEC 10,735 10,013 8,445 10,502 669 1,047
Table 3: Comparison of the execution times (in seconds) of different deep clustering approaches.

For fairness of comparison and in order to better assess the efficiency of our method, we run our algorithm against the modified versions of DEC and IDEC (same as the previous experiment). Based on Table 3 and Table 4, we can see that the run-time of DEC* and IDEC* are significantly higher than the execution times of vanilla DEC and vanilla IDEC, respectively. Therefore, we can conclude that DEC* and IDEC* are less efficient than DEC and IDEC, respectively. This conclusion can be explained by the long pretraining phase. Hence, the gain achieved by pretraining with an Adversarially Constrained Interpolation comes at the cost of a higher execution time. Another observation, DEC* and IDEC* are slightly faster than ADEC*. This is expected and it can be imputed to the adversarial training of our algorithm.


Method MNIST-full MNIST-test USPS Fashion-MNIST REUTERS-10K Mice Protein
DEC* 9,667 9,092 7,692 10,840 53 639
IDEC* 9,556 9,160 7,693 9,623 55 646
ADEC 10,735 10,013 8,445 10,502 669 1,047
Table 4: Comparison of the execution times (in seconds) of DEC*, IDEC* and ADEC.

5.3.2 Feature Randomness and Feature Drift

In this section, our conducted experiments aim to show the ability of ADEC to reach a better trade-off between Feature Randomness and Feature Drift. Therefore, we perform an ablation of our adversarial mechanism. Instead of this mechanism, we regularize the clustering loss with vanilla reconstruction. The obtained model is IDEC*. Then, we compare ADEC with IDEC* in terms of and . As mentioned earlier, both models share the same optimizer, the same pretraining phase and the same embedded clustering loss. The only difference between them is the regularization technique.

The first experiment examines the impact of our adversarial mechanism in reducing Feature Randomness. In this section, we show results for the MNIST dataset. Such results are representative of the general behavior of our approach and the same conclusion can be drawn on the other datasets. In Figure 7, we draw the evolution of for ADEC and IDEC* during training on MNIST. Based on this figure, we observe that the average values of for ADEC is considerably higher than the one for IDEC*. A higher value means that the gradient of ADEC is a better approximation to the supervised gradient. Hence, this experiment confirms that our adversarial regularization is more suitable for alleviating Feature Randomness than vanilla reconstruction.

(a) ADEC.
(b) IDEC*.
Figure 7: during training on MNIST.

The second experiment examined the impact of our adversarial mechanism in reducing Feature Drift. In Figure 8, we draw the evolution of for ADEC and IDEC* during training on MNIST. Based on this figure, we observe that the values of for IDEC* are always negative. This result confirms the strong competition between the gradient of the embedded clustering loss and the gradient of the reconstruction loss. Added to that, we observe that the average values of for ADEC is considerably higher than the one for IDEC*. A higher value indicates that the competition between the embedded clustering gradient and the reconstruction gradient is stronger than the competition between the embedded clustering gradient and the adversarial gradient. Hence, this experiment confirms that our adversarial regularization is more suitable for alleviating Feature Drift than vanilla reconstruction.

(a) ADEC.
(b) IDEC*.
Figure 8: during training on MNIST.

The third experiment examined the impact of Feature Drift on the learning curves. In Figure 10, we draw the learning curves of ADEC and IDEC*, in terms of ACC and NMI, during training on MNIST. Based on this figure, we observe that the learning curves of ADEC are not only above the learning curves of IDEC*, but also smoother. A zoom in to the learning curves of IDEC*, as illustrated by Figures 11 and 12, shows noticeable fluctuations. However, zooming in to the learning curves of ADEC shows a smooth increase in both metrics. The observed fluctuations for IDEC* can be explained by the competition between the reconstruction and the embedded clustering.

Figure 9: ACC and NMI during training on MNIST.
Figure 10: Sensitivity analysis for during training on MNIST.
(a) ADEC.
(b) IDEC*.
Figure 11: ACC during training on MNIST.
(a) ADEC.
(b) IDEC*.
Figure 12: NMI during training on MNIST-test.

The fourth experiment examined the impact of Feature Drift on the sensitivity of the balancing hyperparameter . In Figure 10, we draw the learning curves of IDEC*, for different values of , during training on MNIST. In our experiments, is selected from the following set . Based on the obtained results, we observe that only one value of the set ( = 0.01) yields an acceptable learning curve. All the other values make the learning curve drop significantly. Hence, we can conclude that IDEC* is very sensitive to the choice of . This result can be explained by Feature Drift (the strong competition between the gradient of the self-supervised loss and the gradient of the pseudo-supervised loss). However, in our case, ADEC does not require any balancing hyperparameter.

5.3.3 Qualitative results

(a) MNIST-full.
(b) MNIST-test.
(c) USPS.
(d) Fashion-MNIST.
(e) REUTERS.
Figure 13: 2D embedding subspace visualization to show the discriminative ability of ADEC.

In Figure 13, the discriminative ability of ADEC is illustrated by projecting the data in a 2D latent subspace for different datasets. From this figure, we can see that the projected embedded data points are grouped in well-separated clusters.

(a) MNIST.
(b) Fashion MNIST.
Figure 14: Each row shows the top 10 high-confidence images from each cluster.

Figure 14 illustrates the top 10 high-confidence images from each cluster for two datasets, namely, MNIST and Fashion MNIST. In this figure, images are inserted in decreasing order from left to right according to their distance to their associated clustering centers. Every single row represents a different cluster.

6 Conclusion

In this article, we have proposed an Adversarial Deep Embedded Clustering algorithm. Our method enables to regularize the clustering loss in a way to alleviate Feature Randomness. To overcome Feature Drift, the strong clustering-reconstruction trade-off have been Eliminated. Empirical results have showed that ADEC outperforms state-of-the-art clustering methods in terms of ACC and NMI. Furthermore, our experimental studies have validated that ADEC offers a better trade-off between Feature Drift and Feature Randomness. For ADEC, similar to the most relevant deep clustering models, self-supervision and pseudo-supervision are combined linearly. It is very interesting to study other possible combinations and to find theoretical justifications. Besides, it is worthy to extend the scope of this work by using a more sophisticated architecture (e.g., ResNet32, AlexNet and VGG) to process higher semantic datasets.

References

  • [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • [2] David J Bartholomew, Fiona Steele, Jane Galbraith, and Irini Moustaki. Analysis of multivariate social science data. Chapman and Hall/CRC, 2008.
  • [3] David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer. In International Conference on Learning Representations (ICLR), 2019.
  • [4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
  • [5] Mohamed Bouguessa and Shengrui Wang. Mining projected clusters in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 21(4):507–522, 2008.
  • [6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2019.
  • [7] Deng Cai, Xiaofei He, and Jiawei Han. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913, 2011.
  • [8] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
  • [9] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pages 5879–5887, 2017.
  • [10] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
  • [11] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995.
  • [12] Trevor F Cox and Michael AA Cox. Multidimensional scaling. Chapman and hall/CRC, 2000.
  • [13] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551–556. ACM, 2004.
  • [14] Chris Ding and Xiaofeng He. K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning, page 29. ACM, 2004.
  • [15] Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 5747–5756. IEEE, 2017.
  • [16] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
  • [17] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Representations (ICLR), 2017.
  • [18] David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003.
  • [19] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In International Conference on Learning Representations (ICLR), 2017.
  • [20] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
  • [21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [22] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering with local structure preservation. In International Joint Conference on Artificial Intelligence (IJCAI-17), pages 1753–1759, 2017.
  • [23] Xifeng Guo, En Zhu, Xinwang Liu, and Jianping Yin. Deep embedded clustering with data augmentation. In Asian Conference on Machine Learning, pages 550–565, 2018.
  • [24] Philip Haeusser, Johannes Plapp, Vladimir Golkov, Elie Aljalbout, and Daniel Cremers. Associative deep clustering: Training a classification network with no labels. In Proceedings of the German Conference on Pattern Recognition (GCPR), 2018.
  • [25] Clara Higuera, Katheleen J Gardiner, and Krzysztof J Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one, 10(6):e0129126, 2015.
  • [26] Chih-Chung Hsu and Chia-Wen Lin. Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Transactions on Multimedia, 20(2):421–429, 2018.
  • [27] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1558–1567. JMLR. org, 2017.
  • [28] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
  • [29] Mohammed Jabi, Marco Pedersoli, Amar Mitiche, and Ismail Ben Ayed. Deep clustering: On the link between discriminative models and k-means. arXiv preprint arXiv:1810.04246, 2018.
  • [30] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.
  • [31] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
  • [32] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In International Joint Conference on Artificial Intelligence (IJCAI-17), pages 1965–1972, 2017.
  • [33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [35] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [36] Doug Laney. 3d data management: Controlling data volume, velocity and variety. META group research note, 6(70):1, 2001.
  • [37] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [38] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
  • [39] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
  • [40] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
  • [41] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
  • [42] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
  • [43] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In International Conference on Learning Representations (ICLR), 2016.
  • [44] Nairouz Mrabah, Naimul Mefraz Khan, and Riadh Ksantini. Deep clustering with a dynamic autoencoder. arXiv preprint arXiv:1901.07752, 2019.
  • [45] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856, 2002.
  • [46] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  • [47] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  • [48] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • [49] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations (ICLR), 2016.
  • [50] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
  • [51] Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. Efficient parameter-free clustering using first neighbor relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2019.
  • [52] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998.
  • [53] Sohil Atul Shah and Vladlen Koltun. Deep continuous clustering. arXiv preprint arXiv:1803.01449, 2018.
  • [54] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Departmental Papers (CIS), page 107, 2000.
  • [55] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617, 2002.
  • [56] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • [57] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations (ICLR), 2018.
  • [58] Kenneth E Train. Discrete choice methods with simulation. Cambridge university press, 2009.
  • [59] Elad Tzoreff, Olga Kogan, and Yoni Choukroun. Deep discriminative latent space for clustering. arXiv preprint arXiv:1805.10795, 2018.
  • [60] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
  • [61] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
  • [62] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
  • [63] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [64] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478–487, 2016.
  • [65] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3861–3870. JMLR. org, 2017.
  • [66] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5147–5156, 2016.
  • [67] Chong You, Chun-Guang Li, Daniel P Robinson, and René Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3928–3937, 2016.
  • [68] Chong You, Daniel Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3918–3927, 2016.
  • [69] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.
  • [70] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
  • [71] Xiaofeng Zhu, Shichao Zhang, Yonggang Li, Jilian Zhang, Lifeng Yang, and Yue Fang. Low-rank sparse subspace for spectral clustering. IEEE Transactions on Knowledge and Data Engineering, 2018.

Appendix A Proof of theorem 1

We start by computing:

And since

Therefore

The function can be written as:

According to [14], the function can be written as :

So we obtain

Appendix B Proof of theorem 2

The loss function can be written as:

Then, we compute and separately.

After substitutions, we have:

Appendix C Proof of theorem 3

and play symmetric roles in the first part of , and the regularization part does not depend on . Therefore,