Adversarial Deep Embedded Clustering:
on a better tradeoff between
Feature Randomness and Feature Drift
Abstract
Clustering using deep autoencoders has been thoroughly investigated in recent years. Current approaches rely on simultaneously learning embedded features and clustering the data points in the latent space. Although numerous deep clustering approaches outperform the shallow models in achieving favorable results on several highsemantic datasets, a critical weakness of such models has been overlooked. In the absence of concrete supervisory signals, the embedded clustering objective function may distort the latent space by learning from unreliable pseudolabels. Thus, the network can learn nonrepresentative features, which in turn undermines the discriminative ability, yielding worse pseudolabels. In order to alleviate the effect of random discriminative features, modern autoencoderbased clustering papers propose to use the reconstruction loss for pretraining and as a regularizer during the clustering phase. Nevertheless, a clusteringreconstruction tradeoff can cause the Feature Drift phenomena. In this paper, we propose ADEC (Adversarial Deep Embedded Clustering) a novel autoencoderbased clustering model, which addresses a dual problem, namely, Feature Randomness and Feature Drift, using adversarial training. We empirically demonstrate the suitability of our model on handling these problems using benchmark real datasets. Experimental results validate that our model outperforms stateoftheart autoencoderbased clustering methods.
Unsupervised Learning, Deep Learning, Clustering, Autoencoders.
1 Introduction
The main focus of clustering is to partition the original data into clusters without using any supervisory signal. During the last decades, a plethora of clustering algorithms has been proposed to overcome three main challenges. The first challenge is the highdimentionality of the exiting realworld information. For example, a typical image has thousands of pixels. This characteristic makes the clustering task more difficult due to the wellknown curse of dimensionality [58]. The large amount of data or big data, as popularized by the public community [36], constitutes the second challenge. Computationally, processing largescale datasets is generally associated with time and memory overheads. Last but not least, the highsemantic aspect of natural data makes clustering a more challenging task. For example, two images of cats may look nothing like each other from pixellevel although they belong to the same class. The highsemantic aspect of natural data can be explained by the compositional hierarchy of features [37]. In that respect, it is wellknow that highlevel features are nothing than a combination of lowerlevel ones. To give an example in the case of visual data, edges are combined to construct motifs, which are the keystone for building objects. The same applies to speech and text datasets.
Classical clustering methods, such as, kmeans [42], Gaussian Mixture Model [4], DBSCAN [20] and Mean Shift [11] are shallow models. They rely on computing distancebased similarities in the raw dataspace or in the space where the handcrafted features live. However, features engineering is taskspecific. Therefore, it is inappropriate to integrate such preprocessing task in the pipeline of a generalpurpose clustering framework. Moreover, natural data (e.g., images and videos) have highdimensional and highsemantic aspects. So, when dealing with such datasets, the conventional clustering approaches have limited performance as the computational time increases considerably. In addition, distancebased metrics computed in the raw dataspace are unreliable for discovering semantic similarities.
To address the curse of dimensionality, the original highdimensional data should be projected in a lowdimensional feature space. While abundant literature revolves around unsupervised dimensionality reduction, there are two main families. The first one consists of the linear dimensionality reduction methods, such as, Principal Component Analysis [48] and Factor Analysis [2]. The second family is based on the assumption that the most pertinent information lies on a low dimensional manifold (not a linear subspace) [56]. Multidimensional scaling [12], Isometric Feature Mapping [56] and Hessian Eigen mapping [18] are among the popular manifold dimensionality reduction techniques. Although the linear and nonlinear methods aim to preserve substantial information, they are prone to discriminative information loss, which in turn decimates the clustering performance.
Projected clustering [5] and subspace clustering [71] have gained popularity thanks to their ability to address the problem of highdimensional data clustering, as they identify relevant dimensions that exhibit the cluster structure. Unlike pure dimensionality reduction techniques, these methods do not ignore the discriminative aspect. Yet, they are only effective when the data meets the linear subspace hypothesis, which is rarely the case for natural data.
Differently, kernel kmeans [13] and spectral clustering [45] map the data to nonlinear manifolds. Nevertheless, the transformation capacity of these approaches is limited. They generally underfit the complexity and semanticity of realworld information. Added to that, their computational time usually grows considerably when processing large databases.
The recent advancement in unsupervised representation learning [61, 3] based on deep neural networks gave birth to a new family of clustering strategies, known as deep clustering. The multilayers architecture has become the natural choice when it comes to processing large, highsemantic and highdimensional datasets for several reasons. First, backpropagation and Stochastic Gradient Descent (SGD) allow to update the network weights in a cheap way without the need to loop around the whole dataset, in every single iteration. That’s why a neural network is welladopted for analyzing large data. Second, the compositional nature of the data [37] justifies the need to gradually extract higher semantic representations from one layer to another using nonlinear projection. Finally, the number of neurons in the hidden layers defines the dimensionality of the embedded spaces. Hence, a deep architecture allows to reduce dimensionality if it is designed to have a lowdimensional latent space. In spite of the deep learning success in many supervised applications, leveraging the power of neural networks in performing data clustering is still an open problem.
The most prominent deep clustering approaches rely on autoencoders [64, 22, 65, 59, 53]. Some other deep clustering strategies harness an encoding architecture [24, 66, 9, 8, 26, 27] without a decoding network. However, dispensing with the decoder and clustering the data in a latent space just using pseudolabels can mislead the training process. This is because pseudolabels are primarily generated based on hypothetical similarities, which generally underfit the semanticity of natural datasets. We call this problem Feature Randomness. The rational of choosing autoencoding as the standard deep clustering architecture can be imputed to the limited reliability of pseudosupervision when used alone. The reconstruction allows to rebuild the input samples after encoding them in a low dimensional latent space. The ability to reconstruct a point from a lowdimensional representation suggests that the latent space captures the key factors of variations and similarities. Otherwise, it would be impossible to regenerate the data samples.
Autoencoderbased clustering approaches generally consists of a joint optimization process. The reconstruction cost is combined with a clusteringspecific objective function. Referring to the previous point, retaining the reconstruction cost during the clustering phase helps in reducing Feature Randomness by blocking the encoder from generating random discriminative features. However, regularizing with the reconstruction endtoend can lead to Feature Drift. This problem emerges due to the natural tradeoff between clustering and reconstruction. Put it differently, while latent clustering allows to group and separate the embedded samples by emphasizing withincluster similarities and destroying betweencluster similarities, the reconstruction is associated with preserving all factors of similarities.
Meanwhile, Generative Adversarial Network (GAN) [21] has shown great promise in learning complex natural data distributions. It allows to synthesize outofsample data points. Besides, it has been shown that GAN can obtain images with impressive visual quality [49, 6]. Apart from being a successful generative model, the adversarial training strategy has inspired several modern achievements on unsupervised representation learning [3, 19, 17, 10]. Although GAN does not come with an encoder outofthebox, some recent papers have suggested to extend the classical GAN framework to permit data encoding in a latent space, where the semantic factors of variations and similarities are betteremphasized [19, 17, 43, 57]. Nevertheless, it is still unclear to what extent the features learned, based on a deep generative model, can be useful for downstream discriminative tasks (e.g., classification and clustering).
To address the aforementioned problems, we propose Adversarial Deep Embedded Clustering (ADEC). Our framework consists of eliminating the strong competition between embedded clustering and reconstruction without incurring a Feature Randomness cost. This is done by getting the strong competition outside of a single network, while relying on a discriminator, in order to make sure that the embedded representations preserve the intrinsic data characteristics. Optimizing every objective function in a separate network , based on adversarial training, allows to reach a better tradeoff between Feature Randomness and Feature Drift. Experimentation on real benchmark datasets shows the superiority of our framework, in terms of accuracy (ACC) and normalized mutual information (NMI). In a nutshell, the key contributions of this paper are:

Devising a new pretraing framework based on adversarially constrained interpolation and data transformation.

Overcoming the clusteringreconstruction compromise based on an adversarial training strategy.

Enhancing stateoftheart autoencoderbased clustering by alleviating Feature Randomness and Feature Drift.

Outperforming modern clustering models in terms of ACC and NMI on real benchmark datasets.
The rest of this paper has the following organization: Section 2 is devoted to related work. Section 3 presents an analysis of the tradeoff between Feature Randomness and Feature Drift. In section 4, we present our methodology for tackling the Feature Randomness and Feature Drift problems. In Section 5, we show our experimental results. Finally, section 6 concludes the paper.
2 Related work
To demonstrate the merit of our proposed framework, we provide a critical review of mainstream approaches related to our work. ADEC comes under the realm of deep clustering strategies. Furthermore, it is deemed to be part of the concerted effort to combine GANs and autoencoders. To this end, we shall review the existing deep clustering approaches and the unifying techniques of GANs and autoencoders. We should also review DEC [64] and IDEC [22] since they constitute our principal baselines.
2.1 Deep Clustering
In deep unsupervised learning, typically, there are two conceivable options to make up for the absence of supervisory signals. The first option consists of contriving a pretext task that encourages to learn generalpurpose features. It is better known as selfsupervision. For this case, labels are extracted from the input data. The intuition behind selfsupervision is that the pretext task can not be solved efficiently without gaining a semantically highlevel grasp of the input data. The obtained features can be used to outsource downstream tasks, such as, classification. There exists a wide variety of pretext tasks. Among them, the vanilla reconstruction, the denoising loss [60], the variational loss [34], the adversarial loss [21], predicting the location of image patches [16], predicting the permutations of a "jigsaw puzzle"[46], predicting unpainted image patches[47], and predicting image colorization [70]. The second option consists of contriving a pseudo supervisory signal. Similar to selfsupervision, the labeling is available within the data. However, for pseudosupervision, labels are predicted. Therefore, they are not 100% correct. In this paper, denotes a selfsupervised loss and denotes a pseudosupervised loss.
A possible categorization of deep clustering methods can be imputed to the used loss functions and the way they are combined. Based on that, the existing models fall into three main categories. In Figure 1, the framework of each category is illustrated. Within the actual context, the pseudosupervised loss stands for the embedded clustering objective function, which can be any one of the typical clustering objective functions, such as, Gaussian Mixture Model (GMM) or kmeans. As regards the selfsupervised cost, reconstruction is commonly selected.
For the first category, the clustering is directly performed using a pseudosupervised loss. However, the selfacquired labels are unreliable due to their hypothetical aspect. This can mislead the data grouping by learning nonrepresentative features, which in turn deteriorates the discriminative ability of the model. Concisely, the main weakness of the methods affiliated with this category is Feature Randomness. As part of this category, Yang et al. proposed JULE [66], a deep recurrent framework that allows to perform agglomerative clustering and feature learning alongside with a unified triplet loss. The whole process is optimized endtoend. However, one among the prominent downsides of JULE is the runtime overhead due to the recurrent framework. Chang et al. proposed DAC [9], a framework that enables to cluster the data based on pairwise constraints. DAC has curriculum learning strategy, where only highconfidence training samples are selected. In another line of research, Hu et al. proposed IMSAT [27], which consists of maximizing the mutual information between discrete predictions and their associated data points. The loss function of IMSAT is regularized by a selfaugmented training term that allows to penalize the discrepancy between initial data and their geometrically transformed ones. DeepCluster [8] is another framework intimately tied to this category. It alternates between two basic steps. First, the latent representations are clustered by kmeans. Then, the obtained clustering assignments are fed to a convolutional neural network as supervisory signals to learn better features. It was mainly applied to largescale datasets.
As for the second category, the network is pretrained using a selfsupervised cost function. Then, the obtained latent features are finetuned by retraining using pseudo labels. Compared with the random initialization, pretraining with a selfsupervised loss leads to improved initial features. Nevertheless, there is no correction mechanism to attenuate the noisy labels harm. Hence, Feature Randomness is a strongly remaining concern for this category. As part of this category, DEC [64] is the first deep clustering framework to follow a pretrainingfinetuning strategy.
The third category consists of pretraining using a selfsupervised cost function similar to the second category. However, their finetuning phases are different. In fact, the third category regularizes the pseudosupervised objective function with a selfsupervised one. The advantage of such a strategy is that it offers a mechanism to reduce Feature Randomness. However, combining pseudosupervision and selfsupervision can lead to a strong competition between them. In order to balance the two cost functions, a hyperparameter is required. To give an example, Guo et al. proposed IDEC [22] and Dizaji et al. proposed DEPICT [15]. Both models can be considered as extensions to DEC. They regularize the clustering loss with reconstruction during the finetuning phase. Therefore, the decoder is maintained throughout the whole training process. The main difference between them is that DEPICT utilizes a convolutional architecture, whereas IDEC leverages a fullyconnected autoencoder. Apart from that, Yang et al. proposed DCN [65]. Compared with IDEC and DEPICT, DCN optimizes a latent kmeans objective function. VaDE [32] is another model from this category. Its variational autoencoding architecture allows to impose a GMM latent distribution. Thanks to the reparameterization trick, VaDE can be optimized using backpropagation. In addition to clustering, VaDE can generate data samples. Even so, it is subjected to elevated computational complexity similar to all the other variational frameworks.
2.2 Deep Embedding Clustering (DEC)
DEC [64] has two phases. The pretraining phase allows to learn lowdimensional embedded representations by training the autoencoder with vanilla reconstruction. Then, comes the clustering phase. First, the decoder is discarded. After that, the encoder is trained to jointly optimize the embedded representations and the clustering centers. For every training iteration, a soft clustering assignment (1) is computed based on the Student’s tdistribution. represents an assessment of the between the embedded data point and the center .
(1) 
The DEC loss function (2) is the Kullback Leibler divergence between the soft clustering assignment and an auxiliary target distribution (3).
(2) 
(3) 
2.3 Improved Deep Embedded Clustering (IDEC)
IDEC [22] has the same pretraining phase as DEC. The main difference between them is that IDEC is finetuned to minimize joint embedded clustering and reconstruction as described by (4).
(4) 
stands for the reconstruction and is in charge of balancing the two costs. The key idea of IDEC is to block the clustering loss from corrupting the feature space. However, we argue that combining embedded clustering and vanilla reconstruction gives birth to a strong competition between them (i.e., Feature Drift).
2.4 Combining Autoencoders with GANs
Interpolating data samples from the prior distribution, in the latent space of the generator, leads to realistic and semantically explainable variations [49, 6]. As a consequence of GAN effectiveness in capturing the semantic factors of variations, many researchers have studied the inverse mapping problem (i.e., projecting the data back in the embedded space) [17, 43, 19, 57]. As coupled with the generator, an encoder can potentially learn to produce latent highsemantic features from the initial data distribution. This can bring an important advancement in solving inverse problems (e.g., image inpainting and noise removal) and downstream discrimination tasks. Two of the most seminal contributions on combining the power of GANs with Autoencoders, are BiGAN [17] and AAE [43].
Although we use the same architectural components (i.e., Encoder, Decoder and Discriminator), our framework differs from the previous mentioned ones in several glaring aspects. First, our discriminator operates in the data space. In contrast, the critic of AAE processes samples from the latent space, while BiGAN framework concatenates a sample from the data space with its projection in the embedded space, before feeding it to the discriminator. Second, in BiGAN, the encoder and decoder can not directly communicate with each other. Therefore, the objective function of this approach does not have an explicit reconstruction cost. However, AAE and our model explicitly minimize the cycle cost. Unlike AAE, where both the encoder and decoder are trained to perform reconstruction, our encoder weights are frozen, while optimizing with respect to the cycle loss, in order to avoid drifting the features learned using the clustering loss. Moreover, in BiGAN and AAE, the encoder and decoder are trained jointly in competition with the discriminator network. However, in our framework, each network is trained separately. Furthermore, AAE and BiGAN are standard generative models. So, in order to allow sampling, they enforce the aggregated posterior to match an arbitrary prior. However, in our case, we do not impose any hypothetical prior distribution and the adversarial training strategy is introduced to tackle problems related to embedded clustering, that is, Features Randomness and Feature Drift.
3 TradeOff Between Feature Randomness and Feature Drift
In this section, we propose a mathematical formalism to characterize Feature Randomness and Feature Drift. We explain the identified problems and we shed light on the tradeoff between them.
3.1 Feature Randomness
For a pseudosupervised loss, the used labels are predicted based on a presumptive similarity metric. It then follows that part of the pseudolabels mismatch the real ones. Zhang et al. [69] showed empirically that standard deep neural networks can easily and perfectly fit completely random labels without any considerable time overhead, using the same hyperparameters and architecture as used for training with correct labels. This result suggests that a neural network has sufficient capacity to memorize the whole dataset even when there is little or no correlation between the training samples and their corresponding labels.
We call Feature Randomness the training of a neural network using pseudolabels. Put it differently, Feature Randomness takes place when a significant portion of the true labels are substituted by random ones. At every iteration of a neural network optimization process, Feature Randomness can be characterized by (5). is the cosine of the angle between the gradient of the unsupervised loss and the gradient of the real supervised loss w.r.t the network parameters . denotes the true labels (100% correct) and denotes the pseudolabels (partially correct).
(5) 
Training with pseudolabels deteriorates the generalization capacity of a neural network. In fact, it enforces learning features that emphasize similarities between uncorrelated data points. In order to reduce the harm of Feature Randomness, a possible solution consists of adjusting the gradient of the pseudosupervised loss by another vector. The gradient of is a candidate to be that vector for several reasons. First, it is well known that minimizing generates reasonable generalpurpose features. Second, the selfacquired labels for are 100% correct. Hence, minimizing does not contribute to Feature Randomness. Besides, selfsupervision can be used to integrate relevant priorknowledge. In Figure 2, we illustrate the role of the selfsupervised objective function in adjusting the gradient of the pseudosupervised one. denotes the pretextlabels.
3.2 Feature Drift
In the context of multiobjective optimization, two objective functions are said to be conflicting, if optimizing one of them in value degrade the other one. In such a case, the optimum should be computed, while taking into consideration the tradeoffs between the competing objective functions. A solution is called nondominated when there are multiple optima that jointly optimize the objective functions. These optima are considered equally valid in the absence of extra subjective information.
In the case of deep learning, we call Feature Drift the optimization of a neural network’s loss function whose secondary component (regularizer) strongly competes with the main one. This phenomenon can lead to a failure of the global learning process. The features learned based on the main cost function can be easily drifted by updating with respect to the secondary loss. To better understand this problem, Figure 3 shows a simplistic illustration of Feature Drift. In Figure 3.a, a linear combination of two strongly competing vectors and is pulling the ball. In Figure 3.b, the ball is pulled by another couple of forces and , which are less conflicting. In both cases, the pulling forces are adjusted using a balancing positive coefficient . So, the resultant vector is equal to for the first case and for the second case. In this example, we consider that the pulling forces , , , and are constants and the coefficient is variable. After applying the resultant vector, the object is supposed to reach a position . In both figures, the colored area (delimited by the competing vectors) represents the field of possible positions after applying the resultant vector. The target solution lies within the green field. It is reasonable to predict that the smaller the area of the field, the easier to reach the target solution. For two strongly competing vectors, we observe that the variation of dramatically affects the reached position. Therefore, the choice of is quite critical for this case. Whereas, for the second case, where the competition is less ardent, the variation of has a less important impact on the reached position. Hence, the choice of is less crucial than in the first case. We conclude that the importance of a balancing coefficient depends on the level of competition. Most importantly, we observe that the level of competition can be assessed by the cosine of the angle between the two competing vectors. Hence, at every training iteration, we can characterize Feature Drift as following:
(6) 
Where and are the gradient of the pseudosupervised loss and the gradient of the selfsupervised loss, respectively. When the strongly contending vectors are not balanced meticulously, the desired solution would not be reached even after multiple iterations. Besides, it is of great importance to make unsupervised learning methods less reliant on unpredictable and datasetspecific parameters. The example presented by Figure 3 was selected for its simplicity, since it is difficult to visualize the gradient vectors in a highdimensional space.
Several modern deep clustering models [22, 65, 15, 32] jointly perform reconstruction and embedded clustering. In this work, we show empirically that combining them leads to a very strong competition between their gradients. Under specific conditions, we provide a mathematical proof of this hypothesis. For this reason, we consider the problem of clustering a dataset and X the matrix whose raw vectors are . We presume that the number of clusters . Our operators include a linear encoder , which maps samples from the data space to the latent space and a linear decoder performing an inverse mapping. The matrices and hold the learnable parameters of the encoder and decoder, respectively. We define the vector as the projection of the data point in the latent space and is the reconstructed representation of . We denote . Furthermore, we constrain to the set of semiorthogonal matrices. Thus, , where is the identity matrix. Each cluster is associated with a centroid in the embedded space , where , represents the cluster j and is the number of points in . The center of the embedded points is denoted by . We define, , as the reconstruction loss, and , as the kmeans loss in the latent space.
Let , be the average distance between two clusters and . If is equal to , this distance is called withincluster distance, and defined by and . Otherwise, it is called betweencluster distance, and defined by and .
Theorem 1.
Under the specific conditions described above, the loss function can be expressed as following:
(7) 
where
The proof of Theorem 1 is provided in Appendix A. This theorem shows the implicit competition between a typical clustering loss (kmeans) and the reconstruction one. Intuitively, minimizing the clustering loss has two objectives. First, it allows to emphasize the similarities between data points within the same cluster. Second, it enables to stress the variations between data points from different clusters. However, minimizing the reconstruction loss aims to preserve all the similarities and variances between every couple of data points, whether or not they belong to the same cluster. Using Theorem 1, we can notice that minimizing the first term of leads to the maximization of betweencluster distances, which force the clusters to be separable. Added to that, minimizing the second and third terms of minimizes withincluster variances, which pushes the clusters to be as compact as possible. However, minimizing implies minimizing both betweencluster distances and withincluster variances.
In a general case, increasing significantly causes the selfsupervised loss to easily win the competition. Thus, any discriminative feature learned in the direction of the pseudosupervised loss’s gradient can be easily drifted by the gradient of the selfsupervised loss. On the other side, decreasing significantly leads to Feature Randomness.
4 Adversarial Deep Embedded Clustering
In this section, we describe our proposed framework ADEC. Our model is designed to address Feature Randomness and Feature Drift. To this end, we consider the problem of clustering a dataset into clusters. Each cluster is associated with a centroid in the embedded space , where . Our operators include a deep nonlinear encoder , which maps samples from the data space to the latent space and a deep nonlinear decoder performing an inverse mapping. and represent the learnable parameters of the encoder and decoder, respectively. We define the vector to be the projection of a data point into the latent space and is the reconstructed representation of .
4.1 Pretraining phase
Following stateoftheart autoencoderbased clustering approaches, we pretrain the encoder and decoder. In the context of deep clustering, pseudolabels are the cluster representatives (i.e., embedded centers). It is important to start the clustering phase with latent features that reflect the data distribution. Otherwise, it would be impossible to extract meaningful pseudolabels. Training a neural network using embedded centers extracted from completely random latent representations, leads to excessive Feature Randomness (due to the large number of unreliable pseudolabels). It is wellknown that selfsupervision allows to learn reliable generalpurpose features by solving a pretext task. Therefore, the pretraining phase should consist of minimizing a selfsupervised loss.
Previously proposed algorithms, such as, DEC and IDEC rely on a stacked denoising selfencoding strategy [62] for initializing the training weights and . In our case, we opted for pretraining the autoencoder using vanilla reconstruction loss regularized by an adversarially constrained interpolation [3] and data augmentation (e.g., slight random rotation and translation of the input samples) [23]. These techniques are backed up by results showing an important enhancement in learning unsupervised representations for downstream tasks [3, 23]. When pretraining the model, a real number is randomly sampled to compute , such that, is the reconstruction of a data point interpolated from the latent representations of and . The framework of ACAI [3] simulates a game competition between two adversarial networks. The autoencoder is trained to generate interpolated points. While the critic , which is a neural network parameterized by , enables to regress the interpolation parameter in (9), the autoencoder aims to fool the critic into considering the generated interpolants as real samples (i.e., outputting ) in (8). The second term in (9) allows the critic to identify noninterpolated inputs. The coefficient , in (9), is randomly selected from at every iteration and , in (8), is responsible for balancing the reconstruction and the regularization. For the sake of simplification, we assume that stands for the data samples after carrying out the random transformations (rotation and translation). The full framework of our pretraining phase is illustrated in Figure 4. To the best of our knowledge, we are the first to propose such a pretraining strategy in the context of deep clustering.
(8) 
(9) 
4.2 Clustering phase
For this phase, on top of the pretrained encoder and decoder, we need one additional network. More precisely, we introduce a Discriminator . Similar to the standard GAN framework, the Discriminator allows to identify real data samples from the fake ones. This network is parameterized with . Based on our experimental results, the feature learned by minimizing the embedded clustering objective function can be easily drifted, while minimizing the reconstruction loss. To inhibit this implicit strong competition from taking place, our strategy aims at transforming withinnetwork competition to a different betweennetworks one. Therefore, each network is trained independently from the other ones to avoid the drifting effect.
Training the encoder is the main step in our framework. The clustering loss in (10) is inspired by DEC [64]. It refines the clusters by gradually stressing high confidence assignments. It is worth to note that our methodology can be applied using a different embedded clustering loss. The choice of the DEC cost can be explained by its simplicity and popularity in the deep clustering community. Unlike DEC, we add a regularization term (second part of (10)). Our regularization allows to penalize generating embedded features, which could not be decoded into realistic data points. This constraint is verified by the discriminator, leading to rejecting discriminative features, which corrupt the clustering space. Hence, we argue that minimizing (10) enables to reduce Feature Randomness.
(10) 
As shown by equation (10), our model does not require a balancing hyperparameter. Unlike IDEC, where the balancing hyperparameter is critical and hardtotune due to the strong tradeoff, in our case, the clustering and regularization terms do not reflect any explicit competition. We would provide an experimental study on hyperparameter tuning to validate the aforesaid hypothesis.
Unlike DEC, where the decoder is discarded straight away, this network plays a pivotal role in our case. It can be seen as a monitor. It allows to investigate the variations of the embedded representations induced by training the encoder. Hence, we argue that the decoder should be trained as well to catchup with the encoder updates. However, training the decoder similar to IDEC would drift the discriminative features learned by the encoder. We propose to restrain the backpropagation of the reconstruction loss to the decoder layers as shown by equation (11). We argue that such a strategy helps in reducing Feature Drift.
(11) 
Unlike DEC and IDEC, we introduce a discriminator as an additional architectural component. As exhibited by equation (12), the discriminator is supposed to differentiate real data points from those generated randomly. Similar to the decoder, the discriminator should be sufficiently trained before updating the encoder weights at every clustering iteration. An illustration of the clustering phase is given by Figure 5.
(12) 
At the end of the training process, we observe that the output images are smoother. Added to that, they do not represent a pure reconstruction anymore even if the decoder is trained for a huge number of iterations. This suggests that the encoder learned to destroy nondiscriminative information. Another interesting observation is that the decoder maps images from the same class to the same blurry output image. This observation suggests that the encoder has learned to collapse withinclass variances. Such characteristics of our model are inconsistent with IDEC pure reconstruction as illustrated by Figure 6.
4.3 Optimization
ADEC has five kind of learnable parameters , , , and . All of them are updated using minibatch SGD and backpropagation. , and are, respectively, the stochastic minibatch approximation of , and . The gradients are computed following Theorem 2 and Theorem 3. Refer to Appendix B and C for the proofs.
Theorem 2.
The gradient of the loss function w.r.t the encoded data points is calculated by (13), where denotes the transpose Jacobian matrix of and is the gradient of .
(13) 
Theorem 3.
The gradient of w.r.t. to the cluster center is computed following (14).
(14) 
For the clustering phase, we run the optimization for batch iterations or until the clustering assignment variation between two consecutive clustering iterations is lower than . We found empirically that the decoder needs to be trained for a greater number of iterations compared to the other networks, otherwise it would cause instability. Thus, we alternate between training the {Decoder , Encoder , Discriminator } for number of iterations and training the {Decoder } alone also for auxiliary iterations. The target distribution , which is computed based on the predicted clustering assignment distribution , constitutes the support for computing the pseudolabels. is updated every iterations based on equations (1) and (3). In practice, we refrain from bringing modifications on at every single step to avoid instability. The predicted label for a data point is calculated based on the following equation:
(15) 
Our proposed algorithm is summarized in Algorithm 1.
5 Experiments
An extensive experimental protocol is conducted to validate the suitability of ADEC in tackling Feature Randomness and Feature Drift. In order to perform this, we need to specify the scope of our experiments and the required experimental configurations.
5.1 Scope of experiments
Deep Clustering models differ from each other in five substantial aspects. Each one of these aspects has been proved to have a significant impact on the clustering quality. The first factor is the used architecture. Some studies [8, 31] rely on sophisticated architectures (e.g., ResNet32, AlexNet and VGG) to cluster very large datasets. Other studies [64, 22] leverage fairly sized architectures to cluster sizeable datasets. Likewise, in this paper, we opted for the same architecture used by [64, 22, 32, 65]. The second factor is the integrated prior knowledge (e.g. invariance of images’ labels to small linear transformations and symmetries). For instance, previous works [27, 23] have proved that data augmentation based on prior knowledge leads to better clustering results. Inspired by these studies, we apply a similar data augmentation technique for image datasets. The third factor of quality is the learning dynamics. This was the focus of the following studies: (1) deep overclustering [31], which offers two subheads, one for grouping the data in more clusters than required, and the second for clustering according to the ground truth number of clusters; (2) deep adaptive clustering [9], which clusters the easy samples first and then gradually supply the learning model with more difficult ones; and (3) clustering with a dynamic loss function [44], which gradually change the cost function according to the clustered samples. Unlike these papers, ADEC does not rely on any specific learning dynamics. Finally, the fourth and fifth factors consist in choosing the selfsupervised and pseudosupervised losses, respectively. Jabi et al. [29] proved that, under mild conditions, several pseudosupervised objective functions are equivalent to each other.
All the previous deep clustering studies revolve around the five mentioned axes. The modification of any one of these factors is deemed to improve or worsen the effectiveness and efficiency of the studied deep clustering model. The following experimental protocol aims to show that the tradeoff between Feature Randomness and Feature Drift, which was neglected by previous studies, is influential in designing deep clustering models. For this reason, our experiments should include a comparison, where all the other factors of quality are kept identical between ADEC and its baselines.
5.2 Experimental Settings
All experiments are conducted on a server with 4 Intel(R) Xeon(R) CPU E52660 0 @ 2.20GHz, 32 GO RAM and a NVIDIA TESLA K80 GPU.
5.2.1 Datasets
We evaluate our approach on six benchmark datasets:

MNISTfull [38]: a 10 classes database of grayscale handwritten digit images of size each.

MNISTtest: a subset of images of the MNISTfull dataset.

USPS [28]: a 10 classes database of grayscale digit images of size each.

FashionMNIST [63]: a 10 classes database of grayscale images of size each.

REUTERS10K [39]: a 4 classes database (corporate/industrial, government/social, markets and economics) of articles. The most frequent words in all articles are selected. Then, for each article, we compute the TFIDF features using the selected dictionary.

Mice Protein [25]: an 8 classes database of mice samples. The features of this database consists of the expression levels of 77 proteins.
All datasets are normalized before being fed to the clustering models, thereby the norm of each data point is approximately equal to 1. For fullyconnected models, we flatten the input data if its dimension is greater than one.
5.2.2 Baselines
In order to show the effectiveness of the proposed model, ADEC is compared against classical clustering algorithms, subspace clustering algorithms, manifold clustering algorithms, and stateoftheart deep clustering algorithms. The classical clustering baselines include kmeans [42], Gaussian mixture models (GMM) [4], Least Squares Nonnegative Matrix Factorization (LSNMF) [40] and agglomerative clustering (AC) [30]. The subspace clustering methods include Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit (SSCOMP) [68] and Scalable Elastic Net Subspace Clustering (EnSC) [67]. The other subspace clustering baselines are not efficient enough to deal with 70,000 samples and therefore they are left out. The manifold clustering approaches include normalizedcut spectral clustering (SC) [54] and Kernel (RBF) kmeans [52]. Finally, the deep clustering algorithms include DeepCluster [8], JULE [66], SRkmeans[29], DEC [64], IDEC [22], DCN [65], VaDE [32] and DEPICT [15]. Our baselines also cover clustering the embedded data of an autoencoder using kmeans and FINCH [51] denoted, respectively, by (AE+kmeans) and (AE+FINCH). As a side note, all the fullyconnected baselines share the same architecture with ADEC.
5.2.3 Evaluation Metrics
We adopt the metrics ACC [7], NMI [55], and for assessing the clustering quality. The first two metrics are widely used to compare deep clustering methods. The third and fourth metrics are among the contributions of this work. ACC and NMI lie within the range and and lie within the range . Higher values are better. As shown by (16) and (17), ACC and NMI depend on and , where is a vector representing the predicted labels and is the groundtruth labels vector.
(16) 
is selected from the set of all possible permutations mapping the predicted clusters to the groundtruth categories. The best matching can be found using the Hungarian Algorithm [35].
(17) 
denotes the entropy and is the mutual information.
5.2.4 Implementation
The encoder has eight layers of dimensions  500  500  2000  10. Apart from the bottleneck layer, all the other ones are activated by ReLu [41]. The decoder is an inverse mapping of the encoding layers 10  2000  500  500  with ReLu activations except for the last layer. We pretrain the autoencoder in competition with a critic for iterations to perform data reconstruction constrained by an adversarially constrained interpolation. The learning weights are optimized using Adam [33] with a learning rate equal to . , , and (i.e., hyperparameters specific to Adam) have the respective values , , and . According to ACAI [3] paper, and are set equal to and , respectively. For the clustering stage, the encoder, decoder, and discriminator are trained alternatively for . The training is stopped before reaching the final iteration if the convergence criterion is met. This criterion is parameterized by a threshold . We update the encoder, decoder and discriminator weights using SGD with a learning rate and momentum . All backpropagation updates are performed on random batches of size 256 for both stages (i.e., pretraining and clustering). ADEC is implemented using Python and Tensorflow [1].
5.3 Results
Our experimental protocol has three parts. In the first part, our model is compared with stateoftheart clustering algorithms. In the second part, we analyse the ability of our model to tackle Feature Randomness and Feature Drift. In the last part, some qualitative results are exhibited. Before showing our results, we establish some useful notations. In all the following experiments:  indicates OUT_OF_MEMORY, denotes the unsuitability of the algorithm to process onedimensional data, indicates that the pretraining phase does not support Data Transform and Adversarially Constrained Interpolation, indicates that the pretraining phase does not support Data Transform, * indicates that the evaluated methods share the same pretraining weights, the same architecture, the same learning dynamics and the same clustering loss with ADEC.
5.3.1 Comparing stateofthe art approaches
Table 1 illustrates the evaluation of several clustering approaches, including our proposed method, in terms of ACC and NMI. All baselines methods are tuned according to their default settings. First of all, we observe that stateofthe art subspace clustering algorithms, such as, SSCOMP and EnSC are generally not suitable for clustering datasets with semantic similarities (e.g., images, text, sounds). In fact, subspace clustering presumes the data to lie in a union of lowdimensional linear subspaces. However, this assumption does not hold for datasets with clusters lieing near nonlinearly shaped manifolds [50]. Secondly, we observe that the manifold clustering approaches have better ACC and NMI values than the classical approaches on some datasets. In fact, for the manifold category, the selection of the nonlinear transform is largely empirical. Particularly, no kernel space is sufficiently wellsuited to effectively cluster any dataset. Thirdly, in most cases, we can see that deep clustering models outperform all the other approaches by a huge margin. This observation confirms the suitability of deep clustering when it comes to clustering highdimensional datasets. Finally, comparing among the deep clustering approaches, we can observe that our method provides the best results on every dataset. In terms of ACC and NMI, ADEC outperforms its stateoftheart counterpart DEPICT by 2% and 5%, respectively. Worthy of note that DEPICT is the convolutional version of DEC with some minor modifications. In order to understand the outperformance of our approach, we need to conduct further experiments.
Method  MNISTfull  MNISTtest  USPS  FashionMNIST  REUTERS10K  Mice Protein  

ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  
kmeans  0.532  0.500  0.546  0.501  0.668  0.627  0.474  0.512  0.522  0.313  0.342  0.252 
GMM  0.433  0.366  0.540  0.493  0.551  0.530  0.556  0.557  0.402  0.375  0.139  1.00 
LSNMF  0.540  0.455  0.550  0.463  0.575  0.551  0.549  0.523  0.596  0.361  0.497  0.506 
AC  0.621  0.682  0.695  0.711  0.683  0.725  0.500  0.564  0.526  0.365  0.294  0.211 
SSCOMP  0.309  0.315  0.413  0.450  0.477  0.503  0.100  0.007  0.402  0.008  0.152  0.078 
EnSC  0.111  0.014  0.603  0.591  0.610  0.684  0.629  0.636  0.401  0.014  0.434  0.347 
SC  0.656  0.731  0.660  0.704  0.649  0.794  0.508  0.575  0.402  0.375  0.298  0.268 
RBF kmeans      0.560  0.523  0.629  0.631      0.499  0.288  0.363  0.269 
AE + kmeans  0.807  0.730  0.702  0.617  0.720  0.698  0.585  0.614  0.695  0.475  0.238  0.131 
AE + FINCH      0.709  0.754  0.704  0.788      0.241  0.414  0.157  0.083 
DeepCluster  0.797  0.661  0.854  0.713  0.562  0.540  0.542  0.510  
DCN  0.830  0.810  0.802  0.786  0.688  0.683  0.501  0.558  0.422  0.109  0.197  0.051 
DEC  0.863  0.834  0.856  0.830  0.762  0.767  0.518  0.546  0.814  0.598  0.184  0.026 
IDEC  0.881  0.867  0.846  0.802  0.761  0.785  0.529  0.557  0.790  0.550  0.196  0.037 
SRkmeans  0.939  0.866  0.863  0.873  0.901  0.912  0.507  0.548  
VaDE  0.945  0.876  0.287  0.287  0.566  0.512  0.578  0.630  0.793  0.521  0.139  1.00 
JULE  0.964  0.913  0.961  0.915  0.950  0.913  0.563  0.608  
DEPICT  0.965  0.917  0.963  0.915  0.899  0.906  0.392  0.392  
ADEC  0.986  0.961  0.985  0.957  0.981  0.948  0.586  0.662  0.821  0.605  0.500  0.604 
Method  MNISTfull  MNISTtest  USPS  FashionMNIST  REUTERS10K  Mice Protein  

ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  
DEC*  0.971  0.929  0.968  0.920  0.963  0.910  0.575  0.589  0.814  0.598  0.267  0.158 
IDEC*  0.982  0.952  0.978  0.944  0.980  0.946  0.575  0.631  0.790  0.550  0.188  0.033 
ADEC  0.986  0.961  0.985  0.957  0.981  0.948  0.586  0.662  0.821  0.605  0.500  0.604 
For a fair comparison with stateoftheart deep clustering approaches, the baselines need to be reimplemented in a way to neutralize factors, which are out of this article scope. The new reimplemented models share the same deep clustering factors (i.e., architecture, learning dynamics, integrated prior knowledge and clustering loss) with ADEC. From Table 1 and Table 2, we can notice a considerable improvement in terms of ACC and NMI for the modified version of DEC and IDEC, comparatively to the original ones. More specifically, DEC* outperforms vanilla DEC by a huge margin. Similarly, IDEC* surpasses its standard counterpart significantly. The huge gap between the ordinary pretrained models (i.e., based on a simple reconstruction) and the modified ones, demonstrates the effectiveness of combining Adversarially Constrained Interpolation and Data Transformation, as a pretraining strategy. Furthermore, as we can see from Table 2, ADEC* outperforms DEC* and IDEC*. This result suggests that ADEC offers a better tradeoff between Feature Randomness and Feature Drift. This hypothesis will be further supported in the subsequent experiments.
In Table 3, we report the execution time of different deep clustering methods. The comparison is limited to deep clustering models. Henceforth, we exclude all the other clustering categories due to their less competitive results as just demonstrated by the previous comparison in Table 1. As we can see in Table 3, the runtime of ADEC is significantly higher than the execution times of DEC, IDEC, DCN and DeepCluster on all datasets. As it stands, these methods are more efficient than ADEC. However, we can also observe that the execution time of our method is on par with the execution times of DEPICT, SRkmeans and JULE. Interestingly, our algorithm is way faster than VaDE on all datasets.
Method  MNISTfull  MNISTtest  USPS  FashionMNIST  REUTERS10K  Mice Protein 

DeepCluster  1,375  74  64  1,250     
DCN  640  55  49  732  279  40 
DEC  693  58  53  2,384  105  22 
IDEC  890  349  110  857  97  150 
SRkmeans  14,872  1,657  1,655  4,551     
VaDE  123,000  15,000  13,000  120,000  105  15 
JULE  12,500  3,247  2,540  13,100     
DEPICT  9,561  2,320  1,778  8,581     
ADEC  10,735  10,013  8,445  10,502  669  1,047 
For fairness of comparison and in order to better assess the efficiency of our method, we run our algorithm against the modified versions of DEC and IDEC (same as the previous experiment). Based on Table 3 and Table 4, we can see that the runtime of DEC* and IDEC* are significantly higher than the execution times of vanilla DEC and vanilla IDEC, respectively. Therefore, we can conclude that DEC* and IDEC* are less efficient than DEC and IDEC, respectively. This conclusion can be explained by the long pretraining phase. Hence, the gain achieved by pretraining with an Adversarially Constrained Interpolation comes at the cost of a higher execution time. Another observation, DEC* and IDEC* are slightly faster than ADEC*. This is expected and it can be imputed to the adversarial training of our algorithm.
Method  MNISTfull  MNISTtest  USPS  FashionMNIST  REUTERS10K  Mice Protein 

DEC*  9,667  9,092  7,692  10,840  53  639 
IDEC*  9,556  9,160  7,693  9,623  55  646 
ADEC  10,735  10,013  8,445  10,502  669  1,047 
5.3.2 Feature Randomness and Feature Drift
In this section, our conducted experiments aim to show the ability of ADEC to reach a better tradeoff between Feature Randomness and Feature Drift. Therefore, we perform an ablation of our adversarial mechanism. Instead of this mechanism, we regularize the clustering loss with vanilla reconstruction. The obtained model is IDEC*. Then, we compare ADEC with IDEC* in terms of and . As mentioned earlier, both models share the same optimizer, the same pretraining phase and the same embedded clustering loss. The only difference between them is the regularization technique.
The first experiment examines the impact of our adversarial mechanism in reducing Feature Randomness. In this section, we show results for the MNIST dataset. Such results are representative of the general behavior of our approach and the same conclusion can be drawn on the other datasets. In Figure 7, we draw the evolution of for ADEC and IDEC* during training on MNIST. Based on this figure, we observe that the average values of for ADEC is considerably higher than the one for IDEC*. A higher value means that the gradient of ADEC is a better approximation to the supervised gradient. Hence, this experiment confirms that our adversarial regularization is more suitable for alleviating Feature Randomness than vanilla reconstruction.
The second experiment examined the impact of our adversarial mechanism in reducing Feature Drift. In Figure 8, we draw the evolution of for ADEC and IDEC* during training on MNIST. Based on this figure, we observe that the values of for IDEC* are always negative. This result confirms the strong competition between the gradient of the embedded clustering loss and the gradient of the reconstruction loss. Added to that, we observe that the average values of for ADEC is considerably higher than the one for IDEC*. A higher value indicates that the competition between the embedded clustering gradient and the reconstruction gradient is stronger than the competition between the embedded clustering gradient and the adversarial gradient. Hence, this experiment confirms that our adversarial regularization is more suitable for alleviating Feature Drift than vanilla reconstruction.
The third experiment examined the impact of Feature Drift on the learning curves. In Figure 10, we draw the learning curves of ADEC and IDEC*, in terms of ACC and NMI, during training on MNIST. Based on this figure, we observe that the learning curves of ADEC are not only above the learning curves of IDEC*, but also smoother. A zoom in to the learning curves of IDEC*, as illustrated by Figures 11 and 12, shows noticeable fluctuations. However, zooming in to the learning curves of ADEC shows a smooth increase in both metrics. The observed fluctuations for IDEC* can be explained by the competition between the reconstruction and the embedded clustering.
The fourth experiment examined the impact of Feature Drift on the sensitivity of the balancing hyperparameter . In Figure 10, we draw the learning curves of IDEC*, for different values of , during training on MNIST. In our experiments, is selected from the following set . Based on the obtained results, we observe that only one value of the set ( = 0.01) yields an acceptable learning curve. All the other values make the learning curve drop significantly. Hence, we can conclude that IDEC* is very sensitive to the choice of . This result can be explained by Feature Drift (the strong competition between the gradient of the selfsupervised loss and the gradient of the pseudosupervised loss). However, in our case, ADEC does not require any balancing hyperparameter.
5.3.3 Qualitative results
In Figure 13, the discriminative ability of ADEC is illustrated by projecting the data in a 2D latent subspace for different datasets. From this figure, we can see that the projected embedded data points are grouped in wellseparated clusters.
Figure 14 illustrates the top 10 highconfidence images from each cluster for two datasets, namely, MNIST and Fashion MNIST. In this figure, images are inserted in decreasing order from left to right according to their distance to their associated clustering centers. Every single row represents a different cluster.
6 Conclusion
In this article, we have proposed an Adversarial Deep Embedded Clustering algorithm. Our method enables to regularize the clustering loss in a way to alleviate Feature Randomness. To overcome Feature Drift, the strong clusteringreconstruction tradeoff have been Eliminated. Empirical results have showed that ADEC outperforms stateoftheart clustering methods in terms of ACC and NMI. Furthermore, our experimental studies have validated that ADEC offers a better tradeoff between Feature Drift and Feature Randomness. For ADEC, similar to the most relevant deep clustering models, selfsupervision and pseudosupervision are combined linearly. It is very interesting to study other possible combinations and to find theoretical justifications. Besides, it is worthy to extend the scope of this work by using a more sophisticated architecture (e.g., ResNet32, AlexNet and VGG) to process higher semantic datasets.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [2] David J Bartholomew, Fiona Steele, Jane Galbraith, and Irini Moustaki. Analysis of multivariate social science data. Chapman and Hall/CRC, 2008.
 [3] David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer. In International Conference on Learning Representations (ICLR), 2019.
 [4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
 [5] Mohamed Bouguessa and Shengrui Wang. Mining projected clusters in highdimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 21(4):507–522, 2008.
 [6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2019.
 [7] Deng Cai, Xiaofei He, and Jiawei Han. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913, 2011.
 [8] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
 [9] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pages 5879–5887, 2017.
 [10] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 [11] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995.
 [12] Trevor F Cox and Michael AA Cox. Multidimensional scaling. Chapman and hall/CRC, 2000.
 [13] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel kmeans: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551–556. ACM, 2004.
 [14] Chris Ding and Xiaofeng He. Kmeans clustering via principal component analysis. In Proceedings of the twentyfirst international conference on Machine learning, page 29. ACM, 2004.
 [15] Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 5747–5756. IEEE, 2017.
 [16] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
 [17] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Representations (ICLR), 2017.
 [18] David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for highdimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003.
 [19] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In International Conference on Learning Representations (ICLR), 2017.
 [20] Martin Ester, HansPeter Kriegel, Jörg Sander, Xiaowei Xu, et al. A densitybased algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
 [21] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [22] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering with local structure preservation. In International Joint Conference on Artificial Intelligence (IJCAI17), pages 1753–1759, 2017.
 [23] Xifeng Guo, En Zhu, Xinwang Liu, and Jianping Yin. Deep embedded clustering with data augmentation. In Asian Conference on Machine Learning, pages 550–565, 2018.
 [24] Philip Haeusser, Johannes Plapp, Vladimir Golkov, Elie Aljalbout, and Daniel Cremers. Associative deep clustering: Training a classification network with no labels. In Proceedings of the German Conference on Pattern Recognition (GCPR), 2018.
 [25] Clara Higuera, Katheleen J Gardiner, and Krzysztof J Cios. Selforganizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one, 10(6):e0129126, 2015.
 [26] ChihChung Hsu and ChiaWen Lin. Cnnbased joint clustering and representation learning with feature drift compensation for largescale image data. IEEE Transactions on Multimedia, 20(2):421–429, 2018.
 [27] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing selfaugmented training. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1558–1567. JMLR. org, 2017.
 [28] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
 [29] Mohammed Jabi, Marco Pedersoli, Amar Mitiche, and Ismail Ben Ayed. Deep clustering: On the link between discriminative models and kmeans. arXiv preprint arXiv:1810.04246, 2018.
 [30] Anil K Jain. Data clustering: 50 years beyond kmeans. Pattern recognition letters, 31(8):651–666, 2010.
 [31] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
 [32] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In International Joint Conference on Artificial Intelligence (IJCAI17), pages 1965–1972, 2017.
 [33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [34] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [35] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(12):83–97, 1955.
 [36] Doug Laney. 3d data management: Controlling data volume, velocity and variety. META group research note, 6(70):1, 2001.
 [37] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 [38] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
 [39] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
 [40] ChihJen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
 [41] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
 [42] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
 [43] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In International Conference on Learning Representations (ICLR), 2016.
 [44] Nairouz Mrabah, Naimul Mefraz Khan, and Riadh Ksantini. Deep clustering with a dynamic autoencoder. arXiv preprint arXiv:1901.07752, 2019.
 [45] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856, 2002.
 [46] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
 [47] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
 [48] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
 [49] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations (ICLR), 2016.
 [50] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
 [51] Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. Efficient parameterfree clustering using first neighbor relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2019.
 [52] Bernhard Schölkopf, Alexander Smola, and KlausRobert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998.
 [53] Sohil Atul Shah and Vladlen Koltun. Deep continuous clustering. arXiv preprint arXiv:1803.01449, 2018.
 [54] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Departmental Papers (CIS), page 107, 2000.
 [55] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617, 2002.
 [56] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
 [57] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. In International Conference on Learning Representations (ICLR), 2018.
 [58] Kenneth E Train. Discrete choice methods with simulation. Cambridge university press, 2009.
 [59] Elad Tzoreff, Olga Kogan, and Yoni Choukroun. Deep discriminative latent space for clustering. arXiv preprint arXiv:1805.10795, 2018.
 [60] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
 [61] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
 [62] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
 [63] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 [64] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478–487, 2016.
 [65] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3861–3870. JMLR. org, 2017.
 [66] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5147–5156, 2016.
 [67] Chong You, ChunGuang Li, Daniel P Robinson, and René Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3928–3937, 2016.
 [68] Chong You, Daniel Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3918–3927, 2016.
 [69] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.
 [70] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
 [71] Xiaofeng Zhu, Shichao Zhang, Yonggang Li, Jilian Zhang, Lifeng Yang, and Yue Fang. Lowrank sparse subspace for spectral clustering. IEEE Transactions on Knowledge and Data Engineering, 2018.
Appendix A Proof of theorem 1
We start by computing:
And since
Therefore
The function can be written as:
According to [14], the function can be written as :
So we obtain
Appendix B Proof of theorem 2
The loss function can be written as:
Then, we compute and separately.
After substitutions, we have:
Appendix C Proof of theorem 3
and play symmetric roles in the first part of , and the regularization part does not depend on . Therefore,