A Framework for Deep Constrained Clustering  Algorithms and Advances
Abstract
The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as kmeans, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our framework can not only handle standard together/apart constraints (without the well documented negative effects reported earlier) generated from labeled side information but more complex constraints generated from new types of side information such as continuous values and highlevel domain knowledge. (Source code available at: http://github.com/blueocean92/deep_constrained_clustering)
Keywords:
Constrained Clustering Deep Learning Semisupervised Clustering Reproducible Research1 Introduction
Constrained clustering has a long history in machine learning with many standard algorithms being adapted to be constrained [3] including EM [2], KMeans [25] and spectral methods [26]. The addition of constraints generated from ground truth labels allows a semisupervised setting to increase accuracy [25] when measured against the ground truth labeling.
However, there are several limitations in these methods and one purpose of this paper is to explore how deep learning can make advances to the field beyond what other methods have. In particular, we find that existing nondeep formulations of constrained clustering have the following limitations:

Limited Constraints and Side Information. Constraints are limited to simple together/apart constraints typically generated from labels. In some domains, experts may more naturally give guidance at the cluster level or generate constraints from continuous sideinformation.

Negative Effect of Constraints. For some algorithms though constraints improve performance when averaged over many constraint sets, individual constraint sets produce results worse than using no constraints [8]. As practitioners typically have one constraint set their use can be “hit or miss”.

Assumption of Good Features. A core requirement is that good features or similarity function for complex data is already created.
Since deep learning is naturally scalable and able to find useful representations we focus on the first and second challenges but experimentally explore the third and fourth. Though deep clustering with constraints has many potential benefits to overcome these limitations it is not without its challenges. Our major contributions in this paper are summarized as follows:

We propose a deep constrained clustering formulation that cannot only encode standard together/apart constraints but new triplet constraints (which can be generated from continuous side information), instance difficulty constraints, and cluster level balancing constraints (see section 3).

Deep constrained clustering overcomes a long term issue we reported in PKDD earlier [8] with constrained clustering of profound practical implications: overcoming the negative effects of individual constraint sets.

We show how the benefits of deep learning such as scalability and endtoend learning translate to our deep constrained clustering formulation. We achieve better clustering results than traditional constrained clustering methods (with features generated from an autoencoder) on challenging datasets (see Table 2).
Our paper is organized as follows. First, we introduce the related work in section 2. We then propose four forms of constraints in section 3 and introduce how to train the clustering network with these constraints in section 4. Then we compare our approach to previous baselines and demonstrate the effectiveness of new types of constraints in section 5. Finally, we discuss future work and conclude in section 6.
2 Related Work
Constrained Clustering. Constrained clustering is an important area and there is a large body of work that shows how side information can improve the clustering performance [24, 25, 28, 4, 26]. Here the side information is typically labeled data which is used to generate pairwise together/apart constraints used to partially reveal the ground truth clustering to help the clustering algorithm. Such constraints are easy to encode in matrices and enforce in procedural algorithms though not with its challenges. In particular, we showed [8] performance improves with larger constraint sets when averaged over many constraint sets generated from the ground truth labeling. However, for a significant fraction (just not the majority) of these constraint sets performance is worse than using no constraint set. We recreated some of these results in Table 2.
Moreover, side information can exist in different forms beyond labels (i.e. continuous data), and domain experts can provide guidance beyond pairwise constraints. Some work in the supervised classification setting [14, 21, 20, 10] seek alternatives such as relative/triplet guidance, but to our knowledge, such information has not been explored in the nonhierarchical clustering setting. Complex constraints for hierarchical clustering have been explored [1, 5] but these are tightly limited to the hierarchical structure (i.e., must be joined with before ) and not directly translated to nonhierarchical (partitional) clustering.
Deep Clustering. Motivated by the success of deep neural networks in supervised learning, unsupervised deep learning approaches are now being explored [27, 13, 30, 11, 22]. There are approaches [30, 22] which learn an encoding that is suitable for a clustering objective first and then applied an external clustering method. Our work builds upon the most direct setting [27, 11] which encodes one selftraining objective and finds the clustering allocations for all instances within one neural network.
Deep Clustering with Pairwise Constraints. Most recently, the semisupervised clustering networks with pairwise constraints have been explored: [12] uses pairwise constraints to enforce small divergence between similar pairs while increasing the divergence between dissimilar pairs assignment probability distributions. However, this approach did not leverage the unlabeled data, hence requires lot’s of labeled data to achieve good results. Fogel et al. proposed an unsupervised clustering network [9] by selfgenerating pairwise constraints from mutual KNN graph and extends it to semisupervised clustering by using labeled connections queried from the human. However, this method cannot make outofsample predictions and requires userdefined parameters for generating constraints from mutual KNN graph.
3 Deep Constrained Clustering Framework
Here we outline our proposed framework for deep constrained clustering. Our method of adding constraints to and training deep learning can be used for most deep clustering method (so long as the network has a unit output indicating the degree of cluster membership) and here we choose the popular deep embedded clustering method (DEC [27]). We sketch this method first for completeness.
3.1 Deep Embedded Clustering
We choose to apply our constraints formulation to the deep embedded clustering method DEC [27] which starts with pretraining an autoencoder () but then removes the decoder. The remaining encoder () is then finetuned by optimizing an objective which takes first and converts it to a soft allocation vector of length which we term indicating the degree of belief instance belongs to cluster . Then is selftrained on to produce a unimodal “hard” allocation vector which allocates the instance to primarily only one cluster. We now overview each step.
Conversion of to Soft Cluster Allocation Vector . Here DEC takes the similarity between an embedded point and the cluster centroid measured by Student’s distribution [18]. Note that is a constant as and is a soft assignment:
(1) 
Conversion of To Hard Cluster Assignments . The above normalized similarities between embedded points and centroids can be considered as soft cluster assignments . However, we desire a target distribution that better resembles a hard allocation vector, is defined as:
(2) 
Loss Function. Then the algorithm’s loss function is to minimize the distance between and as follows. Note this is a form of selftraining as we are trying to teach the network to produce unimodal cluster allocation vectors.
(3) 
The DEC method requires the initial centroids given () to calculate are “representative”. The initial centroids are set using kmeans clustering. However, there is no guarantee that the clustering results over an autoencoders embedding yield a good clustering. We believe that constraints can help overcome this issue which we test later.
3.2 Different Types of Constraints
To enhance the clustering performance and allow for more types of interactions between human and clustering models we propose four types of guidance which are pairwise constraints, instance difficulty constraints, triplet constraints, and cardinality and give examples of each. As traditional constrained clustering methods put constraints on the final clustering assignments, our proposed approach constrains the vector which is the soft assignment. A core challenge when adding constraints is to allow the resultant loss function to be differentiable so we can derive back propagation updates.
Pairwise Constraints
Pairwise constraints (mustlink and cannotlink) are well studied [3] and we showed they are capable of defining any ground truth set partitions [7]. Here we show how these pairwise constraints can be added to a deep learning algorithm. We encode the loss for mustlink constraints set ML as:
(4) 
Similarly loss for cannotlink constraints set CL is:
(5) 
Intuitively speaking, the mustlink loss prefers instances with same soft assignments and the cannotlink loss prefers the opposite cases.
Instance Difficulty Constraints
A challenge with selflearning in deep learning is that if the initial centroids are incorrect, the selftraining can lead to poor results. Here we use constraints to overcome this by allowing the user to specify which instances are easier to cluster (i.e., they belong strongly to only one cluster) and by ignoring difficult instances (i.e., those that belong to multiple clusters strongly).
We encode user supervision with an constraint vector . Let be an instance difficulty indicator, means the instance is easy to cluster, means no difficulty information is provided and means instance is hard to cluster. The loss function is formulated as:
(6) 
The instance difficulty loss function aims to encourage the easier instances to have sparse clustering assignments but prevents the difficult instances having sparse clustering assignments. The absolute value of indicates the degree of confidence in difficulty estimation. This loss will help the model training process converge faster on easier instances and increase our model’s robustness towards difficult instances.
Triplet Constraints
Although pairwise constraints are capable of defining any ground truth set partitions from labeled data [7], in many domains no labeled side information exists or strong pairwise guidance is not available. Thus we seek triplet constraints, which are weaker constraints that indicate the relationship within a triple of instances. Given an anchor instance , positive instance and negative instance we say that instance is more similar to than to . The loss function for all triplets can be represented as:
(7) 
where and . The larger value of represents larger similarity between and . The variable controls the gap distance between positive and negative instances. works by pushing the positive instance’s assignment closer to anchor’s assignment and preventing negative instance’s assignment being closer to anchor’s assignment.
Global Size Constraints
Experts may more naturally give guidance at a cluster level. Here we explore clustering size constraints, which means each cluster should be approximately the same size. Denote the total number of clusters as , total training instances number as , the global size constraints loss function is:
(8) 
Our global constraints loss function works by minimizing the distance between the expected cluster size and the actual cluster size. The actual cluster size is calculated by averaging the softassignments. To guarantee the effectiveness of global size constraints, we need to assume that during our minibatch training the batch size should be large enough to calculate the cluster sizes. A similar loss function can be used (see section 3.4) to enforce other cardinality constraints on the cluster composition such as upper and lower bounds on the number of people with a certain property.
3.3 Preventing Trivial Solution
In our framework the proposed mustlink constraints we mentioned before can lead to trivial solution that all the instances are mapped to the same cluster. Previous deep clustering method [30] have also met this problem. To mitigate this problem, we combine the reconstruction loss with the mustlink loss to learn together. Denote the encoding network as and decoding network as , the reconstruction loss for instance is:
(9) 
where is the leastsquare loss: .
3.4 Extensions to Highlevel Domain KnowledgeBased Constraints
Although most of our proposed constraints are generated based on instance labels or comparisons. Our framework can be extended to highlevel domain knowledgebased constraints with minor modifications.
Cardinality Constraints. For example, cardinality constraints [6] allow expressing requirements on the number of instances that satisfy some conditions in each cluster. Assume we have people and want to split them into dinner party groups. An example cardinality constraint is to enforce each party should have the same number of males and females. We split the people into two groups as (males) and (females) in which and . Then the cardinality constraints can be formulated as:
(10) 
For upperbound and lowerbound based cardinality constraints [6], we use the same setting as previously described, now the constraint changes as for each party group we need the number of males to range from to . Then we can formulate it as:
(11) 
Logical Combinations of Constraints. Apart from cardinality constraints, complex logic constraints can also be used to enhance the expressivity power of representing knowledge. For example, if two instances and are in the same cluster then instances and must be in different clusters. This can be achieved in our framework as we can dynamically add cannotlink constraint once we check the soft assignment of and .
Consider a horn form constraint like . Denote , , and . By forward passing the instances within to our deep constrained clustering model, we can get the soft assignment values of these instances. By checking the satisfying results based on , we can decide whether to enforce cannotlink loss .
4 Putting It All Together  Efficient Training Strategy
Our training strategy consists of two training branches and effectively has two ways of creating minibatches for training. For instancedifficulty or globalsize constraints, we treat their loss functions as addictive losses so that no extra branch needs to be created. For pairwise or triplet constraints we build another output branch for them and train the whole network in an alternative way.
Loss Branch for Instance Constraints. In deep learning it is common to add loss functions defined over the same output units. In the Improved DEC method [11] the clustering loss and reconstruction loss were added together. To this we add the instance difficulty loss . This effectively adds guidance to speed up training convergence by identifying “easy” instances and increase the model’s robustness by ignoring “difficult” instances. Similarly we treat the global size constraints loss as an additional additive loss. All instances whether or not they are part of triplet or pairwise constraints are trained through this branch and the minibatches are created randomly.
Loss Branch For Complex Constraints. Our framework uses more complex loss functions as they define constraints on pairs and even triples of instances. Thus we create another loss branch that contains pairwise loss or triplet loss to help the network tune the embedding which satisfy these stronger constraints. For each constraint type we create a minibatch consisting of only those instances having that type of constraint. For each example of a constraint type, we feed the constrained instances through the network, calculate the loss, calculate the change in weights but do not adjust the weights. We sum the weight adjustments for all constraint examples in the minibatch and then adjust the weights. Hence our method is an example of batch weight updating as is standard in DL for stability reasons. The whole training procedure is summarized in Algorithm 1.
5 Experiments
All data and code used to perform these experiments are available online (http://github.com/blueocean92/deep_constrained_clustering) to help with reproducibility. In our experiments we aim to address the following questions:

How does our endtoend deep clustering approach using traditional pairwise constraints compare with traditional constrained clustering methods? The latter is given the same autoencoding representation used to initialize our method.

Are the new types of constraints we create for deep clustering method useful in practice?

Is our endtoend deep constrained clustering method more robust to the well known negative effects of constraints we published earlier [8]?
5.1 Datasets
To study the performance and generality of different algorithms, we evaluate the proposed method on two image datasets and one text dataset:
MNIST: Consists of handwritten digits of by pixel size. The digits are centered and sizenormalized in our experiments [15].
FASHIONMNIST: A Zalando’s article imagesconsisting of a training set of examples and a test set of examples. Each example is a by grayscale image, associated with a label from classes.
REUTERS10K: This dataset contains English news stories labeled with a category tree [16]. To be comparable with the previous baselines, we used root categories: corporate/industrial, government/social, markets and economics as labels and excluded all documents with multiple labels. We randomly sampled a subset of examples and computed TFIDF features on the most common words.
5.2 Implementation Details
Basic Deep Clustering Implementation. To be comparable with deep clustering baselines, we set the encoder network as a fully connected multilayer perceptron with dimensions for all datasets, where is the dimension of input data(features). The decoder network is a mirror of the encoder. All the internal layers are activated by the ReLU [19] nonlinearity function. For a fair comparison with baseline methods, we used the same greedy layerwise pretraining strategy to calculate the autoencoders embedding. To initialize clustering centroids, we run kmeans with 20 restarts and select the best solution. We choose Adam optimizer with an initial learning rate of for all the experiments. We adopt standard metrics for evaluating clustering performance which measure how close the clustering found is to the ground truth result. Specifically, we employ the following two metrics: normalized mutual information(NMI)[23, 29] and clustering accuracy(Acc)[29]. In our baseline comparisons we use IDEC [11], a nonconstrained improved version of DEC published recently.
Pairwise Constraints Experiments. We randomly select pairs of instances and generate the corresponding pairwise constraints between them. To ensure transitivity we calculate the transitive closure over all mustlinked instances and then generate entailed constraints from the cannotlink constraints [7]. Since our loss function for mustlink constraints is combined with reconstruction loss, we use grid search and set the penalty weight for mustlink as .
Instance Difficulty Constraints Experiments. To simulate humanguided instance difficulty constraints, we use kmeans as a base learner and mark all the incorrectly clustered instances as difficult with confidence , we also mark the correctly classified instances as easy instances with confidence . In Figure 1 we give some example difficulty constraints found using this method.
Triplet Constraints Experiments. Triplet constraints can state that instance is more similar to instance than instance . To simulate human guidance on triplet constraints, we randomly select instances as anchors (), for each anchor we randomly select two instances ( and ) based on the similarity between the anchor. The similarity is calculated as the euclidian distance between two instances pretrained embedding. The pretrained embedding is extracted from our deep clustering network trained with pairwise constraints. Figure 2 shows the generated triplets constraints. Through grid search we set the triplet loss margin .
Global Size Constraints Experiments. We apply global size constraints to MNIST and Fashion datasets since they satisfy the balanced size assumptions. The total number of clusters is set to .
5.3 Experimental Results
Experiments on Instance Difficulty. In Table 1, we report the average test performance of deep clustering framework without any constraints in the left. In comparison, we report the average test performance of deep clustering framework with instance difficulty constraints in the right and we find the model learned with instance difficulty constraints outperforms the baseline method in all datasets. This is to be expected as we have given the algorithm more information than the baseline method, but it demonstrates our method can make good use of this extra information. What is unexpected is the effectiveness of speeding up the learning process and will be the focus of future work.
Experiments on pairwise constraints
We randomly generate pairs of constraints which are a small fractions of possible pairwise constraints for MNIST (), Fashion () and Reuters ().
Recall the DEC method is initialized with autoencoder features. To better understand the contribution of pairwise constraints, we have tested our method with both autoencoders features and raw data. As can be seen from Figure 3: the clustering performance improves consistently as the number of constraints increases in both settings. Moreover, with just pairwise constraints the performance on Reuters and MNIST increased significantly especially for the setup with raw data. We also notice that learning with raw data in Fashion achieves a better result than using autoencoder’s features. This shows that the autoencoder’s features may not always be suitable for DEC’s clustering objective. Overall our results show pairwise constraints can help reshape the representation and improve the clustering results.
We also compare the results with recent work [12]: our approach(autoencoders features) outperforms the best clustering accuracy reported for MNIST by a margin of , and respectively for 6, 60 and 600 samples/class. Unfortunately, we can’t make a comparison with Fogel’s algorithm [9] due to an issue in their code repository.
Flexible CSP*  COPKMeans  MPCKMeans  Ours  

MNIST Acc  
MNIST NMI  
Negative Ratio  0 %  
Fashion Acc  
Fashion NMI  
Negative Ratio  6 %  
Reuters Acc  
Reuters NMI  
Negative Ratio  0 % 
Negative Effects of Constraints. Our earlier work [8] showed that for traditional constrained clustering algorithms, that the addition of constraints on average helps clustering but many individual constraint sets can hurt performance in that performance is worse than using no constraints. Here we recreate these results even when these classic methods use autoencoded representations. In Table 2, we report the average performance with randomly generated pairwise constraints. For each dataset, we randomly generated sets of constraints to test the negative effects of constraints[8]. In each run, we fixed the random seed and the initial centroids for kmeans based methods, for each method we compare its performance between constrained version to unconstrained version. We calculate the negative ratio which is the fraction of times that unconstrained version produced better results than the constrained version. As can be seen from the table, our proposed method achieves significant improvements than traditional nondeep constrained clustering algorithms [25, 4, 26].
To understand why our method was robust to variations in constraint sets we visualized the embeddings learnt. Figure 4 shows the embedded representation of a random subset of instances and its corresponding pairwise constraints using tSNE and the learned embedding . Based on Figure 4, we can see the autoencoders embedding is noisy and lot’s of constraints are inconsistent based on our earlier definition [8]. Further, we visualize the IDEC’s latent embedding and find out the clusters are better separated. However, the inconsistent constraints still exist (blue lines across different clusters and redlines within a cluster); these constraints tend to have negative effects on traditional constrained clustering methods. Finally, for our method’s results we can see the clusters are well separated, the mustlinks are well satisfied(blue lines are within the same cluster) and cannotlinks are well satisfied(red lines are across different clusters). Hence we can conclude that endtoendlearning can address these negative effects of constraints by simultaneously learning a representation that is consistent with the constraints and clustering the data. This result has profound practical significance as practitioners typically only have one constraint set to work with.
Experiments on triplet constraints
We experimented on MNIST and FASHION datasets. Figure 2 visualizes example triplet constraints (based on embedding similarity), note the positive instances are closer to anchors than negative instances. In Figure 5, we show the clustering Acc/NMI improves consistently as the number of constraints increasing. Comparing with Figure 3 we can find the pairwise constraints can bring slightly better improvements, that’s because our triplets constraints are generated from a continuous domain and there is no exact together/apart information encoded in the constraints. Triplet constraints can be seen as a weaker but more general type of constraints.
Experiments on global size constraints
To test the effectiveness of our proposed global size constraints, we have experimented on MNIST and Fashion training set since they both have balanced cluster sizes (see Figure 6). Note that the ideal size for each cluster is (each data set has classes), we can see that blue bars are more evenly distributed and closer to the ideal size.
We also evaluate the clustering performance with global constraints on MNIST (Acc:, NMI:) and Fashion (Acc:, NMI:). Comparing to the baselines in table 1, interestingly, we find the performance improved slightly on MNIST but dropped slightly on Fashion.
6 Conclusion and Future Work
The area of constrained partitional clustering has a long history and is widely used. Constrained partitional clustering typically is mostly limited to simple pairwise together and apart constraints. In this paper, we show that deep clustering can be extended to a variety of fundamentally different constraint types including instancelevel (specifying hardness), cluster level (specifying cluster sizes) and tripletlevel.
Our deep learning formulation was shown to advance the general field of constrained clustering in several ways. Firstly, it achieves better experimental performance than wellknown kmeans, mixturemodel and spectral constrained clustering in both an academic setting and a practical setting (see Table 2).
Importantly, our approach does not suffer from the negative effects of constraints [8] as it learns a representation that simultaneously satisfies the constraints and finds a good clustering. This result is quite useful as a practitioner typically has just one constraint set and our method is far more likely to perform better than using no constraints.
Most importantly, we were able to show that our method achieves all of the above but still retains the benefits of deep learning such as scalability, outofsample predictions and endtoend learning. We found that even though standard nondeep learning methods were given the same representations of the data used to initialize our methods the deep constrained clustering was able to adapt these representations even further. Future work will explore new types of constraints, using multiple constraints at once and extensions to the clustering setting.
Acknowledgements
We acknowledge support for this work from a Google Gift entitled: “Combining Symbolic Reasoning and Deep Learning”.
Footnotes
 email: davidson@cs.ucdavis.edu
 email: davidson@cs.ucdavis.edu
References
 Bade, K., Nürnberger, A.: Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SIAM, 2008.
 Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semisupervised clustering. In: KDD, 2004.
 Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in algorithms, theory, and applications. CRC Press (2008)
 Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semisupervised clustering. In: ICML, 2004.
 Chatziafratis, V., Niazadeh, R., Charikar, M.: Hierarchical clustering with structural constraints. arXiv preprint arXiv:1805.09476 (2018)
 Dao, T.B.H., Vrain, C., Duong, K.C., Davidson, I.: A framework for actionable clustering using constraint programming. In: ECAI 2016.
 Davidson, I., Ravi, S.: Intractability and clustering with constraints. In: ICML, 2007.
 Davidson, I., Wagstaff, K.L., Basu, S.: Measuring constraintset utility for partitional clustering algorithms. In: PKDD, 2006.
 Fogel, S., AverbuchElor, H., Goldberger, J., CohenOr, D.: Clusteringdriven deep embedding with pairwise constraints. arXiv preprint arXiv:1803.08457 (2018)
 Gress, A., Davidson, I.: Probabilistic formulations of regression with mixed guidance. In: ICDM, 2016.
 Guo, X., Gao, L., Liu, X., Yin, J.: Improved deep embedded clustering with local structure preservation. In: IJCAI, 2017.
 Hsu, Y.C., Kira, Z.: Neural networkbased clustering using pairwise constraints. arXiv preprint arXiv:1511.06321 (2015)
 Jiang, Z., Zheng, Y., Tan, H., Tang, B., Zhou, H.: Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148 (2016)
 Joachims, T.: Optimizing search engines using clickthrough data. In: KDD, 2002.
 LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
 Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. In: JMLR, 2004.
 Lu, Z., CarreiraPerpinan, M.A.: Constrained spectral clustering through affinity propagation. In: CVPR, 2008.
 Maaten, L.v.d., Hinton, G.: Visualizing data using tsne. Journal of machine learning research 9(Nov), 2579–2605 (2008)
 Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, 2010.
 Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR, 2015.
 Schultz, M., Joachims, T.: Learning a distance metric from relative comparisons. In: NIPS, 2004.
 Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., Kluger, Y.: Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587 (2018)
 Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on webpage clustering. In: Workshop on artificial intelligence for web search (AAAI 2000). vol. 58, p. 64 (2000)
 Wagstaff, K., Cardie, C.: Clustering with instancelevel constraints. In: AAAI, 2000.
 Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained kmeans clustering with background knowledge. In: ICML, 2001.
 Wang, X., Davidson, I.: Flexible constrained spectral clustering. In: KDD, 2010.
 Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML, 2016.
 Xing, E.P., Jordan, M.I., Russell, S.J., Ng, A.Y.: Distance metric learning with application to clustering with sideinformation. In: NIPS, 2003.
 Xu, W., Liu, X., Gong, Y.: Document clustering based on nonnegative matrix factorization. In: SIGIR, 2003.
 Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. arXiv preprint arXiv:1610.04794 (2016)