Semisupervised crossentropy clustering with information bottleneck constraint
Jagiellonian University
Łojasiewicza 6, 30348 Krakow, Poland
{}^{2}Institute for Communications Engineering
Technical University of Munich
Theresienstr. 90, D80333 Munich, Germany
Abstract
In this paper, we propose a semisupervised clustering method, CECIB, that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partitionlevel side information). By combining the ideas from crossentropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goals: the accuracy with which the data set is modeled, the simplicity of the model, and the consistency of the clustering with side information. Experiments demonstrate that CECIB has a performance comparable to Gaussian mixture models (GMM) in a classical semisupervised scenario, but is faster, more robust to noisy labels, automatically determines the optimal number of clusters, and performs well when not all classes are present in the side information. Moreover, in contrast to other semisupervised models, it can be successfully applied in discovering natural subgroups if the partitionlevel side information is derived from the top levels of a hierarchical clustering.
1 Introduction
Clustering is one of the core techniques of machine learning and data analysis, and aims at partitioning data sets based on, e.g., the internal similarity of the resulting clusters. While clustering is an unsupervised technique, one can improve its performance by introducing additional knowledge as side information. This is the field of semisupervised or constrained clustering.
One classical type of side information in clustering are pairwise constraints: human experts determine whether a given pair of data points belongs to the same (mustlink) or to different clusters (cannotlink) [7]. Although this approach received high attention in the last decade, the latest reports [48] suggest that in reallife problems it is difficult to answer whether or not two objects belong to the same group without a deeper knowledge of data set. This is even more problematic as erroneous pairwise constraints can easily lead to contradictory side information [13].
A possible remedy is to let experts categorize a set of data points rather than specifying pairwise constraints. This partitionlevel side information was proposed in [5] and recently considered in [27]. The concept is related to partial labeling applied in semisupervised classification and assumes that a small portion of data is labeled. In contrast to semisupervised classification [49, 45], the number of categories is not limited to the true number of classes; in semisupervised clustering one may discover several clusters among unlabeled data points. Another advantage of partitionlevel side information is that, in contrast to pairwise constraints, it does not become selfcontradictory if some data points are mislabeled.
In this paper, we introduce a semisupervised clustering method, CECIB, based on partitionlevel side information. CECIB combines CrossEntropy Clustering (CEC) [42, 40, 39], a modelbased clustering technique, with the Information Bottleneck (IB) method [43, 11] to build the smallest model that preserves the side information and provides a good model of the data distribution. In other words, CECIB automatically determines the required number of clusters to trade between model complexity, model accuracy, and consistency with the side information.
Consistency with side information is ensured by penalizing solutions in which data points from different categories are put in the same cluster. Since modeling a category by multiple clusters is not penalized, one can apply CECIB to obtain a fine clustering even if the the human expert categorizes the data into only few basic groups, see Figure 1. Although this type of side information seems to be a perfect use case for cannotlink constraints, the computational cost of introducing side information to CECIB is negligible while the incorporation of cannotlink constraints to similar Gaussian mixture model (GMM) approaches requires the use of graphical models, which involves high computational cost. CECIB thus combines the flexibility of cannotlink constraints with an efficient implementation.
We summarize the main contributions and the outline of our paper:

We propose a modified Hartigan algorithm to optimize the CECIB cost function (Section 3.4). The algorithm has a complexity that is linear in the number of data points in each iteration, and it usually requires less iterations than the expectationmaximization (EM) algorithm used for fitting GMMs (C).

We perform extensive experiments demonstrating that CECIB is more robust to noisy side information (i.e., miscategorized data points) than stateoftheart approaches to semisupervised clustering (Section 5.4). Moreover, CECIB performs well when not all categories are present in the side information (Section 5.3), even though the true number of clusters is not specified.

We perform two case studies: In Section 5.6, a human expert provided a partitionlevel side information about the division of chemical compounds into two basic groups (as in Figure 1); CECIB discovers natural chemical subgroups more reliably than other semisupervised methods, even if some labels are misspecified by the expert (Figure 2). The second case study in Section 5.7 applies CECIB to image segmentation.
2 Related work
Clustering has been an important topic in machine learning and data analysis for a long time. Various methods were introduced for splitting data into groups, including modelbased, distancebased, spectral, fuzzy, and hierarchical methods (see [18, 1] for a survey).
Adding to this diversity of techniques, a large number of specialized types of clustering have been developed. One example is multiview clustering, which considers gathering information coming from different domains [19]. As another example, complementary or alternative clustering aims at finding groups which provide a perspective on the data that expands on what can be inferred from previous clusterings [14]. Finally, semisupervised clustering – the problem investigated in this work – makes use of side information to achieve better clustering results or to provide robustness against noisy side information [7].
The traditional approach to incorporate side information into clustering is based on pairwise constraints. The authors of [4] suggested reducing distances between data points with a mustlink constraint and adding a dimension for each cannotlink constraint. After updating all other distances to, e.g., satisfy the triangle inequality, the thus obtained pairwise distance matrix can be used for unsupervised clustering. Kamvar et al. [20] considered a similar procedure, taking the pairwise affinity matrix and setting mustlinks and cannotlinks to predefined maximum and minimum values, respectively. Instead of clustering, they applied eigenvectorbased classification taking the labeled data as training set. Another spectral technique, proposed in [46], relies on solving a generalized eigenvalue problem. Qian et al. [34] developed a framework for spectral clustering that allows using side information in the form of pairwise constraints, partial labeling, and grouping information. An informationtheoretic cost function, squared mutual information, was proposed for semisupervised clustering in [10]. Also clustering techniques based on nonnegative matrix or concept factorization can incorporate pairwise constraints as regularizers [28].
As mentioned in the introduction, partitionlevel side information refers to a partial labeling of the data points that need not necessarily consider all classes – the categories provided as side information may be only a subset of classes, or, as in Figure 1, be of a hierarchical nature. In consequence, clustering with partitionlevel side information differs significantly from a typical semisupervised classification task, as the clustering algorithm should detect clusters within categories and/or within unlabeled data points. A recent paper using partitionlevel side information is [27], where the authors add additional dimensions to feature vectors and propose a modification of kmeans to cluster data points. In [5], partitionlevel side information was used to design a better initialization strategy for kmeans. Similarly, partitionlevel side information was used to propose a semisupervised version of fuzzy cmeans [33, 32]. The authors added a regularization term to the fuzzy cmeans cost function that penalizes fuzzy clusterings that are inconsistent with the side information. This technique was later combined with feature selection methods [22]. Finally, partitionlevel side information can be used in densitybased clustering such as DBSCAN. Specifically, in [25] the authors proposed an algorithm that sets the parameter defining the neighborhood radius of a data point based on partial labeling.
GMMs can be easily adapted to make us of partitionlevel side information by combining the classical unsupervised GMM with a supervised one [2, 49]. This approach can be extended to labels with reliability information [12, 17, 9]. Various statistical and machine learning libraries, such as mixmod [24] or bgmm [8], provide implementations of GMMs with partitionlevel side information.
Also pairwise constraints can be incorporated into GMMs, where dependencies between the hidden cluster indicator variables are then usually modeled by a hidden Markov random field. This procedure was adopted, for example, in [36] to account for cannotlink constraints. Mustlink constraints were considered by treating all involved data points as a single data point with a higher weight. The parameters of the GMM, which was used for hard or soft clustering, are obtained by a generalized expectationmaximization procedure that requires simplifications or approximations [29, 6, 23]. An overview of GMMbased methods with pairwise constraints can be found in [30].
In contrast to most GMM approaches, our method does not require knowledge of the correct number of clusters; initialized with any (larger) number, CECIB reduces the number of clusters for an optimal tradeoff between model accuracy, model complexity (i.e., number of clusters), and consistency with the side information.
Our method is closely related to the information bottleneck method, which focuses on lossy compression of data preserving the information of a stochastically related random variable [43, 37]. Modifications of IB were used in consensus clustering [44] or alternative clustering [14]. The mutual information between data points and its clusters, which describes the cost of (lossy) data compression in IB, is replaced in our model by the crossentropy – see Section 3.2 for more details. Thus, while IB focuses model simplicity and consistency with side information, CECIB adds model accuracy to the cost.
3 CrossEntropy Clustering with an Information Bottleneck Constraint
We now pave the way for our CECIB method. Since our model is related to CEC, we first review its basics in Section 3.1. For completely labeled data, i.e., for the case where all data points are labeled, we then introduce our CECIB model based on ideas from IB in Section 3.2. Section 3.3 extends the analysis to deal with the case where only some data points are labeled. We conclude this section by presenting and analyzing a clustering algorithm that finds a local optimum of our CECIB cost function.
3.1 Crossentropy clustering
CEC is a modelbased clustering method that minimizes the empirical crossentropy between a finite data set X\subset\mathbb{R}^{N} and a parametric mixture of densities [42]. This parametric mixture is a subdensity^{1}^{1}1i.e., f(x)\geq 0 and \int_{\mathbb{R}^{N}}f(x)dx\leq 1. given by
f=\max(p_{1}f_{1},\ldots,p_{k}f_{k}) 
where p_{1} through p_{k} are nonnegative weights summing to one and where f_{1} through f_{k} are densities from the Gaussian family \mathcal{G} of probability distributions on \mathbb{R}^{N}. The empirical crossentropy between X and subdensity f is
H^{\times}(X\f)=\frac{1}{X}\sum_{x\in X}\log f(x)=\frac{1}{X}\sum_{i=1}% ^{k}\sum_{x\in Y_{i}}\log(p_{i}f_{i}(x)) 
where
\mathcal{Y}=\{Y_{1},\ldots,Y_{k}\},\quad Y_{i}:=\{x\in X{:}\ p_{i}f_{i}(x)=% \max_{j}p_{j}f_{j}(x)\}  (1) 
is a partition of X induced by the subdensity f. Letting
\displaystyle\mu_{Y_{i}}=\frac{1}{Y_{i}}\sum_{x\in Y_{i}}x,  
\displaystyle\Sigma_{Y_{i}}=\frac{1}{Y_{i}}\sum_{x\in Y_{i}}(x\mu_{Y_{i}})(% x\mu_{Y_{i}})^{T} 
be the sample mean vector and sample covariance matrix of cluster Y_{i}, we show in A that CEC looks for a clustering \mathcal{Y} such that the following cost is minimized:
H^{\times}(X\f)=H(\mathcal{Y})+\sum_{i=1}^{k}\frac{Y_{i}}{X}H(\mathcal{N}% (\mu_{Y_{i}},\Sigma_{Y_{i}})),  (2) 
where the model complexity is measured by the Shannon entropy of the partition \mathcal{Y},
H(\mathcal{Y}):=\sum_{i=1}^{k}\frac{Y_{i}}{X}\log\frac{Y_{i}}{X}, 
and where the model accuracy, i.e., accuracy of density estimation in cluster Y_{i}, is measured by the differential entropy of the Gaussian density f_{i},
H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))=\tfrac{N}{2}\ln(2\pi e)+\tfrac{1}{2% }\ln\det(\Sigma_{Y_{i}})=\min_{f_{i}\in\mathcal{G}}H^{\times}(Y_{i}\f_{i}). 
The main difference between CEC and GMMbased clustering lies in substituting a mixture density f=p_{1}f_{1}+\cdots+p_{k}f_{k} by a subdensity f=\max(p_{1}f_{1},\ldots,p_{k}f_{k}). This modification allows to obtain a closed form solution for the mixture density given a fixed partition \mathcal{Y}, while for a fixed mixture density the partition \mathcal{Y} is given in (1). This suggests a heuristic similar to the kmeans method. In consequence, CEC might yield a slightly worse density estimation of data than GMM, but converges faster (see Section 3.4 and Appendix C) while the experimental results show that the clustering effects are similar.
3.2 CECIB with completely labeled data
We now introduce CECIB for completely labeled data (i.e., all data points are labeled) by combining the ideas from modelbased clustering with those from the information bottleneck method. We also show that under some assumptions CECIB admits an alternative derivation based on conditional crossentropy given the side information.
Definition 3.1.
Let X be a finite data set and let X_{\ell}\subseteq X denote the set of labeled data points. The partitionlevel side information is a partition \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\} of X_{\ell}, where every Z_{j}\in\mathcal{Z} contains all elements of X_{\ell} with the same label.
To make this definition clear, suppose that \mathcal{X}=\{X_{1},X_{2},\dots,X_{l}\} is the true partition of the data that we want to recover, i.e., we want to obtain \mathcal{Y}=\mathcal{X}. The partitionlevel side information \mathcal{Z} can take several possible forms, including:

\mathcal{Z}=l, and Z_{j}\subseteq X_{j} for j=1,\dots,l. This is equivalent to the notion of partial labeling in semisupervised classification.

\mathcal{Z}=m<l and for every j=1,\dots,m there is a different i such that Z_{j}\subseteq X_{i}. This is the case where only some of the true clusters are labeled.

\mathcal{Z}=m<l and there are m disjoint sets I_{j}\subset\{1,\dots,l\} such that Z_{j}\subset\bigcup_{i\in I_{j}}X_{i}. This is the case where the labeling is derived from a higher level of the hierarchical true clustering (cf. Figure 1).
For the remainder of this subsection, we assume that the side information is complete, i.e., that each data point x\in X is labeled with exactly one category. In other words, X_{\ell}=X and \mathcal{Z} is a partition of X. We drop this assumption in Section 3.3, where we consider partial labeling, i.e., X_{\ell}\subseteq X.
Our effort focuses on finding a partition that is consistent with side information:
Definition 3.2.
Let X be a finite data set and let X_{\ell}\subseteq X be the set of labeled data points that is partitioned into \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\}. We say that a partition \mathcal{Y}=\{Y_{1},\dots,Y_{k}\} of X is consistent with \mathcal{Z}, if for every Y_{i} there exists at most one Z_{j} such that Z_{j}\cap Y_{i}\neq\emptyset.
The definition of consistency generalizes the refinement relation between partitions of the same set. If, as in this section, X_{\ell}=X, then \mathcal{Y} is consistent with \mathcal{Z} if and only if \mathcal{Y} is a refinement of \mathcal{Z}. In other words, a clustering \mathcal{Y} is consistent with \mathcal{Z} if every Y_{i}\in\mathcal{Y} contains elements from at most one category Z_{j}\in\mathcal{Z}. Mathematically, for a clustering \mathcal{Y} consistent with \mathcal{Z} we have
\forall Y_{i}\in\mathcal{Y}{:}\ \exists!j^{\prime}=j^{\prime}(i){:}\ Z_{j}\cap Y% _{i}=\begin{cases}Y_{i}&j=j^{\prime}\\ 0&\text{else}.\end{cases}  (3) 
Thus, for a consistent clustering \mathcal{Y} the conditional entropy H(\mathcal{Z}\mathcal{Y}) vanishes:
\displaystyle H(\mathcal{Z}\mathcal{Y})=\sum_{i=1}^{k}\frac{Y_{i}}{X}H(% \mathcal{Z}Y_{i})=\sum_{i=1}^{k}\sum_{j=1}^{m}\frac{Z_{j}\cap Y_{i}}{X}% \log\left(\frac{Z_{j}\cap Y_{i}}{Y_{i}}\right)\\ \displaystyle\lx@stackrel{{\scriptstyle(a)}}{{=}}\sum_{i=1}^{k}\frac{Z_{j^{% \prime}}\cap Y_{i}}{X}\log\left(\frac{Y_{i}}{Y_{i}}\right)=0 
where (a) is due to (1).
The conditional entropy H(\mathcal{Z}\mathcal{Y}) therefore is a measure for consistency with side information: the smaller the conditional entropy, the higher is the consistency. We thus propose the following cost function for CECIB in the case of complete side information, i.e., when X_{\ell}=X and \mathcal{Z} is a partition of X:
\mathrm{E}_{\beta}(X,\mathcal{Z};\mathcal{Y}):=H(\mathcal{Y})+\sum_{i=1}^{k}% \frac{Y_{i}}{X}H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))+\beta H(\mathcal% {Z}\mathcal{Y})\text{, where }\beta\geq 0.  (4) 
The first two terms are the CEC cost function (2), and the last term H(\mathcal{Z}\mathcal{Y}) penalizes clusterings \mathcal{Y} that are not consistent with the side information \mathcal{Z}. Thus CECIB aims at finding the minimal number of clusters needed to model the data set distribution and to preserve the consistency with the side information. The weight parameter \beta trades between these objectives; we will analyze rationales for selecting this parameter in Section 4.
Our cost function (4) is intricately connected to the IB and related methods. In the notation of this work, i.e., in terms of partitions rather than random variables, the IB cost function is given as [43]
I(X;\mathcal{Y})\beta I(\mathcal{Y};\mathcal{Z})=H(\mathcal{Y})H(\mathcal{Y}% X)\beta H(\mathcal{Z})+\beta H(\mathcal{Z}\mathcal{Y}). 
Noticing that H(\mathcal{Z}) does not depend on the clustering \mathcal{Y}, the main difference between IB and CECIB is that CECIB incorporates a term accounting for the modeling accuracy in each cluster, while IB adds a term related to the “softness” of the clustering: Since H(\mathcal{Y}X) is minimized for deterministic, i.e., hard clusters, IB implicitly encourages soft clusters. A version of IB ensuring deterministic clusters was recently introduced in [41]. The cost function of this method dispenses with the term related to the softness of the clusters leading to a clustering method minimizing
H(\mathcal{Y})+\beta H(\mathcal{Z}\mathcal{Y}). 
Our CECIB method can thus be seen as deterministic IB with an additional term accounting for model accuracy. CECIB can therefore be considered as a modelbased version of the information bottleneck method.
We end this subsection by showing that under some assumptions, one can arrive at the CECIB cost function (with \beta=1) in a slightly different way, by minimizing the conditional crossentropy function:
Theorem 3.1.
Let X be a finite data set that is partitioned into \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\}. Minimizing the CECIB cost function (4), for \beta=1, is equivalent to minimizing the conditional crossentropy function:
H^{\times}((X\f)\mathcal{Z}):=\sum_{j=1}^{m}\frac{Z_{j}}{X}H^{\times}(Z_% {j}\f_{j}), 
where
f_{Z_{j}}:=f_{j}=\max(p_{1}(j)f_{1},\ldots,p_{k}(j)f_{k}) 
is the conditional density f given jth category and where p_{1}(j),\ldots,p_{k}(j) are nonnegative weights summing to one.
The proof of this theorem is given in B. It essentially states that our cost function ensures that each category Z_{j} is modeled by a parametric mixture of densities that is both simple and accurate. We believe that this view on the problem can lead to the development of a clustering algorithm slightly different from what is presented in this paper.
3.3 CECIB with partially labeled data
The previous section assumed that all data points in X were labeled, i.e., the partitionlevel side information \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\} was a partition of X. In this section, we relax this assumption and assume that only a subset X_{\ell}\subseteq X is labeled. In this case \mathcal{Z} is a partition only of X_{\ell}, and in consequence, the conditional entropy H(\mathcal{Z}\mathcal{Y}) from the previous subsection is undefined.
To deal with this problem, let \mathcal{L}=\{X_{\ell},X\setminus X_{\ell}\} denote the partition of X into labeled and unlabeled data. We decompose the conditional entropy of \mathcal{Z} given partitions \mathcal{Y} and \mathcal{L} as
H(\mathcal{Z}\mathcal{Y},\mathcal{L})=\frac{X_{\ell}}{X}H(\mathcal{Z}% \mathcal{Y},X_{\ell})+\frac{X\setminus X_{\ell}}{X}H(\mathcal{Z}\mathcal{% Y},X\setminus X_{\ell}),  (5) 
where
\displaystyle H(\mathcal{Z}\mathcal{Y},X_{\ell})  \displaystyle=\sum_{i=1}^{k}\frac{Y_{i}\cap X_{\ell}}{X_{\ell}}H(\mathcal{% Z}Y_{i}\cap X_{\ell})  
\displaystyle=\sum_{i=1}^{k}\frac{Y_{i}\cap X_{\ell}}{X_{\ell}}\sum_{j=1}^% {m}\frac{Y_{i}\cap Z_{j}}{Y_{i}\cap X_{\ell}}\left(\log\frac{Y_{i}\cap Z% _{j}}{Y_{i}\cap X_{\ell}}\right). 
Let us now assume that the partitionlevel side information is a representative sample of true categories. In other words, assume that the probability for a category of an unlabeled data point given the cluster equals the empirical probability of this category for labeled data points in this cluster. To formalize this reasoning, we view the partition \mathcal{Z} as a random variable that takes values in \{1,\dots,m\}. Our labeled data set X_{\ell} corresponds to realizations of this random variable, i.e., for every x\in X_{\ell}, the corresponding random variable \mathcal{Z} assumes the value indicated by the labeling. Since the side information was assumed to be representative, the relative fraction of data points in cluster Y_{i} assigned to category Z_{j} gives us an estimate of the true underlying probability; we extrapolate this estimate to unlabeled data points and put
\mathbf{P}(\mathcal{Z}=jY_{i}\cap(X\setminus X_{\ell}))=\mathbf{P}(\mathcal{Z% }=jY_{i}\cap X_{\ell})=\frac{Y_{i}\cap Z_{j}}{Y_{i}\cap X_{\ell}}=\mathbf% {P}(\mathcal{Z}=jY_{i}). 
Hence, H(\mathcal{Z}Y_{i}\cap(X\setminus X_{\ell}))=H(\mathcal{Z}Y_{i}\cap X_{\ell}) for every Y_{i}, and we get for (5):
\displaystyle H(\mathcal{Z}\mathcal{Y})  \displaystyle=H(\mathcal{Z}\mathcal{Y},\mathcal{L})  (6)  
\displaystyle=\frac{X_{\ell}}{X}H(\mathcal{Z}\mathcal{Y},X_{\ell})+\frac{% X\setminus X_{\ell}}{X}H(\mathcal{Z}\mathcal{Y},X\setminus X_{\ell})  (7)  
\displaystyle=\frac{X_{\ell}}{X}\sum_{i=1}^{k}\frac{Y_{i}\cap X_{\ell}}{% X_{\ell}}H(\mathcal{Z}Y_{i}\cap X_{\ell})  
\displaystyle{}+\frac{X\setminus X_{\ell}}{X}\sum_{i=1}^{k}\frac{Y_{i}% \cap(X\setminus X_{\ell})}{X\setminus X_{\ell}}H(\mathcal{Z}Y_{i}\cap X_{% \ell})  (8)  
\displaystyle=\sum_{i=1}^{k}\frac{Y_{i}}{X}H(\mathcal{Z}Y_{i}\cap X_{\ell% }).  (9) 
where the first equality follows because the conditional entropy H(\mathcal{Z}\mathcal{Y},\mathcal{L}) does not depend on the partition \mathcal{L}.
With this, we define the CECIB cost function for a model with partitionlevel side information:
Definition 3.3.
(CECIB cost function) Let X be a finite data set and let X_{\ell}\subseteq X be the set of labeled data points that is partitioned into \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\}. The cost of clustering X into the partition \mathcal{Y}=\{Y_{1},\ldots,Y_{k}\} for a given parameter \beta\geq 0 equals
\mathrm{E}_{\beta}(X,\mathcal{Z};\mathcal{Y}):=H(\mathcal{Y})+\sum_{i=1}^{k}% \frac{Y_{i}}{X}\left(H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))+\beta H(% \mathcal{Z}Y_{i}\cap X_{\ell})\right).  (10) 
To shorten the notation we sometimes write \mathrm{E}_{\beta}(\mathcal{Y}) assuming that X and \mathcal{Z} are fixed.
Note that for a complete side information, i.e., for X_{\ell}=X, we get precisely the cost function (4) obtained in the previous subsection.
3.4 Optimization algorithm
The optimization of CECIB cost function can be performed similarly as in the classical CEC method, in which the Hartigan approach [15] is used.
Let X be a finite data set and let X_{\ell}\subseteq X be the set of labeled data points that is partitioned into \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\}. The entire procedure consists of two steps: initialization and iteration. In the initialization step, a partition \mathcal{Y} is created randomly; f_{i}=\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}) are Gaussian maximum likelihood estimators on Y_{i}. In the iteration stage, we go over all data points and reassign each of them to the cluster that decreases the CECIB cost (10) the most. After each reassignment, the clusters densities f_{i} are reparameterized by the maximum likelihood estimators of new clusters and the cardinalities of categories Z_{j}\cap Y_{i} are recalculated. If no cluster membership changed then the method terminates with a partition \mathcal{Y}.
Note that this procedure automatically removes unnecessary clusters by introducing the term H(\mathcal{Y}), which is the cost of cluster identification. If the method is initialized with more clusters than necessary, some clusters will lose data points to other clusters in order to reduce H(\mathcal{Y}), and the corresponding clusters may finally disappear altogether (e.g., by the number of data points contained in this cluster falling below a predefined threshold).
To describe the algorithm in detail, let us denote the cost of a single cluster Y\subset X by
\mathrm{E}_{\beta}(Y):=\frac{Y}{X}\left(\ln\frac{Y}{X}+H(\mathcal{N}(% \mu_{Y},\Sigma_{Y}))+\beta H(\mathcal{Z}Y\cap X_{\ell})\right),  (11) 
assuming that X, \mathcal{Z}, and \beta are fixed. Then, for a given partition \mathcal{Y} of X the minimal value of CECIB cost function equals:
\mathrm{E}_{\beta}(X,\mathcal{Z};\mathcal{Y})=\sum_{i=1}^{k}\mathrm{E}_{\beta}% (Y_{i}). 
Making use of the above notation, the algorithm can be written as follows:
The outlined algorithm is not deterministic and its results depend on the randomly chosen initial partition. Therefore, the algorithm can be restarted multiple times to avoid getting stuck in bad local minima.
One may think that the recalculation of the models and the evaluation of the cost in lines 15 and 18 is computationally complex. Looking at (11), one can see that evaluating the cost for a given cluster requires recomputing the sample mean vector and sample covariance matrix, which, according to [42, Theorem 4.3.], has a complexity quadratic in the dimension N of the dataset. Computing the determinant of the sample covariance matrix can be done with cubic complexity. Moreover, computing the conditional entropy of \mathcal{Z} given the current cluster Y is linear in the number m of categories; if the selected data point x is not labeled, then there is no cost at all for computing these terms, since they cancel in the difference in line 15. Since in each iteration, all data points have to be visited and, for each data point, all clusters have to be tested, one arrives at a computational complexity in the order of \mathcal{O}(nk(N^{3}+m)). In comparison, Lloyd’s algorithm for kmeans has a complexity of \mathcal{O}(nkN) in each iteration and the expectation maximization (EM) algorithm to fit a GMM has a complexity of \mathcal{O}(nkN^{2}) [35, p. 232]. Note, however, that neither classical kmeans nor EM is designed to deal with side information, hence, the complexity of semisupervised algorithms is in general larger. In particular, the addition of cannotlink constraints to GMM can involve a high computational cost. Moreover, in C we provide experimental evidence that the proposed Hartigan algorithm converges faster than Lloyd’s or EM, because the model is reparametrized after each switch.
4 Selection of the weight parameter
In this section we discuss the selection of the weight parameter \beta, trading between model complexity, model accuracy, and consistency with side information. Trivially, for \beta=0 we obtain pure modelbased clustering, i.e., the CEC method while, for \beta\to\infty, model fitting becomes irrelevant and the obtained clustering is fully consistent with reference labeling.
Our first theoretical result states that for \beta=1 the algorithm tends to create clusters that are fully consistent with the sideinformation. Before proceeding, we introduce the following definitions:
Definition 4.1.
Let X be a data set and let \mathcal{Y}=\{Y_{1},\dots,Y_{k}\} be a partition of X. Let further \mathcal{Z}=\{Z_{1},\dots,Z_{m}\} be a partition of X_{\ell}\subseteq X. We say that \mathcal{Y} is a coarsening of \mathcal{Z}, if for every Z_{j} there exists a cluster Y_{i} such that Z_{j}\subseteq Y_{i}.
We say that the partition \mathcal{Y} is proportional to \mathcal{Z}, if the fraction of data points in each cluster equals the fraction of labeled data points in this cluster, i.e., if \frac{Y_{i}}{X}=\sum_{j=1}^{m}\frac{Z_{j}\cap Y_{i}}{X_{\ell}}=\frac{% Y_{i}\cap X_{\ell}}{X_{\ell}}.
Proportionality is required in the proofs below since it admits applying the chain rule of entropy to H(\mathcal{Z}\mathcal{Y}) even in the case where X_{\ell}\subset X. In other words, if \mathcal{Y} is proportional to \mathcal{Z}, then (see D for the proof):
H(\mathcal{Z}\mathcal{Y})+H(\mathcal{Y})=H(\mathcal{Z},\mathcal{Y}). 
Every coarsening of a proportional partition \mathcal{Y} is proportional. Trivially, if X_{\ell}=X, then every partition \mathcal{Y} is proportional to \mathcal{Z}. Note, however, that for finite data sets and if X_{\ell}\subset X, it may happen that there exists no partition \mathcal{Y} proportional to the side information \mathcal{Z} (e.g., if all but one data points are labeled). Nevertheless, the following theorems remain valid as guidelines for parameter selection.
Finally, note that consistency as in Definition 3.2 and coarsening as in Definition 4.1 are, loosely speaking, opposites of each other. In fact, if X_{\ell}=X and if \mathcal{Z} is a partition of X, then \mathcal{Y} is a coarsening of \mathcal{Z} if and only of \mathcal{Z} is consistent with \mathcal{Y}. Although we are interested in partitions \mathcal{Y} consistent with \mathcal{Z}, we use the concept of a coarsening to derive results for parameter selection. Moreover, note that a partition \mathcal{Y} can be both consistent with and a coarsening of the side information \mathcal{Z}. This is the case where every Y_{i} contains exactly one Z_{j} and every Z_{j} is contained in exactly one Y_{i} (i.e., \mathcal{Y} has the same number of elements as \mathcal{Z}).
Theorem 4.1.
Let X\subset\mathbb{R}^{N} be a finite data set and X_{\ell}\subseteq X be the set of labeled data points that is partitioned into \mathcal{Z}=\{Z_{1},\dots,Z_{m}\}. Let \mathcal{Y}=\{Y_{1},\dots,Y_{k}\} be a proportional coarsening of \mathcal{Z}, and suppose that the sample covariance matrices \Sigma_{i} of Y_{i} are positive definite.
If \tilde{\mathcal{Y}}=\{\tilde{Y}_{1},\dots,\tilde{Y}_{k^{\prime}}\} is a coarsening of \mathcal{Y}, then
\mathrm{E}_{1}(\tilde{\mathcal{Y}})\geq\mathrm{E}_{1}(\mathcal{Y}).  (12) 
Proof.
See E. ∎
An immediate consequence of Theorem 4.1 is that, for \beta=1, CECIB tends to put elements with different labels in different clusters. Note, however, that there might be partitions \mathcal{Y} that are consistent with \mathcal{Z} that have an even lower cost (10): Since every consistent partition \mathcal{Y} satisfies H(\mathcal{Z}\mathcal{Y})=0, any further refinement of \mathcal{Y} reduces the cost whenever the cost for model complexity, H(\mathcal{Y}), is outweighed by the modeling inaccuracy \sum_{i=1}^{k}\frac{Y_{i}}{X}H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}})).
Theorem 4.1 does not assume that the side information induces a partition of X that fits our intuition of clusters: the Z_{j} need not be a connected set, but could result from, say, a random labeling of the data set X. Then, for \beta=1, splitting X into clusters that are consistent with the side information will be at least as good as creating a single cluster. Interestingly, if the labeling is completely random, any \beta<1 will prevent dividing the data set into clusters:
Remark 4.1.
Suppose a completely random labeling for the setting of Theorem 4.1. More precisely, we assume that the set of labeled data X_{\ell} is divided into \mathcal{Z}=\{Z_{1},\ldots,Z_{m}\} and that the partition \mathcal{Y} is a proportional coarsening of \mathcal{Z}. If sufficiently many data points are labeled, we may assume that the sample covariance matrix \Sigma_{Y_{i}} of Y_{i} is close to the covariance matrix \Sigma_{X} of X, i.e. \Sigma_{Y_{i}}\approx\Sigma_{X}. For any coarsening \tilde{\mathcal{Y}} of \mathcal{Y}, the crossentropies for \mathcal{Y} and \tilde{\mathcal{Y}} are approximately equal:
\sum_{i}\frac{Y_{i}}{X}H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))\approx% \sum_{j}\frac{\tilde{Y}_{j}}{X}H(\mathcal{N}(\mu_{\tilde{Y}_{j}},\Sigma_{% \tilde{Y}_{j}}))\approx H(\mathcal{N}(\mu_{X},\Sigma_{X})) 
because the sample covariance matrices of \tilde{Y}\in\tilde{\mathcal{Y}} are also close to \Sigma_{X}.
If we compare the remaining parts of cost function (10), then with \beta<1 we obtain:
\displaystyle H(\mathcal{Y})+\beta H(\mathcal{Z}\mathcal{Y})=(1\beta)H(% \mathcal{Y})+\beta(H(\mathcal{Y})+H(\mathcal{Z}\mathcal{Y}))\\ \displaystyle=(1\beta)H(\mathcal{Y})+\beta(H(\tilde{\mathcal{Y}})+H(\mathcal{% Z}\tilde{\mathcal{Y}}))>H(\tilde{\mathcal{Y}})+\beta H(\mathcal{Z}\tilde{% \mathcal{Y}}).  (13) 
The last inequality follows from the fact that H(\mathcal{Y})>H(\tilde{\mathcal{Y}}). Therefore, CECIB with \beta<1 is robust on random labeling.
Our second result is a critical threshold \beta_{0}, above which splitting a given cluster \tilde{Y}_{1} into smaller clusters Y_{1},\ldots,Y_{l} reduces the cost. This threshold \beta_{0} depends on the data set and on the side information. For example, as Remark 4.1 shows, for a completely random labeling we get \beta_{0}=1. To derive the threshold in the general case, we combine the proof of Theorem 4.1 with (13):
Theorem 4.2.
Let X\subset\mathbb{R}^{N} be a finite data set and X_{\ell}\subseteq X be the set of labeled data points that is partitioned into \mathcal{Z}=\{Z_{1},\dots,Z_{m}\}. Let \mathcal{Y}=\{Y_{1},\dots,Y_{k}\} be a proportional coarsening of \mathcal{Z}, and suppose that the sample covariance matrices \Sigma_{i} of Y_{i} are positive definite. Suppose that \tilde{\mathcal{Y}}=\{Y_{1},\dots,Y_{k^{\prime}1},(Y_{k^{\prime}}\cup\cdots% \cup Y_{k})\}, for 1<k^{\prime}<k, is a coarsening of \mathcal{Y}, and let \mu and \Sigma be the sample mean vector and sample covariance matrix of Y_{k^{\prime}}\cup\cdots\cup Y_{k}. Let q_{i}=p_{i} for i=1,\dots,k^{\prime}1 and q_{k}=\sum_{i=k^{\prime}}^{k}p_{i}. We put
\beta_{0}=1+\frac{\sum_{i=k^{\prime}}^{k}\frac{p_{i}}{2q_{k^{\prime}}}\ln\left% (\frac{\det\Sigma_{i}}{\det\Sigma}\right)}{H\left(\frac{p_{k^{\prime}}}{q_{k^{% \prime}}},\dots,\frac{p_{k}}{q_{k^{\prime}}}\right)}.  (14) 
If \beta\geq\beta_{0}, then
\mathrm{E}_{\beta}(\tilde{\mathcal{Y}})\geq\mathrm{E}_{\beta}(\mathcal{Y}). 
Proof.
See F. ∎
We now evaluate a practically relevant instance of the above theorem, where the data follows a Gaussian distribution and the partitionlevel side information is “reasonable”:
Example 4.1.
Let X\subset\mathbb{R} be a data set generated by a onedimensional Gaussian distribution f=\mathcal{N}(\mu,\sigma^{2}), and suppose that the data set is large enough such that the sample mean \mu_{X} and sample variance \sigma_{X}^{2} are close to \mu and \sigma^{2}, respectively. A classical unsupervised modelbased clustering technique, such as CEC or GMM, terminates with a single cluster.
Now suppose that Z_{1}\subset(\infty,\mu) and Z_{2}\subset[\mu,+\infty) are equallysized sets, which suggests that \mathcal{Y}=\{Y_{1},Y_{2}\}=\{(\infty,\mu)\cap X,[\mu,+\infty)\cap X\} is the expected clustering. Consequently, on one hand, the data distribution indicates that a single cluster should be created while, on the other hand, the side information suggests splitting the data set into two clusters. At the threshold \beta_{0} these two conflicting goals are balanced, while for \beta>\beta_{0} a consistent clustering is obtained.
To calculate the critical value \beta_{0}, let \mathcal{Y}=\{Y_{1},Y_{2}\} be proportional to \mathcal{Z}, and let f_{i}=\mathcal{N}(\mu_{Y_{i}},\sigma_{Y_{i}}^{2}) be the optimal fit for cluster Y_{i}. Since the data in Y_{i} can be well approximated by a truncated Gaussian distribution, we can calculate:
\sigma_{Y_{1}}^{2}\approx\sigma_{Y_{2}}^{2}\approx\sigma^{2}\left(1\frac{2}{% \pi}\right). 
Making use of the previous theorem, \mathrm{E}_{\beta}(\{X\})=\mathrm{E}_{\beta}(\mathcal{Y}) for
\beta=\beta_{0}\approx 1+\frac{\ln\sqrt{1\frac{2}{\pi}}}{H(\frac{1}{2},\frac{% 1}{2})}\approx 0.269.  (15) 
Continuing this example, in some cases the side information might be noisy, i.e., data points are labeled wrongly. Consider the labeling \mathcal{Z} that satisfies Z_{1}\subset(\infty,\mu+c) and Z_{2}\subset[\muc,+\infty), for some c>0. In other words, the human experts did not agree on the labeling at the boundary between the clusters. If we choose \mathcal{Y} proportional to this noisy side information \mathcal{Z}, then one has reason to suppose that the sample variances of Y_{1} and Y_{2} are larger than in the noiseless case, hence leading to a larger threshold \beta_{0} according to Theorem 4.1. Setting \beta to a value only slightly higher than the threshold \beta_{0} for the noiseless case thus ensures a partition \mathcal{Y} that is consistent with the noiseless labeling, but that is robust to noise. In summary, one should choose \beta large (i.e., close to 1), if one believes that the side information is correct, but small if one has to expect noisy side information.
5 Experiments
We evaluated our method in classical semisupervised clustering tasks on examples retrieved from the UCI repository [26] and compared its results to stateoftheart semisupervised clustering methods. We evaluated performance in the case of only few classes being present in the partitionlevel side information and for noisy labeling, and investigated the influence of the parameter \beta on the clustering results. We furthermore applied CECIB on a data set of chemical compounds [47] to discover subgroups based on partitionlevel side information derived from the top of a cluster hierarchy and illustrate its performance in an image segmentation task.
5.1 Experimental setting
We considered five related semisupervised clustering methods for comparison. The first is a classical semisupervised classification method that is based on fitting a GMM to the data set taking partial labeling into account. Since it is a classification method, it only works if all true classes are present in the categorization \mathcal{Z}. We used the R implementation Rmixmod [24] with default settings; we refer to this method as “mixmod”.
The second method incorporates pairwise constraints as side information for a GMMbased clustering technique [36]. To transfer the partitionlevel side information to pairwise constraints, we went over all pairs of labeled data points in X_{\ell} and generated a mustlink constraint if they were in the same, and a cannotlink constraint if they were in different categories. We used the implementation from one of the authors’ website^{2}^{2}2http://www.scharp.org/thertz/code.html and refer to this method as “cGMM”. We ran cGMM in MultiCov mode, i.e. every cluster was characterized by its own covariance matrix.
We also applied an extension of kmeans to accept partition levelside information [27]. The method requires setting a weight parameter \lambda, that places weight on the features derived from the side information. The authors suggested \lambda=100, but we found that the method performs more stable for \lambda=100\cdot\mathrm{tr}(\Sigma), i.e., for \lambda being proportional to the trace of the sample covariance matrix of the data set X. We refer to this method as “kmeans”.
Moreover, we considered a semisupervised variant of fuzzy cmeans [33, 32], which incorporates partitionlevel side information. We used the Euclidean distance, set the fuzzifier parameter to 2, and chose a tradeoff parameter \alpha=\frac{X}{X_{\ell}} as suggested by the authors. To obtain a “hard” clustering \mathcal{Y} from a fuzzy partition we assigned every point to its most probable cluster. This technique will be referred to as “fcmeans”.
Finally, we used a semisupervised version of spectral clustering [34] (referred to as “spec”), which was claimed to achieve stateoftheart performance among spectral algorithms. The method accepts pairwise constraints and operates on the affinity (similarity) matrix of the data set. The authors of [34] suggested setting the similarity between data points x_{i} and x_{j} to e^{\x_{i}x_{j}\^{2}/2\rho^{2}}, where \\cdot\ is the Euclidean distance and where \rho>0 is called affinity parameter. In order to account for different variances in different dimensions, we used
e^{\sum_{\ell=1}^{N}\frac{x_{i}^{(\ell)}x_{j}^{(\ell)}^{2}}{2\rho^{2}% \sigma^{2}_{(l)}}},  (16) 
where x_{i}^{(\ell)} is the value of the \ellth coordinate of x_{i} and where \sigma^{2}_{(\ell)} is the variance of the \ellth coordinate of X. The method can be tuned with two parameters: affinity parameter \rho and tradeoff factor \eta. The authors suggest to find the best possible combination of these parameters using a gridsearch strategy. Since we did not allow for tuning any parameters of other methods (including \beta in CECIB), for a fair comparison we decided to fix these two parameters. Specifically, we put \eta=0.7 analyzing the results reported in [34]. We moreover set \rho=1 based on the fact that the Euclidean distances are already normalized according to the variances of the respective dimensions and since [34] reports little influence of the selection of \rho. We generated mustlink and cannotlink constraints as we did for cGMM; moreover, the entries of the affinity matrix were set to one for mustlink, and to zero for cannotlink constraints.
Since CECIB automatically determines an appropriate number of clusters by removing clusters with too few data points, we initialized CECIB with twice the correct number of clusters. In contrast, other methods were run with the correct numbers of clusters. In a semisupervised clustering task with correct labels from all classes, the competing methods can thus be expected to perform better than CECIB.
To better illustrate the effect of the weight parameter \beta, we used two parameterization of CECIB, using \beta=1 and \beta=\beta_{0}\approx 0.269 given by (15). We refer to these two variants as \textsc{CECIB}_{1} and \textsc{CECIB}_{0}, respectively.
Data set  # Instances  # Features  # Classes 

Ecoli{}^{+}  327  5  5 
Glass  214  9  6 
Iris  150  4  3 
Segmentation{}^{+}  210  5  7 
User Modeling  403  5  4 
Vertebral  310  6  3 
Wine  178  13  3 
{}^{+}: PCA was used to reduce a dimensionality of the data set and remove dependent attributes
The similarity between the obtained clusterings and the ground truth was evaluated using Normalized Mutual Information (NMI) [3]. For a reference grouping \mathcal{X} and a clustering \mathcal{Y} it is defined by
\mathrm{NMI}(\mathcal{Y},\mathcal{X})=\frac{2I(\mathcal{Y};\mathcal{X})}{H(% \mathcal{Y})+H(\mathcal{X})}. 
Since I(\mathcal{Y};\mathcal{X})\leq\min\{H(\mathcal{Y}),H(\mathcal{X})\}, NMI is bounded from above by 1, which is attained for identical partitions. If \mathcal{Y} and \mathcal{X} contain different numbers of clusters, then NMI is always below 1.
5.2 Semisupervised clustering
We evaluated the proposed method in a classical semisupervised clustering task, in which we aim to recover a reference partition based on a small sample of labeled data.
We used seven UCI data sets, which are summarized in Table 1. The partitionlevel side information was generated by choosing 0%, 10%, 20%, and 30% of the data points and labeling them according to their class. To remove effects from random initializations, we generated 10 different samples of side information for each percentage and averaged the resulting NMI values.
The clustering results presented in Figure 3 show that \textsc{CECIB}_{1} usually achieved a higher NMI than \textsc{CECIB}_{0}: Since the partitionlevel side information is noisefree, i.e., agrees with the reference grouping, a larger weight parameter \beta leads to better performance. In general, \textsc{CECIB}_{1} produced results similar to the two other GMMbased techniques, cGMM and mixmod. Notable differences can be observed on Vertebral dataset, where CECIB performed significantly better, and on Iris and User Modeling, where the competing methods gave higher NMI. This is most likely caused by the fact that CECIB failed to determine the correct number of clusters (see Table 2), while the GMM implementations were given this correct number of clusters as side information. As it can be seen in Table 2, in all other cases, CECIB terminated with a number of clusters very close to the true value. Initializing CECIB with the correct number of clusters for the Iris data set, we get results that are comparable to those of mixmod and cGMM (see Figure 3(h)).
Data set  # Classes  0%  10%  20%  30% 

Ecoli  5  7  6  6  6 
Glass  6  5  6  6  6 
Iris  3  5  5  5  5 
Segmentation  7  8  8  7  8 
User  4  7  6  6  6 
Vertebral  3  4  4  4  4 
Wine  3  3  3  3  3 
Observe that kmeans gave slightly lower NMI than fcmeans. Nevertheless, both algorithms performed worse than the GMMbased methods, except for the Ecoli and Segmentation data sets. The difference in the results can be explained by the fact that fcmeans and kmeans are distancebased methods and therefore perform differently from modelbased approaches. Although the performance of spec usually increases with more labeled examples, its results are worse than the other methods.
5.3 Few labeled classes
In a (semi)supervised classification task, the model learns classes from a set of labeled data and applies this knowledge to unlabeled data points. More specifically, the classifier cannot assign class labels that were not present in a training set. In contrast, clustering with partitionlevel side information, can detect clusters within a labeled category or within the set of unlabeled data points.
In this section, we apply CECIB to a data set for which the partitionlevel side information contains labels of only two classes from the reference grouping. As before we considered 0%, 10%, 20% and 30% of labeled data. For each of the 10 runs we randomly selected two classes from a reference grouping that covered at least 30% of data in total and generated the partitionlevel side information from these two categories (the same classes were used for all percentages of side information). It was not possible to run mixmod in this case because this package does not allow to use a number of clusters different from the categories given in the side information.
Figure 4 shows that CECIB was able to consistently improve its clustering performance with an increasing size of the labeled data set^{3}^{3}3Although the results for 0% of labeled data should be identical with the ones reported in Section 5.2, some minor differences might follow from a random initialization of the methods, see Section 3.4.. Surprisingly, cGMM sometimes dropped in performance when adding side information. This effect was already visible in Figure 3, but seems to be more pronounced here. While a deeper analysis of this effect is out of scope of this work, we believe that it is due to the simplification made in [36] to facilitate applying a generalized EM scheme. This simplification is valid if pairs of points with cannotlink constraints are disjoint, an assumption that is clearly violated by the way we generate cannotlink constraints (see Section 5.1).
Contrary to cGMM, the results of fcmeans and kmeans were far more stable. In most cases both algorithms increased their performance having access to more labeled data. Interestingly, spec performed in general better when only two classes were labeled than in the previous experiment where all classes were labeled. In consequence, its results were often comparable to or sometimes even better than other methods.
5.4 Noisy side information
In realworld applications, the side information usually comes from human experts, who label training samples. Depending on the expertise of these workers, some part of this side information might be noisy or erroneous. Therefore, the clustering algorithm needs to be robust w.r.t. noisy side information.
To simulate the above scenario, we randomly selected 30% of the data points as side information, as in Section 5.2, and assign incorrect labels for a fixed percentage of them (0%, 10%, 20%, 30%, 40%, 50% of misspecified labels). All methods were run in the same manner as in the previous experiments.
One can see in Figure 5 that \textsc{CECIB}_{0} showed the highest robustness to noisy labels among all competing methods, i.e., the NMI deteriorated the least with increasing noise. Although \textsc{CECIB}_{1} achieved higher NMI than \textsc{CECIB}_{0} for correctly labeled data (without noise), its performance is usually worse than \textsc{CECIB}_{0} when at least 30% of labels are misspecified. The robustness of mixmod and spec is acceptable; their results vary with the used data set, but on average they cope with incorrect labels comparably to \textsc{CECIB}_{1}. In contrast, cGMM, kmeans and fcmeans are very sensitive to noisy side information. Since their performance falls drastically below the results returned for strictly unsupervised case, they should not be used if there is a risk of unreliable side information.
5.5 Influence of weight parameter
From Figure 3 it can be seen that \beta=\beta_{0} often seems to be too small to benefit sufficiently from partitionlevel side information, although it provides high robustness to noisy labels. In this experiment, we investigate the dependence between the value of \beta and the size of the labeled data set and the fraction of noisy labels, respectively.
First, we checked the performance of CECIB with different values of \beta in the noiseless case. We randomly selected 10%, 20% and 30% of the data points, respectively, and labeled them according to their true classes. Figure 6 shows that CECIB with \beta=\beta_{0} run on 30% labels performed similarly to CECIB with \beta=1 run on 10% labels. Therefore, we see that the lack of labeled data can be compensated with a larger value of \beta. Moreover, a larger value of \beta makes CECIB benefit more from a larger number of correctly labeled data points.
In the second experiment we investigated the relation between the fraction of noisy side information and the weight parameter. We drew 30% of the data points and labeled 0%, 10%, 30%, and 50% of them incorrectly (the remaining selected data points were labeled with their correct class labels). We see in Figure 7 that a small noise of 10% did not have severe negative effects on the clustering results. In this case NMI was almost always higher than in the unsupervised case (i.e., for \beta=0), even for \beta=1. For 50% of incorrectly labeled data points, increasing \beta has a negative effect on the clustering performance, while \beta=\beta_{0} provided high robustness to the large amount of noisy labels and in most cases performed at least as well as the unsupervised scenario. For the case where 30% of labels were misspecified, choosing \beta<0.6 seems to produce results at least as good as when no side information is available.
5.6 Hierarchy of chemical classes – a case study
Our CECIB cost function only penalizes including elements from different categories into the same cluster. Covering a single category by more than one cluster is not penalized if the cost for model accuracy outweighs the cost for model complexity. In this experiment, we will show that this property is useful in discovering subgroups from side information derived from a cluster hierarchy.
We considered a data set of chemical compounds that act on 5HT{}_{1A} receptor ligands, one of the proteins responsible for the regulation of the central nervous system [31, 38]. Part of this data set was classified hierarchically by an expert [47], as shown in Figure 8. For an expert it is easier to provide a coarse categorization rather than a full hierarchical classificiation, especially if it is not clear how many subgroups exist. Therefore, in some cases, it might be easier to get a hierarchical structure based on this coarse categorization made by the expert and an automatic clustering algorithm that finds a partition corresponding to the clusters at the bottom of the hierarchy.
We used the KlekotaRoth fingerprint representation of chemical compounds [21], which describes each object by a binary vector, where “1” means presence and “0” denotes absence of a predefined chemical pattern. Since this representation contains 4860 features in total, its direct application to modelbased clustering can lead to singular covariance matrices of clusters. Therefore, PCA was used to reduce its dimension to the 10 most informative components. This data set contains 284 examples in total (see Figure 8 for the cardinalities of particular chemical classes).
We generated partitionlevel side information from the division of chemical data set into two classes: Piperazines and Alkylamines. We considered 0%, 10%, 20% and 30% of data points to be labeled and supposed that the human expert assigns incorrect labels with probabilities: 0%, 10%, 20%, and 30% respectively. Based on the results from previous subsection we used \beta=0.6 instead of \beta=\beta_{0}, which is denoted by \textsc{CECIB}_{0.6}. Our method was run with 10 initial groups, while the other algorithms used the knowledge of the correct number of clusters. As mentioned in Section 5.3, it is not possible to run mixmod in this case, since the desired number of clusters is larger than the number of categories.
It can be seen from Figure 9 that \textsc{CECIB}_{1} gave the highest score among all methods when the expert was always assigning the correct labels and it was only slightly better than \textsc{CECIB}_{0.6}. In the case of 20% and 30% of misspecified labels it was slightly better to use \beta=0.6, although the differences were very small. CECIB terminated usually with 6 or 7 groups.
One can observe that GMM with negative constraints was able to use this type of side information effectively. In the noiseless case, its results improved with the number of labeled data points, but not as much as with our method. In the noisy case, however, its performance dropped down. It is worth mentioning that the implementation of negative constraints with hidden Markov random fields is very costly, while our method is computationally efficient. kmeans benefited from the side information in noiseless case, but deteriorated its performance when incorrect labels were introduced. GMM with positive and negative constraints, and fcmeans were not able to use this type of knowledge to full effect. We observed that the use of negative constraints only has no effect on spec, i.e. its results were almost identical for any number of labeled data^{4}^{4}4We observed similar effects for most UCI data sets when we used negative constraints only in the setting of Section 5.3. Changing the parametrization of the method did not overcome this negative behavior.. The results of spec with both types of constraints led to some improvements, but its overall performance was quite low. We were unable to provide a satisfactory explanation for this behavior.
5.7 Image segmentation
To further illustrate the performance of CECIB we applied it to an image segmentation task. We chose the picture of a dog retrieved from Berkeley Image Segmentation database ^{5}^{5}5https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/ presented in Figure 10(a) (picture no. 247085 resized into 70\times 46 resolution) and tried to separate the shape of the dog from the background. As partitionlevel side information, we marked four regions by one of two labels (indicated by white and black colors, see Figure 10(b)). This information was passed to all considered clustering methods. In this example we focus on noiseless side information, thus we put \beta=1 for CECIB.
To transform the image into vector data, we selected a window of size 7\times 7 around each pixel and used it as a feature vector of dimension 147 (3 color intensities for 49 pixels each). Then, we applied PCA to reduce the dimension of these vectors to 5 most informative components. In consequence, we obtained a data set with 3220 data points in \mathbb{R}^{5}.
Figure 11 shows the clustering results when the algorithms were run with two clusters. It can be seen that CECIB, mixmod, cGMM and spec provided reasonable results. Generally though, in all cases the shape of the dog was often mixed with a part of background. This is not surprising, since 1) CECIB, mixmod and cGMM are “unimodal”, i.e. they try to detect compact groups described by single Gaussians, and 2) kmeans and fcmeans represent clusters by a single central point. Both background and foreground are too complex to be generalized by so simple patterns.
In order to take this into account, we first tried to detect what is a “natural” number of segments. For this purpose, we ran CECIB with 10 initial groups, which was finally reduced to 5 clusters and used this number in other algorithms. As it was shown in previous experiments using chemical compounds, mustlink constraints cannot help when we have a partial labeling for two coarse classes, but we are interested in discovering their subgroups. Thus, the partitionlevel side information was only transformed into cannotlink constraints.
The results presented in Figure 12 show that CECIB separated the background from the foreground quite well. Each of these two regions was described by two clusters, while the fifth group was used for detecting the boundary between them. The creation of such an additional group is natural, because feature vectors were constructed using overlapping windows and the contrast between background and foreground was sharp. One may notice that cGMM and kmeans also allocated one group for the boundary (green colored cluster). Nevertheless, both methods created clusters which mixed some elements from the background and foreground (blue and cyan colored clusters). The result returned by fcmeans separated the two main parts of the image, but also contains a lot of small artifacts. As in the chemical example, spec was not able to achieve reasonable results with cannotlink constraints only. Similarly, it was not possible to run mixmod in this case.
6 Conclusion
We introduced a semisupervised clustering method that combines modelbased clustering realized by CEC with the constraint used by the information bottleneck method. The proposed cost function consists of three terms: the first tries to minimize the final number of clusters, the second penalizes the model for being inconsistent with side information, and the third controls the quality of data modeling. The performance of our method can be tuned by changing a weight parameter that trades between these three conflicting goals. Our method is flexible in the sense that it can be applied to both classical semisupervised clustering tasks, as well as to tasks in which either not all classes appear in the labels or in which subgroups should be discovered based on the labels. For the latter problems, it is difficult or computationally expensive to use existing techniques. Setting the weight parameter appropriately, for which we provide a deep theoretical analysis, makes our method robust to incorrect labels. We evaluated the performance of our method on several data sets, including a case study using chemical compounds data set and an image segmentation task.
Appendix A CrossEntropy Clustering
The empirical crossentropy between the data set X and the parametric mixture f of Gaussian densities is, for a given clustering \mathcal{Y},
\displaystyle H^{\times}(X\f)  \displaystyle=\frac{1}{X}\sum_{i=1}^{k}\sum_{x\in Y_{i}}\log(p_{i}f_{i}(x))  
\displaystyle=\sum_{i=1}^{k}\frac{Y_{i}}{X}\left(\log p_{i}+\frac{1}{Y_{% i}}\sum_{x\in Y_{i}}\log f_{i}(x)\right)  
\displaystyle=\sum_{i=1}^{k}\frac{Y_{i}}{X}\log p_{i}+\sum_{i=1}^{k}\frac% {Y_{i}}{X}H^{\times}(Y_{i}\f_{i}). 
The first sum is minimized by selecting p_{i}=Y_{i}/X, in which case the crossentropy reduces to the entropy of the cluster partition
H(\mathcal{Y}):=\sum_{i=1}^{k}\frac{Y_{i}}{X}\log\frac{Y_{i}}{X}. 
For the second sum, recall that the crossentropy of a Gaussian density f=\mathcal{N}(\mu,\Sigma) with mean vector \mu and covariance matrix \Sigma equals:
H^{\times}(X\f)=\tfrac{N}{2}\ln(2\pi)+\tfrac{1}{2}\\mu_{X}\mu\_{\Sigma}+% \tfrac{1}{2}\mathrm{tr}(\Sigma^{1}\Sigma_{X})+\tfrac{1}{2}\ln\det(\Sigma), 
where \mu_{X} and \Sigma_{X} are the sample mean vector and sample covariance matrix of X, respectively, and where \x\_{\Sigma} is the Mahalanobis norm of x with respect to \Sigma. The density f\in\mathcal{G} minimizing the crossentropy function is f=\mathcal{N}(\mu_{X},\Sigma_{X}), i.e., its mean equals the sample mean of X, and its covariance matrix equals the sample covariance matrix of X [42, Theorem 4.1]. In this case, the cross entropy equals the differential Shannon entropy of \mathcal{N}(\mu_{X},\Sigma_{X}), i.e.,
H^{\times}(X\\mathcal{N}(\mu_{X},\Sigma_{X}))=\tfrac{N}{2}\ln(2\pi e)+\tfrac{% 1}{2}\ln\det(\Sigma_{X})=H(\mathcal{N}(\mu_{X},\Sigma_{X})). 
It follows that the second sum is minimized by selecting, for every i, the maximum likelihood estimator of Y_{i}, f_{i}=\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}) [42, Theorem 4.1, Proposition 4.1].
Appendix B Proof of Theorem 3.1
We consider the CEC cost function (2) separately for each category Z_{j} and define the conditional crossentropy as
H^{\times}((X\f)\mathcal{Z})=\sum_{j=1}^{m}\frac{Z_{j}}{X}H^{\times}(Z_{% j}\f_{j}) 
where
f_{j}=\max(p_{1}(j)f_{1},\ldots,p_{k}(j)f_{k}). 
In other words, we assume a parameterized density model in which the weights p_{i}(j) may depend on the category, while the densities f_{i} may not. Rewriting above cost yields
\displaystyle H^{\times}((X\f)\mathcal{Z})  \displaystyle=\sum_{j=1}^{m}\frac{Z_{j}}{X}\sum_{x\in Z_{j}}\frac{1}{Z_{% j}}\log f_{j}(x)  
\displaystyle=\frac{1}{X}\sum_{j=1}^{m}\sum_{i=1}^{k}\sum_{x\in Z_{j}\cap Y% _{i}}\log p_{i}(j)f_{i}(x)  
\displaystyle=\sum_{j=1}^{m}\sum_{i=1}^{k}\frac{Z_{j}\cap Y_{i}}{X}\log p% _{i}(j)\frac{1}{X}\sum_{i=1}^{k}\sum_{x\in Y_{i}}\log f_{i}(x). 
The second sum is minimized by the maximum likelihood estimates f_{i}=\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}), while the first term is minimized for p_{i}(j)=\frac{Z_{j}\cap Y_{i}}{Z_{j}}. We thus get
\displaystyle\sum_{j=1}^{m}\sum_{i=1}^{k}\frac{Z_{j}\cap Y_{i}}{X}\log p_% {i}(j)  \displaystyle=\sum_{j=1}^{m}\sum_{i=1}^{k}\frac{Z_{j}\cap Y_{i}}{X}\log% \frac{Z_{j}\cap Y_{i}}{Z_{j}}  
\displaystyle=H(\mathcal{Y}\mathcal{Z})  
\displaystyle=H(\mathcal{Z}\mathcal{Y})+H(\mathcal{Y})H(\mathcal{Z}) 
by the chain rule of entropy. Since H(\mathcal{Z}) does not depend on the clustering \mathcal{Y}, the minimization of the above conditional crossentropy is equivalent to the minimization of
\displaystyle H(\mathcal{Y})+\sum_{i=1}^{k}\frac{Y_{i}}{X}H(\mathcal{N}(% \mu_{Y_{i}},\Sigma_{Y_{i}}))+H(\mathcal{Z}\mathcal{Y}).  (17) 
This is exactly the cost (4) for \beta=1.∎
Appendix C Convergence Speed of HartiganBased CEC
Data set  Hartigan CEC  EMbased GMM  Hartigan kmeans  Lloyd kmeans 

Ecoli  6.4  18.6  3.2  10.4 
Glass  5.5  15.7  3.3  9.7 
Iris  5.1  19.1  2.3  7 
Segmentation  4.4  16.7  3.3  9.3 
User Modeling  8.5  48.2  3.9  10.9 
Vertebral  7.6  17.8  2.3  9 
Wine  7.6  13.6  2.3  8.6 
We compared the number of iterations that CEC, EM, and kmeans required to converge to a local minimum. We used the seven UCI data sets from Table 1 and averaged the results over ten runs. Side information was not considered in these experiments. Table 3 shows that the Hartigan heuristic applied to the CEC cost function converges faster than EM does for fitting a GMM. The same holds when comparing the Hartigan algorithm with Lloyd’s method applied to kmeans. Similar results were obtained in an experimental evaluation [40]. We also found out that that CECIB uses a similar number of iterations as CEC; however, the convergence speed varies with the particular sample of side information, which makes a reliable comparison more difficult.
Appendix D Chain Rule for Proportional Partitions
We now show that, for partitions \mathcal{Y} proportional to \mathcal{Z}, the chain rule of entropy can be applied, i.e.,
H(\mathcal{Y})+H(\mathcal{Z}\mathcal{Y})=H(\mathcal{Z},\mathcal{Y}). 
We have
\displaystyle H(\mathcal{Y})+H(\mathcal{Z}\mathcal{Y})  
\displaystyle=\sum_{i=1}^{k}\frac{Y_{i}}{X}\log\frac{Y_{i}}{X}\sum_{% i=1}^{k}\frac{Y_{i}}{X}\sum_{j=1}^{m}\frac{Y_{i}\cap Z_{j}}{Y_{i}\cap X% _{\ell}}\log\frac{Y_{i}\cap Z_{j}}{Y_{i}\cap X_{\ell}}  
\displaystyle\lx@stackrel{{\scriptstyle(a)}}{{=}}\sum_{i=1}^{k}\frac{Y_{i}% \cap X_{\ell}}{X_{\ell}}\log\frac{Y_{i}\cap X_{\ell}}{X_{\ell}}\sum_{i% =1}^{k}\frac{Y_{i}\cap X_{\ell}}{X_{\ell}}\sum_{j=1}^{m}\frac{Y_{i}\cap Z% _{j}}{Y_{i}\cap X_{\ell}}\log\frac{Y_{i}\cap Z_{j}}{Y_{i}\cap X_{\ell}}  
\displaystyle=\sum_{i=1}^{k}\sum_{j=1}^{m}\frac{Y_{i}\cap Z_{j}}{X_{\ell}% }\log\frac{Y_{i}\cap X_{\ell}}{X_{\ell}}\sum_{i=1}^{k}\sum_{j=1}^{m}\frac% {Y_{i}\cap Z_{j}}{X_{\ell}}\log\frac{Y_{i}\cap Z_{j}}{Y_{i}\cap X_{\ell% }}  
\displaystyle=\sum_{i=1}^{k}\sum_{j=1}^{m}\frac{Y_{i}\cap Z_{j}}{X_{\ell}% }\log\frac{Y_{i}\cap Z_{j}}{X_{\ell}}  
\displaystyle=H(\mathcal{Z},\mathcal{Y}) 
where (a) is because \mathcal{Y} is proportional to \mathcal{Z} and thus \frac{Y_{i}\cap X_{\ell}}{X_{\ell}}=\frac{Y_{i}}{X}. In a similar manner it can be shown that
H(\mathcal{Z})+H(\mathcal{Y}\mathcal{Z})=H(\mathcal{Z},\mathcal{Y}) 
where H(\mathcal{Z})=\sum_{j=1}^{m}\frac{Z_{j}}{X_{\ell}}\log\frac{Z_{j}}{X_% {\ell}} and where
H(\mathcal{Y}\mathcal{Z})=\sum_{j=1}^{m}\frac{Z_{j}}{X_{\ell}}H(\mathcal{% Y}Z_{j})=\sum_{j=1}^{m}\sum_{i=1}^{k}\frac{Y_{i}\cap Z_{j}}{X_{\ell}}% \log\frac{Y_{i}\cap Z_{j}}{Z_{j}}. 
Appendix E Proof of Theorem 4.1
Lemma E.1.
Let the data set X\subset\mathbb{R}^{N} be partitioned into two clusters Y_{1} and Y_{2} such that the sample covariance matrix \Sigma_{i} of Y_{i} is positive definite for i=1,2.
Then
H(\mathcal{N}(\mu_{X},\Sigma_{X}))\geq\frac{Y_{1}}{X}H(\mathcal{N}(\mu_{Y_% {1}},\Sigma_{Y_{1}}))+\frac{Y_{2}}{X}H(\mathcal{N}(\mu_{Y_{2}},\Sigma_{Y_{% 2}})). 
Proof.
Let p=Y_{1}/X. By the law of total (co)variance, we have
\displaystyle\Sigma_{X}=p\Sigma_{Y_{1}}+(1p)\Sigma_{Y_{2}}\\ \displaystyle+\underbrace{p(\mu_{X}\mu_{Y_{1}})(\mu_{X}\mu_{Y_{1}})^{T}+(1p% )(\mu_{X}\mu_{Y_{2}})(\mu_{X}\mu_{Y_{2}})^{T}}_{=:\tilde{\Sigma}}  (18) 
where \tilde{\Sigma} is the covariance matrix obtained from the sample mean vectors \mu_{Y_{1}} and \mu_{Y_{2}} of Y_{1} and Y_{2}. Consequently, we get
\displaystyle pH(\mathcal{N}(\mu_{Y_{1}},\Sigma_{Y_{1}}))+(1p)H(\mathcal{N}(% \mu_{Y_{2}},\Sigma_{Y_{2}}))  
\displaystyle=\frac{Np}{2}\ln(2\pi e)+\frac{p}{2}\ln(\det\Sigma_{Y_{1}})+\frac% {N(1p)}{2}\ln(2\pi e)+\frac{(1p)}{2}\ln(\det\Sigma_{Y_{2}})  
\displaystyle=\frac{N}{2}\ln(2\pi e)+\frac{1}{2}\ln\left((\det\Sigma_{Y_{1}})^% {p}(\det\Sigma_{Y_{2}})^{(1p)}\right)  
\displaystyle\lx@stackrel{{\scriptstyle(a)}}{{\leq}}\frac{N}{2}\ln(2\pi e)+% \frac{1}{2}\ln(\det(p\Sigma_{Y_{1}}+(1p)\Sigma_{Y_{2}}))  
\displaystyle\lx@stackrel{{\scriptstyle(b)}}{{\leq}}\frac{N}{2}\ln(2\pi e)+% \frac{1}{2}\ln(\det(p\Sigma_{Y_{1}}+(1p)\Sigma_{Y_{2}}+\tilde{\Sigma}))  
\displaystyle=\frac{N}{2}\ln(2\pi e)+\frac{1}{2}\ln(\det\Sigma)  
\displaystyle=H(\mathcal{N}(m,\Sigma_{X})) 
where (a) follows because \Sigma_{Y_{1}} and \Sigma_{Y_{2}} are positive definite and from [16, Cor. 7.6.8], and where (b) follows because \tilde{\Sigma} is positive semidefinite and from, e.g., [16, Cor. 4.3.12]. ∎
Proof of Theorem 4.1.
Since \mathcal{Y} is proportional to \mathcal{Z} and since the coarsening of a proportional partition is proportional, we can apply the chain rule of entropy to get (see D)
\displaystyle H(\mathcal{Y})+H(\mathcal{Z}\mathcal{Y})  \displaystyle=H(\mathcal{Z},\mathcal{Y})  
\displaystyle H(\tilde{\mathcal{Y}})+H(\mathcal{Z}\tilde{\mathcal{Y}})  \displaystyle=H(\mathcal{Z},\tilde{\mathcal{Y}}). 
We hence get
\displaystyle H(\mathcal{Y})+H(\mathcal{Z}\mathcal{Y})=H(\mathcal{Z},\mathcal% {Y})\lx@stackrel{{\scriptstyle(a)}}{{=}}H(\mathcal{Z},\mathcal{Y},\tilde{% \mathcal{Y}})=H(\mathcal{Z},\tilde{\mathcal{Y}})+H(\mathcal{Y}\mathcal{Z},% \tilde{\mathcal{Y}})\\ \displaystyle\lx@stackrel{{\scriptstyle(b)}}{{=}}H(\mathcal{Z},\tilde{\mathcal% {Y}})=H(\tilde{\mathcal{Y}})+H(\mathcal{Z}\tilde{\mathcal{Y}})  (19) 
where (a) is because \tilde{\mathcal{Y}} is a coarsening of \mathcal{Y} and (b) is because \mathcal{Y} is a coarsening of \mathcal{Z}, respectively. In other words, for proportional coarsenings of \mathcal{Z}, consistency with \mathcal{Z} (measured by the conditional entropy) can be freely traded for model simplicity (measured by entropy).
For the remaining part of the RHS of (12), we write:
\sum_{i=1}^{k}\frac{Y_{i}}{X}H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))=% \sum_{j=1}^{k^{\prime}}\frac{\tilde{Y}_{j}}{X}\sum_{i:Y_{i}\subseteq\tilde% {Y}_{j}}\frac{Y_{i}}{\tilde{Y}_{j}}H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}% })). 
If the inner sums on the RHS consist of at most two terms, i.e. \tilde{Y}_{j}=\{Y_{i_{1}},Y_{i_{2}}\} or \tilde{Y}_{j}=Y_{i}, the inequality is established by Lemma E.1. If the inner sum consists of more than two clusters, one needs to apply Lemma E.1 recursively. For example, if \tilde{Y}_{1}=\{Y_{1},Y_{2},Y_{3}\},
\displaystyle\sum_{i=1}^{3}\frac{Y_{i}}{\tilde{Y}_{1}}H(\mathcal{N}(\mu_{Y% _{i}},\Sigma_{Y_{i}}))\\ \displaystyle\leq\frac{Y_{1}\cup Y_{2}}{\tilde{Y}_{1}}H(\mathcal{N}(\mu_{Y% _{1}\cup Y_{2}},\Sigma_{Y_{1}\cup Y_{2}}))+\frac{Y_{3}}{\tilde{Y}_{1}}H(% \mathcal{N}(\mu_{Y_{3}},\Sigma_{Y_{3}}))\\ \displaystyle\leq H(\mathcal{N}(\mu_{\tilde{Y}_{1}},\Sigma_{\tilde{Y}_{1}}). 
This completes the proof. ∎
Appendix F Proof of Theorem 4.2
Note that, with (13) and (19) (since both \mathcal{Y} and \tilde{\mathcal{Y}} are proportional coarsenings of \mathcal{Z}), we obtain
\displaystyle H(\tilde{\mathcal{Y}})+\beta H(\mathcal{Z}\tilde{\mathcal{Y}})% H(\mathcal{Y})\beta H(\mathcal{Z}\mathcal{Y})  
\displaystyle=H(\tilde{\mathcal{Y}})+\beta H(\mathcal{Z}\tilde{\mathcal{Y}})% (1\beta)H(\mathcal{Y})\beta(H(\tilde{\mathcal{Y}})+H(\mathcal{Z}\tilde{% \mathcal{Y}}))  
\displaystyle=(1\beta)\left(H(\tilde{\mathcal{Y}})H(\mathcal{Y})\right)  
\displaystyle=(\beta1)H(\mathcal{Y}\tilde{\mathcal{Y}})=(\beta1)H\left(% \frac{p_{k^{\prime}}}{q_{k^{\prime}}},\dots,\frac{p_{k}}{q_{k^{\prime}}}\right). 
For i=1,\dots,k^{\prime}1, we have H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))=H(\mathcal{N}(\mu_{\tilde{Y}_{i}},% \Sigma_{\tilde{Y}_{i}})), hence only the last term remains. We obtain with the proof of Theorem 4.1,
\displaystyle H(\mathcal{N}(\mu,\Sigma))\sum_{i=k^{\prime}}^{k}\frac{p_{i}}{q% _{k^{\prime}}}H(\mathcal{N}(\mu_{Y_{i}},\Sigma_{Y_{i}}))=\sum_{i=k^{\prime}}^{% k}\frac{p_{i}}{2q_{k^{\prime}}}\ln\left(\frac{\det\Sigma}{\det\Sigma_{Y_{i}}}% \right). 
It follows that the two costs in the statement are equal for \beta_{0} such that
(\beta_{0}1)H\left(\frac{p_{k^{\prime}}}{q_{k^{\prime}}},\dots,\frac{p_{k}}{q% _{k^{\prime}}}\right)=\sum_{i=k^{\prime}}^{k}\frac{p_{i}}{2q_{k^{\prime}}}\ln% \left(\frac{\det\Sigma_{Y_{i}}}{\det\Sigma}\right) 
from which we get
\beta_{0}=1+\frac{\sum_{i=k^{\prime}}^{k}\frac{p_{i}}{2q_{k^{\prime}}}\ln\left% (\frac{\det\Sigma_{Y_{i}}}{\det\Sigma}\right)}{H\left(\frac{p_{k^{\prime}}}{q_% {k^{\prime}}},\dots,\frac{p_{k}}{q_{k^{\prime}}}\right)}.  (20) 
∎
Acknowledgement
The authors thank Hongfu Liu, Pengjiang Qian and Daphne Teck Ching Lai for sharing their codes implementing semisupervised versions of kmeans, spectral clustering and fuzzy clustering. We also thank Jacek Tabor for useful discussions and comments.
The work of Marek Śmieja was supported by the National Science Centre (Poland) grant no. 2016/21/D/ST6/00980. The work of Bernhard C. Geiger has been funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund and by the German Ministry of Education and Research in the framework of an Alexander von Humboldt Professorship.
References
References
 [1] Charu C Aggarwal and Chandan K Reddy. Data clustering: algorithms and applications. Chapman and Hall/CRC, 2013.
 [2] Christophe Ambroise, Thierry Denoeux, Gérard Govaert, and Philippe Smets. Learning from an imprecise teacher: probabilistic and evidential approaches. Applied Stochastic Models and Data Analysis, 1:100–105, 2001.
 [3] LNF Ana and Anil K Jain. Robust data clustering. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II–128, 2003.
 [4] S. Asafi and D. CohenOr. Constraints as features. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1634–1641, Portland, OR, June 2013.
 [5] Sugato Basu. Semisupervised clustering: Learning with limited user feedback. PhD thesis, The University of Texas at Austin, 2003.
 [6] Sugato Basu, Mikhail Bilenko, and Raymond J Mooney. A probabilistic framework for semisupervised clustering. In Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 59–68, Seattle, WA, August 2004.
 [7] Sugato Basu, Ian Davidson, and Kiri Wagstaff. Constrained clustering: Advances in algorithms, theory, and applications. CRC Press, 2008.
 [8] Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn, et al. The R package bgmm: mixture modeling with uncertain knowledge. Journal of Statistical Software, 47(3):31, 2012.
 [9] Charles Bouveyron and Stéphane Girard. Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recognition, 42(11):2649–2658, 2009.
 [10] Daniele Calandriello, Gang Niu, and Masashi Sugiyama. Semisupervised informationmaximization clustering. Neural Networks, 57:103–111, 2014.
 [11] Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for Gaussian variables. Journal of Machine Learning Research, 6(Jan):165–188, 2005.
 [12] Etienne Côme, Latifa Oukhellou, Thierry Denoeux, and Patrice Aknin. Learning from partially supervised data using mixture models and belief functions. Pattern Recognition, 42(3):334–348, 2009.
 [13] Li FeiFei and Pietro Perona. A Bayesian hierarchical model for learning natural scene categories. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 524–531, San Diego, CA, June 2005.
 [14] David Gondek and Thomas Hofmann. Nonredundant data clustering. Knowledge and Information Systems, 12(1):1–24, 2007.
 [15] John A Hartigan and Manchek A Wong. Algorithm AS 136: A kmeans clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
 [16] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 2 edition, 2013.
 [17] Eyke Hüllermeier and Jürgen Beringer. Learning from ambiguously labeled examples. In Proc. Int. Symposium on Intelligent Data Analysis (IDA), pages 168–179, Madrid, Spain, September 2005. Springer.
 [18] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
 [19] Yizhang Jiang, FuLai Chung, Shitong Wang, Zhaohong Deng, Jun Wang, and Pengjiang Qian. Collaborative fuzzy clustering from multiple weighted views. IEEE Transactions on Cybernetics, 45(4):688–701, 2015.
 [20] S. Kamvar, D. Klein, and C.D. Manning. Spectral learning. In Proc. Int. Joint Conf. on Artificial Intelligence (IJCAI), pages 561–566, Acapulco, Mexico, August 2003.
 [21] Justin Klekota and Frederick P Roth. Chemical substructures that enrich for biological activity. Bioinformatics, 24(21):2518–2525, 2008.
 [22] Daphne Teck Ching Lai and Jonathan M Garibaldi. Improving semisupervised fuzzy cmeans classification of breast cancer data using feature selection. In Proc. IEEE Int. Conf. on Fuzzy Systems (FUZZIEEE), pages 1–8, Hyderabad, India, July 2013. IEEE.
 [23] Tilman Lange, Martin HC Law, Anil K Jain, and Joachim M Buhmann. Learning with constrained and unlabelled data. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 731–738, San Diego, CA, June 2005.
 [24] Rémi Lebret, Serge Iovleff, Florent Langrognet, Christophe Biernacki, Gilles Celeux, and Gérard Govaert. Rmixmod: the R package of the modelbased unsupervised, supervised and semisupervised classification mixmod library. Journal of Statistical Software, 67(6):241–270, 2015.
 [25] Levi Lelis and Jörg Sander. Semisupervised densitybased clustering. In Proc. IEEE Int. Conf. on Data Mining, pages 842–847, Miami, Florida, December 2009. IEEE.
 [26] M. Lichman. UCI machine learning repository, 2013.
 [27] Hongfu Liu and Yun Fu. Clustering with partition level side information. In Proc. IEEE Int. Conf. on Data Mining (ICDM), pages 877–882, Atlantic City, NJ, November 2015.
 [28] Mei Lu, XiangJun Zhao, Li Zhang, and FanZhang Li. Semisupervised concept factorization for document clustering. Information Sciences, 331:86–98, 2016.
 [29] Zhengdong Lu and Todd K. Leen. Semisupervised learning with penalized probabilistic clustering. In Advances in Neural Information Processing Systems (NIPS), pages 849–856, Vancouver, British Columbia, Canada, December 2005.
 [30] Blaine Nelson and Ira Cohen. Revisiting probabilistic models for clustering with pairwise constraints. In Proc. Int. Conf. on Machine Learning (ICML), pages 673–680, Corvallis, OR, June 2007.
 [31] Berend Olivier, Willem Soudijn, and Ineke van Wijngaarden. The 5HT1A receptor and its ligands: structure and function. In Progress in Drug Research, volume 52, pages 103–165. 1999.
 [32] Witold Pedrycz, Alberto Amato, Vincenzo Di Lecce, and Vincenzo Piuri. Fuzzy clustering with partial supervision in organization and classification of digital images. IEEE Transactions on Fuzzy Systems, 16(4):1008–1026, 2008.
 [33] Witold Pedrycz and James Waletzky. Fuzzy clustering with partial supervision. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 27(5):787–795, 1997.
 [34] Pengjiang Qian, Yizhang Jiang, Shitong Wang, KuanHao Su, Jun Wang, Lingzhi Hu, and Raymond F Muzic. Affinity and penalty jointly constrained spectral clustering with allcompatibility, flexibility, and robustness. IEEE Transactions on Neural Networks and Learning Systems, 2016. , accepted for publication.
 [35] Richard A. Redner and Homer F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.
 [36] Noam Shental, Aharon Barhillel, Tomer Hertz, and Daphna Weinshall. Computing Gaussian mixture models with EM using equivalence constraints. In Advances in Neural Information Processing Systems (NIPS), pages 465–472, Vancouver, British Columbia, Canada, December 2004.
 [37] Noam Slonim. The information bottleneck: Theory and applications. PhD thesis, Hebrew University of Jerusalem, 2002.
 [38] Marek Śmieja and Dawid Warszycki. Average information content maximization  a new approach for fingerprint hybridization and reduction. PLoS ONE, 11(1):e0146666, 2016.
 [39] P Spurek, J Tabor, and K Byrski. Active function crossentropy clustering. Expert Systems with Applications, 72:49–66, 2017.
 [40] Przemyslaw Spurek, Konrad Kamieniecki, Jacek Tabor, Krzysztof Misztal, and Marek Śmieja. R package CEC. Neurocomputing, 237:410–413, 2016.
 [41] DJ Strouse and David J. Schwab. The deterministic information bottleneck. In Proc. Conf. on Uncertainty in Artificial Intelligence (UAI), pages 696–705, New York City, NY, June 2016.
 [42] Jacek Tabor and Przemyslaw Spurek. Crossentropy clustering. Pattern Recognition, 47(9):3046–3059, 2014.
 [43] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. Allerton Conf. on Communication, Control, and Computing, pages 368–377, Monticello, IL, September 1999.
 [44] Alexander Topchy, Anil K Jain, and William Punch. Combining multiple weak clusterings. In Proc. IEEE Int. Conf. on Data Mining (ICDM), pages 331–338, Melbourne, Florida, November 2003.
 [45] Enmei Tu, Yaqian Zhang, Lin Zhu, Jie Yang, and Nikola Kasabov. A graphbased semisupervised k nearestneighbor method for nonlinear manifold distributed data classification. Information Sciences, 367:673–688, 2016.
 [46] Ziang Wang and Ian Davidson. Flexible constrained spectral clustering. In Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 563–572, Washington, DC, July 2010.
 [47] D. Warszycki, S. Mordalski, K. Kristiansen, R. Kafel, I. Sylte, Z. Chilmonczyk, and A. J. Bojarski. A linear combination of pharmacophore hypotheses as a new tool in search of new active compounds–an application for 5HT1A receptor ligands. PloS ONE, 8(12):e84510, 2013.
 [48] Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, and Anil K Jain. Semicrowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In Advances in Neural Information Processing Systems (NIPS), pages 1772–1780, Lake Tahoe, December 2012.
 [49] Xiaojin Zhu and Andrew B Goldberg. Introduction to semisupervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.