Robust Bayesian Cluster Enumeration
Abstract
A major challenge in cluster analysis is that the number of data clusters is mostly unknown and it must be estimated prior to clustering the observed data. In realworld applications, the observed data is often subject to heavy tailed noise and outliers which obscure the true underlying structure of the data. Consequently, estimating the number of clusters becomes challenging. To this end, we derive a robust cluster enumeration criterion by formulating the problem of estimating the number of clusters as maximization of the posterior probability of multivariate candidate models. We utilize Bayes’ theorem and asymptotic approximations to come up with a robust criterion that possesses a closedform expression. Further, we refine the derivation and provide a robust cluster enumeration criterion for the finite sample regime. The robust criteria require an estimate of cluster parameters for each candidate model as an input. Hence, we propose a twostep cluster enumeration algorithm that uses the expectation maximization algorithm to partition the data and estimate cluster parameters prior to the calculation of one of the robust criteria. The performance of the proposed algorithm is tested and compared to existing cluster enumeration methods using numerical and real data experiments.
I Introduction
Cluster analysis is an unsupervised learning task that finds the intrinsic structure in a set of unlabeled data by grouping similar objects into clusters. Cluster analysis plays a crucial role in a wide variety of fields of study, such as social sciences, biology, medical sciences, statistics, machine learning, pattern recognition, and computer vision [1, 2, 3, 4]. A major challenge in cluster analysis is that the number of clusters is usually unknown but it is required to cluster the data. The estimation of the number of clusters, also called cluster enumeration, has attracted the interest of researchers for decades and various methods have been proposed in the literature, see for example [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] and the reviews in [22, 23, 24, 25, 4]. However, to this day, no single best cluster enumeration method exists.
In realworld applications, the observed data is often subject to heavy tailed noise and outliers [3, 26, 27, 28, 29, 30] which obscure the true underlying structure of the data. Consequently, cluster enumeration becomes even more challenging when either the data is contaminated by a fraction of outliers or there exist deviations from the distributional assumptions. To this end, many robust cluster enumeration methods have been proposed in the literature, see [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 27, 42, 43, 44, 45] and the references therein. A popular approach in robust cluster analysis is to use the Bayesian information criterion (BIC), as derived by Schwarz [46], to estimate the number of data clusters after either removing outliers from the data [32, 33, 34, 31], modeling noise or outliers using an additional component in a mixture modeling framework [35, 36], or exploiting the idea that the presence of outliers causes the distribution of the data to be heavy tailed and, subsequently, modeling the data as a mixture of heavy tailed distributions [37, 38]. For example, modeling the data using a family of distributions [47, 48, 49, 50, 51, 52, 53] provides a principled way of dealing with outliers by giving them less weight in the objective function. The family of distributions is flexible as it contains the heavy tailed Cauchy for and the Gaussian distribution for as special cases. Consequently, we model the clusters using a family of multivariate distributions and derive robust cluster enumeration criteria that account for outliers given that the degree of freedom parameter is sufficiently small.
In statistical model selection, it is known that the original BIC [46, 54] penalizes two structurally different models the same way if they have the same number of unknown parameters [55, 56]. Hence, careful examination of the original BIC is a necessity prior to its application in specific model selection problems [55]. Following this line of argument, we have recently derived the BIC for cluster analysis by formulating cluster enumeration as maximization of the posterior probability of candidate models [20, 21]. In [20] we showed that the BIC derived specifically for cluster enumeration has a different penalty term compared to the original BIC. However, robustness was not considered in [20], where a family of multivariate Gaussian candidate models were used to derive the criterion, which we refer to as .
To the best of our knowledge, this is the first attempt made to derive a robust cluster enumeration criterion by formulating the cluster enumeration problem as maximization of the posterior probability of multivariate candidate models. Under some mild assumptions, we derive a robust Bayesian cluster enumeration criterion, . We show that has a different penalty term compared to the original BIC [46, 54], given that the candidate models in the original BIC are represented by a family of multivariate distributions. Interestingly, for both the data fidelity and the penalty terms depend on the assumed distribution for the data, while for the original BIC changes in the data distribution only affect the data fidelity term. Asymptotically, converges to . As a result, our derivations also provide a justification for the use of the original BIC with multivariate candidate models from a cluster analysis perspective. Further, we refine the derivation of by providing an exact expression for its penalty term. This results in a robust criterion, , which behaves better than in the finite sample regime and converges to in the asymptotic regime.
In general, BIC based cluster enumeration methods require a clustering algorithm that partitions the data according to the number of clusters specified by each candidate model and provides an estimate of cluster parameters. Hence, we apply the expectation maximization (EM) algorithm to partition the data prior to the calculation of an enumeration criterion, resulting in a twostep approach. The proposed algorithm provides a unified framework for the robust estimation of the number of clusters and cluster memberships.
The paper is organized as follows. Section II formulates the cluster enumeration problem and Section III introduces the proposed robust cluster enumeration criterion. Section IV presents the twostep cluster enumeration algorithm. A comparison of different Bayesian cluster enumeration criteria is given in Section V. A performance evaluation and comparison to existing methods using numerical and real data experiments is provided in Section VI. Finally, Section VII contains concluding remarks and highlights future research directions. Notably, a detailed proof is provided in Appendix B.
Notation: Lower and uppercase boldface letters represent column vectors and matrices, respectively; Calligraphic letters denote sets with the exception of which represents the likelihood function; , , and denote the set of real numbers, the set of positive real numbers, and the set of positive integers, respectively; and denote probability mass function and probability density function (pdf), respectively; represents a multivariate distributed random variable with location parameter , scatter matrix , and degree of freedom ; denotes the estimator (or estimate) of the parameter ; iid stands for independent and identically distributed; (A.) denotes an assumption; stands for the natural logarithm; represents the expectation operator; stands for the limit; represents vector or matrix transpose; denotes the determinant when its argument is a matrix and an absolute value when its argument is scalar; represents the Kronecker product; refers to the stacking of the columns of an arbitrary matrix into a long column vector; denotes Landau’s term which tends to a constant as the data size goes to infinity; stands for an dimensional identity matrix; and represent an dimensional all zero and all one matrix, respectively; denotes the cardinality of the set ; represents equality by definition; denotes mathematical equivalence.
Ii Problem Formulation
Let denote the observed data set which can be partitioned into independent, mutually exclusive, and nonempty clusters . Each cluster , for , contains data vectors that are realizations of iid multivariate random variables , where , , and represent the centroid, the scatter matrix, and the degree of freedom of the th cluster, respectively. Let be a family of multivariate candidate models, where and represent the specified minimum and maximum number of clusters, respectively. Each candidate model , for and , represents a partition of into clusters with associated cluster parameter matrix , which lies in a parameter space . Assuming that

the degree of freedom parameter , for , is fixed at some prespecified value,
the parameters of interest reduce to . Our research goal is to estimate the number of clusters in given assuming that

the constraint is satisfied.
Note that, given some mild assumptions are satisfied, we have recently derived a general Bayesian cluster enumeration criterion, which we refer to as [20]. However, since we assume multivariate candidate models, some of the assumptions made in the derivation of require mathematical justification. In the next section, we highlight the specific assumptions that require justification and, in an attempt to keep the article self contained, provide all necessary derivations.
Iii Robust Bayesian Cluster Enumeration Criterion
Our objective is to select a model which is a posteriori most probable among the set of candidate models . Mathematically
(1) 
where is the posterior probability of given , which can be written as
(2) 
where denotes the joint posterior density of and given . According to Bayes’ theorem
(3) 
where denotes a discrete prior on , represents the prior on given , denotes the pdf of the data set , and is the likelihood function. Substituting Eq. (3) into Eq. (2) results in
(4) 
Since
(5) 
we maximize over the family of candidate models for mathematical convenience. Hence, taking the natural logarithm of Eq. (4) results in
(6) 
where is a constant that is independent of and, consequently, has no effect on the maximization of over . As the partitions (clusters) , for , are independent, mutually exclusive, and nonempty, the following holds:
(7)  
(8) 
Substituting Eqs. (7) and (8) into Eq. (6) results in
(9) 
Maximization of Eq. (9) over involves the maximization of the natural logarithm of a multidimensional integral. The multidimensional integral can be solved using either numerical integration or asymptotic approximations that result in a closedform solution. Closedform solutions are known to provide an insight into the model selection problem at hand [55]. Hence, we apply Laplace’s method of integration [55, 56, 57, 20], which makes asymptotic approximations, to simplify the multidimensional integral in Eq. (9).
For ease of notation, Eq. (9) is written as
(10) 
where
(11) 
Given that

, for , has first and secondorder derivatives that are continuous over the parameter space ,

, for , has global maximum at , where is an interior point of , and

the Hessian matrix of , which is given by
where is the cardinality of , is positive definite for ,
can be approximated by its secondorder Taylor series expansion around as follows:
(12) 
where . The first derivative of evaluated at vanishes because of assumption (A.4).
Note that , for , is known to have multiple local maxima [53, 51]. For assumption (A.4) to hold, we have to show that is the global maximum of , for . We know that the global maximizer of , is , where is the true parameter vector. is the maximum likelihood estimator and its derivation and the final expressions are given in Appendix A. In [58], it was proven that
with probability one. As a result, asymptotically, assumption (A.4) holds. Assumption (A.5) follows because is a maximizer of .
Assume that

, for , is continuously differentiable and its firstorder derivatives are bounded on with .
Then, substituting Eq. (12) into Eq. (11) and approximating by its Taylor series expansion yields
(13) 
where HOT denotes higher order terms and is a Gaussian kernel with mean and covariance matrix . Ignoring the higher order terms, Eq. (13) reduces to
(14) 
The second term in Eq. (14) is equivalent to , where is a constant. Consequently, Eq. (14) reduces to
(15) 
given that . Substituting Eq. (15) into Eq. (10) results in
(16) 
where
(17) 
is the Fisher information matrix (FIM) of the data vectors that belong to the th partition.
Iv Proposed Robust Bayesian Cluster Enumeration Algorithm
We propose a robust cluster enumeration algorithm to estimate the number of clusters in the data set . The presented twostep approach utilizes an unsupervised learning algorithm to partition into the number of clusters specified by each candidate model prior to the computation of one of the proposed robust cluster enumeration criteria for that particular model.
Iva Proposed Robust Bayesian Cluster Enumeration Criteria
For each candidate model , let there be a clustering algorithm that partitions into clusters and provides parameter estimates , for . Assume that (A.1)(A.7) are fulfilled.
Theorem 1.
The posterior probability of given can be asymptotically approximated by
(18) 
where represents the number of estimated parameters per cluster and
(19) 
The likelihood function, also called the data fidelity term, is given by
(20) 
where , denotes the gamma function, and is the squared Mahalanobis distance. The second term in the second line of Eq. (18) is referred to as the penalty term.
Proof.
Once is computed for each candidate model , the number of clusters in is estimated as
(21) 
Corollary 1.
When the data size is finite, one can opt to compute , without asymptotic approximations to obtain a more accurate penalty term. In such cases, the posterior probability of given becomes
(22) 
where the expression for is given in Appendix C.
IvB The Expectation Maximization (EM) Algorithm for Mixture Models
We consider maximum likelihood estimation of the parameters of the component mixture of distributions
(23) 
where denotes the variate pdf and . are the mixing coefficients and are assumed to be known or estimated, e.g. using [48]. The mixing coefficients satisfy the constraints for , and .
The EM algorithm is widely used to estimate the parameters of the component mixture of distributions [48, 47, 49, 59]. The EM algorithm contains two basic steps, namely the E step and the M step, which are performed iteratively until a convergence condition is satisfied. The E step computes
(24)  
(25) 
where is the posterior probability that belongs to the th cluster at the th iteration and is the weight given to by the th cluster at the th iteration. Once and are calculated, the M step updates cluster parameters as follows:
(26)  
(27)  
(28) 
Algorithm 1 summarizes the working principle of the proposed robust twostep cluster enumeration approach. Given that the degree of freedom parameter is fixed at some finite value, the computational complexity of Algorithm 1 is the sum of the run times of the two steps. Since the initialization, i.e., the Kmedians algorithm is performed only for a few iterations, the computational complexity of the first step is dominated by the EM algorithm and it is given by for a single candidate model , where is a fixed stopping threshold of the EM algorithm. The computational complexity of is , which is much smaller than the runtime of the EM algorithm and, as a result, it can easily be ignored in the runtime analysis of the proposed algorithm. Hence, the total computational complexity of Algorithm 1 is .
Note that if is used in Algorithm 1 instead of , the computational complexity of the algorithm increases significantly with the increase in the number of features due to the calculation of Eq. (69).
V Comparison of Different Bayesian Cluster Enumeration Criteria
Model selection criteria that are derived by maximizing the posterior probability of candidate models given data are known to have a common form [56, 60, 20] that is consistent with
(29) 
where is the data fidelity term and is the penalty term. The proposed robust cluster enumeration criteria, and , and the original BIC with multivariate candidate models, , [48, 37, 38] have an identical data fidelity term. The difference in these criteria lies in their penalty terms, which are given by
(30)  
(31)  
(32) 
where and are given by Eq. (19) and Eq. (69), respectively. Note that calculates an exact value of the penalty term, while and compute its asymptotic approximation. In the finite sample regime the penalty term of is stronger than the penalty term of , while asymptotically all three criteria have an identical penalty term.
Remark.
When the degree of freedom parameter converges to
(33) 
where is the asymptotic criterion derived in [20] assuming a family of Gaussian candidate models.
Remark.
A modification of the data distribution of the candidate models only affects the data fidelity term of the original BIC [46, 54]. However, given that the BIC is specifically derived for cluster analysis, we showed that both the data fidelity and penalty terms change as the data distribution of the candidate models changes, see Eq. (18) and the expression for in [20].
A related robust cluster enumeration method that uses the original BIC to estimate the number of clusters is the trimmed BIC (TBIC) [32]. The TBIC estimates the number of clusters using Gaussian candidate models after trimming some percentage of the data. In TBIC, the fast trimmed likelihood estimator (FASTTLE) is used to obtain maximum likelihood estimates of cluster parameters. The FASTTLE is computationally expensive since it carries out a trial and a refinement step multiple times, see [32] for details.
Vi Experimental Results
In this section, we compare the performance of the proposed robust twostep algorithm with stateoftheart cluster enumeration methods using numerical and real data experiments. In addition to the methods discussed in Section V, we compare our cluster enumeration algorithm with the gravitational clustering (GC) [41] and the Xmeans [10] algorithms. All experimental results are an average of Monte Carlo runs. The degree of freedom parameter is set to for all methods that have multivariate candidate models. We use the author’s implementation of the gravitational clustering algorithm [41]. For the TBIC, we trim of the data and perform iterations of the trial and refinement steps. For the model selection based methods, the minimum and maximum number of clusters is set to and , where denotes the true number of clusters in the data under consideration.
Via Performance Measures
Performance is assessed in terms of the empirical probability of detection and the mean absolute error (MAE), which are defined as
(34)  
MAE  (35) 
where is the total number of Monte Carlo experiments, is the estimated number of clusters in the th Monte Carlo experiment, and is the indicator function defined as
(36) 
In addition to these two performance measures, we also report the empirical probability of underestimation and overestimation, which are defined as
(37)  
(38) 
respectively.
ViB Numerical Experiments
ViB1 Analysis of the sensitivity of different cluster enumeration methods to outliers
we generate two data sets which contain realizations of dimensional random variables , where , with cluster centroids , , , and covariance matrices
The first data set (Data1), as depicted in Fig. 0(a), replaces a randomly selected data point with an outlier that is generated from a uniform distribution over the range on each variate at each iteration. The sensitivity of different cluster enumeration methods to a single replacement outlier over iterations as a function of the number of data vectors per cluster is displayed in Table I. From the compared methods, our robust criterion has the best performance in terms of both and MAE. Except for and the TBIC, the performance of all methods deteriorates when , for , is small and, notably, performs poorly. This behavior is attributed to the fact that is an asymptotic criterion and in the small sample regime its penalty term becomes weak which results in an increase in the empirical probability of overestimation. and Xmeans are very sensitive to the presence of a single outlier because they model individual clusters as multivariate Gaussian. Xmeans performs even worse than since it uses the Kmeans algorithm to cluster the data, which is ineffective in handling elliptical clusters. An illustrative example of the sensitivity of and to the presence of an outlier is displayed in Fig. 2. Despite the difference in , when the outlier is either in one of the clusters or very close to one of the clusters, both and are able to estimate the correct number of clusters reasonably well. The difference between these two methods arises when the outlier is far away from the bulk of data. While is still able to estimate the correct number of clusters, starts to overestimate the number of clusters.
MAE  
MAE  
MAE  
TBIC [32]  
MAE  
GC [41]  
MAE  
[20]  
MAE  
Xmeans [10]  
MAE 
The second data set (Data2), shown in Fig. 0(b), contains data points in each cluster and replaces a certain percentage of the data set with outliers that are generated from a uniform distribution over the range on each variate. Data2 is generated in a way that no outlier lies inside one of the data clusters. In this manner, we make sure that outliers are points that do not belong to the bulk of data. Fig. 3 shows the empirical probability of detection as a function of the percentage of outliers . GC is able to correctly estimate the number of clusters for . The proposed robust criteria, and , and the original BIC, , behave similarly and are able to estimate the correct number of clusters when . The behavior of these methods is rather intuitive because as the amount of outliers increases, then the methods try to explain the outliers by opening a new cluster. A similar trend is observed for the TBIC even though its curve decays slowly. is able to estimate the correct number of clusters of the time when there are no outliers in the data set. However, even of outliers is enough to drive into overestimating the number of clusters.
ViB2 Impact of the increase in the number of features on the performance of cluster enumeration methods
we generate realizations of the random variables , for , whose cluster centroids and scatter matrices are given by
with , denoting an dimensional all one column vector, and representing an dimensional identity matrix. For this data set, referred to as Data3, the number of features is varied in the range and the number of data points per cluster is set to . Because , Data3 contains realizations of heavy tailed distributions and, as a result, the clusters contain outliers. The empirical probability of detection as a function of the number of features is displayed in Fig. 4. The performance of GC appears to be invariant to the increase in the number of features, while the remaining methods are affected. But, compared to the other cluster enumeration methods, GC is computationally very expensive. outperforms and the TBIC when the number of features is low, while the proposed criterion outperforms both methods in high dimensions. is not computed for this data set because it is computationally expensive and it is not beneficial given the large number of samples.
ViB3 Analysis of the sensitivity of different cluster enumeration methods to cluster overlap
here, we use Data2 with outliers and vary the distance between the second and the third centroid such that the percentage of overlap between the two clusters takes on a value from the set . The empirical probability of detection as a function of the amount of overlap is depicted in Fig. 5. The best performance is achieved by and and, remarkably, both cluster enumeration criteria are able to correctly estimate the number of clusters even when there exists overlap between the two clusters. As expected, when the amount of overlap is , most methods underestimate the number of clusters to two. While it may appear that the enumeration performance of increases for increasing amounts of overlap, in fact groups the two overlapping clusters into one and attempts to explain the outliers by opening a new cluster. A similar trend is observed for Xmeans. GC is inferior to the other robust methods, and experiences an increase in the empirical probability of underestimation.
ViB4 Analysis of the sensitivity of cluster enumeration methods to cluster heterogeneity
we generate realizations of dimensional random variables , where the cluster centroids are selected at random from a uniform distribution in the range in each variate and the scatter matrices are set to for . The data set is generated in a way that there is no overlap between the clusters. The number of data points in the first four clusters is set to , while is allowed to take on values from the set . This data set (Data4) contains multiple outliers since each cluster contains realizations of heavy tailed distributed random variables. The empirical probability of detection as a function of the number of data points in the fifth cluster is shown in Fig. 6. The proposed cluster enumeration methods, and , are able to estimate the correct number of clusters with a high accuracy even when the fifth cluster contains only of the data available in the other clusters. A similar performance is observed for . TBIC and GC are slightly inferior in performance to the other robust cluster enumeration methods. When the number of data points in the fifth cluster increases, all robust methods perform well in estimating the number of clusters. Interestingly, Xmeans outperforms since the considered clusters are all spherical. overestimates the number of clusters and possesses the largest MAE.
ViC Real Data Results
ViC1 Old Faithful geyser data set
Old Faithful is a geyser located in Yellowstone National Park in Wyoming, United States. This data set, depicted in Fig. 6(a), was used in the literature for density estimation [61], time series analysis [62], and cluster analysis [63, 64]. The performance of different cluster enumeration methods on the clean and contaminated versions of the Old Faithful data set is reported in Table II. The contaminated version, shown in Fig. 6(b), is generated by replacing a randomly selected data point with an outlier at each iteration similar to the way Data1 was generated. Most methods are able to estimate the correct number of clusters of the time for the clean version of the Old Faithful data set. Our criteria, and , and are insensitive to the presence of a single replacement outlier, while TBIC exhibits slight sensitivity. In the presence of an outlier, the performance of deteriorates due to an increase in the empirical probability of overestimation. In fact, finds clusters of the time. GC shows the worst performance and possesses the highest MAE.
Next, we replace a certain percentage of the Old Faithful data set with outliers and study the performance of different cluster enumeration methods. The outliers are generated from a uniform distribution over the range