Toward Multi-Diversified Ensemble Clustering of High-Dimensional Data
The emergence of high-dimensional data in various areas has brought new challenges to the ensemble clustering research. To deal with the curse of dimensionality, considerable efforts in ensemble clustering have been made by incorporating various subspace-based techniques. Besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large number of diversified metrics, and furthermore, how to jointly exploit the multi-level diversity in the large number of metrics, subspaces, and clusters, in a unified framework. To tackle this problem, this paper proposes a novel multi-diversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed. Further, an entropy-based criterion is adopted to explore the cluster-wise diversity in ensembles, based on which the consensus function is therefore presented. Experimental results on twenty high-dimensional datasets have confirmed the superiority of our approach over the state-of-the-art.
The last decade has witnessed significant progress in the development of the ensemble clustering technique [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], which is typically featured by its ability of combining multiple base clusterings into a probably better and more robust consensus clustering, and has recently shown promising advantages in discovering clusters of arbitrary shapes, dealing with noisy data, coping with data from multiple sources, and producing robust clustering results .
In recent years, with high-dimensional data widely appearing in various areas, new challenges have been brought to the conventional ensemble clustering algorithms, which, however, often lack the ability to well address the high-dimensional issues. As is termed the curse of dimensionality, it is highly desired but very difficult to find the inherent cluster structure hidden in the huge dimensions, especially when it is frequently coupled with very low sample size. Recently some efforts have been devoted to ensemble clustering of high-dimensional data, which typically exploit different subspace-based (or feature-based) techniques, such as random subspace sampling [15, 9, 16, 17], stratified subspace sampling , and subspace projection , to explore the diversity in high-dimensionality. Inherently, these subspace-based techniques select or linearly combine data features into different subsets (i.e., subspaces) by a variety of strategies to seek more perspectives for finding cluster structures.
Besides the issue of subspaces (or features), the choice of similarity/dissimilarity metrics is another crucial factor in dealing with high-dimensional data [19, 20]. The existing ensemble clustering methods typically adopt one or a few preselected metrics, which are often selected implicitly based on the expert’s knowledge or some prior assumptions. However, few, if not none, of them have considered the potentially huge benefits and issues hidden in randomized metric spaces. In one hand, it is very difficult to select or learn an optimal metric for a given dataset without human supervision or implicit assumptions. In another hand, with different metrics capable of reflecting different perspectives on data, the joint use of a large number of randomized/diversified metrics may reveal huge opportunities hidden in high-dimensionality. However, it is surprisingly still an open problem in ensemble clustering how to produce and aggregate a large number of diversified metrics to enhance the consensus performance. Furthermore, starting from the metric diversification problem, another crucial challenge arises as to how to jointly exploit multiple levels of diversity in the large number of metrics, subspaces, and clusters, in a unified ensemble clustering framework.
To tackle the above-mentioned problem, we propose a novel ensemble clustering approach termed multi-diversified ensemble clustering (MDEC) by jointly reconciling large populations of diversified metrics, random subspaces, and weighted clusters. Specifically, we exploit a scaled exponential similarity kernel as the seed kernel, which has advantages in parameter flexibility and neighborhood adaptivity and is randomized to breed a large set of diversified metrics. The set of diversified metrics are coupled with random subspaces to form a large number of metric-subspace pairs, which then contribute to the jointly randomized ensemble generation process where the set of diversified base clusterings are produced with the help of the spectral clustering algorithm. With the clustering ensemble generated, to exploit the cluster-wise diversity in the multiple base clusterings, an entropy-based cluster validity strategy is adopted to evaluate and weight each base cluster by considering the distribution of clusters in the entire ensemble, based on which a new multi-diversified consensus function is therefore proposed (see Section 3 and Fig. 1 for more details). In this paper, we conducted experiments on 20 high-dimensional datasets, including 15 cancer gene expression datasets and 5 image datasets. Extensive experimental results have shown the superiority of our approach against the state-of-the-art ensemble clustering approaches for clustering high-dimensional data.
For clarity, the main contributions of this work are summarized as follows:
This paper for the first time, to the best of our knowledge, shows that the joint use of a large population of randomly diversified metrics can significantly benefit the ensemble clustering of high-dimensional data in an unsupervised manner.
A new metric diversification strategy is proposed by randomizing the scaled exponential similarity kernel with both parameter flexibility and neighborhood adaptivity considered, which is further coupled with random subspace sampling for the jointly randomized generation of base clusterings.
A new ensemble clustering approach termed MDEC is presented, which has the ability of simultaneously exploiting a large population of diversified metrics, random subspaces, and weighted clusters in a unified framework.
Extensive experiments have been conducted on a variety of high-dimensional datasets, which demonstrate the significant advantages of our approach over the state-of-the-art ensemble clustering approaches.
2 Related Work
Due to its ability of combining multiple base clusterings into a probably better and more robust consensus clustering, the ensemble clustering technique has been receiving increasing attention in recent years. Many ensemble clustering algorithms have been developed from different technical perspectives [2, 6, 10, 8, 11, 12, 13, 14, 21, 1, 5, 3, 4, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], which can be classified into three main categories, namely, the pair-wise co-occurrence based methods, the graph partitioning based methods, and the median partition based methods.
The pair-wise co-occurrence based methods [21, 28, 29] typically construct a co-association matrix by considering the frequency that two data samples occur in the same cluster among the multiple base clusterings. The co-association matrix is then used as the similarity matrix for the data samples, upon which some clustering algorithms can thereby be performed to obtain the final clustering result. Fred and Jain  first introduced the concept of the co-association matrix and proposed the evidence accumulation clustering (EAC) method, which applied a hierarchical agglomerative clustering algorithm  on the co-association matrix to build the consensus clustering. To extend the EAC method, Wang et al.  took the cluster sizes into consideration and proposed the probability accumulation method. Yi et al.  dealt with the uncertain entries in the co-association matrix by first labeling them as unobserved, and then recovering the unobserved entries by the matrix completion technique. Liu et al.  proved that the spectral clustering of the co-association matrix is equivalent to a weighted version of -means, and proposed the spectral ensemble clustering (SEC) method to effectively and efficiently obtain the consensus result.
The graph partitioning based methods [8, 30, 31] generally construct a graph model for the ensemble of multiple base clusterings, and then partition the graph into several disjoint subsets to obtain the final clustering result. Strehl and Ghosh  solved the ensemble clustering problem by using three graph partitioning based algorithms, namely, cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and meta-clustering algorithm (MCLA). Fern and Brodley  formulated a bipartite graph model by treating both clusters and data samples as nodes, and partitioned the graph by the METIS algorithm  to obtain the consensus result. Huang et al.  dealt with the ensemble clustering problem by sparse graph representation and random walk trajectory analysis, and presented the probability trajectory based graph partitioning (PTGP) method.
The median partition based methods [10, 32, 33] typically formulate the ensemble clustering problem into an optimization problem which aims to find the median partition such that the similarity between the base partitions (i.e., base clusterings) and the median partition is maximized. The median partition problem is NP-hard . To find an approximate solution, Topchy et al.  cast the median partition problem into a maximum likelihood problem and solved it by the EM algorithm. Franek and Jiang  reduced the ensemble clustering problem to an Euclidean median problem and solved it by the Weiszfeld algorithm . Huang et al.  formulated the ensemble clustering problem into a binary linear programming problem and obtained an approximate solution based on the factor graph model and the max-product belief propagation .
Although in recent years significant advances have been made in the research of ensemble clustering [2, 6, 10, 8, 11, 12, 13, 14, 21, 1, 3, 4, 5, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33], yet the existing methods are mostly devised for general-purpose scenarios and lack the desirable ability to appropriately address the clustering problem of high-dimensional data. More recently, some efforts have been made to deal with the curse of dimensionality, where subspace-based (or feature-based) techniques are often exploited. Jing et al.  adopted stratified feature sampling to generate a set of subspaces, which are further incorporated into several ensemble clustering algorithms to build the consensus clustering for high-dimensional data. Yu et al.  proposed a novel subspace-based ensemble clustering framework termed APCE, which integrates random subspaces, affinity propagation, normalized cut, and five candidate distance metrics. Further, Yu et al.  proposed a semi-supervised subspace-based ensemble clustering framework by incorporating random subspaces, constraint propagation, incremental ensemble selection, and normalized cut into the framework. Fern and Brodley  exploited random subspace projection to build a set of subspaces, which in fact are obtained by (randomly) linear combination of features (or feature sets). These methods [7, 9, 16, 18] typically exploit the diversity in high-dimensionality by various subspace-based techniques, but few of them have fully considered the potentially huge diversity in metric spaces. The existing methods [7, 9, 16, 18] generally use one or a few preselected similarity/disimilarity metrics, which are selected implicitly based on the expert’s knowledge or some prior assumptions. Although the method in  proposed to randomly select a metric out of the five candidate metrics at each time, yet it still failed to go beyond a few metrics to explore the huge potential hidden in a large number of diversified metrics, which may play a crucial role in clustering high-dimensional data. The key challenge here lies in how to create such a large number of highly diversified metrics, and further how to jointly exploit the diversity in the large number of metrics, together with subspace-wise diversity and cluster-wise diversity, to achieve a unified ensemble clustering framework for high-dimensional data.
3 Proposed Framework
This section describes the overall algorithm of the proposed ensemble clustering approach. A brief overview is provided in Section 3.1. The metric diversification process is presented in Section 3.2. The jointly randomized ensemble generation is introduced in Section 3.3. Finally the consensus function is given in Section 3.4.
3.1 Brief Overview
In this paper, we propose a novel multi-diversified ensemble clustering (MDEC) approach (see Fig. 1). First, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, and combine the diversified metrics with the random subspaces to form a large set of random metric-subspace pairs. Second, with each random metric-subspace pair, we construct a similarity matrix for the data samples. The spectral clustering algorithm is then performed on these similarity matrices derived from metric-subspace pairs to obtain an ensemble of base clusterings. Third, to exploit the cluster-wise diversity in the ensemble of multiple base clusterings, we adopt an entropy based criterion to evaluate and weight the clusters by considering the distribution of cluster labels in the entire ensemble. With the weighted clusters, the locally weighted co-association matrix is further constructed to serve as a summary of the ensemble. Finally, the spectral clustering algorithm is performed on the locally weighted co-association matrix to obtain the consensus clustering result. It is noteworthy that our approach simultaneously incorporates three levels of diversity, i.e., metric-wise diversity, subspace-wise diversity, and cluster-wise diversity, in a unified framework, which have shown significant advantages in dealing with high-dimensional data when compared to the state-of-the-art ensemble clustering approaches. In the following sections, we will further introduce each step of the proposed approach in detail.
3.2 Diversification of Metrics
The choice of similarity/dissimilarity metrics plays a very crucial role in the field of machine learning and pattern recognition [38, 39, 40, 41, 42]. In particular, unlike the supervised or semi-supervised learning, where the metric learning techniques can be performed to learn the metrics with the help of human supervision or prior assumptions [38, 39, 40, 41, 42], in unsupervised learning it is generally very difficult to choose a proper metric given a task without prior knowledge.
Instead of relying on one or a few manually-selected (or learned) metric, this paper proposes to jointly use a large number of randomly diversified metrics in a unified ensemble clustering framework. Toward this end, we first need to tackle two sub-problems here, i.e., how to create a large number of diversified metrics, and how to collectively exploit them in ensemble clustering.
To create the diversified metrics, we take advantage of the kernel trick with randomization incorporated. The kernel similarity metrics have been proved to be a powerful tool for clustering complex data [38, 42], which, however, suffer from the difficulties in selecting proper kernel parameters. The kernel parameters can be learned by some metric learning techniques [38, 42] with supervision or semi-supervision. But without human supervision, it is often extremely difficult to decide proper kernel parameters. This is a critical disadvantage of kernel methods for conventional (unsupervised) applications, which, nevertheless, just becomes an important advantage in our situation where what is highly desired is not the selection of a good kernel similarity metric, but the flexibility to create a large number of diversified ones.
Specifically, in this paper, we adopt the scaled exponential similarity (SES) kernel  as the seed kernel, which will then be randomized to breed a large population of diversified metrics. Given a set of data samples , where is the -th sample and is the number of features. The SES kernel function for samples and is defined as:
where is a hyperparameter, is a scaling term, and is the Euclidean distance between and . Let denote the set of the nearest neighbors of . The average distance between and its nearest neighbors can be computed as
Then, as suggested in , to simultaneously take into consideration the neighborhood of , the neighborhood of , and their distance, the scaling term is defined as the average of , , and . That is
The SES kernel is a variant of the Gaussian kernel. It has two free parameters, i.e., the hyperparameter and the number of nearest neighbors . The motivation to adopt the SES kernel as the seed kernel in our approach is two-fold. First, with the influence of the scaling term where the -nearest-neighbors’ information is incorporated, the SES kernel has the adaptivity to the neighborhood structure among data samples. Moreover, with each value corresponding to a specific neighborhood size, by randomizing the parameter , multi-scale neighborhood information can be explored to enhance the diversity. Second, the two free parameters and in the SES kernel provide high flexibility for adjusting the influence of the kernel and can contribute to the high diversity of the generated metrics by randomly perturbing the two parameters.
Specifically, we propose to randomly select the two parameters and , respectively, as follows:
where and are two uniform random variables, and outputs the floor of a real number.
Note that our objective is not to find a good pair of parameters and , but to randomize them to yield a large population of diversified metrics. The parameters and are suggested to be randomly selected in a wide range to enhance the diversity. By performing the random selection times, a set of pairs of and are obtained, which correspond to randomized kernel similarity metrics for the dataset , denoted as
where and are the -th pair of randomized parameters.
3.3 Ensemble Generation by Joint Randomization
In this section, with the set of diversified metrics generated, we proceed to couple diversified metrics with random subspaces for jointly randomized ensemble generation.
Let be the set of features in the dataset , where denotes the -th feature. A random subspace is a set of a certain number of features that are randomly sampled from the original feature set. The cluster structure of high-dimensional data may be hidden in differen feature subspaces as well as in different metric spaces. In this paper, we propose to jointly exploit large populations of diversified metrics and random subspaces. Specifically, we perform random subspace sampling times to obtain random subspaces, denoted as , which lead to component datasets, denoted as . Note that each component dataset has the same number of data samples as the original dataset , but its feature set only consists of attributes that are randomly sampled from with a sampling ratio . Obviously, if , then it means every subspace is in fact the original feature space, i.e., no sub-sampling actually happens. Here, with the random subspaces generated, we can couple each of them with a randomly diversified metric (as describe in Section 3.2), and thus obtain random metric-subspace pairs, denoted as
In terms of the -th metric-subspace pair , the similarity between samples and is computed by first mapping and onto the subspace associated with the component dataset and then computing their SES kernel similarity with the randomly selected parameters and . Thus, we can obtain similarity matrices in terms of the metric-subspace pairs as follows:
where the -th similarity matrix (i.e., ) is constructed in terms of the -th metric-subspace pair , denoted as
is the -th entry in . Obviously, according to the definition of the SES kernel, it holds that for any . If samples and have the same feature values in the subspace associated with , then their similarity reaches its maximum .
Having constructed similarity matrices with diversified metric-subspace pairs, we then exploit the spectral clustering algorithm  to construct the ensemble of base clusterings. Spectral clustering is a widely used graph partitioning algorithm, which is capable of capturing the global structure in a graph .
Specifically, for the -th similarity matrix , we treat each data sample as a graph node and build a similarity graph as follows:
where is the node set, and is the edge set. The edge weights are decided by the similarity matrix , i.e., for any , we have . Let denote the number of clusters in the -base clustering. The objective of spectral clustering is to partition the graph into disjoint subsets. To this end, we construct the normalized graph Laplacian as follows:
where the degree matrix is a diagonal matrix with its -th entry defined as the sum of the -th row of . The eigenvectors corresponding to the first eigenvalues of are computed and then stacked to form a new matrix , where the -th column of is the eigenvector corresponding to the -th eigenvalue of . Thereafter, the matrix can be obtained from by normalizing the rows to norm 1.
By treating each row of as a data point in , we can cluster the rows into clusters by -means and thereby obtain the -th base clustering based on the similarity matrix . Formally, the -th base clustering is denoted as
where is the -th cluster in . It is obvious that the clusters in a base clustering cover the entire dataset, i.e., , and two clusters in the same base clustering will not overlap with each other, i.e., .
Finally, based on the diversified similarity matrices in , we can construct an ensemble of base clusterings, denoted as
where is the -th base clustering in the ensemble .
3.4 Consensus Function
With the ensemble generated, the objective of the consensus function is to combine the set of base clusterings into a probably better and more robust final clustering.
As each base clustering consists of a certain number of clusters, the entire ensemble can also be viewed as a large set of clusters from different base clusterings. To exploit the different reliability of different clusters and incorporate the cluster-wise diversity in the consensus function, here we adopt a local weighting strategy  to evaluate and weight the base clusters by jointly considering the distribution of cluster labels in the entire ensemble using an entropic criterion. Formally, we denote the ensemble of clusters as
where is the -th cluster and is the total number of clusters in the ensemble . Note that .
Each cluster is a set of data samples. To estimate the uncertainty of different clusters, the concept of entropy is utilized here . Given a cluster and a base clustering , the uncertainty (or entropy) of w.r.t. can be computed as
is the proportion of data samples in that also appear in . It is obvious that , which leads to . If and only if all data samples in also occur in the same cluster in , the uncertainty of w.r.t. reaches its minimum 0.
With the uncertainty of a cluster w.r.t. a base clustering given in Eq. (16) and the general assumption that the set of base clusterings are independent of each other, we can obtain the uncertainty (or entropy) of w.r.t. the entire ensemble as follows
Intuitively, higher uncertainty indicates lower reliability for a cluster, which implies that the ensemble of base clusterings tend to disagree with the cluster and accordingly a smaller weight can be associated with it . In particular, we proceed to compute a reliability index from the above-mentioned uncertainty measure, and exploit it as a cluster weighting term in our consensus function. The experimental analysis about the efficacy of the cluster weighting term will also be provided in Section 4.6. Specifically, the ensemble-driven cluster index (ECI) is computed as an indication for the reliability of each cluster in the ensemble, which is defined as follows:
Obviously, for any , it holds that , then we have and thereby . Note that a larger value of ECI is associated with a cluster of lower uncertainty (i.e., higher reliability). If and only if the data samples in appear in the same cluster in all of the base clusterings (i.e., all base clusterings agree that the data samples in should belong to the same cluster), the uncertainty of w.r.t. reaches its minimum 0 and the ECI of reaches its maximum 1.
The ECI measure serves as a reliability index for different clusters in the ensemble . By using ECI as a cluster-weighting term, the locally weighted co-association matrix can be obtained as follow:
where denotes the cluster in that belongs to. Note that is the cluster weighting term which weights each cluster according to its ECI value, while is the pair-wise co-occurrence term that indicates whether two samples occur in the same cluster in a base clustering .
Then, with the data samples treated as graph nodes and the locally weighted co-association matrix used as the similarity matrix, the similarity graph for the consensus function can be constructed as follows:
where is the node set, and is the edge set with the weight for any samples and . Thereafter, graph is partitioned into disjoint subsets by performing the spectral clustering algorithm . By treating each subset of graph nodes as a final cluster, the consensus clustering result can thus be obtained.
For clarity, the overall algorithm of the proposed ensemble clustering approach is summarized in Algorithm 1.
In this section, we conduct experiments on a variety of real-world high-dimensional datasets to compare the proposed MDEC approach against several state-of-the-art ensemble clustering approaches.
4.1 Datasets and Experimental Setting
We use twenty high-dimensional datasets in the experiments, including fifteen cancer gene expression datasets and five image datasets (see Tables I and II). The fifteen cancer gene expression datasets are from , while the five image datasets (i.e., UMist, Multiple Features, Flowers-17, COIL-20, and Binary Alphadigits) are from , , , , and , respectively. For clarity, in the following, the fifteen cancer gene expression datasets will be respectively abbreviated as GD-1 to GD-15, while the five image datasets will be respectively abbreviated as ID-1 to ID-5 (as shown in Tables I and II).
To produce a large set of diversified metrics, the two kernel parameters in the SES kernel are suggested to be randomized in a wide range. Specifically, in the experiments, the two kernel parameters and are randomly selected in the ranges of and , respectively. To generate the ensemble of base clusterings, the ensemble size and the sampling ratio are used. The number of clusters in each base clustering is randomly selected in the range of . In Sections 4.5 and Section 4.6, we will further evaluate the ensemble clustering performance of our approach with different ensemble sizes and different sampling ratios .
4.2 Evaluation Measures
To evaluate the quality of the clustering result, two widely used evaluation measures are adopted, namely, normalized mutual information (NMI)  and adjusted Rand index (ARI) . Note that greater values of NMI and ARI indicate better clusterings.
The NMI serves as a sound indication of the shared information between two clusterings. Given the test clustering and the ground-truth clustering , the NMI between and is defined as follows :
where and denote the number of clusters in and , respectively, denotes the number of samples in the -th cluster of , denotes the number of samples in the -th cluster of , and denotes the number of common samples between cluster in and cluster in .
The ARI is computed by considering the number of pairs of samples on which two clusterings agree or disagree. Given two clusterings and , the ARI between them is defined as follows :
where is the number of sample pairs that occur in the same cluster in both and , is the number of sample pairs that occur in different clusters in both and , is the number of sample pairs that occur in the same cluster in but in different clusters in , and is the number of sample pairs that occur in different clusters in but in the same cluster in .
4.3 Comparison Against Base Clusterings
In ensemble clustering, it is generally expected that in the ensemble generation phase the base clusterings can be produced with high diversity, while in the consensus phase the consensus clustering can be constructed with improved stability and quality by fusing the base clusterings.
In this section, we evaluate the performances of the generated base clusterings and the final consensus clusterings of the proposed MDEC approach. As illustrated in Fig. 2, in one aspect, the ensemble of base clusterings show high diversity (typically, with high standard deviations w.r.t. both NMI and ARI) for the benchmark datasets. In another aspect, the consensus clustering results consistently outperform the base clusterings in terms of both overall stability and quality (see Fig. 2). Especially, for the GD-2, GD-4, and GD-6 datasets, the average NMI and ARI scores (over 100 runs) of the consensus clusterings of our approach are even over twice as high as that of the base clusterings.
4.4 Comparison Against Other Ensemble Clustering Methods
In this section, we compare the proposed MDEC approach with ten state-of-the-art ensemble clustering approaches, namely, stratified sampling based cluster-based similarity partitioning algorithm (SSCSPA) , stratified sampling based hypergraph partitioning algorithm (SSHGPA) , stratified sampling based meta-clustering algorithm (SSMCLA) , -means based consensus clustering (KCC) , probability trajectory accumulation (PTA) , probability trajectory based graph partitioning (PTGP) , locally weighted evidence accumulation (LWEA) , locally weighted graph partitioning (LWGP) , spectral ensemble clustering (SEC) , and entropy based consensus clustering (ECC) . To compare the performances of different ensemble clustering approaches, we use the number of classes as the cluster number for each test approach, which is a commonly adopted experimental protocol in ensemble clustering [13, 14]. For each benchmark dataset, we run every test approach 100 times and report their average performances and standard deviations in Tables III and IV.
As shown in Table III, in terms of NMI, the proposed MDEC approach exhibits the best performance in eighteen out of the totally twenty datasets. Although the PTGP approach outperforms our approach in the GD-11 and GD-12 datasets, yet in all of the other eighteen datasets our approach shows significant advantages in the consensus performance (w.r.t. NMI) over the baseline approaches. Similarly, as shown in Table IV, in terms of ARI, the proposed approach also achieves the best performance in eighteen out of the totally twenty benchmark datasets, and shows a clear advantage over the baseline approaches.
To provide a summary view across the twenty benchmark datasets, we further show the average score and average rank of different approaches in the last two rows in Tables III and IV, respectively. Note that the average score (across twenty datasets) is computed by taking the average on the NMI (or ARI) scores, while the average rank is obtained by taking the average on the ranking positions, for each approach across all the datasets.
As can be seen in Table III, the proposed approach achieves an average NMI score of across twenty datasets, which is significantly higher than the second best approach (i.e., ECC) whose average NMI score is . In terms of the ranking positions in Table III, the proposed approach obtains an average rank of , while the second best approach (i.e., PTGP) only obtains an average rank of . Similar advantages can also be seen in Table IV. The average ARI score and the average rank of the proposed approach are and , respectively, which significantly outperform the ten baseline ensemble clustering approaches (see Table IV).
4.5 Robustness to Ensemble Sizes
In this section, we evaluate the performances of different ensemble clustering approaches with varying ensemble sizes. Specifically, we perform the proposed MDEC approach as well as the baseline approaches on the benchmark datasets with the ensemble size varying from to , and report their average performances over 20 runs in Figures 3 and 4.
As can be seen in Fig. 3, in terms of NMI, the proposed approach yields stably high performance across the twenty benchmark datasets with varying ensemble sizes. Although the PTGP and PTA approaches outperform our approach in the GD-11 dataset, yet in most of the other datasets our approach achieves the best or nearly the best performance when compared to the baseline approaches. Especially, on the GD-1, GD-2, GD-3, GD-4, GD-6, GD-8, GD-9, GD-10, GD-13, GD-14, GD-15, ID-1, ID-2, ID-3, ID-4, and ID-5 datasets, our approach shows a significant advantage over the baseline approaches with varying ensemble sizes . Similarly, in terms of ARI, our approach also exhibits the best or nearly best performance on most of the benchmark datasets with varying ensemble sizes (as shown in Fig. 4).
To provide a summary view, Fig. 5 further illustrates the average NMI and ARI scores (across twenty datasets) by different approaches with varying ensemble sizes . In fact, Fig. 5(a) is obtained by taking the average of the twenty sub-figures in Fig. 3, ranging from Fig. 3(a) to Fig. 3(t), while Fig. 5(b) is obtained by taking the average of the twenty sub-figures in Fig. 4, ranging from Fig. 4(a) to Fig. 4(t). As can be seen in Fig. 5, the proposed MDEC approach achieves significantly better performance (w.r.t. both NMI and ARI) than the baseline ensemble clustering approaches across the twenty benchmark datasets. Even when compared to the second and the third best approaches (i.e., ECC and PTGP, respectively), a clear advantage of the proposed MDEC approach can still be observed (see Fig. 5).
4.6 Influence of Metrics, Subspaces, and Clusters
This paper proposes to jointly exploit large populations of diversified metrics, random subspaces, and weighted clusters in a unified ensemble clustering framework. In this section, we evaluate the influence of the three factors (i.e., diversified metrics, random subspaces, and weighted clusters) in our approach.
First, we compare the diversified metrics with several widely used similarity metrics, i.e., cosine similarity, correlation coefficient, and Spearman correlation coefficient. Besides the proposed MDEC approach, we generate three sub-approaches by replacing the diversified metrics by one of the three conventional similarity metrics. As can been seen in Fig. 6, in terms of NMI, the proposed approach with diversified metrics obtains an average score , whereas the three sub-approaches (with the three conventional similarity metrics) obtain average scores of , , and , respectively. In terms of ARI, the proposed approach with diversified metrics obtains an average score of , which also significantly outperforms the three sub-approaches whose average ARI scores are , , and , respectively. As shown in Fig. 6, the use of diversified metrics in the proposed approach is able to significantly improve the consensus clustering performance.
Second, we evaluate the performance of MDEC with different subspace sampling ratio , which varies from to (see Fig. 7). As illustrated in Fig. 7, moderate values of generally lead to better consensus clustering performance. When the sampling ratio goes from to , the performance declines, which suggests that the use of random subspaces exhibits a positive influence when compared to using the full feature sets (by setting ). At the other extreme, when setting to very small values, e.g., in the range of [0.1, 0.3], the performance also declines, due to the fact that the subspaces generated by a very small sampling ratio may not well represent the underlying distribution of the dataset. Empirically, it is suggested that the sampling ratio be set in the range of , which strikes a balance between diversity and quality.
Third, we evaluate the performance of our approach with and without the weighted clusters. Note that the performance of our approach without weighted clusters is obtained by setting all cluster weights equal to one. As shown in Fig. 8, in terms of both NMI and ARI, the proposed approach with weighted clusters exhibits consistently better average performance (across twenty datasets) than that without weighted clusters.
As shown in Figures 6 to 8, we have two main observations: 1) the performance of our approach benefits from the use of diversified metrics, random subspaces, and weighted clusters; 2) out of the three beneficial factors, the diversified metrics play the most important role in the consensus clustering performance, with consideration to the approximately of improvement (w.r.t. both NMI and ARI) that they lead to.
4.7 Execution Time
In this section, we evaluate the efficiency of different ensemble clustering approaches and report their execution times on the benchmark datasets in Table V. In general, larger dimensions and larger sample sizes lead to greater computational costs for the ensemble clustering approaches. As can be seen in Table V, the proposed MDEC approach consumes less than 1 second of time on fourteen out of the totally fifteen cancer gene expression datasets. On the five image datasets, the time efficiency of the proposed MDEC approach is also comparable to the other ensemble clustering approaches.
In summary, as can be seen in Tables III to V and Figures 3 to 5, the proposed MDEC approach has shown significant advantages in clustering accuracy while exhibiting competitive time efficiency when compared against the state-of-the-art ensemble clustering approaches.
All of the experiments are conducted in MATLAB R2016a 64-bit on a workstation (Windows 10 Enterprise 64-bit, 12 Intel 2.40 GHz processors, 128 GB of RAM).
In this paper, we propose a new ensemble clustering approach termed MDEC, which is capable of jointly exploiting large populations of diversified metrics, random subspaces, and weighted clusters in a unified ensemble clustering framework. Specifically, a large number of diversified metrics are generated by randomizing a scaled exponential similarity kernel. The diversified metrics are then coupled with the random subspaces to form a large set of metric-subspace pairs. Upon the similarity matrices derived from the metric-subspace pairs, the spectral clustering algorithm is performed to construct an ensemble of diversified base clusterings. With the base clusterings generated, an entropy-based cluster validity strategy is utilized to evaluate and weight the clusters with consideration to the distribution of the cluster labels in the entire ensemble. Based on the weighted clusters, the locally weighted co-association matrix is built and then partitioned to obtain the consensus clustering. We have conducted extensive experiments on 20 high-dimensional datasets (including 15 cancer gene expression datasets and 5 image datasets), which demonstrate the clear advantages of our approach over the state-of-the-art ensemble clustering approaches.
This project was supported by NSFC (61602189, 61502543 & 61573387), National Key Research and Development Program of China (2016YFB1001003), Guangdong Natural Science Funds for Distinguished Young Scholar (2016A030306014), and Singapore Ministry of Education Tier-2 Grant (MOE2014-T2-2-023).
-  T. Li and C. Ding, “Weighted consensus clustering,” in Proc. of SIAM International Conference on Data Mining (SDM), 2008, pp. 798–809.
-  N. Iam-On, T. Boongoen, S. Garrett, and C. Price, “A link-based approach to the cluster ensemble problem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2396–2409, 2011.
-  T. Wang, “CA-Tree: A hierarchical structure for efficient and scalable coassociation-based cluster ensembles,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 3, pp. 686–698, 2011.
-  N. Li and L. J. Latecki, “Clustering aggregation as maximum-weight independent set,” in Advances in Neural Information Processing Systems (NIPS), 2012, pp. 782–790.
-  L. Zheng, T. Li, and C. Ding, “A framework for hierarchical ensemble clustering,” ACM Transactions on Knowledge Discovery from Data, vol. 9, no. 2, pp. 9:1–9:23, 2014.
-  J. Wu, H. Liu, H. Xiong, J. Cao, and J. Chen, “K-means-based consensus clustering: A unified view,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 155–169, 2015.
-  L. Jing, K. Tian, and J. Z. Huang, “Stratified feature sampling method for ensemble clustering of high dimensional data,” Pattern Recognition, vol. 48, no. 11, pp. 3688–3702, 2015.
-  D. Huang, J.-H. Lai, and C.-D. Wang, “Robust ensemble clustering using probability trajectories,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 5, pp. 1312–1326, 2016.
-  Z. Yu, L. Li, J. Liu, J. Zhang, and G. Han, “Adaptive noise immune cluster ensemble using affinity propagation,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 12, pp. 3176–3189, 2015.
-  D. Huang, J. Lai, and C.-D. Wang, “Ensemble clustering using factor graph,” Pattern Recognition, vol. 50, pp. 131–142, 2016.
-  D. Huang, C. D. Wang, and J. H. Lai, “Locally weighted ensemble clustering,” IEEE Transactions on Cybernetics, in press, 2017.
-  H. Liu, M. Shao, S. Li, and Y. Fu, “Infinite ensemble clustering,” Data Mining and Knowledge Discovery, in press, 2017.
-  H. Liu, J. Wu, T. Liu, D. Tao, and Y. Fu, “Spectral ensemble clustering via weighted k-means: Theoretical and practical evidence,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 5, pp. 1129–1143, 2017.
-  H. Liu, R. Zhao, H. Fang, F. Cheng, Y. Fu, and Y.-Y. Liu, “Entropy-based consensus clustering for patient stratification,” Bioinformatics, vol. 33, no. 17, pp. 2691–2698, 2017.
-  Z. Yu, H.-S. Wong, and H. Wang, “Graph-based consensus clustering for class discovery from gene expression data,” Bioinformatics, vol. 23, no. 21, pp. 2888–2896, 2007.
-  Z. Yu, P. Luo, J. You, H. S. Wong, H. Leung, S. Wu, J. Zhang, and G. Han, “Incremental semi-supervised clustering ensemble for high dimensional data clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 3, pp. 701–714, 2016.
-  Z. Yu, Z. Kuang, J. Liu, H. Chen, J. Zhang, J. You, H. S. Wong, and G. Han, “Adaptive ensembling of semi-supervised clustering solutions,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 8, pp. 1577–1590, 2017.
-  X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A cluster ensemble approach,” in Proc. of International Conference on Machine Learning (ICML), 2003, pp. 186–193.
-  J. H. Lee, K. T. McDonnell, A. Zelenyuk, D. Imre, and K. Mueller, “A structure-based distance metric for high-dimensional space exploration with multidimensional scaling,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 3, pp. 351–364, 2014.
-  C. M. Hsu and M. S. Chen, “On the design and applicability of distance functions in high-dimensional data space,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 4, pp. 523–536, 2009.
-  A. L. N. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835–850, 2005.
-  J. Wu, H. Liu, H. Xiong, and J. Cao, “A theoretic framework of k-means-based consensus clustering,” in Proc. of International Joint Conference on Artificial Intelligence, 2013.
-  C. Zhong, X. Yue, Z. Zhang, and J. Lei, “A clustering ensemble: Two-level-refined co-association matrix with path-based transformation,” Pattern Recognition, vol. 48, no. 8, pp. 2699–2709, 2015.
-  D. Huang, J.-H. Lai, and C.-D. Wang, “Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis,” Neurocomputing, vol. 170, pp. 240–250, 2015.
-  Y. Fan, N. Li, C. Li, Z. Ma, L. J. Latecki, and K. Su, “Restart and random walk in local search for maximum vertex weight cliques with evaluations in clustering aggregation,” in Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 622–630.
-  M. Yousefnezhad, S. J. Huang, and D. Zhang, “WoCE: A framework for clustering ensemble by exploiting the wisdom of crowds theory,” IEEE Transactions on Cybernetics, in press, 2017.
-  D. Huang, J.-H. Lai, C.-D. Wang, and P. C. Yuen, “Ensembling over-segmentations: From weak evidence to strong segmentation,” Neurocomputing, vol. 207, pp. 416–427, 2016.
-  X. Wang, C. Yang, and J. Zhou, “Clustering aggregation by probability accumulation,” Pattern Recognition, vol. 42, no. 5, pp. 668–675, 2009.
-  J. Yi, T. Yang, R. Jin, and A. K. Jain, “Robust ensemble clustering by matrix completion,” in Proc. of IEEE International Conference on Data Mining (ICDM), 2012.
-  A. Strehl and J. Ghosh, “Cluster ensembles: A knowledge reuse framework for combining multiple partitions,” Journal of Machine Learning Research, vol. 3, pp. 583–617, 2003.
-  X. Z. Fern and C. E. Brodley, “Solving cluster ensemble problems by bipartite graph partitioning,” in Proc. of International Conference on Machine Learning (ICML), 2004.
-  A. Topchy, A. K. Jain, and W. Punch, “Clustering ensembles: models of consensus and weak partitions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1866–1881, 2005.
-  L. Franek and X. Jiang, “Ensemble clustering by means of clustering embedding in vector spaces,” Pattern Recognition, vol. 47, no. 2, pp. 833–842, 2014.
-  A. K. Jain, “Data clustering: 50 years beyond -means,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010.
-  G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
-  E. Weiszfeld and F. Plastria, “On the point for which the sum of the distances to n given points is minimum,” Annals of Operations Research, vol. 167, no. 1, pp. 7–41, 2009.
-  F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, 2001.
-  X. Yin, S. Chen, E. Hu, and D. Zhang, “Semi-supervised clustering with metric learning: An adaptive kernel method,” Pattern Recognition, vol. 43, no. 4, pp. 1320–1333, 2010.
-  W. Zhang, Z. Lin, and X. Tang, “Learning semi-riemannian metrics for semisupervised feature extraction,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 4, pp. 600–611, 2011.
-  Q. Wang, P. C. Yuen, and G. Feng, “Semi-supervised metric learning via topology preserving multiple semi-supervised assumptions,” Pattern Recognition, vol. 46, no. 9, pp. 2576–2587, 2013.
-  H. Wang, F. Nie, and H. Huang, “Robust distance metric learning via simultaneous l1-norm minimization and maximization,” in Proc. of International Conference on Machine Learning (ICML), vol. 32, no. 2, 2014, pp. 1836–1844.
-  S. Anand, S. Mittal, O. Tuzel, and P. Meer, “Semi-supervised kernel mean shift clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1201–1215, 2014.
-  B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains, and A. Goldenberg, “Similarity network fusion for aggregating data types on a genomic scale,” Nature Methods, vol. 11, pp. 333–337, 2014.
-  U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
-  M. C. de Souto, I. G. Costa, D. S. de Araujo, T. B. Ludermir, and A. Schliep, “Clustering cancer gene expression data: A comparative study,” BMC bioinformatics, vol. 9, no. 1, p. 497, 2008.
-  D. B. Graham and N. M. Allinson, “Characterising virtual eigensignatures for general purpose face recognition,” in Face Recognition. Springer, 1998, pp. 446–456.
-  K. Bache and M. Lichman, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
-  M.-E. Nilsback and A. Zisserman, “A visual vocabulary for flower classification,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2006, pp. 1447–1454.
-  S. A. Nene, S. K. Nayar, H. Murase et al., “Columbia object image library (COIL-20),” Technical report CUCS-005-96, 1996.
-  S. Roweis, http://www.cs.nyu.edu/%7eroweis/data.html.
-  N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, no. 11, pp. 2837–2854, 2010.
Dong Huang received the B.S. degree in computer science in 2009 from South China University of Technology, Guangzhou, China. He received the M.Sc. degree in computer science in 2011 and the Ph.D. degree in computer science in 2015, both from Sun Yat-sen University, Guangzhou, China. He joined South China Agricultural University in 2015, where he is currently an Associate Professor with the College of Mathematics and Informatics. Since July 2017, he has been a visiting fellow with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. His research interests include data mining and pattern recognition. He is a member of the IEEE.
Chang-Dong Wang received the B.S. degree in applied mathematics in 2008, the M.Sc. degree in computer science in 2010, and the Ph.D. degree in computer science in 2013, all from Sun Yat-sen University, Guangzhou, China. He was a visiting student at the University of Illinois at Chicago from January 2012 to November 2012. He is currently an Associate Professor with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. His current research interests include machine learning and data mining. He has published more than 40 scientific papers in international journals and conferences such as IEEE TPAMI, IEEE TKDE, IEEE TSMC-C, Pattern Recognition, KAIS, Neurocomputing, ICDM and SDM. His ICDM 2010 paper won the Honorable Mention for Best Research Paper Awards. He was awarded 2015 Chinese Association for Artificial Intelligence (CAAI) Outstanding Dissertation. He is a member of the IEEE.
Jian-Huang Lai received the M.Sc. degree in applied mathematics in 1989 and the Ph.D. degree in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an Assistant Professor, where he is currently a Professor with the School of Data and Computer Science. His current research interests include the areas of digital image processing, pattern recognition, multimedia communication, wavelet and its applications. He has published more than 200 scientific papers in the international journals and conferences on image processing and pattern recognition, such as IEEE TPAMI, IEEE TKDE, IEEE TNN, IEEE TIP, IEEE TSMC-B, Pattern Recognition, ICCV, CVPR, IJCAI, ICDM and SDM. Prof. Lai serves as a Standing Member of the Image and Graphics Association of China, and also serves as a Standing Director of the Image and Graphics Association of Guangdong. He is a senior member of the IEEE.
Chee-Keong Kwoh received the bachelor’s degree in electrical engineering (first class) and the master’s degree in industrial system engineering from the National University of Singapore in 1987 and 1991, respectively. He received the PhD degree from the Imperial College of Science, Technology and Medicine, University of London, in 1995. He has been with the School of Computer Science and Engineering, Nanyang Technological University (NTU) since 1993. He is the programme director of the MSc in Bioinformatics programme at NTU. His research interests include data mining, soft computing and graph-based inference; applications areas include bioinformatics and biomedical engineering. He has done significant research work in his research areas and has published many quality international conferences and journal papers. He is an editorial board member of the International Journal of Data Mining and Bioinformatics; the Scientific World Journal; Network Modeling and Analysis in Health Informatics and Bioinformatics; Theoretical Biology Insights; and Bioinformation. He has been a guest editor for many journals such as Journal of Mechanics in Medicine and Biology, the International Journal on Biomedical and Pharmaceutical Engineering and others. He has often been invited as an organizing member or referee and reviewer for a number of premier conferences and journals including GIW, IEEE BIBM, RECOMB, PRIB, BIBM, ICDM and iCBBE. He is a member of the Association for Medical and Bio-Informatics, Imperial College Alumni Association of Singapore. He has provided many services to professional bodies in Singapore and was conferred the Public Service Medal by the President of Singapore in 2008.