Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis
The clustering ensemble technique aims to combine multiple clusterings into a probably better and more robust clustering and has been receiving an increasing attention in recent years. There are mainly two aspects of limitations in the existing clustering ensemble approaches. Firstly, many approaches lack the ability to weight the base clusterings without access to the original data and can be affected significantly by the low-quality, or even ill clusterings. Secondly, they generally focus on the instance level or cluster level in the ensemble system and fail to integrate multi-granularity cues into a unified model. To address these two limitations, this paper proposes to solve the clustering ensemble problem via crowd agreement estimation and multi-granularity link analysis. We present the normalized crowd agreement index (NCAI) to evaluate the quality of base clusterings in an unsupervised manner and thus weight the base clusterings in accordance with their clustering validity. To explore the relationship between clusters, the source aware connected triple (SACT) similarity is introduced with regard to their common neighbors and the source reliability. Based on NCAI and multi-granularity information collected among base clusterings, clusters, and data instances, we further propose two novel consensus functions, termed weighted evidence accumulation clustering (WEAC) and graph partitioning with multi-granularity link analysis (GP-MGLA) respectively. The experiments are conducted on eight real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed methods.
keywords:Clustering ensemble, Clustering aggregation, Weighted evidence accumulation clustering, Graph partitioning with multi-granularity link analysis
Data clustering is a fundamental and very challenging problem in data mining and machine learning. The purpose is to partition unlabeled data into homogeneous groups, each referred to as a cluster. Data clustering requires a distance metric for evaluating the similarity between data instances, which, without prior knowledge of cluster shapes, is hard to specify. In the past few decades, a large number of clustering algorithms have been developed Xu et al. (1993); Li et al. (2007); Zhang and Zhou (2009); Zhao et al. (2010); Wang and Lai (2011); Li et al. (2011); Wang et al. (2013a, b); Wang and Lai (2013). However, there is no single clustering method which is able to identify all sorts of cluster shapes and structures in data.
For the same dataset, different methods, or even the same method with different initializations or parameter settings, may lead to very different clustering results. It is extremely difficult to decide which method would be the proper one for a given clustering task, not to say how to properly specify the initialization and parameter setting for the chosen method. Each method has its own merits as well as weaknesses. Different clusterings generated by different methods or with varying parameters can provide multiple views of the data. How to combine the information of different clustering results for obtaining a better and more robust clustering remains a very challenging problem Jain (2010); Vega-Pons and Ruiz-Shulcloper (2011).
In recent years, many clustering ensemble approaches have been developed, which aim to combine multiple clusterings into a probably better and more robust clustering by utilizing various techniques Strehl and Ghosh (2003); Fern and Brodley (2004); Fred and Jain (2005); Topchy et al. (2005); Hadjitodorov et al. (2006); Li et al. (2007); Iam-On et al. (2008); Domeniconi and Al-Razgan (2009); Wang et al. (2009); Mimaroglu and Erdil (2011); Iam-On et al. (2011); Yi et al. (2012); Franek and Jiang (2014). However, in most of the existing methods, there are mainly two aspects of limitations. Firstly, many of the clustering ensemble approaches lack the ability to weight the base clusterings without access to the original data features, which makes them vulnerable to low-quality clusterings and probable to be affected significantly by low-quality clusterings (or even ill clusterings). Secondly, they mainly focus on the instance level or the cluster level in the ensemble system and fail to fuse multi-granularity information into a unified model. In order to address these two limitations, in this paper, we propose a clustering ensemble framework based on crowd agreement estimation and multi-granularity link analysis. By exploring the relationship among the base clusterings, we present a novel clustering validity measure termed normalized crowd agreement index (NCAI), which is able to evaluate the quality of base clusterings in an unsupervised manner and provides information for treating each base clustering accordingly. The source aware connected triple (SACT) similarity is introduced for analyzing the similarity between clusters with regard to their common neighbors and source reliability. Besides the relations between base clusterings and between clusters, we further investigate the linkage between data instances and clusters and incorporate the information from the three levels of granularity in a unified framework. In our previous work Huang et al. (2013), we introduced the consensus function termed graph partitioning with multi-granularity link analysis (GP-MGLA). This paper is a major extension of our previous work on clustering ensemble. In this paper, more comprehensive literature and motivation are provided. Besides that, we propose another novel consensus function termed weak evidence accumlation clustering (WEAC), which is developed from the conventional evidence accumulation clustering (EAC) Fred and Jain (2005) and capable of dealing with ill clusterings by incorporating the clustering validity cue into the ensemble process. Extensive experiments are further conducted on real-world datasets for evaluating the proposed methods against several baseline clustering ensemble methods.
The remainder of this paper is organized as follows. In Section 2, we review the related work of the clustering ensemble technique. In Section 3, we describe the formulation of the clustering ensemble problem. In Section 4, we present the crowd agreement estimation mechanism. The source aware connected triple (SACT) similarity is introduced in Section 5. In Section 6, we propose two novel consensus functions termed weighted evidence accumulation clustering (WEAC) and graph partitioning with multi-granularity link analysis (GP-MGLA) respectively. The experimental results are reported in Section 7. We conclude this paper in Section 8.
2 Related Work
Clustering ensemble is also known as clustering combination or clustering aggregation, which aims to combine multiple clusterings, each referred to as a base clustering (or an ensemble member), to obtain a so-called consensus clustering. As illustrated in Fig. 1, the clustering ensemble process involves two steps: the first step is to generate multiple clusterings for a given dataset; and the second step is to construct the consensus clustering from the ensemble of base clusterings using different consensus functions.
Given a dataset, the ensemble of base clusterings can be generated by running different clustering algorithms Mimaroglu and Erdil (2011); Yi et al. (2012); Huang et al. (2013), running the same algorithm with different initializations and parameters Fred and Jain (2005); Iam-On et al. (2008); Wang et al. (2009); Iam-On et al. (2011), clustering via sub-sampling the data repeatedly Strehl and Ghosh (2003); Fern and Brodley (2004), or clustering via projecting the data onto different subspaces Strehl and Ghosh (2003); Fern and Brodley (2004); Topchy et al. (2005); Domeniconi and Al-Razgan (2009). Compared to generating base clusterings, how to combine multiple base clusterings, i.e., how to design the consensus function, is much more important and challenging in the clustering ensemble problem.
In the past few years, many consensus functions have been developed to fuse information from multiple clusterings Strehl and Ghosh (2003); Fern and Brodley (2004); Fred and Jain (2005); Topchy et al. (2005); Hadjitodorov et al. (2006); Li et al. (2007); Iam-On et al. (2008); Domeniconi and Al-Razgan (2009); Wang et al. (2009); Mimaroglu and Erdil (2011); Iam-On et al. (2011); Yi et al. (2012); Franek and Jiang (2014). These approaches can be classified into mainly three categories, namely, (i) the median partition based methods Cristofor and Simovici (2002); Topchy et al. (2005); Franek and Jiang (2014), (ii) the pair-wise co-occurrence based methods Fred and Jain (2005); Li et al. (2007); Wang et al. (2009), and (iii) the graph partitioning based methods Strehl and Ghosh (2003); Fern and Brodley (2004); Domeniconi and Al-Razgan (2009).
In the median partition based approaches Cristofor and Simovici (2002); Topchy et al. (2005); Franek and Jiang (2014), the clustering ensemble problem is formulated into an optimization problem, aiming to find the partition/clustering that maximizes the similarity between the the partition and the base clusterings, over the space of all partitions. The median partition problem is NP-complete Topchy et al. (2005). Instead of finding the optimal solution over the huge space of all possible partitions, Cristofor and Simovici Cristofor and Simovici (2002) used the genetic algorithm to obtain an approximative solution where the clusterings are represented by chromosomes. Topchy et al. Topchy et al. (2005) cast the median partition problem into a maximum likelihood problem, as a solution to which the consensus clustering is found using the EM algorithm. Franek and Jiang Franek and Jiang (2014) reduced the median partition problem to the Euclidean median problem by clustering embedding in vector spaces and found the median vector by the Weiszfeld algorithm Weiszfeld and Plastria (2009). Then an inverse transformation would be performed to convert the median vector into a clustering, which was taken as the consensus clustering.
The pair-wise co-occurrence based approaches Fred and Jain (2005); Li et al. (2007); Wang et al. (2009) construct the similarity between data instances by considering how many times they occur in the same cluster in the ensemble of base clusterings. Fred and Jain Fred and Jain (2005) introduced the evidence accumulation clustering (EAC) method, which used the co-association matrix to measure the similarity between instances. Then the hierarchical agglomerative clustering algorithms Jain (2010), e.g., single-link (SL) and average-link (AL), can be performed on the co-association matrix and thus the consensus clustering is obtained. Li et al. Li et al. (2007) analyzed the co-association matrix and proposed a novel hierarchical clustering algorithm by utilizing the concept of normalized edges to measure the similarity between clusters. Wang et al. Wang et al. (2009) generalized the EAC method and proposed the probability accumulation method, which took into consideration the sizes of clusters in the ensemble.
Another category of clustering ensemble is based on graph partitioning Strehl and Ghosh (2003); Fern and Brodley (2004); Domeniconi and Al-Razgan (2009). Strehl and Ghosh Strehl and Ghosh (2003) modeled the ensemble of clusterings in a hypergraph structure where the clusters are treated as hyperedges. For partitioning the graph and obtaining the consensus clustering, they further proposed three graph partitioning algorithms, namely, the cluster-based similarity partitioning algorithm (CSPA), the hypergraph-partitioning algorithm (HGPA), and the meta-clustering algorithm (MCLA). Fern and Brodley Fern and Brodley (2004) formulated the clustering ensemble into a bipartite graph where both the data instances and clusters are represented as graph nodes. An edge between two nodes exists if and only if one of the nodes is a data instance and the other node is the cluster containing it. The consensus clustering is obtained by partitioning the graph into a certain number of disjoint sets of graph nodes.
Many of the existing clustering ensemble approaches implicitly assume that all the base clusterings contribute equally to the ensemble system and can be affected significantly by low-quality clusterings or even ill clusterings. In recent years, some efforts have been made to weight the base clusterings with regard to the clustering validity. Vega-Pons et al. Vega-Pons et al. (2010) exploited several property validity indexes (PVIs), namely, Variance (VI), Connectivity (CI), Silhouette Width (SI) and Dunn index (DI), to assign a weight to each partition in the ensemble and proposed a new clustering ensemble method based on kernel functions. Vega-Pons et al. Vega-Pons et al. (2011) also extended the conventional EAC method by weighting the partitions based on the PVIs. These PVIs need access to the original feature vectors, which are not supposed to be given for the consensus process in the formulation of this work as well as many other clustering ensemble frameworks Fern and Brodley (2004); Fred and Jain (2005); Li et al. (2007); Iam-On et al. (2008); Wang et al. (2009); Mimaroglu and Erdil (2011); Iam-On et al. (2011); Yi et al. (2012); Franek and Jiang (2014). Li and Ding Li and Ding (2008) proposed the weighted consensus clustering (WCC) method, where the weights of the base clusterings are determined via an optimization process based on the nonnegative matrix factorization. The optimization process is computationally expensive when dealing with large datasets. Fern and Lin Fern and Lin (2008) proposed a clustering ensemble selection framework which selects a subset of partitions from a large library of partitions. The ensemble selection process in Fern and Lin (2008) can be viewed as weighting the partitions in the ensemble with either 1 or 0, where 1’s indicate the preserved partitions and 0’s indicate the deleted ones. However, the ensemble selection scheme lacks the flexibility of weighting the selected members in accordance to their quality.
3 The Clustering Ensemble Problem
The purpose of a clustering algorithm is to discover the structure of clusters in a given dataset. The clustering result can be either a hard partition or a fuzzy partition for the dataset. The clustering ensemble technique aims to combine multiple partitions for achieving a better partition. In this paper, we focus on combining hard patitions of data.
Given a dataset , where is the -th data instance and is the number of instances in . A partition (or clustering) of is generated by running a clustering algorithm with some specific parameters. Each cluster in a partition consists a certain number of data instances. Different clusters in the same partition do not intersect with each other. And the union of all clusters in a partition covers the entire dataset.
be a partition of , where denotes the -th cluster and is number of clusters in . Then we have , , and .
In a clustering ensemble system, each partition is referred to as a base clustering. With the partitions generated by different algorithms or the same algorithm with different parameters and initializations, we can obtain the ensemble of base clusterings, which is denoted as
where represents the -th base clustering in . For convenience, the set of all clusters in the ensemble is denoted as , where is the -th cluster in . As is defined, it holds that and .
The multiple partitions of provide multiple looks at the dataset. The problem is to use the information provided by the the ensemble of multiple partitions to obtain a final partition solution , which is generally referred to as the consensus clustering.
4 Crowd Agreement Estimation
In the clustering ensemble system, the base clusterings can be generated using a wide variety of clustering algorithms. Due to the diversity of clustering algorithms and datasets, it is not guaranteed that every base clustering is well constructed. The low-quality clusterings, or even ill clusterings, may affect the consensus process significantly. There is a need to distinguish the poor clusterings from the good ones and treat the base clusterings with regard to their quality. The critical problem here is how to evaluate the quality of the base clusterings without knowing the ground-truth.
Some algorithms have been developed to estimate the clustering quality using different criteria Wu and Chow (2004); Faceli et al. (2009); Li and Latecki (2012). Wu and Chow Wu and Chow (2004) proposed a clustering validity index based on inter-cluster and intra-cluster density. Faceli et al. Faceli et al. (2009) used the overall deviation and the connectivity to assess the quality of a clustering. The overall deviation of a clustering measures the overall distances between data instances and their corresponding cluster centers. The connectivity measures how often neighboring instances are assigned to the same cluster. Li and Latecki Li and Latecki (2012) utilized the average silhouette coefficient to evaluate the quality of a cluster. The silhouette coefficient of a data instance measures how similar that instance is to the instances in its own cluster compared to the instances in the other clusters, whereas the quality of a cluster is estimated by the average of the silhouette coefficients of the instances inside it. These evaluation methods are only applicable to numerical data and need access to the original data features, which are not supposed to be given in the problem formulation of many clustering ensemble approaches Fern and Brodley (2004); Fred and Jain (2005); Li et al. (2007); Iam-On et al. (2008); Wang et al. (2009); Mimaroglu and Erdil (2011); Iam-On et al. (2011); Yi et al. (2012); Franek and Jiang (2014). Rather than utilizing the information of data distribution, in this paper, we view the clustering ensemble as a crowd of individuals and estimate the quality of each individual via consulting the other individuals in the clustering ensemble.
In social and economic science, “the wisdom of the crowd” is the process of taking into consideration the collective opinion of a crowd of individuals rather than a single expert Surowiecki (2004). The ground-truth labeling of a dataset can be viewed as an expert. As the ground-truth is not supposed to be known in unsupervised frameworks, we estimate the quality of a base clustering by collecting information from the crowd of base clusterings. Each base clustering is compared with the other ones and the average opinion of the crowd of individuals is obtained for quality estimation.
Let be an ensemble of base clusterings and be the -th base clustering in . The crowd agreement index (CAI) for is defined as
where denotes the similarity between the two base clusterings and .
We denote the base clustering that gains the maximum agreement from the crowd as the reference member. Then the reliability of the base clusterings is estimated by comparing their crowd agreement with that of the reference member and the normalized version of crowd agreement index can be computed.
The normalized crowd agreement index of is defined as
The basic idea here is to estimate the quality of a base clustering by collecting opinion from a crowd of diverse individuals. According to Definition 2, for , it holds that . In this paper we use the normalized mutual information (NMI) Strehl and Ghosh (2003) as the similarity measure . The greater the NCAI value of a base clustering is, the better its quality is supposed to be.
5 Source Aware Connected Triple
In this section, we investigate the relationship among the clusters in the ensemble and introduce the source aware connected triple (SACT) which is able to measure the similarity of two clusters with regard to their common neighbors and the source reliability.
Two clusters and are neighbors if and only if they share some common data instances, i.e., .
Each cluster is a set of data instances. The Jaccard coefficient Levandowsky and Winter (1971) is often used to measure the similarity between two clusters (or two sets), which is computed as follows:
where and are two clusters and denotes the cardinality of the set . The Jaccard coefficient takes into consideration the sharing instances of two clusters to measure their similarity. Therefore the Jaccard coefficient of two clusters in the same base clustering is always zero. If two clusters intersect, then they are directly related. If two clusters do not intersect but they share a certain number of common neighbors, then they are also related. Iam-On et al. Iam-On et al. (2011) utilized the information of common neighbors of two clusters to justify their similarity, where, however, the reliability of these neighbors was not considered.
Each base clustering can be viewed as a source of clusters. The overall quality of the clusters in a base clustering is correlated to the quality of the base clustering containing them. In this paper, we estimate the reliability of a cluster by considering the quality of the corresponding base clustering and propose the source aware connected triple (SACT) to measure the similarity of two clusters with regard to their common neighbors and the reliability of these neighbors.
The SACT coefficient between two clusters and w.r.t. a cluster is defined as
where denotes the base clustering that contains and
is the influence of the NCAI of the base clustering .
According to Definitions 2 and 4, for , it holds that . The parameter in Eq. (7) is a parameter to adjust the influence of the NCAI. A greater value of leads to a bigger influence of the NCAI, which means the difference of NCAI values between high-confidence partitions and low-confidence partitions is enlarged. When , the influence of NCAI disappears for all base clusterings, i.e., .
The SACT coefficient between two clusters and w.r.t. all the clusters in the ensemble is defined as
By definition, if is not a common neighbor between and , then . Thus the SACT coefficient between two clusters w.r.t. all the common neighbors is identical to that w.r.t. all the clusters in the ensemble and can be computed by Eq. (8).
The SACT similarity between two clusters and is defined as
The SACT similarity is computed on the basis of the the SACT coefficient. The pair of clusters with the maximum SACT coefficient is adopted as the reference pair of clusters, whose SACT similarity is defined to be 1. The SACT similarity of the other pairs of clusters is computed by comparing their SACT coefficient to that of the reference pair (see Eq. (9)). The SACT similarity between a cluster and itself is set to 1.
6 Consensus Functions
In this section, we introduce two novel consensus fuctions which utilize multi-granularity information of the ensemble and are able to deal with ill base clusterings. In the following, we will describe the weighted evidence accumulation clustering (WEAC) method in Section 6.1 and the graph partitioning with multi-granularity link analysis (GP-MGLA) method in Section 6.2.
6.1 Weighted Evidence Accumulation Clustering (WEAC)
In a base clustering, each data instance is assigned to a specific cluster, whereas two instances are either in the same cluster or in two different clusters. Without access to the original features, the affinity between two data instances can be assessed by their co-occurrence information in the ensemble of base clusterings.
Let be a base clustering in the clustering ensemble . Let be the cluster label of the instance in . The similarity matrix for is computed as follows:
|for , .|
For each base clustering, say, , a similarity matrix is constructed. If instances and occur in the same cluster in , then ; otherwise . The similarity matrix contains the pair-wise co-occurrence information of the corresponding base clustering. In the conventional evidence accumulation clustering (EAC) method Fred and Jain (2005), the association matrix is obtained by averaging the similarity matrices of all the base clustering, that is
The basic idea of the proposed WEAC method is to construct the association matrix with considering the reliability of the base clusterings. We assess the quality of each base clustering with the NCAI measure (as described in Section 4) and assign a weight to each base clustering with regard to its estimated quality.
The weighted co-association matrix is a matrix which is computed as follows:
is the weight of the base clustering .
According to Definitions 7 and 8, for and , it holds that . Thus the labeling information of multiple base clusterings is mapped into a new similarity measure by utilizing pair-wise co-occurrence cues and reliability assessment of each member. With the weighted co-association matrix constructed, we further perform the agglomerative clustering methods Jain (2010) to achieve the final consensus clustering.
For clarity, the WEAC method is summarized in Algorithm 1.
6.2 Graph Partitioning with Multi-Granularity Link Analysis (GP-MGLA)
There are three levels of granularity in the clustering ensemble, namely, the data instances, the clusters, and the base clusterings. The existing methods mainly focus on the level of data instances and that of clusters and lack the ability to treat the three levels of granularity as a whole system. In this section, we proposed a graph based clustering ensemble method termed graph partitioning with multi-granularity link analysis (GP-MGLA). In the proposed GP-MGLA method, we formulate the three levels of granularity in the clustering ensemble into a bipartite graph model, which will be described in the following.
Compared to the previous clustering ensemble methods based on graph partitioning Strehl and Ghosh (2003); Fern and Brodley (2004), the GP-MGLA method is distinguished mainly in two aspects. Firstly, the GP-MGLA method utilizes the crowd agreement estimation mechanism (see Section 4) for exploiting the relationship among base clusterings and evaluating the quality of the base clusterings in an unsupervised manner. Secondly, the links between clusters are integrated into the graph model via the SACT similarity measure.
In our bipartite graph model, both data instances and clusters are treated as graph nodes. There are two types of links in the graph, that is, the links between instances and the cluster containing them and the links between clusters that have common neighbors. To implement the bipartite structure, each cluster is used twice, i.e., for each cluster, there are two different nodes representing it in the bipartite graph.
Formally, we construct the bipartite graph as follows:
where is the set of nodes including all instances and clusters, is the set of nodes including all clusters, and is the set of graph links. The graph is an undirected graph. There are no links between the nodes in or between the nodes in . All links are constructed between the nodes in and those in .
Let and be two nodes in the graph . If is a data instance and is the cluster containing , then a link exists between and and the link between them is weighted with regard to the quality of the base clustering that belongs to. If both and are clusters, then the link between them is constructed via the SACT measure (see Section 5). Formally, the weight of the link between the nodes and is defined as follows:
In the graph , the instances and the clusters are used as nodes and the relationship among them is incorporated into the graph links. Also the information among the base clusterings is exploited to provide a reliability measure for the graph links via the crowd agreement estimation. With regard to the bipartite structure of the graph , the Tcut algorithm Li et al. (2012) can be utilized for partitioning the graph into a specific number of disjoint sets of nodes. The data instances in each of these disjoint sets are treated as a cluster and thus the final consensus clustering is obtained. Theoretically, there is a possibility that some of these disjoint sets consist of only clusters and no instances, which would lead to a less number of clusters than specified. However, we have never come across this situation in our experiments, probably due to that the joint force of the links between the instances and clusters containing them is strong enough to hold at least part of them together. For clarity, we summarize the GP-MGLA method in Algorithm 2.
In this section, we conduct experiments on eight real-world datasets and compare the proposed approaches against several baseline clustering ensemble approaches. The datasets and evaluation criterion are described in Section 7.1. The setting of parameters is discussed in Sections 7.2. The construction of base clusterings is introduced in Section 7.3. Then we evaluate the performance of the proposed methods compared to the baseline methods in Section 7.4. The analysis of computational complexity is presented in Sections 7.5.
The experiments in this paper are conducted in Matlab 220.127.116.119 (R2012a) 64-bit on a workstation (Windows Server 2008 R2 64-bit, 8 Intel 2.40GHz processors, 96GB of RAM).
7.1 Datasets and Evaluation Criterion
In our experiments, eight real-world datasets from the UCI machine learning repository Bache and Lichman (2013) are used, namely, Breast Cancer, Image Segmentation, Iris, Seeds, Yeast, Wine, Pen Digits, and Letters. The details of the benchmark datasets are given in Table 1.
To evaluate the quality of the consensus clustering, we utilize the normalized mutual information (NMI) Strehl and Ghosh (2003) which provides an indication of the shared information between two clusterings. Let be the test clustering and the ground-truth clustering. The NMI score of w.r.t. is computed as follows:
where is the number of clusters in , is the number of clusters in , is the number of instances in the -th cluster of , is the number of instances in the -th cluster of , and is the number of common instances shared by cluster in and cluster in .
7.2 Choices of Parameters
There is one parameter in the WEAC method and two parameters and in the GP-MGLA method. The parameter is a scale factor for the link weights between instances and clusters. The parameter adjusts the influence of NCAI for both WEAC and GP-MGLA, where a bigger signals a greater influence of NCAI. We evaluate the performance of the proposed WEAC and GP-MGLA methods with varying parameters on the benchmark datasets. As can be seen in Table 2 and 3, the proposed methods are very stable w.r.t. the varying parameters. Empirically, it is suggested that be set in the interval of and be set in the interval of for the proposed two methods. In the following, the parameters are set that and for all the experiments on all the benchmark datasets.
7.3 Generation of Base Clustering Ensemble
The proposed approaches make no specific assumption about the generation of the ensemble of base clusterings. To evaluate the effectiveness and robustness of the proposed methods over various combinations of base clusterings, we construct a pool of a large number of different base clusterings. Then we run the proposed methods and the baseline methods with the base clusterings randomly chosen from the pool repeatedly.
Four clustering algorithms are used to construct the base clustering pool, namely, -means, rival penalized competitive learning (RPCL) Xu et al. (1993), hierarchical mode association clustering (HMAC) Li et al. (2007), and incremental support vector clustering with outlier detection (OD-ISVC) Huang et al. (2012). To obtain a pool of various base clusterings, we apply the aforementioned clustering algorithms repeatedly with random parameters and initializations on each dataset. The number of clusters for the -means and RPCL methods are randomly chosen in the interval of , where is the number of instances in the dataset. The HMAC method is a hierarchical clustering method. We choose the hierarchy of clustering randomly for the HMAC method, where each hierarchy corresponds to a clustering with a certain number of clusters. For the OD-ISVC method, the base clusterings are generated with randomly chosen kernel width parameter and trade-off parameter . In this paper, we apply each of the clustering algorithms for times and thus a pool of different base clusterings is constructed for each dataset.
7.4 Performance Comparison and Analysis
With the base clustering pool constructed (see Section 7.3), the proposed approaches and the baseline approaches are applied to the ensemble of base clusterings which are randomly chosen from the pool. In our experiments, each of the clustering ensemble approaches has no knowledge about how the chosen base clusterings are generated, i.e., by which algorithm and with what parameters they are generated. For each run, an ensemble of base clusterings is randomly constructed and different clustering ensemble approaches are applied to the ensemble. The ensemble size is used in our work. We test the proposed approaches against the baseline approaches by evaluating their performance over a large number of runs, which aims to rule out the factor of “getting lucky sometimes” and provide a fair comparison for their effectiveness and robustness over different combinations of base clusterings.
7.4.1 Comparison with Base Clusterings
In this paper we propose two novel consensus functions, namely, the GP-MGLA method and the WEAC method. GP-MGLA is a graph partitioning based method, whereas WEAC is a pair-wise similarity based method. With the weighted co-association matrix computed by WEAC, we further perform three agglomerative methods, namely, average-link (AL), complete-link (CL), and single-link (SL) to obtain the final consensus clustering, which leads to three sub-methods denoted as WEAC-AL, WEAC-CL, and WEAC-SL respectively.
For each run, an ensemble is generated by randomly drawing base clusterings from the pool. We apply the proposed clustering ensemble methods on different ensembles for each dataset repeatedly. The average performance over 100 runs of our methods compared to the base clusterings is shown in Fig. 2, in which Max(base) denotes the average NMI score of the best base clustering over all ensembles, Min(base) denotes the average NMI score of the worst base clustering over all ensembles, and Avg(base) denotes the average NMI score of all base clusterings over all ensembles. As shown in Fig. 2, the proposed methods are able to produce better and more robust consensus clusterings than the base clusterings. Specially, GP-MGLA and WEAC-AL significantly outperform the base clusterings on the Breast Cancer, Seeds, Wine, and Pen Digits datasets.
We further compare the consensus clusterings by our methods against each of the base clusterings and calculate the winning percentage. For each run, an ensemble of base clusterings are selected. Then there will be totally comparisons between the consensus clusterings and the base clusterings over 100 runs. We call it a win if the consensus clustering has a higher NMI score than a base clustering and call it a loss if the consensus clustering has a lower NMI score than a base clustering. Ties count as win and loss. The winning percentage is defined as the number of wins divided by the total number of comparisons. As shown in Table 4, the GP-MGLA method and the WEAC method (associated with AL) outperform most of the base clustering w.r.t. the best number of clusters on the benchmark datasets. We also compare the consensus clusterings against the base clusterings w.r.t. the same number of clusters, which means for each comparison the number of clusters of the consensus clustering are set to the same number as the base clustering. As shown in Table 5, GP-MGLA and WEAC outperform about two thirds of the base clusterings w.r.t. the same number of clusters on the benchmark datasets.
|Dataset||Breast Cancer||Image Segmentation||Iris||Seeds|
|Dataset||Breast Cancer||Image Segmentation||Iris||Seeds|
7.4.2 Comparison with Other Clustering Ensemble Methods
|Method||Breast Cancer||Iris||Image Segmentation|
We compare the proposed WEAC and GP-MGLA methods against six different clustering ensemble methods, namely, the hybrid bipartite graph formulation (HBGF) Fern and Brodley (2004), the weighted consensus clustering (WCC) Li and Ding (2008), the evidence accumulation clustering (EAC) Fred and Jain (2005), the ensemble clustering by matrix completion (ECMC) Yi et al. (2012), the SimRank similarity based method (SRS) Iam-On et al. (2008), and the weighted connected-triple method (WCT) Iam-On et al. (2011). Since the ECMC method and the WCC method is very time-consuming (see Fig. 4), it is almost infeasible to run ECMC and WCC for 100 times on the large datasets as the Pen Digits and Letters datasets, which contain and instances respectively. Therefore, the ECMC and WCC methods are performed on the benchmark datasets except the Pen Digits and Letters datasets. And the other baseline methods are performed on all the benchmark datasets.
The EAC, ECMC, SRS, and WCT methods are four pair-wise similarity based methods, each leading to three sub-methods by utilizing three different agglomerative clustering methods, namely, AL, CL, and SL. Then we have 14 baseline methods, that is, HBGF, WCC, EAC-AL, EAC-CL, EAC-SL, ECMC-AL, ECMC-CL, ECMC-SL, SRS-AL, SRS-CL, SRS-SL, WCT-AL, WCT-CL, and WCT-SL. The average performance over 100 runs of the proposed methods and the 14 baseline methods for each dataset is summarized in Table 6 and 7. For each test method, the number of clusters for the consensus clustering is set to two values respectively, that is, best- and true-. The best- is the number of clusters that leads to the optimal performance for a method on the dataset. The true- is the number of true classes in the dataset. As shown in Table 6 and 7, the performance of the WEAC-AL method is better and more stable than the other pair-wise similarity based methods. The WEAC-AL method achieves the best NMI scores for the Seeds dataset and nearly best NMI scores for the Iris, Image Segmentation, Yeast, and Wine datasets. Among the test methods, the GP-MGLA method produces overall the best and most stable clustering results on the benchmark datasets.
7.4.3 Dealing with Ill Clusterings
In order to evaluate the robustness of our methods to ill clusterings, we add a certain ratio of heavily imbalanced clusterings into the base clustering pool. For example, adding of ill clusterings into the pool means replacing of base clusterings in the pool with heavily imbalanced clusterings. To produce the heavily imbalanced clusterings, we firstly partition the dataset into clusters via -means where is randomly chosen in the interval of . Then we merge a proportion of clusters into one, i.e., randomly chosen clusters will be merged into one cluster in the clustering. In our experiments, the values of are randomly selected in the interval of , which lead to heavily imbalanced clusterings. Different ratio of ill base clusterings are added to the pool and then we conduct experiments on the ensemble of randomly chosen base clusterings from the pool.
For each ratio of ill base clusterings, we run each of the clustering ensemble methods for 100 times and the performance is summarized in Fig. 3. The average-link is used for each of the pair-wise similarity based methods, namely, WEAC, EAC, ECMC, SRS, and WCT. As can be seen in Fig. 3, the proposed WEAC method yields much better performance than the EAC method on the benchmark datasets. On the whole, the proposed GP-MGLA method yields much better and more robust performance than the other clustering ensemble methods with different ratio of ill base clusterings added.
7.5 Computational Complexity
The computation of the NMI measure between two partitions takes time, where is the number of instances in the dataset. The computation of the NCAI measure takes time, where is the number of base clusterings in the ensemble. The computation of the SACT similarity is , where is the number of clusters in the ensemble and is the average number of neighbors connecting to a cluster. As the conventional EAC method Fred and Jain (2005) is , the time complexity of the proposed WEAC method (associated with average-link) is . The Tcut algorithm for bipartite graph partitioning is , where is the number of clusters in the final consensus clustering and is the average number of links connecting to a node in the graph. Then we have the time complexity of the proposed GP-MGLA method as .
The proposed methods and the baseline methods are applied to the Letters dataset to test the execution time w.r.t. varying data sizes. The time performance of these test methods with varying data sizes is illustrated in Fig. 4. To process the entire Letters dataset with instances, the time costs (in seconds) of WEAC and GP-MGLA are and respectively, whereas the time costs (in seconds) of HBGF, EAC, and WCT are , , and respecitively. In the proposed methods, it takes seconds to compute the NCAI for the data size of . Each of the five pair-wise similarity based methods, namely, WEAC, EAC, ECMC, SRS, and WCT, is associated with average-link. As shown in Fig. 4, the ECMC method and the WCC method are the two slowest methods. And the SRS method is the third slowest. The GP-MGLA is slower than WEAC, EAC, and WCT when the data size is below . However, the GP-MGLA shows an advantage in execution time as the data size grows beyond . The proposed GP-MGLA method and the HBGF method are the two fastest methods when the data size is greater than , mainly due to their efficient graph partitioning algorithms.
In this paper, we address the clustering ensemble problem using crowd agreement estimation and multi-granularity link analysis. With the clustering ensemble viewed as a crowd, we assess reliability of the individuals inside it by exploiting the so-called wisdom of the crowd. The normalized crowd agreement index is proposed for evaluating the quality of base clusterings in an unsupervised manner. The source aware connected triple similarity is introduced for constructing the link between two clusters with their common neighbors and source reliability taken into consideration. To achieve the final consensus clustering, two novel consensus functions are further presented, termed weighted evidence accumulation clustering (WEAC) and graph partitioning with multi-granularity link analysis (GP-MGLA) respectively. The experiments conducted on eight real-world datasets show the effectiveness and robustness of the proposed clustering ensemble methods.
The authors would like to thank the anonymous reviewers for their insightful comments and suggestions which helped enhance this paper significantly. This work was supported by National Science & Technology Pillar Program (No. 2012BAK16B06), NSFC (61173084) and the GuangDong Program (Grant No. 2012A080104005). The work of Chang-Dong Wang was in part sponsored by CCF-Tencent Open Research Fund and Research Training Program of SMIE of Sun Yat-sen University.
- Xu et al. (1993) L. Xu, A. Krzyzak, E. Oja, Rival penalized competitive learning for clustering analysis, RBF net, and curve detection, IEEE Transactions on Neural Networks 4 (4) (1993) 636–649.
- Li et al. (2007) J. Li, S. Ray, B. G. Lindsay, A nonparametric statistical approach to clustering via mode identification, Journal of Machine Learning Research 8 (2007) 1687–1723.
- Zhang and Zhou (2009) M.-L. Zhang, Z.-H. Zhou, Multi-instance clustering with applications to multi-instance prediction, Applied Intelligence 31 (1) (2009) 47–68.
- Zhao et al. (2010) F. Zhao, L. Jiao, H. Liu, X. Gao, M. Gong, Spectral clustering with eigenvector selection based on entropy ranking, Neurocomputing 73 (10-12) (2010) 1704–1717.
- Wang and Lai (2011) C.-D. Wang, J.-H. Lai, Energy based competitive learning, Neurocomputing 74 (12-13) (2011) 2265–2275.
- Li et al. (2011) M. Li, X. C. Lian, J. T. Kwok, B. L. Lu, Time and space efficient spectral clustering via column sampling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11), 2011.
- Wang et al. (2013a) C.-D. Wang, J.-H. Lai, C. Y. Suen, J.-Y. Zhu, Multi-exemplar affinity propagation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (9) (2013a) 2223–2237.
- Wang et al. (2013b) C.-D. Wang, J.-H. Lai, D. Huang, W.-S. Zheng, SVStream: A support vector based algorithm for clustering data streams, IEEE Transactions on Knowledge and Data Engineering 25 (6) (2013b) 1410–1424.
- Wang and Lai (2013) C.-D. Wang, J.-H. Lai, Position regularized support vector domain description, Pattern Recognition 46 (3) (2013) 875–884.
- Jain (2010) A. K. Jain, Data clustering: 50 years beyond -means, Pattern Recognition Letters 31 (8) (2010) 651–666.
- Vega-Pons and Ruiz-Shulcloper (2011) S. Vega-Pons, J. Ruiz-Shulcloper, A survey of clustering ensemble algorithms, International Journal of Pattern Recognition and Artificial Intelligence 25 (3) (2011) 337–372.
- Strehl and Ghosh (2003) A. Strehl, J. Ghosh, Cluster ensembles: A knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3 (2003) 583–617.
- Fern and Brodley (2004) X. Z. Fern, C. E. Brodley, Solving cluster ensemble problems by bipartite graph partitioning, in: Proceedings of the International Conference on Machine Learning (ICML’04), 2004.
- Fred and Jain (2005) A. L. N. Fred, A. K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (6) (2005) 835–850.
- Topchy et al. (2005) A. Topchy, A. K. Jain, W. Punch, Clustering ensembles: models of consensus and weak partitions, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (12) (2005) 1866–1881.
- Hadjitodorov et al. (2006) S. T. Hadjitodorov, L. I. Kuncheva, L. P. Todorova, Moderate diversity for better cluster ensembles, Information Fusion 7 (3) (2006) 264–275.
- Li et al. (2007) Y. Li, J. Yu, P. Hao, Z. Li, Clustering ensembles based on normalized edges, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07), 2007.
- Iam-On et al. (2008) N. Iam-On, T. Boongoen, S. Garrett, Refining pairwise similarity matrix for cluster ensemble problem with cluster relations, in: Proceedings of the International Conference on Discovery Science (ICDS’08), 2008.
- Domeniconi and Al-Razgan (2009) C. Domeniconi, M. Al-Razgan, Weighted cluster ensembles: Methods and analysis, ACM Transactions on Knowledge Discovery from Data 2 (4) (2009) 1–40.
- Wang et al. (2009) X. Wang, C. Yang, J. Zhou, Clustering aggregation by probability accumulation, Pattern Recognition 42 (5) (2009) 668–675.
- Mimaroglu and Erdil (2011) S. Mimaroglu, E. Erdil, Combining multiple clusterings using similarity graph, Pattern Recognition 44 (3) (2011) 694–703.
- Iam-On et al. (2011) N. Iam-On, T. Boongoen, S. Garrett, C. Price, A link-based approach to the cluster ensemble problem, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (12) (2011) 2396–2409.
- Yi et al. (2012) J. Yi, T. Yang, R. Jin, A. K. Jain, Robust ensemble clustering by matrix completion, in: Proceedings of the IEEE International Conference on Data Mining (ICDM’12), 2012.
- Franek and Jiang (2014) L. Franek, X. Jiang, Ensemble clustering by means of clustering embedding in vector spaces, Pattern Recognition 47 (2) (2014) 833–842.
- Huang et al. (2013) D. Huang, J.-H. Lai, C.-D. Wang, Exploiting the wisdom of crowd: A multi-granularity approach to clustering ensemble, in: Proceedings of the International Conference on Intelligence Science and Big Data Engineering (IScIDE’13), 2013.
- Cristofor and Simovici (2002) D. Cristofor, D. Simovici, Finding median partitions using information-theoretical-based genetic algorithms, Journal of Universal Computer Science 8 (2) (2002) 153–172.
- Weiszfeld and Plastria (2009) E. Weiszfeld, F. Plastria, On the point for which the sum of the distances to n given points is minimum, Annals of Operations Research 167 (1) (2009) 7–41.
- Vega-Pons et al. (2010) S. Vega-Pons, J. Correa-Morris, J. Ruiz-Shulcloper, Weighted partition consensus via kernels, Pattern Recognition 43 (8) (2010) 2712–2724.
- Vega-Pons et al. (2011) S. Vega-Pons, J. Ruiz-Shulcloper, A. Guerra-Gandón, Weighted association based methods for the combination of heterogeneous partitions, Pattern Recognition Letters 32 (16) (2011) 2163–2170.
- Li and Ding (2008) T. Li, C. Ding, Weighted consensus clustering, in: Proceedings of the SIAM International Conference on Data Mining (SDM’08), 2008.
- Fern and Lin (2008) X. Z. Fern, W. Lin, Cluster ensemble selection, Statistical Analysis and Data Mining 1 (3) (2008) 128–141.
- Wu and Chow (2004) S. Wu, T. W. S. Chow, Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density, Pattern Recognition 37 (2) (2004) 175–188.
- Faceli et al. (2009) K. Faceli, M. C. P. de Souto, D. S. A. de Araújo, A. C. P. L. F. de Carvalho, Multi-objective clustering ensemble for gene expression data analysis, Neurocomputing 72 (2009) 2763–2774.
- Li and Latecki (2012) N. Li, L. J. Latecki, Clustering aggregation as maximum-weight independent set, in: Advances in Neural Information Processing Systems (NIPS’12), 2012.
- Surowiecki (2004) J. Surowiecki, The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations, Anchor Books, 2004.
- Levandowsky and Winter (1971) M. Levandowsky, D. Winter, Distance between sets, Nature 234 (1971) 34–35.
- Li et al. (2012) Z. Li, X.-M. Wu, S.-F. Chang, Segmentation using superpixels: A bipartite graph partitioning approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12), 2012.
- Bache and Lichman (2013) K. Bache, M. Lichman, UCI machine learning repository, 2013.
- Huang et al. (2012) D. Huang, J.-H. Lai, C.-D. Wang, Incremental support vector clustering with outlier detection, in: Proceedings of the International Conference on Pattern Recognition (ICPR’12), 2012.