WoCE: a framework for clustering ensemble by exploiting the wisdom of Crowds theory
Abstract
The Wisdom of Crowds (WOC), as a theory in the social science, gets a new paradigm in computer science. The WOC theory explains that the aggregate decision made by a group is often better than those of its individual members if specific conditions are satisfied. This paper presents a novel framework for unsupervised and semisupervised cluster ensemble by exploiting the WOC theory. We employ four conditions in the WOC theory, i.e., diversity, independency, decentralization and aggregation, to guide both the constructing of individual clustering results and the final combination for clustering ensemble. Firstly, independency criterion, as a novel mapping system on the raw data set, removes the correlation between features on our proposed method. Then, decentralization as a novel mechanism generates high quality individual clustering results. Next, uniformity as a new diversity metric evaluates the generated clustering results. Further, weighted evidence accumulation clustering method is proposed for the final aggregation without using thresholding procedure. Experimental study on varied data sets demonstrates that the proposed approach achieves superior performance to stateoftheart methods.
I Introduction
Clustering, the art of discovering meaningful patterns in the unlabeled data sets, is one of the main tasks in machine learning. Semisupervised clustering is a branch of clustering methods that uses prior supervision information, such as labeled data, known data associations, or pairwise constraints, to aid the clustering process. This paper focuses on pairwise constraints, i.e. pairs of instances known as belonging to the same cluster (mustlink constraints) or different clusters (cannotlink constraints). Pairwise constraints arise naturally in many real tasks and have been widely used in semisupervised clustering. There is a wide range of issues in the clustering methods. For instance, individual clustering algorithms provide different accuracies in a complex data set because they generate the clustering results by optimizing a local or global function instead of natural relations between data points [1, 2, 3, 4]. As another example, pairwise constraints often result in highly unstable clustering performance, whereas they have the potential to improve clustering accuracy in practice [5, 6].
As a novel solution, cluster ensemble was proposed for achieving a robust and stable final result by combining the different individual clustering results [1]. Cluster Ensemble Selection (CES) is a new approach which combines a subgroup of individual clustering results. It uses a consensus metric(s) for evaluating and selecting the ensemble committee in order to improve the accuracy of final results [7]. Generally, CES contains four components; i.e. generation, evaluation, selection, and combination. Firstly, individual clustering results are generated by using different kinds of clustering algorithms or repeating some algorithms, which can generate random results in each runtime such as the kmeans. Next, a consensus metric(s) such as Normalized Mutual Information (NMI) is employed to evaluate the generated results. After that, the evaluated results are selected by thresholding procedure. Lastly, the final clustering result is obtained by an aggregation mechanism [7, 8, 9, 10, 11].
There are three challenges in the CES arena; i.e. strategy of generation, metric(s) of evaluation, thresholding procedure. As the first challenge, the strategy of generating the individual clustering results can dramatically affect the performance of CES [12, 13, 14, 15, 16]. There are generally two paradigms, i.e. some of these studies [7, 17, 9, 13] separately run each component of the CES (generate all individual results, then evaluate them, etc.) whereas the rest of studies [18, 12] employed feedback mechanism, which gradually runs each component of the CES (generating the first individual result, then evaluating it, etc.). On the one hand, feedback mechanism uses evaluated the results at each step for improving the quality of the generated results in the next steps. Therefore, it can usually provide better performance in comparison with the first paradigm [18, 12]. On the other hand, it may not be compatible with many of classical structures/metrics in the ensemble learning. Evaluation is the next challenge. NMI is one of the most prevalent diversity metrics that is used in the CES because 1) NMI is not sensitive to the cluster’s indices [18] 2) it can be easily implemented [8, 7] 3) it has better time complexity in comparison with other classic methods [19, 7, 17, 9]. The main disadvantage of NMI is that the symmetric problem. Indeed, it cannot provide an efficient evaluation while the numbers of instances in distinct clusters are highly different. For instance, consider a clustering analysis for partitioning emails to normal or spam groups, where the number of instances in the normal group is significantly greater than the number of data points in the spams group. Alizadeh et al. [9, 17, 18] proved that the NMI evaluates the similarity between these two clusters equal to 1, while the real similarity is near to zero. This issue can rapidly decrease the performance of the NMIbased CES methods in the big data analysis [17, 9, 18, 12]. Recently, some of the studies proposed a modified version of the NMI such as APMM^{1}^{1}1AlizadehParvinMoshkiMinaei [9] and MAX [17] for solving this problem. Their proposed methods were utilized for evaluating diversity between a cluster and a partition. Since using mentioned methods for evaluating two partitions increases the time complexity, it is critical to propose a new metric, which directly can evaluate diversity between two partitions. The next challenge in the CES is thresholding. In practice, it is so hard to find optimum values of thresholds; and the performance of the CES significantly depends on the threshold values [12].
Most of the ensemble methods (especially in the CES) employs the (majority) voting systems [7, 8, 18, 12], such as Boosting and ErrorCorrecting Output Codes (ECOC) in supervised learning [20] or Evidence Accumulation Clustering (EAC) method in unsupervised learning [19]. Indeed, CES framework just provides a voting system for selecting the robust and stable individual results. Voting systems are firstly defined in the term of social science, where it is used for providing democratic societies, fair trials (in the courts), etc. [21]. There is a wide range of theories in social science, which can provide an environment for applying an effective voting system. They can be used to inspire new algorithms in machine learning. The Wisdom of Crowds (WOC) is one of these theories, which explain a robust approach for generating accurate results in a voting system. It simply claims that decisions made by aggregating the information of groups are better than those made by any single group member if the four specific conditions of this theory are satisfied; i.e. diversity, independency, decentralization, and aggregation [21, 18]. Indeed, we can find many modern concepts in different sciences, which used WOC as a fundamental resource, e.g. Delphi method in management [21], crowdsourcing/funding in the market [21], crowd computing [22] in computer networks, etc. In computer science, this theory was used for optimizing resources in wireless sensor networks [22]. Further, there is a wide range of studies in supervised learning [23, 24, 25, 26, 27, 28] and unsupervised learning [12, 18], which use the WOC theory for proposing new approaches. These studies validated that the WOC theory usually leads to better performance and higher stability.
For solving the three mentioned problems in CES, this paper shows that the WOC theory well matches the target of cluster ensemble, and thus its four conditions can be employed to guide the designing of individual clusterings as well as the final ensemble. Based on this observation, we propose a robust framework, which is called Wisdom of crowds Cluster Ensemble (WoCE), for both unsupervised and semisupervised cluster ensemble. Our contribution in this paper can be summarized as follows:

Firstly, a new mapping between the WOC observations and the CES problems. Furthermore, a general framework is proposed based on WOC theory for generating diverse individual results and using the feedback mechanism to select individual clusterings with high independency and quality. This framework is the first WOCbased approach for semisupervised clustering.

After that, this paper introduces a novel technique in the term of mathematical independent random variables for mapping the data to new dimensions based on the natural correlation of raw data, which can satisfy the independency criterion in the WOC. This mapping can generate independent features, which increase the performance of individual clustering algorithms.

Then, to satisfy the decentralization criterion in WOC, this study uses different numbers of clusters in the different kinds of clustering algorithms, which can effectively generate high quality individual clustering results. Moreover, this paper develops a new method for selecting features based on supervision information in the semisupervised approach.

Next, to satisfy the diversity criterion in the WOC, this study proposes a new diversity metric called uniformity, which is based on the APMM criterion, for evaluating the diversity of two partition, directly [9].

Lastly, to satisfy the aggregation mechanism in WOC, this paper proposed Weighted Evidence Accumulation Clustering (WEAC) to obtain the final clustering with a weighted combination of all individual results. While the weight of each individual result in WEAC can be estimated with different metrics, the uniformity was used in this paper.
The rest of this paper is organized as follows: In Section II, this study first briefly reviews some related works. Then it introduces the proposed WoCE framework in Section III. Experimental results are presented in Section IV; and finally this paper presents conclusion and point out some future works in Section V.
Ii Related Works
Iia The Wisdom of Crowds
Francis Galton was a British scientist, who introduced the correlation concept in statistics. In 1906, he went to annual West of England Fat Stock and Poultry Exhibition where the local farmers and townspeople gathered to estimate and gamble the quality of each other’s cattle, sheep, pigs, etc. Each animal was shown to the crowd; and people wrote their estimations on the tickets. Final goal of this gambling was estimating the closest weight for each animal in comparison with the real weight of that animal. Galton considered that the average of tickets’ value for each animal must be a value of significant distance in comparison with the exact answer because a few people (local farmers or experts) just knew the right answer. He borrowed all 787 tickets, which show the estimations of an ox’s weight. While the weight of that ox was 1198 pounds, the average of estimated values in the tickets was 1197! In 1907, he published the ‘Vox Populi’ paper in the Nature journal; and mentioned that “the result seems more creditable to the trustworthiness of a democratic judgment that might have been expected”. In fact, he understood that each ticket contains two data; i.e. information and error. Errors in the tickets omit each other, and the information summarized. This is the main reason that the average of those tickets was really quiet in comparison with the correct answer. This is the core idea of the wisdom of crowds theory in social science. Further, this theory is comparable with the jury theorem, which was proposed by Condorcet. Supported by a wide range of examples in business, management, economic, social science, mathematician, etc., Surowiecki introduced the wisdom of crowds as a framework for making optimized decisions in 2004. He proposed four criteria for a wise crowd: [21]
Independency People’s opinions are not determined by the opinions of those around them.
Decentralization People are able to specialize and draw on local knowledge.
Diversity Each person has private information, even if it is only an eccentric interpretation of the known facts.
Aggregation Some mechanism exists for turning private judgments into a collective decision.
There are some examples for unwise crowds in Surowiecki’s book; i.e. Columbia shuttle disaster, bubble in the stock markets, etc. Further, he mentioned to three failures in the crowd intelligence. In other words, the wisdom of crowds cannot solve these types of problems. The first is called ants circular mill, which was introduced by William Beebe. An ant mill is an observed phenomenon in which a group of army ants separated from the main foraging party loses the pheromone track and begins to follow one another, forming a continuously rotating circle. Next is called Needle in a Haystack. In this type of problem, just a few members of a group know the right answer. The last is called random decisions. In this type of problem, the final result is completely generated independent of members’ decisions. Although the wisdom of crowds cannot solve the mentioned problems, it is employed in the different fields of science as a novel theory for solving problems. For instance, it is one of the main references for the Delphi method in management, crowd sourcing and funding in business, the problem solving theorem and the central limit theorem in the mathematician, etc. [21].
IiB Cluster Ensemble
Clustering groups data points into clusters so that members of the same cluster are more similar to each other than to members of other clusters. Semisupervised clustering uses supervision information to aid the clustering process. This paper focuses on pairwise constraintsbased semisupervised methods. As constraintbased methods: Liu et al. proposed semisupervised linear discriminant clustering (SemiLDC) [29]. Wang et al. introduced a new technique by utilizing the constrained pairwise data points and their neighbors, which is denoted as constraint neighborhood projections that required fewer labeled data points (constraints) and can naturally deal with constraint conflicts [30]. Chen et al. recently proposed a clustering algorithm which is based on graph clustering and optimizing an appropriately weighted objective, where larger weights are given to observations with lower uncertainty [31].
As mentioned before, individual clustering algorithms provide predictions with different accuracy rates. In practice, individual algorithms may fail to provide accurate and stable results. For solving this problem, cluster ensemble proved that better final results can be generated by combining individual clustering results instead of only choosing the best one [1]. The idea that not all partitions are suitable for cooperating to generate the final clustering was proposed in Cluster Ensemble Selection (CES). This method combines a selected group of best individual clustering results according to consensus metric(s) from the ensemble committee in order to improve the accuracy of final results [7].
There are a wide range of studies in the unsupervised cluster ensemble (selection). Vega et al. proposed Weighted Partition Consensus via Kernels (WPCK) method, which analyzes the set of partitions in the cluster ensemble and extracts valuable information that can improve the quality of the combination process [32]. In another study, Vega et al. developed the Weighted Evidence Accumulation (WEA) algorithm by computing the weighted association matrix as the first step and after that, applying a hierarchical clustering algorithm for selecting the consensus partition with the highest lifetime criterion. They also introduced the Generalized Kernel Partition Consensus (GKPC) method that uses the Information Unification step after the generation in the methodology of the WPCK method [33]. Jia et al. proposed SIM for diversity measurement, which works based on the NMI [11]. Romano et al. proposed Standardized Mutual Information (SMI) for evaluating clustering results [34]. Yu et al. proposed the Hybrid Clustering Solution Selection (HCSS) strategy that utilizes a weighting function to combine several feature selection techniques for the refinement of clustering solutions in the ensemble [14]. Based on Normalized Crowd Agreement Index (NCAI) and multigranularity information collected among individual clusterings, clusters, and data instances, Huang et al. proposes two novel consensus functions, termed weighted evidence accumulation clustering (WEAC) and graph partitioning with multigranularity link analysis (GPMGLA) [35]. Jing et al. introduced a component generation approach for producing ensemble components based on Stratified feature sampling [16]. Yu et al. adopted affinity propagation (AP) in different subspaces of the data set for generating a set of individual clusterings [13]. Alizadeh et al. have concluded the disadvantages of NMI as a symmetric criterion. They used the APMM and Maximum (MAX) metrics to measure diversity and stability, respectively, and suggested a new method for building a coassociation matrix from a subset of the individual cluster results. While the proposed methods can solve the symmetric problem of the NMI method, they just can combine a subclusters of the generated partition in the reference set [17, 9]. Yousefnezhad et al. proposed Weighted Spectral Cluster Ensemble (WSCE) method by exploiting the concept of community detection and graph based clustering [12].
Gao et al. introduced a graphbased consensus maximization (BGCM) method for combining multiple supervised and unsupervised models. This method consolidated a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. Since, this research used a classification approach for unsupervised learning, it is sensitive to the quality of supervision information [36]. Huang et al. extended extreme learning machines (ELMs) for both semisupervised and unsupervised tasks based on the manifold regularization [35]. Anand et al. proposed a semisupervised framework for kernel mean shift clustering (SKMS) that uses only pairwise constraints to guide the clustering procedure. They used the initial kernel matrix by minimizing a LogDet divergencebased objective function for first mapped to a highdimensional kernel space where the constraints are imposed by a linear transformation of the mapped points [5]. Xiong et al. proposed Neighborhoodbased Framework (NBF) method. This method builds on the concept of neighborhood, where neighborhoods contain “labeled examples” of different clusters according to the pairwise constraints. Furthermore, it expands the neighborhoods by selecting informative points and querying their relationship with the neighborhoods [6].
One of the biggest challenges in the mentioned methods is that they did not use the achieved errors, i.e. false positive and false negative, for improving the quality of the final aggregation. As mentioned before, WOC theory uses information and errors for increasing the performance of the final result. Briefly, information aggregate with each other; and also, errors omit each other. There are several studies based on the WOC theory in supervised learning, e.g. in recollecting ordering information [25], rank ordering problem [24], estimating the underlying value (e.g., the class) in the image processing [26], underwater mine classification with imperfect labels [27], minimum spanning tree problems [28], and classification ensemble [23]. As the first WOCbased unsupervised CES method, Alizadeh et al. proposed the Wisdom of Crowds Cluster Ensemble (WOCCE) . They proposed a new strategy of generating, evaluating, selecting, and combining the individual clustering results based on WOC theory. The main advantages of the WOCCE are using feedback mechanism for managing errors in each iteration and utilizing the A3 metric (average of APMM) to avoid the NMI symmetric problem. There are also four disadvantages in the WOCCE method. Firstly, WOCCE needs three distinct kinds of threshold values for generating final clustering result. Further, the performance of WOCCE is dramatically sensitive to the value of mentioned thresholds; and finding the optimum threshold values is so hard in the real application. Secondly, the concept of independency criterion in WOCCE was just limited to random and initial points in the same type individual clustering algorithms, whereas based on the independency definition in WOC, it can be defined in the term of mathematical independent random variables for all kinds of clustering algorithms. Thirdly, the time complexity of A3 is really high because it is the average of the APMM for all existed clusters in a partition. Since APMM is technically designed for comparing the similarity between a partition and a cluster, there is a wide range of common parts that are sequentially repeated in the A3 metric. Lastly, the WOCCE is only developed for unsupervised learning, while this framework can be also used for semisupervised learning [18, 12]. Indeed, this paper introduces a new framework for WOCbased CES to solve the mentioned problems in the WOCCE.
Iii The Proposed Method
Iiia Definition
Based on outlines of the WOC theory [21, 23, 18], the conditions for a crowd to be wise are: diversity, independency, decentralization, and aggregation. Baker et al. [23] and Alizadeh et al. [18] redefined the WOC criteria for supervised learning and unsupervised learning, respectively. They used algorithms, data and results instead of people, information and opinions in the mentioned definitions, respectively. Same structure is utilized in this paper to redefine the criteria for proposing a framework in both unsupervised and semisupervised methods. So, our definition for WOC criteria listed as follows:
Independency The data, which is applied to clustering models, must have the lowest correlation between its features.
Decentralization Algorithms are able to specialize the results based on the local knowledge.
Diversity Each algorithm has private result, even if it is only an eccentric interpretation of the known facts.
Aggregation Some method exists for combining private results into a collective decision (final result).
As a whole, it can be stated that the WoCE can produce final results in four stages. Firstly, the mapping function removes the correlation between the features of raw data set. This mapping function can satisfy independency criterion. Then, for satisfying the decentralization criterion, this paper applies local knowledge, i.e. the given number of clusters and supervision information. Further, it employs the various kinds of individual clustering algorithms. After these steps, diversity criterion evaluates the probability of accuracy in the generated clustering results. Finally, an effective aggregation method can increase the performance of the proposed method. In the rest of this section, the formulation of the proposed method will be discussed, and this paper will mention what WOC criterion is satisfied by using each part of the formulation. After that, we briefly summarized the whole algorithm procedure.
IiiB Independency
Based on definitions of the WOC theory, people must decide by using independent information. Hence, people can discover novel patterns, which are utilized to solve complex problems such as selecting the best person in the presidential election or finding an irregular engineering problem in the NASA’s shuttle [21, 18]. In machine learning arena, this concept can be defined in the term of mathematical independent random variables. In fact, independent features are generated by removing the correlation between the features of raw data. There are various methods for removing the correlation before applying individual clustering techniques, such as Principle Component Analysis (PCA) or Linear Discriminant Analysis (LDA). They can validate that removing the correlation dramatically improves the performance of clustering results [20, 12]. Now, this paper defines independency criterion by utilizing the concept of correlation. In other words, this paper develops a new branch of mentioned methods in the CES for mapping data to different dimensions with less correlation between its features. In the rest of this section, we show that how our proposed method transforms features of raw data to stable dimensions with less correlation.
Given a set of data examples , and the corresponding pairwise mustlink constraint set ; and belong to the same cluster and pairwise cannotlink constraint set ; and belong to different clusters. The simple average of can be denoted as follows:
(1) 
where is the number of instances in the ; and denotes the instance of the data points. Now, this paper denotes as follows:
(2) 
where is the data points, and denotes simple average of , which calculated by (1). It’s clear that is zeromean. In other words, the excepted value of is zero as follows:
(3) 
Further, this paper defines , where , denote the number of features and data points, respectively. The main goal of this mapping is just minimizing the correlation between features. This problem can be reformulate as follows:
(4) 
If the correlation (covariance) of is considered , then the correlation of will be defined as follows:
(5) 
Based on above definition, the expected value of feature of denotes as follows:
(6) 
where denotes the index of the . In other words, our correlation problem is changed to a variance problem. Now, maximizing the based on the variance of will be omitted the correlation between features. Since the scale of data after mapping must be same, we assume following equation:
(7) 
So, our problem will be reformulated as follows:
(8) 
where the symbol is an abbreviation for ‘a small change in q’. We consider so the above definition denotes as follows:
(9) 
Based on (7) and (8), we can assume as follows:
(10) 
Now, this paper defines following equation by using (9) and (10):
(11) 
where is a constant. Since the following equation must be satisfy for minimizing correlation between features:
(12) 
where R and denote the eigenvectors and eigenvalues, respectively. For all features of the above equation will be denoted as follows:
(13) 
which is called eigenstructure equation. In above equation, is a diagonal matrix. Based on (7), we can define following equation:
(14) 
where is identity matrix. Following equation denotes based on (13) and (14):
(15) 
where denotes number of features in data . Now, consider that is a descending order based on values. For an optional feature selection in our unsupervised approach, we can define the following equation instead of (15):
(16) 
where is the number of features, which must be selected for generating results. Algorithm 1 shows the mapping function, which can generate independent features by minimizing the correlation of data set. For reducing the time complexity, this paper uses an EM algorithm [37] for estimating the eigenvalues/vectors ( and ) in Algorithm 1. Please see Section A.5 in [37] for more information.
IiiC Decentralization
In WOC theory, the decentralization criterion increases the crowd intelligence, the margin of error and the quality of the final result [21, 18]. In the clustering problems, the same concept is the main reason for using the CES approach to improve the quality of the final result. So, there is a wide range of quality metrics in the previous CES methods [7, 17, 9]. Based on the WOC theory, this paper uses local knowledge for increasing the quality of individual clustering results. There are two different kinds of local knowledge in the CES; i.e. the number of clusters in unsupervised learning and supervision information in semisupervised learning. Moreover, employing different kinds of clustering algorithms significantly can affect to generate more specialize clustering results because they include different kinds of objective functions [18]. Briefly, this paper applies the different kinds of clustering algorithms on the mapped data for generating the individual clustering results in both unsupervised and semisupervised versions of the proposed method. Further, these algorithms use different numbers of clusters in the range of , where k denotes the number of clusters in the final results. Since, this procedure generates all available kinds of patterns as the reference set, it can increase the robustness of the final results. In addition, this paper develops a new feature selection method based on supervision information for improving the performance of the final result. In the rest of this section, we show that how this paper uses supervision information for generating common/local knowledge in the semisupervised approach.
As mentioned before, our proposed method is based on pairwise constraint, i.e. mustlinks and cannotlinks. This paper denotes the mustlink constraint with , and the cannotlink constraint with . For generating each individual clustering result, this paper defines Constraint Projection, which is a set of projective vectors , such that the and are most faithfully preserved in the transformed lowdimensional representations . That is, examples involved by should be close while examples involved by should be far in the lowdimensional space. Define the objective function as maximizing with respect to , where:
(17) 
where and denote the cardinalities of and , respectively, and is a scaling coefficient. The intuition behind (17) is to let the average distance in the lowdimensional space between examples involved by the cannotlink as large as possible, while distances between examples involved by the mustlink as small as possible. Since the distance between examples in the same cluster is typically smaller than that in different clusters, a scaling parameter is added to balance the contributions of the two terms in (17) and its value can be estimated as follows:
(18) 
We can also reformulate the objective function in (17) in a more convenient way as follows:
(19) 
where and are respectively defined as:
(20) 
(21) 
This paper calls and defined in (20) and (21) respectively as cannotlink scatter matrix and mustlink scatter matrix, which resemble the concepts of betweencluster scatter matrix and withincluster scatter matrix respectively in linear discriminant analysis (LDA) [20]. The difference lies in that the latter uses cluster labels to generate scatter matrices, while the former uses pairwise constraints to generate scatter matrices. Obviously, the problem expressed by (19) is a typical eigenproblem, and can be efficiently solved by computing the eigenvectors of corresponding to the positive eigenvalues. In other words, just consider that and are eigenvectors and eigenvalues of , respectively. The and is descending ordered based on values (). Also, where shows the position of positive eigenvalues (. Further, the transformed data set is calculated as follows:
(22) 
Algorithm 2 illustrates the transformation algorithm for both unsupervised and semisupervised approaches. The transformed data is applied to different kinds of individual clustering algorithms for generating the reference set.
IiiD Diversity
Indeed, diversity is a common concept in both the WOC theory and the CES methods. For instance, NMI [19] and APMM [9] are two famous methods for calculating diversity in the cluster ensemble (selection). The diversity increases the stability of the final results. As mentioned before, NMI has the symmetric problem. This problem causes that evaluation of the diversity between two clusters always results equal, when those clusters are complements of each other. This fault is occurred when the number of positive clusters in the considered partition of reference set is greater than 1 [17, 9, 18]. Although some of the researches proposed alternative methods such as APMM [9] and MAX [17] for solving this problem, their proposed methods were utilized for evaluating diversity between a cluster and a partition. As a result, using mentioned methods for evaluating the diversity of two partitions increases the time complexity. In the rest of this section, we firstly explain that how NMI and APMM work. Then, we develop a new metric, which directly can evaluate diversity between two partitions.
Indeed, NMI employed three different Shannon’s entropy for evaluating the similarity between two partitions. Since, NMI is normalized, the was always considered as the diversity between mentioned partitions. NMI used the entropy of common instances between two partitions as numerator, and also employed the sum of entropy of each partition as denominator [17, 9, 19]. As mentioned before, NMI has symmetric problem. As another alternative, APMM tried to solve the mentioned problem for evaluating the similarity between a cluster ( from ) and all clusters of another partition () [9]. Since, some common parts of APMM must be repeated for calculating diversity of two partitions, using the APMM for evaluating the diversity of two partitions increases the time complexity. Further, simple average was utilized for calculating the diversity between all clusters of a partition () versus all clusters of another partition () [9, 18]. This averaging procedure causes to decrease the robustness of achieved evaluation because it finds the mean of similarity between all clusters of two partitions instead of calculating maximum similarity (minimum diversity) among of them. This paper proposes a new greedy method based on the main idea of the APMM. It can calculate diversity between two partitions without repeating common parts; and also it avoids using the averaging procedure.
As mentioned in the previous section, individual clustering results are generated by using the transformed data on the different kinds of clustering algorithms. This paper denotes the generated results as a reference set as follows:
(23) 
where denotes the number of individual clustering results and is the partition of the generated results. Now, this paper finds the maximum similarity for each partition by considering the number of all instances in that partition versus the number of instances in each cluster of that partition as follows:
(24) 
where is a partition from the reference set; denotes the cluster of partition , and and denote the cardinalities of and , respectively. Furthermore, this paper finds the maximum similarity for each partition by considering the number of instances in each cluster of that partition versus the number of all instances in that partition as follows:
(25) 
where the notations of , , , define same as the previous equation. Now, this paper determines the following equation as maximum similarity between a partition versus other partitions in the reference set:
(26) 
where and are the reference set and a partition from the reference set, respectively. Also, and denote the partition from the reference set and cluster from the partition , respectively. Further, and are the cardinalities of and , respectively. Now, this paper proposes the Uniformity as the diversity of partition versus all partitions of the reference set as follows:
(27) 
where is the reference set (ensemble committee), and denotes a partition from the reference set. Uniformity is normalized between . As a greedy metric, Uniformity employs a strict strategy for evaluating the diversity between partition and the other partitions of ensemble committee. In other words, Uniformity represents a value near of zero for a partition with low diversity, and illustrates a value near of one for a partition with high diversity. In addition, it avoids to repeat common parts, i.e. equations (25) and (26), for evaluating the diversity in each comparison.
IiiE Aggregation
Thresholding is used for selecting the evaluated individual results in the CES. Then coassociation matrix is generated by using consensus function on the selected results. Lastly, the final result is generated by applying linkage methods on the coassociation matrix. These methods generate the Dendrogram and cut it based on the number of clusters in the result [19, 18]. In recent years, many papers have used EAC as a highperformance consensus function for combining individual results [19, 9, 18, 8, 7]. EAC uses the number of clusters shared by objects over the number of partitions in which each selected pair of objects is simultaneously presented for generating each cell of the coassociation matrix.
Figure 1 illustrates the effect of the EAC equation () on the shape of Dendrogram. Where represents the number of clusters shared by objects with indices (i, j); and is the number of partitions in which this pair of instances (i and j) is simultaneously presented. As a matter of fact; EAC considers that the weights of all algorithms’ results are the same. Instead of counting these indices, this paper uses following equation, which is called Weighted EAC (WEAC), for generating the coassociation matrix.
(28) 
where and are same as the EAC equation; Also, is the weight of combining the instances. Although, this weight can have different definitions in the other applications, this paper uses average of Uniformity of two algorithms as follows for combining individual results:
(29) 
where and illustrate the uniformities of the algorithms, which generated the results for indices and . In other words, as a new mechanism, this paper generates the effective results when both algorithms have high Uniformity values; and also the effects of individual results are near of zero when the both algorithms have small values in the Uniformity metric. As a result, this paper just omits the effect of low quality individual results by using mentioned mechanism instead of selecting them by thresholding procedures. Further, the final coassociation matrix, which is a symmetric matrix, will be generated by (28) as follows:
(30) 
where is the number of data points; and denotes the final aggregation for and instances.
IiiF Summarization and Discussion
Algorithm 3 shows the pseudo code of the proposed method. In this algorithm, the distances are measured by an Euclidean metric. The ClusteringAlgorithm function builds the partitions of individual clustering results, which will be discussed in the next section; uniformity function evaluates individual clustering results by using (27). Then, evaluated results will be added to reference set. The WEAC function generates the coassociation matrix, according to (28). The AverageLinkage function creates the final ensemble according to the Average Linkage method [18].
There are three points, which must be discussed before this paper starts to explain the empirical studies. Firstly, why this paper chooses the WOC as a framework in the cluster ensemble? As mentioned before, the main reasons for using cluster ensemble are increasing performance, stability, robustness of the final results on the clustering problems. As already stated, the WOC theory is superior to that of a few experts. In other words, it is proven [21, 23, 18] that results made by aggregating the information of groups have better performance, stability, and robustness than those made by any single group member if the WOC criteria are satisfied. Therefore, the cluster ensemble and the WOC are the same solutions with the same goals in two different sciences, i.e. machine learning and social science, respectively. Next, what are the common concepts between our proposed criteria in the WOC and previous methods in clustering problems? In fact, diversity is existed in clustering with the same title, e.g. NMI and APMM are two famous methods for calculating diversity in the cluster ensemble (selection). The diversity increases the stability and robustness of the final results. Further, independency referred to the correlation concept in the learning methods. This correlation can be defined between features of raw data. There are some techniques, i.e. Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc. for mapping data to new dimensions without any correlation between its features. This paper uses a new branch of these techniques for satisfying the independency criterion, which can increase the performance of the final results. In addition, decentralization guarantees that the quality of the final result is optimized. In other words, it uses different individual clustering algorithms, which use different objective functions, for generating all possible patterns as the reference set in the cluster ensemble problem. Moreover, an effective aggregation method can combine the final result without thresholding procedure. The last question is why all of the four conditions of the WOC must be satisfied in the ensemble learning? Based on previous question, the proposed method can be defined as a CES method that applied a feature mapping in advance. In practice, all of clustering analysis has these steps [12, 18, 36]. Therefore, the WOC framework does not add any new stage in the pipeline on clustering analysis. It just defined what is the robust and compulsory structure for an ensemble framework in realworld application.
Iv Experiments
The empirical studies will be presented in this section. The unsupervised methods are used to find meaningful patterns in unlabeled data sets such as web documents; and semisupervised employs supervision information for generating more robust and stable final results in real world application. Since, the real data set does not have class labels, there is no direct evaluation method for estimating the performance in unsupervised or semisupervised methods. Like many previous researches [7, 17, 9, 18, 36, 5], this paper compares the performance of its proposed method with other individual clustering methods and cluster ensemble (selection) methods by using standard data sets and their real classes. Moreover, the supervision information will be randomly generated based on real class labels. In this paper, all of algorithms are implemented in the MATLAB R2015a (8.5) by authors on a PC with certain specifications^{2}^{2}2Apple Mac Book Pro, CPU = Intel Core i7 (4*2.4 GHz), RAM = 8GB, OS = OS X 10.11 in order to generate experimental results. All results are reported by averaging the results of 10 independent runs of the algorithms. Table I demonstrates the individual clustering algorithms, which are used for generating the individual clustering results in our proposed method. Further, the number of individual clustering results for the ensemble methods is set as 20 in the reference set.
No.  Method 

1  Kmeans 
2  Fuzzy Cmeans 
3  Median Kflats 
4  Gaussian mixture 
5  Subtract clustering 
6  Singlelinkage euclidean 
7  Singlelinkage hamming 
8  Singlelinkage cosine 
9  Averagelinkage euclidean 
10  Averagelinkage hamming 
11  Averagelinkage cosine 
12  Completelinkage euclidean 
13  Completelinkage hamming 
14  Completelinkage cosine 
15  Wardlinkage euclidean 
16  Wardlinkage hamming 
17  Wardlinkage cosine 
18  Spectral using a sparse similarity matrix 
19  Spectral using Nystrom method with orthogonalization 
20  Spectral using Nystrom method without orthogonalization 
Iva Data Sets
This paper uses three different groups of data sets for generating experimental results; i.e. image data sets, document data sets and other UCI data sets. Table II illustrates the properties of these data sets. This paper uses the USPS digits data set, which is a collection of grayscale images of natural handwritten digits and is available from [38]. Furthermore, this paper utilizes ImageNet [39], MNIST, and CIFAR10 [40] as three imagebased data sets, which are mostly employed in Deep Learning studies [40]. As another alternative in the imagebased data set, this paper uses Alzheimer’s Diseases Neuroimaging Initiative (ADNI) data set for subjects. This data set contains Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images from human Brian in two categories (which are shown by C1 and C2 in the Table II and III) for recognizing the Alzheimer diseases. In the first category, this data set partitions subjects to three groups of Health Control (HC), Mild Cognitive Impairment (MCI), and Alzheimer’s Diseases (AD). In the second category, there are four groups because the MCI will be partitioned into high and low risk groups (HMCI/LMCI). This paper uses all possible forms of this data set by using only MRI features, only PET features and all of MRI and PET features (FUL) in each of two categories. More information about ADNI202 is available in [41]. As a document based data set, the 20 Newsgroups is a collection of approximately 20,000 newsgroup documents, which is partitioned (nearly) evenly across 20 different newsgroups. Some of the newsgroups are very closely related to each other, while others are highly unrelated. It has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. As two other documentbased data sets, the Reuters21578 [42] and Letters [15] are employed in this paper. The rest of standard data sets are from UCI [43]. This paper has chosen data sets which are as diverse as possible in their numbers of true clusters, features, and samples because this variety better validates the obtained results. The features of the data sets are normalized to a mean of 0 and variance of 1, i.e. .
Data Set  Instances  Features  Class 

20 Newsgroups  26214  18864  20 
ADNIMRIC1  202  93  3 
ADNIMRIC2  202  93  4 
ADNIPETC1  202  93  3 
ADNIPETC2  202  93  4 
ADNIFULC1  202  186  3 
ADNIFULC2  202  186  4 
Arcene  900  10000  2 
Bala. Scale  625  4  3 
Brea. Cancer  286  9  2 
Bupa  345  6  2 
CIFAR10  5000  1024  10 
CNAE9  1080  857  9 
Galaxy  323  4  7 
Glass  214  10  6 
Half Ring  400  2  2 
ImageNet  5000  400  5 
Ionosphere  351  34  2 
Iris  150  4  3 
Letters  20000  16  26 
MNIST  70000  784  10 
Optdigit  5620  62  10 
Pendigits  10992  16  10 
Reuters21578  8293  18933  65 
SA Hart  462  9  2 
Sonar  208  60  2 
Statlog  6435  36  7 
USPS  9298  256  10 
Wine  178  13  2 
Yeast  1484  8  10 
IvB Performance analysis for unsupervised methods
In this section the performance (accuracy metric [20]) of unsupervised version of proposed method (UWoCE) will be analyzed. As mentioned before, algorithms listed in Table I were employed for generating the individual clustering results in our proposed method. Further, the sets of supervision information (mustlinks and cannot links) are considered null in this section. Also, the final clustering performance was evaluated by relabeling between obtained clusters and the ground truth labels and then counting the percentage of correctly classified samples [18, 12]. The results of the proposed method are compared with full ensemble (EAC) [19] as baseline, WPCK [32], GKPC [33], HCSS [14], GPMGLA [15], and WOCCE [18] which are stateoftheart cluster ensemble (selection) methods. The performance of full ensemble method (EAC) is reported for demonstrating the effect of selecting the best results in comparison combing all generated result with each others. In addition, the performance of the WPCK, GKPC, HCSS, and GPMGLA are reported as four weighted clustering ensemble methods. For representing the effect of Uniformity on the performance of the final results, it compares with three stateoftheart metrics in diversity evaluation (A3 [18], SACT [15], and CA [33]). This paper does not use optional feature selection in this section ().
The experimental results are given in Table III. In this table, the best result which is achieved for each data set is highlighted in bold. As depicted in this table, the results of the EAC illustrate the effect of evaluation and selection in cluster ensemble selection methods. Since some of the four conditions of the WOC theory do not exist in EAC, this method is a good example of unwise crowd. According to this table, the proposed algorithm (WoCE) has generated better results in comparison with other individual and ensemble algorithms. Even though the proposed method was outperformed by a number of algorithms in four data sets (ADNIMRIC2, SA Heart, Sonar, and Yeast), the majority of the results demonstrate the superior accuracy of the proposed method in comparison with other algorithms. In addition, the difference between the performance of proposed method and the best result in those three data sets is lower that 2%. In addition, the WOCCE and the proposed method generate more stable results in comparison with other methods based on the standard variances. As mentioned before, this is the effect of WOC framework.
Data Sets  EAC  WPCK  GKPC  HCSS  GPMGLA  WOCCE  UWoCE 

20 Newsgroups  26.190.72  27.010.93  28.451.02  30.620.84  35.470.91  32.620.52  38.230.12 
ADNIMRIC1  42.190.37  41.240.97  43.511.02  46.610.36  49.360.7  48.820.37  51.150.73 
ADNIMRIC2  39.520.31  39.950.61  40.090.51  41.320.81  40.721.25  42.220.44  41.230.95 
ADNIPETC1  40.380.52  40.510.26  43.791.04  45.30.49  48.220.71  49.190.26  51.170.98 
ADNIPETC2  38.850.59  37.510.69  36.580.72  41.921.18  40.680.73  39.430.79  42.480.67 
ADNIFULC1  44.420.91  43.840.93  46.560.49  49.620.81  49.270.61  48.820.41  50.890.83 
ADNIFULC2  47.210.63  49.710.99  51.260.64  52.260.66  51.920.7  49.390.63  53.310.97 
Arcene  61.790.813  63.920.81  63.261.04  65.540.73  66.320.91  65.160.32  68.130.82 
Bala. Scale  54.091.75  55.420.94  56.040.72  57.410.56  56.230.94  57.880.61  60.640.58 
Brea. Cancer  90.171.24  81.931.92  82.431.24  65.511.91  72.271.06  96.920.77  970.14 
Bupa  51.730.99  57.910.82  59.090.98  58.331.32  58.910.51  57.020.46  60.830.12 
CIFAR10  51.921.24  54.10.88  55.520.79  56.120.91  57.820.85  59.370.52  62.040.32 
CNAE9  72.411.09  75.410.69  75.530.55  80.631.41  81.290.81  79.20.58  84.120.44 
Galaxy  33.120.52  30.991  32.710.84  35.710.61  34.720.96  34.880.81  37.180.67 
Glass  50.930.18  45.012.03  46.572.97  52.310.68  50.620.38  51.820.92  570.78 
Half Ring  77.530.21  82.540.93  85.410.94  90.530.67  89.991.02  87.20.14  98.110.31 
ImageNet  23.530.81  32.860.42  35.320.59  35.040.93  33.510.83  38.140.62  41.670.7 
Ionosphere  68.120.42  66.521.1  67.040.79  71.230.91  70.90.99  70.520.15  73.670.41 
Iris  73.510.82  79.921.86  80.390.83  85.620.82  75.310.28  920.59  96.30.62 
Letters  42.820.81  48.951.34  47.680.98  54.320.9  52.190.49  53.690.73  55.830.26 
MNIST  52.182.76  55.661.41  62.460.76  59.921.41  67.390.97  66.210.92  69.720.71 
Optdigit  65.921.2  70.270.84  74.670.42  78.991.02  76.690.72  77.160.21  80.560.69 
Pendigits  52.880.92  55.730.75  54.080.38  62.820.81  60.780.95  61.680.18  64.130.42 
Reuters21578  62.340.72  70.240.92  71.820.78  74.630.87  75.290.66  68.850.32  76.410.24 
SA Hart  66.391.62  67.381.02  66.531.26  70.540.93  71.420.87  73.70.46  72.050.16 
Sonar  50.480.92  53.841.01  53.250.51  61.820.72  59.120.83  54.390.25  60.060.87 
Statlog  52.280.91  55.390.75  55.260.97  57.330.91  56.420.92  55.770.71  59.760.5 
USPS  60.490.84  59.420.78  62.110.37  64.921.68  63.080.59  65.210.69  66.010.24 
Wine  70.240.72  75.621.79  81.250.93  79.290.51  83.160.84  71.340.55  89.460.14 
Yeast  33.810.32  36.230.61  35.230.72  40.250.88  42.030.91  37.760.26  41.120.4 
IvC Performance analysis for semisupervised methods
The empirical results of semisupervised methods will be analyzed in this section. Since most of the semisupervised cluster ensemble methods [6, 29, 36] use feature selection based on the supervision information, this paper compares the performance of semisupervised methods on highdimensional and largescale data set in the Table II; i.e. 20 Newsgroups, Letters, and Reuters21578 as documentbased data sets, ADNI, CIFAR10, ImageNet, MNIST, and USPS as imagebased data sets, and also Arcene, CNAE9, Optdigit, Sonar from UCI repository. This paper does not use the optional feature selection in this section ().
In this paper, 1% to 5% of instances with class labels are randomly selected for generating the supervision information (a half for mustlink and a half for cannotlink); e.g. 1% (2620) of instances are selected in the 20 Newsgroups data set for generating the pairwise constraints, where 655 mustlinks and 655 cannotlinks constraints are generated by the selected instances. In addition, the supervision information, which applied to the methods, are same in each independent run for all methods. Notably, this paper does not employ all combinations of the randomly selected instances as the pairwise constraints (mustlinks and cannotlinks). In other words, each randomly selected instance is used once for generating just a mustlink or a cannotlink. There are two reasons for this strategy of generating pairwise constraints. Firstly, this strategy provides better diversity among the generated pairwise constraints. Secondly, this strategy represents better simulation for the real application of prior supervision information. Indeed, there is no class label in the realworld applications, and generating the pairwise constraints from all combinations of the randomly selected instances is impossible or expensive [5, 6, 36]. For instance, just consider an interactive image search engine, where it shows two random images to users in each attempt and asks users to specify that these images are same (mustlink) or different (cannotlink). Then, the search engine will improve the clustering results based on these limited feedbacks.
The final clustering performance (accuracy metric [20]) was evaluated by relabeling between obtained clusters and the ground truth labels and then counting the percentage of correctly classified samples [18]. Figure 2 illustrates the performance of the proposed method (WoCE) in comparison with the RP [44], BGCM [36], NBF [6] and SKMS [5]. In this figure, the standard deviations of the results are lower than 1% (for 10 independent runs). This paper reports the performance of RP as a classical method in the semisupervised cluster ensemble. Also, this paper reports the performance of BGCM as a novel graphbased approach in the semisupervised clustering. Notably, the BGCM has two versions; i.e. unsupervised and semisupervised. This paper uses the semisupervised version of BGCM in this section. What’s more, this paper uses SKMS as a kernelbased method in semisupervised clustering. Last but not least, the empirical results of the proposed method are compared with NBF as another heuristic method in the semisupervised cluster ensemble. Even though WoCE was outperformed in one data set (Optdigit) by some algorithms, the majority of results demonstrate superior accuracy for the proposed method. In addition, the clustering performance of some algorithms in Fig. 2 (k) and () become worse with increased number of pairwise constraints. As mentioned before, pairwise constraints often result in highly unstable clustering performance [5, 6]. These figures are good examples for this issue, where some of the previous methods cannot handle the extra supervision information. In fact, the supervision information made unstable individual clustering results and significantly reduce the performance of the mentioned methods. In these cases, our proposed method has handled the supervision information by employing the WOC theory, i.e. better data representation (Algorithm 1 and 2), robust individual clustering evaluation (Uniformity metric), and effective aggregating mechanism (WEAC).
IvD Optional feature selection in unsupervisedmethod
In this section, the performance of the proposed method will be analyzed by using the optional feature selection ( parameter). Since feature selection is automatically used by applying the supervision information on the mapped data set in the semisupervised version of the proposed method, the performance of the unsupervised version of the proposed method (UWoCE) will be analyzed in this section. This paper employs highdimension data sets, i.e. Arcene, CIFAR10, MNIST, and USPS, for analyzing the performance of the proposed method. Figure 3 (a) shows the relationship between the performances of the proposed method based on the percentage of selected features in different data sets. The vertical axis refers to the performance while the horizontal axis refers to the percentage of the selected features in each data set. As depicted in this figure, the optional feature selection can significantly increase the performance of final results on highdimensional data sets. So, this paper offers that the optional feature selection will be used for highdimensional data sets to handle features’ sparsity. Moreover, this experiment is the reason for using highdimensional and largescale data sets in the previous section. The important questions which must be discussed here are what is different between the mapping function and the optional feature selection? And where the optional feature selection can be employed? Indeed, the mapping function, which is illustrated in Algorithm 1, minimizes the correlation between features; and it maps data to a new domain, which the covariance between different features is near zero. Most of the time, this function maps data to the stable dimensions, which can dramatically improve the accuracy of the final results. It can be also formulated as follows: , where can satisfy the independency criterion in the WOC theory. In a highdimensional data set, some of the calculated eigenvalues () approach near to zero. Since these eigenvalues trivially effect on the mapping , they can be omitted for reducing the dimensions of the data set and the time and space complexities of the clustering analysis. In other words, these eigenvalues may reduce the stability of the final results [12]. By considering the optional feature selection, the mapping can be formulated as follows: , where . Therefore, employing this optional feature selection for analyzing the highdimension data set can improve the stability of the mapping as well as the performance of the final result (see Fig. 3 (a) ). In addition, this feature selection is better to use based on the fluctuation of the eigenvalues (remove near zero values) in each problem.
IvE Time complexity analysis
In this section, the runtime of both unsupervised and semisupervised methods will be compared by using various data sets, i.e. three largescale data sets (Letters, MNIST, 20 Newsgroups) and two highdimension data sets (Arcene, Reuters21578). Figure 3 (b) illustrates the relationship between the runtime of the mentioned methods and the size of data sets. The vertical axis refers to the runtime while the horizontal axis refers to the algorithms. As mentioned before, all of the results in this experiment are generated by a PC with a certain specifications. As depicted in this figure, the runtime of the semisupervised methods (the first five bars) is more than the runtime of the unsupervised methods because they need an additional step to apply the supervision information (mostly in the form of an Eigenproblem) [36]. By considering the performance of these methods in Table III and Fig. 2, WoCE (the first bar) and UWoCE (the last bar) generated more efficient results in comparison with other clustering methods. Indeed, the proposed method selects the features based on the correlations between data points and supervision information (in semisupervised approach). So, the number of calculations for generating individual clustering results will be significantly decreased in comparison with other cluster ensemble methods, while the performance of the final results is rapidly increased.
There are some technical issues that must be discussed here. Firstly, this paper uses an EM algorithm [37] for estimating the eigenvalues/vectors, which this algorithm can significantly reduce the time complexity of the mapping function in Algorithm 1. Secondly, the size of the transformed matrix (eq. 22) in the proposed method for applying the supervision information is limited to the size of pairwise constraints. This size is really small in comparison with the size of instances; e.g. In 20 Newsgroup data set, the size of this matrix for 1% of randomly sampled pairwise constraints is , while the instance similarity matrix is . Notably, most of the previous studies such as SKMS and BGCM directly used the instance similarity matrix for applying the supervision information. Lastly, this paper uses a modified version of the EAC for combining the individual clustering results. EAC applies a linkage method on a simple matrix, where the size of this matrix is the number of algorithms the number of instances (), where in practice. By contrast, some of the previous studies such as BGCM utilized the graph methods for combining the individual results, where the size of the adjacency matrix of the graph in these methods is the square of the number of instances (). Based on these technical issues, the proposed method can significantly increase the performance of the final results as well as an acceptable runtime.
V Conclusion
In this paper, wisdom of crowds (WOC) theory in social science was mapped to the clustering ensemble arena. The main advantages of this mapping include the addition of two new aspects, i.e., independency and decentralization, for estimating the quality of individual clustering results, and a new framework to investigate them. To reach the four conditions of WOC, this paper incorporates a series of novel strategies for producing individual clustering results as well as obtaining the final ensemble result. Specifically, a mapping function is introduced to perform independency on individual clustering results. This function can minimize the correlation between features by using the concepts of expected value and covariance. The decentralization criterion is proposed for transforming the data from highdimension to lowdimension based on pairwise constraints, to keep quality in the generated individual clustering results. Further, this paper evaluates the diversity of individual clustering results with a novel metric called uniformity. At last, weighted EAC is proposed for the final aggregation. To validate the effectiveness of the proposed approach, an extensive experimental study is performed by comparing with multiple stateoftheart methods on various data sets. In the future, we will develop a new version of uniformity based on the concept of expected value instead of using the APMM.
Acknowledgment
We thank the anonymous reviewers for comments. This work was supported in part by the National Natural Science Foundation of China (61422204, 61473149, and 61503182), Jiangsu Natural Science Foundation (BK20130034 and BK2015042628), and NUAA Fundamental Research Funds (NE2013105).
References
 [1] A. Strehl and J. Ghosh, “Cluster ensembles  a knowledge reuse framework for combining multiple partitions,” Journal of Machine Learning Research, vol. 3, pp. 583–617, 2002.
 [2] A. Topchy, A. Jain, and W. Punch, “Combining multiple weak clusterings,” in IEEE International Conference on Data Mining (ICDM’03), 1922 September 2003, pp. 331–338.
 [3] A. Fred and A. Lourenco, “Cluster ensemble methods: from single clusterings to combined solutions,” Computer Intelligence, vol. 126, pp. 3–30, 2008.
 [4] A. K. Jain, A. Topchy, M. Law, and J. Buhmann, “Landscape of clustering algorithms,” in 17th International Conference on Pattern Recognition, 26 August 2004, pp. 23–26.
 [5] S. Anand, S. Mittal, O. Tuzel, and P. Meer, “Semisupervised kernel mean shift clustering,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 15–2, 2014.
 [6] S. Xiong, J. Azimi, and X. Fern, “Active learning of constraints for semisupervised clustering,” IEEE Transaction on Knowledge and Data Engineering, vol. 24, no. 1, 2014.
 [7] X. Fern and W. Lin, “Cluster ensemble selection,” in SIAM International Conference on Data Mining (SDM’08), 2426 April 2008, pp. 128–141.
 [8] J. Azimi and X. Fern, “Adaptive cluster ensemble selection,” in 21th International joint conference on artificial intelligence (IJCAI09), 1117 July 2009, pp. 992–997.
 [9] H. Alizadeh, B. MinaeiBidgoli, and H. Parvin, “Cluster ensemble selection based on a new cluster stability measure,” Intelligence Data Analysis (IDA), vol. 18, no. 3, pp. 389–40, 2014.
 [10] L. Limin and F. Xiaoping, “A new selective clustering ensemble algorithm,” in 9th IEEE International Conference on eBusiness Engineering, 911 September 2012, pp. 45–49.
 [11] J. Jia, X. Xiao, and B. Liu, “Similaritybased spectral clustering ensemble selection,” in 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2931 May 2012, pp. 1071–1074.
 [12] M. Yousefnezhad and D. Zhang, “Weighted spectral cluster ensemble,” in IEEE International Conference on Data Mining (ICDM’15), 1417 November 2015.
 [13] Z. Yu, L. Li, J. Liu, J. Zhang, and G. Han, “Adaptive noise immune cluster ensemble using affinity propagation,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 12, pp. 3176–3189, 2015.
 [14] Z. Yu, L. Li, Y. Gao, J. You, J. Liu, H.S. Wong, and G. Han, “Hybrid clustering solution selection strategy,” Pattern Recognition, vol. 47, no. 10, pp. 3362–3375, 2014.
 [15] D. Huang, J.H. Lai, and C.D. Wang, “Combining multiple clusterings via crowd agreement estimation and multigranularity link analysis,” Neurocomputing, vol. 170, pp. 240–250, 2015.
 [16] L. Jing, K. Tian, and J. Z. Huang, “Stratified feature sampling method for ensemble clustering of high dimensional data,” Pattern Recognition, vol. 48, no. 11, pp. 3688–3702, 2015.
 [17] H. Alizadeh, H. Parvin, and S. Parvin, “A framework for cluster ensemble based on a max metric as cluster evaluator,” International Journal of Computer Science, vol. 39, pp. 1–39, 2012.
 [18] H. Alizadeh, M. Yousefnezhad, and B. MinaeiBidgoli, “Wisdom of crowds cluster ensemble,” Intelligent Data Analysis (IDA), vol. 19, no. 3, 2015.
 [19] A. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27, pp. 835–850, 2005.
 [20] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. AddisonWesley Longman Publishing Co., Inc.; ISBN: 0321321367, 2005.
 [21] J. Surowiecki, The Wisdom of Crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. Littles: Brown, ISBN:0316861731, 2004.
 [22] D. Yang, G. Xue, X. Fang, and J. Tang, “Crowdsourcing to smartphones: incentive mechanism design for mobile phone sensing,” in MOBICOM’2012: ACM International Conference on Mobile Computing and Networking, 22–26 August 2012.
 [23] L. Baker and D. Ellison, “The wisdom of crowds — ensembles and modules in environmental modeling,” Geoderma, vol. 147, pp. 1–7, 2008.
 [24] B. Miller, P. Hemmer, M. Steyvers, and M. D. Lee, “The wisdom of crowds in rank ordering problems,” in 9th International Conference on Cognitive Modeling (ICCM’09), 2426 July 2009, pp. 86–91.
 [25] M. Steyvers, M. Lee, B. Miller, and P. Hemmer, “The wisdom of crowds in the recollection of order information,” Advances in Neural Information Processing Systems, vol. 22, pp. 1785–1793, 2009.
 [26] P. Welinder, S. Branson, S. Belongie, and P. Perona, “The multidimensional wisdom of crowds,” in 24th Conference on Neural Information Processing Systems (NIPS), 69 December 2010, pp. 1–9.
 [27] D. P. Williams, “Underwater mine classification with imperfect labels,” in 20th International Conference on Pattern Recognition, August 2010, pp. 4157–4161.
 [28] S. K. M. Yi, M. Steyvers, and M. D. Lee, “Wisdom of the crowds in minimum spanning tree problems,” in 32nd Annual Conference of the Cognitive Science Society, 1013 August 2010, pp. 31 840–31 845.
 [29] C.L. Liu, W.H. Hsaio, C.H. Lee, and F.S. Gou, “Semisupervised linear discriminant clustering,” IEEE Transaction on Cybernetics, vol. 44, no. 7, pp. 989–1000, 2014.
 [30] H. Wang, T. Li, T. Li, and Y. Yang, “Constraint neighborhood projections for semisupervised clustering,” IEEE Transaction on Cybernetics, vol. 44, no. 5, pp. 636–643, 2014.
 [31] Y. Chen, S. H. Lim, and H. Xu, “Weighted graph clustering with nonuniform uncertaintiese,” in 31st International Conference on Machine Learning (ICML’14), 2126 June 2014, pp. 1566–1574.
 [32] S. VegaPons, J. CorreaMorris, and J. RuizShulcloper, “Weighted partition consensus via kernels,” Pattern Recognition, vol. 43, no. 8, pp. 2712–2724, 2010.
 [33] S. VegaPons, J. RuizShulcloper, and A. GuerraGandón, “Weighted association based methods for the combination of heterogeneous partitions,” Pattern Recognition Letters, vol. 32, no. 16, pp. 2163–2170, 2011.
 [34] S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized mutual information for clustering comparisons: One step further in adjustment for chance,” in 31st International Conference on Machine Learning (ICML14), 2126 June 2014, pp. 1143–1151.
 [35] G. Huang, S. Song, J. N. D. Gupta, and C. Wu, “Semisupervised and unsupervised extreme learning machines,” IEEE Transaction on Cybernetics, vol. 44, no. 12, pp. 2405–2417, 2014.
 [36] J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han, “A graphbased consensus maximization approach for combining multiple supervised and unsupervised models,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 15–2, 2013.
 [37] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation, vol. 11, no. 2, pp. 443–482, 1999.
 [38] S. Roweis. (1998) The worldfamous courant institute of mathematical sciences, computer science department, new york university. [Online]. Available: http://cs.nyu.edu/∼roweis/data.html
 [39] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 2169–2178.
 [40] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” in International Conference on Learning Representations (ICLR’16), 2016.
 [41] C. Zu and D. Zhang, “Labelalignmentbased multitask feature selection for multimodal classification of brain disease,” in 4th NIPS Workshop on Machine Learning and Interpretation in Neuroimaging (MLINI’14), 13 December 2014.
 [42] D. Cai, X. He, and J. Han, “Locally consistent concept factorization for document clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 902–913, 2011.
 [43] C. B. D. J. Newman, S. Hettich, and C. Merz. (1998) Uci repository of machine learning databases. [Online]. Available: http://www.ics.uci.edu/mlearn/MLSummary.html
 [44] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transaction onPattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998.
Muhammad Yousefnezhad received his B.Sc. and M.Sc. degrees in Computer Hardware Engineering and Information Technology (IT), with a sheer focus on Artificial Intelligence, from Mazandaran University of Science and Technology (MUST), Iran, in 2010 and 2013, respectively. He joined the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics as a Research Assistant for his Ph.D. research in 2014. His main research interest is developing machine learning techniques, particularly within the area of the human brain decoding. 
ShengJun Huang received the B.Sc. and Ph.D. degrees in computer science from Nanjing University, China, in 2008 and 2014, respectively. He is currently an Associate Professor in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics. His main research interests include machine learning and patter recognition. He has won the China Computer Federation (CCF) Outstanding Doctoral Dissertation Award in 2015, the Best Poster Award at KDD’12, the Best Student Paper Award at CCDM’11, and the Microsoft Fellowship Award in 2011. 
Daoqiang Zhang received the B.Sc. and Ph.D. degrees in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 1999 and 2004, respectively. He is currently a Professor in the Department of Computer Science and Engineering, Nanjing University of Aeronautics and Astronautics. His current research interests include machine learning, pattern recognition, and biomedical image analysis. In these areas, he has authored or coauthored more than 100 technical papers in the refereed international journals and conference proceedings. 