WoCE: a framework for clustering ensemble by exploiting the wisdom of Crowds theory

WoCE: a framework for clustering ensemble by exploiting the wisdom of Crowds theory

Muhammad Yousefnezhad, Sheng-Jun Huang, Daoqiang Zhang M. Yousefnezhad is with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, 210016 China e-mail: (myousefnezhad@nuaa.edu.cn).S. J. Huang is with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, 210016 China e-mail: (huangsj@nuaa.edu.cn).D. Zhang is with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, 210016 China e-mail: (dqzhang@nuaa.edu.cn).
Abstract

The Wisdom of Crowds (WOC), as a theory in the social science, gets a new paradigm in computer science. The WOC theory explains that the aggregate decision made by a group is often better than those of its individual members if specific conditions are satisfied. This paper presents a novel framework for unsupervised and semi-supervised cluster ensemble by exploiting the WOC theory. We employ four conditions in the WOC theory, i.e., diversity, independency, decentralization and aggregation, to guide both the constructing of individual clustering results and the final combination for clustering ensemble. Firstly, independency criterion, as a novel mapping system on the raw data set, removes the correlation between features on our proposed method. Then, decentralization as a novel mechanism generates high quality individual clustering results. Next, uniformity as a new diversity metric evaluates the generated clustering results. Further, weighted evidence accumulation clustering method is proposed for the final aggregation without using thresholding procedure. Experimental study on varied data sets demonstrates that the proposed approach achieves superior performance to state-of-the-art methods.

semi-supervised clustering; cluster ensemble; pairwise constraints; the wisdom of crowds.

I Introduction

Clustering, the art of discovering meaningful patterns in the unlabeled data sets, is one of the main tasks in machine learning. Semi-supervised clustering is a branch of clustering methods that uses prior supervision information, such as labeled data, known data associations, or pairwise constraints, to aid the clustering process. This paper focuses on pairwise constraints, i.e. pairs of instances known as belonging to the same cluster (must-link constraints) or different clusters (cannot-link constraints). Pairwise constraints arise naturally in many real tasks and have been widely used in semi-supervised clustering. There is a wide range of issues in the clustering methods. For instance, individual clustering algorithms provide different accuracies in a complex data set because they generate the clustering results by optimizing a local or global function instead of natural relations between data points [1, 2, 3, 4]. As another example, pairwise constraints often result in highly unstable clustering performance, whereas they have the potential to improve clustering accuracy in practice [5, 6].

As a novel solution, cluster ensemble was proposed for achieving a robust and stable final result by combining the different individual clustering results [1]. Cluster Ensemble Selection (CES) is a new approach which combines a subgroup of individual clustering results. It uses a consensus metric(s) for evaluating and selecting the ensemble committee in order to improve the accuracy of final results [7]. Generally, CES contains four components; i.e. generation, evaluation, selection, and combination. Firstly, individual clustering results are generated by using different kinds of clustering algorithms or repeating some algorithms, which can generate random results in each runtime such as the k-means. Next, a consensus metric(s) such as Normalized Mutual Information (NMI) is employed to evaluate the generated results. After that, the evaluated results are selected by thresholding procedure. Lastly, the final clustering result is obtained by an aggregation mechanism [7, 8, 9, 10, 11].

There are three challenges in the CES arena; i.e. strategy of generation, metric(s) of evaluation, thresholding procedure. As the first challenge, the strategy of generating the individual clustering results can dramatically affect the performance of CES [12, 13, 14, 15, 16]. There are generally two paradigms, i.e. some of these studies [7, 17, 9, 13] separately run each component of the CES (generate all individual results, then evaluate them, etc.) whereas the rest of studies [18, 12] employed feedback mechanism, which gradually runs each component of the CES (generating the first individual result, then evaluating it, etc.). On the one hand, feedback mechanism uses evaluated the results at each step for improving the quality of the generated results in the next steps. Therefore, it can usually provide better performance in comparison with the first paradigm [18, 12]. On the other hand, it may not be compatible with many of classical structures/metrics in the ensemble learning. Evaluation is the next challenge. NMI is one of the most prevalent diversity metrics that is used in the CES because 1) NMI is not sensitive to the cluster’s indices [18] 2) it can be easily implemented [8, 7] 3) it has better time complexity in comparison with other classic methods [19, 7, 17, 9]. The main disadvantage of NMI is that the symmetric problem. Indeed, it cannot provide an efficient evaluation while the numbers of instances in distinct clusters are highly different. For instance, consider a clustering analysis for partitioning emails to normal or spam groups, where the number of instances in the normal group is significantly greater than the number of data points in the spams group. Alizadeh et al. [9, 17, 18] proved that the NMI evaluates the similarity between these two clusters equal to 1, while the real similarity is near to zero. This issue can rapidly decrease the performance of the NMI-based CES methods in the big data analysis [17, 9, 18, 12]. Recently, some of the studies proposed a modified version of the NMI such as APMM111Alizadeh-Parvin-Moshki-Minaei [9] and MAX [17] for solving this problem. Their proposed methods were utilized for evaluating diversity between a cluster and a partition. Since using mentioned methods for evaluating two partitions increases the time complexity, it is critical to propose a new metric, which directly can evaluate diversity between two partitions. The next challenge in the CES is thresholding. In practice, it is so hard to find optimum values of thresholds; and the performance of the CES significantly depends on the threshold values [12].

Most of the ensemble methods (especially in the CES) employs the (majority) voting systems [7, 8, 18, 12], such as Boosting and Error-Correcting Output Codes (ECOC) in supervised learning [20] or Evidence Accumulation Clustering (EAC) method in unsupervised learning [19]. Indeed, CES framework just provides a voting system for selecting the robust and stable individual results. Voting systems are firstly defined in the term of social science, where it is used for providing democratic societies, fair trials (in the courts), etc. [21]. There is a wide range of theories in social science, which can provide an environment for applying an effective voting system. They can be used to inspire new algorithms in machine learning. The Wisdom of Crowds (WOC) is one of these theories, which explain a robust approach for generating accurate results in a voting system. It simply claims that decisions made by aggregating the information of groups are better than those made by any single group member if the four specific conditions of this theory are satisfied; i.e. diversity, independency, decentralization, and aggregation [21, 18]. Indeed, we can find many modern concepts in different sciences, which used WOC as a fundamental resource, e.g. Delphi method in management [21], crowdsourcing/funding in the market [21], crowd computing [22] in computer networks, etc. In computer science, this theory was used for optimizing resources in wireless sensor networks [22]. Further, there is a wide range of studies in supervised learning [23, 24, 25, 26, 27, 28] and unsupervised learning [12, 18], which use the WOC theory for proposing new approaches. These studies validated that the WOC theory usually leads to better performance and higher stability.

For solving the three mentioned problems in CES, this paper shows that the WOC theory well matches the target of cluster ensemble, and thus its four conditions can be employed to guide the designing of individual clusterings as well as the final ensemble. Based on this observation, we propose a robust framework, which is called Wisdom of crowds Cluster Ensemble (WoCE), for both unsupervised and semi-supervised cluster ensemble. Our contribution in this paper can be summarized as follows:

  • Firstly, a new mapping between the WOC observations and the CES problems. Furthermore, a general framework is proposed based on WOC theory for generating diverse individual results and using the feedback mechanism to select individual clusterings with high independency and quality. This framework is the first WOC-based approach for semi-supervised clustering.

  • After that, this paper introduces a novel technique in the term of mathematical independent random variables for mapping the data to new dimensions based on the natural correlation of raw data, which can satisfy the independency criterion in the WOC. This mapping can generate independent features, which increase the performance of individual clustering algorithms.

  • Then, to satisfy the decentralization criterion in WOC, this study uses different numbers of clusters in the different kinds of clustering algorithms, which can effectively generate high quality individual clustering results. Moreover, this paper develops a new method for selecting features based on supervision information in the semi-supervised approach.

  • Next, to satisfy the diversity criterion in the WOC, this study proposes a new diversity metric called uniformity, which is based on the APMM criterion, for evaluating the diversity of two partition, directly [9].

  • Lastly, to satisfy the aggregation mechanism in WOC, this paper proposed Weighted Evidence Accumulation Clustering (WEAC) to obtain the final clustering with a weighted combination of all individual results. While the weight of each individual result in WEAC can be estimated with different metrics, the uniformity was used in this paper.

The rest of this paper is organized as follows: In Section II, this study first briefly reviews some related works. Then it introduces the proposed WoCE framework in Section III. Experimental results are presented in Section IV; and finally this paper presents conclusion and point out some future works in Section V.

Ii Related Works

Ii-a The Wisdom of Crowds

Francis Galton was a British scientist, who introduced the correlation concept in statistics. In 1906, he went to annual West of England Fat Stock and Poultry Exhibition where the local farmers and townspeople gathered to estimate and gamble the quality of each other’s cattle, sheep, pigs, etc. Each animal was shown to the crowd; and people wrote their estimations on the tickets. Final goal of this gambling was estimating the closest weight for each animal in comparison with the real weight of that animal. Galton considered that the average of tickets’ value for each animal must be a value of significant distance in comparison with the exact answer because a few people (local farmers or experts) just knew the right answer. He borrowed all 787 tickets, which show the estimations of an ox’s weight. While the weight of that ox was 1198 pounds, the average of estimated values in the tickets was 1197! In 1907, he published the ‘Vox Populi’ paper in the Nature journal; and mentioned that “the result seems more creditable to the trustworthiness of a democratic judgment that might have been expected”. In fact, he understood that each ticket contains two data; i.e. information and error. Errors in the tickets omit each other, and the information summarized. This is the main reason that the average of those tickets was really quiet in comparison with the correct answer. This is the core idea of the wisdom of crowds theory in social science. Further, this theory is comparable with the jury theorem, which was proposed by Condorcet. Supported by a wide range of examples in business, management, economic, social science, mathematician, etc., Surowiecki introduced the wisdom of crowds as a framework for making optimized decisions in 2004. He proposed four criteria for a wise crowd: [21]
Independency People’s opinions are not determined by the opinions of those around them.
Decentralization People are able to specialize and draw on local knowledge.
Diversity Each person has private information, even if it is only an eccentric interpretation of the known facts.
Aggregation Some mechanism exists for turning private judgments into a collective decision.

There are some examples for unwise crowds in Surowiecki’s book; i.e. Columbia shuttle disaster, bubble in the stock markets, etc. Further, he mentioned to three failures in the crowd intelligence. In other words, the wisdom of crowds cannot solve these types of problems. The first is called ants circular mill, which was introduced by William Beebe. An ant mill is an observed phenomenon in which a group of army ants separated from the main foraging party loses the pheromone track and begins to follow one another, forming a continuously rotating circle. Next is called Needle in a Haystack. In this type of problem, just a few members of a group know the right answer. The last is called random decisions. In this type of problem, the final result is completely generated independent of members’ decisions. Although the wisdom of crowds cannot solve the mentioned problems, it is employed in the different fields of science as a novel theory for solving problems. For instance, it is one of the main references for the Delphi method in management, crowd sourcing and funding in business, the problem solving theorem and the central limit theorem in the mathematician, etc. [21].

Ii-B Cluster Ensemble

Clustering groups data points into clusters so that members of the same cluster are more similar to each other than to members of other clusters. Semi-supervised clustering uses supervision information to aid the clustering process. This paper focuses on pairwise constraints-based semi-supervised methods. As constraint-based methods: Liu et al. proposed semi-supervised linear discriminant clustering (Semi-LDC) [29]. Wang et al. introduced a new technique by utilizing the constrained pairwise data points and their neighbors, which is denoted as constraint neighborhood projections that required fewer labeled data points (constraints) and can naturally deal with constraint conflicts [30]. Chen et al. recently proposed a clustering algorithm which is based on graph clustering and optimizing an appropriately weighted objective, where larger weights are given to observations with lower uncertainty [31].

As mentioned before, individual clustering algorithms provide predictions with different accuracy rates. In practice, individual algorithms may fail to provide accurate and stable results. For solving this problem, cluster ensemble proved that better final results can be generated by combining individual clustering results instead of only choosing the best one [1]. The idea that not all partitions are suitable for cooperating to generate the final clustering was proposed in Cluster Ensemble Selection (CES). This method combines a selected group of best individual clustering results according to consensus metric(s) from the ensemble committee in order to improve the accuracy of final results [7].

There are a wide range of studies in the unsupervised cluster ensemble (selection). Vega et al. proposed Weighted Partition Consensus via Kernels (WPCK) method, which analyzes the set of partitions in the cluster ensemble and extracts valuable information that can improve the quality of the combination process [32]. In another study, Vega et al. developed the Weighted Evidence Accumulation (WEA) algorithm by computing the weighted association matrix as the first step and after that, applying a hierarchical clustering algorithm for selecting the consensus partition with the highest lifetime criterion. They also introduced the Generalized Kernel Partition Consensus (GKPC) method that uses the Information Unification step after the generation in the methodology of the WPCK method [33]. Jia et al. proposed SIM for diversity measurement, which works based on the NMI [11]. Romano et al. proposed Standardized Mutual Information (SMI) for evaluating clustering results [34]. Yu et al. proposed the Hybrid Clustering Solution Selection (HCSS) strategy that utilizes a weighting function to combine several feature selection techniques for the refinement of clustering solutions in the ensemble [14]. Based on Normalized Crowd Agreement Index (NCAI) and multi-granularity information collected among individual clusterings, clusters, and data instances, Huang et al. proposes two novel consensus functions, termed weighted evidence accumulation clustering (WEAC) and graph partitioning with multi-granularity link analysis (GP-MGLA) [35]. Jing et al. introduced a component generation approach for producing ensemble components based on Stratified feature sampling [16]. Yu et al. adopted affinity propagation (AP) in different subspaces of the data set for generating a set of individual clusterings [13]. Alizadeh et al. have concluded the disadvantages of NMI as a symmetric criterion. They used the APMM and Maximum (MAX) metrics to measure diversity and stability, respectively, and suggested a new method for building a co-association matrix from a subset of the individual cluster results. While the proposed methods can solve the symmetric problem of the NMI method, they just can combine a sub-clusters of the generated partition in the reference set [17, 9]. Yousefnezhad et al. proposed Weighted Spectral Cluster Ensemble (WSCE) method by exploiting the concept of community detection and graph based clustering [12].

Gao et al. introduced a graph-based consensus maximization (BGCM) method for combining multiple supervised and unsupervised models. This method consolidated a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. Since, this research used a classification approach for unsupervised learning, it is sensitive to the quality of supervision information [36]. Huang et al. extended extreme learning machines (ELMs) for both semi-supervised and unsupervised tasks based on the manifold regularization [35]. Anand et al. proposed a semi-supervised framework for kernel mean shift clustering (SKMS) that uses only pairwise constraints to guide the clustering procedure. They used the initial kernel matrix by minimizing a LogDet divergence-based objective function for first mapped to a high-dimensional kernel space where the constraints are imposed by a linear transformation of the mapped points [5]. Xiong et al. proposed Neighborhood-based Framework (NBF) method. This method builds on the concept of neighborhood, where neighborhoods contain “labeled examples” of different clusters according to the pairwise constraints. Furthermore, it expands the neighborhoods by selecting informative points and querying their relationship with the neighborhoods [6].

One of the biggest challenges in the mentioned methods is that they did not use the achieved errors, i.e. false positive and false negative, for improving the quality of the final aggregation. As mentioned before, WOC theory uses information and errors for increasing the performance of the final result. Briefly, information aggregate with each other; and also, errors omit each other. There are several studies based on the WOC theory in supervised learning, e.g. in recollecting ordering information [25], rank ordering problem [24], estimating the underlying value (e.g., the class) in the image processing [26], underwater mine classification with imperfect labels [27], minimum spanning tree problems [28], and classification ensemble [23]. As the first WOC-based unsupervised CES method, Alizadeh et al. proposed the Wisdom of Crowds Cluster Ensemble (WOCCE) . They proposed a new strategy of generating, evaluating, selecting, and combining the individual clustering results based on WOC theory. The main advantages of the WOCCE are using feedback mechanism for managing errors in each iteration and utilizing the A3 metric (average of APMM) to avoid the NMI symmetric problem. There are also four disadvantages in the WOCCE method. Firstly, WOCCE needs three distinct kinds of threshold values for generating final clustering result. Further, the performance of WOCCE is dramatically sensitive to the value of mentioned thresholds; and finding the optimum threshold values is so hard in the real application. Secondly, the concept of independency criterion in WOCCE was just limited to random and initial points in the same type individual clustering algorithms, whereas based on the independency definition in WOC, it can be defined in the term of mathematical independent random variables for all kinds of clustering algorithms. Thirdly, the time complexity of A3 is really high because it is the average of the APMM for all existed clusters in a partition. Since APMM is technically designed for comparing the similarity between a partition and a cluster, there is a wide range of common parts that are sequentially repeated in the A3 metric. Lastly, the WOCCE is only developed for unsupervised learning, while this framework can be also used for semi-supervised learning [18, 12]. Indeed, this paper introduces a new framework for WOC-based CES to solve the mentioned problems in the WOCCE.

Iii The Proposed Method

Iii-a Definition

Based on outlines of the WOC theory [21, 23, 18], the conditions for a crowd to be wise are: diversity, independency, decentralization, and aggregation. Baker et al. [23] and Alizadeh et al. [18] redefined the WOC criteria for supervised learning and unsupervised learning, respectively. They used algorithms, data and results instead of people, information and opinions in the mentioned definitions, respectively. Same structure is utilized in this paper to redefine the criteria for proposing a framework in both unsupervised and semi-supervised methods. So, our definition for WOC criteria listed as follows:
Independency The data, which is applied to clustering models, must have the lowest correlation between its features.
Decentralization Algorithms are able to specialize the results based on the local knowledge.
Diversity Each algorithm has private result, even if it is only an eccentric interpretation of the known facts.
Aggregation Some method exists for combining private results into a collective decision (final result).

As a whole, it can be stated that the WoCE can produce final results in four stages. Firstly, the mapping function removes the correlation between the features of raw data set. This mapping function can satisfy independency criterion. Then, for satisfying the decentralization criterion, this paper applies local knowledge, i.e. the given number of clusters and supervision information. Further, it employs the various kinds of individual clustering algorithms. After these steps, diversity criterion evaluates the probability of accuracy in the generated clustering results. Finally, an effective aggregation method can increase the performance of the proposed method. In the rest of this section, the formulation of the proposed method will be discussed, and this paper will mention what WOC criterion is satisfied by using each part of the formulation. After that, we briefly summarized the whole algorithm procedure.

Iii-B Independency

Based on definitions of the WOC theory, people must decide by using independent information. Hence, people can discover novel patterns, which are utilized to solve complex problems such as selecting the best person in the presidential election or finding an irregular engineering problem in the NASA’s shuttle [21, 18]. In machine learning arena, this concept can be defined in the term of mathematical independent random variables. In fact, independent features are generated by removing the correlation between the features of raw data. There are various methods for removing the correlation before applying individual clustering techniques, such as Principle Component Analysis (PCA) or Linear Discriminant Analysis (LDA). They can validate that removing the correlation dramatically improves the performance of clustering results [20, 12]. Now, this paper defines independency criterion by utilizing the concept of correlation. In other words, this paper develops a new branch of mentioned methods in the CES for mapping data to different dimensions with less correlation between its features. In the rest of this section, we show that how our proposed method transforms features of raw data to stable dimensions with less correlation.

Given a set of data examples , and the corresponding pairwise must-link constraint set ; and belong to the same cluster and pairwise cannot-link constraint set ; and belong to different clusters. The simple average of can be denoted as follows:

(1)

where is the number of instances in the ; and denotes the instance of the data points. Now, this paper denotes as follows:

(2)

where is the data points, and denotes simple average of , which calculated by (1). It’s clear that is zero-mean. In other words, the excepted value of is zero as follows:

(3)

Further, this paper defines , where , denote the number of features and data points, respectively. The main goal of this mapping is just minimizing the correlation between features. This problem can be reformulate as follows:

(4)

If the correlation (covariance) of is considered , then the correlation of will be defined as follows:

(5)

Based on above definition, the expected value of feature of denotes as follows:

(6)

where denotes the index of the . In other words, our correlation problem is changed to a variance problem. Now, maximizing the based on the variance of will be omitted the correlation between features. Since the scale of data after mapping must be same, we assume following equation:

(7)

So, our problem will be reformulated as follows:

(8)

where the symbol is an abbreviation for ‘a small change in q’. We consider so the above definition denotes as follows:

(9)

Based on (7) and (8), we can assume as follows:

(10)

Now, this paper defines following equation by using (9) and (10):

(11)

where is a constant. Since the following equation must be satisfy for minimizing correlation between features:

(12)

where R and denote the eigenvectors and eigenvalues, respectively. For all features of the above equation will be denoted as follows:

(13)

which is called eigenstructure equation. In above equation, is a diagonal matrix. Based on (7), we can define following equation:

(14)

where is identity matrix. Following equation denotes based on (13) and (14):

(15)

where denotes number of features in data . Now, consider that is a descending order based on values. For an optional feature selection in our unsupervised approach, we can define the following equation instead of (15):

(16)

where is the number of features, which must be selected for generating results. Algorithm 1 shows the mapping function, which can generate independent features by minimizing the correlation of data set. For reducing the time complexity, this paper uses an EM algorithm [37] for estimating the eigenvalues/vectors ( and ) in Algorithm 1. Please see Section A.5 in [37] for more information.

  Input: Data set ,  as number of features:   is considered for deactivating the feature selection
  Output: Mapped data set
  Method: 1. Calculate simple average by using (1). 2. Calculate by using (2). 3. Generate . 4. Calculate eigenvalues/vectors ( and ) of by [37]. 5. Sort based on descending values of .  6. if d is not zero () then   Select features of , and sorting as ,  else .  end if 7. Return .
Algorithm 1 The Mapping Function

Iii-C Decentralization

In WOC theory, the decentralization criterion increases the crowd intelligence, the margin of error and the quality of the final result [21, 18]. In the clustering problems, the same concept is the main reason for using the CES approach to improve the quality of the final result. So, there is a wide range of quality metrics in the previous CES methods [7, 17, 9]. Based on the WOC theory, this paper uses local knowledge for increasing the quality of individual clustering results. There are two different kinds of local knowledge in the CES; i.e. the number of clusters in unsupervised learning and supervision information in semi-supervised learning. Moreover, employing different kinds of clustering algorithms significantly can affect to generate more specialize clustering results because they include different kinds of objective functions [18]. Briefly, this paper applies the different kinds of clustering algorithms on the mapped data for generating the individual clustering results in both unsupervised and semi-supervised versions of the proposed method. Further, these algorithms use different numbers of clusters in the range of , where k denotes the number of clusters in the final results. Since, this procedure generates all available kinds of patterns as the reference set, it can increase the robustness of the final results. In addition, this paper develops a new feature selection method based on supervision information for improving the performance of the final result. In the rest of this section, we show that how this paper uses supervision information for generating common/local knowledge in the semi-supervised approach.

As mentioned before, our proposed method is based on pairwise constraint, i.e. must-links and cannot-links. This paper denotes the must-link constraint with , and the cannot-link constraint with . For generating each individual clustering result, this paper defines Constraint Projection, which is a set of projective vectors , such that the and are most faithfully preserved in the transformed low-dimensional representations . That is, examples involved by should be close while examples involved by should be far in the low-dimensional space. Define the objective function as maximizing with respect to , where:

(17)

where and denote the cardinalities of and , respectively, and is a scaling coefficient. The intuition behind (17) is to let the average distance in the low-dimensional space between examples involved by the cannot-link as large as possible, while distances between examples involved by the must-link as small as possible. Since the distance between examples in the same cluster is typically smaller than that in different clusters, a scaling parameter is added to balance the contributions of the two terms in (17) and its value can be estimated as follows:

(18)

We can also reformulate the objective function in (17) in a more convenient way as follows:

(19)

where and are respectively defined as:

(20)
(21)

This paper calls and defined in (20) and (21) respectively as cannot-link scatter matrix and must-link scatter matrix, which resemble the concepts of between-cluster scatter matrix and within-cluster scatter matrix respectively in linear discriminant analysis (LDA) [20]. The difference lies in that the latter uses cluster labels to generate scatter matrices, while the former uses pairwise constraints to generate scatter matrices. Obviously, the problem expressed by (19) is a typical eigen-problem, and can be efficiently solved by computing the eigenvectors of corresponding to the positive eigenvalues. In other words, just consider that and are eigenvectors and eigenvalues of , respectively. The and is descending ordered based on values (). Also, where shows the position of positive eigenvalues (. Further, the transformed data set is calculated as follows:

(22)

Algorithm 2 illustrates the transformation algorithm for both unsupervised and semi-supervised approaches. The transformed data is applied to different kinds of individual clustering algorithms for generating the reference set.

  Input: data set ,  must-links , cannot-links ,   (as supervision information) Number of features :  (as optional feature selection)
  Output: Mapped data set
  Method: 1. Generating Y by using Algorithm 1 and and . 2. if and are empty then return end if 3. Generating , , by using , (18), (20), (21) 4. Calculating the eigenvalues and eigenvectors   of . 5. Calculating the by using based on . 6. Return
Algorithm 2 The Transformation Algorithm

Iii-D Diversity

Indeed, diversity is a common concept in both the WOC theory and the CES methods. For instance, NMI [19] and APMM [9] are two famous methods for calculating diversity in the cluster ensemble (selection). The diversity increases the stability of the final results. As mentioned before, NMI has the symmetric problem. This problem causes that evaluation of the diversity between two clusters always results equal, when those clusters are complements of each other. This fault is occurred when the number of positive clusters in the considered partition of reference set is greater than 1 [17, 9, 18]. Although some of the researches proposed alternative methods such as APMM [9] and MAX [17] for solving this problem, their proposed methods were utilized for evaluating diversity between a cluster and a partition. As a result, using mentioned methods for evaluating the diversity of two partitions increases the time complexity. In the rest of this section, we firstly explain that how NMI and APMM work. Then, we develop a new metric, which directly can evaluate diversity between two partitions.

Indeed, NMI employed three different Shannon’s entropy for evaluating the similarity between two partitions. Since, NMI is normalized, the was always considered as the diversity between mentioned partitions. NMI used the entropy of common instances between two partitions as numerator, and also employed the sum of entropy of each partition as denominator [17, 9, 19]. As mentioned before, NMI has symmetric problem. As another alternative, APMM tried to solve the mentioned problem for evaluating the similarity between a cluster ( from ) and all clusters of another partition () [9]. Since, some common parts of APMM must be repeated for calculating diversity of two partitions, using the APMM for evaluating the diversity of two partitions increases the time complexity. Further, simple average was utilized for calculating the diversity between all clusters of a partition () versus all clusters of another partition () [9, 18]. This averaging procedure causes to decrease the robustness of achieved evaluation because it finds the mean of similarity between all clusters of two partitions instead of calculating maximum similarity (minimum diversity) among of them. This paper proposes a new greedy method based on the main idea of the APMM. It can calculate diversity between two partitions without repeating common parts; and also it avoids using the averaging procedure.

As mentioned in the previous section, individual clustering results are generated by using the transformed data on the different kinds of clustering algorithms. This paper denotes the generated results as a reference set as follows:

(23)

where denotes the number of individual clustering results and is the partition of the generated results. Now, this paper finds the maximum similarity for each partition by considering the number of all instances in that partition versus the number of instances in each cluster of that partition as follows:

(24)

where is a partition from the reference set; denotes the cluster of partition , and and denote the cardinalities of and , respectively. Furthermore, this paper finds the maximum similarity for each partition by considering the number of instances in each cluster of that partition versus the number of all instances in that partition as follows:

(25)

where the notations of , , , define same as the previous equation. Now, this paper determines the following equation as maximum similarity between a partition versus other partitions in the reference set:

(26)

where and are the reference set and a partition from the reference set, respectively. Also, and denote the partition from the reference set and cluster from the partition , respectively. Further, and are the cardinalities of and , respectively. Now, this paper proposes the Uniformity as the diversity of partition versus all partitions of the reference set as follows:

(27)

where is the reference set (ensemble committee), and denotes a partition from the reference set. Uniformity is normalized between . As a greedy metric, Uniformity employs a strict strategy for evaluating the diversity between partition and the other partitions of ensemble committee. In other words, Uniformity represents a value near of zero for a partition with low diversity, and illustrates a value near of one for a partition with high diversity. In addition, it avoids to repeat common parts, i.e. equations (25) and (26), for evaluating the diversity in each comparison.

Iii-E Aggregation

Thresholding is used for selecting the evaluated individual results in the CES. Then co-association matrix is generated by using consensus function on the selected results. Lastly, the final result is generated by applying linkage methods on the co-association matrix. These methods generate the Dendrogram and cut it based on the number of clusters in the result [19, 18]. In recent years, many papers have used EAC as a high-performance consensus function for combining individual results [19, 9, 18, 8, 7]. EAC uses the number of clusters shared by objects over the number of partitions in which each selected pair of objects is simultaneously presented for generating each cell of the co-association matrix.

Fig. 1: In the traditional EAC, the represents the number of clusters shared by objects with indices (i, j); and is the number of partitions in which this pair of instances (i and j) is simultaneously presented. This method assumes the weights of all individual clustering results () are the same. This paper proposes Weighted EAC for optimizing this method by using a weight for each individual clustering results instead of just counting their shared clusters. While the weight can have different definitions in the other applications, this paper uses average of Uniformity of two parititon as the weight in the WEAC ().

Figure 1 illustrates the effect of the EAC equation () on the shape of Dendrogram. Where represents the number of clusters shared by objects with indices (i, j); and is the number of partitions in which this pair of instances (i and j) is simultaneously presented. As a matter of fact; EAC considers that the weights of all algorithms’ results are the same. Instead of counting these indices, this paper uses following equation, which is called Weighted EAC (WEAC), for generating the co-association matrix.

(28)

where and are same as the EAC equation; Also, is the weight of combining the instances. Although, this weight can have different definitions in the other applications, this paper uses average of Uniformity of two algorithms as follows for combining individual results:

(29)

where and illustrate the uniformities of the algorithms, which generated the results for indices and . In other words, as a new mechanism, this paper generates the effective results when both algorithms have high Uniformity values; and also the effects of individual results are near of zero when the both algorithms have small values in the Uniformity metric. As a result, this paper just omits the effect of low quality individual results by using mentioned mechanism instead of selecting them by thresholding procedures. Further, the final co-association matrix, which is a symmetric matrix, will be generated by (28) as follows:

(30)

where is the number of data points; and denotes the final aggregation for and instances.

  Input: Data set , Must-links , Cannot-links , Number of clusters , Number of selected features (default d=0),
  Output: as partition of data set into clusters
  Method: 1. Initial an empty set as Reference-Set. 2. Generate by applying Algorithm 2 to (, , , ). foreach individual clustering algorithms do 3.   = Clustering-Algorithm(, ). 4.  diversity = Uniformity (, ). 5.  Add to end foreach  6. = WEAC() 7. Dendrogram = Average-Linkage () 8. Final-Result = Cluster (, )
Algorithm 3 The WoCE algorithm

Iii-F Summarization and Discussion

Algorithm 3 shows the pseudo code of the proposed method. In this algorithm, the distances are measured by an Euclidean metric. The Clustering-Algorithm function builds the partitions of individual clustering results, which will be discussed in the next section; uniformity function evaluates individual clustering results by using (27). Then, evaluated results will be added to reference set. The WEAC function generates the co-association matrix, according to (28). The Average-Linkage function creates the final ensemble according to the Average Linkage method [18].

There are three points, which must be discussed before this paper starts to explain the empirical studies. Firstly, why this paper chooses the WOC as a framework in the cluster ensemble? As mentioned before, the main reasons for using cluster ensemble are increasing performance, stability, robustness of the final results on the clustering problems. As already stated, the WOC theory is superior to that of a few experts. In other words, it is proven [21, 23, 18] that results made by aggregating the information of groups have better performance, stability, and robustness than those made by any single group member if the WOC criteria are satisfied. Therefore, the cluster ensemble and the WOC are the same solutions with the same goals in two different sciences, i.e. machine learning and social science, respectively. Next, what are the common concepts between our proposed criteria in the WOC and previous methods in clustering problems? In fact, diversity is existed in clustering with the same title, e.g. NMI and APMM are two famous methods for calculating diversity in the cluster ensemble (selection). The diversity increases the stability and robustness of the final results. Further, independency referred to the correlation concept in the learning methods. This correlation can be defined between features of raw data. There are some techniques, i.e. Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc. for mapping data to new dimensions without any correlation between its features. This paper uses a new branch of these techniques for satisfying the independency criterion, which can increase the performance of the final results. In addition, decentralization guarantees that the quality of the final result is optimized. In other words, it uses different individual clustering algorithms, which use different objective functions, for generating all possible patterns as the reference set in the cluster ensemble problem. Moreover, an effective aggregation method can combine the final result without thresholding procedure. The last question is why all of the four conditions of the WOC must be satisfied in the ensemble learning? Based on previous question, the proposed method can be defined as a CES method that applied a feature mapping in advance. In practice, all of clustering analysis has these steps [12, 18, 36]. Therefore, the WOC framework does not add any new stage in the pipeline on clustering analysis. It just defined what is the robust and compulsory structure for an ensemble framework in real-world application.

Iv Experiments

The empirical studies will be presented in this section. The unsupervised methods are used to find meaningful patterns in unlabeled data sets such as web documents; and semi-supervised employs supervision information for generating more robust and stable final results in real world application. Since, the real data set does not have class labels, there is no direct evaluation method for estimating the performance in unsupervised or semi-supervised methods. Like many previous researches [7, 17, 9, 18, 36, 5], this paper compares the performance of its proposed method with other individual clustering methods and cluster ensemble (selection) methods by using standard data sets and their real classes. Moreover, the supervision information will be randomly generated based on real class labels. In this paper, all of algorithms are implemented in the MATLAB R2015a (8.5) by authors on a PC with certain specifications222Apple Mac Book Pro, CPU = Intel Core i7 (4*2.4 GHz), RAM = 8GB, OS = OS X 10.11 in order to generate experimental results. All results are reported by averaging the results of 10 independent runs of the algorithms. Table I demonstrates the individual clustering algorithms, which are used for generating the individual clustering results in our proposed method. Further, the number of individual clustering results for the ensemble methods is set as 20 in the reference set.

No. Method
1 K-means
2 Fuzzy C-means
3 Median K-flats
4 Gaussian mixture
5 Subtract clustering
6 Single-linkage euclidean
7 Single-linkage hamming
8 Single-linkage cosine
9 Average-linkage euclidean
10 Average-linkage hamming
11 Average-linkage cosine
12 Complete-linkage euclidean
13 Complete-linkage hamming
14 Complete-linkage cosine
15 Ward-linkage euclidean
16 Ward-linkage hamming
17 Ward-linkage cosine
18 Spectral using a sparse similarity matrix
19 Spectral using Nystrom method with orthogonalization
20 Spectral using Nystrom method without orthogonalization
TABLE I: The individual clustering algorithms, which are used for generating individual clustering results

Iv-a Data Sets

This paper uses three different groups of data sets for generating experimental results; i.e. image data sets, document data sets and other UCI data sets. Table II illustrates the properties of these data sets. This paper uses the USPS digits data set, which is a collection of gray-scale images of natural handwritten digits and is available from [38]. Furthermore, this paper utilizes ImageNet [39], MNIST, and CIFAR-10 [40] as three image-based data sets, which are mostly employed in Deep Learning studies [40]. As another alternative in the image-based data set, this paper uses Alzheimer’s Diseases Neuroimaging Initiative (ADNI) data set for subjects. This data set contains Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images from human Brian in two categories (which are shown by C1 and C2 in the Table II and III) for recognizing the Alzheimer diseases. In the first category, this data set partitions subjects to three groups of Health Control (HC), Mild Cognitive Impairment (MCI), and Alzheimer’s Diseases (AD). In the second category, there are four groups because the MCI will be partitioned into high and low risk groups (HMCI/LMCI). This paper uses all possible forms of this data set by using only MRI features, only PET features and all of MRI and PET features (FUL) in each of two categories. More information about ADNI-202 is available in [41]. As a document based data set, the 20 Newsgroups is a collection of approximately 20,000 newsgroup documents, which is partitioned (nearly) evenly across 20 different newsgroups. Some of the newsgroups are very closely related to each other, while others are highly unrelated. It has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. As two other document-based data sets, the Reuters-21578 [42] and Letters [15] are employed in this paper. The rest of standard data sets are from UCI [43]. This paper has chosen data sets which are as diverse as possible in their numbers of true clusters, features, and samples because this variety better validates the obtained results. The features of the data sets are normalized to a mean of 0 and variance of 1, i.e. .

Data Set Instances Features Class
20 Newsgroups 26214 18864 20
ADNI-MRI-C1 202 93 3
ADNI-MRI-C2 202 93 4
ADNI-PET-C1 202 93 3
ADNI-PET-C2 202 93 4
ADNI-FUL-C1 202 186 3
ADNI-FUL-C2 202 186 4
Arcene 900 10000 2
Bala. Scale 625 4 3
Brea. Cancer 286 9 2
Bupa 345 6 2
CIFAR-10 5000 1024 10
CNAE-9 1080 857 9
Galaxy 323 4 7
Glass 214 10 6
Half Ring 400 2 2
ImageNet 5000 400 5
Ionosphere 351 34 2
Iris 150 4 3
Letters 20000 16 26
MNIST 70000 784 10
Optdigit 5620 62 10
Pendigits 10992 16 10
Reuters-21578 8293 18933 65
SA Hart 462 9 2
Sonar 208 60 2
Statlog 6435 36 7
USPS 9298 256 10
Wine 178 13 2
Yeast 1484 8 10
TABLE II: The standard data sets

Iv-B Performance analysis for unsupervised methods

In this section the performance (accuracy metric [20]) of unsupervised version of proposed method (UWoCE) will be analyzed. As mentioned before, algorithms listed in Table I were employed for generating the individual clustering results in our proposed method. Further, the sets of supervision information (must-links and cannot links) are considered null in this section. Also, the final clustering performance was evaluated by re-labeling between obtained clusters and the ground truth labels and then counting the percentage of correctly classified samples [18, 12]. The results of the proposed method are compared with full ensemble (EAC) [19] as baseline, WPCK [32], GKPC [33], HCSS [14], GP-MGLA [15], and WOCCE [18] which are state-of-the-art cluster ensemble (selection) methods. The performance of full ensemble method (EAC) is reported for demonstrating the effect of selecting the best results in comparison combing all generated result with each others. In addition, the performance of the WPCK, GKPC, HCSS, and GP-MGLA are reported as four weighted clustering ensemble methods. For representing the effect of Uniformity on the performance of the final results, it compares with three state-of-the-art metrics in diversity evaluation (A3 [18], SACT [15], and CA [33]). This paper does not use optional feature selection in this section ().

The experimental results are given in Table III. In this table, the best result which is achieved for each data set is highlighted in bold. As depicted in this table, the results of the EAC illustrate the effect of evaluation and selection in cluster ensemble selection methods. Since some of the four conditions of the WOC theory do not exist in EAC, this method is a good example of unwise crowd. According to this table, the proposed algorithm (WoCE) has generated better results in comparison with other individual and ensemble algorithms. Even though the proposed method was outperformed by a number of algorithms in four data sets (ADNI-MRI-C2, SA Heart, Sonar, and Yeast), the majority of the results demonstrate the superior accuracy of the proposed method in comparison with other algorithms. In addition, the difference between the performance of proposed method and the best result in those three data sets is lower that 2%. In addition, the WOCCE and the proposed method generate more stable results in comparison with other methods based on the standard variances. As mentioned before, this is the effect of WOC framework.

Data Sets EAC WPCK GKPC HCSS GP-MGLA WOCCE UWoCE
20 Newsgroups 26.190.72 27.010.93 28.451.02 30.620.84 35.470.91 32.620.52 38.230.12
ADNI-MRI-C1 42.190.37 41.240.97 43.511.02 46.610.36 49.360.7 48.820.37 51.150.73
ADNI-MRI-C2 39.520.31 39.950.61 40.090.51 41.320.81 40.721.25 42.220.44 41.230.95
ADNI-PET-C1 40.380.52 40.510.26 43.791.04 45.30.49 48.220.71 49.190.26 51.170.98
ADNI-PET-C2 38.850.59 37.510.69 36.580.72 41.921.18 40.680.73 39.430.79 42.480.67
ADNI-FUL-C1 44.420.91 43.840.93 46.560.49 49.620.81 49.270.61 48.820.41 50.890.83
ADNI-FUL-C2 47.210.63 49.710.99 51.260.64 52.260.66 51.920.7 49.390.63 53.310.97
Arcene 61.790.813 63.920.81 63.261.04 65.540.73 66.320.91 65.160.32 68.130.82
Bala. Scale 54.091.75 55.420.94 56.040.72 57.410.56 56.230.94 57.880.61 60.640.58
Brea. Cancer 90.171.24 81.931.92 82.431.24 65.511.91 72.271.06 96.920.77 970.14
Bupa 51.730.99 57.910.82 59.090.98 58.331.32 58.910.51 57.020.46 60.830.12
CIFAR-10 51.921.24 54.10.88 55.520.79 56.120.91 57.820.85 59.370.52 62.040.32
CNAE-9 72.411.09 75.410.69 75.530.55 80.631.41 81.290.81 79.20.58 84.120.44
Galaxy 33.120.52 30.991 32.710.84 35.710.61 34.720.96 34.880.81 37.180.67
Glass 50.930.18 45.012.03 46.572.97 52.310.68 50.620.38 51.820.92 570.78
Half Ring 77.530.21 82.540.93 85.410.94 90.530.67 89.991.02 87.20.14 98.110.31
ImageNet 23.530.81 32.860.42 35.320.59 35.040.93 33.510.83 38.140.62 41.670.7
Ionosphere 68.120.42 66.521.1 67.040.79 71.230.91 70.90.99 70.520.15 73.670.41
Iris 73.510.82 79.921.86 80.390.83 85.620.82 75.310.28 920.59 96.30.62
Letters 42.820.81 48.951.34 47.680.98 54.320.9 52.190.49 53.690.73 55.830.26
MNIST 52.182.76 55.661.41 62.460.76 59.921.41 67.390.97 66.210.92 69.720.71
Optdigit 65.921.2 70.270.84 74.670.42 78.991.02 76.690.72 77.160.21 80.560.69
Pendigits 52.880.92 55.730.75 54.080.38 62.820.81 60.780.95 61.680.18 64.130.42
Reuters-21578 62.340.72 70.240.92 71.820.78 74.630.87 75.290.66 68.850.32 76.410.24
SA Hart 66.391.62 67.381.02 66.531.26 70.540.93 71.420.87 73.70.46 72.050.16
Sonar 50.480.92 53.841.01 53.250.51 61.820.72 59.120.83 54.390.25 60.060.87
Statlog 52.280.91 55.390.75 55.260.97 57.330.91 56.420.92 55.770.71 59.760.5
USPS 60.490.84 59.420.78 62.110.37 64.921.68 63.080.59 65.210.69 66.010.24
Wine 70.240.72 75.621.79 81.250.93 79.290.51 83.160.84 71.340.55 89.460.14
Yeast 33.810.32 36.230.61 35.230.72 40.250.88 42.030.91 37.760.26 41.120.4
TABLE III: The performance of unsupervised methods

(a) 20 Newsgroups (b) ADNI-FUL-C1 (c) Arcene (d) CNAE-9 (e) Optdigit (f) Reuters-21578 (g) Sonar (h) USPS (i) CIFAR-10 (j) ImageNet (k) MNIST () Letters
Fig. 2: The performance of semi-supervised methods.

Iv-C Performance analysis for semi-supervised methods

The empirical results of semi-supervised methods will be analyzed in this section. Since most of the semi-supervised cluster ensemble methods [6, 29, 36] use feature selection based on the supervision information, this paper compares the performance of semi-supervised methods on high-dimensional and large-scale data set in the Table II; i.e. 20 Newsgroups, Letters, and Reuters-21578 as document-based data sets, ADNI, CIFAR-10, ImageNet, MNIST, and USPS as image-based data sets, and also Arcene, CNAE-9, Optdigit, Sonar from UCI repository. This paper does not use the optional feature selection in this section ().

In this paper, 1% to 5% of instances with class labels are randomly selected for generating the supervision information (a half for must-link and a half for cannot-link); e.g. 1% (2620) of instances are selected in the 20 Newsgroups data set for generating the pairwise constraints, where 655 must-links and 655 cannot-links constraints are generated by the selected instances. In addition, the supervision information, which applied to the methods, are same in each independent run for all methods. Notably, this paper does not employ all combinations of the randomly selected instances as the pairwise constraints (must-links and cannot-links). In other words, each randomly selected instance is used once for generating just a must-link or a cannot-link. There are two reasons for this strategy of generating pairwise constraints. Firstly, this strategy provides better diversity among the generated pairwise constraints. Secondly, this strategy represents better simulation for the real application of prior supervision information. Indeed, there is no class label in the real-world applications, and generating the pairwise constraints from all combinations of the randomly selected instances is impossible or expensive [5, 6, 36]. For instance, just consider an interactive image search engine, where it shows two random images to users in each attempt and asks users to specify that these images are same (must-link) or different (cannot-link). Then, the search engine will improve the clustering results based on these limited feedbacks.

The final clustering performance (accuracy metric [20]) was evaluated by re-labeling between obtained clusters and the ground truth labels and then counting the percentage of correctly classified samples [18]. Figure 2 illustrates the performance of the proposed method (WoCE) in comparison with the RP [44], BGCM [36], NBF [6] and SKMS [5]. In this figure, the standard deviations of the results are lower than 1% (for 10 independent runs). This paper reports the performance of RP as a classical method in the semi-supervised cluster ensemble. Also, this paper reports the performance of BGCM as a novel graph-based approach in the semi-supervised clustering. Notably, the BGCM has two versions; i.e. unsupervised and semi-supervised. This paper uses the semi-supervised version of BGCM in this section. What’s more, this paper uses SKMS as a kernel-based method in semi-supervised clustering. Last but not least, the empirical results of the proposed method are compared with NBF as another heuristic method in the semi-supervised cluster ensemble. Even though WoCE was outperformed in one data set (Optdigit) by some algorithms, the majority of results demonstrate superior accuracy for the proposed method. In addition, the clustering performance of some algorithms in Fig. 2 (k) and () become worse with increased number of pairwise constraints. As mentioned before, pairwise constraints often result in highly unstable clustering performance [5, 6]. These figures are good examples for this issue, where some of the previous methods cannot handle the extra supervision information. In fact, the supervision information made unstable individual clustering results and significantly reduce the performance of the mentioned methods. In these cases, our proposed method has handled the supervision information by employing the WOC theory, i.e. better data representation (Algorithm 1 and 2), robust individual clustering evaluation (Uniformity metric), and effective aggregating mechanism (WEAC).

Iv-D Optional feature selection in unsupervised-method

In this section, the performance of the proposed method will be analyzed by using the optional feature selection ( parameter). Since feature selection is automatically used by applying the supervision information on the mapped data set in the semi-supervised version of the proposed method, the performance of the unsupervised version of the proposed method (UWoCE) will be analyzed in this section. This paper employs high-dimension data sets, i.e. Arcene, CIFAR-10, MNIST, and USPS, for analyzing the performance of the proposed method. Figure 3 (a) shows the relationship between the performances of the proposed method based on the percentage of selected features in different data sets. The vertical axis refers to the performance while the horizontal axis refers to the percentage of the selected features in each data set. As depicted in this figure, the optional feature selection can significantly increase the performance of final results on high-dimensional data sets. So, this paper offers that the optional feature selection will be used for high-dimensional data sets to handle features’ sparsity. Moreover, this experiment is the reason for using high-dimensional and large-scale data sets in the previous section. The important questions which must be discussed here are what is different between the mapping function and the optional feature selection? And where the optional feature selection can be employed? Indeed, the mapping function, which is illustrated in Algorithm 1, minimizes the correlation between features; and it maps data to a new domain, which the covariance between different features is near zero. Most of the time, this function maps data to the stable dimensions, which can dramatically improve the accuracy of the final results. It can be also formulated as follows: , where can satisfy the independency criterion in the WOC theory. In a high-dimensional data set, some of the calculated eigenvalues () approach near to zero. Since these eigenvalues trivially effect on the mapping , they can be omitted for reducing the dimensions of the data set and the time and space complexities of the clustering analysis. In other words, these eigenvalues may reduce the stability of the final results [12]. By considering the optional feature selection, the mapping can be formulated as follows: , where . Therefore, employing this optional feature selection for analyzing the high-dimension data set can improve the stability of the mapping as well as the performance of the final result (see Fig. 3 (a) ). In addition, this feature selection is better to use based on the fluctuation of the eigenvalues (remove near zero values) in each problem.


(a) (b)
Fig. 3: (a) The performance of UWoCE method by using the optional feature selection. (b) The runtime analysis.

Iv-E Time complexity analysis

In this section, the runtime of both unsupervised and semi-supervised methods will be compared by using various data sets, i.e. three large-scale data sets (Letters, MNIST, 20 Newsgroups) and two high-dimension data sets (Arcene, Reuters-21578). Figure 3 (b) illustrates the relationship between the runtime of the mentioned methods and the size of data sets. The vertical axis refers to the runtime while the horizontal axis refers to the algorithms. As mentioned before, all of the results in this experiment are generated by a PC with a certain specifications. As depicted in this figure, the runtime of the semi-supervised methods (the first five bars) is more than the runtime of the unsupervised methods because they need an additional step to apply the supervision information (mostly in the form of an Eigenproblem) [36]. By considering the performance of these methods in Table III and Fig. 2, WoCE (the first bar) and UWoCE (the last bar) generated more efficient results in comparison with other clustering methods. Indeed, the proposed method selects the features based on the correlations between data points and supervision information (in semi-supervised approach). So, the number of calculations for generating individual clustering results will be significantly decreased in comparison with other cluster ensemble methods, while the performance of the final results is rapidly increased.

There are some technical issues that must be discussed here. Firstly, this paper uses an EM algorithm [37] for estimating the eigenvalues/vectors, which this algorithm can significantly reduce the time complexity of the mapping function in Algorithm 1. Secondly, the size of the transformed matrix (eq. 22) in the proposed method for applying the supervision information is limited to the size of pairwise constraints. This size is really small in comparison with the size of instances; e.g. In 20 Newsgroup data set, the size of this matrix for 1% of randomly sampled pairwise constraints is , while the instance similarity matrix is . Notably, most of the previous studies such as SKMS and BGCM directly used the instance similarity matrix for applying the supervision information. Lastly, this paper uses a modified version of the EAC for combining the individual clustering results. EAC applies a linkage method on a simple matrix, where the size of this matrix is the number of algorithms the number of instances (), where in practice. By contrast, some of the previous studies such as BGCM utilized the graph methods for combining the individual results, where the size of the adjacency matrix of the graph in these methods is the square of the number of instances (). Based on these technical issues, the proposed method can significantly increase the performance of the final results as well as an acceptable runtime.

V Conclusion

In this paper, wisdom of crowds (WOC) theory in social science was mapped to the clustering ensemble arena. The main advantages of this mapping include the addition of two new aspects, i.e., independency and decentralization, for estimating the quality of individual clustering results, and a new framework to investigate them. To reach the four conditions of WOC, this paper incorporates a series of novel strategies for producing individual clustering results as well as obtaining the final ensemble result. Specifically, a mapping function is introduced to perform independency on individual clustering results. This function can minimize the correlation between features by using the concepts of expected value and covariance. The decentralization criterion is proposed for transforming the data from high-dimension to low-dimension based on pairwise constraints, to keep quality in the generated individual clustering results. Further, this paper evaluates the diversity of individual clustering results with a novel metric called uniformity. At last, weighted EAC is proposed for the final aggregation. To validate the effectiveness of the proposed approach, an extensive experimental study is performed by comparing with multiple state-of-the-art methods on various data sets. In the future, we will develop a new version of uniformity based on the concept of expected value instead of using the APMM.

Acknowledgment

We thank the anonymous reviewers for comments. This work was supported in part by the National Natural Science Foundation of China (61422204, 61473149, and 61503182), Jiangsu Natural Science Foundation (BK20130034 and BK2015042628), and NUAA Fundamental Research Funds (NE2013105).

References

  • [1] A. Strehl and J. Ghosh, “Cluster ensembles - a knowledge reuse framework for combining multiple partitions,” Journal of Machine Learning Research, vol. 3, pp. 583–617, 2002.
  • [2] A. Topchy, A. Jain, and W. Punch, “Combining multiple weak clusterings,” in IEEE International Conference on Data Mining (ICDM’03), 19-22 September 2003, pp. 331–338.
  • [3] A. Fred and A. Lourenco, “Cluster ensemble methods: from single clusterings to combined solutions,” Computer Intelligence, vol. 126, pp. 3–30, 2008.
  • [4] A. K. Jain, A. Topchy, M. Law, and J. Buhmann, “Landscape of clustering algorithms,” in 17th International Conference on Pattern Recognition, 26 August 2004, pp. 23–26.
  • [5] S. Anand, S. Mittal, O. Tuzel, and P. Meer, “Semi-supervised kernel mean shift clustering,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 15–2, 2014.
  • [6] S. Xiong, J. Azimi, and X. Fern, “Active learning of constraints for semi-supervised clustering,” IEEE Transaction on Knowledge and Data Engineering, vol. 24, no. 1, 2014.
  • [7] X. Fern and W. Lin, “Cluster ensemble selection,” in SIAM International Conference on Data Mining (SDM’08), 24-26 April 2008, pp. 128–141.
  • [8] J. Azimi and X. Fern, “Adaptive cluster ensemble selection,” in 21th International joint conference on artificial intelligence (IJCAI-09), 11-17 July 2009, pp. 992–997.
  • [9] H. Alizadeh, B. Minaei-Bidgoli, and H. Parvin, “Cluster ensemble selection based on a new cluster stability measure,” Intelligence Data Analysis (IDA), vol. 18, no. 3, pp. 389–40, 2014.
  • [10] L. Limin and F. Xiaoping, “A new selective clustering ensemble algorithm,” in 9th IEEE International Conference on e-Business Engineering, 9-11 September 2012, pp. 45–49.
  • [11] J. Jia, X. Xiao, and B. Liu, “Similarity-based spectral clustering ensemble selection,” in 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 29-31 May 2012, pp. 1071–1074.
  • [12] M. Yousefnezhad and D. Zhang, “Weighted spectral cluster ensemble,” in IEEE International Conference on Data Mining (ICDM’15), 14-17 November 2015.
  • [13] Z. Yu, L. Li, J. Liu, J. Zhang, and G. Han, “Adaptive noise immune cluster ensemble using affinity propagation,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 12, pp. 3176–3189, 2015.
  • [14] Z. Yu, L. Li, Y. Gao, J. You, J. Liu, H.-S. Wong, and G. Han, “Hybrid clustering solution selection strategy,” Pattern Recognition, vol. 47, no. 10, pp. 3362–3375, 2014.
  • [15] D. Huang, J.-H. Lai, and C.-D. Wang, “Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis,” Neurocomputing, vol. 170, pp. 240–250, 2015.
  • [16] L. Jing, K. Tian, and J. Z. Huang, “Stratified feature sampling method for ensemble clustering of high dimensional data,” Pattern Recognition, vol. 48, no. 11, pp. 3688–3702, 2015.
  • [17] H. Alizadeh, H. Parvin, and S. Parvin, “A framework for cluster ensemble based on a max metric as cluster evaluator,” International Journal of Computer Science, vol. 39, pp. 1–39, 2012.
  • [18] H. Alizadeh, M. Yousefnezhad, and B. Minaei-Bidgoli, “Wisdom of crowds cluster ensemble,” Intelligent Data Analysis (IDA), vol. 19, no. 3, 2015.
  • [19] A. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27, pp. 835–850, 2005.
  • [20] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.   Addison-Wesley Longman Publishing Co., Inc.; ISBN: 0321321367, 2005.
  • [21] J. Surowiecki, The Wisdom of Crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations.   Littles: Brown, ISBN:0-316-86173-1, 2004.
  • [22] D. Yang, G. Xue, X. Fang, and J. Tang, “Crowdsourcing to smartphones: incentive mechanism design for mobile phone sensing,” in MOBICOM’2012: ACM International Conference on Mobile Computing and Networking, 22–26 August 2012.
  • [23] L. Baker and D. Ellison, “The wisdom of crowds — ensembles and modules in environmental modeling,” Geoderma, vol. 147, pp. 1–7, 2008.
  • [24] B. Miller, P. Hemmer, M. Steyvers, and M. D. Lee, “The wisdom of crowds in rank ordering problems,” in 9th International Conference on Cognitive Modeling (ICCM’09), 24-26 July 2009, pp. 86–91.
  • [25] M. Steyvers, M. Lee, B. Miller, and P. Hemmer, “The wisdom of crowds in the recollection of order information,” Advances in Neural Information Processing Systems, vol. 22, pp. 1785–1793, 2009.
  • [26] P. Welinder, S. Branson, S. Belongie, and P. Perona, “The multidimensional wisdom of crowds,” in 24th Conference on Neural Information Processing Systems (NIPS), 6-9 December 2010, pp. 1–9.
  • [27] D. P. Williams, “Underwater mine classification with imperfect labels,” in 20th International Conference on Pattern Recognition, August 2010, pp. 4157–4161.
  • [28] S. K. M. Yi, M. Steyvers, and M. D. Lee, “Wisdom of the crowds in minimum spanning tree problems,” in 32nd Annual Conference of the Cognitive Science Society, 10-13 August 2010, pp. 31 840–31 845.
  • [29] C.-L. Liu, W.-H. Hsaio, C.-H. Lee, and F.-S. Gou, “Semi-supervised linear discriminant clustering,” IEEE Transaction on Cybernetics, vol. 44, no. 7, pp. 989–1000, 2014.
  • [30] H. Wang, T. Li, T. Li, and Y. Yang, “Constraint neighborhood projections for semi-supervised clustering,” IEEE Transaction on Cybernetics, vol. 44, no. 5, pp. 636–643, 2014.
  • [31] Y. Chen, S. H. Lim, and H. Xu, “Weighted graph clustering with non-uniform uncertaintiese,” in 31st International Conference on Machine Learning (ICML’14), 21-26 June 2014, pp. 1566–1574.
  • [32] S. Vega-Pons, J. Correa-Morris, and J. Ruiz-Shulcloper, “Weighted partition consensus via kernels,” Pattern Recognition, vol. 43, no. 8, pp. 2712–2724, 2010.
  • [33] S. Vega-Pons, J. Ruiz-Shulcloper, and A. Guerra-Gandón, “Weighted association based methods for the combination of heterogeneous partitions,” Pattern Recognition Letters, vol. 32, no. 16, pp. 2163–2170, 2011.
  • [34] S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized mutual information for clustering comparisons: One step further in adjustment for chance,” in 31st International Conference on Machine Learning (ICML14), 21-26 June 2014, pp. 1143–1151.
  • [35] G. Huang, S. Song, J. N. D. Gupta, and C. Wu, “Semi-supervised and unsupervised extreme learning machines,” IEEE Transaction on Cybernetics, vol. 44, no. 12, pp. 2405–2417, 2014.
  • [36] J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han, “A graph-based consensus maximization approach for combining multiple supervised and unsupervised models,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 15–2, 2013.
  • [37] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation, vol. 11, no. 2, pp. 443–482, 1999.
  • [38] S. Roweis. (1998) The world-famous courant institute of mathematical sciences, computer science department, new york university. [Online]. Available: http://cs.nyu.edu/∼roweis/data.html
  • [39] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2.   IEEE, 2006, pp. 2169–2178.
  • [40] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” in International Conference on Learning Representations (ICLR’16), 2016.
  • [41] C. Zu and D. Zhang, “Label-alignment-based multi-task feature selection for multimodal classification of brain disease,” in 4th NIPS Workshop on Machine Learning and Interpretation in Neuroimaging (MLINI’14), 13 December 2014.
  • [42] D. Cai, X. He, and J. Han, “Locally consistent concept factorization for document clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 902–913, 2011.
  • [43] C. B. D. J. Newman, S. Hettich, and C. Merz. (1998) Uci repository of machine learning databases. [Online]. Available: http://www.ics.uci.edu/mlearn/MLSummary.html
  • [44] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transaction onPattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998.

Muhammad Yousefnezhad received his B.Sc. and M.Sc. degrees in Computer Hardware Engineering and Information Technology (IT), with a sheer focus on Artificial Intelligence, from Mazandaran University of Science and Technology (MUST), Iran, in 2010 and 2013, respectively. He joined the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics as a Research Assistant for his Ph.D. research in 2014. His main research interest is developing machine learning techniques, particularly within the area of the human brain decoding.

Sheng-Jun Huang received the B.Sc. and Ph.D. degrees in computer science from Nanjing University, China, in 2008 and 2014, respectively. He is currently an Associate Professor in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics. His main research interests include machine learning and patter recognition. He has won the China Computer Federation (CCF) Outstanding Doctoral Dissertation Award in 2015, the Best Poster Award at KDD’12, the Best Student Paper Award at CCDM’11, and the Microsoft Fellowship Award in 2011.

Daoqiang Zhang received the B.Sc. and Ph.D. degrees in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 1999 and 2004, respectively. He is currently a Professor in the Department of Computer Science and Engineering, Nanjing University of Aeronautics and Astronautics. His current research interests include machine learning, pattern recognition, and biomedical image analysis. In these areas, he has authored or coauthored more than 100 technical papers in the refereed international journals and conference proceedings.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
22905
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description