Cluster matching by permuting cluster labels is important in many clustering contexts such as cluster validation and cluster ensemble techniques. The classic approach is to minimize the euclidean distance between two cluster solutions which induces inappropriate stability in certain settings. Therefore, we present the algorithm that introduces two improvements best explained in the crisp case. First, instead of maximizing the trace of the cluster crosstable, we propose to maximize a -transformation of this crosstable. Thus, the trace will not be dominated by the cells with the largest counts but by the cells with the most non-random observations, taking into account the marginals. Second, we suggest a probabilistic component in order to break ties and to make the matching algorithm truly random on random data. The truematch algorithm is designed as a building block of the truecluster framework and scales in polynomial time. First simulation results confirm that the truematch algorithm gives more consistent truecluster results for unequal cluster sizes. Free R software is available.
Keywords: Hungarian method, truematch, truecluster, MMCC, CIC, Hornik (2005)
Applying a cluster algorithm to a dataset results in—fuzzy or crisp—assignments of cases to anonymous clusters. In order to interpret these clusters, we often wish to compare these clusters to other classifications, so some heuristic is needed to match one classification to another. With the advent of resampling and ensemble methods in clustering (Gordon and Vichi, 2001; Dimitriadou et al., 2002; Strehl and Ghosh, 2002), the task of matching cluster solutions has become even more important: we need reliable and scalable matching algorithms that do the task fully automated.
Consider, for example, the use of bootstrapping or cross-validation for cluster validation as suggested by many authors (Moreau and Jain, 1987; Jain and Moreau, 1988; Tibshirani et al., 2001; Roth et al., 2002; Ben-Hur et al., 2002; Dudoit and Fridlyand, 2002): many cluster solutions are created and agreement between them is evaluated. Some agreement indices do not need explicit cluster matching (Rand, 1971; Hubert and Arabie, 1985), but others can only be applied after cluster solutions have been matched, for example, Cohen’s kappa (1960).
Recently, authors have suggested transfering the idea of bagging (Breiman, 1996) to clustering. Some approaches aggregate cluster centers (Leisch, 1999; Dolnicar and Leisch, 2000; Bakker and Heskes, 2001) or aggregate consensus between pairs of observations (Monti et al., 2003; Dudoit and Fridlyand, 2003, BagClust2 algorithm). Other approaches aggregate cluster assignments and, therefore, require cluster matching, for example, the crisp BagClust1 algorithm of Dudoit and Fridlyand (2003), the combination scheme for fuzzy clustering of Dimitriadou et al. (2002) or truecluster (Oehlschlägel, 2007b).
Truecluster is an algorithmic framework for robust scalable clustering with model selection that combines the idea of bagging with information theoretical model selection along the lines of (Akaike, 1973, 1974) and (Schwarz, 1978). In order to calculate its cluster information criterion (), truecluster requires a reliable cluster matching algorithm. The truematch algorithm presented here was designed to play that role. The organization of the paper is as follows: in Section 2, we show an undesirable feature of the standard approach to cluster matching. In Section 3, we present the truematch algorithm. In Section 4, we demonstrate the benefits of the truematch algorithm within the truecluster framework. In Section 5, we use simulation to compare truematch against standard trace maximization matching and in Section 6, we discuss our results.
2 What’s wrong with trace maximization of the matching table
The standard aproach to cluster matching is searching for that permutation of cluster labels that minimizes the euclidean distance to a reference cluster solution. This criterion has been suggested for fuzzy consensus clustering (Gordon and Vichi, 2001; Dimitriadou et al., 2002), as well as for crisp consensus clustering (Strehl and Ghosh, 2002) or crisp cluster bagging Dudoit and Fridlyand (2003, BagClust1). In the crisp case, this criterion is simply trace maximization of matching table counts: cross-tabulating class memberships of two solutions and then permuting rows/columns of the matching table until the trace becomes maximal. To our knowledge, cluster publications and software differ in the algorithms used to obtain trace maximization, but do not question the euclidean criterion per se.
For example, Dimitriadou et al. (2002) suggested a recursive heuristic to approximate trace maximization. It is known that trying all permutations has time complexity , where denotes the number of clusters. The Hungarian method improves on this and achieves polynomial time complexity . Kuhn (1955) published a pencil and paper version, which was followed by J.R. Munkres’ executable version (Munkres, 1957) and extended to non-square matrices by Bourgeois and Lassalle (1971). For a list of further algorithmic approaches to this so-called linear sum assignment problem or weighted bipartite matching, see Hornik (2005).
However, scalablility is not the only quality aspect of a matching algorithm. An important statistical feature of a matching algorithm is the following: if we match two random partitions, the matching algorithm should not systematically align the two partitions. We now show that the classic trace maximization does not generally possess this feature.
Assume a cluster algorithm that claims to identify an outlier in a sample of size but which actually declares one case as ‘outlying’ by random. Now assume a procedure that draws two bootstrap samples and clusters them into 99% ‘normal’ cases and one ‘outlier’. In 1% of such procedures, the outlier picked in the second sample will randomly match the outlier picked in the first sample. In such cases, trace maximization matching will lead to a matching table as shown in Table 3. In the other 99%, there will be no match, which—by trace maximization—gives a matching table like that shown in Table 3. The resulting expected matching table is shown in Table 3.
We can see that under random clustering, we expect 98.02% on the main diagonal which at first glance looks like a strong (non-random) match. Only applying standard random correction (Cohen, 1960) confirms this to be a pure random match (Cohen’s kappa = 0). However, in a clustering context we have two objections against relying on such random corrections: as far as evaluation of cluster agreement is concerned, random corrections, such as Cohen’s kappa or Hubert and Arabie’s corrected rand index do not work properly, because spatial neighbors have an above-random chance of being clustered together in the absence of any cluster structure in the data. Therefore, agreement indices are too optimistic even with random correction. More importantly, in other contexts such as bagging there is no random correction available at all. If cluster sizes are (very) different, bagging cluster results will suffer because in standard trace maximization big randomly matched cells win over small cells representing non-random matches. Therefore, we are looking for a matching algorithm that does not systematically generate a strong diagonal under random conditions.
3 Truematch algorithm
The problems with standard trace maximization described in the previous section result from focusing on raw counts in a situation with unequal marginal (cluster) probabilities. From other contexts, we know that this is not a good idea. Take the -test for statistical independence of two categorial variables. It is not based on raw counts. Instead, the matching table of raw counts is transformed to another unit taking the marginals into account. Let denote the total number of observations, the number of observations in one row, the number of observations in one column and, finally, let denote the number of observations in one cell of the x cluster crosstable. The first step in calculating is to calculate for each cell the number of expected counts under the assumption of independence:
Then, we transform the matrix of raw counts in Equation 1 into a matrix of normalized squared deviations from the null model:
In order to cope with unequal cluster sizes, we suggest basing cluster matching on maximizing the trace of rather than on maximizing the trace of . And in order to avoid any systematic not based on the data, we add a probabilistic component to the matching algorithm. Consequently we define the truematch algorithm as:
Randomly permute rows and columns of the matching table
Transform the matching table counts to signed normalized squared deviations using Equation 3
Apply a trace maximization algorithm like the Hungarian method to maximize the trace (in fact the Hungarian method minimizes )
Order the resulting row/column pairs descending by breaking ties at random
If no trace maximization algorithm like the Hungarian method is available, the matching can easily be done using the truematch heuristic similar to the heuristic suggested by Dimitriadou et al. (2002):
Calculate signed normalized squared deviations for all remaining cells of the matching table
Order all cells descending by and by (breaking ties by random) and denote the first cell as the target cell
Match the row of the target cell to the column of the target cell
Remove the row and the column of the target cell from the matching table
If both the number of remaining rows and columns is at least two, repeat from step 1
It is obvious that the truematch algorithm has runtime complexity like the Hungarian method. The truematch heuristic also nicely translates into polynomial runtime. The number of residuals calculated to reduce the matching table from to is , thus the total number of residuals calculated is
and, therefore, the truematch heuristic has runtime complexity and memory complexity if the recursive nature of the algorithm is realized using a while-loop. R package truecluster (Oehlschlägel, 2007a) implements the truematch algorithm in matchindex(method = "truematch") and the truematch heuristic in matchindex(method = "tracemax") efficiently through underlying C-code.
Applying the truematch algorithm and the truematch heuristic to the above example gives identical results: as in standard trace maximization matching, we find 1% random matches in matching table 3, but for the 99% non-random matching cases, truematch generates two versions of matching tables, see Table 4. Both versions have shifted the majority of counts off-diagonal. Due to the probabilistic component in the 2nd step, this leads to the expected matching (Table 5) that has a weak trace. Under truematch, only systematic, non-random matches will result in a strong diagonal.
We can quantify the benefit of truematch in this case by comparing expected values of certain agreement indices, cf. Table 6. The rand index (Rand, 1971) and its random corrected version crand (Hubert and Arabie, 1985) are invariant against row/column permutations and, thus, do not differ. There is also no difference for kappa (Cohen, 1960). However, the big difference is on the simple non-random-corrected diagonal fraction of observations: while the trace maximization misleadingly results in an expected diagonal close to 1, truematch reduces the expectation of this non-random-corrected index close to zero. In the next two sections, we will explore the benefit of truematch in a bagging context, where the main diagonal defines the matching but no random correction is available.
4 The role of truematch in truecluster
The truecluster concept (Oehlschlägel, 2007b) suggests a cluster information criterion () that evaluates for each cluster model (for each number of clusters) a x matrix that aggregates votes over many resamples. is created by the multiple match cluster count () algorithm using the truematch algorithm as follows:
Create a x matrix and initialize each cell with zero
Take a resample (with replacement) of size , use a base cluster algorithm to fit the -cluster model to the resample. Then, use a suitable prediction method to determine cluster membership of the out-of-resample cases to get a complete cluster vector with elements
For each row in add one vote (add 1) to the column corresponding to the cluster membership in
Repeat step 2
Estimate cluster memberships by row-wise majority count in (breaking ties at random), use the truematch algorithm or heuristic to align with , and rename the clusters in like the corresponding clusters in
For each row in add one vote (add 1) to the column corresponding to the cluster membership in
Repeat from step 4 until some reasonable convergence criterion is reached
Divide each cell in by its rowsum to get a matrix of estimated cluster membership probabilities
Table 7 summarizes simulations with truecluster versus consensus clustering: 100 cases, 10,000 replications, for details see MMCCconcensus.r in R package truecluster (Oehlschlägel, 2007a), the table is sorted and grouped by the magnitude of CIC values). For random data without cluster structure, we would expect very ‘fuzzy’ without clear preferences for any cluster. Furthermore, we would expect CIC to increase for models with more true clusters and to decrease if models try to distinguish more clusters than justified by the data.
Table 7 shows that the MMCC algorithm using truematch delivers on this expectation: CIC increases for justified clusters and declines for unjustified ones, even if unjustified clusters in the model are small. This works because once cluster decisions are unjustified, the trumatch algorithm starts distributing its votes randomly across undistinguishable columns of and, thus, ‘fuzzifies’ . Compare that to consensus clustering (Dimitriadou et al., 2002) based on trace maximization obtained with R package clue (Hornik and Boehm, 2007; Hornik, 2005). Models with unjustified small clusters get CIC values as high as models without the unjustified cluster. This is a consequence of the trace maximization matching, adding inappropriate stability to the voting. Take, for example, the ”random 99:1” model, which is as unjustified as the ”random 50:50” model but receives a much higher value. The stability induced by the trace maximization matching results in quite a crisp : for each row, we find high probability for one cluster and low probability for the other. If we assign cases to clusters based on the maximum probability per row in , all cases are assigned to the same cluster. Such a degenerated is not wrong but unfortunate. If we manually analyze , we might detect that actually represents a one-cluster (K=1) model. But if we are after automatic selection of models (number of clusters), it is misleading that does not represent but . Analyzing a consensus cluster solution for degeneracies does not really help: the estimated probabilitites can be biased even before the matrix formally degenerates.
|MMCC||true K||model K||H||RMC||I||CIC|
|justified 50 random 49:1||2||3||0.499||0.018||0.695||0.196|
|consensus||true K||model K||H||RMC||I||CIC|
|justified 50 random 49:1||2||3||0.071||0.011||0.965||0.895|
|true K||true number of clusters|
|model K||model number of clusters|
|RMC||relative model complexity|
|CIC||cluster information criterion (I-H)|
|single 100||theoretical values for single group (no cluster)|
|random 50:50||random clustering with 2 equal sized clusters|
|random 99:1||random clustering 2 unequal sized clusters|
|random 50:49:1||random clustering with 3 unequal sized clusters|
|justified 50:50||justified clustering with 2 equal sized cluster|
|justified 50 random 49:1||2 justified clusters, one randomly split unequal sized|
5 Simulation results
In order to systematically investigate the consequences of the different features of truematch versus simple trace maximization matching, we have carried out extensive simulations within the truecluster framework: we assume two clusters and vary their relative size and the reliability of a fictitious clustering algorithm and compare the results gained via trace maximization versus truematch. We did two versions of the simulations: in the non-fixed version, just determines sampling probabilitites; in the fixed version, the fictitious clustering algorithm enforces the exact relative size of the two clusters. Details of the simulation are given in Appendix A.
Figure 1 shows information, uncertainty, and its difference for the non-fixed simulations. White areas denote simulation trials where the truecluster algorithm degenerated from a 2-cluster solution to a 1-cluster solution. The most notable difference is the big share of non-converged truecluster solutions using trace maximization, compared to the truematch algorithm. The estimated information, given reliability and skewness, is very similar and reasonable: information is highest for and and is lower for both reducing and/or skewing .
By contrast, compared for uncertainty and for the , trace maximization and truematch differ dramatically. Using trace maximization, the uncertainty estimate does not only depend on but is also artificially lower for higher skewness. As a consequence, cluster models with unequal cluster sizes get better values than cluster models with equal cluster sizes. Using the truematch algorithm almost avoids this undesirable pattern: the estimated uncertainty almost only depends on , not on . The estimated shows a very reasonable pattern: at high the is highest for equal sized clusters—conforming with the entropy principle— at low , the is low, however skewed is. Only at very extreme is the biased downwards: too small clusters cannot be detected with too small a sample size. Extreme models are non-identifiable and the uncertainty estimate has high variance. Keep in mind that ‘extreme’ corresponds to very few cases at a sample size of . The fixed simulations gave similar results (Figure 2).
In summary, trace maximization fails to estimate uncertainty independent of skewness and tends to overestimate for unequal cluster sizes or fails to converge. This restricts its usefulness for cluster evaluation and bagging. By contrast, the truematch algorithm works at almost any combination of reliability and skewness (with the exception of non-identifiable models, given the sample size).
We have shown that trace maximization matching fails to behave sufficiently neutrally when matching clusterings. The problem arises generally but is especially important in contexts where random correction is not applicable. As an alternative, we have presented the truematch algorithm and heuristic, both probabilistically generate neutral expected matching tables and scale in polynomial time. Our simulations have confirmed that truematch avoids unjustified (expected) matchings induced by unequal cluster sizes. For the simulations done here, the truematch algorithm and the truematch heuristic behave identically. Since the truematch heuristic does not guarantee maximizing the -criterion, we expect the truematch algorithm to be superior. However, there is a subtle difference: while the matching of the truematch algorithm depends solely on , the truematch heuristic uses and to select the row/column matches. Therefore, a final decision about an optimal matching algorithm needs more investigation.
Truematch is central to the algorithm, which creates the basis for the CIC-evaluation in the truecluster framework and, thus, contributes to solving the decade-old problem of choosing the optimal number of clusters. Beyond that, cluster bagging, in general, could benefit from using truematch: the resulting x matrix is rather fuzzified than degenerated for unjustified cluster splits. This allows for better automated processing of such results. It is an open question whether the truematch algorithm also has advantages for consensus clustering, or whether different usages of cluster ensembles require different matching algorithms.
We would like to thank Dr. Stefan Pilz for reviewing this paper and giving valuable hints for improvement.
In this appendix, we give details concerning the simulations in section 5: assume a vector of length 100 with ‘true’ sample group memberships where denotes the fraction of 1 and fraction of 0. Let denote the matrix of joint probabilities for a case’s true and clustered classification when the cluster algorithm perfectly separates 0 from 1 (at ).
Let denote the matrix of joint probabilities for a case’s true and clustered classification when the cluster algorithm makes a random guess when separating 0 from 1 (at ).
Then denotes the matrix of joint probabilities for a case’s true and clustered classification when the cluster algorithm has reliability .
The two conditional probabilitîes that the clustering algorithm identifies the true class, given the true class, are
For each value of and each value of , we simulate aggregation of 1000 bootstrap samples from , for each bootstrap sample our fictitious cluster algorithm assigns cases with probability to the true class and with probability to the other class. The resulting cluster memberships are matched versus the (current) estimated cluster memberships of the cases in the bootstrap sample. If or does not contain two classes, the bootstrap sample is dropped and replaced by another one. Differently from the algorithm in Section 4, we do not predict cluster memberships of the out-of-bag cases. We use directly instead of , consequently the rows of are not guaranteed to have aggregated an equal number of votes. For all combinations of and —the resulting 99x101 truecluster models —we calculate information, uncertainty, and (Oehlschlägel, 2007b). These values are visualized using colorcoding and contourlines are added based on a loess smooth. To create the version, the complete procedure is repeated, additionally enforcing a fixed fraction by moving randomly selected observations in from the too big group to the too small one—analogous to a cluster algorithm that forces certain cluster sizes. The R-code doing the simulation is available in truematch.r in package truecluster (Oehlschlägel, 2007a).
- Akaike (1973) H. Akaike. Information theory and an extension of the maximum likelihood principle. In B.N. Petrov and F. Cáski, editors, Second International Symposium on Information Theory, pages 267–281, Budapest, 1973. Akademiai Kaidó. Reprinted in Breakthroughs in Statistics, eds Kotz, S. & Johnson, N.L. (1992), volume I, pp. 599–624. New York: Springer.
- Akaike (1974) H. Akaike. A new look at statistical model identification. IEEE Transactions on Automatic Control, 19:716–723, 1974.
- Bakker and Heskes (2001) Bart Bakker and Tom Heskes. Model clustering and resampling, 2001. URL citeseer.ist.psu.edu/bakker00model.html.
- Ben-Hur et al. (2002) A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. Pac Symp Biocomputing, 7:6–17, 2002.
- Bourgeois and Lassalle (1971) François Bourgeois and Jean-Claude Lassalle. An extension of the munkres algorithm for the assignment problem to rectangular matrices. Communication ACM, 14(12):802–804, 1971.
- Breiman (1996) L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
- Cohen (1960) Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46, 1960.
- Dimitriadou et al. (2002) E. Dimitriadou, A. Weingessel, and K. Hornik. A combination scheme for fuzzy clustering. Journal of Pattern Recognition and Artificial Intelligence, 16:901–912, 2002.
- Dolnicar and Leisch (2000) S. Dolnicar and F. Leisch. Behavioural market segmentation using the bagged clustering approach based on binary guest survey data: Exploring and visualizing unobserved heterogeneity. Tourism Analysis, 5(2-4):163–170, 2000.
- Dudoit and Fridlyand (2002) S. Dudoit and J. Fridlyand. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7):research0036.1–0036.21, 2002.
- Dudoit and Fridlyand (2003) S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9):1090–1099, 2003.
- Gordon and Vichi (2001) A. D. Gordon and M. Vichi. Fuzzy partition models for fitting a set of partitions. Psychometrika, 66:229–248, 2001.
- Hornik (2005) Kurt Hornik. A CLUE for CLUster Ensembles. Journal of Statistical Software, 14(12), September 2005. URL www.jstatsoft.org/v14/i12/.
- Hornik and Boehm (2007) Kurt Hornik and Walter Boehm. clue: Cluster ensembles, 2007. R package version 0.3-11.
- Hubert and Arabie (1985) Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985.
- Jain and Moreau (1988) A. K. Jain and J. Moreau. Bootstrap techniques in cluster analysis. Pattern Recognition, 20:547´–568, 1988.
- Kuhn (1955) H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quaterly, 2:225–231, 1955.
- Leisch (1999) Friedrich Leisch. Bagged clustering. Technical Report Working Paper 51, SFB Adaptive Information Systems and Modelling in Economics and Management Science, Vienna University of Economics and Business Administration in cooperation with the University of Vienna, Vienna University of Technology., 1999.
- Monti et al. (2003) Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52:91–118, 2003.
- Moreau and Jain (1987) J. V. Moreau and A. K. Jain. The bootstrap approach to clustering. In P.A. Devijver and J. Kittler, editors, Pattern Recognition: Theory and Applications, volume 30 of NATO ASI Series F, pages 63–71. Springer, 1987.
- Munkres (1957) J. Munkres. Algorithms for the assignment and transportation problems. J. Siam, 5:32–38, 1957.
- Oehlschlägel (2007a) Jens Oehlschlägel. Truecluster: an algorithmic framework for robust and scalable clustering, 2007a. URL www.truecluster.com. R package version 0.3 (version 1.0 and higher will also be hosted at CRAN.R-project.org).
- Oehlschlägel (2007b) Jens Oehlschlägel. Truecluster: robust scalable clustering with model selection. submitted to jmlr, 2007b.
- Rand (1971) W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850, 1971.
- Roth et al. (2002) Volker Roth, Tilman Lange, Mikio Braun, and Joachim M. Buhmann. A resampling approach to cluster validation. In Wolfgang Härdle and Bernd Rönz, editors, Proceedings in Computational Statistics: 15th Symposium Held in Berlin (COMPSTAT2002), pages 123–128, Heidelberg, 2002. Physica-Verlag.
- Schwarz (1978) G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.
- Strehl and Ghosh (2002) A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583–617, 2002.
- Tibshirani et al. (2001) Robert Tibshirani, Guenther Walther, David Botstein, and Patrick Brown. Cluster validation by prediction strength. Technical report, Stanford University, 2001.