Clustering strategy and method selection

Clustering strategy and method selection

Christian Hennig,
Department of Statistical Science, University College London
Abstract

Note: This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required.

The aim of this chapter is to provide a framework for all the decisions that are required when carrying out a cluster analysis in practice. A general attitude to clustering is outlined, which connects these decisions closely to the clustering aims in a given application. From this point of view, the chapter then discusses aspects of data processing such as the choice of the representation of the objects to be clustered, dissimilarity design, transformation and standardization of variables. Regarding the choice of the clustering method, it is explored how different methods correspond to different clustering aims. Then an overview of benchmarking studies comparing different clustering methods is given, as well as an outline of theoretical approaches to characterize desiderata for clustering by axioms. Finally, aspects of cluster validation, i.e., the assessment of the quality of a clustering in a given dataset, are discussed, including finding an appropriate number of clusters, testing homogeneity, internal and external cluster validation, assessing clustering stability and data visualization.

1 Introduction

Note: This paper is a chapter in the forthcoming Handbook of Cluster Analysis, Hennig et al. (2015). For definitions of basic clustering methods and some further methodology, other chapters of the Handbook are referred to. To read this version of the paper without the Handbook, some knowledge of cluster analysis methodology is required.

In Hennig et al. (2015), a large number of cluster analysis methods have been introduced, and in any situation in which a clustering is needed, the user is faced with a potentially overwhelming number of options. The current paper is about how the required choices can be made. Milligan (1996) listed seven steps of a cluster analysis that require decisions, namely

1. choosing the objects to be clustered,

2. choosing the measurements/variables,

3. standardization of variables,

4. choosing a (dis-)similarity measure,

5. choosing a clustering method,

6. determining/deciding the number of clusters,

7. interpretation, testing, replication, cluster validation.

I will treat all but the first one (general principles of sampling and experimental design apply), not sticking exactly to this order. The chapter focuses on the general philosophy behind the required choices, what this means in practice, and on some areas of research. This has to be combined with knowledge on clustering methods as given elsewhere in this volume. Some more discussion of the above issues can be found in Milligan (1996) and standard cluster analysis books such as Jain and Dubes (1988); Kaufman and Rousseeuw (1990); Gordon (1999); Everitt et al. (2011).

The point of view taken here, previously outlined in Hennig and Liao (2013) and also shared by other authors (von Luxburg et al. (2012)), is that there is no such thing as a universally ”best clustering method”. Different methods should be used for different aims of clustering. The task of selecting a clustering method implies a proper understanding of the meaning of the data, the clustering aim and the available methods, so that a suitable method can be matched to what the application requires. Although many experienced experts in the field, including the authors of the books cited above, agree with this view, there is not much advice in the literature on how the specific requirements of the application can be connected with the available methods. Instead, cluster analysis methods have been often compared on simulated data or data with known classes, in order to find a “best” one disregarding the research context. Such comparisons are of some use, particularly because they reveal, in some cases, that methods may not be up for what they were supposed to do. Still, it would be more useful to have more specific information about what kind of method is connected to what kind of clustering task, defined by clustering aim, required cluster concept, and potential structure in the data.

The present chapter goes through the most essential steps of making the necessary decisions for a cluster analysis. It starts in Section 2 with a discussion of the background, relating the aims of clustering to the cluster concepts that may be of interest in a specific situation. Section 3 looks at the data to be clustered. Often it is useful to pre-process the data before applying a clustering method, by defining new variables, dissimilarity measures, transforming or selecting features. Such operations have an often fundamental impact on the resulting clustering. Note that I will use the term “features” to refer to the variables eventually used for clustering if a cluster analysis method for an “objects times features”-matrix as input is applied, whereas the term “variables” will be used in a more general sense for measurements characterizing the objects used in the clustering process, potentially later to be used as clustering features, or for computing dissimilarity measures or new variables.

Section 4 is on comparing clustering methods. This encompasses the decision which method fits a certain clustering aim, measurement of the quality of clustering methods, benchmark simulation studies, and some theoretical work on characterizing clusterings and clustering methods. In many cases, though, there may not be enough precise information about the clustering aim and cluster concepts of interest, so that the user may not be able to pinpoint exactly what method is needed. Also, it may be discovered that the clustering structure of the data may differ from what was expected in advance, and other methods than initially considered may look promising. Section 5 is about evaluating and comparing outcomes of clustering methods, before the chapter is concluded.

2 Clustering aims and cluster concepts

In various places in the literature it is noted that there is no generally accepted definition of a cluster. This is not surprising, given the many different aims for which clusterings are used. Here are some examples:

• delimitation of species of plants or animals in biology,

• medical classification of diseases,

• discovery and segmentation of settlements and periods in archeology,

• image segmentation and object recognition,

• social stratification,

• market segmentation,

• efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

• exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses,

• information reduction and structuring of sets of entities from any subject area for simplification, more effective communication, or more effective access/action such as complexity reduction for further data analysis,

• investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster”, and this has strong implications for the methodological strategy. Finding an appropriate clustering method means that the cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

A key distinction can be made between “realist” aims of clustering, concerning the discovery of some meaningful real structure corresponding to the clusters, and “constructive” aims, where researchers intend to split up the data into clusters for pragmatic reasons, regardless of whether there is some essential real difference between the resulting groups. This distinction can be roughly connected to the choice of clustering methodology. For example, some clustering criteria such as -means (Hennig et al. (2015)) produce homogeneous clusters in the sense that all observations are assigned to the closest centroid, and large distances within clusters are heavily penalized. This is useful for a number of constructive clustering aims. On the other hand, -means does not pay much attention to whether or not the clusters are clearly separated by gaps, and does not tolerate large variance and spread of points within clusters, which can occur in clusters that correspond to real patterns (for example objects in images).

However, the distinction between realist and constructive clustering aims is not as clear cut as it may seem at first sight. Categorization is a very basic human activity that is directly connected with the emergence of language. Whenever human beings speak of real patterns, this can only refer to categories that are aspects of human cognition and can be expressed in language, which can be seen as a pragmatic human construct (Van Mechelen et al. (1993) review cognitive theories of categorization with a view to connecting them to inductive data analysis including clustering). In a related manner, researchers with realist clustering aims should not hope that the data alone can reveal real structure; constructive impact of the researchers is needed to decide what counts as real.

The key issue in realist clustering is how the real structure the researchers are interested in is connected to the available data. This requires subject matter knowledge, but it also requires decisions by the researchers. “Real structure” is often understood as the existence of an unobserved categorical variable the values of which define the “true” clusters. Such an idea is behind the popular use of datasets with given true classes for benchmarking of cluster analysis methods. But neither can it be taken fur granted that the known categories are the only existing ones that could qualify as “real clusters”, nor do such categories necessarily correspond to data analytic clusters. For example, male/female is certainly a meaningful categorization of human beings, but there may not even be a significant difference between men and women regarding the results of a certain attitude survey, let alone separated clusters corresponding to sex. Usually the objects represented in the dataset can be partitioned into real categories in many ways. Also, different cluster analysis methods will produce different clusterings, which may more or less well correspond to patterns that are real in potentially different ways. This means that in order to decide about appropriate cluster analysis methodology, researchers need to think about what data analytic characteristics the clusters they are aiming at are supposed to have. I call this the “cluster concept” of interest in a specific study.

The real patterns of interest may be more or less closely connected to the available data. For example, in biological species delimitation, the concept of a species is often defined in terms of interbreeding (there is some controversy about the precise definition, see Hausdorf (2011)). But interbreeding patterns are not usually available as data. Species are nowadays usually delimited by use of genetic data, but in the past, and also occasionally in the present in an exploratory manner, species were seen as the source of a real grouping in phenotype data. In any case, the researchers need some idea about how true distinctions between species are connected to patterns in the data. Regarding genetic data, this means that knowledge needs to be used about what kind of similarity arises from persistent genetic exchange inside a species, and what kind of separation arises between distinct species. There may be subgroups of individuals in a species between which there is little actual interbreeding (because potential interbreeding suffices for species definition), for example between geographically separated groups, and consequently not as much genetic similarity as one would naively expect. Furthermore there are various levels of classification in biology, such as families and genii above and subspecies below the level of species, so that data analytic clusters may be found at several levels, and the researchers may need to specify more precisely how much similarity within and separation between clusters is required for finding species.

Such knowledge needs to be reflected in the cluster analysis method to be chosen. For example, species may be very heterogeneous regarding geographical distribution and size, and therefore a clustering method that implicitly tends to bring forth clusters that are very homogeneous such as -means or complete linkage is inappropriate.

In some cases, the data are more directly connected to the cluster definition. In species delimitation, there may be interbreeding data, in which case researchers can specify the requirements of a clustering more directly. This may imply graph theoretic clustering methods and a specification of how much connectedness is required within clusters, although such decisions can often not be made precise because of missing information arising from sampling of individuals, missing data etc. On the other hand, the connection between the cluster definition and the data may be less close, as in the case of phenotype data used for delimiting species, in which case the researchers may not have strong information about how the clusters they are interested in are characterized in the data, and some speculation is needed in order to decide what kind of clustering method may produce something useful.

In many situations different groupings can be interpreted as real, depending on the focus of the researchers. Social classes for example can be defined in various ways. Marx made ownership of means of production the major defining characteristic of different classes, but social classes can also be defined by looking at patterns of communication and contact, or occupation, or education, or wealth, or by a mixture of these (Hennig and Liao (2013)). In this case, a major issue for data clustering is the selection of the appropriate variables and measurements, which implicitly defines what kinds of social classes can be found.

The example of social stratification also illustrates that there is a gradual transition rather than a clear cut between realist and constructive clustering aims. According to some views (such as the Marxist one) social classes are an essential and real characteristic of society, but according to other views, in many societies there is no clear delimitation between social classes that could justify to call these classes “real”, despite the existence of real inequality. Social classes can then still be used as a convenient tool for structuring the inequality in such societies.

Regarding constructive clustering aims, it is obvious that researchers need to decide about the desired “cluster concept”, or in other words, about the characteristics that their clusters should have. The discussion above implies that this is also the case for realist clustering aims, for which the required cluster concept needs to be derived from knowledge about the nature of the real clusters, and from a decision of the researchers about their focus of interest if (as is usually the case) the existence of more than a unique real clusterings is conceivable. For constructive clustering, the required cluster concept needs to be connected to the practical use that is intended to be made of the clusters.

Also where the primary clustering aim is constructive, realist aims may still be of interest insofar as if indeed some real grouping structure is clearly manifest in the data, many constructive aims will be served well by having this structure reflected in the clustering. For example, market segmentation may be useful regardless of whether there are really meaningfully separated groups in the data, but it is relevant to find them if they exist.

Here is a list of potential characteristics of clusters that may be desired, and that can be checked using the available data. A number of these are related with the “formal categorization principles” listed in Section 14.2.2.1 of Van Mechelen et al. (1993).

1. Within-cluster dissimilarities should be small.

2. Between-cluster dissimilarities should be large.

3. Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or, if appropriate, by linear, time series or spatial process models.

4. Members of a cluster should be well represented by its centroid.

5. The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).

6. Clusters should be stable.

7. Clusters should correspond to connected areas in data space with high density.

8. The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).

9. It should be possible to characterize the clusters using a small number of variables.

10. Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.

11. Features should be approximately independent within clusters.

12. All clusters should have roughly the same size.

13. The number of clusters should be low.

When trying to measure these characteristics, they have to be made more precise, and in some cases it matters a lot how exactly they are defined. Take no. 1, for example. This may mean that all within-cluster dissimilarities should be small without exception (i.e., the maximum should be small, as required by complete linkage hierarchical clustering), or their average, or a high quantile of them. These requirements may look similar at first sight but are very different regarding the integration of outliers in clusters. Having small within-cluster dissimilarities may emphasize gaps by looking at the smallest dissimilarities between each two clusters, or it may rather mean that the central areas of the clusters are well distributed in data space. As another example, stability can refer to sampling other data from the same population, to adding “noise”, or to comparing results from different clustering algorithms.

Some of these characteristics are in conflict with others in some datasets. Connected areas with high density may include very large distances, and may have undesired (e.g., non-convex or nonlinear) shapes. Representing objects by centroids may bring forth some clusters with little or no gap between them. Having clusters of roughly equal size forces outliers to be integrated in distant clusters, which produces large within-cluster dissimilarities.

Deciding about such characteristics is the key to linking the clustering aim to an appropriate clustering method. For example, if a database of images should be clustered so that users can be shown a single image to represent a cluster, no. 7 is most important. Useful market segments need to be addressed by non-statisticians and should therefore normally be represented by few variables, on which dissimilarities between members should be low. Similar considerations can be made for realist clustering aims, see above.

For choosing a clustering method, it is then necessary to know how they correspond to the required characteristics. Some methods optimize certain characteristics directly (such as -means for no. 4), and in some further cases experience and research suggest typical behavior (-means tends to produce clusters of roughly equal size, whereas methods looking for high-density areas may produce clusters of very variable size). See Section 4.1 for more comments on specific methods. Other characteristics such as stability are not involved in the definition of most clustering methods, but can be used to validate clusterings and to compare clusterings from different methods.

The task of choosing a clustering method is made harder by the fact that in many applications more than one of the listed characteristics is relevant. Clusterings may be used for several purposes, or desired characteristics may not be well defined, e.g., in exploratory data analysis, or for realist clustering aims in cases where the connection between the interpretation of the clusters and the data is rather loose. Also, a misguided desire for uniqueness and objectivity makes many researchers reluctant to specify desired characteristics and choose a clustering method accordingly, because they hope that there is a universally optimal method that will just produce “natural” clusters. Probably for such reasons there is currently almost no systematic research investigating the characteristics of methods in terms of the various cluster characteristics.

3 Data processing

The decision about what data to use, including how to choose, transform and standardize variables, and if and how to compute a dissimilarity measure, is an important part of the methodological strategy in cluster analysis. It often has a major impact on the clustering result, and is sometimes more important than the choice of the clustering method.

3.1 Choice of representation

To some extent the data format restricts the choice of clustering methods; there are specialized methods for continuous, ordinal, categorical and mixed type data, dissimilarity data, graphs, time series, spatial data etc. But often data can be represented in different ways. For example, a collection of time series with 100 time points can be represented as points in 100-dimensional Euclidean space, but they can also be represented by autocorrelation parameters of a time series model fitted to them, by wavelet features or some other low dimensional representation, or by dissimilarity measures which may involve some alignment or “time warping”, see Hennig et al. (2015). On the other hand, dissimilarity data can be be transformed to Euclidean data using multidimensional scaling (MDS) techniques. This means that the researcher often can choose whether the objects are represented by features, dissimilarities, or in another way, for example by vertices in a graph.

Generally, dissimilarity measures are a suitable basis for clustering if the cluster concept is mainly based on the idea that similar objects should be grouped together and dissimilar objects should be in different clusters. Dissimilarity measures can be constructed for most data types. On the other hand, clusters characterized by distributional and geometrical shapes and clusters with potentially high within-cluster variability or skewness are found better with objects characterized by features instead of dissimilarities.

The choice of representation should be guided by the question how objects qualify to belong together in the same cluster. For example, if the data are time series, there are various different possible concepts of “belonging together”. Time series may belong together if their values are similar most of the time, which is appropriate if the plain values play a large role in the assessment of similarity (for example cigarettes smoked per day in research about smoking behavior). A musical melody can be played at different speeds and in different keys, so that two musical melodies may still be assessed as similar despite pitch values being quite different and changes in pitch happen at different times. In other applications, such as particle detection by electrodes, the characteristics of a single event that happens at a certain potentially flexible time point (such as a value going up and then down again) may be important, and having detected such an event, some specific characteristics of it may represent the objects in the most useful manner.

A central issue regarding the representation is the choice of variables that are either used as features to represent the objects or on which a dissimilarity definition is based. Both subject matter and statistical considerations play a role here. From a statistical point of view, a variable could be seen as uninformative if it is either strongly correlated with other variables and does not carry information on its own as long as certain other variables are kept, or the variable may not be connected to any “real” clustering characterized in the data for example by high density regions. Furthermore, in some situations (for example using gene expression data) the number of variables may simply be so large that cluster analysis methods become unstable. There are various automatic methods for variable selection in connection with clustering, see Hennig et al. (2015) for clustering variables at the same time as observations, and Alelyani et al. (2014) for a recent survey. Popular classical methods such as principal component analysis (PCA) and MDS are occasionally used for constructing informative variables. These, however, are based on objective functions (variance, stress) that do not have any relation to clustering, and may therefore miss information that is important for clustering. There are some projection pursuit-type methods that aim at finding low-dimensional representations of the data particularly suitable for clustering (Bolton and Krzanowski (2003); Tyler et al. (2009)).

It is important to realize, though, that the variables involved in clustering define the meaning of the clusters. Changing the variables implies changing the meaning. If the researchers have a clear idea about the meaning of the clusters of interest, it is problematic to select variables in an automatic manner. For example, Hennig and Liao (2013) were interested in socio-economic stratification, for which information on income, savings, education and housing is essential. Even if for example incomes do not show any clear grouping structure, or are correlated strongly with another variable, this does not constitute a valid reason to exclude this variable for constructing a clustering that is meant to reflect a meaningful socio-economic partition of a population. A stratification based on automatically selected variables that cluster in a nicer way may be of exploratory interest, but does not fulfill the aim of the researchers. One could argue that in case of correlation between income and another variable, savings, say, the information from income is retained as long as savings (or a linear combination of them both, as would be generated by PCA) is still used as a feature for clustering. But this is not true, because the fact that the information is shared by two variables that in terms of their meaning are essential for the clustering aim is additional information that should not be lost.

Another issue is that variables can play different roles, which has different implications. For example, a dataset may include spatial coordinates and other variables (e.g., regional data on avalanche risk, or color information in image segmentation). Depending on the role that the spatial information should play, spatial coordinates can be included in the clustering process as features together with the others (which implies that regional similarity will somehow be traded off against similarity regarding the other variables in the clustering process), or they could define constraints (e.g., clusters on the other variables could be constrained to be spatially connected), or they could be ignored for clustering, but could be used afterward to validate the resulting clusters or to analyze their spatial structure. For avalanche risk mapping, for example, one may take the latter approach for detailed maps if spatial information is discretized and there is enough data at each point, but one may want to impose spatial constraints if data is sparser or if the map needs to be coarser because it is used by decision makers instead of hikers.

Often there is a good reason for not choosing the variables automatically from the data, but rather guided by the aim of clustering. In some cases dimension reduction can be achieved by the definition of meaningful new indices summarizing information in certain variables. On the other hand, automatic variable selection may yield interesting clusterings if the aim is mainly exploratory, or if there is no prior information about the importance of the variables and it is suspected that some of them are uninformative “noise”.

3.2 Dissimilarity definition

In order to apply dissimilarity based methods and to measure whether a clustering method groups similar observations together, a formal definition of “dissimilarity” is needed (or “proximity”, which refers to either dissimilarity or similarity, as sometimes used in the literature; their treatment is equivalent and there are a number of transformations between dissimilarity and similarity measures, the simplest and most popular of which probably is “dissimilarity maximum similarity minus similarity”). In many situations, dissimilarities between objects cannot be measured directly, but have to be constructed from some measurements of variables of the objects of interest. Directly measured dissimilarities occur for example in comparative experiments in psychology and market research.

There is no unique “true” dissimilarity measure for any dataset; the dissimilarity measurement has to depend on the researchers’ concept of what it means to treat two objects as “similar”, and therefore on the clustering aim.

Mathematically, a dissimilarity is a function , being the object space, so that and for . There is some work on asymmetric dissimilarities (Okada (2000)) and multiway dissimilarities defined between more than two objects (Diatta (2004)). A dissimilarity fulfilling the triangle equality

 d(x,y)+d(y,z)≥d(x,z), x,y,z∈X,

is called a “distance” or “metric”. The triangle inequality is connected to Euclidean intuition and therefore seems to be a “natural” requirement, but in some applications it is not appropriate. Hennig and Hausdorf (2006) argue, e.g., that for presence-absence data of species on regions two species A and B are very dissimilar if they are present on two small disjoint areas, but both should be treated as similar to a species C covering a larger area that includes both A and B, if clusters are to be interpreted as species grouped together by palaeoecological processes.

A vast number of dissimilarity measures has been proposed, some for rather general purposes, some for more specific applications (dissimilarities between shapes (Veltkamp and Latecki (2006)), melodies (Müllensiefen and Frieler (2007)), geographical species distribution areas (Hennig and Hausdorf (2006)), etc.). Chapter 3 in Everitt et al. (2011) gives a good overview of general purpose dissimilarities. Here are some basic considerations:

Aggregating binary variables.

If two objects are represented by binary variables, let be the number of variables on which . If all variables are treated in the same way, the most straightforward dissimilarity is the simple matching coefficient,

 dSM(x1,x2)=1−a00+a11p.

However, often (e.g. in the case of geographical presence-absence data in ecology) common presences are important, whereas common absences are not. This is taken into account by the Jaccard dissimilarity

 dJ(x1,x2)=1−a11a11+a10+a01.

One can worry about whether this gives the object with more presences too much weight in the denominator, and actually more than 30 dissimilarity measures for such data have been proposed Shi (1993), prompting much research about their characteristics and how they relate to each other (Gower and Legendre (1986); Warrens (2008)).

Aggregating categorical variables.

If there are more than two categories, again the most intuitive way to construct a dissimilarity measure is one minus the relative number of “matches”. In some applications such as population genetics dissimilarity should rather be a non-linear function of matches between genes, and it is also important to think about whether and in what way variables with different numbers of categories or even with more or less uniform distributions should be given different weights because some variables produce matches more easily than others.

Aggregating continuous variables.

The Minkowski ()-distance between two objects on real-valued variables is

 dMq(xi,xj)=q ⎷p∑l=1dl(xil,xjl)q, (1)

where . Variable weights can easily be incorporated by multiplying the by . Most often, the Euclidean distance and the Manhattan distance are used. Using with larger gives the variables with larger more weight, i.e., two observations are treated as less similar if there is a very large dissimilarity on one variable and small dissimilarities on the others than if there is about the same medium-sized dissimilarity on all variables, whereas gives all variable-wise contributions implicitly the same weight (note that this does not hold for the Euclidean distance that corresponds to physical distances and is used as default choice in many applications).

An alternative would be the (squared) Mahalanobis distance,

 dM(xi,xj)2=(xi−xj)TS−1(xi−xj), (2)

where is a scatter matrix such as the sample covariance matrix. This is affine equivariant, i.e., not only rotating the data points in Euclidean space, but also stretching them in any number of directions will not affect the dissimilarity. It will also implicitly aggregate and therefore weight information from strongly correlated variables down (correlation implies that data are “stretched” in the direction of their dependence; the consequence is that “joint information” is only used once). This is desirable if clusters can come in in all kinds of elliptical shapes. On the other hand, it means that the weight of the variables is determined by their covariance structure and not by their meaning, which is not always appropriate (see the discussion about variable selection above).

There are many further ways of constructing a dissimilarity measure from several continuous variables, see Everitt et al. (2011), such as the Canberra distance, which emphasizes differences close to zero. It is defined by and in (1). The Pearson correlation coefficient has been used to construct a dissimilarity measure as well (other transformations are also used). This interprets and as similar if they are positively linear dependent. This does not mean that their values have to be similar, but rather the values of the variables relative to the other variables. In some applications variables are clustered, which means that variables and objects change their roles; if the variables are the objects to be clustered, in is a proper correlation between variables, which is a typical use of .

Aggregating ordinal variables.

Ordinal variables are characterized by the absence of metric information about the distances between two neighboring categories. They could be treated as categorical variables, but this would ignore available information. On the other hand, it is fairly common practice to use plain Likert codes 1,2,…and then to use methods for continuous data. Ordinality can be taken into account while still using methods for continuous data by scoring the categories in a way that uses the ordinal information only. Straightforward scores are obtained by ranking (using the midrank for all objects in one category) or normal scores (Conover (1999)), which treat the data as if there would be an underlying uniform (ranks) or Gaussian distribution (normal scores). A more sophisticated approach is polytomous item response theory (Ostini and Nering (2006)). Using scores that are determined by the distribution of the data does not guarantee that they appropriately quantify the interpretative distances between categories, and in some situations (e.g., Likert scales in questionnaires where interviewees can see that responses are coded 1,2,…) this may be reflected better by plain Likert codes. Sometimes also there is a more complex structure in the categories that can be reflected by scoring data in a customized way. For example, in Hennig and Liao (2013), a “housing” variable had levels “owns”, “pays rent” and several levels such as “shared ownership” that could be seen as lying between “owns” and “pays rent” but could not be ordered, which could be reflected by having a distance of 1 between “pays rent” and “owns” and 0.5 between any other pair of categories.

Aggregating mixed-type variables and missing values.

If there are variable-wise distances defined, variables of mixed type can be aggregated. A standard way of doing this is the Gower dissimilarity (Gower (1971))

 dG(xi,xj)=∑pl=1wlδijldl(xil,xjl)∑pl=1wlδijl,

where is a variable weight and except if or are missing, in which case . This is a weighted version of and takes into account missing values by just leaving the corresponding variable out and rescaling the others. Gower recommended to use the weight for standardization to -range (see Section 3.4), but Hennig and Liao (2013) argued that many clustering methods tend to identify gaps in variable distributions with cluster borders, and that this implies that should be used to weight binary and other “very discrete” variables down against continuous variables, because otherwise the former would get an unduly high influence on the clustering. can also be used to weight variables up that have high subject matter importance. The Gower dissimilarity is very general and covers most applications of dissimilarity-based clustering to mixed-type variables. An alternative for missing values is to treat them as an own category. For continuous variables one could give missing values a constant dissimilarity to every other value. More references are in Everitt et al. (2011).

In many situations detailed considerations regarding the subject matter will play the most important role regarding the design of a dissimilarity measure. This is particularly the case if the data are more structured than just a collection of variables. Such considerations start with deciding how to represent the objects, as discussed in Section 3.1 and illustrated by the task of time series clustering. The next task is how to aggregate the measurements in an appropriate way. In time series clustering, one consideration is whether some processes that are interpreted to be similar may occur at different and potentially varying speeds, so that flexible alignment (“dynamic time warping”) is required, as may be the case in gesture recognition. See Liao (2005) for further aspects of choosing dissimilarities between time series.

Key issues may differ a lot from one application to the next, so it is difficult to present general rules. There is some research on approximating expert judgments of similarity with functions of the available variables (Gordon (1990); Müllensiefen and Frieler (2007)). Hennig and Hausdorf (2006), who incorporate geographical distance information into a dissimilarity for presence-absence data, list a number of general principles for designing and fine-tuning dissimilarities:

• What should be the basic behavior of the dissimilarity as a function of the existing measurements (when decreasing/increasing etc.)?

• What should be the relative weight of different aspects of the basic behavior? Should some aspects be incorporated in a nonlinear manner (see Section 3.3)?

• Construct exemplary pairs of objects for which it is clear what value the dissimilarity should have, or how it should compare with some other exemplary pairs.

• Construct sequences of pairs of objects in which one aspect changes while others are held constant.

• Whether and how could the dissimilarity measure be disturbed by small changes in the characteristics? What behavior in these situations would be appropriate?

• Which transformations of the variables should leave the dissimilarities unchanged?

• Are there reasons that the dissimilarity measure should be a metric (or have some other particular mathematical properties)?

3.3 Transformation of variables

According to the same philosophy as before, effective distances (as used by a clustering method) on the variables should reflect the “interpretative distance” between objects, and transformations may be required to achieve this. Because there is a large variety of clustering aims, it is difficult to give general principles that can be applied in a straightforward manner, and the issue is best illustrated using examples. Therefore, consider now the variable “savings amount” in socio-economic stratification in Hennig and Liao (2013). Regarding social stratification it makes sense to allow proportionally higher variation within high income and/or high savings clusters; the “interpretative difference” between incomes is rather governed by ratios than by absolute differences. In other words, the difference between two people with yearly incomes of $2 million and$ 4 million, say, should in terms of social strata be treated about equally as the difference between $20,000 and$ 40,000. This suggests a log transformation, which has the positive side effect to tame some outliers in the data. Some people indeed have zero savings, which means that the transformation should actually be log(savings). The choice of can have surprisingly strong implications on clustering, because it tunes the size of the “gap” between persons with zero savings and persons with small savings; in the dataset analyzed in Hennig and Liao (2013) there were only a handful of persons with savings below $100, but more with savings between$ 100 and \$ 500. Clustering methods tend to identify borders between clusters with gaps. A low value for , e.g., , creates a rather broad gap, which means that many clustering methods will isolate the zero savings-group regardless of the values of the other variables. However, from the point of view of socio-economic stratification, zero savings are not that special and not essentially different from low savings below a few hundred dollars, and therefore a larger value for (Hennig and Liao (2013) chose ) needs to be chosen to allow methods to put such observations together in the same cluster. The reasoning may seem to be very subjective, but actually this is required when attention is paid to the detail, and there is no better justification for any straightforward default choice (e.g., ).

It is fairly common that “interpretative distances” are nonlinear functions of plain differences. As another example, Hennig and Hausdorf (2006) used geographical distance information in a nonlinear way in a dissimilarity measure for presence-absence data for biological species, because individuals can easily travel shorter distances, whereas what goes on in regions with a long distance between them is rather unrelated, regardless of whether this distance is, say, 2,000 or 4,000 km, the difference between which therefore should rather be scaled down compared to differences between smaller distances.

Whether such transformations are needed depends on the clustering method. For example, a typical distribution of savings amounts is very skew and sometimes the skewness corresponds to the change in interpretative distances along the range of the variable. Fitting a mixture of appropriate skew distributions (see Hennig et al. (2015)) can then have a similar effect as transforming the variable.

3.4 Standardization, weighting and sphering of variables

Standardization of variables is a kind of transformation, but with a different rationale. Instead of governing the effective distance within a variable, it governs the relative weight of variables against each other when aggregating them. Standardization is not needed if a clustering method or dissimilarity is used that is invariant against affine transformations such as Gaussian mixture models allowing for flexible covariance matrices or the Mahalanobis distance. Such methods standardize variables internally, and the following considerations may apply also to the question whether it is a good idea to use such a method.

Standardization of is a special case of the linear transformation

 x∗i=B−1(xi−μ), i=1,…,n,

where is an invertible -matrix and . Standardizing location by introducing (usually chosen as the mean vector of the data) does not normally have an influence on clustering, but simplifies expressions. “Standardization” refers to using a diagonal matrix of scale statistics (see below) as . For “sphering”, , where for a scatter matrix , with being the matrix of eigenvectors and being the diagonal matrix of eigenvalues.

If the clustering method is not affine invariant (for example -means or dissimilarity-based methods using the Euclidean distance), standardization may have a large impact. For example, if variables are measured on different scales and one variable has values around 1,000 and another one has values between 0 and 1, the first variable will dominate the clustering regardless of what clustering pattern is supported by the second one. Standardization makes clustering invariant against the scales of the variables, and sphering makes clustering invariant against general affine linear transformations.

But standardization and sphering are not always desirable. The effect of sphering is the same as the effect of using the Mahalanobis distance (2), discussed above. If variables use the same measurement scale but have different variances, it depends on the requirements of the application whether standardization is desirable or not. For example, data may come from a questionnaire where respondents were asked to rate several items on a scale between 1 and 10. If for some items almost all respondents picked central values between 4 and 7, this may well indicate that the respondents did not find these items very interesting, and that therefore these items are less informative for clustering compared with other items for which respondents made a good use of the full width of the scale. Fur standard clustering methods that are not affine invariant, the variation within a variable defines its relative impact on the clustering. Leaving the items unstandardized means that an item with little variation would have little impact on clustering, which seems appropriate in this situation, whereas in other applications one may want to allow the variables a standardized influence on clustering regardless of the within-variable variation.

The most popular methods for standardization are

• standardization to -range,

• standardization to unit variance,

• standardization to a unit value of a robust variance estimator such as interquartile range (IQR) or median absolute deviation (MAD) from the median.

As is the case for most such decisions, the standardization method occasionally makes a substantial difference. The major difference is the treatment of outlying values. Range standardization is vulnerable to outlying values in the sense that an extreme outlier has the effect of squeezing together the other values on that variable, so that any structural information in this variable apart from the outlier will only have a very small influence on the clustering. This is avoided by using a robust variance estimator, which can have another undesired effect. Although outliers on a single variable will not affect other structural information on the same variable so much, for objects for which a single variable has an outlying value, this may dominate the information from all other variables, which can have a big impact in situations with many variables and a moderate number of outlying values in various variables. Variance standardization compromises on the disadvantages of both other approaches as well as on the advantages.

If for subject matter reasons some variables are more important than others regardless of the within-variable variation, one could reweight them by multiplying them with constants reflecting the relative importance after having standardized their data-driven impact.

None of the methods discussed up to here takes clustering information into account. A problem here is that if a variable shows a clear separation between clusters, this may introduce large variability, which may imply a large variance, range or IQR/MAD. If variables use the same measurement units and values are comparable, this could be an argument against standardization; if within-cluster variation is low, range-standardization will normally be better than the other schemes (Milligan and Cooper (1988)). The problem is, obviously, that clustering information is not normally available a priori. Art et al. (1982) discuss a method in which there is an initial guess, based on smallest dissimilarities, which objects belong to the same cluster, from which then a provisional within-cluster covariance matrix is estimated, which is used to sphere the dataset, De Soete (1986) suggests to reweight variables in such a way that an ultrametric is optimally approximated (Hennig et al. (2015)). These methods are compared with classical standardization by Gnanadesikan et al. (1995).

4 Comparison of clustering methods

Different cluster analysis methods can be compared in several different ways. When choosing a method for a specific clustering aim, it is important to know the characteristics of the clustering methods so that they can be matched with the required cluster concept. This is treated in Section 4.1. Section 4.2 reviews some existing studies comparing different clustering methods. Section 4.3 summarizes some theoretical work on desirable properties of clustering methods.

4.1 Relating methods to clustering aims

Following Section 2, the choice of an appropriate clustering method is strongly dependent on the aim of clustering. Here I list some clustering methods treated in this book, and how they relate to the list of potentially desirable cluster characteristics given in Section 2. Completeness cannot be achieved because of space limitations. For definitions of all listed methods, see Hennig et al. (2015).

-means.

The objective function of -means implies that it aims primarily at representing clusters by centroids. The squared Euclidean distance penalizes large distances within clusters strongly, so outliers can have a strong impact and there may be small outlying clusters, although -means generally rather tends to produce clusters of roughly equal size. Distances in all directions from the center are treated in the same way and therefore clusters tend to be spherical (-means is equivalent to ML-estimation in a model where clusters are modeled by spherical Gaussian distributions). -means emphasizes homogeneity rather than separation; it is usually more successful regarding small within-cluster dissimilarities than regarding finding gaps between clusters.

-medoids

is similar to -means, but it uses unsquared dissimilarities. This means that it may allow larger dissimilarities within clusters and is somewhat more flexible regarding outliers and deviations from the spherical cluster shape.

Hierarchical methods.

A first consideration is whether a full hierarchy of clusters is required (for example because the dissimilarity structure should be approximated by an ultrametric) or whether using a hierarchical method is rather a tool to find a single partition by cutting the hierarchy at some point. If only a single partition is required, hierarchies are not as flexible as some other algorithms for finding an in some sense optimal clustering (this applies, e.g., to comparing Ward’s hierarchical method with good algorithms for the -means objective function as reviewed in Hennig et al. (2015)). Different hierarchical methods produce quite different clusters. Both Single and Complete Linkage are rather too extreme for many applications, although they may be useful in a few specific cases. single linkage focuses totally on separation, i.e., keeping the closest points of different clusters apart from each other, and Complete Linkage focuses totally on keeping the largest dissimilarity within a cluster low. Most other hierarchical methods are a compromise between these two extremes.

Spectral clustering and graph theoretical methods.

These methods are not governed by straightforward objective functions that attempt to make within-cluster dissimilarities small or between-cluster dissimilarities large. Spectral clustering is connected to Single Linkage in the sense that its “ideal” clusters theoretically correspond to connected components of a graph. However, spectral clustering can be set up in such a way (depending sometimes strongly on tuning decisions such as the how the edge weights are computed) that it works in a smoother and more flexible way than Single Linkage, less vulnerable to single points “chaining” clusters. Generally spectral clustering still can produce very flexible cluster shapes and focuses much more on cluster separation than on within-cluster homogeneity when applied to originally Euclidean data in the usual way, i.e., using a strongly concave transformation of the dissimilarities so that the method focuses on the smallest dissimilarities, i.e., the neighborhoods of points, whereas pairs of points with large dissimilarity can still be connected through chains of neighborhoods.

Mixture models.

The distributional assumptions for such models define “prototype clusters”, i.e., the characteristics of the clusters the methods will find. These characteristics can depend strongly on details. For example, the Gaussian mixture model with fully flexible covariance matrices has a much larger flexibility (which often comes with stability issues and may incur quite large within-cluster dissimilarities) than a model in which covariance matrices are assumed to be equal or spherical. Using mixtures of or very skew distributions will allow observations within clusters that are quite far away from the cluster cores. Generally, the mixture model does not come with implicit conditions that ensure the separation of clusters. Two Gaussian distributions can be so close to each other that their mixture is unimodal. Still, for a large enough dataset, the BIC will separate the two components, which is only beneficial if the clustering aim allows to split up data subsets that seem rather homogeneous (the idea of merging such mixture components is discussed in Hennig et al. (2015)). This issue is also important to have in mind when fitting mixture models to structural data; slight violations of model assumptions such as linearity may lead to fits by more “clusters” that are not well separated, if the BIC is used to determine the number of mixture components. Standard latent class models for categorical data assume local independence within clusters, which means that clusters can be interpreted in terms of the marginal distributions of the variables, which may be useful but is also restrictive, and allows large within-cluster dissimilarities. The comments here apply for Bayesian approaches as well, which allow the user to “tune” the behavior of the methods through adjustment of the prior distribution, e.g., by penalizing methods with more clusters and parameters in a stronger way. This can be a powerful tool for regularization, i.e., penalizing troublesome issues such as zero variances and spurious clusters. On the other hand, such priors may have unwanted implications. For example, the Dirichlet prior implies that a certain non-uniform distribution of cluster sizes is supported.

Clustering time series, functional data and symbolic data.

As was already discussed in Section 3.1, regarding time series and also functions and symbolic data, a major issue to decide is in what sense the sequences of observations should belong together in a cluster, which could mean for example similar values, similar functional shapes (with or without alignment or “time warping”), similar autocorrelation structure, or good approximation by prototype objects. This is what mainly distinguishes the many methods discussed in these chapters.

Density-based methods.

Identifying clusters with areas of high density seems to be very intuitive and directly connected to the term “cluster”. High density areas can have very flexible shapes, but more sophisticated density-based methods do not depend as strongly on one or a few points as Single Linkage, which can be seen as a density-based method. There are a few potential peculiarities to keep in mind. High density areas may vary a lot in size, so they may include very large dissimilarities and there may be much variation in numbers of points per cluster. In different locations in the same dataset, depending on the local density, different density levels may qualify as “high”, and methods looking for high density areas at various resolutions can be useful. Clusters may also be identified with density modes, which occur at potentially very different density levels. Density-based methods usually do not need the number of clusters specified, but rather their resolution, i.e., size of neighborhood (in terms of number of neighbors or radius), grid size or kernel bandwith. This determines how large gaps in the density have to be in order to be found as cluster borders and is often not easier than specifying the number of clusters. In higher dimensions, it becomes more difficult for clustering algorithms to figure out properly where the density is high or low, and also the sparsity of data in high-dimensional space means that densities tend to be more uniformly low.

4.2 Benchmarking studies

Different clustering methods can be compared based on datasets in which a true clustering is known. There are three basic approaches for this in the literature (see Hennig (2015) for more discussion and some philosophical background regarding the problem of defining the “true” clusters):

1. Real datasets can be used in which there are known classes of some kind (a problem with this is that there is no guarantee that the known “true” classes are the only ones that make sense, or that they even cluster properly at all).

2. Data can be simulated from mixture or fixed partition models where within-cluster distributions are homogeneous, such as the Gaussian or uniform distribution (it depends on the separation of the mixture components whether these can be seen as separated clusters; also such datasets will naturally favor clustering methods that are based on the corresponding model assumptions).

3. Real data can be used for which there is no knowledge of a true clustering.

Measures as introduced in Hennig et al. (2015) such as the adjusted Rand index can then be used in order to compare the results of clustering methods with the true clusterings in the first two approaches. Measuring the quality of the clusterings for the third approach is less straightforward, and this is used less often. Morey et al. (1983), for example, used a dataset of 750 alcohol abusers on some socio-behavioral variables, and measured quality by external validation, i.e. looking at the discrimination of the clusters by some external variables, and by splitting the data into two random subsamples, clustering both, and using nearest centroid allocation for computing a similarity measure of the clustering of the different subsamples. Another approach is to compare dissimilarity data to the ultrametric induced by a hierarchical clustering using the cophenetic correlation, see Hennig et al. (2015), as done by Saracli et al. (2013) for artificial data.

At first sight it seems to be a very important and promising project to compare clustering methods comprehensively, given the variety of existing approaches that is often confusing for the user. Unfortunately, the variety of clustering aims and cluster concepts and also the variety of possible datasets, both regarding data analytic features such as shape of clusters, number of clusters, separation of clusters, outliers, noise variables, and regarding data formats (Euclidean, ordinal, categorical variables, number of variables, structured data, dissimilarity data of various different kinds) makes such a project a rather unrealistic prospect.

In the 1970s and 1980s, with less methodology already existing, a number of comparative benchmark studies were run on artificial data, usually focusing on standard hierarchical methods and different -means-type algorithms. Some of these (the most comprehensive of which was Milligan (1980)) are summarized in Milligan (1996). As could be expected, results depended heavily on the features of the datasets. Overall, Ward’s hierarchical clustering seemed rather successful and single linkage seemed problematic, although at least the first result may be biased to some extent by the data generation processes used in these studies.

More recent studies tend to focus on more specialist issues such as comparing different algorithms for the -means criterion (Brusco and Steinley (2007)), comparing -means with Gaussian mixture models with more general covariance matrix models (Steinley and Brusco (2011); note that the authors show that often -means does rather well even for non-spherical data, but this work is a discussion paper and some discussants highlight situations where this is not the case), or a latent class mixture model and -medoids for categorical data (Anderlucci and Hennig (2014)). Dimitriadou et al. (2004) is an example for a study on data typical for a specific application, namely functional magnetic resonance imaging datasets. The winners of their study are neural gas and -means.

A large number of comparative simulation studies can be found in papers that introduce new clustering methods. However, such studies are usually often biased in favor of the new method that the author wants to advertise by showing that it is superior to some existing methods. Although such studies potentially contain interesting information about how clustering methods compare, having their huge number and strongly varying quality in mind, the author takes the freedom to cite as a single example Coretto and Hennig (2014), comparing robust clustering methods on Euclidean data with elliptical clusters and outliers.

A very original approach was taken by Jain et al. (2004), who did not attempt to rank clustering methods according to their quality. Instead, they clustered 35 different clustering algorithms into five groups based on their partitions of twelve different datasets. The similarity between the clustering algorithms was measured as the averaged similarity (Rand index) between the partitions obtained on the datasets. Given that different clustering methods serve different aims and may well arrive at different legitimate clusterings on the same data, this seems to be a very appropriate approach. Apart from already mentioned methods, this study includes a number of graph based and spectral clustering algorithms, some methods optimizing objective functions other than -means (CLUTO), and “Chameleon-type” methods, i.e. more recent hierarchical algorithms based on dynamic modeling.

Still, it is fair to say that existing work merely scratches the surface of what could be of potential interest in cluster benchmarking, and there is much potential for more systematic comparison of clustering methods.

4.3 Axioms and theoretical characteristics of clustering methods

Another line of research aims at exploring whether clustering methods fulfill some theoretical desiderata. Jardine and Sibson (1971) listed a number of supposedly “natural” axioms for clustering methods and showed that single linkage was the only clustering method fulfilling them. Single Linkage also fulfills eight out of nine of the admissibility criteria given in Fisher and Van Ness (1971), more than any other method compared there (which include standard hierarchical methods and -means). Together with the fact that Single Linkage is known to be problematic in many situations because of chaining phenomena and the possibility to produce very large within-cluster dissimilarities, these results should indeed rather put into question the axiomatic approach than all methods other than single linkage. Both these papers motivate their axioms from intuitive considerations, which can be criticized (see, e.g., Kaufman and Rousseeuw (1990)). It turns out that monotonicity axioms are among the most restrictive. Jardine and Sibson (1971) discuss clustering methods that map dissimilarities to clusterings that can be represented by ultrametrics , such as most standard hierarchical clustering methods, and their monotonicity axiom requires . From the point of view of ultrametric representation of a distance this may look harmless, but in fact the axiom restricts the options for partitioning the data at the different levels of the hierarchy quite severely, because it implies that if is increased for two observations and that are in the same cluster at some level, neither nor nor other points in this cluster can be merged with points in other clusters on a lower level as a result of the modification.

Fisher and Van Ness (1971) use a variant of this criterion, which requires that the resulting clustering does not change, and is therefore applicable to procedures that do not yield ultrametrics. The implications are similarly restrictive. They state explicitly that some admissibility criteria only make sense in certain applications. For example, they define “convex admissibility”, which states that the convex hulls of different clusters do not intersect. This requires the data to come from a linear space and rules out certain arrangements of nonlinear shaped clusters. It is the only criterion in Fisher and Van Ness (1971) that is violated by single linkage. Other admissibility criteria are concerned with a method’s ability to recover certain “strong” clusterings, e.g., where all within-cluster dissimilarities are smaller than all between-cluster dissimilarities.

More recently, there is some revived interest in the axiomatic characterization of clustering methods. Kleinberg (2002) proved an “impossibility theorem”, stating that there can be no partitioning method fulfilling a set of three conditions claimed to be “natural”, namely scale invariance (multiplying all dissimilarities with a constant does not change the partition), richness (any partition of points is a possible outcome of the method; this particularly implies that the number of clusters cannot be fixed) and consistency. The latter condition states that if the dissimilarities are changed in such a way that all within-cluster dissimilarities are made smaller or equal, and all between-cluster dissimilarities are made larger or equal, the clustering remains the same. Like the monotonicity axioms before, this is more restrictive than the author suggests, because the required transformation can be defined in such a way that two or more very homogeneous subsets emerge within a single original cluster, which intuitively suggests that the original cluster should then be split up (a corresponding relaxation of the consistency condition is proposed in the paper and does not lead to an impossibility theorem anymore). Furthermore, Kleinberg (2002) shows that three different versions of deciding where to cut a Single Linkage dendrogram can fulfill any two of the three conditions, which means that these conditions cannot be used to distinguish any other clustering approach from single linkage.

Ackerman and Ben-David (2008) respond to Kleinberg’s paper. Instead of using the axioms to characterize clusterings, they suggest to use them (plus some others) to characterize cluster quality functions (CQF), and then clusterings could be found by optimizing these functions. Note that a clustering method optimizing a consistent CQF (i.e., a CQF that cannot become worse under the kind of transformation of dissimilarities explained above) does not necessarily yield consistent clusterings, because in a modified dataset other clusterings could look even better. The idea also applies with modified axioms to clustering methods with fixed number of clusters. Follow-up work studies specific properties of clustering methods with the aim of providing axioms that serve to distinguish clustering methods as suitable for different applications (Ackerman et al. (2010); Ackerman et al. (2012)). A similar approach is taken by Puzicha et al. (2000), who compare a number of clustering criteria based on separability measures averaging between-cluster dissimilarities in different ways according to a set of axioms some of which are very similar to the above, adding local shift invariance and robustness criteria that formalize that small changes to single dissimilarities can only have limited influence on the criterion.

Correa-Morris (2013) starts from Kleinberg (2002) in a different way and allows clustering methods to be restricted by certain parameters (such as the number of clusters). The axioms apply to clusterings as in Kleinberg (2002), but a number of variants of the consistency requirement are defined, and several clustering methods including Single and Complete Linkage and -means are shown to be scale invariant, rich and consistent in a slightly re-defined sense.

Still, much existing work on axiomatic characterization is concerned with distinguishing “admissible” from “inadmissible” methods, exceptions being Ackerman et al. (2010); Ackerman et al. (2012). This is of limited value in practice, particularly because up to now no method in at least fairly widespread use has been discredited because of being “inadmissible” in such a theoretical sense; in case of negative results, rather the admissibility criteria were put into question. Still there is some potential in such research to learn about the clustering methods. Changing the focus from branding methods as generally inadmissible to distinguishing the merits of different approaches seems to be a more promising research direction. A number of other characteristics of clustering methods has been studied theoretically, see for example the references on robustness and stability measurement in Hennig et al. (2015).

Ackerman and Ben-David (2009) axiomatize “clusterability” of datasets with a view towards finding computationally simpler algorithms for datasets that are “easy” to cluster, which mainly means that there is strong separation between the clusters.

5 Cluster validation

Cluster validation is about assessing the quality of a clustering on a dataset of interest. Different from Section 4.2, here the focus is on analyzing a real dataset for which the clustering is of real interest, and where no “true” clustering is known with which the clustering to be assessed could be compared (the approaches in Sections 5.2, 5.4 and 5.5 can also be used in benchmarking studies). Quality assessment of a single clustering can be of interest in its own right, but methods for assessing the cluster quality can also be used for comparing different clusterings, be they from different methods, or from the same method but with different input parameters, particularly with different numbers of clusters. Because the latter is a central problem in cluster analysis, some literature uses the term “cluster validation” exclusively for methods to decide about the number of clusters, but here a more general meaning is intended.

In any case cluster validation is an essential step in the cluster analysis process, particularly because most methods do not come with any indication of the quality of the resulting clustering other than the value of the objective function to be optimized, if there is one.

There are several different approaches to cluster validation. Hennig (2005) lists

• use of external information,

• testing for clustering structure,

• internal validation indices,

• stability assessments,

• visual exploration,

• comparison of several different clusterings on the same dataset.

Before going through these, I start with some considerations regarding the decision about the number of clusters.

5.1 The number of clusters

As the clustering problem as a whole, also the problem of deciding the number of clusters is not uniquely defined, and there is no unique “true” number of clusters. Even if the clustering method is chosen, the number of clusters is still ambiguous. The ideal situation for defining the problem properly seems to be if data are assumed to come from a mixture probability model, e.g., a mixture of Gaussians, and every mixture component is identified with a cluster. The problem then seems to boil down to estimating the number of mixture components. To do this consistently is difficult enough (see Hennig et al. (2015)), but unfortunately in reality it is an ill-posed problem. Generally, probability models are not expected to hold precisely in reality. But if the data come from a distribution that is not exactly a Gaussian mixture with finitely many components, a consistent criterion (such as the BIC, see Hennig et al. (2015)) will estimate a number of clusters converging to infinity, because a large dataset can be approximated better with more mixture components. If mixture components are to be interpreted as clusters, normally at least some separation between them is required, which is not guaranteed if their number is estimated consistently.

The decision about which number of clusters is appropriate in a certain application amounts to deciding in some way what granularity is required for the clustering. Ultimately, how strong separation between different clusters is required and a partition into how many clusters is useful in the given situation cannot be decided by the data alone without user input. It is often suggested in the literature that the number of clusters needs to be “known” or otherwise it needs to be estimated from the data. But if it is understood that finding the number of clusters in a certain application needs user input anyway, fixing the number of clusters is often as legitimate a user decision as the user input needed otherwise. There are many supposedly “objective” criteria for finding the best number of clusters (see Hennig et al. (2015)). But it would be more appropriate to say that these criteria, instead of estimating any underlying “true” number of clusters, implicitly define what the best number of clusters is, and the user still needs to decide which definition is appropriate in the given application.

In many situations there are good reasons not to fix the number of clusters but rather to give the data the chance to pick a number that fits its pattern. But the researcher should not be under the illusion that this can be done reliably without having thought thoroughly about what cluster concept is required. Apart from the indices listed in Hennig et al. (2015), also the statistics listed in Section 5.4 can be used, particularly if the researcher has a quantitative idea about, for example, how strong separation between clusters is required.

5.2 Use of external information

Formal and informal external information can be used. Informally, subject matter experts can often decide to what extent a clustering makes sense to them. On one hand, this is certainly not totally reliable, and a clustering that looks surprising to a subject matter expert may even be particularly interesting and could spark new discoveries. On the other hand, the subject matter expert may have good reasons to discard a certain clustering, which often points to the fact that the clustering aim was not well enough specified or understood when choosing a certain clustering method in the first place. If possible, the problem should then be understood in such a way that it can lead to an amendment in the choice of methodology.

For formal external validation, there may be external variables or groupings known that are expected or desired to be related to the clustering. For example, in market segmentation, a clustering may be computed of data that gives preferences of customers for certain products or brands, and in order to make use of these clusters, they should be to some extent homogeneous also regarding other features of the customers such as sex, age, household size etc. This can be explored using techniques such as MANOVA and discriminant analysis for continuous variables, and association measures or tests and measures for comparing clusterings (see Hennig et al. (2015)) for categorical variables and groupings.

5.3 Testing for clustering structure

In many clustering applications, researchers may want to determine whether there is a “real” clustering in the data that corresponds to an underlying meaningful grouping. Many clustering algorithms deliver a clustering regardless of whether the dataset is “really” clustered. A chapter in Hennig et al. (2015) is about methods to test homogeneity models against clustering alternatives. Note that straightforward models for homogeneity such as the Gaussian or uniform distribution may be too simple to model even some datasets without meaningful clusters. Significant deviations from such homogeneity models may sometimes be due to outliers, skew or nonlinear distributional shapes, or other structure in the data such as temporal or spatial autocorrelation, in which case it is advisable to use more complex null models, see Hennig et al. (2015). In any case it is important that a significant result of a homogeneity test does not necessarily validate every single one of the found clusters. Homogeneity tests have been applied to single clusters or pairs of clusters in order to give more local information about grouping structure, but this is not without problems, see Hennig et al. (2015).

5.4 Internal validation indices

A large number of indices has been proposed in the literature for evaluating the quality of a clustering based on the clustered data alone. Such indices are comprehensively discussed in Hennig et al. (2015). Most of them attempt to summarize the clustering quality as a single number, which is somewhat unsatisfactory according to the discussion in Section 2.

Alternatively it is possible to measure relevant aspects of a clustering separately in order to characterize the cluster quality in a multivariate way. Indices measuring several aspects of a clustering are implemented in the R-package “fpc”. Here are some examples:

• measurements of within-cluster homogeneity such as maximum or average within-cluster dissimilarity, within-cluster sum of squares, or the largest within-cluster gap;

• measurements of cluster separation such as the minimum or average dissimilarity between clusters; Hennig (2014) proposes the average minimum dissimilarity to a point from a different cluster of the 10% of observations for which this is smallest;

• measurements of fit such as within-cluster sum of dissimilarities from the centroid or Hubert’s -type measures, see Hennig et al. (2015);

• measurements of homogeneity of different clusters, e.g., the entropy of the cluster sizes or the coefficient of variation of cluster-wise average distances to the nearest neighbor;

• measurements of similarity between the empirical within-cluster distribution and distributional shapes of interest, such as the Gaussian or uniform distribution.

5.5 Stability assessment

Stability is an important aspect of clustering quality. Certainly a clustering does not warrant a strong interpretation if it changes strongly under slight changes of the data. Although there is theoretical work on clustering stability (see Hennig et al. (2015)), this gives very limited information about to what extent a specific clustering on a specific dataset is stable.

Given a dataset, stability can be explored by generating artificial variants of the data and exploring how much the clustering changes. This is treated in Hennig et al. (2015). Standard resampling approaches are nonparametric bootstrap, subsampling and splitting of the dataset. Alternatively, observations may be “jittered” or additional observations such as outliers added, although the latter approaches require a model for adding or changing observations.

Aspects to keep in mind are firstly that often parts of the dataset are clearly clustered and other parts are not, and therefore it may happen that some clusters of a clustering are stable and other parts are not. Secondly, stability is not enough to ensure the quality or meaningfulness of a clustering. For example, a big enough dataset from a homogeneous distribution may allow a very stable clustering. For example, 2-means will partition data from a uniform distribution on a two-dimensional rectangle in which one side is twice as long as the other in a very stable manner with only a few ambiguities along the borderline of the two clusters. Thirdly, in some applications in which data are clustered for organizational reasons such as information reduction, stability is not of much interest.

5.6 Visual exploration

The term “cluster” has an intuitive visual meaning to most people, and also in the literature about cluster analysis visual displays are a major device to introduce and illustrate the clustering problem. Many of the potentially desired features of clusterings such as separation between clusters, high density within clusters, and distributional shapes can be explored graphically in a more holistic (if subjective) way than by looking at index values. Standard visualization techniques such as scatterplots, heatplots and mosaic plots for categorical data as well as interactive and dynamic graphics can be used both to find and to validate clusters, see, e.g, Theus and Urbanek (2008), Cook and Swayne (1999). For cluster validation, one would normally distinguish the clusters using different colors and glyphs. Most people’s intuition for clusters is strongly connected to the low-dimensional Euclidean space, and therefore methods that project data into a low-dimensional Euclidean space such as PCA are popular and useful. A chapter in Hennig et al. (2015) illustrates the use of PCA and a number of other techniques for cluster visualization with a focus on network-based techniques and visualization of curve clustering. There are also specialized projection techniques for visualizing the separation between clusters in a given clustering (Hennig (2004)) and for finding clusters (Bolton and Krzanowski (2003); Tyler et al. (2009)). Hennig (2005) proposes to look for every single cluster at plots that show its separation from the remainder of the dataset, as well as projection pursuit plots for the data of a single cluster on its own to detect deviations from homogeneity. Such plots can also be applied to more general data formats if a dissimilarity measure exists by use of MDS. The implementation of MDS in the “GGvis” package allows dynamic and interactive exploration of the data and of the parameters of the MDS (Buja et al. (2008)). Anderlucci and Hennig (2014) apply MDS to visualize clusters in categorical data.

A number of visualization methods have been developed specifically for clustering, of which dendrograms (see Hennig et al. (2015)) are probably most widespread. Dendrograms are also frequently used for ordering observations in heatplots. Due to their ability to visualize high-dimensional information and dissimilarity matrices without projecting on a lower-dimensional space, heatplots are often used for such data. Their use depends heavily on the order of the observations. For use in cluster validation it is desirable to plot observations in the same cluster together, which is achieved by the use of dendrograms for ordering the observations. However, it would also be desirable to order observations within clusters in such a way that the transition between clusters is as smooth as possible, so that not well separated clusters can be detected. This is treated by Hahsler and Hornik (2011).

Kaufman and Rousseeuw (1990) introduced the silhouette plot based on the silhouette width (see Hennig et al. (2015)), which shows how well observations are separated from neighboring clusters. In Jörnsten (2004) this is compared with plots based on the within-cluster data depth. Leisch (2010) introduces another alternative to the silhouette width based on centroids along with further plots to explore how clusters are concentrated around cluster centroids.

5.7 Different clusterings on the same dataset

The similarity between different clusterings on the same dataset can be measured using the ideas in the corresponding chapter of Hennig et al. (2015). Running different cluster analyses on the same dataset and analyzing to what extent the results differ can be seen as an alternative approach to find out whether and which clusters in the dataset are stable and meaningful. Some care is required regarding the choice of clustering methods and the interpretation of results. If certain characteristics of a clustering are important in a certain application and others are not, it is more important that the chosen cluster analysis method delivers a good result in this respect than that its results coincide largely with the results of a less appropriate method. So if methods are chosen that are too different from each other, some of them may just be inappropriate for the given problem and no importance should be attached to their results. On the other hand, if too similar methods are chosen (such as Ward’s method and -means), the fact that clusters are similar does not tell the user too much about their quality. Looking at the similarity of different clusterings on the same data is useful mainly for two reasons:

• Several different methods may seem appropriate for the clustering aim, either because the aim is imprecise, or because heterogeneous and potentially conflicting characteristics of the clustering are desired.

• Some fine-tuning is required (such as neighborhood sizes in density-based clustering, variable weighting in the dissimilarity, or prior specifications in Bayesian clustering), and it is of interest to explore how sensitive the clustering solution is to such tuning, particularly because the precise values of tuning constants are hardly fully determined by background knowledge.

6 Conclusions

In this paper, the decisions required for carrying out a cluster analysis are discussed, connecting them closely to the clustering aims in a specific application. The paper is intended to serve as a general guideline for clustering and for choosing the appropriate methodology from the many approaches on offer in Hennig et al. (2015).

Acknowledgement: This work was supported by EPSRC grant EP/K033972/1.

References

• Ackerman and Ben-David (2008) Ackerman, M. and S. Ben-David (2008). Measures of clustering quality: A working set of axioms for clustering. Advances in Neural Information Processing Systems (NIPS) 22, 121–128.
• Ackerman and Ben-David (2009) Ackerman, M. and S. Ben-David (2009). Clusterability: A theoretical study. Journal of Machine Learning Research, Proc. 12th International Conference on Artificial Intelligence (AISTAT) 5, 1–8.
• Ackerman et al. (2012) Ackerman, M., S. Ben-David, S. Branzei, and D. Loker (2012). Weighted clustering. In Proc. 26th AAAI Conference on Artificial Intelligence, pp. 858–863.
• Ackerman et al. (2010) Ackerman, M., S. Ben-David, and D. Loker (2010). Towards property-based classification of clustering paradigms. In Advances in Neural Information Processing Systems (NIPS), pp. 10–18.
• Alelyani et al. (2014) Alelyani, S., J. Tang, and H. Liu (2014). Feature selection for clustering: A review. In C. C. Aggarwal and C. K. Reddy (Eds.), Data Clustering: Algorithms and Applications, pp. 29–60. CRC Press, Boca Raton (FL).
• Anderlucci and Hennig (2014) Anderlucci, L. and C. Hennig (2014). Clustering of categorical data: a comparison of a model-based and a distance-based approach. Communications in Statistics - Theory and Methods 43, 704–721.
• Art et al. (1982) Art, D., R. Gnanadesikan, and J. R. Kettenring (1982). Data-based metrics for cluster analysis. Utilitas Mathematica 21A, 75–99.
• Bolton and Krzanowski (2003) Bolton, R. J. and W. J. Krzanowski (2003). Projection pursuit clustering for exploratory data analysis. Journal of Computational and Graphical Statistics 12, 121–142.
• Brusco and Steinley (2007) Brusco, M. J. and D. Steinley (2007). A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72, 583–600.
• Buja et al. (2008) Buja, A., D. F. Swayne, M. L. Littman, N. Dean, H. Hofmann, and L. Chen (2008). Data visualization with multidimensional scaling. Journal of Computational and Graphical Statistics 17, 444–472.
• Conover (1999) Conover, W. J. (1999). Practical Nonparameteric Statistics (3rd ed.). Wiley, New York.
• Cook and Swayne (1999) Cook, D. and D. F. Swayne (1999). Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi. Springer, New York.
• Coretto and Hennig (2014) Coretto, P. and C. Hennig (2014). Robust improper maximum likelihood: Tuning, computation, and a comparison with other methods for robust gaussian clustering. arXiv:1406.0808 (submitted).
• Correa-Morris (2013) Correa-Morris, J. (2013). An indication of unification for different clustering approaches. Pattern Recognition 46, 2548––2561.
• De Soete (1986) De Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering. Quality and Quantity 20, 169–180.
• Diatta (2004) Diatta, J. (2004). Concept extensions and weak clusters associated with multiway dissimilarity measures. In P. Eklund (Ed.), Concept Lattices, pp. 236–243. Springer, New York.
• Dimitriadou et al. (2004) Dimitriadou, E., M. Barth, C. Windischberger, K. Hornik, and E. Moser (2004). A quantitative comparison of functional mri cluster analysis. Artificial Intelligence in Medicine 31, 57–71.
• Everitt et al. (2011) Everitt, B. S., S. Landau, M. Leese, and D. Stahl (2011). Cluster Analysis, 5th ed. Wiley, New York.
• Fisher and Van Ness (1971) Fisher, L. and J. Van Ness (1971). Admissible clustering procedures. Biometrika 58, 91–104.
• Gnanadesikan et al. (1995) Gnanadesikan, R., J. R. Kettenring, and S. L. Tsao (1995). Weighting and selection of variables. Journal of Classification 12, 113–136.
• Gordon (1990) Gordon, A. D. (1990). Constructing dissimilarity measures. Journal of Classification 7, 257–269.
• Gordon (1999) Gordon, A. D. (1999). Classification, 2nd ed. CRC Press, Boca Raton (FL).
• Gower (1971) Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics 27, 857–874.
• Gower and Legendre (1986) Gower, J. C. and P. Legendre (1986). Metric and euclidean properties of dissimilarity coefficients. Journal of Classification 5, 5–48.
• Hahsler and Hornik (2011) Hahsler, M. and K. Hornik (2011). Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics 10, 335–354.
• Hausdorf (2011) Hausdorf, B. (2011). Progress toward a general species concept. Evolution 65, 923–931.
• Hennig (2004) Hennig, C. (2004). Asymmetric linear dimension reduction for classification. Journal of Computational and Graphical Statistics 13, 930–945.
• Hennig (2005) Hennig, C. (2005). A method for visual cluster validation. In C. Weihs and W. Gaul (Eds.), Classification - The Ubiquitous Challenge, pp. 153–160. Springer, Berlin.
• Hennig (2014) Hennig, C. (2014). How many bee species? a case study in determining the number of clusters. In M. Spiliopoulou, L. Schmidt-Thieme, and R. Janning (Eds.), Data Analysis, Machine Learning and Knowledge Discovery, pp. 41–50. Springer, Berlin.
• Hennig (2015) Hennig, C. (2015). What are the true clusters? arXiv:1502.02555 (submitted).
• Hennig and Hausdorf (2006) Hennig, C. and B. Hausdorf (2006). Design of dissimilarity measures: a new dissimilarity measure between species distribution ranges. In V. Batagelj, H.-H. Bock, A. Ferligoj, and A. Ziberna (Eds.), Data Science and Classification, pp. 29–38. Springer, Berlin.
• Hennig and Liao (2013) Hennig, C. and T. F. Liao (2013). Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification (with discussion). Journal of the Royal Statistical Society, Series C 62, 309–369.
• Hennig et al. (2015) Hennig, C., M. Meila, F. Murtagh, and R. Rocci (Eds.) (2015). Handbook of Cluster Analysis. Chapman & Hall/CRC (forthcoming).
• Jain and Dubes (1988) Jain, A. K. and R. C. Dubes (1988). Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (NJ).
• Jain et al. (2004) Jain, A. K., A. Topchy, M. H. C. Law, and J. M. Buhmann (2004). Landscape of clustering algorithms. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR04) , Vol. 1, pp. 260–263. IEEE Computer Society Washington, DC.
• Jardine and Sibson (1971) Jardine, N. and R. Sibson (1971). Mathematical Taxonomy. Wiley, London and New York.
• Jörnsten (2004) Jörnsten, R. (2004). Clustering and classification based on the l1 data depth. Journal of Multivariate Analysis 90, 67–89.
• Kaufman and Rousseeuw (1990) Kaufman, L. and P. Rousseeuw (1990). Finding Groups in Data. Wiley.
• Kleinberg (2002) Kleinberg, J. (2002). An impossibility theorem for clustering. Advances in Neural Information Processing Systems (NIPS) 15, 463–470.
• Leisch (2010) Leisch, F. (2010). Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing 20, 457–469.
• Liao (2005) Liao, T. W. (2005). Clustering of time series data—a survey. Pattern Recognition 38, 1857––1874.
• Milligan (1980) Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45, 325–342.
• Milligan (1996) Milligan, G. W. (1996). Clustering validation: results and implications for applied analyses. In P. Arabie, L. J. Hubert, and G. D. Soete (Eds.), Clustering and Classification, pp. 341–375. World Scientific, Singapore.
• Milligan and Cooper (1988) Milligan, G. W. and M. C. Cooper (1988). A study of standardization of variables in cluster analysis. Journal of Classification 5(2), 181–204.
• Morey et al. (1983) Morey, L. C., R. K. Blashfield, and H. A. Skinner (1983). A comparison of cluster analysis techniques withing a sequential validation framework. Multivariate Behavioral Research 18, 309–329.
• Müllensiefen and Frieler (2007) Müllensiefen, D. and K. Frieler (2007). Modelling expert’s notions of melodic similarity. Musicae Scientiae Discussion Forum 4A, 183–210.
• Okada (2000) Okada, A. (2000). An asymmetric cluster analysis study of car switching data. In W. Gaul, O. Opitz, and M. Schader (Eds.), Data Analysis: Scientific Modeling and Practical Application, pp. 495–504. Springer, Berlin.
• Ostini and Nering (2006) Ostini, R. and M. J. Nering (2006). Polytomous Item Response Theory Models. Sage, Thousand Oaks (CA).
• Puzicha et al. (2000) Puzicha, J., T. Hofmann, and J. Buhmann (2000). Theory of proximity based clustering: Structure detection by optimization. Pattern Recognition 33, 617–634.
• Saracli et al. (2013) Saracli, S., N. Dogan, and I. Dogan (2013). Comparison of hierarchical cluster analysis methods by cophenetic correlation. Journal of Inequalities and Applications (electronic publication) 203.
• Shi (1993) Shi, G. R. (1993). Multivariate data analysis in palaeoecology and palaeobiogeography-a review. Palaeogeography, Palaeoclimatology, Palaeoecology 105, 199–234.
• Steinley and Brusco (2011) Steinley, D. and M. J. Brusco (2011). Evaluating the performance of model-based clustering: Recommendations and cautions. Psychological Methods 16, 63–79.
• Theus and Urbanek (2008) Theus, M. and S. Urbanek (2008). Interactive Graphics for Data Analysis: Principles and Examples. CRC Press, Boca Raton, FL.
• Tyler et al. (2009) Tyler, D. E., F. Critchley, L. Dümbgen, and H. Oja (2009). Invariant co-ordinate selection (with discussion). Journal of the Royal Statistical Society, Series B 71, 549–592.
• Van Mechelen et al. (1993) Van Mechelen, I., J. Hampton, R. S. Michalski, and P. Theuns (1993). Categories and Concepts - Theoretical Views and Inductive Data Analysis. Academic Press, London.
• Veltkamp and Latecki (2006) Veltkamp, R. C. and L. J. Latecki (2006). Properties and performance of shape similarity measures. In V. Batagelj, H.-H. Bock, A. Ferligoj, and A. Ziberna (Eds.), Data Science and Classification, pp. 47–56. Springer, Berlin.
• von Luxburg et al. (2012) von Luxburg, U., R. Williamson, and I. Guyon (2012). Clustering: Science or art? JMLR Workshop and Conference Proceedings 27, 65–79.
• Warrens (2008) Warrens, M. J. (2008). On similarity coefficients for 2x2 tables and correction for chance. Psychometrika 73, 487–502.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters