Evaluation Metrics for Unsupervised Learning Algorithms
Abstract
Determining the quality of the results obtained by clustering techniques is a key issue in unsupervised machine learning. Many authors have discussed the desirable features of good clustering algorithms. However, Jon Kleinberg established an impossibility theorem for clustering. As a consequence, a wealth of studies have proposed techniques to evaluate the quality of clustering results depending on the characteristics of the clustering problem and the algorithmic technique employed to cluster data.
I Introduction
Machine learning techniques are usually classified into supervised and unsupervised techniques. Supervised machine learning starts from prior knowledge of the desired result in the form of labeled data sets, which allows to guide the training process, whereas unsupervised machine learning works directly on unlabeled data. In the absence of labels to orient the learning process, these labels must be “discovered” by the learning algorithm.[cord_unsupervised_2008]
In this technical report, we discuss the desirable features of good clustering results, recall Kleinberg’s impossibility theorem for clustering, and describe a taxonomy of evaluation criteria for unsupervised machine learning. We also survey many of the evaluation metrics that have been proposed in the literature. We end our report by describing the techniques that can be used to adjust the parameters of clustering algorithms, i.e. their hyperparameters.
Ii Formal Limitations of Clustering
From an intuitive point of view, the clustering problem has a very clear goal; namely, properly clustering a set of unlabeled data. Despite its intuitive appeal, the notion of “cluster” cannot be precisely defined, hence the wide range of clustering algorithms that have been proposed.[EstivillCastro2002]
Iia Desirable Features of Clustering
Jon Kleinberg proposes three axioms that highlight the characteristics that a grouping problem should exhibit and can be considered “good”, independently of the algorithm used to find the solution. These axioms are scale invariance, consistency, and wealth [kleinberg_impossibility_2002], which are explained in more detail below.
A grouping function is defined as a set of of points and the distances between pairs of points. The set of points is and the distance between points is given by the distance function , where . The distance function measures the dissimilarity between pairs of points. For instance, the Euclidean, Manhattan, Chebyshev, and Mahalanobis distances can be used, among many others. Alternatively, a similarity function might also be used.
IiA1 Scale Invariance
The first of Kleinberg’s axioms states that for any distance function and any scaling factor .[kleinberg_impossibility_2002]
This simple axiom indicates that a clustering algorithm should not modify its results when all distances between points are scaled by the factor determined by a constant .
IiA2 Richness
A clustering process is considered to be rich when every partition of is a possible result of the clustering process. If the use to denote the set of all partitions so that for some distance function , then is equal to the set of all partitions.[kleinberg_impossibility_2002]
This means that the the clustering function must be flexible enough to produce any arbitrary partition/clustering of the input data set.
IiA3 Consistency
Let and be two distance functions. If, for every pair belonging to the same cluster, , and for every pair belonging to different clusters, , then .[kleinberg_impossibility_2002]
A clustering process is “consistent” when the clustering results do not change if the distances within clusters decrease and/or the distances between clusters increase.
IiB An Impossibility Theorem for Clustering
Given the above three axioms, Kleinberg proves the following theorem: For every , there is no clustering function that satisfies scale invariance, richness, and consistency.[kleinberg_impossibility_2002]
Determining a “good” clustering is not a trivial problem. It is impossible for any clustering procedure to be able to satisfy all three axioms. Practical clustering algorithms must tradeoff the desirable features of clustering results.
Since the three axioms cannot hold simultaneously, clustering algorithms can be designed to violate one of the axioms while sarisfying the other two. Kleinberg illustrates this point by describing three variants of singlelink clustering (an agglomerative hierarchical clustering algorithm): [kleinberg_impossibility_2002]

cluster stopping condition: Stop merging clusters when we have clusters (violates the richness axiom, since the algorithm would never return a number of clusters different to ).

Distance stopping condition: Stop merging clusters when the nearest pair of clusters are farther than (violates scale invariance given that every cluster will contain a single instance when is large, whereas a single cluster will contain all data when ).

Scale stopping condition: Stop merging clusters when the nearest pair of clusters are farther than a fraction of the maximum pairwise distance (scale invariance is satisfied, yet consistency is violated).
Clustering algorithms can often satisfy the properties of scale invariance and consistency by relaxing their richness (e.g. whenever the number of clusters is established beforehand). As we have seen, some algorithms can even be customized to satisfy two out of three axioms by relaxing the third one (e.g. simple linkage with different stopping criteria).
Iii Methods for Cluster Evaluation
Evaluating the results of a clustering algorithm is a very important part of the process of clustering data. In supervised learning,“the evaluation of the resulting classification model is an integral part of the process of developing a classification model and there are wellaccepted evaluation measures and procedures” [tan_introduction_2005]. In unsupervised learning, because of its very nature, cluster evaluation, also known as cluster validation, is not as welldeveloped.[tan_introduction_2005]
In clustering problems, it is not easy to determine the quality of a clustering algorithm. This gives rise to multiple evaluation techniques. Quite often, the evaluation process includes a notorious particularity: the way the measurement is performed depends on the algorithm used to obtain the clustering results.
When analyzing clustering results, several aspects must be taken into account for the validation of the algorithm results[tan_introduction_2005]:

Determining the clustering tendency in the data (i.e. whether nonrandom structure really exists).

Determining the correct number of clusters.

Assessing the quality of the clustering results without external information.

Comparing the results obtained with external information.

Comparing two sets of clusters to determine which one is better.
The first three issues are addressed by internal or unsupervised validation, because there is no use of external information. The fourth issue is resolved by external or supervised validation. Finally, the last issue can be addressed by both supervised and unsupervised validation techniques.[tan_introduction_2005].
Gan et al. [gan_data_2007] propose a taxonomy of evaluation techniques that comprises both internal and external validation approaches (see Figure 1).
Iiia Null Hypothesis Testing
One of the desirable characteristics of a clustering process is to show whether data exhibits some tendency to form actual clusters. From a statistical point of view, a feasible approach consists of testing whether data exhibits random behavior or not [halkidi_clustering_2001]. In this context, the null hypothesis testing can be used: A null hypothesis is assumed to be true until evidence suggests otherwise. In this case, the null hypothesis is the randomness of data and, when the null hypothesis is rejected, we assume that the data is significantly unlikely to be random. [gan_data_2007].
One of the difficulties of null hypothesis testing in this context is determining the statistical distribution under which the randomness hypothesis can be rejected. Jain and Dubes propose three alternatives [jain_algorithms_1988]:

Random plot hypothesis : All proximity matrices of order are equally likely.

Random label hypothesis : All permutations of labels on objects are equally likely.

Random position hypothesis : All sets of locations in some region of a dimensional space are equally likely.
Statistical techniques such as Monte Carlo analysis and bootstrapping can be used to determine the clustering tendency in data [jain_algorithms_1988].
Iv Internal validation
Internal validation methods make it possible to establish the quality of the clustering structure without having access to external information (i.e. they are based on the information provided by data used as input to the clustering algorithm).
In general, two types of internal validation metrics can be combined: cohesion and separation measures. Cohesion evaluates how closely the elements of the same cluster are to each other, while separation measures quantify the level of separation between clusters (see Figure 2). These measures are also known as internal indices because they are computed from the input data without any external information [tan_introduction_2005]. Internal indices are usually employed in conjunction with two clustering algorithm families: hierarchical clustering algorithms and partitional algorithms.[gan_data_2007].
Internal validation is used when there is no additional information available. In most cases, the particular metrics used by the evaluation methods are the same metrics that the clustering algorithm tries to optimize, which can be counterproductive in determining the quality of a clustering algorithm and deliver unfair validation results. On the other hand, in the absence of other sources of information, these metrics allow different algorithms to be compared under the same evaluation criterion [aggarwal_data_2015], yet care must be taken not to report biased results.
Internal evaluation methods are commonly classified according to the type of clustering algorithm they are used with. For partitional algorithms, metrics based on the proximity matrix, as well as metrics of cohesion and separation, such as the silhouette coefficient, are often used. For hierarchical algorithms, the cophenetic coefficient is the most common (see Figure 3).
Iva Partitional Methods
Several of the measures employed by internal cluster validations methods are based on the concepts of cohesion and separation (see Figure 2). In general, the internal validation value of a set of clusters can be decomposed as the sum of the validation values for each cluster [tan_introduction_2005]:
This measure of validity can be cohesion, separation, or some combination of both. Quite often, the weights that appear in the previous expression correspond to cluster size.
The individual measures of cohesion and separation are defined as follows[tan_introduction_2005]:
Cohesion is measured within a cluster (an intracluster metric), whereas separation is measured between clusters (an intercluster measure). Both are based on a proximity function that determines how similar a pair of examples are (similarity, dissimilarity and distance functions can be used). These metrics can also be defined for prototypebased clustering techniques, where proximity is measured from data examples to cluster centroids or medoids.
It should be noted that the cohesion metric defined above is equivalent to the cluster SSE [Sum of Squared Errors], also known as SSW [Sum of Squared Errors Within Cluster], when the proximity function is the squared Euclidean distance [tan_introduction_2005]:
where is an example in the cluster, is a cluster representative (e.g. its centroid) and is the number of examples in the cluster .
When using the SSE metric, small values indicate a good cluster quality. Obviously, this metric is minimized in those clusters that were built from SSEoptimizationbased algorithms such as kmeans, but is clearly suboptimal for clusters detected using other techniques, such as densitybased algorithms (e.g. DBSCAN) [aggarwal_data_2015].
Likewise, we can maximize the distance between clusters using a separation metric. This approach leads to the between group sum of squares, or SSB [tan_introduction_2005]:
where is the mean of the cluster and is the overall mean [tan_introduction_2005]. Unlike the SSE metric, a good cluster quality is given by the high SSB values. As before, SSB is biased in favor of algorithms based on maximizing the separation distances among cluster centroids.[aggarwal_data_2015].
As mentioned above, a clustering is considered to be good when it has a high separation between clusters and a high cohesion within clusters [handl_computational_2005]. Instead of dealing with separate metrics for cohesion and separation, there are several metrics that try to quantify the level of separation and cohesion in a single measure [zhao_cluster_2012]:

The CalisnkiHarabasz coefficient, CH, also known as the variance ratio criterion, is a measure based on the internal dispersion of clusters and the dispersion between clusters. We would choose the number of clusters that maximizes the CH value for clusters [calinski_dendrite_1974]:

The Dunn index is the ratio of the smallest distance between data from different clusters and the largest distance between clusters. Again, this ratio should be maximized [dunn_wellseparated_1974]:

The XieBeni score was designed for fuzzy clustering, but it can applied to hard clustering. As the previous coefficients, it is a ratio whose numerator estimates the level of compaction of the data within the same cluster and whose denominator estimates the level of separation of the data from different clusters [xie_validity_1991]:

The BallHall index is a dispersion measure based on the quadratic distances of the cluster points with respect to their centroid [ball_isodata_1965]:

The Hartigan index is based on the logarithmic relationship between the sum of squares within the cluster and the sum of squares between clusters [hartigan_clustering_1975].

The Xu coefficient takes into account the dimensionality of the data, the number of data examples, and the sum of squared errors form clusters [xu_bayesian_1997]:

The silhouette coefficient is the most common way to combine the metrics of cohesion and separation in a single measure. Computing the silhouette coefficient at a particular point consists of the following three steps [tan_introduction_2005]:

For each example, the average distance to all the examples in the same cluster is computed:

For each example, the minimum average distance between the example and the examples contained in each cluster not containing the analyzed example:

For each example, the silhouette coefficient is determined by the following expression:
The silhouette coefficient is defined in the interval for each example in our data set. The global silhouette coefficient is just the average of the particular silhouette coefficients for each example:
Unlike other combined measures, the silhouette coefficient provides us a simple framework for qualification. Positive values indicate a high separation between clusters. Negative values are an indication that the clusters are mixed with each other (i.e. an indication of overlapping clusters). When the silhouette coefficient is zero, it is an an indication that the data are uniformly distributed throughout the Euclidean space [aggarwal_data_2015].
Unfortunately, one of the main drawbacks of the silhouette coefficient is its high computational complexity, , which makes it impractical when dealing with huge data sets [celebi_unsupervised_2016].

Despite their widespread use, cohesion and separation metrics are not the only validation method available for partitional clustering techniques. In fact, cohesion and separation metrics do not perform well when it comes to analyzing results obtained by algorithms based on density analysis.
Given the proximity (or similarity) matrix of a data set and the clustering obtained by a clustering algorithm, we can compare the actual proximity matrix to an ideal version of the proximity matrix based on the provided clustering. Reordering rows and columns so that all examples of the same cluster appear together, the ideal proximity matrix has a block diagonal structure. High correlation between the actual and ideal proximity matrices indicates that examples in the same cluster are close to each other, although it might not be a good measure for densitybased clusters [tan_introduction_2005].
Unfortunately, the mere construction of the whole proximity matrix is computationally expensive, , and this validation method cannot be used without sampling for large data sets.
IvB Hierarchical Methods
The clustering validation methods discussed in the previous section were devised for partitional clustering algorithms. Several internal validation techniques have also been proposed and tested with hierarchical clustering algorithms. As you can expect, these evaluation metrics obtain better results when using hierarchical algorithms such as the single link agglomerative clustering algorithm, SLINK [gan_data_2007].
IvB1 Cophenetic Correlation Coefficient (CPCC)
The cophenetic distance between two examples is the proximity at which an agglomerative hierarchical clustering algorithm puts the examples in the same cluster for the first time [tan_introduction_2005]. Looking at the associated dendrogram, it corresponds to the height at which the branches corresponding to the two examples are merged.
The cophenetic correlation coefficient (CPCC) is a metric used to evaluate the results of a hierarchical clustering algorithm [gan_data_2007]. This correlation coefficient was proposed by Sokal and Rohlf in 1962 [sokal_comparison_1962] as the correlation between the entries of the cophenetic matrix , containing cophenetic distances, and the proximity matrix , containing similarities.
The cophenetic matrix defined for pairs of examples as the level of proximity between the examples in the dendrogram (i.e. the level of proximity at which both examples are assigned to the same cluster). The cophenetic correlation coefficient is then defined as [gan_data_2007]
where is the distance between the example pair and is their cophenetic distance. The correlation coefficient also includes the average of the distances in the proximity matrix and the average of the cophenetic distances in the cophenetic matrix, which can be computed as follows:
The cophenetic correlation coefficient, as the silhouette coefficient and any other correlation coefficient, is a value in the interval . High CPCC values indicate a high level of similarity between the two matrices [gan_data_2007], an indication that the clustering algorithm has been able to identify the underlying structure of its input data.
IvB2 Hubert Statistic
The Hubert statistic is similar to the cophenetic correlation coefficient. First, concordance are discordance are defined for pairs of examples.
A pair is concordant when or . Likewise, a pair is said to be discordant when or . Therefore, a pair is neither concordant nor discordant if or .
Let and be the number of concordant and discordant pairs, respectively. Then, the Hubert coefficient is defined as [theodoridis_pattern_2003]:
As the cophenetic coefficient, the Hubert statistic is between 1 and 1. Like CPCC, it has been mainly used to compare the results of two hierarchical clustering algorithms. A higher Hubert value corresponds to a better clustering of data.
V External Validation
External validation methods can be associated to a supervised learning problem. External validation proceeds by incorporating additional information in the clustering validation process, i.e. external class labels for the training examples. Since unsupervised learning techniques are primarily used when such information is not available, external validation methods are not used on most clustering problems. However, they can still be applied when external information is available and also when you generate synthetic data from a real data set.
Like internal validation methods, it is also possible to classify external metrics depending on the algorithmic approach of the clustering technique used to solve a particular clustering problem. A more rational classification of external validation methods is shown in Figure 4 [cord_machine_2008]. According to this taxonomy, different external validation metrics can be used to compare two sets of clusters, the first one obtained by the clustering algorithm under evaluation and the second one provided by an external source.
We want to compare the result of a clustering algorithm to a potentially different partition of data , which might represent the expert knowledge of the analyst (his experience or intuition), prior knowledge of the data in the form of class labels, the results obtained by another clustering algorithm, or simply a grouping considered to be “correct” [gan_data_2007].
In order to carry out this analysis, a contingency matrix must be built to evaluate the clusters detected by the algorithm. This contingency matrix contains four terms:

: The number of data pairs found in the same cluster, both in and in .

: The number of data pairs found in the same cluster in but in different clusters in .

: The number of data pairs found in different clusters in but in the same cluster in .

: The number of data pairs found in different clusters, both in and in .
From these four indicators, we can easily obtain:

The number of pairs found in the same cluster in : .

The number of pairs found in the same cluster in : .
Obviously, the total number of pairs must be
Va Matching Sets
The first family of external validation methods that can be used to compare two partitions of data consists of those method that identify the relationship between each cluster detected in and its natural correspondence to the classes in the reference result defined by .
Several measures can be defined to measure the similarity between the clusters in , obtained by the clustering algorithm, and the clusters if , corresponding to our prior (external) knowledge [aggarwal_data_2016]:

Precision counts the true positives, how many examples are properly classified within the same cluster [perry_machine_1955]:

Recall evaluates the percentage of elements that are properly included in the same cluster:

The Fmeasure combines precision and recall in a single metric, their weighted harmonic mean:
Quite often, precision and recall are evenly combined with an unweighted harmonic mean ():

Purity evaluates whether each cluster contains only examples from the same class:
In the expressions above, , , and , where is the number of examples belonging to the class found in the cluster and () is the number of examples in the cluster ().
VB Peertopeer Correlation
A second family of measures for external validation are based on the correlation between pairs, i.e. they seek to measure the similarity between two partitions under equal conditions, such as the result of a grouping process for the same set, but by means of two different methods and . It is assumed that the examples that are in the same cluster in should be in the same class in , and vice versa [tan_introduction_2005].
Some metrics based on measuring the correlation between pairs are the following:

The Jaccard coefficient assesses the similarity of the detected clusters to the provided partition :

The Rand coefficient is similar to the Jaccard coefficient, yet it is measured against the total data set (equivalent to accuracy in a supervised machine learning setting):

The Folkes and Mallows coefficient computes the similarity between the clusters found by the algorithm with respect to the independent markers:

We can also resort to the Hubert statistical coefficient in this context:
As before, is the number of examples belonging to the class found in the cluster , whereas () is the number of examples in the cluster ().
VC Measures Based on Information Theory
A third family of external cluster validation metrics is based on Information Theory concepts, such as the existing uncertainty in the prediction of the natural classes provided by the partition . This family includes basic measures such as entropy and mutual information, as well as their respective normalized variants.

Entropy is a reciprocal measure of purity that allows us to measure the degree of disorder in the clustering results:

Mutual information allows us to measure the the reduction in uncertainty about the clustering results given knowledge of the prior partition:
As always, , , and .
Vi Hyperparameter Tuning
Internal and external validation metrics are used once the clustering algorithm has been applied to the available data set. However, the clustering algorithm itself has its own parameters. Adjusting those parameters, also known as hyperparameters in the machine learning literature, can help us obtain very different clustering results.
When using unsupervised machine learning techniques, several issues affect their effectiveness. Even though external validation metrics can help us evaluate whether the obtained clusters closely match the underlying categories in the training data, which the clustering algorithm tries to identify without externallyprovided class labels, those metrics cannot address other issues such as the right number of clusters for our current data set. For instance, in the case of hierarchical clustering techniques, we are certainly interested in determining the best level at which we can cut our dendrogram.
Any clustering algorithms has a set of parameters , which might include the number of clusters or not. Hyperparameter tuning tries to determine, for the different possible values of the parameters in , which set of parameter values is the most suitable for our particular clustering problem.
We could proceed in the following way [halkidi_clustering_2001]:

When the algorithm does not include the number of clusters among its parameters (), we run the algorithm with different values for its parameters so that we can determine their largest range for which remains constant. Later, we choose as parameter values the values in the middle of this range.

When the algorithm parameters include the desired number of clusters (), we run the algorithm for a range of values for between and . For each value of , we run the algorithm multiple times using different sets of values (i.e. starting from different initial conditions) and choose the value that optimizes our desired validation metric, which might be internal or external depending on our particular clustering problem.
When we just want to determine the “right” number of clusters, , plotting the validation results for different values of can sometimes show a relevant change in the validation metric, commonly referred to as a “knee” or “elbow” [theodoridis_pattern_2003].
However, in practice, the number of parameters in might be large, so we cannot test every possible value combination of parameter values in a systematic way. Hyperparameter tuning can then be seen as a combinatorial optimization problem. Fortunately, we can resort to automated tuning strategies that facilitate our search [Bergstra:2012:RSH:2188385.2188395]. Among the tuning strategies at our disposal, we could include the following ones [berzal_redes_2019]:

Grid search is based on a systematic exploration of the hyperparameter space, on which we define a grid and test for a parameter combination in each cell of such grid. If we have parameters, the systematic exploration of a dimensional space might require an exponential number of parameter configurations, often unfeasible in practice.

Random search chooses parameter configurations at random. Even though our search is not exhaustive, when using random search, we hope that some parameter combinations will lead us to promising regions in our search space. The rationale behind random search is that, quite often, local changes in the parameter values do not produce significant changes in the algorithm output, so that a systematic exploration might not be really necessary, even when feasible (usually, it is not feasible anyway).

Smart search techniques try to optimize the problem of searching for hyperparameter values. Different strategies can be implemented, such as Bayesian optimization using Gaussian processes and evolutionary optimization using genetic algorithms or evolution strategies.
Vii Conclusion
Determining the quality of the results provided by a clustering algorithm is not an easy problem. Kleinberg defined three properties any clustering algorithm should try to satisfy (the axioms of scale invariance, richness, and consistency) and proved an impossibility theorem that shows that no clustering algorithm can simultaneously satisfy all of them.
A wide range of metrics have been proposed in the literature to quantify the quality of clustering results:

Internal validation metrics do not require external information. These metrics focus on measuring cluster cohesion and separation, on the statistical analysis of the proximity matrix, or on the study of the dendrogram generated by hierarchical clustering algorithms.

External validation metrics resort to externallyprovided information to evaluate the quality of the clustering results. A large number of external validation metrics are at our disposal, ranging from matching sets to peertopeer correlation and information theoretical indices.
External validation metrics are also useful when comparing the results provided by different clustering algorithms (or the same algorithm with different sets of parameter values).
Finding the best configuration for the parameters of an algorithm is known as hyperparameter tuning. This process is often necessary, for instance, to determine the optimal number of clusters for a particular clustering problem.