Spectral Overlap and a Comparison of Parameter-Free, Dimensionality Reduction Quality Metrics
Nonlinear dimensionality reduction methods are a popular tool for data scientists and researchers to visualize complex, high dimensional data. However, while these methods continue to improve and grow in number, it is often difficult to evaluate the quality of a visualization due to a variety of factors such as lack of information about the intrinsic dimension of the data and additional tuning required for many evaluation metrics. In this paper, we seek to provide a systematic comparison of dimensionality reduction quality metrics using datasets where we know the ground truth manifold. We utilize each metric for hyperparameter optimization in popular dimensionality reduction methods used for visualization and provide quantitative metrics to objectively compare visualizations to their original manifold. In our results, we find a few methods that appear to consistently do well and propose the best performer as a benchmark for evaluating dimensionality reduction based visualizations.
In a variety of modern applications, researchers find dimensionality reduction as a beneficial way of visualizing high dimensional data. Some examples of successful applications include HIV analysis  and analyzing gene data  while some researchers might use dimensionality reduction to showcase algorithmic output as opposed to the raw data , . But, while dimensionality reduction for visualization is often regarded as a helpful way to explore data, a quantitative measure of the low dimensional output’s similarity to its high dimensional input is infrequently mentioned.
The task of dimensionality reduction for a data set can be specified as follows. The researcher begins with a set of observations with dimensions which can be represented as . Next, a target dimension is chosen which could be any such that to reduce the number of unfavorable properties that come with high dimensional spaces  or can be set to 2 or 3 dimensions as a means to visualize the high dimensional data. The dimensionality reduction algorithm is then applied to map to a low dimensional set while seeking to maintain as much of the original structure from in . This problem is difficult in real applications due to a variety of reasons such as lack of information about the effective dimension or the geometry of the manifold on which the data lives .
One of the challenges for researchers using tunable nonlinear dimensionality reduction algorithms is determining which performance metric in the literature is best suited for their application. The first difficulty in the current literature is the prevalence of quality metrics that require additional tuning from the user. Without a proper prior, this task can be difficult because different choices of metric parameter values result in different optimal choices for dimensionality reduction hyperparameters. Then, improper tuning of a quality metric can lead to misleading visualizations. The second is the lack of lower dimensional exploration of quality metric performance to ensure that the a given metric is useful for hyperparameter optimization. In the experiments section, we seek to compare these methods for more trivial cases in low dimensions to explore each method’s ability to act as a performance metric when we can visualize the ground truth for the low dimensional local structure.
As the primary target of dimensionality reduction evaluation literature, nonlinear dimensionality reduction methods continue to grow in number and have succeeded in a variety of applications. The following methods are just a few of the algorithms that we use in our experiments to assess the quality metrics. An early method is Sammon mapping  where squared differences in high and low dimensional pairwise distances are scaled by the Euclidean distance in the original space . Another very popular method, t-SNE , uses pairwise Euclidean distances to generate conditional probabilities in the high and low dimensional space and then seeks to minimize the KL divergence between a high dimensional Gaussian and low dimensional Student t-distribution with 1 degree of freedom. Also,  propose their method “local multidimensional scaling” which utilizes a combination of MDS and concepts from “force-direct” graph drawing. Our emphasis is especially on methods such as local MDS and t-SNE which both have hyperparameters that can dramatically change the visualization and mislead a user into believing that there is additional underlying structure or that the classes are very easily separable.
The contributions of the paper are as follows. We provide a joint comparison of recent “parameter-free” quality metrics and promising nonlinear dimensionality reduction algorithms for visualization. Based on our findings, we recommend  and the Spectral Overlap quality metric developed in this paper.
The remainder of the paper is outlined as follows. In Section 2, we introduce existing challenges and past approaches. In Section 3, we discuss the quality metric literature and some of the pitfalls. Finally, in Section 4, we run experiments to evaluate popular and promising nonlinear dimensionality reduction algorithms for visualization using quality metrics that do not require the researcher to specify parameters such as the number of relevant nearest neighbors.
In some of the more basic dimensionality reduction algorithms for visualization such as multidimensional scaling and PCA, there are no additional tunable parameters to influence the outcome of the output Y and the means by which one can interpret the data is straightforward. However, these linear techniques are limited and suffer from various pitfalls such as overly focusing on large pairwise distances and therefore compromise small pairwise distances which make up the local structure of the data . In addition, some of the most successful applications of dimensionality reduction in areas such as computer vision, natural language processing, and audio signal processing have made tremendous strides through the use of highly nonlinear models. Both of these occurrences suggest that there is potentially more to gain by using nonlinear methods which focus on local structure in the input data.
But, since these methods are nonlinear and many do not develop an explicit mapping, additional concerns arise including the stability of visualizations and how well the newly generated output dimension represents the original high dimensional space. Interestingly,  tackle this by posing dimensionality reduction as an optimization problem and learn mapping functions which could potentially act as one way to visualize the stability of the dimensionality reduction process.
Another area of difficulty for these nonlinear dimensionality reduction methods is that they are often favored for high dimensional tasks where there is no known equivalent to a “ground truth”. In regression tasks, we have a “ground truth” response and in classification tasks, we have . Since the field of dimensionality reduction does not have a universal and similarly objective method for comparison, there is a reliance on the researcher to provide the remaining assumptions to correctly map some high dimensional X to low dimensional Y. We can see that this is the case in  and  where much of the analysis is qualitative and the evaluation is primarily based on an algorithm’s ability to separate like-classes into separate clusters or groups. However, no numerical measurement is used to bolster the qualitative claims and provide formal comparison with respect to the original input dataset. This can be a bit concerning if it turns out that a visualization was spurious or has little to no relation to the original high dimensional data. For cases where the data may be used downstream in tasks like regression or classification, it might be reasonable to have some measure of similarity to the high dimensional space in order to better gauge if observed group separability or low dimensional structure is robust to new data.
A demonstration conducted by  expresses the same concern by displaying how changing just one of t-SNE’s tuning parameters can result in a variety of possible misleading visualizations. The authors discuss how, when one is using t-SNE, cluster sizes, distances between clusters, and interesting geometry in the visualization might be random or inconsistent with the original input data. Furthermore, they show how smaller values of t-SNE’s “perplexity” hyperparameter, which is a smooth equivalent to a kNN graph, can result in different visualizations after each run of the algorithm.
Ultimately, as we see in Section 3, the approach we propose is similar to the intuition behind the heuristic  which is to maintain as much local structure as possible but with a secondary consideration for global structure.
The beginning of dimensionality reduction quality metrics dates back to the 1960’s with methods such as Kruskal’s Stress Measure and Sammon Stress. Since then, numerous methods have been developed to evaluate the effectiveness of an algorithm’s ability to replicate high dimensional structure in a lower dimensional space  with an added emphasis on local structure. Many of these methods, such as trustworthiness and continuity , local continuity meta-criterion , and mean relative rank errors  depend on an additional tunable parameter which requires the user to specify performance by an algorithm’s ability to maintain the same nearest neighbors in some fashion. While a subjective might not mean much for a small number of data points, it can become unclear as if a carefully chosen value for is reasonable.
Cognizant of this concern, many authors such as , , and , have tried evaluating projections qualitatively by plotting the curve created by evaluating tunable metrics for multiple values of . Some complications with this approach can be seen in  if one reviews figure 5 which plots dimensionality reduction performance on the swiss roll dataset and figure 6 which plots dimensionality reduction performance on 1000 images drawn from the MNIST dataset. We see an immediate local maximum in the performance curve followed by a decline in values before increasing once again for large values of the parameter . CCA using geodesic distances appears to be the best in figure 5 but t-SNE using Euclidean distance is labeled the best performer based on their rule of judging performance by the best “local” score. We see what appears to be the first few neighbors are kept very well but that same level of performance is not achieved until is much greater than the first location of the local maximum. This phenomenon occurs when there is an overemphasis on local neighborhoods such as in cases in . In the examples with small perplexity, t-SNE forces immediate neighbors to be close but results in tiny clusters that have very little resemblance with the original dataset.
In addition,  propose the co-ranking matrix which is a more comprehensive means of evaluating the dimensionality reduction process. The co-ranking matrix is an (N-1)-by-(N-1) matrix that is a joint histogram of the ranks with observations above the diagonal called “extrusions” and below the diagonal “intrusions”. If used as an evaluation tool, one can identify hard intrusions and extrusions based on a choice of which determines the nearest neighbors. Lee and Verleysen also show how T&C, MRRE, and LCMC are similar to penalizing different portions of the co-ranking matrix and go on to offer the co-ranking matrix as a framework for future development of dimensionality reduction quality metrics. Some of the insights gained through visualizing dimensionality reduction performance through this lens include allowing the user to identify less harmful, small intrusions/extrusions which may come with noise flattening and large amplitude extrusions which characterizes a tearing in the manifold.
While tunable metrics do provide substantial information about the relationship between high and low dimensional space, quality metrics without tunable parameters require fewer assumptions on the part of the user and therefore one can guarantee more consistency in the quality of the output.  propose using a metric that measures preservation of distance orderings and have offered the Spearman’s Rho as one of the early measures to determining the preservation of topology. Some newer methods have come out such as “entropy” and “mutual information” which were proposed by  and treat the dimensionality reduction process as a communication channel model which transfers data points from high to low dimensional space.
4 Spectral Overlap
Building on this understanding of nonlinear dimensionality reduction quality metrics, we propose the following method which we call “Spectral Overlap”. The intuition here is that we want to penalize any lack of overlap in every KNN graph for . This provides equal weight across all neighbors.
We specify “Spectral Overlap” as follows: Let be the KNN graph in input space with parameter and be the KNN graph in output space with parameter . We begin by calculating the overlap penalty:
The overlap measure above captures the mismatch in high and low dimensional KNN graphs. From here, we scale by a normalizing constant to characterize the decay in performance as one increases the number of data points. This yields the quality metric:
If there is a tear in the manifold, more nearest-neighbor relationships will not be upheld and the penalty will be larger. Additionally, this also penalizes cases where one group can potentially occlude another due to the limits of the dimensionality reduction algorithm. We find that for linear dimensionality reduction algorithms, this can happen often such as in the case of the clusters data set we explore in Section 5.
Ultimately, due to the limit of metric spaces, spectral overlap’s objective is to measure the preservation of nearest neighbor relationships and cannot properly address a lack of transitivity in cases such as word embeddings . Thus, we alternatively recommend representing high dimensional data in multiple maps if the transitivity of data points is not guaranteed.
To evaluate some of these more objective measures of local structure performance, we evaluate four methods on six data sets using 3 publicly available CRAN packages and a package we developed for local multidimensional scaling. We apply t-SNE from the R package Rtsne , UMAP from umap , Sammon Mapping from stats , and local MDS from our package lmds . Next, while datasets such as the synthetic Swiss roll dataset  and popular MNIST dataset  are common datasets for comparing nonlinear dimensionality reduction visualizations, we instead propose a series of datasets where we know confounding attributes such as the intrinsic dimension and local structure and utilize a Bayes error metric that complements each specific data generating process to evaluate the performance of the more general quality metrics that have been proposed.
The six datasets are: (1) Two Lines, (2) Trefoil Knot, (3) Three Gaussians, (4) Noisy Circles, (5) Curved X’s, and (6) High Dimensional Clusters. Datasets (1) and (2) are directly from  and (3) is inspired by the three cluster dataset except we increase the variance for one cluster. Dataset (4) comes from  and (5) is inspired by the parabola example in . Finally, we add (6) which is a set of separable clusters on the corners of a 4-dimensional hyper-cube.
Since we have the data generating processes for each method, we begin by generating 250 data points for each 2D dataset and 800 data points for the high dimensional cluster data set from the same random seed. Next, for each nonlinear dimensionality reduction algorithm, we pre-specify a range of values for each algorithm’s set of hyperparameters and create a grid of all combinations for a grid search for each quality metric. The list of hyperparameters for each method is provided in the lmds github here.
For each algorithm, we then set 5 different randomized seeds and evaluate each algorithm with a given group of hyperparameters from its corresponding grid using each of the different quality metrics. We then select the best scoring hyperparameter configurations for each algorithm and quality metric and re-run those algorithms with the same seed. These resulting visualizations are then representative of good performance for each nonlinear dimensionality reduction algorithm’s parameters with respect to each quality metric.
Finally, we evaluate each algorithm-quality metric pair for each data set using Bayes error metrics that are appropriate for the data generating processes. For the 2-dimensional data sets, since we are projecting back to the same dimension, we are able to use the Procrustes Distance between the input data and output visualization where . The Procrustes Distance  compares two data sets of dimension and provides the L2 distance after scaling, rotating, and shifting one of the data sets. This provides a means for measuring how well an algorithm can exactly replicate the original manifold if we are mapping to the intrinsic dimension of the data and know its topology exactly. Next, for the high dimensional clusters data sets, we use the accuracy of the first nearest neighbor in in classifying the class we assigned in high dimensional space . Once again, this is only relevant given that we have specifically devised a case where the groups are completely separable with no overlap. In both cases, the Bayes error metrics are hypothetical and complement the specific data generating process. However, by identifying these base cases, we can compare more general quality metrics such as those in Appendix C to get a better sense of robustness of said metrics prior to using them in more complex cases.
5.3 Results & Discussion
Results from our simulations are shown in 1. Since scale varies across datasets, we rank the performance of each metric on each algorithm-dataset pair from 1 to for the metrics including the “Bayes error” metric. Average performance is broken down by dimension first and then the cumulative performance for 2-dimensional and higher dimensional performance is averaged to generate a final score.
Based on our simulations, we see that and Spectral Overlap tend to do the best in reducing the defined Bayes error metrics for each dataset and algorithm. This is then followed by Local Error and then Entropy. We see that which is based on a heuristic for tunable quality metrics and Mutual Information perform the worst.
For the 2-dimensional data sets, we see that on average, across these metrics, Sammon mapping does the best followed by local MDS, then t-SNE, and finally UMAP. It appeared that local MDS, UMAP, and t-SNE had difficulty with the Curved X’s dataset the most with higher Procrustes distances on average than the other datasets. The next worse dataset was the 3 Gaussians dataset. For a more visual exploration of the results, we provide visualizations of the original datasets as well as the best performers for each algorithm-quality metric pair in the supplementary material.
However, on the multi-dimensional cluster data set, we see that Sammon mapping performs the worst and that more nonlinear methods tend to outperform across almost all quality metrics. Across most metrics, t-SNE appears to have the most robust performance with UMAP following closely. We see that performs the best followed by Local Error and Spectral Overlap. Perhaps one of the most interesting aspects of these results can be seen in the visualization in Figure 1. While has good numerical performance based on our choice of Bayes error, we can see that this Bayes error is prone to tearing of the manifold. and Spectral Overlap sought to maintain the relationship such that clusters were relatively equidistant from each other. While these did not perform as well, we can see that they did a better job of preserving what one could conceptualize as the original manifold.
|Metric||2D Avg. Rank||High Dim Avg. Rank||Overall Avg.|
Based on the outcomes, we see that is predisposed to certain edge cases that can mislead a user to think he or she has found a high quality visualization. As stated in Section 3, the problem appears in the case where an algorithm puts too much weight on maintaining a small number of nearest neighbors but subsequently compromises the remaining neighbor rankings. We see that, on average, and Spectral Overlap perform well in both the low and high dimensional data sets which leads us to believe for now that these are good proxies for quality of output dimension .
Ultimately, this demonstration shows how tuning nonlinear dimensionality reduction methods using various meta-criterion perform in the most ideal case when we know (1) the intrinsic dimension, (2) the true local structure or any topological structure, and (3) the true global structure.
We present a simple exploration of some of the existing quality metrics that researchers can use without a prior with some of the leading nonlinear dimensionality reduction algorithms. By creating data generating processes that allow performance to be compared to a reasonable “Bayes error” metric, we are able to provide the beginnings of a more rigorous study for dimensionality reduction quality metrics. Based on our observations, we find that metrics based on maximizing the overlap in nearest neighbors such as and Spectral Overlap tend to have robust performance for 2 dimensional and higher dimensional cases. As shown in the case of LMDS and the Clusters data set, using these methods can provide potentially informative visualizations without tearing the manifold. Finally, we provide an R wrapper for local MDS which is available for download here.
For future work, we look to explore a few avenues. First, by defining a unifying metric to go by, this comparison acts as a first step to a fuller study of the existing nonlinear dimensionality reduction algorithms. We also intend to further investigate the realm of dimensionality reduction quality metrics in order to define more intuitive or analytically promising measures of local structure and then global structure. Finally, we aim to develop a method that more directly uses the and Spectral Overlap quality metric and will potentially become a competitor to leading methods.
Appendix A Results
|Algorithm||Data||Entropy||Mutual Info||Local Error||Spectral Error||Spearman||1-NN|
|Method||Avg. Procrust. Dist.|
|Algorithm||Data||Entropy||Mutual Info||Local Error||Spectral Error||Spearman||Procrustes|
Appendix B Data Generating Process
Below outlines the mathematical formulation for each of the data generating processes:
The 2 Lines dataset from  is generated as two parallel line segments. This can be done as such:
The 3 Gaussians dataset inspired by  is generated as 3 separable clusters with 2 closer to each other and 1 with large variance. We formulate the data as such:
The trefoil knot also comes from  and we add a small amount of noise . The data generating process is:
where our final input matrix is .
The Curved X’s data set is inspired by  who show how using linear methods does not often guarantee that the nearest neighbor relationships are kept and that nonlinear methods can do a better job by focusing on local structure. This data set is a slight change and consists of two curved X’s which are specified as such:
where our final input matrix is .
The Noisy Circles data set comes from  and is one circle drawn inside a larger circle. We also add noise and the data generating process is therefore as such:
High Dimensional Clusters
For the high dimensional clusters data set, we place clusters on the corners of a 4-dimensional hyper-cube as such:
Appendix C Quality Metrics
Below we outline the formulation for each of the quality metrics:
and Quality Metrics
which counts the number of points that remained in the same local neighborhood defined by . This can also be conceptualized as penalizing the number of mismatched ranks when comparing ranked Euclidean distances in high and low dimensional space. One can tweak the number of nearest neighbors or as mentioned in , removing the tuning element. Next, the LCMC is calculated as:
And finally, we achieve our choice of via:
Entropy and Mutual Information
Both methods come from  and first require the user to calculate the co-ranking matrix as follows:
for all N data points and
Next, the joint probability distribution is specified as follows:
The Entropy quality metric is:
The Mutual Information metric is:
Similar to Kruskal Stress, we devise a meta criterion that focuses more on getting the correct immediate errors by linearly weighting lower ranked neighbors higher than higher ranked neighbors. The result is a sum over the cumulative sum of the squared distance errors. More succinctly, we can think of this as linearly weighting the squared distance errors as:
where and are the Euclidean distance in high and low dimensional space and is the neighborhood for a given point defined by the th nearest neighbors.
Let be the KNN graph in input space with parameter and be the KNN graph in output space with parameter . The metric is calculated as:
The intuition here is that we want to have overlap in every KNN graph for . This provides more weight on more immediate neighbors and results in a light penalty if pairwise neighbor relations are off by one or two ranks. However, if there is a drastic tear in the manifold, this will more heavily penalize terms that should be a nearest neighbor to a given point but are very far from said given point.
The Spearman rank correlation coefficient is calculated on all values of and for one coefficient.
From , the objective is the l2 norm on the rotated, shifted, and scaled image with respect to the reference image .
From , one metric that can be used when the true groups are known is the accuracy of the first nearest neighbor classifier. We can calculate this quickly via the following process. Let be the column-wise rank of each element excluding elements where which have default distance . Let each class of total classes have elements in .
where blkdiag corresponds to a block diagonal matrix of diagonal blocks of size such that or the total number of points in the input matrix .
We then calculate as such:
-  (2013) Assessment of dimensionality reduction based on communication channel model; application to immersive information visualization. In 2013 IEEE international conference on big data, pp. 1–6. Cited by: Appendix C, §3.
-  (2006) Autoencoder networks for hiv classification. Current Science, pp. 1467–1473. Cited by: §1.
-  (1995) An index of topological preservation for feature extraction. Pattern Recognition 28 (3), pp. 381–391. Cited by: §3.
-  (2012) A general framework for dimensionality-reducing data visualization mapping. Neural Computation 24 (3), pp. 771–804. Cited by: §2.
-  (2009) Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association 104 (485), pp. 209–219. Cited by: Appendix C, §1, §3, §5.
-  (2006) Insights for human ether-a-go-go-related gene potassium channel inhibition using recursive partitioning and kohonen and sammon mapping techniques. Journal of medicinal chemistry 49 (17), pp. 5059–5071. Cited by: §1.
-  (2014) A methodology to compare dimensionality reduction algorithms in terms of loss of quality. Information Sciences 270, pp. 1–27. Cited by: §3.
-  (2009) The elements of statistical learning. New York: Springer. Cited by: Appendix B, Appendix C, §5.1, §5.2.
-  (2008) Visualizing data using t-sne.. Journal of machine learning research 9, pp. 2579–2605. Cited by: §1, §2, §5.
-  (1998) Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 28 (1), pp. 39–54. Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
-  (2007) Nonlinear dimensionality reduction. Springer Science & Business Media. Cited by: §3.
-  (2009) Quality assessment of dimensionality reduction: rank-based criteria. Neurocomputing 72 (7-9), pp. 1431–1443. Cited by: Appendix C, §3.
-  (2010) Scale-independent quality criteria for dimensionality reduction. Pattern Recognition Letters 31 (14), pp. 2248–2257. Cited by: §2, §3.
-  (2008) Rank-based quality assessment of nonlinear dimensionality reduction.. In ESANN, pp. 49–54. Cited by: §3.
-  (2017) A new method for performance analysis in nonlinear dimensionality reduction. arXiv preprint arXiv:1711.06252. Cited by: Appendix C, §1, §3.
-  (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §2, §5.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
-  (1969) A nonlinear mapping for data structure analysis. IEEE Transactions on computers 100 (5), pp. 401–409. Cited by: §1, §5.
-  (2018) Comparing different clustering algorithms on toy datasets. External Links: Cited by: Appendix B, §5.1.
-  (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §5.
-  (2012) Visualizing non-metric similarities in multiple maps. Machine learning 87 (1), pp. 33–55. Cited by: §4.
-  (2009) Dimensionality reduction: a comparative. Cited by: Appendix C, §1, §2.
-  (2006) Local multidimensional scaling. Neural Networks 19 (6-7), pp. 889–899. Cited by: §3.
-  (2016) How to use t-sne effectively. Distill. External Links: Cited by: Appendix B, Appendix B, Appendix B, §2, §3, §5.1.
-  (2018) Graph convolutional neural networks for web-scale recommender systems. arXiv preprint arXiv:1806.01973. Cited by: §1.