A Topological Approach to Spectral Clustering
We propose a clustering algorithm which, for input, takes data assumed to be sampled from a uniform distribution supported on a metric space , and outputs a clustering of the data based on a topological estimate of the connected components of . The algorithm works by choosing a weighted graph on the samples from a natural one-parameter family of graphs using an error based on the heat operator on the graphs. The estimated connected components of are identified as the support of the eigenfunctions of the heat operator with eigenvalue , which allows the algorithm to work without requiring the number of expected clusters as input.
The analysis of complex, high-dimensional data is one of the major research challenges in contemporary computer science and statistics. In recent years, geometric and topological approaches to data analysis have begun to yield important insights into the structure of complex data sets (see, for instance, [Belkin_Niyogi_2003] for an example of spectral geometry applied to dimension reduction, and [Ghrist_2008], [Carlsson_2009] for surveys on homological methods of data analysis and visualization). The common point of departure of these methods is the assumption that data in high-dimensional spaces is often concentrated around a low-dimensional manifold or other topological space. In this note, we begin from the assumption that the data comes from a uniform distribution supported on a topologically disconnected space, and that clusters in the data reflect this lack of topological connectivity.
Geometric techniques for data analysis have concentrated on approximating the geometry of the data as a step toward non-linear dimension reduction. Once the dimension is reduced, standard statistical techniques are then used to analyze the data in the lower-dimensional space. Methods in this class include ISOMAP [Tenenbaum_de_Sliva_Langford_2000], Locally Linear Embedding [Roweis_Saul_2000], Hessian Eigenmaps [Donoho_Grimes_2003], Laplacian Eigenmaps [Belkin_Niyogi_2003], and Diffusion Maps [Coifman_Lafon_2006]. Most of these techniques build a weighted graph to approximate the Laplace-Beltrami operator on a manifold, or else a related Markov chain on a graph, and then in practice use the eigenvalues and eigenvectors of an approximation to the Laplacian to reduce the dimension of the data and then use a -means clustering algorithm to classify the data [Shen_et_al_2014], [Sanchez-Garcia_et_al_2014]. In [Chazal_Guibas_2013], we find a more topologically-oriented approach, in which persistent homology is used to help identify high-density regions of a distribution function.
In this article, we give an algorithm that directly uses the topological information in the Laplacian and heat operator, and which also demonstrates the utility of considering clustering as a problem of estimating the number of connected components of a distribution whose support is disconnected. While the topological aspect of the clustering problem is generally acknowledged in the topological data analysis community, this is, to the best of our knowledge, the first completely data-driven clustering algorithm that explicitly exploits this point of view. Second, the algorithm produces both the number of clusters and the clusters themselves, unlike popular algorithms such as -means clustering, in which the number of clusters is required as input. The algorithm we present also gives, we believe, the first use of cross-validation in a non-commutative context. While this appears formally like standard cross-validation, we find ourselves outside the standard context of commutative probability, and the usual proofs of convergence no longer apply. Rigorous justification of this cross-validation technique is an important topic for future study, but we do not address this here. Finally, we give the first algorithm for automatically choosing the bandwidth of the kernel function used in approximations to Laplace-Beltrami operators.
2 The Heat Operator on a Family of Graphs
We will see in Section 3 that, for data sampled from a uniform distribution supported on a disconnected manifold, the clustering problem reduces to identifying the -eigenfunctions of a certain graph Laplacian , or, equivalently, the -eigenfunctions of the the heat operator at some .
We now describe how to construct the family of graph Laplacians from which we will choose . First, let be the points sampled from . Given a subset , We define the matrix to be
where is a function which satisfies
for some non-negative function
as a distribution as .
In our applications, will also have compact support. We will sometimes use to denote .
We likewise define the corresponding heat operators by
3 Topology and Clustering
We suppose that our data is sampled from a uniform distribution which is supported on a disconnected subspace embedded in a topological space . Our task is then to estimate the number of connected components of from the samples and to decide which points belong to which connected component. Given the sample points from , we generate a one-parameter family of weighted graphs with vertices at , with Laplace operator on , and we appeal to several basic facts of Hodge theory (see, for instnace, [Jost_2011], Chapter 3).
First, we recall that the Laplace-Beltrami operator may be defined by . We first recall that the dimension of the kernel of the Laplace-Beltrami operator is equal to the -th real Betti number of , which gives the number of connected components of . Also, since , the functions in the kernel are constant on each connected component (also Lemma 3.3.5 in [Jost_2011]). The following proposition now follows easily.
Let be a smooth Riemannian manifold, and let denote the -th Betti number of . Let be a basis of the kernel of the Laplace-Beltrami operator of , and define the map by
Then the image of consists of exactly points in , and the image of each connected component of is a single point.
First, we know from [Jost_2011], Section 3.3 that each basis function is constant on each connected component of , and it follows that each connected component is sent to a single point. It only remains to show that no two connected components are sent to the same point. Consider the matrix defined by . Since the are linearly independent, this is a matrix of full rank. Suppose now that there are two connected components, and , whose image under is the same point . Then two rows of are the same, and the , a contradiction. Therefore, all of the connected components of are sent to different points in . ∎
We note, too, that Proposition 1 also hold for graphs and the graph Laplacian instead of manifolds, with a nearly identical proof.
4 Cross-Validation and Bandwidth Selection
When working with approximations to the Laplace-Beltrami operator, the properties of the approximation depend on the particular choice of kernel and bandwidth used. In practice, this parameter is generally chosen on an ad hoc basis, or else based on an asymptotic formula with unknown constants, as, for instance, in [Coifman_Lafon_2006]. In these methods, the resulting graphs are connected for any given bandwidth, and the exact dependence of the clustering on the bandwidth is unclear. In our case, we use a compactly supported kernel function, so the bandwidth directly dictates the connected components of the graph, and it becomes critical for us to have a systematic method for choosing this parameter.
In this article, we make this choice using a variant of the so-called "elbow method", which is often used to estimate the number of clusters in k-means clustering or the number of eigenvalues to use in principal component analysis. The idea in this method is the following. When is a manifold, would ideally like to choose a bandwidth which minimizes an error functional of the form , where denotes the Hilbert-Schmidt norm. Let the operator be defined by
where is a uniform measure on the manifold . Let , and let , and note the and are the natural bias and variance terms for this problem, respectively. Now, by the triangle inequality, and we estimate using the cross validation sum
(As mentioned in the introduction, it is actually not obvious that this is an appropriate estimate for , given the norm and the non-commutativity in the problem. We do not discuss this further here, however, but we will return to this question in future work.) The bias is difficult to estimate, but we may nonetheless attempt to choose the bandwidth such that increasing beyond produces diminishing returns with respect to decreasing the variance, i.e. to find such that
for . We further estimate this to be the value which is the largest value of such that
where is the diameter of the data set.
5 Compensating for estimation error
The results in Section 3 tell us that the points in each connected component of our graphs should be sent to exactly the same point in by the map . In practice, however, there are small amounts of estimation error in the algorithms for computing eigenvalues and eigenvectors, and this must be accounted for when constructing the final clustering. We do this with a modified version of Gaussian elimination on the matrix formed by the eigenvectors, which we now describe.
First, note that the -th entry in the eigenvector is the value of the eigenfunction evaluated on the point . Let be the matrix defined by
We give a modified Gaussian elimination algorithm in Algorithm 1. For what follows, let denote the number of points in our sample.
Note that the algorithm, if there was no estimation error, would send each point in the sample to one of the vectors in the standard basis of . Now, however, even given some numerical error, we are able to cluster the sample points according to how close are to each of the vectors .
6 Algorithm and Experiments
We now give the complete algorithm and the results of some numerical experiements.
For our experiments, the functions used are
where are balls centered at of radius , and is the standard Lebesgue measure in the ambient Euclidean space.
The following figures summarize the output of this algorithm on a data set of 500 points sampled with a small amount of Guassian noise from three circles embedded in . The horizontal circle has radius , and the other two have radius and are centered on a random point of the horizontal circle. The standard deviation of the noise was taken to be . The Figure 6.1 shows the curve giving the variance estimate, and figure 6.2 shows the image of the eigenfunctions, after passing through the modified Gaussian elimination procedure. Finally, Figure 6.3 shows the clustering given by the algorithm, with the different clusters colored in red, green, and blue.
We thank Robert Adler, Jacob Abernathy, and Bertrand Michel for many useful discussions and comments. We would additionally like to thank the organizers of the CIMAT Differential Geometry Seminar, the AUS-ICMS Meeting, the XXIst Oporto Meeting on Topology, Geometry, and Physics, and the Toposys Network for the opportunity to present early versions of this work in their conferences and seminars. This research was supported in part by FP7-ICT- 318493-STREP \printbibliography