Consistent Manifold Representation for Topological Data Analysis
For data sampled from an arbitrary density on a manifold embedded in Euclidean space, we introduce the Continuous k-Nearest Neighbors (CkNN) graph construction. We prove that CkNN is the unique unweighted construction that is consistent with the connected components of the underlying manifold in the limit of large data, for compact Riemannian manifolds and a large class of non-compact manifolds. More precisely, we show that CkNN is geometrically consistent in the sense that the unnormalized graph Laplacian converges to the Laplace-Beltrami operator, spectrally as well as pointwise. We demonstrate that CkNN produces a single graph that captures all topological features simultaneously, in contrast to persistent homology, which represents each homology generator at a separate scale. As applications we derive a new fast clustering algorithm and a method to identify patterns in natural images topologically. Finally, we conjecture that CkNN is topologically consistent, meaning that the homology of the Vietoris-Rips complex (implied by the graph Laplacian) converges to the homology of the underlying manifold (implied by the Laplace-de Rham operators) in the limit of large data.
keywords:topological data analysis, Laplace-de Rham operator, manifold learning, spectral clustering, geometric prior
Building a discrete representation of a manifold from a finite data set is a fundamental problem in machine learning. Particular interest pertains to the case where a set of data points in a possibly high-dimensional Euclidean space is assumed to lie on a relatively low-dimensional manifold. The field of topological data analysis (TDA) concerns the extraction of topological invariants such as homology from discrete measurements.
Currently, there are two major methodologies for representing manifolds from data sets. One approach is an outgrowth of Kernel PCA scholkopf1998nonlinear (), using graphs with weighted edges formed by localized kernels to produce an operator that converges to the Laplace-Beltrami operator of the manifold. These methods include versions of diffusion maps belkin2003laplacian (); diffusion (); localk (); BH14 () that reconstruct the geometry of the manifold with respect to a desired metric. Convergence of the weighted graph to the Laplace-Beltrami operator in the large data limit is called consistency of the graph construction. Unfortunately, while such constructions implicitly contain all topological information about the manifold, it is not yet clear how to use a weighted graph to build a simplicial complex from which simple information like the Betti numbers can be extracted.
A second approach, known as persistent homology carlsson2009topology (); edels2010 (); ghrist2008 (), produces a series of unweighted graphs that reconstructs topology one scale at a time, tracking homology generators as a scale parameter is varied. The great advantage of an unweighted graph is that the connection between the graph and a simplicial complex is immediate, since the Vietoris-Rips construction can be used to build an abstract simplicial complex from the graph. However, the persistent homology approach customarily creates a family of graphs, of which none is guaranteed to contain all topological information. The goal of a consistent theory is not possible since there is not a single unified homology in the large data limit.
In this article we propose replacing persistent homology with consistent homology in data analysis applications. In other words, our goal is to show that it is possible to construct a single unweighted graph from which the underlying manifold’s topological information can be extracted. We introduce a specific graph construction from a set of data points, called continuous k-nearest neighbors (CkNN), that achieves this goal for any compact Riemannian manifold. Theorem 1 states that the CkNN is the unique unweighted graph construction for which the (unnormalized) graph Laplacian converges spectrally to a Laplace-Beltrami operator on the manifold in the large data limit. Furthermore, this is true even when the manifold is not compact, with some mild added conditions.
The proof of consistency for the CkNN graph construction is carried out in Section 6 for both weighted and unweighted graphs. There we complete the theory of graphs constructed from variable bandwidth kernels, computing for the first time the bias and variance of both pointwise and spectral estimators. This analysis reveals the surprising fact that the optimal bandwidth for spectral estimation is significantly smaller than the optimal choice for pointwise estimation (see Fig. 12). This is crucial because existing statistical estimates SingerEstimate (); BH14 () imply very different parameter choices that are not optimal for the spectral convergence desired in most applications. Moreover, requiring the spectral variance to be finite allows us to specify which geometries are accessible on non-compact manifolds. Finally, combining our statistical estimates with the theory of von2008consistency (), we provide the first proofs of spectral convergence for graph Laplacians on non-compact manifolds. Details on the relationship of our new results to previous work are given in Section 6.
As mentioned above, the reason for focusing on unweighted graphs is their relative simplicity for topological investigation. There are many weighted graph constructions that converge to the Laplace-Beltrami operator with respect to various geometries belkin2003laplacian (); diffusion (); localk (); BH14 (). Although these methods are very powerful, they are not convenient for extracting topological information. For example, to determine the zero homology from a weighted graph requires numerically estimating the dimension of the zero eigenspace of the graph Laplacian. Alternatively, in an unweighted graph construction, one can apply the depth first search algorithm to determine the zero homology, which is significantly more efficient than the eigensolver approach. Secondly, determination of the number of zero eigenvalues requires setting a nuisance parameter as a numerical threshold. For higher-order homology generators, the problem is even worse, as weighted graphs require the construction of the Laplace-de Rham operators which act on differential forms. (We note that the -th Laplace-de Rham operator acts on functions (-forms) and is called the Laplace-Beltrami operator.) In contrast, the unweighted graph construction allows the manifold to be studied using topological data analysis methods that are based on simplicial homology (e.g. computed from the Vietoris-Rips complex).
We introduce the CkNN in Section 2, and demonstrate its advantages in topological data analysis by considering a simple but illustrative example. In Section 3 we show that consistent spectral estimation of the Laplace-de Rham operators is the key to consistent estimation of topological features. In particular, the key to estimating the connected components of a manifold is spectral estimation of the Laplace-Beltrami operator. In Section 4, these results are used to show that the CkNN is the unique unweighted graph construction which yields a consistent geometry via the Laplace-Beltrami operator on functions. These results guarantee consistency of the connected components (clustering) and we conjecture consistency of the higher order homology. We give several examples that demonstrate the consistency of the CkNN construction in Section 5, including a fast and consistent clustering algorithm that allows more general sampling densities than existing theories. Theoretical results are given in Section 6. We conclude in Section 7 by discussing the relationship of CkNN to classical persistence. In this article, we focus on applications to TDA, but the theoretical results will be of independent interest to those studying the geometry as well as topology of data.
2 Continuous scaling for unweighted graphs
We begin by describing the CkNN graph construction and comparing it to other approaches. Then we discuss the main issues of this article as applied to a simple example of data points arranged into three rectangles with nonuniform sampling.
2.1 Continuous k-Nearest Neighbors
Our goal is to create an unweighted, undirected graph from a point set with interpoint distances given by a metric . Since the data points naturally form the vertices of a graph representation, for each pair of points we only need to decide whether or not to connect these points with an edge. There are two standard approaches for constructing the graph:
Fixed -balls: For a fixed , connect the points if .
k-Nearest Neighbors (kNN): For a fixed integer , connect the points if either or where are the -th nearest neighbors of respectively.
The fixed -balls choice works best when the data is uniformly distributed on the manifold, whereas the kNN approach adapts to the local sampling density of points. However, we will see that even when answering the simplest topological questions, both standard approaches have severe limitations. For example, when clustering a data set into connected components, they may underconnect one part of the graph and overestimate the number of components, while overconnecting another part of the graph and bridging parts of the data set that should not be connected. Despite these drawbacks, the simplicity of these two graph constructions has led to their widespread use in manifold learning and topological data analysis methods carlsson2009topology ().
Our main point is that a less discrete version of kNN sidesteps these problems, and can be proved to lead to a consistent theory in the large data limit. Define the Continuous k-Nearest Neighbors (CkNN) graph construction by
CkNN: Connect the points if
where the parameter is allowed to vary continuously. Of course, the discrete nature of the (finite) data set implies that the graph will change at only finitely many values of . The continuous parameter has two uses. First, it allows asymptotic analysis of the graph Laplacian in terms of , where we interpret the CkNN graph construction as a kernel method. Second, it allows the parameter to be fixed for each data set, which allows us to interpret as a local density estimate.
The CkNN construction is closely related to the “self-tuning” kernel introduced in ZP () for the purposes of spectral clustering, which was defined as
The kernel (1) leads to a weighted graph, but replacing the exponential kernel with the indicator function
and introducing the continuous parameter yields the CkNN unweighted graph construction. The limiting operator of the graph Laplacian based on the kernels (1) and (2) was first analyzed pointwise in Ting2010 (); BH14 (). In Sec. 6 we provide the first complete analysis of the spectral convergence of these graph Laplacians, along with the bias and variance of the spectral estimates.
The CkNN is an instance of a broader class of multi-scale graph constructions:
Multi-scale: Connect the points if
where defines the local scaling near the point . In Section 4 we will show that
is the unique multi-scale graph construction that yields a consistent limiting geometry, where is the sampling density and is the intrinsic dimension of the data.
Note that if we are given a data set, neither nor may be known beforehand. Fortunately, for points on a manifold embedded in Euclidean space, the kNN itself provides a very simple density estimator, which for sufficiently small approximately satisfies
where is the -th nearest neighbor of and is the dimension of the underlying manifold loftsgaarden65 (). Although more sophisticated kernel density estimators could be used (see for example ScottVBK ()), a significant advantage of (4) is that it implicitly incorporates the exponent without the need to explicitly estimate the intrinsic dimension of the underlying manifold.
In the next section, we demonstrate the advantages of the CkNN on a simple example before turning to the theory of consistent topological estimation.
2.2 Example: Representing non-uniform data
In this section we start with a very simple example where both the fixed -ball and simple kNN graph constructions fail to identify the correct topology. Fig. 1 shows a simple “cartoon” of a non-uniformity that is common in real data, and reveals the weakness of standard graph constructions. All the data points lie in one of the three rectangular connected components outlined in Fig. 1(a). The left and middle components are densely sampled and the right component is more sparsely sampled. Consider the radius indicated by the circles around the data points in Fig. 1(a). At this radius, the points in the sparse component are not connected to any other points in that component. This is too small for the connectivity of the sparse component to be realized, but at the same time is too large to distinguish the two densely sampled components. A graph built by connecting all points within the radius , shown in Fig. 1(b), would find many spurious components in the sparse region while simultaneously improperly connecting the two dense components. We are left with a serious failure: The graph cannot be tuned, with any fixed , to identify the “correct” three boxes as components.
The kNN approach to local scaling is to replace the fixed approach with the establishment of edges between each point and its -nearest neighbors. While this is an improvement, in Fig. 2 we show that it still fails to reconstitute the topology even for the very simple data set considered in Fig. 1. Notice that in Fig. 2(a) the graph built based on the nearest neighbor () leaves all regions disconnected, while using two nearest neighbors () incorrectly bridges the sparse region with a dense region, as shown in Fig. 2(b). Of course, using kNN with will have the same problem as . Fig. 2 shows that simple kNN is not a panacea for nonuniformly sampled data.
Finally, we demonstrate the CkNN graph construction in Fig. 3. An edge is added between points and when . We denote the th-nearest neighbor of (resp., ) by (resp., ). The coloring in Fig. 3(a) exhibits the varying density between boxes. Assigning edges according to CkNN with and , shown in Fig. 3(b), yields an unweighted graph whose connected components reflect the manifold in the large data limit. Theorem 1 guarantees the existence of such a , that yields an unweighted CkNN graph with correct topology.
Although we have focused on the connected components of the point set, the CkNN graph in Fig. 3 fully triangulates all regions, which implies that the -homology is correctly identified as trivial. Clearly, the graph constructions in Figures 1 and 2 are very far from identifying the correct -homology.
To complete our analysis of the three-box example, we compare CkNN to a further alternative. Two crucial features of the CkNN graph construction are (1) symmetry in and which implies an undirected graph construction, and (2) introduction of the continuous parameter which allows to be fixed so that is an estimator of . There are many alternative ways of combining the local scaling function with the continuous parameter . Our detailed theoretical analysis in Sec. 6 shows that the geometric average is consistent, but it does not discount all alternatives.
For example, we briefly consider the much less common ‘AND’ construction for kNN, where points are connected when (as opposed to standard kNN which uses the ). Intuitively, the advantage of the ‘AND’ construction is that it will not incorrectly connect dense regions to sparse regions because it takes the smaller of the two kNN distances. However, on a non-compact domain, shown in Fig. 4, this construction does not identify the correct homology whereas the CkNN does. We conclude the ‘AND’ version of kNN is not generally superior to CkNN. Moreover, our analysis in Sec. 6 does not apply, due to the fact that the and functions are not differentiable, making their analysis more difficult than the geometric average used by CkNN.
2.3 Multiscale Homology
In Section 4 we will see that the geometry represented by the CkNN graph construction captures the true topology with relatively little data by implicitly choosing a geometry which is adapted to the sampling measure. The CkNN construction yields a natural multi-scale geometry, which is assumed to be very smooth in regions of low density and can have finer features in regions of dense sampling. Since all geometries yield the same topology, this multi-scale geometry is a natural choice for studying the topology of the underlying manifold, and this advantage is magnified for small data sets. In Fig. 5 we demonstrate the effect of this geometry on the persistent homology for a small data set with multiple scales. Following that, in Fig. 6 we show how the CkNN graph construction can capture the homology even for a non-compact manifold.
To form the data set in Fig. 5 we sampled 60 uniformly random points on a large annulus in the plane centered at with radii in and another 60 uniformly random points on a much smaller annulus centered at with radii in . Together, these 120 points form a “figure eight” with a sparsely sampled large hole and a densely sampled small hole as shown in Fig. 5(b)(d). We then used the JavaPlex package Javaplex () to compute the persistence diagram for the VR complex based on the standard -ball graph construction, shown in Fig. 5(a). Note that two major generators of are found along with some “topological noise”, and the generators do not exist for a common .
Since JavaPlex can build the VR complex persistence diagram for any distance matrix, we could easily compute the CkNN persistence diagram, shown in Fig. 5(c), by using the ‘distance’ matrix , and we used to form the matrix. Notice how the CkNN captures both scales simultaneously, giving a multi-scale representation, whereas the standard -ball graph captures only one scale at a time. Fig. 5(d) shows the edges for , at which the graph captures the correct topology in all dimensions. In addition, the CkNN construction is more efficient, requiring only 934 edges to form a single connected component, whereas the -ball construction requires 2306 edges.
The CkNN construction has a marked advantage over the fixed- construction for non-compact data. To form the data set in Fig. 6(a) we sampled 150 points from a 2-dimensional Gaussian distribution and then removed the points of radius between , leaving 120 points lying on two connected components with a single non-contractible hole. In this case the standard -ball persistence does not even capture the correct 0-homology for the manifold, due to the decreasing density near the outlying points, as shown in Fig. 6(c-d). Furthermore, since the true manifold is non-compact, there is no reason to expect the -ball construction to converge to the correct topology even in the limit of large data. In fact, as the amount of data is increased, the outlying points will become increasingly spaced out, leading to worse performance for the fixed -ball construction, even for the homology. In contrast, the CkNN construction is able to capture all the correct topological features for a large range of values, as shown in Fig. 6(e-f).
3 Manifold topology from graph topology
In this section we delineate our notion of graph consistency. Assume that the vertices of the graph are data points that lie on a manifold embedded in Euclidean space. Our goal is to access the true topology of the underlying manifold using an abstract simplicial complex on the finite data set. The Vietoris-Rips complex is constructed inductively, first adding a triangle whenever all the faces are in the graph and then adding a higher order simplex whenever all the faces of the simplex are included. We say that a graph construction on these vertices is topologically consistent if the homology computed from the VR complex converges to the homology of the underlying manifold in the limit of large data.
Topological consistency has been shown directly for the -ball graph construction on compact manifolds without boundary VRcomplex (). An in depth analysis in SmaleWeinberger () shows that for compact manifolds the probability of obtaining the correct homology from a sampling set can be bounded in terms of the curvature and nearness to self-intersection of the manifold. The results of VRcomplex (); SmaleWeinberger () are analogous to a result of diffusion (), which shows that the graph Laplacian associated to -ball graph construction converges to the Laplace-Beltrami operator on compact manifolds when the data points are sampled uniformly. In fact, diffusion () proves pointwise convergence on compact manifolds for any smooth sampling density, but their construction requires a weighted graph in general. For graph Laplacian methods (including the results developed here), the dependence on the curvature and nearness to self-intersection appears in the bias term of the estimator as shown in diffusion (); heinthesis ().
In the special case of uniform sampling, the theory of diffusion () is equivalent to the unweighted -ball construction (using the kernel ). In Sec. 6, combined with a result of von2008consistency (), we prove spectral convergence of this graph Laplacian to the Laplace-Beltrami operator, which we refer to as spectral consistency of the graph Laplacian. We should note that the result of von2008consistency () requires several extra conditions and in Sec. 6 we prove that these conditions are superfluous. In particular, for a manifold without boundary or with a smooth boundary, the CkNN graph construction the spectral convergence is guaranteed. This is the bottom arrow in the following diagram:
The left vertical arrow corresponds to the fact that the graph Laplacian can trivially be used to reconstruct the entire graph since the non-diagonal entries are simply the negative of the adjacency matrix. Since the graph determines the entire VR complex, the graph Laplacian completely determines the VR homology of the graph. This connection is explicitly spectral for the zero homology, since the zero-homology of a graph corresponds exactly to the zero-eigenspace of the graph Laplacian von2007tutorial ().
The right vertical arrow follows from the fact that from the Laplace-Beltrami operator, it is possible to analytically reconstruct the metric on a Riemannian manifold which completely determines the homology of the manifold. To make this connection explicit, in coordinates we can compute the Riemannian metric by
The Riemannian metric lifts to an inner product on forms which defines the Hodge inner products
on differential forms. From the Hodge inner products, we define the codifferential operators as the formal adjoint of the exterior derivative on -forms. Finally, we can define the Laplace-de Rham operator on differential -forms by , and the Hodge theorem laplacianBook () states that the kernel of is isomorphic to the -th de Rham cohomology group and the -th singular cohomology group, . For closed manifolds without boundary, Poincare duality relates the homology to the cohomology . In general, cohomology is considered a stronger invariant.
To summarize the above discussion, just as the discrete Laplacian completely determines the graph VR homology, the Laplace-Beltrami operator completely determines the manifold homology. In perfect analogy to the discrete case, the zero-homology of the manifold corresponds to the zero-eigenspace of the Laplace-Beltrami operator. This connection is explicitly spectral, and generalizes to higher-order homology groups via the Laplace-de Rham operators on differential forms.
By establishing spectral convergence of the graph Laplacian to the Laplace-Beltrami operator (the bottom horizontal arrow) we complete the first step of the connection between the graph homology and the limiting manifold homology (the top horizontal arrow). This is the first step towards establishing topological consistency on non-compact manifolds with arbitrary sampling. Moreover, this crucial first step also strongly suggests the consistency of the higher-order homology. However, for the higher-order homology, the consistency is not yet explicitly spectral, so the top horizontal arrow in the above diagram remains a conjecture for the higher-order homology. Establishing this connection via explicit spectral convergence requires defining a discrete analog of differential forms and discrete analogs of the higher-order Laplace-de Rham operators and then showing spectral convergence to the corresponding operators on the manifold. One promising construction is the discrete exterior calculus DEC1 (); DEC2 () but no consistency results have been shown yet.
The goal of this paper is to provide a general unweighted graph construction that is spectrally consistent (and therefore consistent on connected components) for any smooth sampling density. Additionally, we wish to avoid the restriction to compact manifolds. Using more complicated weighted graph constructions, recent results show that for a large class of non-compact manifolds heinthesis () with smooth sampling densities that are allowed to be arbitrarily close to zero BH14 (), the graph Laplacian can converge pointwise to the Laplace-Beltrami operator. These results require weighted graph constructions due to several normalizations which are meant to remove the influence of the sampling density on the limiting operator. In Sec. 6 we analyze a construction similar to BH14 (), but without using any normalizations. We extend the results of BH14 () to include spectral convergence using von2008consistency ().
In the next section we show that continuous k-nearest neighbors (CkNN) construction is the unique unweighted graph construction that yields a consistent unweighted graph Laplacian for any smooth sampling density on manifolds of the class defined in heinthesis (), including many non-compact manifolds.
4 The unique consistent unweighted graph construction
Consider a data set of independent samples from a probability distribution that is supported on a -dimensional manifold embedded in Euclidean space. For a smooth function on , we will consider the CkNN graph construction, where two data points and are connected by an edge if
This construction leads to the weight matrix whose th entry is if and have an edge in common, and otherwise. Let be the diagonal matrix of row sums of , and define the “unnormalized” graph Laplacian . In Sec. 6 we show that for an appropriate factor depending only on and , in the limit of large data, converges both pointwise and spectrally to the operator defined by
where is the positive definite Laplace-Beltrami operator and is the gradient, both with respect to the Riemannian metric inherited from the ambient space. In fact, pointwise convergence follows from a theorem of Ting2010 (), and the pointwise bias and a high probability estimate of the variance was first computed in BH14 (). Both of these results follow for a larger class of kernels than we consider here.
Although the operator in (6) appears complicated, we now show that it is still a Laplace-Beltrami operator on the same manifold , but with respect to a different metric. A conformal change of metric corresponds to a new Riemannian metric , where , and which has Laplace-Beltrami operator
Eliminating in the two equations results in , which implies as the only choice that makes the operator (6) equal to a Laplace-Beltrami operator . The new metric is . Computing the volume form of the new metric we find
which is precisely the sampling measure. Moreover, the volume form is exactly consistent with the discrete inner product
This consistency is crucial since the discrete spectrum of are the minimizers of the functional
where is a constant (see Theorem 3 in the Sec. 6). If the Hilbert space norm implied by were not the volume form of the Riemannian metric , then the eigenvectors of would not minimize the correct functional in the limit of large data. This shows why it is important that the Hilbert space norm is consistent with Laplace-Beltrami operator estimated by .
Another advantage of the geometry concerns the spectral convergence of to shown in Theorem 7 in Sec. 6, which requires the spectrum to be discrete. Assuming a smooth boundary, the Laplace-Beltrami operator on any manifold with finite volume will have a discrete spectrum cianchi2011 (). This insures that spectral convergence always holds for the Riemannian metric since the volume form is , and therefore the volume of the manifold is exactly
Since all geometries on a manifold have the same topology, this shows once again that the metric is extremely natural for topological investigations since spectral convergence is guaranteed, and spectral convergence is crucial to determining the homology.
The fact that is the unique solution to (8) along with the spectral consistency implies the following result.
Theorem 1 (Unique consistent geometry).
Consider data sampled from a Riemannian manifold, not necessarily compact, having either a smooth boundary or no boundary. Among unweighted graph constructions (5), is the unique choice which yields a consistent geometry in the sense that the unnormalized graph Laplacian converges spectrally to a Laplace-Beltrami operator. Thus, CkNN is the unique graph construction (5) that yields a consistent clustering in the limit of large data.
Based on the discussion in the previous section, and in particular the fact that the Laplace-Beltrami operator determines the entire cohomology of the manifold, we propose the following conjecture:
Conjecture 0 (Unique consistent topology).
CkNN is the unique graph construction (5) with an associated VR complex that is topologically consistent in the limit of large data.
As mentioned in the previous section, completing the proof of this conjecture would require explicit estimation of the Laplace-de Rham operators, , and corresponding spectral convergence proofs. In this paper we establish this fact for the Laplace-Beltrami operator, , which is the key first step and strongly supports the conjecture.
The above theorem and conjecture have practical ramifications since (as shown in the previous section) a consistent graph construction will have the same limiting topology as the underlying manifold. In Examples 4 and 5 we will empirically illustrate the consistency of the CkNN choice as well as the failure of alternative constructions.
The unnormalized graph Laplacian is not the only graph Laplacian. With the same notation as above, the “normalized”, or “random-walk” graph Laplacian is often defined as , and has the limiting operator
(see for example Ting2010 (); BH14 (); the constant is different from the unnormalized case). Note that again is the unique choice leading to a Laplace-Beltrami operator. This choice implies that to leading order , so the corresponding Hilbert space has norm . This implies spectral consistency since is equivalent to , which is related to the functional
Therefore, for the choice , both the unnormalized and normalized graph Laplacians are consistent with the same underlying Laplace-Beltrami operator with respect to the metric .
The consistency of graph Laplacian constructions in terms of spectral convergence to operators was explored in a very general context in von2008consistency (). Their goal was to justify spectral clustering algorithms that use the first nontrivial eigenfunction of the graph Laplacian to segment data sets. In Sec. 6 we apply the theory of von2008consistency () to show spectral convergence of kernel-based graph Laplacians to Laplace-Beltrami operators for a large class of manifolds and geometries. A significant restriction of the theory of von2008consistency () is that the desired eigenvalue must be isolated from the essential spectrum, which includes the range of the degree function. This limitation appears to suggest superior convergence for the normalized Laplacian, where the range of the degree function is always a single value. In Sec. 6 we show that the range of the degree function is for all the kernel based constructions of belkin2003laplacian (); diffusion (); heinthesis (); BH14 (); localk () including the unnormalized graph Laplacians used here. This removes a significant barrier to spectral convergence for a large class of graph Laplacian constructions, and for this class the normalized and unnormalized graph Laplacians have the same spectral convergence properties. Finally, we show that (assuming a smooth boundary or no boundary) the geometry implied by the CkNN graph construction has a Laplace-Beltrami operator with a discrete spectrum, even when the manifold is not compact. These results allow us to make a broad statement of spectral convergence, especially for the special geometry implied by the CkNN graph construction.
We emphasize that we are not using either graph Laplacian directly for computations. Instead, we are using the convergence of the graph Laplacian to show convergence of the graph connected components to those of the underlying manifold. Since this consistency holds for an unweighted graph construction, we can make use of more computationally efficient methods to find the topology of the graph, such as depth-first search to compute the zero-level homology. More generally we compute the higher order homology from the VR complex of the graph, which our conjecture suggests should converge to that of the underlying manifold. A wider class of geometries are accessible via weighted graph constructions (see for example diffusion (); BH14 (); localk ()), but fast algorithms for analyzing graph topology only apply to unweighted graphs.
5 Further applications to Topological Data Analysis
The fundamental idea of extracting topology from a point cloud by building unweighted graphs relies on determining what is considered an edge in the graph as a function of a parameter, and then considering the graph as a VR complex. In the -ball, kNN and CkNN procedures, edges are added as a parameter is increased, from no edges for extremely small values of the parameter to full connectivity for sufficiently large values. From this point of view, the procedures differ mainly by the order in which the edges are added.
Classical persistence orders the addition of possible edges by , whereas the CkNN orders the edges by . More generally, a multi-scale graph construction with bandwidth function orders the edges by . Our claim is that CkNN gives an order that allows graph consistency to be proved. In addition, we have seen in Figs. 5 and 6 that the CkNN ordering is more efficient. In this section we show further examples illustrating this fact. We will quantify the persistence or stability of a feature by the percentage of edges (out of the total possible in the given ordering) for which the feature persists. This measure is an objective way to compare different orderings of the possible edges.
The consistent homology approach differs from the persistent homology approach by using a single graph construction to simultaneously represent all the topological features of the underlying manifold. This requires selecting the parameter which determines the CkNN graph. The asymptotically optimal choice of in terms of the number of data points is derived in Sec. 6, however the constants depend on the geometry of the unknown manifold. There are many existing methods of tuning for learning the geometry of data epsilontuning (); BH14 (). As a practical method of tuning for topological data analysis, we can use the classical persistence diagram to find the longest range of values such that all of the homological generators do not change. In a sense we are using the classical persistence diagram in reverse, looking for a single value of where all the homology classes are stable. In Examples 4 and 5 below we validate this approach by showing that the percentage of edges which capture the true homology is longest when using ordering defined by which is equivalent to the CkNN.
5.1 A fast graph-based clustering algorithm
We consider the problem of identifying the connected components of a manifold from a data set using the connected components of the CkNN graph construction. While clustering connected components is generally less difficult than segmenting a connected domain, outliers can easily confuse many clustering algorithms. Many rigorous methods, including any results based on existing kernel methods heinHighDensity2 (); ZP (); von2008consistency (); heinCuts1 (), require the sampling density to be bounded away from zero. In other words, rigorous clustering algorithms require the underlying manifold to be compact. A common work-around for this problem is the estimate the density of the data points and then remove points of low density. However, this leaves the removed points unclustered highdensity1rigourous (); highdensity2 (); highdensity3 (); heinHighDensity1 (). We have shown that the CkNN method is applicable to a wide class of non-compact manifolds, and in particular the connected components of the CkNN graph will converge to the connected components of the underlying manifold.
Here we use the CkNN to put an ordering on the potential edges of the graph. While the full persistent homology of a large data set can be computationally very expensive, the 0-homology is easily accessible using fast graph theoretic algorithms. First, the connected components of a graph can be quickly identified by the depth-first search algorithm. Second, unlike the other homology classes, the 0-homology is monotonic; as edges are added, the number of connected components can only decrease or stay the same. This monotonicity allows us to easily identify the entire 0-homology -sequence by only finding the transitions, meaning the numbers of edges where the -th Betti number changes. We can quickly identify these transitions using a binary search algorithm as outlined below.
Algorithm 1. Fast Binary Search Clustering.
Inputs: Ordering of the possible edges (pairs of points), number of clusters .
Outputs: Number of edges such that the graph has components and adding an edge yields compnonent.
Initialize the endpoints and
Build a graph using the first edges from the ordering
Use depth-first search to find the number of components
If set otherwise set
When the goal is to find all of the transition points, Algorithm 1 can easily be improved by storing all the numbers of clusters from previous computations and using these to find the best available left and right endpoints for the binary search.
In Fig. 7, we illustrate the use of Algorithm 1 on a point set consisting of the union of three spiral-shaped subsets with nonuniform sampling. In fact, the density of points falls off exponentially in the radial direction. Fig. 7(a) shows the original set, and panel (b) shows the number of components as a function of the proportion of edges. When the number of edges is between one and two percent of the possible pairs of points, the persistence diagram in (b) detects three components, shown in (c) along with the edges needed. A three-dimensional version of three spiral-shaped subsets is depicted in Fig. 7(d)-(f), with similar results.
In the next two examples, we illustrate the theoretical result that the choice of in the bandwidth function is optimal, where is the dimension of the data.
We begin with the zero-order homology. We will demonstrate empirically that the choice maximizes the persistence of the correct clustering, for . Consider data sampled from an -dimensional Gaussian distribution with a gap of radial width centered at radius . (The dependence on the dimension is necessary to insure that there are two connected components for small data sets.) The radial gap separates into two connected components, the compact interior -ball, and the non-compact shell extending to infinity with density decaying exponentially to zero.
Given a data set sampled from this density, we can construct a graph using the multi-scale graph construction which connects two points if . Since the true density is known, we consider the bandwidth functions for . For each value of we used the fast clustering algorithm to identify the minimum and maximum numbers of edges which would identify the correct clusters. We measured the persistence of the correct clustering as the difference between the minimum and maximum numbers of edges which identified the correct clusters divided by the total number of possible edges . We then repeated this experiment for 500 random samples of the distribution and averaged the persistence of the correct clustering for each value of and the results are shown in Fig. 8. Notice that for each dimension the persistence has a distinctive peak centered near which indicates that the true clustering is the most persistent using the multi-scale graph construction that is equivalent to the CkNN.
Next, we examine the discovery of the full homology for a 1-dimensional and 2-dimensional example using Javaplex Javaplex (). To obtain a one-dimensional example with two connected components we generated a set of points from a standard Gaussian on the -axis with points with removed, and then mapped these points into the plane via . The embedding induces a loop, and so there is non-trivial 1-homology in the 1-dimensional example. The correct homology for this example has Betti numbers and , which is exactly the same as the true homology for the 2-dimensional cut Gaussian from Fig. 8, which will be our 2-dimensional example. In Fig. 9 we show the persistence of the correct homology in terms of the percentage of edges as a function of the parameter that defines the multi-scale graph construction. As with the clustering example, the results clearly show that the correct homology is most persistent when is near which corresponds to the CkNN graph construction.
5.2 Identifying patterns in images with homology
In this section we consider the identification of periodic patterns or textures from image data. We take a topological approach to the problem, and attempt to classify the orbifold (the quotient of the plane by the group of symmetries) by its topological signature. Note that to achieve this, we will not need to learn the symmetry group, but will directly analyze the orbifold by processing the point cloud of small pixel subimages of the complete image in without regard to the original location of the subimages.
In Fig. 10(a) we show four simple patterns that can be distinguished by homology. To make the problem more difficult, the patterns are corrupted by a ‘brightness’ gradient which makes identifying the correct homology difficult. From left to right the patterns are: First, stripes have a single periodicity so that ; second, a pattern that is periodic in both the vertical and horizontal directions, implying ; third, a checkerboard pattern also has only two periodicities , but they have different periods than the previous pattern; fourth, a hexagonal pattern has 3 periodicities so that . To see the three periodicities in the fourth pattern, notice that the pattern repeats when moving right two blocks, or down three blocks, or right one block and down two blocks; each of these periodicities yields a distinct homology class.
To identify the pattern in each image, we cut each full image into 9-by-9 sub-images, yielding 121 points in . In Fig. 10(b) we show the results of applying the fixed -ball graph construction to each of the four sets of sub-images. In order to choose we used JavaPlex Javaplex () to compute the persistent homology and then chose the region with the longest persistence (meaning the region of where the homology went the longest without changing). In the title of each plot we show first two betti numbers for the graph constructed with this value of , we also show the length of the persistence in terms of the percentage of edges for which the homology is unchanged. In Fig. 10(c) we repeated this experiment using the CkNN construction, choosing from the region with the longest unchanging homology. In this set of examples, the CkNN construction is more efficient, and finds the correct orbifold homology.
When the ‘brightness’ gradient is removed, both the fixed -ball and CkNN constructions identify the correct homology in the most persistent region. However, the ‘brightness’ gradient means that the patterns do not exactly meet (see the leftmost panels in Figs. 10(b,c)). For the simple stripe pattern, the -ball construction can still bridge the gap and identify the correct homology; however, for the more complex patterns, the -ball construction finds many spurious homology classes which obscure the correct homology.
We applied the CkNN graph construction to identify patterns in real images of zebra stripes and fish scales in Figure 11. The images shown in Fig. 11 were taken from larger images Zebra (); Fish (). In order to analyze the subimage spaces we first decimated the images to reduce the resolution, (by a factor of 2 for the stripes and factor of 25 for the scales) in each case to yield a pixel image. We then formed the set of all 23-pixel by 23-pixel subimages shifting by two pixels in the vertical and horizontal directions to obtain 136 subimages, considered as points in . We built the rescaled distance matrix using . In Fig. 11(b,e) we show the persistent homology of the VR complex associated to the respective distance matrices in terms of the parameter of the CkNN. Using this diagram, we chose the maximal interval of for which the homology was stable. In Fig. 11(c,f) we show the CkNN graph for chosen from this region along with the correct Betti numbers.
6 Convergence of Graph Laplacians
We approach the problem of consistent graph representations of manifolds by assuming we have data points which are sampled from a smooth probability distribution defined on the manifold. We view this assumption as establishing a “geometric prior” for the problem. Our main goal is to approximate the Laplace-Beltrami operator on the manifold, independent of the sampling distribution, and using as little data as possible. This is a natural extension of ideas developed by belkin2003laplacian () and Coifman and collaborators diffusion (); nadler2005diffusion (); diffcoords (); nadler2008diffusion (); NadlerCluster ().
The proof of consistency for any graph construction has three parts, two statistical and one analytic: (1) showing that the (discrete) graph Laplacian is a consistent statistical estimator of a (continuous) integral operator (either pointwise or spectrally), (2) showing that this estimator has finite variance, and (3) an asymptotic expansion of the integral operator that reveals the Laplace-Beltrami operator as the leading order term. The theory of convergence of kernel weighted graph Laplacians to their continuous counterparts was initiated with belkin2003laplacian () which proved parts (1) and (3) for uniform sampling on compact manifolds, and part (2) was later completed in SingerEstimate (). Parts (1) and (3) were then extended to non-uniform sampling in diffusion (), and part (2) was completed in BH14 (). The extension to noncompact manifolds was similarly divided. First, heinthesis () provided the proof of parts (1) and (3), and introduced the necessary additional geometric assumptions which were required for the asymptotic analysis on non-compact manifolds. However, BH14 () showed that the pointwise errors on non-compact manifolds could be unbounded and so additional restrictions had to be imposed on the kernel used to construct the graph Laplacian. In order to construct the desired operators on non-compact, BH14 () showed that variable bandwidth kernels were required (part (1) for variable bandwidth kernels was previously achieved in Ting2010 ()). In all of this previous work, parts (1) and (2) are always proven pointwise, despite the fact that most applications require spectral convergence.
The geometric prior assumes that the set of points that have positive sampling density is a smooth manifold where is a smooth sampling density. Some weak assumptions on the manifold are required. The theory of belkin2003laplacian (); diffusion () assumes that the manifold is compact, implying that the density must be bounded away from zero on . The theory of heinthesis (); Hein1 (); hein2 () showed that this assumption could be relaxed, and together with the statistical analysis in BH14 () allows a large class of noncompact manifolds with densities that are not bounded below. In this article we require to have injectivity radius bounded below and curvature bounded above; these technical assumptions hold for all compact manifolds and were introduced to allow application to noncompact manifolds in heinthesis ().
There are several algorithms for estimating the Laplace-Beltrami operator, however a particularly powerful construction in BH14 () is currently the only estimator which allows the sampling density to be arbitrarily close to zero. Since we are interested in addressing the problem of non-uniform sampling, we will apply the result of BH14 () for variable bandwidth kernels. However, the method of BH14 () used a special weighted graph Laplacian in order to approximate the Laplace-Beltrami operator. The weighted graph Laplacian uses special normalizations which were first introduced in the diffusion maps algorithm diffusion () in order to reverse the effect of the sampling density. The goal of this paper is to use an unweighted graph Laplacian to approximate the Laplace-Beltrami operator, since this allows us to compute the topology of the graph using fast combinatorial algorithms. In fact, the unweighted graph Laplacian will converge to the Laplace-Beltrami operator of the embedded manifold in the special case of uniform sampling. The power of our new graph construction is that we can recover a Laplace-Beltrami operator for the manifold from an unweighted graph Laplacian, even when the sampling is not uniform.
Let represent the sampling density of a data set . We combine a global scaling parameter with a local scaling function to define the combined bandwidth . Then consider the symmetric variable bandwidth kernel
for any shape function that has exponential decay. For a function and any point we can form a Monte-Carlo estimate of the integral operator
is the positive definite Laplace-Beltrami operator, and is the gradient operator, both with respect to the Riemannian metric that inherits from the ambient space . (Note that in BH14 () the expansion (12) is written with respect to the negative definite Laplace Beltrami operator, whereas here we consider the positive definite version.) In the right-hand side of (12), all functions are evaluated at . The function depends on the shape function and the curvature of the manifold at .
6.1 Pointwise and integrated bias and variance
The standard graph Laplacian construction starts with a symmetric affinity matrix whose entry quantifies similarity between nodes and . We assume in the following that from (10), which includes the CkNN construction as a special case. Define the diagonal normalization matrix as the row sums of , i.e. . The unnormalized graph Laplacian matrix is then
We interpret this matrix as the discrete analog of an operator. Given a function , by forming a vector the matrix vector product is
so that is also a vector that represents a function on the manifold. Theorem 3 below makes this connection rigorous by showing that for an appropriate factor that depends on and , the vector is a consistent statistical estimator of the differential operator
where all functions are evaluated at the point . The last equality in (15) follows from the definition of and applying the product rules for positive definite Laplacian and the gradient . Theorem 3 shows that is a pointwise consistent statistical estimator of a differential operator. In fact consistency was first shown in Ting2010 () and the bias of this estimator was first computed in BH14 (). Here we include the variance of this estimator as well.
Theorem 3 (Pointwise bias and variance of ).
Let be an -dimensional Riemannian manifold embedded in , let be a smooth density function, and define as in (13). For independent samples of , and we have
where and are constants.
Before proving Theorem 3, we comment on the constants such as and , which depend on the shape of the kernel function. It was originally shown in diffusion () that the expansion (12) holds for any kernel with exponential decay at infinity. A common kernel used in geometric applications is the Gaussian , due to its smoothness. For topological applications, we are interested in building an unnormalized graph, whose construction corresponds to a kernel defined by the indicator function . The sharp cutoff function enforces the fact that each pair of points is either connected or not. (The indicator kernel satisfies (12) since it has compact support and thus has exponential decay.)
In the table below we give formulas for all the constants which appear in the results, along with their values for the Gaussian and indicator functions (note that is the unit ball in ).
It turns out that the variance of the statistical estimators considered below are all proportional to the constant . As a function of the intrinsic dimension , the constant decreases exponentially for the Gaussian kernel. On the other hand, since the volume of a unit ball decays like for large , the constant increases exponentially for the indicator kernel.
Proof of Theorem 3. Notice that the term is zero and can be left out of the summation, this allows us to consider the expectation of only over the terms which are identically distributed. From (12) the -th entry of the vector has expected value
We can also compute the variance of this estimator defined as
Since are independent we have
Notice that for the last term we have
this term will be higher order so we summarize it as . Computing the remaining term we find the variance to be
Since is also a local kernel with moments and the above asymptotic expansions apply and we find
Theorem 3 is a complete description of the pointwise convergence of the discrete operator . While the expansion (16) shows that the matrix approximates the operator , it does not tell us how the eigenvectors of approximate the eigenfunctions of . That relationship is the subject of Theorem 4 below.
Since is a symmetric matrix, we can interpret the eigenvectors as the sequential orthogonal minimizers of the functional
By dividing the numerator and denominator by , we can interpret the inner product as an integral over the manifold since
It is easy to see that the above estimator has variance . Similarly, in Theorem 4 we will interpret as approximating an integral over the manifold.
Theorem 4 (Integrated bias and variance of ).
Let be an -dimensional Riemannian manifold embedded in and let be a smooth density function. For independent samples of , and we have
So assuming the inner products are finite (for example if is a compact manifold) we have .
By definition we have
where the term is zero and so the sum is over the terms where . Since are sampled according to the density we have
We can now compute the variance of the estimator which estimates . To find the variance we need to compute
Notice that when are all distinct, by independence we can rewrite these terms of (24) as
The constant accounts for the fact that of the total terms in (24), only terms have distinct indices. Since and , we next consider the terms where either or but not both. Using the symmetry of , by changing index names we can rewrite all four combinations as so that these terms of (24) can be written as