Empirical geodesic graphs and CAT(k) metrics for data analysis
A methodology is developed for data analysis based on empirically constructed geodesic metric spaces. For a probability distribution, the length along a path between two points can be defined as the amount of probability mass accumulated along the path. The geodesic, then, is the shortest such path and defines a geodesic metric. Such metrics are transformed in a number of ways to produce parametrised families of geodesic metric spaces, empirical versions of which allow computation of intrinsic means and associated measures of dispersion. These reveal properties of the data, based on geometry, such as those that are difficult to see from the raw Euclidean distances. Examples include clustering and classification. For certain parameter ranges, the spaces become CAT(0) spaces and the intrinsic means are unique. In one case, a minimal spanning tree of a graph based on the data becomes CAT(0). In another, a so-called “metric cone” construction allows extension to CAT(k) spaces. It is shown how to empirically tune the parameters of the metrics, making it possible to apply them to a number of real cases.
Key words and phrases:intrinsic mean, extrinsic mean, CAT(0), curvature, metric cone, cluster analysis, non-parametric analysis
There have been many development in statistics in which geometry, represented by an extension from Euclidean to more general spaces, has proved fundamental. Thus, reproducing kernel Hilbert spaces is a part of Gaussian process methods, Sobolev spaces are used for , and Besov spaces, for wavelets. Differential geometry, Riemannian manifolds, curvature and geodesics are at the core of information geometry, shape analysis and manifold learning.
This paper is concerned with CAT(0), and, more generally, CAT(k). Although related to Riemannian manifolds ( denotes the Riemannian curvature) these metric spaces have a less rigid structure and are qualitatively different from the spaces mentioned above. Trees are among the first examples in which CAT(0) properties have been used in statistics and bioinformatics [billera-2001]: the set of trees (the tree space) becomes a geodesic metric space (to be defined) when for each tree, the edge lengths are allocated to entries in a vector of sufficient dimension to capture all the tree structures. Then, the metric is the Euclidean geodesic distance on a simplicial complex called a “fan” which is induced by the tree structure: the geodesics are the shortest piecewise linear paths in the fan. Such spaces have already been shown to be CAT(0) [billera-2001] by Gromov’s theory [gromov-1987].
For a random variable on a metric space endowed with a metric the general intrinsic mean is defined by
The empirical intrinsic mean based on data , sometimes called the Fréchet mean, is defined as
For Euclidean space, , the sample mean. In general, is not necessarily convex and the means, , are not unique. Figure 1 shows that the curvature can affect the property of . In particular, for so-called CAT(0) spaces, which (trivially) include Euclidean spaces, the intrinsic means are unique.
Even when the mean is not unique, the function can yield useful information, for example about clustering. We can also define second-order quantities:
A key concept in the study of these issues is that the metrics are global geodesic metrics, that is metrics based on the shortest path between points measured by integration along a path with respect to a local metric. The interplay between the global and the local will concern us to a considerable extent.
For some intuition, consider a circle with arc length as distance and the uniform measure. The Fréchet intrinsic mean is the whole circle. We can say, informally, that the non-uniqueness arises because the space is not CAT(0). Although the curvature is zero in the Riemannian geometry sense, there are other considerations related to “diameter”, which prevents it from having the CAT(0) property [hotz-2013]. If we use the embedding Euclidean distance rather than the arc distance, we have the extrinsic mean, but again, we obtain the whole perimeter. See [bhattacharya-2012] for further discussion on the intrinsic and extrinsic means. The empirical case is also problematic. For example, two data points on a diameter, with arc length as distance, give intrinsic means on the perpendicular diameter. Simple geometry shows that the empirical extrinsic mean is the whole circle. We also cover the more general CAT() spaces giving some new results related to “diameter” as well as conditions for the uniqueness of intrinsic means not requiring the spaces to be CAT(0).
The general form of depends, here, on three parameters, and it can be written in compact form:
where the function and the construction of are given below. Once we have introduced this new class of metrics, variety of statistics can be generalised; intrinsic mean, variance, clustering (based on local minima of ). For classification problems, we can select an appropriate metric by cross-validation. Theoretically, we have the opportunity to apply well studies areas of geometry compared with methods based on selection of a good “loss function.” In Table 1, we summarise such generalised statistics.
There are many ways to transform one metric into another, regardless of whether they are geodesic metrics. A straightforward way is to use a concave function such that given a metric , the new metric is . This is plausible if we use non-convex , which are useful, as will be explained, in clustering and classification. Such concave maps are often interpreted as loss functions, but we will consider them in terms of a change of metric which may lead to selection using geometric concepts. This is particularly true for the construction of in this paper.
The basic definition and construction from a geodesic metric space to the special geodesics based on accumulation of density are given in the next section, together with the definition of a CAT(0) space. In Section 3, we first show that means and medians in simple one-dimensional statistics can be placed into our framework. Because geodesics themselves are one-dimensional paths, this should provide some essential motivation. The -metric is obtained by a local dilation. Our computational shortcut is to use empirical graphs, whose vertices are data points.
We will need, therefore, to define empirical geodesics. We start with a natural geodesic defined via a probability density function in which the distance along a path is the amount of density “accumulated” along that path. Then, an empirical version is defined whenever a density is estimated. We might have based the paper on kernel density estimates; instead, we have adopted a very geometric construction based on objects such as the the Delaunay complex and its skeleton graph.
In Section 4, the metric is introduced. It is based on a function derived from a geodesic metric via shrinking, pointwise, to an abstract origin; that is to say an abstract cone is attached. The smaller the value of , the closer to the origin (apex). The geometry of the two dimensional case drives our understanding of the general case because of the one-dimensional nature of geodesics. We prove that, for finite , the embedding cone is CAT() with smaller than the original space. We cover the more general CAT() spaces, giving some new results related to “diameter” , in Section 5, including conditions for the uniqueness of intrinsic means not requiring the spaces to be CAT(0).
Section 6 provides a summary of the effect of changing and . After some discussion on the selection of and in Section 7, Section 8 covers some examples.
2. Geodesics, intrinsic mean and extrinsic mean
The fundamental object in this paper is a geodesic metric space. This is defined in two stages. First, define a metric space with base space and metric . Sometimes, will be a Euclidean space of dimension , containing the data points, but it may also be some special object such as a graph or manifold. Second, define the length of a (rectilinear) path between two points and the geodesic connecting and as the shortest such path. This minimal length defines a metric , and the space endowed with the geodesic metric is called the geodesic metric space, .
The interplay between and will be critical for this paper, and, as mentioned, we will have a number of ways of constructing .
For data points in , the empirical intrinsic (Fréchet) mean is
There are occasions when it is useful to represent as a sub-manifold of a larger space (such as Euclidean space) with its own metric . We can then talk about the extrinsic mean:
Typically, the intrinsic mean is used as an alternative when the geodesic distance, is hard to compute. The difficultly in considering the intrinsic mean in is that it may not lie in the original base space . This leads to a third possibility, which is to project it back to , in some way, as an approximation to the intrinsic mean (which may be hard to compute). We will discuss this again in Section 4.
2.1. CAT(0) spaces
CAT(0) spaces, which correspond to non-positive curvature Riemannian spaces, are important here because their intrinsic means are unique. The CAT(0) property is as follows. Take any three points in a geodesic metric space and consider the “geodesic triangle” of the points based on the geodesic segments connecting them. Construct a triangle in Euclidean 2-space with vertices , called the comparison triangle, whose Euclidean distances, , are the same as the corresponding geodesic distances just described: , etc. On the geodesic triangle select a point on the geodesic edge between and and find the point on the edge of the Euclidean triangle such that . Then the CAT(0) condition is that for all and all choices of :
For a CAT(0) space (i) there is a unique geodesic between any two points, (ii) the space is contractible, in the topological sense, to a point and (iii) the intrinsic mean in terms of the geodesic distance is unique. CAT() spaces, a generalization of CAT(0) spaces, are explained in Section 5.
2.2. Geodesic metrics on distributions
Let be a -dimensional Euclidean random variable absolutely continuous with respect to the Lebesgue measure, with density . Let be a parametrised integrable path between two points in , which is rectifiable with respect to the Lebesgue measure. Let
with appropriate modification in the non-differentiable case, be the local element of length along . The weighted distance along is
The geodesic distance is
Here we consider a random variable on Euclidean space but this can be generalized for Riemannian manifolds and even for singular spaces with a density with respect to a base measure naturally defined by the metric.
From the geodesic distances on distributions we shall follow three main directions:
transform the geodesic metrics in various ways with parameters to obtain a wide class of metrics,
discover (locally) CAT(0) and CAT() spaces for certain ranges of the parameters,
apply empirical versions of the metrics based on an empirical graph whose nodes are the data points.
There is an important distinction between global transformations applied to the whole distance between points and local transformations applied to dilate the distance element.
3. The metric and minimal spanning trees
The general metric is a dilation of the original distance and what we have referred to as a local metric. It is obtained by transforming the density in (1). Thus for between and ,
Changing essentially changes the local curvature. Roughly speaking, when is more negative (positive), the curvature is more negative (positive).
In the next subsection, we look at the one-dimensional case. Although this case is elementary, good intuition is obtained by rewriting the standard version in terms of a geodesic metric.
3.1. One-dimensional means and medians
Assume that is a continuous univariate random variable with probability density function and cumulative distribution function (CDF) . The mean achieves . Here we are using the Euclidean distance: .
The median is defined by . On a geometric basis, we can say that achieves , where we use a metric that measures the amount of probability between and :
Carrying out the calculations:
which achieves a minimum of at , as expected.
Now let us consider the sample version. Let be the order statistics, which we assume are distinct. One of the first exercises in statistics is to show that minimises , with respect to .
For the median, first consider using the first of the two approaches with the empirical CDF . We obtain various definitions depending on our definition of and , or just using convention. Using the metric approach the natural metric is to take
with the standard definition of . Applied to distinct data points this is equal to . For an arbitrary
where . Then, is achieved at when is odd and at when is even.
Another approach for the median would be to take a piecewise linear approximation to which is equivalent to having a density that is proportional to in the interval . Then, the metric is
and is achieved at when is odd and at , when is even. We can think of this last result in another way. Consider the points as points in , take the empirical mean of the points and transform them back with the CDF corresponding to , namely, the piecewise linear approximation.
The idea of weighting intervals should provide intuition when we extend the intervals to edges on a graph, because edges are one-dimensional.
3.2. The metric for graphs
There are a number of options to produce an empirical version of the metric, based on the data. One such option would be to produce a smooth empirical density followed by numerical integration and optimization to compute the geodesics. We prefer a much simpler method based on a graph constructed from the data. All geodesic computation is then restricted to the graph. We list some candidates (see appendix LABEL:graphs for a description of the second and third candidates listed below):
the complete graph with vertices at the data points and all edges,
the edge graph (1-skeleton) of the Delaunay simplicial complex with vertices at the data points,
the Gabriel graph with vertices at the data points.
The discussion below applies to the complete graph or any connected sub-graph. For any such graph, define a version of the distance just for edges,
where is the Euclidean distance from to . This can be explained by making a transformation
We refer to this as edge regularization. We then apply in the usual way to obtain
The new “length” of each edge is obtained by integrating this “density” along the edge. In this sense, also plays the role of density estimation. Although we need a regularization with respect to the dimension for density estimation [kendall-1963], we manage the regularization by rescaling the parameter . Note that gives the unit length and restores the original length.
Now we consider only the set of edges of the graph as a metric space with the metric defined by the geodesic:
where the infimum is taken over all (connected) paths between and . Note that even if can estimate the local density well, it does not follow that the metric can be approximate by the metric since edge lengths on each path are not independent. It is suggested that further theoretical work is necessary. Here we will admit as an approximation of .
If the graph is not a complete Euclidean graph with weights equal to the Euclidean lengths of the edges, some edges may not be in any edge geodesics between any pair of vertices.
For an edge-weighted graph with weights on the graph, , which is the union of all the edge geodesics between all pairs of vertices, is called the geodesic sub-graph (or geodesic graph) of .
We will see how the geodesic sub-graphs transform as the value of changes.
We make an important general position assumption that the set of values are distinct, that is there are no ties. This is an additional general position assumption to that given for the Delaunay cells ( see appendix LABEL:graphs). We order the values using only a single suffix for simplicity: where . For , this induces the values:
Now, consider the geodesics as . Recall that a circuit in a graph is a connected path that begins and ends in some vertex and an elementary circuit is a circuit that visits a vertex no more than once. Consider an edge that has the following property which we call : it is in an elementary circuit of the graph in which all other edges have smaller values of namely
Then, the path (within the circuit) from to not containing the edge has length smaller than when is sufficiently negative:
From this argument, we see that for sufficiently large as approaches , every edge having property Q is removed from the geodesic sub-graph, and we obtain a tree.
Let us summarize this algorithm, which applies to a general edge-weighted graph with distinct edges. We refer to this algorithm as the backwards algorithm. It clearly gives a tree.
Let and label the edges in increasing order of their weights.
Starting with edge , remove if it is in a cycle otherwise continue to .
(General step) Continue downwards at each stage removing an edge if it is in a cycle of the remaining subgraph.
Stop if no more edges can be removed using step 3.
There is a natural forwards algorithm that also yields a tree as follows.
Let and label the edges in increasing order of their weights.
Starting with , add an edge if adding it does not create a cycle.
(General step) Continue adding an edge at each step provided that the addition does not create a cycle.
Stop if no more edges can be added.
We have the following.
Given a connected edge-weighted graph , the backward and forward algorithms yield the same tree, which we call .
Proof. Let and be the trees generated by the backward and forward algorithms respectively. Any edge of not in cannot be in , because it is in a circuit of edges with lower weights than itself, by the forward construction. This is sufficient because each tree has the same number of edges. In fact, the tree can be constructed by the simple rule: remove all edges that are in a circuit with “smaller” edges; the order in which the edges are removed is irrelevant. ∎
Although there are several implementations of forward and backward minimal spanning tree algorithms, their importance here is to give intuition for the construction of geodesics as .
We now show that the tree is the minimal spanning tree in a strong sense.
Let be a graph with and distinct edge weights . There is a unique spanning tree whose ordered weights have the following minimal property. If are the ordered weights of any other spanning tree, then
with strict inequality for at least one . Moreover is given by the forward (or backward) algorithm.
Proof. The proof is obtained by contradiction. Let be the tree constructed by the forward algorithm. Let be another spanning tree with ordered weights and suppose that for some with , we have . Let the edges of and be and , respectively. Since , we must have . In the sequence , let and ; then, we must have . By the nature of the forward algorithm, new vertices are used only when an edge is included by the algorithm: all unused edges form circuits and therefore use edges of the tree constructed up to that point. Thus, the subgraph of with edges has exactly vertices; this value is attained on the addition of the last edge . However, , which is a subgraph of , attains vertices at with which is a contradiction. ∎
The ordering property in Theorem 3.3 is known from the theory of stochastic ordering: if the empirical CDFs of the distances and are and , respectively, then with strict inequality for at least one value (in fact, over at least one non-empty interval). Thus, not only but also for any non-decreasing function .
For sufficiently negative , the tree itself, that is the tree as a metric space with metric , is a CAT(0) space. We need to extend the metric somewhat so that it applies to the edges, in addition to the nodes. Thus, for any two points on the tree, define
where the integral is taken along the (unique) path on the tree and when line element is in edge in .
There is an such that for any , the geodesic sub-graph becomes the minimal spanning tree endowed with the metric and, therefore, becomes a CAT(0) space.
We see that for sufficiently negative , every geodesic defined with the metric lies in the tree . In fact, although we started with a general connected graph, any graph for which the edges can be mapped into a Euclidean interval gives a CAT(0) tree using this construction.
There are some well-known algorithms and considerable literature on minimal spanning trees. Remarkably, the minimal spanning tree for a complete Euclidean graph is the same as the minimal spanning tree for the Delaunay graph and is therefore a subgraph of the Delaunay graph.
Let be a complete edge-weighted graph whose weights are distinct and monotonically related to those of the complete Euclidean graph of a point set in dimensions. Then, , , the Gabriel graph and the Delaunay graph of the point set all have the same and unique minimal spanning tree.
Versions of this result are known in two dimensions [aurenhammer-1991], but the authors had some difficulty in finding a concise proof in the literature. We present a proof in appendix LABEL:proof-mst.
Thus, any algorithm for finding the minimal spanning tree of a graph will give the minimal spanning tree of the Delaunay graph, without having to find the full Delaunay complex for any Euclidean space into which the points can be embedded. Further, as seen previously, the minimal spanning tree has the strong minimal property in the sense of Theorem 3.5.
3.3. The double -chain
We may study the geometry as increases away from . Following Theorem 3.3, we are interested in two cases: the Euclidean Delaunay graph and the complete Euclidean graph. In both cases, we consider the metric.
Let be an edge-weighted graph with distinct weights and let be its geodesic subgraph; then,
Proof. This follows from the consideration of geodesics. An edge in is not in
if it is not a geodesic. In this case, there is an
alternative path from to such that
However, this inequality is preserved if is decreased, so that is increased. Thus an edge absent from
is absent from . ∎
3.4. and CAT()
Let be a geodesic disk of radius centred at . Define the maximum radius of the disk centred at as being CAT(), that is
See Section 5 for the definition of CAT(). If is a metric graph, does not include a cycle whose length is less than
Consider a rescaling of such that the shortest (longest) edge length is 1, and denote it as for ().
Because the -chain is increasing for , each cycle in is removed one by one as decreases. Furthermore, each cycle length increases as decreases because, by the rescaling, every edge length is greater than 1 and it increases as decreases. This gives the decreasing property of for . We can prove the result for similarly.∎
By the theorem, becomes “more CAT()” for a smaller . Because rescaling of the graph does not affect the uniqueness of the intrinsic mean, tends to have a unique mean for a smaller .
3.5. Geodesic graphs in 2-d with different
Figures 2 (a)-(f) are geodesic graphs with different values of for 50 samples of the standard 2-d Normal distribution. We give two cases in which we decrease : the Delaunay graph in Figure 2 and the complete graph in Figure 3. By the time the cases are indistinguishable and have the same minimal spanning geodesic graph for large negative values of , as expected.
This is predictable from Theorem 3.5 and 3.6 and gives an important practical strategy: when the dimension is high and is small, use the complete graph rather than the Delaunay graph because the former requires computational cost only proportional to , whereas the computational cost of the latter is [deBerg-2008] [cignoi-1998].
4. The metric
The CAT(0) property of Euclidean space implies that we do not obtain multiple local minima of even for multi-modal distributions. However, an appropriate concave transformation of the metric can modify the base data space making it less CAT(0). We introduce the metric via a transformation as a candidate.
For any geodesic metric space with metric and a parameter , we can define the metric
Since is a concave function on (see Figure 4), becomes a metric but not necessarily a geodesic metric. We can express this conveniently as , where is the Heaviside function,
It is easiest to consider the case that is the Euclidean distance on the real line. As for small values of , the metric behaves like , and as , behaves like rescaled to . For Euclidean distances greater than , returns a constant distance of unity. The metric has the effect of downsizing large distances to unity. Because, as will soon be seen, can be recognized as a geodesic metric of a cone embedding , we refer to the mean
as the -extrinsic mean.
We consider a scheme in which the real line, or part of it, is mapped into the unit circle, where it can be represented by an angle . In the unit disk, , a point is represented by polar coordinates: .
We consider the top half of the unit disk, for which , namely
We will now give a rule for travelling from a point in to the point P:.
If , travel in a straight line to P. This is the geodesic for the Euclidean metric
If , travel first in straight line to the origin and then in a straight line from to P.
We now apply a similar rule to points and .
If , we take the Euclidean distance
If , we take , indicating a route to the origin and out to the other point.
Now, consider two points at and on the circumference of the circle; then, in case (D1) above, the distance between them is
When , we obtain the value 2.
Now, keeping fixed, let us rescale the distance inside the argument. In terms of the two points just described, we place the first point at P so that and scale the second with a factor . Then, without changing the notation, and removing the factor of 2, we have . Then, implies that we may use the Euclidean distance for a wider interval. Case (D2) above corresponds to the distance before rescaling and the new version achieves the value 1. This corresponds precisely to the metric of this section.
4.1. The -extrinsic mean: one dimension
The extrinsic mean keeps the mean within the original data space but uses the metric. We will see below that the metric involves, even in the general case, an extension of the original space by a single dimension. This distinguishes it from the embeddings for extrinsic means used in the literature, which involve high-dimensional Euclidean space, for example the case of the tree space.
Controlling , as will be seen below, controls the value of when the embedding space is considered as a CAT() space. We have an indirect link between clustering and CAT() spaces. As decreases while the embedding space becomes more CAT(0) ( decreasing) the original space becomes less CAT(0). This demonstrates, we believe, the importance of the CAT() property in geodesic-based clustering.
In Euclidean space, the standard Euclidean distance dose not exhibit multiple local means because the space is trivially CAT(0). However, by using the -metric with a sufficiently small , the space can have multiple local means, as shown in Figure 5.
4.2. The general case: metric cone
The above construction is a special case of a general construction that applies to any geodesic metric space and hence to those in this paper. Let be a geodesic metric space with a metric . A metric cone with is a cone with a metric
for any .
The intuitive explanation is as follows. See Figure 7. Let be a subset with the geodesic metric on . Thus, and are the same as a set but endowing different metrics. Then is a rescaling of the metric on by . For any , their projections give two points , respectively. For a geodesic between and , consider a cone spanned by . This cone can be isometrically embedded into an “extended unit circular sector”, i.e. a covering of the unit disk corresponding to . Then and are also mapped into the extended unit circular sector; the distance for corresponds to the case (D2) of a disk if we set and . This corresponds to the length of the blue line path in Figure 7 (b1) and (b2). For further details on metric cones, refer to [deza-deza-2009].
The following result indicates that the metric cone space preserves the CAT(0) property of the original space and the smaller values of continue this process. See section LABEL:sec:versus for more details.
If is a CAT(0) space, the metric cone is also CAT(0) for every .
If is CAT(0), is also CAT(0) for .
If is CAT() for , becomes CAT(0) for .
The proof is given in appendix LABEL:proof-cone.
5. Cat() spaces, curvature, diameter and existence of means
A CAT() space for is a geodesic metric space satisfying the following CAT() condition. Take any geodesic triangle whose perimeter is less than and select any point on a geodesic . Let be a comparison triangle, which has the same edge length as , on a surface with a constant curvature , i.e. a sphere with radius for , a plane for and a hyperbolic space for . Set on satisfying . Then the CAT() condition is that for all and all choices of , . Thus every CAT() space is a CAT() space for . Every metric graph is CAT() for if and only if there is no loop shorter than because every metric tree is CAT(0) and therefore CAT() for .
Let be a geodesic metric space and fix it throughout this section. The diameter of a subset is defined as the length of the longest geodesic in . We define classes , and as follows.
: the class of subsets such that the geodesic distance function is strictly convex on for each . Here, “convex” means geodesic convex, i.e. a function on is convex iff for every geodesic on , is convex with respect to .
for : the class of the subsets such that for any probability measure whose support is in and non-empty, the intrinsic -mean exists uniquely. We refer to as .
: the class of subsets such that for every pair , the geodesic between and is unique.
for any .