Power Weighted Shortest Paths for Unsupervised Learning

# Power Weighted Shortest Paths for Unsupervised Learning

Daniel Mckenzie Corresponding Author: mckenzie@math.uga.edu Department of Mathematics, University of Georgia Steven Damelin damelin@umich.edu Department of Mathematics, University of Michigan
###### Abstract

We study the use of power weighted shortest path distance functions for clustering high dimensional Euclidean data, under the assumption that the data is drawn from a collection of disjoint low dimensional manifolds. We argue, theoretically and experimentally, that this leads to higher clustering accuracy. We also present a fast algorithm for computing these distances.

Keywords: unsupervised learning, clustering, shortest path distance, manifold hypothesis.

## 1 Introduction

Clustering high dimensional data is an increasingly important problem in contemporary machine learning, and is often referred to as unsupervised learning. Throughout, we shall assume that our data is presented as a subset of a Euclidean space, , although our results easily extend to more general metric spaces. Loosely speaking, by clustering we mean partitioning into subsets, or clusters, such that data points in the same are more “similar” than data points in different subsets. Clearly, the notion of similarity is context dependent. Although there exist algorithms that operate on the data directly, for example -means, many modern algorithms proceed by first representing the data as a weighted graph with and representing the similarity between and and then using a graph clustering algorithm on . Spectral clustering [NJW02] is an archetypal example of such an approach. Constructing requires a choice of distance function . Ideally, one should choose such that points in the same cluster are close with respect to , while points in different clusters remain distant. Thus, the choice of distance function should reflect, in some way, our assumptions about the data and the notion of similarity we would like the clusters to reflect.

A common assumption, frequently referred to as the manifold hypothesis (see, for example, [FMN16]) posits that each is sampled from a latent data manifold . Many types of data sets are known or suspected to satisfy this hypothesis, for example motion segmantation [AHKS19], images of faces or objects taken from different angles or under different lighting [BJ03, HYL03] or handwritten digits [TDSL00]. It is also usually assumed that the dimension of each is much lower than the ambient dimension . Although it can be shown that taking to be the Euclidean distance can be successful [AC11]for such data, data-driven distance functions have been increasingly favored [CL06, BRS11, CMS17, LMM17].

Once has been chosen, can be constructed. A common choice [NJW02, ZMP05] is to use some variant of a Gaussian kernel, whereby where is a user defined parameter. However this is unsuitable for large graphs, as the resulting similarity matrix is dense. Hence in this case, a common choice is to use a nearest neighbors (-NN) graph, for some choice of , constructed as:

 Wij={1 if xi among the k % nearest neighbors of xj0 otherwise

In this article we consider taking to be a power weighted shortest path distance. A path from to is any ordered subset which we regard as describing the path in the complete graph with vertices given by . For any the power weighted shortest path distance (-wspd) from to is

 dpX(xα,xβ):=(minγm∑j=0∥xij+1−xij∥p)1/p (1)

where denotes the Euclidean norm on . We note that the use of shortest path distances in clustering data sets is not new (see the discussion in §1.1), but has typically been hindered by high computational cost, as computing for all is equivalent to the well known all pairs shortest path (APSP) problem on a complete weighted graph on vertices. Solving this problem using Dijkstra’s algorithm naively requires operations—clearly infeasible for large data sets. In this paper we provide a way around this computational barrier, and also contribute to the theoretical analysis of -wspd’s. Specifically, our contributions are:

1. We prove theoretically that, asymptotically, -wspd’s behave as expected for data satisfying the manifold hypothesis. That is, we show that as the number of data points tends to infinity the distance between points in different clusters remains bounded away from zero while the distance between points in the same tends to zero, and we quantify the rate at which this happens.

2. We show how -wspd’s can be thought of as interpolants between the Euclidean distance and the longest leg path distance (see [LMM17]), which we shall abbreviate to LLPD.

3. We introduce a novel modified version of Dijkstra’s algorithm that computes the nearest neighbors, with respect to any -wspd or the LLPD, of any in in time, where is the cost of a Euclidean nearest-neighbor query. Hence one can construct a -wspd k-NN graph in . As we typically have , i.e. or , this means that constructing a -wspd k-NN graph requires only marginally more time than constructing a Euclidean -NN graph (which requires ).

4. We verify experimentally that using a -wspd in lieu of the Euclidean distance results in an appreciable increase in clustering accuracy, at the cost of a small increase in run time, for a wide range of real and synthetic data sets.

### 1.1 Prior work

The idea of using -wspd’s for clustering was proposed in [VB03], and further explored in [OS05]. More generally, shortest path distances are a core part of ISOMAP [TDSL00], although we emphasize that here not all paths through are considered—first a -NN graph is computed from and only paths in this graph are admissible. Finally, several recent papers consider the use of LLPD for clustering, for example [LMM17].

The works most similar to ours are [BRS11] and [CMS17]. In [BRS11] -wspd’s are proposed for semi-supervised learning, and a fast Dijkstra-style algorithm is provided. However their set-up is fundamentally different to ours, as their focus is on finding, for every , its nearest neighbor, with respect to a -wspd, in some set of labeled data points . Moreover, they only consider semi-supervised methods, specifically nearest-neighbor classification, and do not provide any quantitative results on the asymptotic behaviour of the lengths of power weighted shortest paths. In [CMS17] the -wspd with is studied, and some interesting connections between this distance and the nearest-neighbor geodesic distance are discovered. However, the applications of this distance to clustering is not explicitly explored.

On the computational side, we are unaware of any prior mention of Algorithm 2 in the literature, although similar algorithms, which solve slightly different problems, are presented in [HP16], [MJB17] and [BRS11]. As mentioned above, the algorithm presented in [BRS11] is concerned with finding, for all , the nearest neighbor of with respect to a -wspd in a set of labeled data and has run time . It does not seem possible to extend it to finding path nearest neighbors. The algorithm of [HP16] is formulated for any weighted graph (i.e. not just graphs obtained from data sets ) and as such is not well-adapted to the problem at hand. In particular, it has run time . Because the distance graph obtained from is implicitly complete, and this results in a run time proportional to , which is infeasible for large data sets. Finally, the algorithm presented in [MJB17], although adapted to the situation of distance graphs of data sets, actually solves a slightly different problem. Specifically they consider finding the -wspd nearest neighbors of each in a Euclidean nearest neighbors graph of . As such, it is not clear whether the set of nearest neighbors produced by their algorithm are truly the -wspd nearest neighbors in .

Finally we mention that our approach is “one at a time”, whereas the other three algorithms mentioned are “all at once”. That is, our algorithm takes as input and outputs the -wpsd nearest neighbors of . This can then be iterated to find the -wspd nearest neighbors of all . In contrast, “all at once” algorithms directly return the sets of nearest neighbors for each . Thus it is possible our algorithm will have applications in other scenarios where the -wspd nearest neighbors of only some small subset of points of are required.

## 2 Preliminaries

### 2.1 Notation

Throughout this paper, shall denote the ambient dimension, while will denote a fixed, finite sets of distinct points in . We shall denote the Euclidean (i.e. ) norm on as . For any finite set , by we shall mean its cardinality. For any positive integer , by we mean the set .

### 2.2 Data Model

We consider data sets consisting naturally of clusters, which are a priori unknown. Let . We posit that for each there is a smooth, compact, embedded manifold with sampled from according to a continuous probability density function supported on . Let denote the restriction of the Euclidean metric to , then is a compact Riemannian manifold. For any let

 dista(x,y):=infγ∫10√ga(γ′(t),γ′(t))dt

denote the metric induced by , where the infimum is over all piecewise smooth curves with and . Define the diameter of to be the supremum over all distances between points in :

 diam(Ma):=supx,y∈Ma% dista(x,y)

Since each is compact this supremum is in fact a maximum and is finite. We assume that the data manifolds are fairly well separated, that is,

 dist(Ma,Mb)=minx∈Ma,y∈Mb∥x−y∥≥δ>0 for all,1≤a

Note that frequently (for example, in [AC11]), this model is extended to allow for noisy sampling, whereby for some , is sampled from the tube defined as:

 B(Ma,τ)={x∈RD : miny∈Ma∥x−y∥2≤τ}.

but we leave this extension to future work.

### 2.3 Power Weighted Shortest Path Distances

For any distinct pair and any path define the -weighted length of to be:

 L(p)(γ):=(m∑j=0∥xij+1−xij∥p)1/p (3)

where by convention we declare and . As in (1) we define the -weighted shortest path distance, or -wspd, from to through to be the minimum length over all such paths:

 d(p)X(xα,xβ):=min{L(p)(γ) : γ a path from xα to xβ % through X} (4)

Note that will depend on the power weighting and the data set . As several authors [HDH16, CMS17, AVL12] have noted, the distance is density-dependent, so that if and are contained in a region of high density (i.e. a cluster) the path distance will likely be shorter than the Euclidean distance (as long as ).

#### 2.3.1 Longest-Leg Path Distance

Another common path-based distance, analyzed in connection with spectral clustering in [FB03, CY08, LMM17], is the longest-leg path distance (LLPD), which we shall denote as (the choice of this notation should become clear shortly). It is defined as the minimum, over all paths from to through , of the maximum distance between consecutive points in the path (i.e. legs). Before formally defining , define, for any path from to through , the longest-leg length of as:

 L(∞)(γ)=maxj=0,…,m∥xij+1−xij∥

again we are using the convention that and . Now, in analogy with (4):

 d(∞)X(xi,xj)=min{L(∞)(γ):γ a path from xα to xβ % through X} (5)

## 3 p-wspd’s in the Multi-Manifold Setting

One of the most useful aspects of -wspd’s, when applied to clustering problems, is that they tend to “squeeze” points in the same cluster together, while (hopefully) keeping points in different clusters separated. Here we make this more precise. Specifically we show that for any if the data comes from the model described in §2.2 then:

• (see Lemma 3.1)

• (see Theorem 3.5). In fact, we can specify the rate at which this quantity goes to zero.

Note that in this section it is sometimes necessary to enlarge our definition of -wspd to allow for paths between that are not necessarily in (and points that are not in shall be denoted without a subscript). Thus is technically defined as, using the notation of §2.3, .

### 3.1 Paths between points in different clusters

Here we prove that -wspd’s maintain a separation between points in different clusters.

###### Lemma 3.1.

Let denote the minimal distance between points in different clusters. That is:

 ϵ2:=mina,b∈[ℓ]a≠bminxα∈Xaxβ∈Xbd(p)X(xα,xβ)

Then with as defined in (2).

###### Proof.

For any and let be any path from to through , where again we are using the convention that and . If and there must exist (at least one) such that while . By the assumptions on the generative model, and and so:

 ∥xij∗+1−xij∗∥p≥(dist(Ma,Mb))p=δp

thus:

Because this holds for all such :

 d(p)X(xα,xβ):=minγ{L(p)(γ)}≥δ

and because this holds for all such and :

 minxα∈Xa,xβ∈Xbd(p)X(xα,xβ)≥δ

Finally, this holds for all , yielding the lemma. ∎

### 3.2 Asymptotic Limits of power weighted shortest paths

For all , define as the minimum p-weighted length of paths from to through (i.e. we are excluding paths that may pass through points in ). Because , it follows that 111More generally the reader is invited to check that for any we have that .. In this section we address the asymptotic behaviour of . Here is where we make critical use of the main theorem of [HDH16], which we state as Theorem 3.2. Recall that is the probability density function with respect to which is sampled from , and that by assumption . Define the following power-weighted geodesic distance on :

where here the infimum is over all piecewise smooth paths with and . As in §2.2, for the Riemannian manifold let denotes the geodesic distance from to on with respect to .

In order to bound we study an auxiliary shortest path distance . This distance will also be defined as a minimum over -weighted path lengths, but instead of measuring the length of the legs using the Euclidean distance , we measure them with respect to the intrinsic metric :

 d(p)Ma(x,y):=minγ(m∑j=0dista(xij+1,xij)p)1/p (7)

where again the is over all paths from to through .

###### Theorem 3.2 (Theorem 1 in [Hdh16]222There is a slight notational discrepancy here. What is called d(p)X(xα,xβ) in [Hdh16] is our (d(p)Ma(xα,xβ))p).

Let be a compact Riemannian manifold, and assume that is drawn from with continuous probability distribution satisfying . Let . For all , let Then for any fixed :

 (8)

where is a constant depending only on and , but not on . Note that the dependency on is contained in the term.

###### Corollary 3.3.

With assumptions as in Theorem 3.2,

 maxxα,xβ∈Xa(d(p)Ma(xα,xβ))≤Can(1−p)/pdaa

with probability , where is a constant depending on and but not on .

###### Proof.

From Theorem 3.2 there are two cases to consider:

1. , or

2. .

In the first case, the one leg path is a path through , hence:

 d(p)Ma(xα,xβ)≤(% dista(xα,xβ)p)1/p

and so with probability

For the second case, recall that . By assumption . From the definition of (see (6))

Because is compact and embedded, its diameter (see §2.2) is finite, and . Because :

Hence from Theorem 3.2 and Equation (10), with probability :

where now . Combining equations (9) and (11) and redefining proves the corollary. ∎

Finally, we observe that the Euclidean path distance is always smaller then the intrinsic path distance:

###### Lemma 3.4.

For any , and for all ,

###### Proof.

Observe that for any , . It follows that for any path through :

 m∑j=0∥xij+1−xij∥p≤m∑j=0%dista(xij+1,xij)p

and so:

 (d(p)Xa(x,y))p =minγ{m∑j=0∥xij+1−xij∥p} ≤minγ{m∑j=0dista(xij+1,xij)p}=(d(p)Ma(x,y))p

whence the result follows. ∎

### 3.3 Main Result

###### Theorem 3.5.

Let and and define to be the maximal distance between points in the same cluster:

Then with probability approaching as . In particular, for all , almost surely.

###### Proof.

For any , Lemma 3.4 and Corollary 3.3 give us that:

with probability , where the first inequality follows from the fact that . Taking the maximum over all yields:

Where the constant now depends on the geometry of the (in particular their dimension and diameter) and the probability distributions but not on the number of points per cluster. Finally, observe that for we indeed have that as

## 4 Relating p-wspd’s to the LLPD

Here we verify that the distance can be thought of as interpolating between Euclidean distance and longest-leg path distance.

###### Lemma 4.1.

For any fixed , we have that for all .

###### Proof.

First observe that for any fixed path from to we have that . To see this, let us suppose that . That is, is the longest leg in . Then for any :

 L(p)(γ):=(m∑j=0∥xij+1−xij∥p)1/p =∥xij∗−xij∗+1∥⎛⎜ ⎜⎝1+m∑j=0j≠j∗(∥xij+1−xij∥∥xij∗−xij∗+1∥)p⎞⎟ ⎟⎠1/p ≤∥xij∗−xij∗+1∥(m1/p)

and as , . Because the operation of taking a minimum is continuous, we get that:

For all ,

###### Proof.

is defined as a minimum over all paths from to through , and in particular the one hop path is such a path. We claim it is the shortest such path as for any other path by repeated applications of the triangle inequality:

 L(1)(γα→β)=∥xα−xβ∥ =∥xα−m∑j=1(xij−xij)−xβ∥ ≤m∑j=0∥xij−xij+1∥=L(1)(γ)

## 5 A Fast Algorithm for p-wspd Nearest Neighbors

In this section we start from a more general perspective. Let be a weighted graph. We shall require all weights to be positive, and will represent the edge-weight function. If we fix a numbering of the vertices then is a matrix and represents the weight of the edge (and if there is no such edge). Occasionally, it will prove more convenient to not fix an ordering of the vertices, in which case will represent the weight of the edge (and again if there is no such edge). By we shall mean the path from to in through . Here, this is only valid if are all edges in . In analogy with §2.3 we maintain the convention that for such a path , and . Define the length of as the sum of all its edge weights:

 L(γ):=m∑i=0W(wi,wi+1)

and similarly define the longest-leg length of as:

 L(∞)(γ)=mmaxi=0W(wi,wi+1)

For any define the shortest path distance as:

 dG(u,v)=min{L(γ): γ a path from u to v}

and analogously define the longest-leg path distance as:

 d(∞)G(u,v)=min{L∞(γ): γ a path from u % to v}

Let us relate this to the discussion in previous sections. For any set of data points and any power weighting one can form a graph on vertices, one for each , and edge weights . Then:

 dG(vi,vj)=(d(p)X(xi,xj))p

Note that here the graph is complete.

###### Definition 5.1.

Let denote the set of nearest neighbors of . That is, with for all

###### Definition 5.2.

For any graph , define a directed -Nearest Neighbors graph with directed edges if .

In practice we do not compute the entire edge set of , but rather just compute the sets as it becomes necessary.

###### Definition 5.3.

Let denote the set of vertices which are closest to in the shortest-path distance . That is, and for all . By convention, we take to be in as .

We have not specified how to break ties in the definition of or . For the results of this section to hold, any method will suffice, as long as we use the same method in both cases. To simplify the exposition, we shall assume henceforth that all distances are distinct.

First let us briefly review how Dijkstra’s algorithm works. The following implementation is as in [CLRS09]. The min-priority queue operations and extractMin have their standard definitions (see, for example Chpt. 6 of [CLRS09]). For any vertex and any subset , we shall also use the shorthand to denote the process of initializing a min-priority queue with and for all .

Note that the list that is outputted contains all the shortest path distances from the source . That is, once is popped in step , is the shortest path distance from to . The key observation is the following:

###### Lemma 5.4.

Suppose that all weights are non-negative: . If is the -th vertex to be removed from at step , then is the -th closest vertex to .

###### Proof.

See, for example, the discussion in [CLRS09]. ∎

It follows that, if one is only interested in finding the nearest neighbors of in the path distance , one need only iterate through the ‘while’ loop times. There is a further inefficiency, which was also highlighted in [BRS11]. The ‘for’ loop 6–10 iterates over all neighbors of . The graphs we are interested in are, implicitly, fully connected, hence this for loop iterates over all other vertices at each step. We fix this with the following observation:

###### Lemma 5.5.

For any graph , let denote its -Nearest-Neighbor graph (see Definition 5.2). Then:

 NdGk,G(v)=NdG(k)G(k),k(v) for all v

Note that in the directed graph , we consider only paths that traverse each edge in the ‘correct’ direction.

Concretely: the path-nearest-neighbors in are the same as the path-nearest neighbors in , hence one can find by running a Dijkstra-style algorithm on , instead of . As each vertex in has a small number of neighbors (precisely ), this alleviates the computational burden of the ‘for’ loop 6–10 highlighted above.

Before proving this lemma, let us explain why it may seem counterintuitive. If there is a path from to that is short (at least shorter than the shortest paths to all ). In forming from , one deletes a lot of edges. Thus it is not clear that is still a path in (some of its edges may now be ‘missing’). Hence it would seem possible that is now far away from in the shortest-path distance in . The lemma asserts that this cannot be the case.

###### Proof.

Since the sets and have the same cardinality (i.e. ), to prove equality it suffices to prove one containment. We shall show that . Consider any . Let be the shortest path from to . That is, .

We claim that is a path in . If this not the case, then there is an edge that is in but is not an edge in (we again adopt the convention that and ). By the construction of this implies that there are vertices