A Proof of Lemma 1

# Pruning nearest neighbor cluster trees

## Abstract

Nearest neighbor (-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability?

Our first contribution is a statistical analysis that reveals how certain subgraphs of a -NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.

## 1 Introduction

In this work, we consider the nearest neighbor (-NN) graph where each sample point is linked to its nearest neighbors. These graphs are widely used in machine learning and data mining applications, and interestingly there is still much to understand about their expressiveness. In particular we would like to better understand what such a graph on a finite sample of points might reveal about the cluster structure of the underlying distribution of points. More importantly we are interested in whether one can identify spurious structures that are artifacts of sampling variability, i.e. spurious structures that are not representative of the true cluster structure of the distribution.

Our first contribution is in exposing more of the richness of -NN graphs. Let be a -NN graph over an -sample from a distribution with density . Previous work Maier et al. (2009) has shown that the connected components (CC) of a given level set of can be approximated by the CCs of some subgraph of , provided the level set satisfies certain boundary conditions. However it remained unclear whether or when all level sets of might satisfy these conditions, in other words, whether the CCs of any level set can be recovered. We show under mild assumptions on that CCs of any level set can be recovered by subgraphs of for sufficiently large. Interestingly, these subgraphs are obtained in a rather simple way: just remove points from the graph in decreasing order of their -NN radius (distance to the ’th nearest neighbor), and we obtain a nested hierarchy of subgraphs which approximates the cluster tree of , i.e. the nested hierarchy formed by the level sets of (see Figure 1, also Section 2.1).

Our second, and perhaps more important contribution is in providing the first concrete approach in the context of clustering that guarantees the pruning of all spurious cluster structures at any tree level. We carefully work out the tradeoff between pruning “aggressively” (and potentially removing important clusters) and pruning “conservatively” (with the risk of keeping spurious clusters) and derive tuning settings that require no knowledge of the underlying distribution beyond an upper bound on . We can thus guarantee in a finite sample setting that (a) all clusters remaining at any level of the pruned tree correspond to CCs of some level set of , i.e. all spurious clusters are pruned away, and (b) salient clusters are still discovered, where the degree of saliency depends on the sample size . We can show furthermore that the pruned tree remains a consistent estimator of the underlying cluster tree, i.e. the CCs of any level set of are recovered for sufficiently large . Interestingly, the pruning procedure is not tied to the -NN method, but is based on a simple intuition that can be applied to other cluster tree methods (see Section 3).

Our results rely on a central “connectedness” lemma (Section 5.2) that identifies which CCs of remain connected in the empirical tree. This is done by analizing the way in which -NN radii vary along a path in a dense region.

### 1.1 Related work

Recovering the cluster tree of the underlying density is a clean formalism of hierarchical clustering proposed in 1981 by J. A. Hartigan Hartigan (1981). Hartigan showed in the same seminal paper that the single-linkage algorithm is a consistent estimator of the cluster tree for densities on . For it is known that the empirical cluster tree of a consistent density estimate is a consistent estimator of the underlying cluster tree (see e.g. Wong & Lane (1983)), unfortunately there is no known algorithm for computing this empirical tree. Nonetheless, the idea has led to the development of interesting heuristics based on first estimating density, then approximating the cluster tree of the density estimate in high dimension Wong & Lane (1983); Stueltze & Nugent (2010).

Many other related work such as Rigollet & Vert (2009); Singh et al. (2009); Maier et al. (2009); Rinaldo & Wasserman (2010) consider the task of recovering the CCs of a single level set, the closest to the present work being Maier et al. (2009) which uses a -NN graph for level set estimation. As previously discussed, level set estimation however never led to a consistent estimator of the cluster tree, since these results typically impose technical requirements on the level set being recovered but do not work out how or when these requirements might be satisfied by all level sets of a distribution.

A recent insightful paper of Chaudhuri & Dasgupta (2010) presents the first provably consistent algorithm for estimating the cluster tree. At each level of the empirical cluster tree, they retain only those samples whose -NN radii are below a scale parameter which indexes the level; CCs at this level are then discovered by building an -neighborhood graph on the retained samples. This is similar to an earlier generalization of single-linkage by Wishart (1969) which however was given without a convergence analysis. The -NN tree studied here differs in that, at an equivalent level , points are connected to the subset of their -nearest neighbors retained at that level. One practical appeal of our method is its simplicity: we need only remove points from an initial -NN graph to obtain the various levels of the empirical cluster tree.

Chaudhuri & Dasgupta (2010) provides finite sample results for a particular setting of . In contrast our finite sample results are given for a wide range of values of , namely for . In both cases the finite sample results establish natural separation conditions under which the CCs of level sets are recovered (see Theorem 1). The result of Chaudhuri & Dasgupta (2010) however allows the possibility that some empirical clusters are just artifacts of sampling variability. We provide a simple pruning procedure that ensures that clusters discovered empirically at any level correspond to true clusters at some level or the underlying cluster tree. Note that this can be trivially guaranteed by returning a single cluster at all levels, so we additionally guarantee that the algorithm discovers salient modes of the density, where the saliency depends on empirical quantities (see Theorem 2).

A recent archived paper Rinaldo et al. (2010) also treats the problem of false clusters in cluster tree estimation, but the result is not algorithmic as they only consider the cluster tree of an empirical density estimate, and do not provide a way to compute this cluster tree.

There exist many pruning heuristics in the literature which typically consist of removing small clusters Maier et al. (2009); Stueltze & Nugent (2010) using some form of thresholding. The difficulty with these approaches is in how to define small without making strong assumptions on the unknown underlying distribution, or on the tree level being pruned (levels correspond to different resolutions or cluster sizes). Moreover, even the assumption that spurious clusters must be small does not necessarily hold. Consider for example a cluster made up of two large regions connected by a thin bridge of low mass; the two large regions can easily appear as two separate clusters in a finite sample. Some more sophisticated methods such as Stueltze & Nugent (2009) do not rely on cluster size for pruning, instead they return confidence values for the empirical clusters based on various notions of cluster stability; unfortunately they do not provide finite sample guarantees. Our pruning guarantees the removal of all spurious clusters, large and small (see Figure 2); we make no assumption on the shape of clusters beyond a smoothness assumption on the density; we provide a simple tuning parameter whose setting requires just an upper bound on the density.

## 2 Preliminaries

Assume the finite dataset is drawn i.i.d. from a distribution over with density function .

We start with some simple definitions related to -NN operations. All balls, unless otherwise specified, denote closed balls in .

For , let denote the radius of the smallest ball centered at containing points from . Also, let denote the radius of the smallest ball centered at of -mass .

###### Definition 2 (k-NN and mutual k-NN graphs).

The -NN graph is that whose vertices are the points in , and where is connected to iff or for some . The mutual -NN graph is that where is connected to iff and .

### 2.1 Cluster tree

###### Definition 3 (Connectedness).

We say is connected if for every there exists a continuous function where and . is called a path in between and .

The cluster tree of will be denoted , where are the CCs of the level set . Notice that forms a (infinite) tree hierarchy where for any two components , either or one is a descendant of the other, i.e or .

## 3 Algorithm

###### Definition 4 (k-NN density estimate).

Define the density estimate at as :

 fn(x)≐kn⋅vol(B(x,rk,n(x)))=kn⋅vdrdk,n(x),

where is the volume of the unit ball in .

Let be the -NN or mutual -NN graph. For define as the subgraph of containing only vertices in and corresponding edges. The CCs of form a tree: let and be two such CCs, either or one is a descendant of the other, i.e. is a subgraph of or vice versa. To simplify notation, we let the set denote the empirical cluster tree before pruning.

### Pruning

The pruning procedure (Algorithm 1) consists of simple lookups: it reconnects CCs at level if they are part of the same CC at level where the tuning parameter controls how aggressively we prune. We show its behavior on a finite sample in Figure 2.

The intuition behind the procedure is the following. Suppose are disconnected at some level in the empirical tree before pruning. However, they ought to be connected, i.e. their vertices belong to the same CC at the highest level where they are all contained in the underlying cluster tree. Then, key sample points from that would have kept them connected are missing at level in the empirical tree. These key points have values lower than , but probably not much lower. By looking down to a lower level near we find that are connected and thus detect the situation. Notice that this intuition is not tied to the -NN cluster tree but can be applied to any other cluster tree procedure. All that is required is that all points from (as discussed above) be connected at some level in the tree close to .

It is not hard to see that the CCs of the pruned subgraphs still form a tree. We will hence denote the pruned empirical tree by .

## 4 Results Overview

We make the following assumptions on the density .

1. , .

2. is Hoelder-continuous, i.e. there exists such that for all ,

 ∣∣f(x)−f(x′)∣∣≤L∥∥x−x′∥∥α.

Theorem 1 below is a finite sample result that establishes conditions under which samples from a connected subset of remain connected in the empirical cluster tree, and samples from two disconnected subsets of remain disconnected even after pruning. Essentially, for sufficiently large, points from connected subsets remain connected below some level. Also, provided is not too large, disjoint subsets and which are separated by a large enough region of low density (relative to , and ), remain disconnected above some level.

We require the following two definitions.

###### Definition 5 (Envelope of A⊂Rd).

Let and for , define:

###### Definition 6 ((ϵ,r)-separated sets ).

are -separated if there exists a separating set such that every path in between and intersects , and

 supx∈S+rf(x)≤infx∈A∪A′f(x)−ϵ.
###### Theorem 1.

Suppose satisfies (A.1) and (A.2). Let be the -NN or mutual -NN graph. Let and define . There exist and such that, for

 C(max{1,√2/θ})ddln(n/δ) ≤k≤C′(F√ln(n/δ))2(α+d)/(3α+d)n2α/(3α+d) (1)

the following holds with probability at least simultaneously for subsets of .

1. Let be a connected subset of , and let . All points in belong to the same CC of .

2. Let and be two disjoints subsets of , and define . Recall that is the tuning parameter. Suppose and are -separated for and . Then and are disconnected in .

Theorem 1 above, although written in terms of , applies also to by just setting . The theorem implies consistency of both pruned and unpruned -NN trees under mild additional conditions. Some such conditions are illustrated in the corollary below. A nice practical aspect of the pruning procedure is that consistency is obtained for a wide range of settings of and as functions of .

###### Corollary 1 (Consistency).

Suppose that satisfies (A.1) and (A.2) and that, in addition, is supported on a compact set, and for any , there are finitely many components in . Assume that, as , and while satisfies (1).

For any , let denote the smallest component of containing . Fix . We have .

###### Proof.

Let and be separate components of . The assumptions ensure that all paths between and traverse a compact set satisfying (see Lemma 14 of Chaudhuri & Dasgupta (2010)). Let and . By uniform continuity of , there exists such that for , is small enough so that . Also, there exists such that for , , in other words .

Since is finite, there exists such that for , all pairs have a suitable -separating set . Thus by Theorem 1, for , with probability at least , , and are fully contained in and are disjoint. They are thus disjoint at any higher level, so and are also disjoint.

The above holds for all , so the statement follows. ∎

While Theorem 1 establishes that a connected set remains connected below some level, it does not guarantee against parts of becoming disconnected at higher levels, creating spurious clusters. Note that the removal of spurious clusters can be trivially guaranteed by just letting the parameter very large, but the ability of the algorithm to discover true clusters is necessarily affected. We are interested in how to set in order to guarantee the removal of spurious clusters while still recovering important ones.

Theorem 2 guarantees that, by setting as (recall from Theorem 1), separate CCs of the empirical cluster tree correspond to actual clusters of the (unknown) underlying distribution, i.e. all spurious clusters are removed. The setting of only requires an upper-bound on the density 1. Note that, under such a setting, consistency is maintained per Corollary 1, and in light of Theorem 1 (b), we can expect that interesting clusters are discovered. In particular the following salient modes of are discovered.

###### Definition 7 ((ϵ,r)-salient mode).

An -salient mode is a leaf node of the cluster tree which has an ancestor (possibly itself) satisfying:

1. is the ancestor of a single leaf of , namely .

2. is large: .

3. is sufficiently separated from other components at its level: let ; and are -separated.

Notice that, under the assumptions of Corollary 1, every mode of is -salient for sufficiently large and .

###### Theorem 2 (Pruning guarantees).

Let . Under the assumptions of Theorem 1, the following holds with probability at least .

1. Suppose the tuning parameter . Consider two disjoint CCs and at the same level in . Let be the union of vertices of and , and define . The vertices of and those of are in separate CCs of .

2. Let and . There exists a map from the set of -salient modes to the leaves of the empirical tree .

The behavior of both the -NN and mutual -NN tree, as guaranteed in Theorem 2, is illustrated in Figure 3.

## 5 Analysis

Theorem 1 follows from lemmas 3 and 6 below. These two lemmas depend on the events described by lemmas 1, 2 and 4 which happen with a combined probability of at least for a confidence parameter .

Theorem 2 follows from lemmas 5 and 7 below. These two lemmas also depend on the events described by lemmas 1, 2 and 4 which happen with a combined probability of at least .

### 5.1 Maintaining Separation

In this section we establish conditions under which points from two disconnected subsets of remain disconnected in the empirical tree, even after pruning.

The following is an important lemma which establishes the estimation error of relative to on the sample . Interestingly, although of independent interest, we could not find this sort of finite sample statement in the literature on -NN2, at least not under our assumptions. The proof, presented as supplement in the appendix, is a bit involved and starts with some intuition from an asymptotic analysis of Devroye & Wagner (1977) combined with a form of the Chernoff bound found in Angluin & Valiant (1979).

###### Lemma 1.

Suppose satisfies (A.1) and (A.2). There exists such that for , for and

 121ln(2n/δ) ≤k≤C(F√ln(2n/δ))2(α+d)/(3α+d)n2α/(3α+d),

we have with probability at least that

The next lemma bounds in terms of , and hence, in terms of the density at . The proof is provided as supplement in the appendix.

###### Lemma 2.

Suppose satisfies (A.1) and (A.2). Fix and let .

1. Let . We have , . If in addition , it follows that .

2. Suppose . We have

 ∀x∈Lλ,rk(x)≤min⎧⎨⎩2−3/dr,(2kvdnf(x))1/d⎫⎬⎭.

For , if in addition , we have with probability at least that for all

 2−3/drk(Xi)≤rk,n(Xi)≤23/drk(Xi).

The main separation lemma is next. It says that if and are separated by a sufficiently large low density region, then they remain separated in the empirical tree.

###### Lemma 3 (Separation).

Suppose satisfies (A.1) and (A.2). Let be the -NN or mutual -NN graph. Define , and let . There exists such that, for

 192ln(2n/δ)≤k ≤C(F√ln(n/δ))2(α+d)/(3α+d)n2α/(3α+d),

the following holds with probability at least simultaneously for any two disjoint subsets of .

Let . If and are -separated for and , then and are disconnected in and therefore in .

###### Proof.

Applying Lemma 1, it’s immediate that, with probability at least , all points of any are in and lower levels, and no point from is in or higher levels. Thus any path between and in must have an edge through the center of a ball . This edge must therefore have length greater than . We just need to show that no such edge exists in .

Let be the set of points (vertices) in . By Lemma 1, . Given the density assumption on , so and . Now, given the range of , Lemma 2 holds for the level set . It follows that with probability at least (uniform over any such choice of since the event is a function of ),

 maxXi∈Vrk,n(Xi)≤23/dmaxXi∈Vrk(Xi)≤2rθ.

Thus, edge lengths in are at most . ∎

#### Identifying Modes

As a corollary to Lemma 3, we can guarantee in Lemma 5 that certain salient modes are recovered by the empirical cluster tree. For this to happen, we require in Definition 7 (ii) that an -salient mode is contained in a sufficiently large set so that we sample points near the mode.

We start with the following VC lemma establishing conditions under which subsets of contain samples from .

###### Lemma 4 (Lemma 5.1 of Bousquet et al. (2004)).

Suppose is a class of subsets of . Let denote the -shatter coefficient of . Let denote the empirical distribution over samples drawn i.i.d from . For , with probability at least ,

 supA∈CF(A)−Fn(A)√F(A)≤2√logSC(2n)+log4/δn.
###### Lemma 5 (Modes).

Suppose satisfies (A.1) and (A.2). Let be the -NN or mutual -NN graph. Let . There exist and such that, for

 Cdln(n/δ) ≤k≤C′(F√ln(n/δ))2(α+d)/(3α+d)n2α/(3α+d)

the following holds with probability at least . Let and . There exists a map from the set of -salient modes to the leaves of the empirical tree .

###### Proof.

First, with probability at least , for any -salient mode , there are samples in from the containing set (as defined in Definition 7). To arrive at this we apply Lemma 4 for the class of all possible balls , (for this class ). We have with probability at least that for all , whenever

 F(B)≥Cdln(n/δ)n>4(d+1)log(2n)+log(4/δ)n,

where is appropriately chosen to satisfy the last inequality. Now, from the definition of , there exists such that , while we have , implying that .

As a consequence of the above argument, there is a finite number of -salient modes since each contributes some points to the final sample . We can therefore arrange them as so that for , we have where . An injective map can now be constructed iteratively as follows.

Starting with , we have by Lemma 3 that, with probability at least , is disconnected in from all . Let be the union of those CCs of containing points from . We’ve already established that contains no point from any . For , also contains no point from any . This is because, again by Lemma 3, is disconnected in from , therefore disconnected from since all CCs in remain connected at lower levels. Now, since is disconnected from all , we can just map to any leaf rooted in , being the unique image of such a leaf. ∎

### 5.2 Maintaining Connectedness

In this section we show that sample points from a connected subset of remain connected in the empirical cluster tree before pruning (therefore also after pruning).

Similar to Chaudhuri & Dasgupta (2010), for any two points we uncover a path in near a path in that connects the two. The path in (the dashed path depicted below) consists of a sequence of sample points from balls centered on the path in (the solid path depicted below). The intuition is that is a high density route near which we can find enough sample points to connect and .

The balls centered on must be chosen sufficiently small and consecutively close so that consecutive terms are adjacent in . In Chaudhuri & Dasgupta (2010), points are adjacent (at any particular level) whenever they are less than some scale apart; one can therefore choose balls of the same radius and consecutively close. In our particular case, no single scale determines adjacency. Adjacency is determined by the various nearest-neighbor radii and this creates a multiscale effect that complicates the analysis. One way to handle (and effectively get rid of) this multiscale effect is to choose balls on of the same radius corresponding to the smallest possible nearest-neighbor radius in (restricted to ). However, in order to get samples in such small balls one would need rather large sample size , so the idea results in weak bounds. We instead use an inductive argument which keeps track of the various scales, the intuition being that nearest-neighbor-radii have to change slowly along the path from to .

###### Lemma 6 (Connectedness).

Suppose satisfies (A.1) and (A.2). Let be the -NN or mutual -NN graph. Define and let . There exist and such that, for

 C(max{1,√2/θ})ddln(n/δ) ≤k≤C′(F√ln(n/δ))2(α+d)/(3α+d)n2α/(3α+d),

the following holds with probability at least simultaneously for all connected subsets of .

Let . All points in belong to the same CC of , therefore of .

###### Proof.

First, let and be large enough for lemmas 1 and 2 to hold. Define . By Lemma 2 (a), we have that for any . Applying Lemma 1, it follows that with probability at least (uniform over choices of ), all points of are in . We will show that is connected in possibly through points in .

In particular, any are connected through a sequence built according to the following procedure. Let be a path in between and . Define .

Starting at (), set if , and we’re done, otherwise:
Let be the point in farthest along the path from , i.e. is highest in the set. Define the half-ball

 H(yi)≐{z:∥z−y∥<τ2−18/drk,n(xi), (z−yi)⋅(xi−yi)≥0}.

Pick in , and continue.

The rest of the argument will proceed inductively as follows. First, assume that and that exists. This is necessarily the case for . Assume . We will show that exists, is also in , and is adjacent to in . It will follow that must exist (if the process does not end) and is distinct from . We’ll then argue that the process must also end.

To see that exists (under the aforementioned assumptions), we apply Lemma 4 for the class of all possible half-balls centered at (for this class ). We have with probability at least that for all , whenever

 F(H(y))≥C0dln(nδ)n>(8d+4)log(2n)+4log(4δ)n,

where is appropriately chosen to satisfy the last inequality. We next show satisfies the first inequality.

We first apply Lemma 2 on (this inclusion was established earlier). We have with probability at least (uniform over all ) that for , . Thus, for all ,

 ∥z−xi∥ ≤2⋅τ2−9/drk,n(xi) ≤2⋅τ2−9/dr≤2r, (2)

implying by the same Lemma 2 that . Now, from Lemma 1, . We can thus write

 F(H(yi)) ≥14vol(B(yi,τ2−18/drk,n(xi)))f(xi) =τd2−20vol(B(xi,rk,n(xi)))f(xi) ≥τd2−21vol(B(xi,rk,n(xi)))fn(xi) =τd2−21kn≥C0dln(n/δ)n, for C≥221C0.

Therefore there is a point in . In addition since it is within of .

Next we establish that there is an edge between and in . To this end we relate to by first relating to . Remember that for we have so that for any we have . Also recall that we always have (see (2)), implying . We then have

 vdrdk(xi)⋅12f(xi) ≤kn≤vdrdk(xi+1)⋅2f(xi+1) ≤vdrdk(xi+1)⋅4f(xi),

where for the first two inequalities we used the fact that both balls and have the same mass . It follows that

 rk,n(xi+1) ≥2−3/drk(xi+1)≥2−6/drk(xi) ≥2−9/drk,n(xi), (3)

implying . We then get

 ∥xi−xi+1∥2 =∥xi−yi∥2+∥xi+1−yi∥2 −(xi−yi)⋅(xi+1−yi) ≤∥xi−yi∥2+∥xi+1−yi∥2 ≤2τ2⋅min{r2k,n(xi),r2k,n(xi+1)} ≤θ2min{r2k,n(xi),r2k,n(xi+1)},

meaning and are adjacent in .

Finally we argue that must exist. By (3) above we have

 ∥xi+1−yi∥<τ2−18/drk,n(xi)≤τ2−9/drk,n(xi+1),

in other words the ball contains in its interior. It follows by continuity of that there is a point in this ball further along the path from than . Thus, recursively all ’s must be distinct, implying that all ’s must be distinct. Since all ’s belong to the finite sample the process must eventually terminate. ∎

#### Pruning of Spurious Branches

As a corollary to Lemma 6 we can guarantee in Lemma 7 that the pruning procedure will remove all spurious branchings, and hence, all spurious clusters.

###### Lemma 7 (Pruning).

Let . Under the assumptions of Lemma 6, the following holds with probability at least , provided .

Consider two disjoint CCs and at the same level in . Let be the union of vertices of and , and define