Local angles and dimension estimationfrom data on manifolds

Local angles and dimension estimation
from data on manifolds

Mateo Díaz Mateo Díaz, Center for Applied Mathematics, 657 Frank H.T. Rhodes Hall, Cornell University, Ithaca, NY 14853 md825@cornell.edu Adolfo J. Quiroz Adolfo J. Quiroz, Departamento de Matemáticas
Universidad de los Andes
Carrera 1 No. 18a 10
Edificio H
Primer Piso
111711 Bogotá
Colombia
aj.quiroz1079@uniandes.edu.co
 and  Mauricio Velasco Mauricio Velasco, Departamento de Matemáticas
Universidad de los Andes
Carrera 1 No. 18a 10
Edificio H
Primer Piso
111711 Bogotá
Colombia
mvelasco@uniandes.edu.co
Abstract.

For data living in a manifold and a point we consider a statistic which estimates the variance of the angle between pairs of vectors and , for data points , , near , and evaluate this statistic as a tool for estimation of the intrinsic dimension of at . Consistency of the local dimension estimator is established and the asymptotic distribution of is found under minimal regularity assumptions. Performance of the proposed methodology is compared against state-of-the-art methods on simulated data.

Key words and phrases:
dimension estimation, local -statistics, angle variance, manifold learning
2010 Mathematics Subject Classification:
62G05, 62H10, 62H30

1. Introduction

Understanding complex data sets often involves dimensionality reduction. This is particularly necessary in the analysis of images, and when dealing with genetic or text data. Such data sets are usually presented as collections of vectors in and it often happens that there are non-linear dependencies among the components of these data vectors. In more geometric terms these non-linear dependencies amount to saying that the vectors lie on a submanifold whose dimension is tipically much smaller than . The expression manifold learning has been coined in the literature for the process of finding properties of from the data points.

Several authors in the artificial intelligence literature have argued about the convenience of having methods to find or approximate these low-dimensional manifolds [3, 13, 27, 29, 31, 33]. Procedures for achieving this kind of low dimensional representation are called manifold projection methods. Two fairly successful such methods are Isomap of Tenenbaum, de Silva and Langford [33] and the Locally Linear Embedding method of Roweis and Saul [27]. For these and other manifold projection procedures, a key initial ingredient is a precise estimation of the integer , ideally obtained at low computational cost.

The problem of estimating has been the focus of much work in statistics starting from the pioneering work of Grassberger-Procaccia [14]. Most of the most recent dimension identification procedures appearing in the literature are either related to graph theoretic ideas [6, 7, 35, 23, 24] or to nearest neighbor distances [25, 20, 10, 21]. A key contribution of the latter group is the work of Levina and Bickel [20] who propose a “maximum likelihood” estimator of intrinsic dimension. To describe it let be the distance from the sample point to its -th nearest neighbor in the sample (with respect to the euclidean distance in the ambient space ). Levina and Bickel show that, asymptotically, the expected value of the statistic

(1)

coincides with the intrinsic dimension of the data. As a result, they propose the corresponding sample average as an estimator of dimension. Asymptotic properties of this statistic have been obtained in the literature (see [24, Theorem 2.1]) allowing for the construction of confidence intervals. Both the asymptotic expected value and the asymptotic distribution are independent of the underlying density from which the sample points are drawn and thus lead to a truly non-parametric estimation of dimension.

In addition to distances, Ceruti et al. propose in [9] that angles should be incorporated in the dimension estimators. This proposal, named DANCo, combines the idea of norm concentration of nearest neighbors with the idea of angle concentration for pairs of points on the -dimensional unit sphere.

The resulting dimension identification procedure is relatively involved. The method combines two ideas. On one hand it uses the Kullback-Leibler divergence to measure the distance between the estimated probability density function (pdf) of the normalized nearest neighbor distance for the data considered and the corresponding pdf of the distance from the center of an -dimensional unit ball to its nearest neighbor under uniform sampling. On the other hand, it uses a concentration result due to Södergren [30], for angles corresponding to independent pairs of points on a sphere.

The main contribution of this article is a new and simple dimension identification procedure based solely on angle concentration. We define a -statistic which averages angle squared deviations over all pairs of vectors in a nearest neighbor ball of a fixed point and determine its asymptotic distribution. In the basic version of our proposed method there is no need of calibration of distributions and moreover our statistic is a -statistic among dependent pairs of data points and it is well known that these offer fast convergence to their mean and asymptotic distribution.

Our method has been called ANOVA in the literature111The term was coined by Breiding, Kalisnik, Sturmfels and Weinstein in [5] when describing an earlier preliminary version of this article, given that the -statistic used, to be defined below, is an estimator of the variance of the angle between pairs of vectors among uniformly chosen points in the sphere . Our main results are to prove the consistency of the proposed method of estimation (Proposition 3.8) and the description of the (suitably normalized) asymptotic distribution of the statistic considered (Theorem 3.6), a result that is very useful in the construction of asymptotic confidence intervals in dimension estimation. We describe our proposed method in Section 2 and provide its theoretical justification in Section 3. Sections 4 and 5 discuss the details of our implementation of the dimension identification procedure together with some empirical improvements. It also contains the result of performance evaluations on simulated examples, including comparisons with current state-of-the-art methods.

2. A -statistic for dimension identification

2.1. Description of the statistic

Suppose our data form an i.i.d. sample, from a distribution on with support on a Riemannian manifold of dimension . Given a point , the question to be addressed is to determine the dimension of the tangent space of at using only information from sample points near (we want to allow for the value of to depend on the point and for to be disconnected).

The simplest version of our dimension identification procedure is described by the following steps:

  1. For an appropriate value of the constant , to be specified below, let . Assume, relabeling the sample if necessary, that are the nearest neighbors of in the sample, according to the euclidean distance in .

  2. Define the angle-variance -statistic, , by the formula

    (2)

    where denotes the dot product on .

  3. Estimate the unknown dimension as , equal to the integer such that is closest to , for a sufficiently large sample size , where is the quantity defined by

    (3)

The key idea of our estimator goes as follows: For large and the chosen value of , the nearest neighbors of in the data set, behave as uniform data on a small ball around in the embedded tangent space of at this point, and the corresponding unit vectors, , are nearly uniform on the unit sphere of the tangent space, . For uniform data on , the expected angle between two random vectors is always (regardless of ), but the variance of this angle decreases rapidly with . Formula (3) gives the value of this variance for every dimension . Since our results below show that the -statistic, , will converge in probability to for the actual dimension of at , estimation of by choosing the such that closest to will be consistent. An additional fact that helps in this convergence is that the variance of , which depends on the fourth moment of the angles, is also converging rapidly to zero.

The following subsection establishes useful facts about angles between random points on the unit sphere of and, in particular, about moments of the function

(4)

when computed on data uniformly distributed on . Section 3, building on subsection 2.2, develops the theoretical results that serve as basis for the use of on manifolds.

2.2. Angle-variance statistics for pairs of uniform points on

Lemma 2.1 (Angles between uniform vectors).

Let be two independent vectors with the uniform distribution on the unit sphere and let be the angle between them. The following statements hold:

  1. The distribution of is given by

  2. The moment generating function of , denoted by is given by

    according to whether is even or odd respectively.

  3. In particular for all and where

  4. The variance of the centered squared angle is given by

Proof.
  1. Passing to polar coordinates with , for and . The probability that is precisely the fraction of the surface area of the sphere defined by the inequality . Since the surface element of the sphere is given by

    the probability is given by

    as claimed.

  2. We begin with a claim

    Claim 2.2.

    Let be a -function and define

    where Then, the following recursion formula holds

    Proof.

    Using integration by parts one can show a recursive formula for and conclude that

    Applying integration by parts twice and using our formula for gives the result. ∎

    In particular if we take we get

    As a result we obtain the stated closed formula for the moment generating function.

  3. All densities are, like sine, symmetric around and the first statement follows. To ease the computations we introduce the cumulant-generating function Then it is immediate that We consider two cases, even and odd, first let us assume that . Then, we write the cumulant-generating function as

    After some dry algebra we get and , which gives the result for the even case. The odd case follows from an analogous argument.

  4. Let be the th moment of the random variable , i.e. . It is well known that and Therefore,

    (5)

    Again, consider two cases: even and odd. Suppose , just as before, we calculate and Substituting both these into (5) yields the claim. A similar argument can be applied to the odd case.

At first glance the formulas for and might seem a little complicated. In order to derive our results we need tangible decrease rates in terms of the dimension. The following claim gives us an easy way to interpret these quantities.

Claim 2.3.

The following bounds hold for and :

  • (6)
  • (7)

    moreover the upper bound for holds for all

Proof.

We distinguish two cases according on whether is even or odd. If is even, we can define by the equality and compute

Since this series consists of monotonically decreasing terms and we conclude that

as claimed. On the other hand, notice that the other term concerning the variance can be written as

which again can be bound by

Then, we get

where the first inequality follows since The case when is odd is proven similarly. ∎

3. Theoretical foundations

3.1. Statement of results

In this subsection we state the theoretical results that serve as basis for the proposed methodology. Proofs are given in the following subsection. The setting is the following: An i.i.d. sample, , is available from a distribution on . Additionally we have access to a distingushied point , and near this point the data live on a Riemannian manifold , of dimension . Furthermore, at the distribution has a Lipschitz continuous non-vanishing density function , with respect to the volume measure on . Without loss of generality, we assume that . Then, we have

Proposition 3.1 (Behavior of nearest neighbors).

For a positive constant , define and let be the euclidean distance in from to its -st nearest neighbor in the sample . Define to be the open ball of radius around in . Then, the following holds true:

  1. For any sufficiently large , we have that, with probability one, for large enough (, for some depending on the actual sample), , where

    is a deterministic function that only depends on the distribution at and

  2. Conditionally on the value of , the -nearest-neighbors of in the sample , have the same distribution as an independent sample of size from the distribution with density , equal to the normalized restriction of to .

In what follows, with a slight abuse of notation, we will write to denote the nearest neighbors of 0 in the sample and assume that these follow the distribution with density of Proposition 3.1. Let be the orthogonal projection onto the (embedded) tangent space to at . For a nonzero , let , and . takes values in the -dimensional unit sphere of the tangent space of at .

Our first Lemma bounds the difference between the inner products and in terms of the length of projections. In this Lemma, the random nature of the is irrelevant.

Lemma 3.2 (Basic projection distance bounds).

For any :

  1. The cosine of the angle between and is close to that between and . More precisely,

    for some , whenever for .

Using Lemma 3.2, we can establish the following approximation. Let be the -nearest-neighbors from the sample to in . Define and as above and let be given by the formula

Proposition 3.3 (Approximating the statistic via its tangent analogue).

For , as above, we have

  1. The sequence converges to in probability as .

  2. .

When comes from the distribution producing the sample, but is restricted to fall very close to 0, the distribution of will be nearly uniform in a ball centered at 0 in . This will allow us to establish a coupling between the normalized projection and a variable , uniformly distributed on the unit sphere of , an approximation that leads to the asymptotic distribution of . Some geometric notation must be introduced to describe these results. Since near 0, is a Riemannian submanifold of dimension , it inherits, from the euclidean inner product in , a smoothly varying inner product , given by , where is the inclusion with differential . This metric determines a differential -form which, in terms of local coordinates for and dual coordinates of with , is given by . The differential form endows with a volume measure . We say that a random variable on has density if the distribution of satisfies for all borel sets in .

If is a random variable taking values on with density and is a positive real number, let be a random variable with distribution given by the normalized restriction of to , that is:

Define . The following geometric Lemma will be used for relating the densities of and .

Lemma 3.4 (Tangent space approximations).

The following statements hold for all sufficiently small and in .

  1. The map is a diffeomorphism. Let be its inverse.

  2. The inclusion holds and moreover where denotes the Lebesgue measure on .

  3. The following equality holds:

Let be a random variable uniformly distributed in and note that is uniformly distributed on the unit sphere , regardless of the value of . Our next Lemma shows that under weak hypotheses there is a coupling between and which concentrates on the diagonal as decreases.

Lemma 3.5 (Coupling).

Let denote a small positive number. With as above and a random vector with the uniform distribution on the unit sphere, of , if the density of in , near 0, is locally Lipschitz continuous and nonvanishing at 0, then the following hold:

  1. There exists a coupling and a constant such that for all sufficiently small .

  2. There exists a coupling such that for all sufficiently small .

The previous Lemma leads to the asymptotic distribution of the statistic .

Theorem 3.6 (Local Limit Theorem for angle-variance).

Let for the constant of the proof of Proposition 3.1 and assume are the nearest neighbors to in the sample, with respect to the euclidean distance in . If then the following statements hold:

  1. The equality holds and

  2. The quantity converges, in distribution, to that of where the are i.i.d. chi-squared random variables with one degree of freedom and the are the eigenvalues of the operator on defined by

    for , where and denotes the uniform measure on .

This limit theorem is obtained by the various approximation steps given in the preliminary results together with the classical Central Limit Theorem for degenerate statistics, as described in Chapter 5 of [28]. Depending on the relative values of the ’s appearing in the statement of the Theorem, it could happen that the limiting distribution just obtained approaches a Gaussian distribution as the dimension increases (this would happen if the were such that Lindeberg’s condition holds).

Although theoretical study of the ’s is left for future work, we conjecture that as increases the limiting distribution converges to a Gaussian distribution. Numerical experiments seem to support our conjecture, see Figure 1.

(a)
(b)
(c)
Figure 1. QQ-plots. As established in the proof of Theorem 3.6, the limiting distribution is in fact the asymptotic distribution of , with defined in (Equation 14). For this figure, we generate samples of the variable , for in dimensions , and . The plots compare the quantiles of the sample distribution against the standard normal quantiles.

In order to get a consistency result for our basic local dimension estimator, we will use the following fact.

Corollary 3.7 (Variance convergence).

Under the conditions stated at the beginning of this subsection,

for equal to the dimension of near 0.

Recall from Section 2 that our basic procedure estimates the dimension as , equal to the integer such that is closest to . This procedure is consistent, as stated next.

Proposition 3.8 (Consistency of basic dimension estimator).

As in Section 2, write for the basic estimator described above. Let be the true dimension of in a neighborhood of . Then, in the setting of the present section,

as .

3.2. Proofs


Proof of Proposition 3.1.
Let us recall a probability bound for the Binomial distribution. For an integer valued random variable with Binomial() distribution and expected value , one of the Chernoff-Okamoto inequalities (see Section 1 in [16]) states that, for , . Letting , we get

(8)

For fixed and small enough , let denote the ball of radius around . For a random vector , with the distribution of our sample , by our assumptions on and near 0, we have that

where is the volume (Lebesgue measure) of the unit ball in and is a positive number. Let denote the amount of sample points that fall in . We have that . We choose such that , for a constant to be specified in a moment. Then, by (8), we get

(9)

Pick any value of . For this choice, the bound in (9) will add to a finite value when summed over . By the Borel-Cantelli Lemma, the inequality will hold for all sufficiently large. It follows that if , the -nearest-neighbors of 0 in the sample, will fall in for every sufficiently large and the chosen value of , namely

which is . The proof of the first part of the Proposition ends by renaming .
The statement of the second part of Proposition 3.1 is intuitive and has been used in the literature without proof. Luckily, Kaufmann and Reiss [18] provide a formal proof of these type of results in a very general setting. In particular, (ii) of Proposition 3.1 holds by formula (6) of [18].

Proof of Lemma 3.2
By an orthogonal change of coordinates, we can assume that is spanned by the first basis vectors in . The projection is a differentiable function whose derivative at is the identity. By the implicit function theorem we can conclude that there exists an , such that is a diffeomorphism and that admits, near a chart of the form

(10)

where , and such that for and . As a result, the euclidean distance between a point of near and the tangent space at is given, in the local coordinates , by

We will prove that there exists a constant such that, for all sufficiently small and all with the inequality holds. By Applying Taylor’s Theorem at to the differentiable function we conclude, since and , that the constant and linear term vanish from the expansion. This proves the claim because . Assume is a constant which satisfies . Thus

(11)

On the other side,

(12)

Altogether,

where the first inequality is just the triangle inequality and the second one follows from (11) and (12). The third item in the Lemma follows immediately from the triangle inequality and the second item by adding and subtracting .

Remark 3.9.

The quadratic term of is the second fundamental form of and, therefore, the constant can be chosen to be the largest sectional curvature of at .

Before proving Proposition 3.3, we need a Lemma on the behavior of the function.

Lemma 3.10.

Suppose that and let be sufficiently small (for our purposes it suffices to have ). Then,

Proof.

Assume first that both and are positive. We have

Using that the integrand in the last expression is increasing in and by the change of variables , we get

since for . From this last bound, the result follows in this case by integration. The argument for the case in which both and are negative is identical, by symmetry. In the case , both and fall in a fixed interval () where the derivative of is bounded and the result follows easily. ∎

Proof of Proposition 3.3
To prove part (1), putting together Proposition 3.1 and Lemma 3.2 we have

From this, it follows easily that

and, using Lemma 3.10 we get

The bound is preserved by the application of the function (since the function is locally Lipschitz) and by taking averages over all pairs, and we get

(13)

The result follows by observing that, for the value of considered,

To prove , notice that from part it is immediate that converges to zero in probability, which implies that , since is a bounded random variable.

Proof of Lemma 3.4
Recall, from the proof of Lemma 3.2, that for , small enough, the projection is a diffeomorphism and that admits, near a chart (inverse) of the form given in (10) and satisfying that and such that for and . Also from that proof, recall that there exists a constant such that for small enough and all with , we have , where is the distance between a point and its projection on . It follows that the image contains a ball of radius such that and therefore

proving part of the Lemma, since the volume of the first and last term differ by at most . For part note that . Since are the inner products are