Iterated Diffusion Maps for Feature Identification
Recently, the theory of diffusion maps was extended to a large class of local kernels with exponential decay which were shown to represent various Riemannian geometries on a data set sampled from a manifold embedded in Euclidean space. Moreover, local kernels were used to represent a diffeomorphism between a data set and a feature of interest using an anisotropic kernel function, defined by a covariance matrix based on the local derivatives . In this paper, we generalize the theory of local kernels to represent degenerate mappings where the intrinsic dimension of the data set is higher than the intrinsic dimension of the feature space. First, we present a rigorous method with asymptotic error bounds for estimating from the training data set and feature values. We then derive scaling laws for the singular values of the local linear structure of the data, which allows the identification the tangent space and improved estimation of the intrinsic dimension of the manifold and the bandwidth parameter of the diffusion maps algorithm. Using these numerical tools, our approach to feature identification is to iterate the diffusion map with appropriately chosen local kernels that emphasize the features of interest. We interpret the iterated diffusion map (IDM) as a discrete approximation to an intrinsic geometric flow which smoothly changes the geometry of the data space to emphasize the feature of interest. When the data lies on a manifold which is a product of the feature manifold with an irrelevant manifold, we show that the IDM converges to the quotient manifold which is isometric to the feature manifold, thereby eliminating the irrelevant dimensions. We will also demonstrate empirically that if we apply the IDM to features which are not a quotient of the data manifold, the algorithm identifies an intrinsically lower-dimensional set embedding of the data which better represents the features.
keywords:diffusion maps, local kernel, iterated diffusion map, dimensionality reduction, feature identification
Often, for high-dimensional data and especially for data lying on a nonlinear subspace of Euclidean space, the variables of interest do not lie in the directions of largest variance and this makes them difficult to identify. In this paper we consider the supervised learning problem, where we have a training data set, along with the values of the features of interest for this training data. Throughout this manuscript we will assume that the training data set consists of data points which are near a manifold embedded in an -dimensional Euclidean space; we refer to as the ‘data space’ or ‘data manifold’. We also assume that we have a set of feature values corresponding to each training data point, and these feature values are assumed to lie near a manifold embedded in an -dimensional Euclidean space; we refer to as the ‘feature space’ or ‘feature manifold’. We assume that the feature manifold is intrinsically lower-dimensional than the data manifold, so intuitively the data manifold contains information which is irrelevant to the feature, and we refer to this information broadly as the ‘irrelevant variables’ or the ‘irrelevant space’. In some contexts we will be able to identify the irrelevant space explicitly, for example the data manifold may simply be a product manifold of the feature manifold with an irrelevant manifold. However, more complex relationships between the data manifold, feature manifold, and irrelevant variables are possible. We will think of the feature space as arising from a function defined on the data space, and our goal is to represent this function. However, this will only be possible in some contexts such as the product manifold described above. More generally, our goal is to find a mapping from the data space to an intrinsically lower-dimensional space which contains all the information of the feature space but is, as far as possible, independent of the irrelevant variables.
Recently a method for formally representing a diffeomorphism between manifolds using discrete data sets sampled from the manifolds was introduced in LK (). In particular, a diffeomorphism is represented using a local kernel to pull back the Riemannian metric from one manifold onto the other. With respect to the intrinsic geometry of the local kernel, the manifolds are isometric, and the isometry can be represented by a linear map between the eigenfunctions of the respective Laplacian operators. In this paper, we consider the more difficult case when the manifolds are not diffeomorphic, so that one manifold may even be higher dimensional than the other. This represents the scenario described above, where some of the variables of a high-dimensional data set may be irrelevant to the features of interest.
The challenge of having irrelevant variables is that it violates the fundamental assumption of differential geometry, namely that it is local. This is because data points which differ only in the irrelevant variables will be far away in the data space and yet have the same feature values. This fundamental issue is independent of the amount of data available and is illustrated in Figure 1. Namely, if the feature of interest is the radius of an annulus, then points on opposite sides of the annulus are closely related with respect to this feature of interest. Conversely, points which are far away in the feature space may appear relatively close in data space; this can occur when many of the irrelevant variables are very close. Of course, in the limit of large data, points being close in data space implies that they are close in feature values. However, a large number of irrelevant variables can easily overwhelm any finite data set due to the curse-of-dimensionality. Intuitively, the presence of irrelevant variables makes it difficult to determine the true neighbors.
Intuitively our goal is to determine the true neighbors of every point in the data space, meaning the points in the data space which have similar feature values regardless of the values of the irrelevant variables. The key difficulty is that we need a method which can be extended to new data points, since the goal of representing the map is to be able to apply this map to new points in data space. In order to learn the map we will assume that we have a training data set for which the feature values are known. Of course, for the training data set, we could easily find the true neighbors of the training data points by using the known feature values. However, finding the neighbors using the feature values cannot be used for determining the true neighbors of a new data point, since the goal is to determine the feature values of the new data point. Instead, we propose an iterative mapping which smoothly distorts the data space in a way that contracts in the directions of the irrelevant variables and expands in the directions of the features. Crucially, this iterative mapping can be smoothly extended to new data points for which the feature values are unknown, allowing us to extend the feature map to these new data points.
In this paper, we introduce the Iterated Diffusion Map (IDM) which is an iterative method that smoothly distorts the data space in a way that contracts in the directions of the irrelevant variables and expands in the directions of the features. We illustrate our method for the purposes of intuitive explanation in Figure 1. In this example, the original data set is an annulus in the plane, but the variable of importance (represented by color in the top images) is the radial component of the annulus, meaning that the angular component is an irrelevant dimension of the manifold. The initial neighborhood is simply an Euclidean ball in the plane as shown in the bottom row of images. In the subsequent images we apply the diffusion map multiple times to evolve the data set in a way that biases it towards the desired feature. Notice that as the geometry evolves, the notion of neighborhood evolves. In the bottom right image, we see that after four iterations of the diffusion map, the notion of neighbor has grown to include any points which have the same radius, independent of their angle. Moreover, after four iterations, we see that points that were initially very close neighbors, namely points that have the same angle but slightly different radii, are no longer neighbors. So after applying the IDM, the notion of neighbor becomes very sensitive to the feature (radius) and independent of the irrelevant variable (angle).
As we will explain in Section 4.2, the example in Figure 1 has a particularly nice structure, namely the full data set is a product space of the desired feature with the irrelevant variables. When this structure is present we will be able to interpret the iterated diffusion map as a discretization of a geometric evolution which contracts the irrelevant variables to zero, thereby reconstructing the quotient map. When the product space structure is not present, the iterated diffusion map recovers a more complex structure which is not yet theoretically understood. In Section 4.3 we will give several simple examples of both product spaces and non-product spaces that illustrate the current theory and its limitations. Naturally, if one wishes to understand every feature of a data set, there is no advantage to the iterated diffusion map. However, often we can identify desirable features in training data sets, and the IDM finds an extendable map to an intrinsically lower dimensional space which better represents the desired features.
As we will see below, the construction of IDM requires several tools. In Section 2.1, we will show that iterating the standard diffusion map of diffusion () has no effect (after the first application of the diffusion map, subsequent applications will approximate the identity map when appropriately scaled). We will see that the isotropic kernel used in the standard diffusion map yields a canonical isometric embedding of the manifold. In Section 2.2, we review how local kernels, can change the geometry of the manifold and obtain an isometric embedding with respect to the new geometry. Local kernels are a broad generalization of the isotropic kernels used in diffusion () and were shown in LK () to be capable of representing any geometry on a manifold.
To construct a local kernel that emphasizes the feature directions, we will need to estimate the derivative of the feature map, . In Section 3, we give a rigorous method of estimating , including asymptotic error bounds, based on a weighted local linear regression. Moreover, in Section 3.1, by applying this weighted local linear regression from the manifold to itself we derive scaling laws for the singular values of the local linear structure near a point on the manifold described by the data. The scaling laws of the singular values yield a robust method of identifying the tangent space of the manifold near a point. Finally, these scaling laws allow us to devise more robust criteria of determining the intrinsic dimension of manifold, as well as the local bandwidth parameter of the diffusion maps algorithm.
With the tools of Sections 2 and 3, our goal is to use to construct a new geometry on the data set that emphasizes the feature of interest. In Section 4 we will see that naively forming an anisotropic kernel (following LK ()) using a rank deficient matrix will not satisfy the assumptions that define a local kernel. So, in order to represent the feature map , we cannot directly apply the local kernels theory of LK (), which only applies to diffeomorphisms. Instead, in Section 4 we introduce the IDM as a discrete approximation of a geometric flow that contracts the irrelevant variables and expands the feature variables on the manifold. In Section 4.2 we show that when the data manifold is the product of the feature manifold with irrelevant variables, this geometric flow will recover the quotient map from the data manifold to the feature manifold. In Section 4.3 we give several numerical examples demonstrating the IDM and we also include the IDM numerical algorithm in A. We close the paper with a short summary, highlighting the advantages and limitations.
In this section, we review recent key results that are relevant to the method developed in this paper. First, we remind the readers that, up to a scalar factor, a diffusion map is an isometric embedding of the manifold represented by a data set. Second, we briefly review the recently developed method for representing diffeomorphism between manifolds LK (), which we will use as a building block.
2.1 The Diffusion Map as an Isometric Embedding
A natural distance that respects the nonlinear structure of the data is the geodesic distance, which can be approximated as the shortest path distance. However, the shortest path distance is very sensitive to small perturbations of a data set. A more robust metric that also respects the nonlinear structure of the data is the diffusion distance which is defined as an average over all paths. This metric can be approximated by the Euclidean distance of the data points in the embedded space constructed by the diffusion maps algorithm diffusion ().
For a Riemannian manifold with associated heat kernel , we can define the diffusion distance as,
for where is the volume form on associated to the Riemannian metric which corresponds to . We can write the heat kernel as where is the Dirac delta function and is the (negative definite) Laplace-Beltrami operator on with eigenfunctions , where . Using the Plancherel equality the diffusion distance becomes,
where the term is zero since is constant. Defining the diffusion map by,
for sufficiently large, the diffusion distance is well approximated by the Euclidean distance in the diffusion coordinates, . The key to making this idea practical is the algorithm of diffusion () which uses the data set sampled from a manifold to construct a sparse graph Laplacian that approximates the Laplace-Beltrami operator, on .
Of course, the dimension of the diffusion coordinates will depend on the parameter , which is intuitively a kind of coarsening parameter. For small we have,
where is the geodesic distance and is the intrinsic dimension of (see for example laplacianBook ()). The function is the first term in the heat kernel expansion. In laplacianBook () the following formula is derived for ,
where is the exponential map based at , and the expansion follows from noting that is a smooth map with first derivative equal to the identity at and second derivative orthogonal to the tangent plane. Using the expansion of , for sufficiently small, we have the following expansion of the heat kernel,
and below we will bound the error by the worst case of the intrinsic dimension, namely . Using the heat kernel expansion (1), we can expand the diffusion distance as,
Based on (2.1) we define the rescaled diffusion map by,
and the rescaled diffusion distance, , which is approximated by,
so that for the rescaled diffusion distance approximates the geodesic distance.
Notice that the rescaled diffusion distance only approximates the geodesic distance when the geodesic distance is small. In particular, the diffusion distance should not be thought of as an approximate geodesic distance, as is clearly shown in Figure 2 below. For small distances, the rescaled diffusion distance and the geodesic distance also agree very closely with the Euclidean distance in any isometric embedding, as was shown in diffusion (). In fact, the diffusion map provides a canonical embedding of the manifold up to a rotation in the following sense: For any isometric image of where is an isometry, the diffusion map embeddings of and with the same parameter will differ by at most an orthogonal linear map. This is because the eigenfunctions of the Laplacian depend only on the geometry of the manifold, which is preserved by an isometric map, and for the eigenfunctions corresponding to repeated eigenvalues may differ only by an orthogonal transfomation. Moreover, the fact that the rescaled diffusion map preserves small geodesic distances implies that the rescaled diffusion map is approximately an isometric embedding (this was shown previously in DMembed () which provides more detailed bounds). In particular, this implies that, for small and large, the Laplace-Beltrami operator on is very close to the Laplace-Beltrami operator on . One consequence of this fact is that if we iterate the rescaled diffusion map, the results should not change (up to a rotation).
To demonstrate these facts numerically, we generated points equally spaced on a unit circle in . We applied the diffusion maps algorithm to estimate the eigenvalues and eigenfunctions of the Laplace-Beltrami operator on the unit circle. We should emphasize that it is crucial to correctly normalize the eigenfunctions using a kernel density estimate , that is, we require,
where is the volume form on inherited from the ambient space. See BH14VB () for details on the Monte-Carlo integral above and a natural density estimate implicit to the diffusion maps construction. We then evaluated the rescaled diffusion map for and compared the resulting diffusion distances to the Euclidean distances and the geodesic distances in Figure 2. Notice that for distances less than all the distances agree as shown above. We also show the spectra of the heat kernels, , which are the weights of the various eigenfunctions in the diffusion map embedding. Notice that for large, the spectrum decays much faster, so fewer eigenfunctions are required for the diffusion distance to be well approximated by the Euclidean distance in the diffusion mapped coordinates. Finally, for , we performed an ‘iterated’ diffusion map, by computing the (rescaled) diffusion map of the data set , in effect finding . We then compared the eigenfunctions from the first diffusion map with those . Due to the symmetry of the unit circle, the eigenfunctions corresponding to repeated eigenvalues differed by an orthogonal linear map (meaning a phase shift in this case). After removing the phase shift, the eigenfunctions are compared in Figure 2.
Since the diffusion map differs from the rescaled diffusion map by a scalar factor, the eigenfunction from iterating the standard diffusion map will also agree. The only purpose of the rescaled diffusion map is to exactly recover the local distances in the data set, and thereby to also find the same eigenvalues (since rescaling the manifold will change the spectrum of the Laplacian). Finally, if the standard diffusion map is used, the nuisance parameter will have to be retuned in order to iterate the diffusion map, since the diffusion distances will be scaled differently than the original distances. We emphasize that the goal of this section is to show that iterating the standard diffusion map algorithm is not a useful method. However, in LK () it was shown that a generalization of diffusion maps to local kernels can be used to construct the Laplace-Beltrami operator with respect to a different metric. In the remainder of the paper we will see that when the new metric is induced by a feature map on the data set, iterating the diffusion map has a nontrivial effect which can be beneficial.
2.2 Local Kernels and the Pullback Geometry
The connection between kernel functions and geometry was introduced by Belkin and Niyogi in BN () and generalized by Coifman and Lafon in diffusion (). Assuming that a data set is sampled from a density supported on a -dimensional manifold , summing a function approximates a the integral where is the volume form on inherited from the ambient space . The central insight of BN (); diffusion () is that by choosing a kernel function which has exponential decay, the integral is localized to the tangent space of the manifold.
The theory of BN (); diffusion () was recently generalized in LK () to a wide class of kernels called local kernels which are assumed only to have decay that can be bounded above by an exponentially decaying function of distance. The results of LK () generalized an early result of SingerAnisotropy2008 () to a much wider class of kernels and connected these early results to their natural geometric interpretations. In this paper we will use the following prototypical example of a local kernel, since it was shown in LK () that every operator which can be obtained with a local kernel can also be obtained with a prototypical kernel. Let be a matrix valued function on the manifold such that each is a symmetric positive definite matrix. Define the prototypical kernel with covariance (and first moment of zero) by
in the sense that in the limit of large data and as we have and . Notice that the matrix valued function acts on the ambient space, whereas the tensor in the limiting operator is only defined on the tangent planes of . As shown in LK (), only the projection of onto the tangent space will influence the operator . Thus, we introduce the linear map which acts as the identity on the tangent plane as a subspace of and sends all vectors originating at which are orthogonal to to zero. The map projects the ambient space onto the tangent space so that is a matrix and we define .
Given data sampled from a -dimensional manifold embedded in Euclidean space the manifold naturally inherits a Riemannian metric, , from the ambient space. The standard Diffusion Maps algorithm uses an isotropic kernel (where the covariance matrix is a multiple of the identity matrix) to estimate the Laplace-Beltrami operator corresponding to the metric . It was shown in LK () that local kernels such as (3) can be used to approximate the Laplace-Beltrami operator corresponding to a new Riemannian metric , where is a Riemannian metric of for diffeomorphism that satisfies . Formally, we summarize this result as follows:
Theorem 2.1 (Pullback geometry of local kernels, with nonuniform sampling).
Let be a Riemannian manifold and let be sampled according to any smooth density on . Let be a diffeomorphism and let and . For the local kernel in (3), define the symmetric kernel . Then for any smooth function on ,
Theorem 2.1 follows directly from Theorem 4.7 of LK (). This result was used by LK () to represent a diffeomorphism between two manifolds. We assume we are given a training data set sampled from the data manifold along with the true feature values, , where lie on . When is a diffeomorphism, we can use a local kernel to pullback the Riemannian metric from onto via the correspondence between the data sets. With this metric on , the two manifolds are isometric, which implies that the Laplacians ( on and on ) have the same eigenvalues, and that the associated eigenfunctions of any eigenvalue are related by an orthogonal transformation laplacianBook ().
In Section 3 we will give a rigorous method to approximate using the training data. With this approximation, numerically we evaluate the local kernel
By Theorem 2.1, using the kernel (5), we approximate the Laplacian on . Simultaneously, using the standard diffusion maps algorithm (with ) we approximate the Laplacian on . Since and are isometric, the eigenvalues of and will be the same and the corresponding eigenfunctions will be related by an orthogonal transformation. By taking sufficiently many eigenfunctions and on the respective manifolds, the eigenfunctions can be considered coordinates of an embeddings and . We can now project the diffeomorphism into these coordinates as,
where is linear and can be estimated using linear least squares. Finally, to extend the diffeomorphism to new data points we need only extend the map to this new data point using standard methods such as the Nyström extension.
Notice that the key to the existence of the linear map is that the diffeomorphism induces a new metric on that is isometric to the metric on . In Section 4 we will make use of this theorem for identifying feature in that is relevant to the data in , even when is not a diffeomorphism, but simply a mapping. However, we will first give rigorous results in Section 3 for approximating the tangent plane and the derivative from data.
3 Tangent Spaces and Derivatives
Section 2.2 shows that to build a global map between data sets, we need to estimate the local linear maps between the tangent spaces and at each point . Notice that is a matrix, where is the intrinsic dimension of and is the intrinsic dimension of . However, it will be more natural to represent as a map between the ambient spaces and . Recall that , is a matrix valued function which projects from the ambient space onto the tangent space such that . We introduce the notation for the matrix valued function given by the projection from onto the tangent space such that . With this notation,
In practice we will estimate , however, when used to construct a local kernel as in Section 2.2 only will influence the intrinsic geometry defined by the kernel.
In this section we improve and make rigorous a method originally introduced in LK () that estimates the local linear maps from data using a weighted regression. To estimate , we take the nearest neighbors of and use the correspondence to find and the neighbors . Note that may not be the nearest neighbors of due to the distortion of the geometry introduced by ; although if is a diffeomorphism (as in LK ()) the local distortion will be very small. In LK () they construct the weighted vectors
and define to be the matrix which minimizes . Intuitively, the exponential weight is used to localize the vectors; otherwise the linear least squares problem would try to preserve the longest vectors , which do not represent the tangent space well. This method of localization was used in LK () for estimating , and it is also closely related to a method of determining the tangent space of a manifold which was introduced in singerWu (). Using the foundational theory developed in diffusion () we will now make this method of finding tangent spaces and derivatives rigorous.
Let be samples from with density and where . Define to be a matrix with columns and let be a matrix with columns , where
with as in (6) and .
Following Appendix B of diffusion (), let with with sufficiently small so that there is a unique geodesic with and . Let be a basis for the tangent space and define the projection of the geodesic onto the tangent plane by . Locally, we can parameterize the manifold using a function so that . We now use the Taylor expansion , where and is orthogonal to the tangent space. Combining the previous lines yields,
which implies that and . From Equation (B.2) in diffusion (), we have . For and we have,
This shows that taking the inner product with vectors in the neighborhood of , vectors in the tangent space are of order- and vectors in the orthogonal complement are of order-.
Let be discrete data points sampled from . Recall from diffusion () we have,
where the continuous integral is a result of taking Monte-Carlo limit over data sampled from the sampling density with respect to the volume form that inherits from the ambient space. The restriction of the integral to the tangent plane was shown in diffusion () and follows from the exponential decay of the integrand and we also use the fact from diffusion () that . Finally, the change of variables in (3) drops all the odd order terms due to the symmetry of the kernel.
Recall that was the matrix with columns and is the matrix with columns . For any vectors and we have,
where is the Hessian operator and the last equality follows from using the exponential decay of the integrand to restrict the integral to the tangent plane (see diffusion () for details). For and we reduce (3) to,
On the other hand, for and we reduce (3) to,
where we have used the expansion and we define
Finally, it is easy to see that for and all the terms will be polynomials of degree 3 in the coordinates of , and since these terms are all odd, by the symmetry of the domain of integration we have . Together with (3) and (11), the proof is complete. ∎
We note that the above proof can easily be generalized on kernels of the form for having exponential decay by following diffusion (). In the remainder of this section, we will discuss the consequences of this result in more details. In particular, we shall see that the scaling law established in this theorem provides systematic methods to identify tangent spaces, estimate derivavtive , as well as to estimate the kernel bandwidth parameter , which is crucial for accurate numerical approximation.
3.1 Identifying Tangent Spaces with the Singular Value Decomposition
The first method of leveraging Theorem 3.2 is with the singular value decomposition (SVD). Intuitively, the singular vectors will naturally be sorted into tangent vectors, with singular values of order , and orthogonal vectors, with singular values of order . To see this we state the following corollary to Theorem 3.2.
Let be samples from with density . Define to be a matrix with columns , where is defined as in Theorem 3.2. Then,
The proof follows from Theorem 3.2 with so that and . Note that the Hessian in the definition of in (12) is with respect to the coordinates , so in general is not necessarily zero. In fact, by repeating the argument in the derivation of (12) one can show that,
where is a projection operator that is identity in the directions orthogonal to and maps all vectors originating at to zero when they are in . ∎
Recall that is the projection onto the tangent space at viewed as a subspace of . Corollary 3.3 suggests that if , then , whereas for we find . This shows that if is a singular vector, the associated singular value,
will either be order- if is in the tangent space, or order- if is orthogonal to the tangent space. Since the singular value decomposition of finds which maximizes , when is well tuned the first singular values will all be order- and the remaining singular values will be order-. This fact gives us a way to identify the tangent vectors of the manifold by defining the scaling law, , of a singular value, , to be the exponential power such that . When then the associated singular vector is a tangent vector and when then the associated singular vector is orthogonal to .
For discrete data, this power law will change as a function of the bandwidth parameter . Numerically, we can estimate this power law by computing for discrete values and then approximating,
We now demonstrate this numerically by sampling 10000 points from a uniform grid and mapping them onto a torus embedded in by . We chose a point and constructed the weighed vectors for where . For each value of we compute the three singular values of and then we compute the scaling laws for each singular value. These scaling laws are shown in Figure 3. We selected the optimal value of using the method that we will describe in Section 3.3, which are highlighted by a solid dot in the scaling law curves, and we plot the associated singular vectors in Figure 3.
To demonstrate the robustness of this methodology to small noise in the ambient space, we repeated the experiment adding a three dimensional Gaussian random perturbation with mean zero and variance to each point. In the noisy case, the theoretical scaling laws are obtained for a much smaller range of values of as shown in Figure 3. In fact, when analyzed at a small scale () all three singular values have scaling law , which represents the three dimensional nature of the manifold after the addition of the noise. However, the scaling laws also capture the approximate two-dimensional structure, as shown by the scaling law of the third singular vector being very close to for . This suggests that the scaling laws are robust for perturbations of magnitude less than , however, the singular vectors are more sensitive as shown by the slight tilt in the tangent plane defined by the first two singular vectors in Figure 3.
3.2 Estimating Derivatives with the Linear Regressions
We now return to the problem of estimating the derivative of a nonlinear mapping where we assume that we know the values of on our training data set . As mentioned above, the approach of LK () was to use a linear regression to estimate as the matrix which minimizes . Using the theory developed in Section 3.1 we can now rigorously justify this approach. Notice that the linear regression minimizes the error by setting (where the additional factor of from Theorem 3.2 cancels making this equivalent to the approach of LK ()).
Theorem 3.2 suggested that a simple method of estimating the derivative is with the correlation matrix . Numerically, we found that a better estimate of is given by the linear regression , and we also analyze this construction. From Corollary 3.3 we have , which implies that in the limit of large data,