Vector Diffusion Maps and the Connection Laplacian

Vector Diffusion Maps and the Connection Laplacian

A. Singer Department of Mathematics and PACM, Princeton University, Fine Hall, Washington Road, Princeton NJ 08544-1000 USA, email: amits@math.princeton.edu    H.-T. Wu Department of Mathematics, Princeton University, Fine Hall, Washington Road, Princeton NJ 08544-1000 USA, email: hauwu@math.princeton.edu
Abstract

We introduce vector diffusion maps (VDM), a new mathematical framework for organizing and analyzing massive high dimensional data sets, images and shapes. VDM is a mathematical and algorithmic generalization of diffusion maps and other non-linear dimensionality reduction methods, such as LLE, ISOMAP and Laplacian eigenmaps. While existing methods are either directly or indirectly related to the heat kernel for functions over the data, VDM is based on the heat kernel for vector fields. VDM provides tools for organizing complex data sets, embedding them in a low dimensional space, and interpolating and regressing vector fields over the data. In particular, it equips the data with a metric, which we refer to as the vector diffusion distance. In the manifold learning setup, where the data set is distributed on (or near) a low dimensional manifold embedded in , we prove the relation between VDM and the connection-Laplacian operator for vector fields over the manifold.

Dedicated to the Memory of Partha Niyogi

Key words. Dimensionality reduction, vector fields, heat kernel, parallel transport, local principal component analysis, alignment.

1 Introduction

A popular way to describe the affinities between data points is using a weighted graph, whose vertices correspond to the data points, edges that connect data points with large enough affinities and weights that quantify the affinities. In the past decade we have been witnessed to the emergence of non-linear dimensionality reduction methods, such as locally linear embedding (LLE) [33], ISOMAP [39], Hessian LLE [12], Laplacian eigenmaps [2] and diffusion maps [9]. These methods use the local affinities in the weighted graph to learn its global features. They provide invaluable tools for organizing complex networks and data sets, embedding them in a low dimensional space, and studying and regressing functions over graphs. Inspired by recent developments in the mathematical theory of cryo-electron microscopy [36, 18] and synchronization [34, 10], in this paper we demonstrate that in many applications, the representation of the data set can be vastly improved by attaching to every edge of the graph not only a weight but also a linear orthogonal transformation (see Figure LABEL:fig:1).

Figure 1.1: In VDM, the relationships between data points are represented as a weighted graph, where the weights are accompanied by linear orthogonal transformations .
(a)
(b)
(c)
Figure 1.2: An example of a weighted graph with orthogonal transformations: and are two different images of the digit one, corresponding to nodes and in the graph. is the rotation matrix that rotationally aligns with and is some measure for the affinity between the two images when they are optimally aligned. The affinity is large, because the images and are actually the same. On the other hand, is an image of the digit two, and the discrepancy between and is large even when these images are optimally aligned. As a result, the affinity would be small, perhaps so small that there is no edge in the graph connecting nodes and . The matrix is clearly not as meaningful as . If there is no edge between and , then is not represented in the weighted graph.

Consider, for example, a data set of images, or small patches extracted from images (see, e.g., [27, 8]). While weights are usually derived from the pairwise comparison of the images in their original representation, we instead associate the weight to the similarity between image and image when they are optimally rotationally aligned. The dissimilarity between images when they are optimally rotationally aligned is sometimes called the rotationally invariant distance [31]. We further define the linear transformation as the orthogonal transformation that registers the two images (see Figure LABEL:fig:digits). Similarly, for data sets consisting of three-dimensional shapes, encodes the optimal orthogonal registration transformation. In the case of manifold learning, the linear transformations can be constructed using local principal component analysis (PCA) and alignment, as discussed in Section LABEL:sec:manifold.

While diffusion maps and other non-linear dimensionality reduction methods are either directly or indirectly related to the heat kernel for functions over the data, our VDM framework is based on the heat kernel for vector fields. We construct this kernel from the weighted graph and the orthogonal transformations. Through the spectral decomposition of this kernel, VDM defines an embedding of the data in a Hilbert space. In particular, it defines a metric for the data, that is, distances between data points that we call vector diffusion distances. For some applications, the vector diffusion metric is more meaningful than currently used metrics, since it takes into account the linear transformations, and as a result, it provides a better organization of the data. In the manifold learning setup, we prove a convergence theorem illuminating the relation between VDM and the connection-Laplacian operator for vector fields over the manifold.

The paper is organized in the following way: In Section LABEL:sec:manifold we describe the manifold learning setup and a procedure to extract the orthogonal transformations from a point cloud scattered in a high dimensional Euclidean space using local PCA and alignment. In Section LABEL:sec:VDM we specify the vector diffusion mapping of the data set into a finite dimensional Hilbert space. At the heart of the vector diffusion mapping construction lies a certain symmetric matrix that can be normalized in slightly different ways. Different normalizations lead to different embeddings, as discussed in Section LABEL:sec:normalizations. These normalizations resemble the normalizations of the graph Laplacian in spectral graph theory and spectral clustering algorithms. In the manifold learning setup, it is known that when the point cloud is uniformly sampled from a low dimensional Riemannian manifold, then the normalized graph Laplacian approximates the Laplace-Beltrami operator for scalar functions. In Section LABEL:convergetoconnlap we formulate a similar result, stated as Theorem LABEL:summary, for the convergence of the appropriately normalized vector diffusion mapping matrix to the connection-Laplacian operator for vector fields 111One of the main considerations in the way this paper is presented was to make it as accessible as possible, also to readers who are not familiar with differential geometry. Although the connection-Laplacian is essential to the understanding of the mathematical framework that underlies VDM, and differential geometry is extensively used in Appendix LABEL:proof for the proof of Theorem LABEL:summary, we do not assume knowledge of differential geometry in Sections LABEL:sec:manifold-LABEL:sec:summary (except for some parts of Section LABEL:gdd) that detail the algorithmic framework. The concepts of differential geometry that are required for achieving basic familiarity with the connection-Laplacian are explained in Appendix LABEL:setupbackground.. The proof of Theorem LABEL:summary appears in Appendix LABEL:proof. We verified Theorem LABEL:summary numerically for spheres of different dimensions, as reported in Section LABEL:numerical and Appendix LABEL:app-A. We also used other surfaces to perform numerical comparisons between the vector diffusion distance, the diffusion distance, and the geodesic distance. In Section LABEL:extrapolation we briefly discuss out-of-sample extrapolation of vector fields via the Nyström extension scheme. The role played by the heat kernel of the connection-Laplacian is discussed in Section LABEL:gdd. We use the well known short time asymptotic expansion of the heat kernel to show the relationship between vector diffusion distances and geodesic distances for nearby points. In Section LABEL:applications we briefly discuss the application of VDM to cryo-electron microscopy, as a prototypical multi-reference rotational alignment problem. We conclude in Section LABEL:sec:summary with a summary followed by a discussion of some other possible applications and extensions of the mathematical framework.

2 Data sampled from a Riemannian manifold

One of the main objectives in the analysis of a high dimensional large data set is to learn its geometric and topological structure. Even though the data itself is parameterized as a point cloud in a high dimensional ambient space , the correlation between parameters often suggests the popular “manifold assumption” that the data points are distributed on (or near) a low dimensional Riemannian manifold embedded in , where is the dimension of the manifold and . Suppose that the point cloud consists of data points that are viewed as points in but are restricted to the manifold. We now describe how the orthogonal transformations can be constructed from the point cloud using local PCA and alignment.

Local PCA. For every data point we suggest to estimate a basis to the tangent plane to the manifold at using the following procedure which we refer to as local PCA. We fix a scale parameter and define as the neighbors of inside a ball of radius centered at :

Denote the number of neighboring points of by222Since depends on , it should be denoted as , but since is kept fixed it is suppressed from the notation, a convention that we use except for cases in which confusion may arise. , that is, , and denote the neighbors of by . We assume that is large enough so that , but at the same time is small enough such that . In Theorem LABEL:localpcatheorem we show that a satisfactory choice for is given by , so that . In fact, it is even possible to choose if the manifold does not have a boundary.

Observe that the neighboring points are located near , where deviations are possible either due to curvature or due to neighboring data points that lie slightly off the manifold. Define to be a matrix whose ’th column is the vector , that is,

In other words, is the data matrix of the neighbors shifted to be centered at the point . Notice, that while it is more common to shift the data for PCA by the mean , here we shift the data by . Shifting the data by is also possible for all practical purposes, but has the slight disadvantage of complicating the proof for the convergence of the local PCA step (see Appendix LABEL:localpcatheorem).

Let be a positive monotonic decreasing function with support on the interval , for example, the Epanechnikov kernel , where is the indicator function 333In fact, can be chosen in a more general fashion, for example, monotonicity is not required for all theoretical purposes. However, in practice, a monotonic decreasing leads to a better behavior of the PCA step. Let be an diagonal matrix with

Define the matrix as

That is, the ’th column of is the vector scaled by . The purpose of the scaling is to give more emphasis to nearby points over far away points (recall that is monotonic decreasing). We denote the singular values of by .

In many cases, the intrinsic dimension is not known in advance and needs to be estimated directly from the data. If the neighboring points in are located exactly on , then , and there are only non-vanishing singular values (i.e., ). In such a case, the dimension can be estimated as the number of non-zero singular values. In practice, however, due to the curvature effect, there may be more than non-zero singular values. A common practice is to estimate the dimension as the number of singular values that account for high enough percentage of the variability of the data. That is, one sets a threshold between 0 and 1 (usually closer to 1 than to 0), and estimates the dimension as the smallest integer for which

For example, setting means that singular values account for at least variability of the data, while singular values account for less than . We refer to the smallest integer as the estimated local dimension of at . One possible way to estimate the dimension of the manifold would be to use the mean of the estimated local dimensions , that is, (and then round it to the closest integer). The mean estimator minimizes the sum of squared errors . We estimate the intrinsic dimension of the manifold by the median value of all the ’s, that is, we define the estimator for the intrinsic dimension as

The median has the property that it minimizes the sum of absolute errors . As such, estimating the intrinsic dimension by the median is more robust to outliers compared to the mean estimator. In all proceeding steps of the algorithm we use the median estimator , but in order to facilitate the notation we write instead of .

Suppose that the singular value decomposition (SVD) of is given by

The columns of the matrix are orthonormal and are known as the left singular vectors

We define the matrix by the first left singular vectors (corresponding to the largest singular values):

\hb@xt@.01(2.1)

The columns of are orthonormal, i.e., . The columns of represent an orthonormal basis to a -dimensional subspace of . This basis is a numerical approximation to an orthonormal basis of the tangent plane . The order of the approximation (as a function of and ) is established later, using the fact that the columns of are also the eigenvectors (corresponding to the largest eigenvalues) of the covariance matrix given by

\hb@xt@.01(2.2)

Since is supported on the interval the covariance matrix can also be represented as

\hb@xt@.01(2.3)

We emphasize that the covariance matrix is never actually formed due to its excessive storage requirements, and all computations are performed with the matrix . We remark that it is also possible to estimate the intrinsic dimension and the basis using the multiscaled PCA algorithm [28] that uses several different values of for a given , but here we try to make our approach as simple as possible while being able to later prove convergence theorems.

Alignment. Suppose and are two nearby points whose Euclidean distance satisfies , where is a scale parameter different from the scale parameter . In fact, is much larger than as we later choose , while, as mentioned earlier, (manifolds with boundary) or (manifolds with no boundary). In any case, is small enough so that the tangent spaces and are also close.444In the sense that their Grassmannian distance given approximately by the operator norm is small. Therefore, the column spaces of and are almost the same. If the subspaces were to be exactly the same, then the matrices and would have differ by a orthogonal transformation satisfying , or equivalently . In that case, is the matrix representation of the operator that transport vectors from to , viewed as copies of . The subspaces, however, are usually not exactly the same, due to curvature. As a result, the matrix is not necessarily orthogonal, and we define as its closest orthogonal matrix, i.e.,

\hb@xt@.01(2.4)

where is the Hilbert-Schmidt norm (given by for any real matrix ) and is the set of orthogonal matrices. This minimization problem has a simple solution [13, 25, 21, 1] via the SVD of . Specifically, if

is the SVD of , then is given by

Figure 2.1: The orthonormal basis of the tangent plane is determined by local PCA using data points inside a Euclidean ball of radius centered at . The bases for and are optimally aligned by an orthogonal transformation that can be viewed as a mapping from to .

We refer to the process of finding the optimal orthogonal transformation between bases as alignment. Later we show that the matrix is an approximation to the parallel transport operator (see Appendix LABEL:setupbackground) from to whenever and are nearby.

Note that not all bases are aligned; only the bases of nearby points are aligned. We set to be the edge set of the undirected graph over vertices that correspond to the data points, where an edge between and exists iff their corresponding bases are aligned by the algorithm555We do not align a basis with itself, so the edge set does not contain self loops of the form . (or equivalently, iff ). The weights are defined using a kernel function as 666Notice that the weights are only a function of the Euclidean distance between data points; another possibility, which we do not consider in this paper, is to include the Grassmannian distance into the definition of the weight.

\hb@xt@.01(2.5)

where we assume that is supported on the interval . For example, the Gaussian kernel leads to weights of the form for and otherwise. We emphasize that the kernel used for the definition of the weights could be different than the kernel used for the previous step of local PCA.

3 Vector diffusion mapping

We construct the following matrix :

\hb@xt@.01(3.1)

That is, is a block matrix, with blocks, each of which is of size . Each block is either a orthogonal transformation multiplied by the scalar weight , or a zero matrix.777As mentioned in the previous footnote, the edge set does not contain self-loops, so and . The matrix is symmetric since and , and its overall size is . We define a diagonal matrix of the same size, where the diagonal blocks are scalar matrices given by

\hb@xt@.01(3.2)

and

\hb@xt@.01(3.3)

is the weighted degree of node . The matrix can be applied to vectors of length , which we regard as vectors of length , such that is a vector in viewed as a vector in . The matrix is an averaging operator for vector fields, since

\hb@xt@.01(3.4)

This implies that the operator transport vectors from the tangent spaces (that are nearby to ) to and then averages the transported vectors in .

Notice that diffusion maps and other non-linear dimensionality reduction methods make use of the weight matrix , but not of the transformations . In diffusion maps, the weights are used to define a discrete random walk over the graph, where the transition probability in a single time step from node to node is given by

\hb@xt@.01(3.5)

The Markov transition matrix can be written as

\hb@xt@.01(3.6)

where is diagonal matrix with

\hb@xt@.01(3.7)

While is the Markov transition probability matrix in a single time step, is the transition matrix for steps. In particular, sums the probabilities of all paths of length that start at and end at . Coifman and Lafon [9, 26] showed that can be used to define an inner product in a Hilbert space. Specifically, the matrix is similar to the symmetric matrix through . It follows that has a complete set of real eigenvalues and eigenvectors and , respectively, satisfying . Their diffusion mapping is given by

\hb@xt@.01(3.8)

where is the ’th entry of the eigenvector . The mapping satisfies

\hb@xt@.01(3.9)

where is the usual dot product over Euclidean space. The metric associated to this inner product is known as the diffusion distance. The diffusion distance between and is given by

\hb@xt@.01(3.10)

Thus, the diffusion distance between and is the weighted- proximity between the probability clouds of random walkers starting at and after steps.

In the VDM framework, we define the affinity between and by considering all paths of length connecting them, but instead of just summing the weights of all paths, we sum the transformations. A path of length from to is some sequence of vertices with and and its corresponding orthogonal transformation is obtained by multiplying the orthogonal transformations along the path in the following order:

\hb@xt@.01(3.11)

Every path from to may therefore result in a different transformation. This is analogous to the parallel transport operator from differential geometry that depends on the path connecting two points whenever the manifold has curvature (e.g., the sphere). Thus, when adding transformations of different paths, cancelations may happen. We would like to define the affinity between and as the consistency between these transformations, with higher affinity expressing more agreement among the transformations that are being averaged. To quantify this affinity, we consider again the matrix which is similar to the symmetric matrix

\hb@xt@.01(3.12)

through and define the affinity between and as , that is, as the squared HS norm of the matrix , which takes into account all paths of length , where is a positive integer. In a sense, measures not only the number of paths of length connecting and but also the amount of agreement between their transformations. That is, for a fixed number of paths, is larger when the path transformations are in agreement, and is smaller when they differ.

Since is symmetric, it has a complete set of eigenvectors and eigenvalues . We order the eigenvalues in decreasing order of magnitude . The spectral decompositions of and are given by

\hb@xt@.01(3.13)

where for and . The HS norm of is calculated using the trace:

\hb@xt@.01(3.14)

It follows that the affinity is an inner product for the finite dimensional Hilbert space via the mapping :

\hb@xt@.01(3.15)

That is,

\hb@xt@.01(3.16)

Note that in the manifold learning setup, the embedding is invariant to the choice of basis for because the dot products are invariant to orthogonal transformations. We refer to as the vector diffusion mapping.

From the symmetry of the dot products , it is clear that is also an inner product for the finite dimensional Hilbert space corresponding to the mapping

where

We define the symmetric vector diffusion distance between nodes and as

\hb@xt@.01(3.17)

The matrices and are positive semidefinite due to the following identity:

\hb@xt@.01(3.18)

for any . As a consequence, all eigenvalues of reside in the interval . In particular, for large enough , most terms of the form in (LABEL:HS-norm) are close to 0, and can be well approximated by using only the few largest eigenvalues and their corresponding eigenvectors. This lends itself into an efficient approximation of the vector diffusion distances of (LABEL:Dt), and it is not necessary to raise the matrix to its power (which usually results in dense matrices). Thus, for any , we define the truncated vector diffusion mapping that embeds the data set in (or equivalently, but more efficiently in ) using the eigenvectors as

\hb@xt@.01(3.19)

where is the largest integer for which and .

We remark that we define through rather than through , because we cannot guarantee that in general all eigenvalues of are non-negative. In Section LABEL:gdd, we show that in the continuous setup of the manifold learning problem all eigenvalues are non-negative. We anticipate that for most practical applications that correspond to the manifold assumption, all negative eigenvalues (if any) would be small in magnitude (say, smaller than ). In such cases, one can use any real for the truncated vector diffusion map .

4 Normalized Vector Diffusion Mappings

It is also possible to obtain slightly different vector diffusion mappings using different normalizations of the matrix . These normalizations are similar to the ones used in the diffusion map framework [9]. For example, notice that

\hb@xt@.01(4.1)

are the right eigenvectors of , that is, . We can thus define another vector diffusion mapping, denoted , as

\hb@xt@.01(4.2)

From (LABEL:wv) it follows that and satisfy the relations

\hb@xt@.01(4.3)

and

\hb@xt@.01(4.4)

As a result,

\hb@xt@.01(4.5)

In other words, the Hilbert-Schmidt norm of the matrix leads to an embedding of the data set in a Hilbert space only upon proper normalization by the vertex degrees (similar to the normalization by the vertex degrees in (LABEL:eq:dm) and (LABEL:dmd) for the diffusion map). We define the associated vector diffusion distances as

\hb@xt@.01(4.6)

The distances are related by .

We comment that the normalized mappings and that map the data points to the unit sphere are equivalent, that is,

\hb@xt@.01(4.7)

This means that the angles between pairs of embedded points are the same for both mappings. For diffusion map, it has been observed that in some cases the distances are more meaningful than (see, for example, [17]). This may also suggest the usage of the distances in the VDM framework.

Another important family of normalized diffusion mappings is obtained by the following procedure. Suppose , and define the symmetric matrices and as

\hb@xt@.01(4.8)

and

\hb@xt@.01(4.9)

We define the weighted degrees corresponding to by

the diagonal matrix as

\hb@xt@.01(4.10)

and the block diagonal matrix (with blocks of size ) as

\hb@xt@.01(4.11)

We can then use the matrices and (instead of and ) to define the vector diffusion mappings and . Notice that for we have and , so that and . The case turns out to be especially important as discussed in the next Section.

5 Convergence to the connection-Laplacian

For diffusion maps, the discrete random walk over the data points converges to a continuous diffusion process over that manifold in the limit and . This convergence can be stated in terms of the normalized graph Laplacian given by

In the case where the data points are sampled independently from the uniform distribution over , the graph Laplacian converges pointwise to the Laplace-Beltrami operator, as we have the following proposition [26, 3, 35, 20]: If is a smooth function (e.g., ), then with high probability

\hb@xt@.01(5.1)

where is the Laplace-Beltrami operator on . The error consists of two terms: a bias term and a variance term that decreases as , but also depends on . Balancing the two terms may lead to an optimal choice of the parameter as a function of the number of points . In the case of uniform sampling, Belkin and Niyogi [4] have shown that the eigenvectors of the graph Laplacian converge to the eigenfunctions of the Laplace-Beltrami operator on the manifold, which is stronger than the pointwise convergence given in (LABEL:eq:bias1).

In the case where the data points are independently sampled from a probability density function whose support is a -dimensional manifold and satisfies some mild conditions, the graph Laplacian converges pointwise to the Fokker-Planck operator as stated in following proposition [26, 3, 35, 20]: If , then with high probability

\hb@xt@.01(5.2)

where the potential term is given by . The error is interpreted in the same way as in the uniform sampling case. In [9] it is shown that it is possible to recover the Laplace-Beltrami operator also for non-uniform sampling processes using and (that correspond to in (LABEL:Walpha) and (LABEL:mDalpha)). The matrix converges to the Laplace-Beltrami operator independently of the sampling density function .

For VDM, we prove in Appendix LABEL:proof the following theorem, Theorem LABEL:summary, that states that the matrix , where , converges to the connection-Laplacian operator (defined via the covariant derivative, see Appendix LABEL:setupbackground and [32]) plus some potential terms depending on . In particular, converges to the connection-Laplacian operator, without any additional potential terms. Using the terminology of spectral graph theory, it may thus be appropriate to call the connection-Laplacian of the graph.

The main content of Theorem LABEL:summary specifies the way in which VDM generalizes diffusion maps: while diffusion mapping is based on the heat kernel and the Laplace-Beltrami operator over scalar functions, VDM is based on the heat kernel and the connection-Laplacian over vector fields. While for diffusion maps, the computed eigenvectors are discrete approximations of the Laplacian eigenfunctions, for VDM, the -th eigenvector of is a discrete approximation of the -th eigen-vector field of the connection-Laplacian over , which satisfies for some .

In the formulation of the Theorem LABEL:summary, as well as in the remainder of the paper, we slightly change the notation used so far in the paper, as we denote the observed data points in by , where is the embedding of the Riemannian manifold in . Furthermore, we denote by the -dimensional subspace of