A regression approach for explaining manifold embedding coordinates
Abstract
Manifold embedding algorithms map high dimensional data, down to coordinates in a much lower dimensional space. One of the aims of the dimension reduction is to find the intrinsic coordinates that describe the data manifold. However, the coordinates returned by the embedding algorithm are abstract coordinates. Finding their physical, domain related meaning is not formalized and left to the domain experts. This paper studies the problem of recovering the domainspecific meaning of the new low dimensional representation in a semiautomatic, principled fashion. We propose a method to explain embedding coordinates on a manifold as nonlinear compositions of functions from a userdefined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data.
Manifold learning algorithms, also known as embedding algorithms, map data from high dimensional, possibly infinite dimensional spaces down to coordinates in a much lower dimensional space. In the sciences, one of the goals of dimension reduction is the discovery of descriptors of the data generating process. Both linear dimension reduction algorithms like Principal Component Analysis (PCA) and nonlinear algorithms such as Diffusion Maps [CL06] are used in applications from genetics to chemistry to uncover collective coordinates describing large scale properties of the interrogated system. For example, in chemistry, a common objective is to discover socalled reaction coordinates describing evolution of molecular configurations [CNO00, NC17] .
For example, Figure 1 shows the toluene molecule (), consisting of atoms. By Molecular Dynamics (MD) simulation, configurations of this molecule are generated in dimensions. They are then mapped into dimensions by a manifold learning algorithm. The figure shows that this configuration space is well approximated by a onedimensional manifold, and this manifold is parametrized by a geometric quantity, the torsion of the bond connecting the methyl group with the benzene group. Torsions are dihedral angles in tetrahedra with vertices at atoms of the molecule. In a molecule, any two atoms which are not too distant interact; hence the toluene molecule has many more interactions than the graph edges presented in Figure 1. Our problem is to select the torsion that explains the onedimensional manifold of toluene configurations out of the many torsions to be considered.
Finding the meaning behind the coordinates output by an embedding algorithm through comparison to problemrelevant covariates is usually done via visual inspection by human experts. This paper introduces a regression method for automatically establishing such relationships between the abstract coordinates output by a manifold learning algorithm and functions of the data that have meaning in the domain of the problem.
The next section defines the problem formally, Section 2 presents useful background in manifold estimation, while Sections 3, 4 and 5 develop our method. Section 3 is of independent interest as it introduces Functional Lasso (FLasso), a novel method for sparsely representing a function as a nonlinear composition of functions from a dictionary. Section 4 introduces a new differentialgeometric method for estimating the gradient of a function defined on a manifold. These two methods are combined to obtain our main algorithm, ManifoldLasso in Section 5. The relationship to previous work is discussed in Section 6. Section 8 presents experiments with FLasso and ManifoldLasso, respectively. Recovery results are presented in Section 7.
1 Problem formulation, assumptions and challenges
We make the standard assumption that the observed data are sampled i.i.d. from a smooth manifold ^{1}^{1}1The reader is referred to [Lee03] for the definitions of the differential geometric terms used in this paper. of intrinsic dimension smoothly embedded in by the inclusion map. In this paper, we will call smooth any function or manifold of class at least . We assume that the intrinsic dimension of is known; for example, by having been estimated previously by one method in [KvL15]. The manifold is a Riemannian manifold with Riemannian metric inherited from the ambient space . Furthermore, we assume the existence of a smooth embedding map , where typically . We call the coordinates in this dimensional ambient space the embedding coordinates; let . In practice, the mapping of the data onto represents the output of an embedding algorithm, and we only have access to and via and its image .
In addition, we are given a dictionary of userdefined and domainrelated smooth functions . Our goal is to express the embedding coordinate functions in terms of functions in .
More precisely, we assume that with a smooth function of variables, defined on a open subset of containing the range of . Let , and . The problem is to discover the set such that . We call the functional support of , or the explanation for the manifold in terms of . For instance, in the toluene example, the functions in are all the possible torsions in the molecule, , and is the explanation for the 1dimensional manifold traced by the configurations.
Indeterminacies
In differential geometric terms, the explanation is strongly related to finding a parametrization of . Hence, . Since the function given by the embedding algorithm is not unique, the function cannot be uniquely determined. Moreover, whenever where is a smooth monotonic function, the support may not be unique either.
In Section 7 we give sufficient and necessary conditions under which can be recovered uniquely; intuitively, they consist of functional independencies between the functions in . For instance, it is sufficient to assume that that the dictionary is a functionally independent set, i.e. there is no that can be obtained as a smooth function of other functions in .
Hence, in this paper we resolve the indeterminacies w.r.t. the support , and we limit our scope to recovering . We leave to future work the problem of recovering information on how the functions combine to explain .
2 Background on manifold estimation: neighborhood graph, Laplacian and estimating tangent subspaces
Manifold learning and intrinsic geometry
Suppose we observe data points that are sampled from a smooth dimensional submanifold . The task of manifold learning is to provide a diffeomorphism where . The Whitney Embedding Theorem [Lee03] guarantees the existence of a map satisfying this property with . Hence, a good manifold learner will identify a smooth map with .
The neighborhood graph and kernel matrix
The neighborhood graph is a data structure that associates to each data point its set of neighbors , where is a neighborhood radius parameter. The neighborhood relation is symmetric, and determines an undirected graph with nodes represented by the data points .
Closely related to the neighborhood graph are the local position matrices , local embedding coordinate matrices , and the kernel matrix whose elements are
(1) 
Typically, the radius and the bandwidth parameter are related by with a small constant greater than 1. This ensures that is close to its limit when while remaining sparse, with sparsity structure induced by the neighborhood graph. Rows of this matrix will be denoted to emphasize that when a particular row is passed to an algorithm, only values need to be passed. The neighborhood graph, local position matrices, and kernel matrix play crucial roles in our manifold estimation tasks.
Estimating tangent spaces in the ambient space
The tangent subspace at point in the data can be estimated by Weighted Local Principal Component Analysis, as described in the LocalPCA algorithm. The output of this algorithm is an orthogonal matrix , representing a basis for . For this algorithm and others we define the SVD algorithm of a symmetrical matrix as outputting , where and are the largest eigenvalues and their eigenvectors, respectively, and denote a column vector of ones of length by . We denote by .
The renormalized graph Laplacian
The renormalized graph Laplacian, also known as the sample Laplacian, or Diffusion Maps Laplacian , constructed by Laplacian , converges to the manifold Laplace operator ; [CL06] shows that this estimator is unbiased w.r.t. the sampling density on (see also [HAvL05, HAvL07, THJ10]). The Laplacian is a sparse matrix; row of contains nonzeros only for . Thus, as for , rows of this matrix will be denoted . Again, sparsity pattern is given by the neighborhood graph; construction of the neighborhood graph is therefore the computationally expensive component of this algorithm.
The principal eigenvectors of the Laplacian (or alternatively of the matrix of Algorithm Laplacian), corresponding to its smallest eigenvalues, are sometimes used as embedding coordinates of the data; the embedding obtained is known as the Diffusion Map [CL06] or the Laplacian Eigenmap [BN02] of . We use this embedding approach for convenience, but in general, any algorithm which asymptotically generates a smooth embedding is acceptable.
The pushforward Riemannian metric
Geometric quantities such as angles and lengths of vectors in the tangent bundle , distances along curves in , etc., are captured by Riemannian geometry. We assume that is a Riemannian manifold, with the metric induced from . Furthermore, we associate with a Riemannian metric which preserves the geometry of . This metric is called the pushforward Riemannian metric and is defined by
(2) 
In the above, maps vectors from to , and is the Euclidean scalar product.
For each , the associated pushforward Riemannian metric expressed in the coordinates of , is a symmetric, semipositive definite matrix of rank . The scalar product takes the form .
The matrices can be estimated by the algorithm RMetric of [PM13]. The algorithm uses only local information, and thus can be run efficiently. For notational simplicity, we refer to the embedding coordinates of the points in a neighborhood of a point as .
3 Flasso: sparse nonlinear functional support recovery
Our approach for explaining manifold coordinates is predicated on the following general method for decomposing a realvalued function over a given dictionary . The main idea is that when , their differentials at any point are in the linear relationship . This is expressed more precisely as follows.
Proposition 1 (Leibniz Rule)
Let with . All maps are assumed to be smooth. Then at every point,
(3) 
or, in matrix notation
Given samples , dictionary , and function , we would like to discover that is a function of without knowing . The usefulness of this decomposition is that even if is a nonlinear function of , its gradient is a linear combination of the gradients at every point. That is, proposition 1 transforms the functional, nonlinear sparse recovery problem into a set of linear sparse recovery problems, one for each data point.
Formulating the problem as Group Lasso
Let and , for . These vectors in could be estimated, or available analytically. Then, by (3) we construct the following linear model
(4) 
In the above, is the vector . If there exists some such that holds, then nonzero s are estimations of for and zeros indicate . Hence, in , only elements are nonzero. The term is added to account for noise or model misspecification.
The key characteristic of the functional support that we leverage is that the same set of elements will be nonzero for all ’s. Denote by the vector . Since finding the set for which , given and is underdetermined when for example , we use Group Lasso [YL06] regularization in order to zero out as many vectors as possible. In particular, each group has size . Thus, the support recovery can be reformulated as the globally convex optimization problem
(5) 
with a regularization parameter. This framework underscores why we normalize gradients by ; else we would favor dictionary functions that are scaled by large constants, since their coefficients would be smaller. Note that normalization of is not necessary since it rescales all estimated partials the same. We call problem (5) Functional Lasso (FLasso). Experiments on synthetic data in Section 8 illustrate the effectiveness of this algorithm.
Multidimensional FLasso
The FLasso extends easily to vector valued functions . To promote finding a common support for all embedding coordinates, groups will overlap the embedding coordinates. Let
(6) 
and
(7) 
Then the FLasso objective function becomes
(8) 
In the above, is the vector of regression coefficients corresponding to the effect of function , which form group in the multivariate group Lasso problem. The size of each group is .
4 Explaining manifold coordinates with FLasso on the tangent bundle
To explain the coordinates with the FLasso method of Section 3, we need (i) to extend FLasso to functions defined on a manifold , and (ii) to reliably estimate their differentials in an appropriate coordinate system.
These tasks require bringing various “gradients” values into the same coordinate system. For function defined on a neighborhood of in , we denote the usual gradient . Similarly for a function defined on a neighborhood of in , the gradient is denoted . To define the gradient of a function on a manifold, we recall that the differential of a scalar function on at point is a linear functional . The gradient of at , denoted is defined by for any . To distinguish between gradients in the tangent bundles of , respectively , we denote the former by and the latter by .
In this section, we will be concerned with the coordinate functions seen as functions on , and with the gradients and of the dictionary functions. It is easy to see that for a choice of basis in ,
(9) 
4.1 FLasso for vectorvalued functions on
We start from the FLasso objective function for dimensional vector valued functions (8). Since differentials of functions on a manifold operate on the tangent bundle , the definitions of and from (FLasso) are replaced with the values of the respective gradients in . For the moment we assume that these are given. In Section 4.2 we will concern ourselves with estimating the parameters of from the data, while is given in (12) below.
For each , we fix an orthogonal basis in , and denote the orthogonal matrix representing the respective basis vectors by . Hence, any vector in can be expressed as by its coordinates in the chosen basis, or as . For any vector in , represents the coordinates of its projection onto ; in particular, if , . These elementary identities are reflected in (9) above. Let
(10) 
and
(11) 
In the above, the columns of are the coordinates of in the chosen basis of ; is the vector of regression coefficients corresponding to the effect of function , which form group in the multivariate group Lasso problem. The size of each group is . By comparing (6) and (7) with (10) and (11) we see, first, the change in the expression of due to projection. We also note that in the definition of the normalization is omitted. For the present, we will assume that the dictionary functions have been appropriately normalized; how to do this is deferred to Section 5.
To obtain the matrices , we project the gradients of the dictionary functions onto the tangent bundle, i.e.
(12) 
The FLasso objective function is
(13) 
an expression identical to (8), but in which denote the quantities in (10) and (12). To note that is invariant to the change of basis . Inded, let be a different basis, with a unitary matrix. Then, , , and . The next section is concerned with estimating the gradients from the data.
4.2 Estimating the coordinate gradients by pullback
Since is implicitly determined by a manifold embedding algorithm, the gradients of are in general not directly available, and is known only through its values at the data points. We could estimate these gradients naively from differences between neighboring points, but we choose a differential geometric method inspired by [LSW09] and [PM13], that pulls back the differentials of from to , thus enabling them to be compared in the same coordinate system with the derivatives .
If the range of were , then the gradient of a coordinate function would be trivially equal to the th basis vector, and . If , by projecting the rows of onto the tangent subspace , we get . We then pull back the resulting vectors to to get , as defined in (10).
Pulling back into
If the embedding induced by were an isometry, the estimation of could be performed by LocalPCA, and the pullback following [PM13]. Here we do not assume that is isometric. The method we introduce exploits the fact that, even when is not an isometry, the distortion induced by can be estimated and corrected [PM13]. More precisely, we make use of the RMetric algorithm described in Section refsec:background to estimate for each , the associated pushforward Riemannian metric, expressed in the coordinates of by , a symmetric, semipositive definite matrix of rank . Additionally, since the theoretical rank of equals , the principal eigenvectors of represent (an estimate of) an orthonormal basis of .
We now apply (2) to the set of vectors for . Let
(14) 
and as defined in the previous section. Then, (2) implies
(15) 
The equality is approximate to first order because the operator above is a first order approximation to the logarithmic map ; the error tends to 0 with the neighborhood radius [dC92]. If is full rank , for , we can obtain by leastsquares as
(16) 
This equation recovers in the local basis of .
If we desired to obtain in the global coordinates, it suffices to express the columns of as vectors in (or equivalently to apply the appropriate linear transformation to ).
Algorithm PullBackDPhi summarizes the above steps. For the estimation of , the Laplacian must be available. Note also that in obtaining (15) explicit projection on is not necessary, because any component orthogonal to this subspace is in the null space of .
5 The full ManifoldLasso algorithm
We are now ready to combine the algorithms of Section 2 with the results of Sections 3 and 4 into the main algorithm of this paper. Algorithm ManifoldLasso takes as input data sampled near an unknown manifold , a dictionary of functions defined on (or alternatively on an open subset of the ambient space that contains ) and an embedding in . The output of ManifoldLasso is a set of indices in , representing the functions that explain .
Computation
The first two steps of ManifoldLasso are construction of the neighborhood graph, and estimation of the Laplacian . As shown in Section 2, is a sparse matrix, hence RMetric can be run efficiently by only passing values corresponding to one neighborhood at a time. Note that in our examples and experiments, Diffusion Maps is our chosen embedding algorithm, so the neighborhoods and Laplacian are already available, though in general this is not the case.
The second part of the algorithm estimates the gradients and constructs matrices . The gradient estimation runtime is where is the number of edges in the neighborhood graph, using Cholesky decompositionbased solvers. Finally, the last step is a call to the GroupLasso, which estimates the support of . The computation time of each iteration in GroupLasso is . For large data sets, one can perform the “for” loop over a subset of the original data while retaining the geometric information from the full data set. This replaces the in the computation time with the smaller factor .
Normalization
In Group Lasso, the columns of the matrix corresponding to each group are rescaled to have unit Euclidean norm. This ensures that the FLasso algorithm will be invariant to rescaling of dictionary functions by a constant. Since any multiplication of by a nonzero constant, simultaneously with dividing its corresponding by the same constant leaves the reconstruction error of all ’s invariant, but affects the norm , rescaling will favor the dictionary functions that are scaled by large constants. Therefore, the relative scaling of the dictionary functions can influence the support recovered.
When the dictionary functions are defined on , but not outside . In this case we follow the standard Group Lasso recipe. We calculate the normalizing constant
(17) 
The above is the finite sample version of , integrated w.r.t. the data density on . Then we set which has the same effect as the normalization for FLasso in Section 3.
In the case when the dictionary functions are defined on a neighborhood around in , we compute the normalizing constant with respect to , that is
(18) 
Then, once again, we set . Doing this favors the dictionary functions whose gradients are parallel to the manifold , and penalizes the ’s which have large gradient components perpendicular to .
Tuning
As the tuning parameter is increased, the cardinality of the support decreases. Tuning parameters are often selected by crossvalidation in Lassotype problems, but this is not possible in our unsupervised setting. We base our choice of on matching the cardinality of the support to . The recovery results proved in Section 7.2 state that for a single chart, functionally independent dictionary functions suffice.
In manifolds which cannot be represented by a single chart, the situation is more complicated, and it depends on the topology of the codomains of the functions . For example, in the toluene data presented in Section 8, the manifold is a circle. This cannot be covered by a single chart, but it can be parametrized by a single function in the dictionary, the angle of the bond torsion in Figure 5. Another case when is the case when no single dictionary function can parametrize the manifold. When is not a functionally independent set, it is possible that the parametrization of by is not unique. Hence, in general, the support size may be equal to or larger.
5.1 Variants and extensions
The approach utilized here can be extended in several interesting ways. First, our current approach explains the embedding coordinates produced by a particular embedding algorithm. However, the same approach can be used to directly explain the tangent subspace of , independently of any embedding.
Second, one could set up FLasso problems that explain a single coordinate function. In general, manifold coordinates do not have individual meaning, so it will not be always possible to find a good explanation for a single . However, Figure 6 shows that for the ethanol molecule, whose manifold is a torus, there exists a canonical system of coordinates in which each coordinate is explained by one torsion.
Third, manifold codomains of can be used, eliminating certain issues of continuity and chartcounting.
Finally, when the gradients of the dictionary functions are not analytically available, they can also be estimated from data.
6 Related work
To our knowledge, ours is the first solution to estimating a function as a nonlinear sparse combination of functions in a dictionary. Below we cite some of the closest related work.
Group Lasso has been widely used in sparse regression and variable selection. In [OPV14] the Functional Sparse Shrinkage and (FuSSO) is introduced to solve similar problem in Euclidean space setting. FuSSO uses Group Lasso to recover as a sparse linear combination of functions from an infinite dictionary given implictly by the decomposition over the Fourier basis. More importantly, this method imposes an additive model between the response and the functional covariates, which is not true in our method.
Gradient Learning [YX12]is also a similar work trying to recover nonzero partial derivatives via Group Lasso type regression as methods of variable selection and dimension reduction. Their work does not have functional covariate like us as input. Moreover, the goal of [YX12] is not explaining the coordinates but instead predicting a response given the covariates.
The Active Subspace method of [CDW14] uses the information in the gradient to discover a subspace of maximum variation of . This subspace is given by the principal subspace of the matrix , where is a weighting function averaging over a finite or infinite set of points in the domain of . While this method uses the gradient information, it can only find a global subspace, which would not be adequate for function composition, or for functions defined on nonlinear manifolds.
The work of [BPK16] is similar to ours in that it uses a dictionary. The goal is to identify the functional equations of nonlinear dynamical systems by regressing the time derivatives of the state variables on a subset of functions in the dictionary, with a sparsity inducing penalty. The recovered functional equation is linear in the dictionary functions, hence any nonlinearity in the state variables must be explicitly included in the dictionar. On the other hand, when the functional equation can be expressed as a sum of dictionary functions, then the system is completely identified.
With respect to parametrizing manifolds, the early work of [SR03, TR02] (and references therein) proposes parametrizing the manifold by finite mixtures of local linear models, aligned in a way that provides global coordinates, in a way reminiscent of LTSA[ZZ04].
A point of view different from ours views a set of eigenvectors of the LaplaceBeltrami operator as a parametrization of . Hence, the Diffusion Maps coordinates could be considered such a parametrization [CL06, CLL05, Gea12]. With [MN17] it was shown that principal curves and surfaces can provide an approximate manifold parametrization. Our work differs from these in two ways (1) first, obviously, the explanations we obtain are endowed with the physical meaning of the domain specific dictionaries, (2) less obviously, descriptors like principal curves or Laplacian eigenfunctions are generally still nonparametric (i.e exist in infinite dimensional function spaces), while the parametrizations by dictionaries we obtain (e.g the torsions) are in finite dimensional spaces. [DTCK18] tackles a related problem: choosing among the infinitely many Laplacian eigenfunctions which provide a dimensional parametrization of the manifold; the approach is to solve a set of Local Linear Embedding [RS00] problems, each aiming to represent an eigenfunction as a combination of the preceding ones.
7 Uniqueness and recovery results
7.1 Functional dependency and uniqueness of representation
Here we study the conditions under which is uniquely represented over a dictionary that contains . Not surprisingly, we will show that these are functional (in)dependency conditions on the dictionary.
We first study when a group of functions on an open can be represented with a subset of functionally independent functions. The following lemma implies that if a group of nonfullrank smooth functions has a constant rank in a neighborhood, then it can be showed that locally we can choose a subset of these functions such that the others can be smoothly represented by them. This is a direct result from the constant rank theorem.
Lemma 2 (Remark 2 after Zorich[Zor04] Theorem 2 in Section 8.6.2)
Let be a mapping defined in an neighborhood of a point . Suppose and the rank of the mapping is at every point of a neighborhood and . Moreover assume that the principal minor of order of the matrix is not zero. Then in some neighborhood of there exist functions such that
(19) 
Applying this lemma we can construct a local representation of a subset in . The following classical result in differential geometry lets us expand the above lemma beyond local to global.
We start with a definition. A smooth partition of unity subordiante to is an indexed family of smooth functions with the following properties:

for all and all ;

supp for each ;

Every has a neighborhood that intersects supp for only finitely many values of ;

for all .
Lemma 3 (John 2013[Lee03] Theorem 2.23)
Suppose is a smooth manifold, and is any indexed open cover of . Then there exists a smooth partition of unity subordinate to
Now we state our main results.
Theorem 4
Assume and are defined as in Section 3, with are functions in open set and let be another subset of functions. Then there exists a mapping on such that iff
(20) 
Proof If is not full rank on , we replace with a subset of so that is full rank globally on . The existence of such subsets are guaranteed by stepwise eliminating functions in , which will result in a zero matrix in r.h.s if such a subset does not exist and thus leads to a contradiction. Since , . Let
(21) 
and denote the l.h.s. matrix in (20). When the rank of equals the rank of , according to Lemma 2, there exists some neighborhood of and functions
(22) 
Here we should notice that is defined only on a neighborhood of . We denote such a neighborhood by and then since this holds for every , therefore we can find an open cover of the original open set . Since each open set in is a manifold, the classical result of partition of unity in Lemma 3 holds, that admits a smooth partition of unity subordinate to the cover . We denote this partition of unity by .
Hence we can define
(23) 
where the functions in are taken without repetition. For each , the product is on the neighborhood . According to the properties of partition of unity, this is a locally finite sum, which means that we do not have to deal with the problem of convergence. Also this will be a function.
Therefore, globally in we have
(24) 
Now we prove the converse implication. If , then there is , so that . Pick such that ; such an must exist because otherwise it will be in . By the theorem’s assumption, . This implies that is in for any . But this is impossible at .
A direct corollary of this theorem is that in a single chart scenario, we can use exactly functionally independent functions in the dictionary to give the explanation.
Corollary 5
Let defined as before. is a smooth manifold with dimension embedded in . Suppose that is also an embedding of and has a decomposition for each . If there is a diffeomorphism , then there is a subset of functions with such that for some function we can find on each
Proof Since is a diffeomorphism, we can write uniquely for each . And also . Then the original decomposition of on each is actually
(25) 
and is a mapping . Now we choose such that is full rank and . From
(26) 
and the previous theorem, we know that there exists a function such that on each . Finally let then
(27) 
holds for each . Now we determine the number of functions in . On one hand we can select so that is full rank, hence . On the other hand, since is an embedding, which means that is also an dimension manifold, we could consider another diffeomorphism such that
(28) 
is and onetoone.Therefore the Jacobian of l.h.s. is of rank , which should be less than the rank of each of the r.h.s. differentials. Therefore . To combine these results we have .
We say that itself is functionally independent when that every is functionally independent of . Corollary 5 states that can be