Nonparametric Bayesian Regression on Manifolds via Brownian Motion1footnote 11footnote 1This work was supported by NSF awards DMS-09-56072 and DMS-14-18386 and the University of Minnesota Doctoral Dissertation Fellowship Program.

Nonparametric Bayesian Regression on Manifolds via Brownian Motion111This work was supported by NSF awards DMS-09-56072 and DMS-14-18386 and the University of Minnesota Doctoral Dissertation Fellowship Program.

Xu Wang
wang1591@umn.edu
Dept. of Mathematics
University of Minnesota
Minneapolis, MN 55455
lerman@umn.edu
Dept. of Mathematics
University of Minnesota
Minneapolis, MN 55455
Abstract

This paper proposes a novel framework for manifold-valued regression and establishes its consistency as well as its contraction rate. It assumes a predictor with values in the interval and response with values in a compact Riemannian manifold . This setting is useful for applications such as modeling dynamic scenes or shape deformations, where the visual scene or the deformed objects can be modeled by a manifold. The proposed framework is nonparametric and uses the heat kernel (and its associated Brownian motion) on manifolds as an averaging procedure. It directly generalizes the use of the Gaussian kernel (as a natural model of additive noise) in vector-valued regression problems. In order to avoid explicit dependence on estimates of the heat kernel, we follow a Bayesian setting, where Brownian motion on induces a prior distribution on the space of continuous functions . For the case of discretized Brownian motion, we establish the consistency of the posterior distribution in terms of the distances for any . Most importantly, we establish contraction rate of order for any fixed , where is the number of observations. For the continuous Brownian motion we establish weak consistency.

1 Introduction

In many applications of regression analysis, the response variables lie in Riemannian manifolds. For example, in directional statistics [20, 12, 11] the response variables take values in the sphere or the group of rotations. Applications of directional statistics include crystallography [22], altitude determination for navigation and guidance control [30], testing procedure for Gene Ontology cellular component categories [27], visual invariance studies [21] and geostatics [34]. Other modern applications of regression give rise to different types of manifold-valued responses. In the regression problem of estimating shape deformations of the brain over time (e.g., for studying brain development, aging or diseases), the response variables lie in the space of shapes [10, 24, 17, 3, 25, 9]. In the analysis of landmarks [16] the response variables lie in the Lie group of diffeomorphisms.

The quantitative analysis of regression with manifold-valued responses (which we refer to as manifold-valued regression) is still in early stages and is significantly less developed than statistical analysis of vector-valued regression with manifold-valued predictors [1, 8, 26, 5, 28, 36, 7]. A main obstacle for advancing the analysis of manifold-valued regression is that there is no linear structure in general Riemannian manifolds and thus no direct method for averaging responses. Parametric methods for regression problems with manifold-valued responses [10, 17, 21, 13, 16] directly generalize the linear or polynomial real-valued regressions to geodesic or Riemannian polynomial manifold-valued regression. Nevertheless, the geodesic or Riemannian polynomial assumption on the underlying function is often too restrictive and for many applications non-parametric models are required. To address this issue, Hein [15] and Bhattacharya [3] proposed kernel-smoothing estimators, where in [15] the predictors and responses take values in manifolds and in [3] the predictors and responses take values in compact metric spaces with special kernels. Hein [15] proved convergence of the risk function to a minimal risk (w.p. 1; conditioned on the predictor) and Bhattacharya [3] established consistency of the joint density function of the predictors and the responses. However, the rate of contraction (that is, the rate at which the posterior distribution contracts to a distribution with respect to the underlying regression function) of any previously proposed manifold-valued regression estimator was not established. To the best of our knowledge, rate of contraction was only established when both the predictor and response variables are real [32] and this work does not seem to extend to manifold-valued regression.

The main goal of this paper is to establish the rate of contraction of a natural estimator for manifold-valued regression (with real-valued predictors). This estimator is proposed here for the first time.

1.1 Setting for Regression with Manifold-Valued Responses

We assume that the predictor takes values in and the response takes values in a compact -dimensional Riemannian manifold . We denote the Riemannian measure on by ( is the volume form). We also assume an underlying function , which relates between the predictor variables and response variables by determining a density function , so that

 x|t∼pf0(t)(x). (1)

We find it natural to define

 pf0(t)(x)=pσ2(f0(t),x), (2)

where denotes the heat kernel on centered at and evaluated at time . Equivalently, is the transition probability of Brownian motion on (with the measure ) from to at time . We note that controls the variance of the distribution of and as , the distribution of approaches . In the special case where :

and this implies the common model: .

We also assume a distribution of , whose support equals , though its exact form is irrelevant in the analysis. At last, we assume i.i.d. observations with the joint distribution and the density function

 pn0=n∏i=1p(ti)pσ2(f0(ti),xi). (3)

The aim of the regression problem is to estimate among all functions in given the observations .

For simplicity, we denote throughout the rest of the paper

 P:=C([0,1],M).

1.2 Bayesian Perspective: Prior and Posterior Distributions Based on the Brownian Motion

Since the set of functions includes Brownian paths, the heat kernel, which expresses the Brownian transition probability, can be used to form a prior distribution on . For the sake of clarity, we need to distinguish between two different ways of using the heat kernel in this paper. The first one applies the heat kernel with , and (see e.g., Section 1.1), where the time (or variance) parameter quantifies the “noise” in w.r.t. the underlying function . The second one uses the heat kernel with and , where the time parameter inversely characterizes the “smoothness” of the path between and . The smaller , the smoother the path between and (since smaller makes it less probable for to get further away from ). Using the heat kernel , we define in Section 1.2.1 a continuous Brownian motion (BM) prior distribution and in Section 1.2.2 a discretized BM prior distribution. Section 1.2.3 then defines posterior distributions in terms of the prior distributions and the given observations of the setting.

1.2.1 The Continuous BM Prior on P

We note that a function can be identified as a parametrized path in . Let’s assume that is a starting point of this path, that is . We denote . Corollary 2.19 of [2] implies that there exists a unique probability measure on such that for any , , and open subsets , the following identify is satisfied

 Wx(f∈Px | f(t1)∈U1,…,f(tn)∈Un)=∫U1×…×Unptn−tn−1(xn,xn−1)⋯pt2−t1(x2,x1)pt1(x1,x)dμ(x1)⋯dμ(xn). (4)

We define the conditional prior distribution of given by . We assume that the distribution of is and thus obtain that the prior distribution of is .

1.2.2 The Discretized BM Prior P

The continuous BM prior often does not have a density function. We discuss here a special case of discretized BM, where the density function of the prior is well-defined. For such that is an integer, we define as the set of piecewise geodesic functions from to , where for each , , the interval is mapped to the geodesic curve from to . Each function in is determined by its values at . Let the distribution of be uniform w.r.t. the Riemannian measure and let the transition probability from to be given by the heat kernel . Then the density function (w.r.t. ) of the discretized BM prior on can be specified as follows:

 πh(f)=1μ(M)1/h∏k=1ph(f(kh−h),f(kh)).

The corresponding distribution is denoted by .

Throughout the paper we assume a sequence with and with some abuse of notation denote by the sequence of discretized BM priors defined above with . By construction, is supported on . Since , can also be considered as a set of priors on .

1.2.3 Posterior Distributions

Given observations drawn according to the setting of Section 1.1, the posterior distribution of has the density function

 Π(f∈A|{(ti,xi)}ni=1) ∝∫f∈An∏i=1p(ti,xi|f)dΠ(f) (5) =∫f∈An∏i=1pf(ti)(xi)p(ti)dΠ(f),

where the equality in (5) follows by applying (1) and (2) to the estimator of .

1.3 Main Theorems: Posterior Consistency and Rate of Contraction

We establish the posterior consistency for the discretized and continuous BM priors respectively. That is, we show that as approaches infinity, the posterior distributions contract with high probability to the distribution (recall that is the underlying function in ). Furthermore, for the discretized BM we study the rate of contraction of the posterior distribution. The theorem for the discretized BM is formulated in Section 1.3.1 and the one for the continuous BM (with weaker convergence) in Section 1.3.2.

1.3.1 Posterior Consistency and Rate of Contraction for Discretized BM

Theorem 1.1 below formulates the rate of contraction of the posterior distribution of the discretized BM with respect to the metric on , where . This metric, , is defined as follows:

 dq(f1,f2)=(∫t∈[0,1]distM(f1(t),f2(t))qp(t)dt)1/q, (6)

where denotes the geodesic distance on and is the pdf for the predictor .

Theorem 1.1.

Assume a regression setting with a predictor variable , whose pdf is strictly positive on , a response variable in a compact finite-dimensional Riemannian manifold and an underlying and unknown Lipschitz function , which relates between and according to (1) and (2). Assume an arbitrarily fixed and for , let be the sidelength of the set and let denote the sequence of discretized BM priors on . Then there exists an absolute constant and a fixed constant depending only on the positive minimum value of on , the volume of and the Riemannian metric of 222More precisely, the dependence of the constant (which is later defined in (15)) on the Riemannian metric., such that contracts to according to the rate . More precisely, for any

 Πn(f:dq(f,f0)≥A0ϵn|{(ti,xi)}ni=1)→0

in -probability (see (3)) as .

The proof of Theorem 1.1 appears in Section 2 and utilizes a general strategy for establishing contraction according to [14]. The significance of the theorem is in properly determining the sidelength parameter (as a function of ). Practical application of the discretized BM prior can suffer from underfitting or overfitting as a result of too small or too large choice of respectively. Theorem 1.1 implies that for observations, should be picked as to achieve a contraction rate of for any fixed .

1.3.2 Posterior Consistency for Continuous BM

We show here that the posterior distribution is weakly consistent. In order to clearly specify the weak convergence, it is natural to identify functions in with density functions of observations. Let denote the set of densities from which the observations are drawn. Assuming a fixed variance , a function can be identified with a density function as follows:

 Φ:f⟶pf(t,x):=pσ2(f(t),x)p(t). (7)

Therefore, induces a prior on the set , which is again denoted by with some abuse of notation. For the simplicity of analysis, we assume here that is known. Section 4.1 discusses the modification needed when is unknown.

For the underlying function , we define its weak neighborhood of radius by

Theorem 1.2 states the weak posterior consistency of the continuous BM prior . It is proved later in Section 3.

Theorem 1.2.

If is a compact Riemannian manifold and if the true underlying function of the regression model is Lipschitz continuous, then the posterior distribution is weakly consistent. In other words, for any ,

 Π(Nϵ(f0)|{(ti,xi)}ni=1)⟶1

almost surely w.r.t. the true probability measure (defined in (3)) as .

1.4 Main Contributions of This Work

The first contribution of this paper is the proposal of a natural model for manifold-valued regression (with real-valued predictors). Indeed, the heat kernel on the Riemannian manifold gives rise to an averaging process, which generalizes basic averages of vector-valued regression. In particular, the heat kernel on is the same as the Gaussian kernel (applied to the difference of and ), which is widely used in regression when (due to an additive Gaussian noise model). The Bayesian setting is natural for the proposed model, since it uses the discretized or continuous Brownian motion on as a prior distribution of and it does not directly use the heat kernel. It is not hard to simulate the Brownian motion, but tight estimates of the heat kernel for general are hard.

The second and main contribution of this work is the derivation of the contraction rate of the posterior distribution for the discretized Brownian motion. To the best of our knowledge the rate of contraction was only established before for regression with real-valued predictors and responses. For this case, van Zanten [32] established contraction rate for the posterior distribution of samples under the -norm, where . His analysis does not seem to extend to our setting. It is unclear to us if this stronger contraction rate also applies to the general case of manifold-valued regression (see discussion in Section 6.3).

The third contribution is the consistency result for the continuous Brownian motion. The only other consistency result for manifold-valued regression we are aware of is by Bhattacharya [3]. It suggests a general nonparametric Bayesian kernel-based framework for modeling the conditional distribution , where the predictor and response take values in metric spaces with kernels. Under a suitable assumption on the kernels, [3] established the posterior consistency for the conditional distribution w.r.t. the norm (see [3, Proposition 13.1]). We remark that [3] applies to responses and predictors in Riemannian manifolds (where the corresponding metric kernels are the heat kernels). However, both the conditional distribution (of given ) and the prior distribution are different than the ones proposed here. It is unclear how to obtain a rate of contraction for [3].

The last contribution is the implication of a new numerical procedure for manifold-valued regression, which is based on simulating a Brownian motion on . The flexibility of the shapes of the sample paths of the Brownian motion is advantageous over state-of-the-art geodesic regression methods. Real applications often do not give rise to geodesics and thus the nonparametric regression method is less likely to suffer from underfitting. Another nonparametric approach is kernel regression [15, 3]. In Section 5, we compare between kernel regression and Brownian motion regression (our method) for a particular example, which is easy to visualize.

1.5 Organization of the Rest of the Paper

The paper is organized as follows. Theorems 1.1 and 1.2 are proved in Sections 2 and 3 respectively. Section 4 extends the framework to the cases where is unknown and is supported on a subset of . Section 5 demonstrates the performance of the proposed procedure on a particular example, which is easy to visualize, and compares it to kernel regression [15, 3].

2 Proof of Theorem 1.1

Our proof utilizes Theorem 2.1 of [14, page 4]. The latter theorem establishes the contraction rate for a sequence of priors over the set of joint densities of the predictor and response under some conditions on and the covering number of . We thus conclude Theorem 1.1 by establishing these conditions.

We use the following distance on the space with an arbitrarily fixed :

 dq,D(p1,p2)=12∥p1−p2∥qfor p1,p2∈D.

The regression framework is formulated in terms of the space (see Section 1.1, in particular, the mapping of to in (7)) and the metric on (see (6)). We also use the metric on , which is defined by

 d∞(f1,f2)=maxt∈[0,1]distM(f1(t),f2(t)), (8)

The proof is organized as follows. Section 2.1 shows that under the mapping (7) of to , is bounded from below by (and above by ). Therefore, the posterior contraction w.r.t.  implies the posterior contraction w.r.t. . Then, Sections 2.2-2.4 show that if the sidelengths and a constant are chosen properly, then the priors and the sieve of functions (defined later in (21)) satisfy conditions (2.2)-(2.4) respectively in Theorem 2.1 of [14]. The posterior contraction of is then concluded.

2.1 Relations between dq,D, dq and d∞

We formulate and prove the following lemma, which relates between , and . It is later used as follows: The first inequality of (9) deduces convergence in from convergence in . The second inequality of (9) is used in finding the covering number of the space .

Lemma 2.1.

If , and for all , then there exists two constants depending only on , and the Riemannian manifold such that for any with corresponding densities , in (via (7))

 C0dq(f1,f2)≤dq,D(pf1,pf2)≤C1d∞(f1,f2). (9)
Proof.

For , we define the function

 F(x1,x2,y)=|pσ2(x1,y)−pσ2(x2,y)|distM(x1,x2). (10)

We note that the first inequality of (9) is true if there exists a constant such that

 ∫y∈MF(x1,x2,y)qdμ(y)≥Cq0mq−1p,∀x1≠x2∈M. (11)

Since is compact and is infinitely differentiable, for any , there exists such that

 ∣∣∣F(x1,x2,y)−∣∣∣∂pσ2(x1,y)∂v12∣∣∣∣∣∣≤ϵ,∀x1,x2,y∈M,distM(x1,x2)≤δ, (12)

where is the unit vector of the geodesic connecting and . Since the heat kernel is not constant and due to the compactness of the space of unit tangent vectors, there exists such that

 ∫y∈M∣∣∣∂pσ2(x1,y)∂v12∣∣∣dμ(y)≥C′0,∀x1∈M,v12∈Tx1M,∥v12∥=1. (13)

Inequalities (12), (13) and the Schwarz inequality imply that

 ∫y∈MF(x1,x2,y)qdμ(y)≥CI:=(C′0−ϵμ(M)μ(M))q,∀x1,x2,y∈M,distM(x1,x2)≤δ. (14)

If we pick small enough (with its in (12)), is a positive number. On the other hand, if the pair satisfies that , we show that for some constant ,

 ∫y∈MF(x1,x2,y)qdμ(y)≥CII,∀x1,x2,y∈M,distM(x1,x2)≥δ. (15)

Since the set is compact, the existence of is guaranteed if we can show that

 ∫y∈MF(x1,x2,y)qdμ(y)>0,∀x1,x2,y∈M,distM(x1,x2)≥δ,

which can further be reduced to showing that given any pair ,

 ∃y∈M,pσ2(x1,y)≠pσ2(x2,y). (16)

We prove (16) by contradiction. If (16) is not true, then

 pσ2(x1,y)=pσ2(x2,y),∀y∈M. (17)

If we plug and respectively in (17), and use the symmetry of the heat kernel, to get , which means that

 pσ2(x1,x2)=√pσ2(x1,x1)pσ2(x2,x2). (18)

On the other hand,

 pσ2(x1,x2) =∫z∈Mpσ2/2(x1,z)pσ2/2(z,x2)dμ(z) (19) ≤√∫z∈Mpσ2/2(x1,z)2dμ(z)∫z∈Mpσ2/2(z,x2)2dμ(z) =√pσ2(x1,x1)pσ2(x2,x2).

In view of (18) the Cauchy-Schwartz inequality used in (19) is an equality and consequently

 pσ2/2(x1,z)=pσ2/2(x2,z),∀z∈M.

Applying the same argument iteratively, we conclude that for any ,

 pσ2/2m(x1,z)=pσ2/2m(x2,z),∀z∈M.

However, as , but . This is a contradiction. Inequality (16) and thus (15) are proved. We conclude from (14) and (15), the first inequality of (9) with .

Next, we establish the second inequality of (9). Theorem 4.1.4 in [18, page 105] states that is infinitely differentiable in both variables and . In particular, its first partial derivatives are continuous. Furthermore, the fact that is compact implies that the first partial derivatives are bounded. That is, there exists such that

 ∣∣∣∂pσ2(x,y)∂x∣∣∣≤CM,∣∣∣∂pσ2(x,y)∂y∣∣∣≤CM.

Consequently,

 |pσ2(x1,y)−pσ2(x2,y)|≤CMdistM(x1,x2). (20)

Applying (20) and then bounding by and by , we conclude (12) with as follows:

 dq,D(pf1,pf2) =(∬|pσ2(f1(t),y)p(t)−pσ2(f2(t),y)p(t)|qdμ(y)dt)1/q ≤(∬CqMdistM(f1(t),f2(t))qp(t)qdμ(y)dt)1/q ≤CMM(q−1)/qpμ(M)d∞(f1,f2).

Remark 2.2.

We note that when , the constants in Lemma 2.1 are independent of . In particular, in this case the condition is not needed.

2.2 Verification of Inequality 2.2 of [14]

We estimate the covering numbers of special subsets of and . The final estimate verifies inequality 2.2 of [14]. We start with some notation and definitions that also include these special subsets of and .

For and , let

 ∥f∥α:=maxt1,t2∈[0,1]distM(f(t1),f(t2))|t1−t2|α

and

 Pα:={f∈P|∥f∥α<∞}.

For a sequence increasing to infinity we define the sieve of functions

 Pn,α={f∈Pα|∥f∥α≤Mn}. (21)

This induces a sieve of densities of by the map (7). For and a metric space with the metric , we denote by the -covering number of , which is the minimal number of balls of radius needed to cover .

In the rest of the section we estimate the covering numbers of the sets , and . We assume a decreasing sequence approaching zero. Section 2.2.1 upper bounds for an arbitrary such sequence . Section 2.2.2 upper bounds for arbitrary sequences and as above. At last, Section 2.2.3 upper bounds for sequences and satisfying an additional condition (see (37) below). It verifies inequality 2.2 of [14].

2.2.1 Covering Numbers of M

For any , we construct an -net on the -dimensional compact Riemannian manifold . Let be the diameter of . That is,

 D(M)=maxx,y∈MdistM(x,y).

The Nash embedding theorem [23] and Whitney embedding theorem [35] imply that there exists an isometric map

 E:M⟶R2D.

Since , the image is contained in an hypercube with side length . We partition this as a regular grid with grid spacing in each direction. Since each point in has distance less than to some grid vertex, the set of grid vertices, , is an -net of . Thus the -covering number of can be bounded as follows:

 N(ϵn,HC,distR2D)≤(2D(M)ϵn/√2D)2D. (22)

Next, we construct an -net of using the -net of . To begin with, we show in Lemma 2.3 that the Riemannian distance and the Euclidean distance are equivalent locally under an isometric embedding.

Lemma 2.3.

Let be a compact Riemannian manifold and be an isometric embedding to . Then for any fixed constant , there exists a constant such that with ,

 |distM(x,y)−distR2D(E(x),E(y))|
Proof.

Suppose this is not true. Then there exists a sequence of such that and

 |distM(xn,yn)−distR2D(E(xn),E(yn))|≥CdistR2D(E(xn),E(yn)). (23)

Since is compact, there is a subsequence, denoted again by , and a point such that . By picking an orthonormal basis of the tangent space and using the exponential map , one has normal coordinates

 Φ:RD≡TzM⊃Bz(0,r)⟶M

where is the -ball centered the origin on . Let be the logarithm map at and be the Euclidean distance on . Let and . Applying Lemma 12 in [33, page 24] for ,

 |distM(xn,yn)−distI(xn,yn)|

Let be the composition of with ,

 f:RD⊃Bz(0,r)⟶R2D.

We note that and . The Tyler series of is

 f(yn)−f(xn)=(∇f(xn))T(yn−xn)+12(yn−xn)T(∇2f(xn))(yn−xn)+⋯.

This implies that

 ∥f(yn)−f(xn)∥2=∥(∇f(xn))T(yn−xn)∥2+O(∥yn−xn∥22). (25)

On the one hand, since is an isometric embedding, the linear map

 ∇f(0):RD⟶R2D

preserves the Euclidean distance. On the other hand, the smoothness of implies that has bounded derivatives. Thus,

 ∇f(xn)=∇f(0)+O(∥xn∥2). (26)

Then, (25) and (26) and the triangle inequality imply that

 ∥f(yn)−f(xn)∥2=∥yn−xn∥2+O(∥xn∥2∥yn−xn∥2+∥yn−xn∥22).

In other words,

 |distI(xn,yn)−distR2D(E(xn),E(yn))|≤O(∥xn∥2+∥yn−xn∥2)distI(xn,yn). (27)

By (24) and (27),

 |distM(xn,yn)−distR2D(E(xn),E(yn))|

where . Moreover, by (27),

 distI(xn,yn)<(1−O(∥xn∥2+∥yn−xn∥2))−1distR2D(E(xn),E(yn)).

Therefore, if , then

 |distM(xn,yn)−distR2D(E(xn),E(yn))|

We note that as since and this contradicts assumption (23).

Now, we construct an -net of from .

Lemma 2.4.

Let There exists a constant such that is an -net of when . Consequently,

 N(ϵn,M,distM)≤N(ϵn/3,HC,distR2D).
Proof.

Suppose where is the constant in Lemma 2.3 with . For any point , let be the vertex in that is closest to w.r.t. . Then, by definition, . Let . To prove the lemma, it is sufficient to show that

 distM(x,z)<ϵn. (28)

Since , Lemma 2.3 states that

 |distM(x,z)−distR2D(E(x),E(z))|<13distR2D(E(x),E(z)). (29)

Inequality (29) implies (28) and thus the lemma as follows:

 distM(x,z)<43distR2D(E(x),E(z))<8ϵn/9.

<