1 Introduction

## Abstract

A theorem of McCann [13] shows that for any two absolutely continuous probability measures on there exists a monotone transformation sending one probability measure to the other. A consequence of this theorem, relevant to statistics, is that density estimation can be recast in terms of transformations. In particular, one can fix any absolutely continuous probability measure, call it , and then reparameterize the whole class of absolutely continuous probability measures as monotone transformations from . In this paper we utilize this reparameterization of densities, as monotone transformations from some , to construct semiparametric and nonparametric density estimates. We focus our attention on classes of transformations, developed in the image processing and computational anatomy literature, which are smooth, invertible and which have attractive computational properties. The techniques developed for this class of transformations allow us to show that a penalized maximum likelihood estimate (PMLE) of a smooth transformation from exists and has a finite dimensional characterization, similar to those results found in the spline literature. These results are derived utilizing an Euler-Lagrange characterization of the PMLE which also establishes a surprising connection to a generalization of Stein’s lemma for characterizing the normal distribution.

A general spline representation for nonparametric and semiparametric density estimates using diffeomorphisms.

Ethan Anderes and Marc Coram

Department of Statistics, University of California, Davis, CA 95616, USA

e-mail: anderes@stat.ucdavis.edu

Health Research and Policy Department, Stanford University, Palo Alto, CA 94304, USA

12

Key words and phrases: Density estimation, Euler-Lagrange, penalized maximum likelihood, diffeomorphism.

MSC 2010 Subject Classification: 62G07

## 1 Introduction

A theorem of McCann [13] shows that for any two absolutely continuous probability measures on there exists a monotone transformation sending one probability measure to the other. A consequence of this theorem, relevant to statistics, is that density estimation can be recast in terms of transformations. In particular, one can fix any absolutely continuous probability measure, call it , and then reparameterize the whole class of absolutely continuous probability measures as monotone transformations from . The advantage of this new viewpoint is the flexibility in choosing the target measure which can allow prior information on the shape of true sampling distribution. For example, if it is known that the data is nearly Gaussian then choosing a Gaussian along with a strong penalty on transformations that are far from the identity allows one to construct penalized maximum likelihood estimates which effectively shrink the resulting nonparametric estimate in the direction of the Gaussian target . Moreover, when there is no knowledge about the true sampling measure, one can simply choose any absolutely continuous and still construct a completely nonparametric density estimate.

In this paper we utilize this reparameterization of densities, as monotone transformations from some , to construct semiparametric and nonparametric density estimates. We focus our attention on classes transformations which are smooth, invertible and which have attractive computational properties. Formally, we model our data as being generated by some diffeomorphism of some fixed but known probably distribution on :

 X1,…,Xniid∼P∘ϕ. (1)

The notation is taken to mean that the probability of is given by where . An important observation is that the model (1) implies that . Therefore, one can imagine estimating by attempting to “deform” the data by a transformation which satisfies

 ϕ(X1),…,ϕ(Xn)iid∼P.

One of the main difficulties when working with such a model is the invertibility condition on the maps . The nonlinearity of this condition in when makes constructing rich classes and optimizing over such classes difficult. One of the early attempts at circumventing such difficulties, found in [2], utilized the class of quasi-conformal maps to generate penalized maximum likelihood estimates of . However, these tools were only developed for with no clear generalization for higher dimension. In this paper we adapt the powerful tools developed by Grenander, Miller, Younes, Trouvé and co-authors in the image processing and computational anatomy literature (see [24] and the references therein) and apply them to the estimation of from data . Indeed, these techniques allow us to show that a penalized maximum likelihood estimate of exists and that the solution has a finite dimensional characterization, similar to those results found in the spline literature (see [23]).

We start the paper in Section 2 with an overview of using the dynamics of time varying vector field flows to generate rich classes of diffeomorphisms. Then in sections 3 and 4 we define our penalized maximum likelihood estimate (PMLE) of and prove not only existence, but also establish a finite dimensional characterization which is key for numerically computing the resulting density estimate. In Section 5 we notice a surprising connection with the Euler-Lagrange equation for the PMLE of and a generalization of Stein’s lemma for characterizing the normal distribution (see [19]). In sections 6 and 7 we give examples of our new density estimate, first as a nonparametric density estimate and second as a semiparametric density estimate where a finite dimensional model is used for the target probability measure . We finish the paper with an appendix which contains some technical details used for the proofs of the existence of the PMLE and for the finite dimensional characterization.

## 2 A rich class of diffeomorphisms

In this section we give a brief overview of the technique of using the dynamics of time varying vector field flows to generate rich classes of diffeomorphisms. These time varying flows have been utilized with spectacular success in the fields of computational anatomy and image processing (see [1, 4, 5, 6, 7, 10, 11, 14, 15, 16, 20, 21, 22, 24, 25], and references therein). The tools developed in this body of work have been shown to be very powerful for developing algorithms which optimize a diverse range of objective functions defined over classes of diffeomorphism. It is these tools that we adapt for density estimation and statistics.

A map is said to be a diffeomorphism of the open set if is one-to-one, maps onto , and . In what follows we generate classes of diffeomorphisms by time varying vector field flows. In particular, let be a time varying vector field in , where denotes ‘time’ so that for each , is a function mapping into . Under mild smoothness conditions there exists a unique class of diffeomorphisms of , denoted , which satisfy the following ordinary differential equation

 ∂tϕvt(x) =vt(ϕvt(x)) (2)

with boundary condition , for all (see Theorem 1 below). The interpretation of these flows is that represents the position of a particle at time , which originated from location at time , and flowed according the the instantaneous velocity given by . It will be convenient to consider the diffeomorphism that maps time to some other time , this will be denoted .

For the remainder of the paper we will assume that at each time , will be a member of a Hilbert space of vector fields mapping into with inner product denoted by and norm by . Indeed, how one chooses the Hilbert space will determine the smoothness properties of the resulting class of deformations . Once the Hilbert space is fixed we can define the following set of time varying vector fields.

###### Definition 1.

Let denote the space of measurable functions such that for all and .

One clear advantage of this class is that it can be endowed with a Hilbert space inner product if is a Hilbert space. Indeed, is a Hilbert space with inner product defined by (see Proposition 1 in the Appendix or Proposition 8.17 in [24]). For the remainder of the paper we typically use or to denote elements of and or to denote the corresponding elements of at any fixed time . An important theorem found in [24] relates the smoothness of to the smoothness of the resulting diffeomorphism. Before we state the theorem, some definitions will be prudent. The Hilbert space is said to be continuously embedded in another normed space (denoted ) if and there exists a constant such that

 ∥v∥H≤c∥v∥V

for all where denotes the norm in . Also we let denote the subset of functions whose partial derivatives of order or less all have continuous extensions to zero at the boundary .

###### Theorem 1 ([24], [10]).

If , then for any there exists a unique class of diffeomorphisms which satisfy (2) and for all .

To derive our finite dimensional characterization we will make the additional assumption that is a reproducing kernel Hilbert space of vector fields. This will guarantee the existence of a reproducing kernel which can be used to compute the evaluation functional. In particular, has the property that for any and the following identity holds for all . To simplify the following computations we will only work with kernels of the form where is a positive definite function and is the -by- identity matrix.

Now the class of diffeomorphisms we consider in this paper corresponds to the set of all time varying vector field flows evaluated at : , where ranges through . The class will be completely specified by the reproducing kernel which has the flexibility to control the smoothness of the resulting maps through Theorem 1.

## 3 Penalized maximum likelihood estimation

In this section we construct a penalized maximum likelihood estimate (PMLE) of given the data where satisfies (2) and . Under mild assumptions on and the density of we prove the existence of a PMLE estimate of , whereby obtaining an estimate of the true sampling distribution.

The target probability measure is assumed known with a bounded density with respect to Lebesque measure on . Therefore, by writing the density of as for some function , the probability measure has density given by

 dP∘ϕv1(x)=det(Dϕv1(x))expH∘ϕv1(x)dx

where is defined as the determinant of the Jacobian of evaluated at (always positive by the orientation preserving nature of ). Since ranges over an infinite dimensional space of diffeomorphisms, the likelihood for given the data will typically be unbounded as ranges in . The natural solution is to regularize the log likelihood using the corresponding Hilbert space norm on with a multiplicative tuning factor . The penalized log-likelihood (scaled by ) for the unknown vector field flow given data is then given by

 Eλ(v)≡1nn∑k=1logdet[Dϕ(Xk)]+H∘ϕ(Xk)−λ2∫10∥vt∥2Vdt. (3)

The estimated vector field is chosen to be any element of which maximizes over . The following theorem establishes that such a exists.

###### Claim 1.

Let be a Hilbert space which is continuously embedded in where is a bounded open subset of . Suppose is a bounded and continuous density on . Then there exists a time varying vector field such that

 Eλ(^v)=supv∈V[0,1]Eλ(v). (4)
###### Proof.

We first establish by splitting the energy into three parts

 Eλ(v)=1nn∑k=1logdetDϕv1(Xk):=E1(v)+1nn∑k=1H∘ϕv1(Xk):=E2(v)−λ2∫10∥vt∥2Vdt.:=E3(v) (5)

Notice that each term is well defined and finite whenever , since the assumption is sufficient for Theorem 8.7 in [24] to apply to the class . In particular, for any there exists a unique class of diffeomorphisms of , , which satisfies (2) (also see Theorem 2.5 in [10]). The term is clearly bounded from above since by assumption. For the remaining two terms notice that the determinant of the Jacobian is given by (by equation (26) in the Appendix). Therefore

 E1(v)+E3(v) =1nn∑k=1∫10(divvt(ϕvt(Xk))−λ2∥vt∥2V)dt ≤1nn∑k=1∫10(supx∈Ω|divvt(x)|−λ2∥vt∥2V)dt ≤∫10(c∥vt∥V−λ2∥vt∥2V)dt,by the assumption V↪C20(Ω,Rd) ≤c22λ<∞.

Now let be any maximizing sequence that satisfies . Since we can construct the sequence so that there exists an such that for all . Since is bounded, closed finite balls in are weakly compact (by [10]). Therefore we may extract a subsequence from (relabeled by ) which weakly converges to a . In particular, for all . Furthermore we have lower semicontinuity of the norm

 liminfm→∞∥vm∥2V[0,1]≥∥^v∥2V[0,1]. (6)

Now by Theorem 3.1 in [10] we have that uniformly in as . This allows us to show that for every . To see why, one can use similar reasoning as in [7]. First write

 |logdetDϕvm1(x)−logdetDϕ^v1(x)| =∣∣∣∫10divvmt(ϕvmt(x))−div^vt(ϕ^vt(x))dt∣∣∣=I+II

where the first term satisfies

 I ≡∣∣∣∫10divvmt(ϕvmt(x))−divvmt(ϕ^vt(x))dt∣∣∣ ≤∫10∥divvmt∥1,∞∣∣ϕvmt(x)−ϕ^vt(x)∣∣dt ≤∫10c∥vmt∥V∣∣ϕvmt(x)−ϕ^vt(x)∣∣dt,since V↪C20(Ω,Rd) →0, since ∥vm∥V[0,1]≤M for % all m.

For the second term notice that the map sending is a bounded linear functional on (using the fact that ) where . By the Riesz representation theorem there exists a such that . Therefore

 II ≡∣∣∣∫10divvmt(ϕ^vt(x))−div^vt(ϕ^vt(x))dt∣∣∣ =∣∣⟨vm−^v,w^v⟩V[0,1]∣∣→0,by weak convergence.

Combining the results for and we can conclude that for every .

To finish the proof notice that

 supv∈V[0,1]Eλ(v) =limm→∞E(vm)=limsupm→∞E(vm) =1nn∑k=1logdetDϕ^v1(Xk)+1nn∑k=1H∘ϕ^v1(Xk)−λ2liminfm→∞∫10∥vmt∥2Vdt ≤1nn∑k=1logdetDϕ^v1(Xk)+1nn∑k=1H∘ϕ^v1(Xk)−λ2∫10∥^vt∥2Vdt, by (???) =Eλ(^v)

One of the important facts about any vector field flow which maximizes is that the resulting estimated transformation is a geodesic (or minimum energy) flow with respect to the vector field norm . To see this is first notice that the parameterization of time maps, , by vector fields is a many-to-one parameterization. In other words there exist multiple pairs of vector fields such that but . Notice, however, that the log-likelihood term in only depends on . This implies that any maximizer of must simultaneously minimize the penalty over the class of all which has the same terminal value, i.e. . Consequently, the PMLE estimate must be a geodesic flow. An important consequence is that geodesic flows are completely determined by the initial vector field . This will become particularly important in the next section where the initial velocity field will be completely parameterized by coefficient vectors.

## 4 Spline representation from Euler-Lagrange

In this section we work under the additional assumption that is a reproducing kernel Hilbert space. This assumption allows one to derive the Euler-Lagrange equation for any maximizer of which satisfies (4). This leads to a finite dimensional characterization of which parallel those results found in the spline literature for function estimation.

###### Claim 2.

Let be a reproducing kernel Hilbert space, with kernel , continuously embedded in where is bounded open subset of . Suppose is a density on . Then any time varying vector field which satisfies (4) also satisfies the following Euler-Lagrange equation:

 ^vt(x) =1λnn∑k=1βTk,tR(x,Xk,t)+1λnn∑k=1∇yR(x,y)∣∣y=Xk,t (7)

where and .

###### Proof.

Let and decompose as in (5). Notice first that if and then . Therefore is differentiable with respect to with derivative given by

 ∂ϵE3(^v+ϵh)∣∣ϵ=0=∫10⟨ht,λ^vt⟩Vdt. (8)

In addition, Theorem 8.10 of [24] implies that is differentiable at . Now, the assumption combined with equation (31), in the Appendix, gives

 ∂ϵE2(^v+ϵh)∣∣ϵ=0 =−1nn∑k=1∇H(ϕ^v1(Xk))⋅∂ϵϕ^v+ϵh1(Xk)∣∣ϵ=0 =−1nn∑k=1∇H(ϕ^v1(Xk))⋅∫10{Dϕ^vu1hu}∘ϕ^vu(Xk)du =−1nn∑k=1∫10{∇H(Xk,1)Dϕ^vu1(Xk,u)}⋅hu(Xk,u)du =∫10⟨hu(⋅),−1nn∑k=1{∇H(Xk,1)Dϕ^vu1(Xk,u)}TR(⋅,Xk,u)⟩Vdu (9)

Finally, Proposition 3, from the Appendix, implies is differentiable at with derivative given by

 ∂ϵE1(^v+ϵh)∣∣ϵ=0 =−1nn∑k=1∂ϵlogdetDϕ^v+ϵh1(Xk)∣∣ϵ=0 =−1nn∑k=1∫10[hu⋅∇logdetDϕ^vu1+\rm divhu]∘ϕ^vu(Xk)du =∫10⟨hu(⋅),−1nn∑k=1{∇logdetDϕ^vu1(Xk,u)}TR(⋅,Xk,u)+∇yR(⋅,y)∣∣y=Xk,u⟩Vdu (10)

Remark: the above equation requires which follows since by the assumption (see [3]). Now from (8), (9) and (10), the energy is differentiable with respect to at and

 0=∂ϵEλ(^v+ϵh)∣∣ϵ=0=⟨E^v,h⟩V[0,1] (11)

where

 E^vt=λ^vt−1nn∑k=1βTk,tR(⋅,Xk,t)−1nn∑k=1∇yR(⋅,y)∣∣y=Xk,t (12)

with . Since was arbitrary, equation (11) implies , which then gives (7). Remark: we are using the fact that the zero function in a reproducing kernel space is point-wise zero since the evaluation functionals are bounded. ∎

There are a few things things to note here. First, the Euler-Lagrange equation (7) only implicitly characterizes since it appears on both sides of the equality ( and also depend on ). Regardless, (7) is useful since it implies that must lie within a known dimensional sub-space of . In particular, as discussed at the end of Section 3, the estimate is completely characterized by it’s value at time , i.e. (by the geodesic nature of ). Restricting equation (7) to one obtains

 ^v0(x) =1λnn∑k=1βTk,0R(x,Xk)+1λnn∑k=1∇yR(x,y)∣∣y=Xk. (13)

Simply stated, has a finite dimensional spline characterization with spline knots set at the observations . Therefore to recover one simply needs to find the vectors which satisfy the following fixed point equation

 βk,0=∇H(ϕ^v1(Xk))Dϕ^v1(Xk)+∇logdetDϕ^v1(Xk) (14)

for all .

## 5 Connection to Stein’s Method

The Euler-Lagrange equation given in (7) has a surprising connection with a generalization of Stein’s lemma for characterizing the normal distribution (see [19]). The main connection is that the Euler Lagrange equation for the PMLE estimate , simplified at initial time and terminal time , can be reinterpreted as an empirical version of a generalization of Stein’s lemma. This is interesting in it’s own right, however, the connection may also bear theoretical fruit for deriving asymptotic estimation bounds on the nonparametric and semiparametric estimates derived from . In this section we make this connection explicit with the goal of of motivating and explaining the Euler-Lagrange equation for derived above.

To relate at with Stein’s lemma, and more generally Stein’s method for distributional approximation, first notice that (14) implies the coefficients , from the implicit equation (13) for , satisfy where is the estimated density of using the pullback of the target measure with the estimated diffeomorphisms . Now by computing the inner product of both sides of the Euler-Lagrange equation (13) with any vector field and applying the reproducing property of the kernel one derives

 λ⟨^v0,u⟩V =En{∇log^f(X)⋅u(X)+divu(X)}. (15)

where denotes expectation with respect to the empirical measure generated by the data: . To relate with Stein first let denote expectation with respect to the population density given in our basic model (1). Notice that a generalization of Stein’s lemma shows that if the densities and give rise to the same probability measure then

 0=E{∇log^f(X)⋅u(X)+divu(X)} (16)

for all in a large class of test functions (see Proposition 4 in [19]). For example, a simple consequence of Lemma 2 in [18] implies that when is the density of a dimensional Gaussian distribution and then implies for any bounded function with bounded gradient. Stein’s method, on the other hand, generally refers to a technique for bounding the distance between two probability measures and using bounds on departures from a characterizing equation, such as (16) for example (see [8] for an exposition). The bounds typically take the form

 suph∈H∣∣∣∫(hf−h^f)∣∣∣ ≤supu∈U∣∣E{∇log^f(X)⋅u(X)+divu(X)}∣∣ (17)

where and are two class of functions related through a set of differential equations. In our case, applying a Hölder’s inequality to the Euler-Lagrange equation (15) gives a bound on right hand side of (17) in terms of a regularization measurement on the PMLE and an empirical process error:

 supu∈U∣∣E{∇log^f(X)⋅u(X)+divu(X)}∣∣ ≤λ∥^v0∥Vsupu∈U∥u∥Vregularization at t=0+supu∈U∣∣(E−En)ν^f,u∣∣ empirical % process error (18)

where . This makes it clear that theoretical control of the PMLE estimate at time , using the Euler-Lagrange equation characterization (13), allows asymptotic control of the distance between the estimated density and the true density .

At terminal time , there is a similar connection with Stein’s lemma. In contrast to time , which quantifies the distance between the estimated and population densities and , time quantifies the distance between (the target measure) with (the push forward of the true population distribution though the estimated map). To make the connection, one follows the same line of argument as above to find that for any

 λ⟨^v1,u⟩V =E^vn{∇H(X)⋅u(X)+% divu(X)} (19)

where denotes expectation with respect to the empirical measure , which is simply the push forward of the empirical measure through the estimated map . Now the analog to (18) becomes

 supu∈U∣∣E^v{∇H(X)⋅u(X)+divu(X)}∣∣ ≤λ∥^v1∥Vsupu∈U∥u∥Vregularization at t=1+supu∈U∣∣(E^v−E^vn)γu∣∣ (20)

where and denotes expectation with respect to the push forward of the population density though the estimated map . Since the target measure is assumed to have density , this bounds the distributional distance between and when .

## 6 Nonparametric example

In this section we utilize the finite dimensional characterization of the PMLE at time , given in (13), to construct nonparametric density estimates of the form from iid samples . As was discussed in the introduction, so long as the target measure is absolutely continuous, the assumption that encompasses all absolutely continuous measures. Since the class of diffeomorphisms is nonparametric, the estimate is inherently nonparametric regardless of the choice of target probability measure (with density ). In effect, the choice of target specifies a shrinkage direction for the nonparametric estimate: larger values of shrink further toward the target . In this section we illustrate the nonparametric nature of the density estimate , whereas the next section explores semiparametric estimation with parametric models on the target . One key feature of our methodology is the use of the Euler-Lagrange equation (7) as a stopping criterion for a gradient based optimization algorithm for constructing . In fact, to avoid computational challenges associated with generating geodesics with initial velocities given by (13), we consider a finite dimensional subclass of which have geodesics that are amenable to computation (and for which gradients are easy to compute). The key is that we use the Euler-Lagrange identity (7) to measure of the richness of the subclass, within the larger infinite dimensional Hilbert space , whereby allowing a dynamic choice of the approximating dimension for a target resolution level.

Claim 2 shows that the PMLE vector field obeys a parametric form determined up to the identification of the functions as ranges in . Moreover, the whole path of coefficients is determined from the initial values , by the geodesic nature of . In this way, we are free to optimize, over the vectors using equation (13) and are guaranteed that the global maximum, over the full infinite dimensional space , has this form. Unfortunately, deriving geodesic maps with this type of initial velocity field is challenging. To circumvent this difficulty we choose an approximating subclass of vector fields at time which are parametrized by the selection of knots and initial momentum row vectors and have the form:

 v0(x)=N∑k=1ηTkR(x,κk). (21)

The knots