A Jacobian inequality for gradient maps on the sphere and its application to directional statistics

# A Jacobian inequality for gradient maps on the sphere and its application to directional statistics

Tomonari SEI
###### Abstract

In the field of optimal transport theory, an optimal map is known to be a gradient map of a potential function satisfying cost-convexity. In this paper, the Jacobian determinant of a gradient map is shown to be log-concave with respect to a convex combination of the potential functions when the underlying manifold is the sphere and the cost function is the distance squared. The proof uses the non-negative cross-curvature property of the sphere recently established by Kim and McCann, and Figalli and Rifford. As an application to statistics, a new family of probability densities on the sphere is defined in terms of cost-convex functions. The log-concave property of the likelihood function follows from the inequality.

## 1 Introduction

In recent years, the theory of optimal transport has been actively studied. In particular, properties of the optimal transport map on Riemannian manifolds are well established. The existence and uniqueness theorem for the optimal transport map on Riemannian manifolds was proved by McCann (2001); this result extended the pioneering work of Brenier (1991) for the Euclidean case. He showed that optimal transport is given by the gradient map of a so-called cost-convex function. On the other hand, for statistical data analysis on Euclidean space, it is useful to consider convex combinations of convex functions in order to construct various probability density functions (Sei (2006), Sei (2009)). In this paper, we show that when the underlying space is the sphere, the convex combination of cost-convex functions is actually cost-convex (Lemma 1) and the Jacobian determinant of the resultant gradient map is log-concave with respect to the convex combination (Theorem 1). This result is an extension of the Jacobian interpolation inequality shown by Cordero-Erausquin et al. (2001). We refer to our Jacobian inequality as the Jacobian inequality throughout this paper, for simplicity.

Our result is related to the regularity theory of optimal transport maps. Here we consider some recent studies in this field. Ma et al. (2005) showed that regularity of the transport map for general cost functions on Euclidean space is assured if a geometrical quantity called the cost-sectional curvature is positive. Conversely, Loeper (2005) showed that non-negativity of the cost-sectional curvature is necessary for regularity. He also showed that non-negativity of the cost-sectional curvature implies non-negativity of the usual sectional curvature if the cost function is the squared distance on a Riemannian manifold. However, the converse does not hold (Kim (2007)). Comprehensive assessment on the theory of optimal transport has been published (Villani (2009)). A relevant concept is the cross-curvature (Kim and McCann (2007)). Kim and McCann (2008) and Figalli and Rifford (2009) independently showed that the sphere has almost positive cross-curvature. In general, the cost-sectional curvature is non-negative if the cross-curvature is non-negative. In the present paper, we use the non-negative cross-curvature property of the sphere to prove our main results.

We show that our Jacobian inequality opens several doors for applications to directional statistics. In this field, a family of probability densities is used to analyze given directional data, such as locations on the earth. For example, a test on the directional character of given data is constructed via families of probability density functions on the sphere. Directional statistics has a long history since Fisher (1953) and a comprehensive text on this subject has been published (Mardia and Jupp (2000)).

We define a probability density function on the sphere by the gradient maps of cost-convex functions. Although, in the context of optimal transport, one usually considers push-forward of probability densities, we construct a family of densities by means of pull-back of probability densities. This follows from the fact that a pull-back density has an explicit expression for the likelihood function needed for statistical analysis. The density function does not need any special functions such as the modified Bessel function, which usually appear in directional statistics. Furthermore, the Jacobian inequality implies that the likelihood function is log-concave with respect to the statistical parameters. This property is reasonable for computation of the maximum likelihood estimator. We propose more specific models and show graphical images of each probability density. In terms of analysis of real data, we present the result of density estimation for some astronomical data.

This paper is organized as follows. In Section 2, we present basic notation and state our main theorem. In Section 3, we construct a family of probability density functions on the sphere and apply them to directional statistics. All mathematical proofs of the main theorem and lemmas are given in Section 4. Finally we present a discussion in Section 5.

## 2 Main theorem

Let be the -dimensional unit sphere. The tangent space at is denoted by . The geodesic distance (arc length) between and in is denoted by . The cost function is . If one uses Euclidean coordinates in to express , then , where the range of is .

The -transform of a function is defined by

 ϕc(y) = supx∈Sn{−c(x,y)−ϕ(x)}. (1)

The function is said to be cost-convex, or -convex, if . Examples of -convex functions will be given in Section 3. By compactness of , a function is -convex if and only if for any there exists some (not necessarily unique) such that .

The image of the exponential map of at , denoted by , is the end point of the geodesic starting at with the initial vector . More explicitly, if one uses Euclidean coordinates in to express and , the exponential map is written as , where denotes the Euclidean norm of the vector . The exponential map is a diffeomorphism from to , where is the antipodal point of .

The following lemma is a consequence of the non-negative cross-curvature property of the sphere established by Kim and McCann (2008) and Figalli and Rifford (2009). See Section 4 for a proof.

###### Lemma 1 (Convex combination of c-convex functions).

If and are -convex, then for each the function of is also -convex.

###### Remark 1.

Figalli et al. (2009) showed Lemma 1 simultaneously and independently from us. Indeed, they showed more general result, in that the convexity of the space of -convex functions is necessary and sufficient condition for the non-negative cross-curvature property (see Theorem 3.2 in Figalli et al. (2009)).

We define as long as is differentiable at , where is the gradient operator. Following Delanoë and Loeper (2006), we call the gradient map associated with the potential function . The map is differentiable at if and has its Hessian at . It is known that any -convex on any compact Riemannian manifold is Lipschitz and therefore differentiable almost everywhere. Furthermore, has a Hessian almost everywhere in the Alexandrov sense, and therefore is differentiable almost everywhere (see McCann (2001) and Cordero-Erausquin et al. (2001)). These technical facts on differentiability are important for the theory of optimal transport. However, we will not need them because, for statistical applications, we can assume from the beginning that is differentiable except at a finite set of points (see Section 3).

For any -convex functions and , by Lemma 1, the convex combination is -convex. We define an interpolation of gradient maps by

 Ft(x) = Gϕt(x) = expx(∇ϕt(x)),t∈[0,1].

Assume that for each , and has its Hessian at . Then it is easy to see that, for any , and has its Hessian defined at . We define the Jacobian determinant with respect to any orthonormal basis on and with suitable orientations.

The following theorem is our main result.

###### Theorem 1 (Jacobian inequality).

Let and be two -convex functions. Let be a point in such that, for each , and has its Hessian defined at . Then the Jacobian determinant defined above is log-concave with respect to . It is equivalent to the inequality

 logJt(x) ≥ (1−t)logJ0(x)+tlogJ1(x),t∈[0,1].

We refer to the above inequality as the Jacobian inequality in this paper.

###### Remark 2.

This theorem is an extension of the result obtained by Cordero-Erausquin et al. (2001). They showed a similar inequality under the additional assumption that , as a corollary of a stronger inequality related to the geometric-arithmetic inequality. It is not known whether the stronger one holds for our case (see also Remark 5). ∎

## 3 Application to directional statistics

### 3.1 Probability densities induced by gradient maps

In Sei (2006) and Sei (2009), the author proposed a family of probability density functions in terms of gradient maps on Euclidean space, where a probability density is constructed as a pull-back of some fixed measure (typically Gaussian) pulled by a gradient map. The notion can be directly extended to probability density functions on the sphere.

For statistical application, we will consider only -convex functions such that the gradient map is an isomorphism on and has its Hessian defined everywhere except at a finite set of points. We define some related terminology.

###### Definition 1 (Wrapping potential function).

We say that a function is a wrapping potential function if is -convex, has its Hessian defined everywhere except for a finite set of points and is an isomorphism on . Let be the set of all wrapping potential functions.

We have the following lemma.

###### Lemma 2.

If and are in , then the interpolation () is also in .

We construct a probability density function for each . Let be a random variable on distributed uniformly. Then, since is bijective, we can define a random variable on by . The probability density function of with respect to the uniform measure is , where the symbol Jac refers to the Jacobian determinant. In other words, we define by the pull-back measure of the uniform measure pulled by the gradient map .

At this point, we describe the exact sampling method of the probability density function . A sampling procedure is important if one needs to calculate expectations by the Monte Carlo method. From the definition, it is clear that the random variable with a uniformly random variable on has density . Hence if we can generate and solve the equation effectively, we obtain a random sample . Indeed, is quite easily generated, for example, by normalization of a standard Gaussian sample in . To solve , it is sufficient to find the unique minimizer of the function with respect to since the following lemma holds.

###### Lemma 3.

[Lemma 7 of McCann (2001)] If is -convex and is defined at , then the unique minimizer of with respect to is .

Thus our task is to solve the (deterministic) minimization problem. Although the minimization problem of is not convex in the usual sense, the objective function has no local minimum, by -convexity. Hence the problem is efficiently solved by generic optimization packages. An example of sampling is illustrated in Figure 1.

We consider a finite-dimensional set of probability densities on the sphere. In statistics, a finite-dimensional set of probability densities is called a statistical model. An unknown parameter that parameterizes the density functions is estimated from observed data points . One of the most important estimators is the maximum likelihood estimator that maximizes the likelihood function with respect to .

We construct a new statistical model using -convex functions. Recall that the set of wrapping potential functions is a convex space (Lemma 2). We can consider a finite-dimensional subspace as follows. Let for . Define

 ϕθ(x) = p∑i=1θiϕ(i)(x),

where ranges over a convex subset of such that for any . By Lemma 2 and the elementary fact that , we can use the simplex as . Let be the probability density function induced by , that is,

 p(x|θ) = Jac(Gϕθ(x)). (2)

We call the family (2) the spherical gradient model.

The maximum likelihood estimator for the spherical gradient model (2) is reasonably computed by the following corollary of Theorem 1.

###### Corollary 1.

Define by (2). Then, for any data points , the likelihood function is log-concave with respect to .

###### Remark 3.

As an anonymous referee pointed out, in Euclidean space, there are results on convexity along generalized geodesics (Chapter 9 of Ambrosio et al. (2005); see also Villani (2009)). Here a generalized geodesic is defined by the set of measures pushed forward by the gradient maps . On the other hand, we consider pull-back measures in this paper as, for example, (2) indicates.

### 3.3 Examples

We give some examples of the spherical gradient model (2). Recall denotes the length between and on . All the examples are combinations of rotationally symmetric functions , where and . The -th derivative of is denoted by . The following lemma is fundamental.

###### Lemma 4.

Assume that and for almost all . Then for each the function of is in .

Let be the set of functions on that satisfy the assumption in Lemma 4. Choose pairs from . Then we can define the spherical gradient model (2) with

 ϕθ(x) = p∑i=1θifi(d(x,zi))θ=(θi)pi=1∈Θ, (3)

where is a convex subset of such that for all .

###### Remark 4.

If , the resultant density is a function of for some . In directional statistics, such a probability density function is called rotationally symmetric. ∎

We briefly touch on known distributions on the sphere in statistics. A very well-known distribution on the sphere is the von Mises-Fisher distribution defined by

 p(x|μ) = (|μ|2)(n+1)/21Γ((n+1)/2)I(n+1)/2−1(|μ|)exp(μ⊤x) (4)

in Euclidean coordinates of , where and denotes the modified Bessel function of the first kind and order . A more general distribution is the Fisher-Bingham distribution defined by

 p(x|μ,A) = 1a(μ,A)exp(μ⊤x+x⊤Ax), (5)

where is a normalizing factor to ensure that . See Mardia and Jupp (2000) for details.

We return to our spherical gradient model (2) with (3). The following explicit formula due to a general expression (11) is useful for practical implementation:

 p(x|θ) = (sin|vθ|/|vθ|)n−1det(xx⊤+Hθ+p∑i=1θiKi), vθ = −p∑i=1θif′i(αi)ei,αi = cos−1(x⊤zi),ei = zi−xcosαisinαi, Hθ = eθe⊤θ+αθcosαθsinαθ(I−xx⊤−eθe⊤θ),eθ = vθ/|vθ|,αθ = |vθ|, Ki = f′′i(αi)eie⊤i+f′i(αi)cosαisinαi(I−xx⊤−eie⊤i),

where Euclidean coordinates in are used. We remark that the above formula needs no special function, unlike the von Mises-Fisher distribution (4) or the Fisher-Bingham distribution (5).

We give examples of pairs . Recall that is the set of all wrapping potential functions.

###### Example 1 (Linear potential).

Let for all . We use Euclidean coordinates in to express . Then is in as long as . We deduce that a potential function is in if . The parameter determines the direction and magnitude of concentration. That is, the resultant density function takes larger values at when is closer to and is larger, where the negative sign of is needed because our model is defined by the pull-back measure. We call the linear potential and the resultant statistical model the linear-potential model. This model is rotationally-symmetric (see Remark 4). An example is given in Figure 2 (a).

Consider for and for . Then the potential can be written as

 ϕθ(x) = p1∑i=1θix⊤zi+p∑i=p1+1θi4{2(x⊤zi)2−1}.

Let and . Let denote the trace norm of defined by the sum of absolute eigenvalues of . This is actually a norm because . Then we deduce that a potential function

 ϕμ,A(x) = x⊤μ+12x⊤Ax (6)

is in if satisfies . We call the model the quadratic-potential model. Various numerical examples of the quadratic potential model are given in Figure 2. Note that the representation of includes redundancy because . It will be tractable if one sets . However, in general this restriction strictly reduces the size of the set. For example, the matrix has norm but the trace-adjusted one has .

###### Example 3 (High-frequency potential).

As a generalization of the above examples, we consider for a positive integer . If and are given, we obtain a potential

 ϕθ(x) = p∑i=1θik−2icos(kid(x,zi)). (7)

We call this model the high-frequency model. Various numerical examples of the high-frequency model are given in Figure 3. The density function used in Figure 1 belongs to this class.

### 3.4 An actual data set

Here we give a brief analysis of some astronomical data. The data consist of the locations of 188 stars of magnitude brighter than or equal to 3.0. The data is available from the Bright Star Catalog (5th Revised Ed.) distributed from the Astronomical Data Center. We simply compare the quadratic model and the null model (uniform distribution) by using Akaike’s Information Criterion (AIC). In general, AIC for a statistical model is defined by the sum of times the maximum log-likelihood and times the parameter dimension. It is recommended to select the statistical model minimizing AIC from a set of candidates. See Akaike (1974) for details of AIC.

The estimated parameter for the quadratic model is

 ^μ = (0.010,0.017,0.091)⊤and^A = 0.173^z1^z⊤1−0.250^z2^z⊤2,

where and . The maximum log-likelihood is 12.5. Since the number of unknown parameters is 8, AIC is . On the other hand, the likelihood of the null model (uniform distribution) is zero and AIC is also zero. Therefore, we select the quadratic model from the two candidates. Figure 4 shows the observed data and the estimated density.

## 4 Proofs

### 4.1 Proofs of Lemma 1

We use the following lemma due to Proposition 6 of McCann (2001). The lemma can also proved by direct calculation for the sphere. Recall .

###### Lemma 5 (Inverse of the exponential map).

Let and assume . Then , where denotes the gradient operator with respect to .

We first recall the cross-curvature non-negativity of the sphere (Kim and McCann (2008), Figalli and Rifford (2009)). For simplicity, the definitions below are specialized for the sphere. For a given triplet with and , the curve

 {expz((1−t)exp−1z(x)+texp−1z(y))∣t∈[0,1]}

is called a -segment connecting and with respect to . We denote the -segment by in this paper. For given with , let and be smooth curves such that and . We assume that either or . Note that only one of the two curves is assumed to be a -segment. Then the cross-curvature is well defined by

 S(x,y)(ξ,η) = −d2ds2d2dt2c(σs,τt)∣∣∣s=0,t=0,

where and . For a given quadruplet , the sliding mountain is defined by a function

 t ↦ c(z,[y0,y1]t(z))−c(x,[y0,y1]t(z)). (8)

We use the following fact proved by Kim and McCann (2008) and Figalli and Rifford (2009).

###### Lemma 6 (Cross-curvature non-negativity).

For the sphere, the cross-curvature is non-negative for any with .

Although the following lemma is essentially due to Kim and McCann (2008), we derive it from Lemma 6 for completeness.

###### Lemma 7 (Time-convex-sliding-mountain).

Let be a point in and let and be two points in different from the antipodal point of . Then for any the sliding-mountain (8) is convex with respect to .

###### Proof.

Denote the sliding-mountain (8) by . Fix and denote for simplicity. We first assume is not the antipodal point of and prove . Let and . Note that is a -segment with respect to . Then from Lemma 6, we have

 −d2ds2d2du2c(σs,τu)∣∣∣u=0 ≥ 0 (9)

for each . On the other hand, by Lemma 5,

 ddsc(σs,τu)∣∣∣s=0 = −⟨ξ,exp−1z(τu)⟩ = −⟨ξ,(1−u)exp−1z(y)+uexp−1z(y1)⟩,

where and denotes the inner product on . We obtain

 ddsd2du2c(σs,τu)∣∣∣s=0,u=0 = 0. (10)

Integrating both sides of (9) with respect to twice and using (10), we have

 {d2du2c(z,τu)−d2du2c(x,τu)}∣∣∣u=0 ≥ 0.

Since , we have . Next we assume that is the antipodal point of . By assumption, is not the antipodal point of . By direct calculation, we have

 lims→t+0df(s)ds−lims→t−0df(s)ds = 2π∣∣∣d[y0,y1]t(z)dt∣∣∣ ≥ 0.

Therefore is convex over . ∎

Now we use Lemma 7 to prove Lemma 1. For any point , we denote the antipodal point of by . Since and are -convex, there exist functions and such that (). Then we have

 ϕt(x) = (1−t)ϕ0(x)+tϕ1(x) = supy0{−(1−t)c(x,y0)−(1−t)ϕc0(y0)}+supy1{−tc(x,y1)−tϕc1(y1)}

where the last equality follows from continuity of and (). Now we consider a -segment and denote it by for simplicity. From Lemma 7, we have

 −(1−t)c(x,y0)−tc(x,y1) = supz≠y′0,y′1{−c(x,yt(z))+c(z,yt(z))−(1−t)c(z,y0)−tc(z,y1)},

where the supremum of the right hand side is attained at . Hence

 ϕt(x) = supy0,y1≠x′supz≠y′0,y′1{−c(x,yt(z))+c(z,yt(z))−(1−t)c(z,y0)−tc(z,y1) −(1−t)ϕc0(y0)−tϕc1(y1)} = supw{−c(x,w)−ξ(w)},

where is defined by an infimum convolution

 ξ(w) := inf(y0,y1,z)|y0,y1≠x′,yt(z)=w{−c(z,w)+(1−t)c(z,y0)+tc(z,y1) +(1−t)ϕc0(y0)+tϕc1(y1)}.

Since is written in the form of a -transform, it is -convex. This proves Lemma 1.

### 4.2 Proof of Theorem 1

For each -convex function , let be the set of points such that and has its Hessian defined at . If is a wrapping potential function (Definition 1), then consists only of a finite set of points.

The following lemma is essentially proved in Delanoë and Loeper (2006).

###### Lemma 8.

If is -convex, then except for at most one . Furthermore, if for some , then for any .

###### Proof.

Let be -convex. Assume that there exists such that . In general, any -convex function on a compact Riemannian manifold is Lipschitz continuous with Lipschitz constant less than or equal to the diameter of the manifold (Lemma 2 of McCann (2001)). Since the diameter of the sphere is , we have . Hence is the antipodal point of . We now prove that for all . We use 2-monotonicity of the gradient map:

 d2(x,Gϕ(x))+d2(y,Gϕ(y)) ≤ d2(x,Gϕ(y))+d2(y,Gϕ(x)),

where is the distance between and . The above inequality follows from Lemma 3. Let , and . Then we have . By the triangle inequality with respect to the triangle , we have . Therefore

 0 ≥ π2+c2−a2−b2 ≥ π2+(a+b−π)2−a2−b2 = 2(π−a)(π−b).

This implies or ; equivalently, or . Hence we have for any . Then for any from the definition of . ∎

We proceed to the proof of Theorem 1. Fix two -convex functions and and let for . Let . Then it is easy to see that for any . Recall that the gradient map of is denoted by . Note that is a -segment . We prepare some notation to represent an explicit formula of the Jacobian determinant of . Let be the Jacobian determinant of the exponential map at , i.e. , where the determinant is calculated with respect to any orthonormal bases. Denote the Hessian operator at by and let . Then, by Cordero-Erausquin et al. (2001), the Jacobian determinant of is

 Jt(x) = σt(x)det(Ht(x)+Hessxϕt). (11)
###### Lemma 9.

Let . The matrix-valued function is concave with respect to :

 Ht(x) ⪰ (1−t)H0(x)+tH1(x)for\ any t∈[0,1],

where means that is non-negative definite.

###### Proof.

Since is a -segment , Lemma 7 implies that

 c(w,Ft(x))−c(x,Ft(x)) ≥ (1−t){c(w,F0(x))−c(x,F0(x))}+t{c(w,F1(x))−c(x,F1(x))}

for all . By taking the Hessian with respect to at , we obtain

 Hesswc(w,Ft(x))|w=x ⪰ (1−t)Hesswc(w,F0(x))|w=x+tHesswc(w,F1(x))|w=x.

This means . ∎

###### Lemma 10 (Jacobian-ratio inequality).

Let . Then the following inequality holds:

 (Jt(x)σt(x))1/n ≥ (1−t)(J0(x)σ0(x))1/n+t(J1(x)σ1(x))1/n. (12)
###### Proof.

By the formula (11), it is sufficient to prove that is concave with respect to . Indeed, by Lemma 9 and the geometric-arithmetic inequality on , we obtain

 det1/n(Ht+Hessxϕt) ≥ det1/n{(1−t)H0+tH1+Hessxϕt} = det1/n{(1−t)(H0+Hessxϕ0)+t(H1+Hessxϕ1)} ≥ (1−t)det1/n(H0+Hessxϕ0)+tdet1/n(H1+Hessxϕ1).

Hence is concave. ∎

###### Remark 5.

If , the inequality (12) is similar to the Jacobian inequality, due to Cordero-Erausquin et al. (2001). They showed that if ,

 Jt(x)1/n ≥ (1−t)v1−t(F1(x),x)1/n+t[vt(x,F1(x))]1/nJ1(x)1/n, (13)

where denotes the volume distortion coefficient (see Cordero-Erausquin et al. 2001 for details). The inequality (13) is crucial to prove a Brunn-Minkowskii-type inequality on manifolds. However, since the inequality (13) is only established for the special case , it is not sufficient for our statistical application. Unfortunately, (13) is not implied from (12). In fact, if , then and , and the inequality (12) reduces to

 Jt(x)1/n ≥ (1−t)σt(x)1/n+t(σt(x)σ1(x))1/nJ1(x)1/n.

This inequality is weaker than (13) because and . ∎

###### Lemma 11.

For any , is concave with respect to .