Exponentially concave functions and a new information geometry

Exponentially concave functions and a new information geometry

\fnmsSoumik \snmPal\thanksreft2,m1label=e1]soumikpal@gmail.com [    \fnmsTing-Kam Leonard \snmWong\thanksreft2,m2label=e2]tkleonardwong@gmail.com [ University of Washington\thanksmarkm1 and University of Southern California\thanksmarkm2 University of Washington
Seattle, Washington 98195
USA
University of Southern California
406H Kaprielian Hall
Los Angeles, California 90089
USA
Abstract

A function is exponentially concave if its exponential is concave. We consider exponentially concave functions on the unit simplex. In a previous paper we showed that gradient maps of exponentially concave functions provide solutions to a Monge-Kantorovich optimal transport problem and give a better gradient approximation than those of ordinary concave functions. The approximation error, called L-divergence, is different from the usual Bregman divergence. Using tools of information geometry and optimal transport, we show that L-divergence induces a new information geometry on the simplex consisting of a Riemannian metric and a pair of dually coupled affine connections which defines two kinds of geodesics. We show that the induced geometry is dually projectively flat but not flat. Nevertheless, we prove an analogue of the celebrated generalized Pythagorean theorem from classical information geometry. On the other hand, we consider displacement interpolation under a Lagrangian integral action that is consistent with the optimal transport problem and show that the action minimizing curves are dual geodesics. The Pythagorean theorem is also shown to have an interesting application of determining the optimal trading frequency in stochastic portfolio theory.

[
\kwd
\arxiv

arXiv:1605.05819 \startlocaldefs \endlocaldefs

\runtitle

Exponentially concave functions

{aug}

and

\thankstext

t2This research is partially supported by NSF grants DMS-1308340 and DMS-1612483.

class=MSC] \kwd[Primary ]60E05 \kwd[; secondary ]52A41

information geometry \kwdoptimal transport \kwdexponential concavity \kwdL-divergence \kwdgeneralized Pythagorean theorem \kwdfunctionally generated portfolio \kwdstochastic portfolio theory

1 Introduction

Definition 1.1 (Exponential concavity).

Let be convex. We say that a function is exponentially concave if is concave. (By convention we set .)

Throughout this paper we let be the open unit simplex

 Δn={p=(p1,…,pn)∈Rn:pi>0,n∑i=1pi=1}, (1.1)

regarded as the collection of strictly positive probability distributions on a set with elements. This is due to the applications we have in mind, although many generalizations are possible. An interesting property of exponentially concave functions is that their gradient maps give a better first-order approximation those of than ordinary concave functions. In [45], we introduced the concept of L-divergence. Let be a differentiable exponentially concave function. For , concavity of implies that

 φ(p)+log(1+∇φ(p)⋅(q−p))≥φ(q), (1.2)

where is the Euclidean gradient. Clearly this approximation is sharper than the linear approximation of itself. The L-divergence of is the error in this approximation:

 T(q∣p):=log(1+∇φ(p)⋅(q−p))−(φ(q)−φ(p))≥0. (1.3)

The extra concavity of exponentially concave functions have found several recent applications in analysis, probability and optimization. For example, in [22], the equivalence of entropic curvature-dimension conditions and Bochner’s inequality on metric measure spaces is established using the notion of convexity. When and , the negative of a convex function is exponentially concave. Better gradient approximation has also led to better algorithms in optimization and machine learning such as those in [29, 30, 35], although the authors tend to replace the logarithmic term in (1.2) by a quadratic approximation.

One of our primary applications in mind is related to stochastic portfolio theory. In [25, 23], the author considers the gradient map of an exponentially concave function as a map from to its closure . The following restatement can be found in [45, Proposition 6]. Let be a differentiable exponentially concave function on . For , define by

 πi(p)=pi(1+∇φ(p)⋅(e(i)−p)),i=1,…,n, (1.4)

where is the th standard basis of . Then, it can be shown that . In keeping with standard definitions in the subject we will call this map the portfolio map. In this vein also see articles [6, 24, 26, 31, 42, 49, 53, 54].

The L-divergence of should be distinguished from the Bregman divergence of defined by

 D(q∣p):=∇φ(p)⋅(q−p)−(φ(q)−φ(p)). (1.5)

Bregman divergence was introduced in [12] and is widely applied in statistics and optimization. To see the difference consider two fundamental examples. For , the Kullback-Leibler divergence (also known as relative entropy) is given by

 H(q∣p)=n∑i=1qilogqipi.

It can be shown that the relative entropy is the Bregman divergence of the Shannon entropy . On the other hand, fix and consider the cross entropy . This is an exponentially concave function of whose associated portfolio map (1.4) is constant: . The corresponding L-divergence is given by

 (1.6)

This quantity is sometimes referred to as the free energy in statistical physics. In finance it is called the diversification return in [9, 14, 21, 52], the excess growth rate in [25, 27, 44], the rebalancing premium in [10], and the volatility return in [28].

In [45] we introduced a Monge-Kantorovich optimal transport problem on which can be solved using exponentially concave functions on the unit simplex. The cost function is defined for by

 c(θ,ϕ):=ψ(θ−ϕ),whereψ(x):=log(1+n−1∑i=1exi) (1.7)

is strictly convex on . We will recall the details of this transport problem in Section 2.1 and its relationship to exponentially concave functions. It suffices to say for now that, given a pair of Borel probability measures and on the optimal coupling of the two with respect to the above cost can be expressed in terms of the portfolio map of an exponentially concave function on the simplex. A related cost function appears in [41] in the completely different context of finding polytopes with given geometric data. It also appears to be related to the study of moment measures as introduced in [15] (see page 3836 in particular).

1.1 Our contributions

In this paper we show that information geometry provides an elegant geometric structure underlying exponential concavity, L-divergence and the optimal transport problem. Here is a motivating question which is the starting point of this work. Suppose is an exponentially concave function on with its associated L-divergence . Can we geometrically characterize triplets such that ? The answer to this question determines the optimal frequency of rebalancing the portfolio generated by (see Section 5.4). Also see Section 3.3 for a transport interpretation of this inequality.

Using tools of information geometry, we show that exponentially concave functions on and their L-divergences induce a new geometric structure on the simplex regarded as a smooth manifold of probability distributions. Let be an exponentially concave function on . We only require that is smooth and the Euclidean Hessian of is strictly positive definite everywhere (see Assumption 2.5). The induced geometric structure consists of a Riemannian metric and a dual pair of affine connections and . These connections define via parallel transports two kinds of geodesic curves on called primal and dual geodesics. Interestingly, the duality in this geometry goes hand in hand with the duality in the related Monge-Kantorovich optimal transport problem, and this work is the first which exploits this connection. We summarize our main results as follows. First we give the answer of the motivating question.

Theorem 1.2 (Generalized Pythagorean theorem).

Given , consider the dual geodesic joining and and the primal geodesic joining and . Consider the Riemannian angle between the geodesics at . (See Proposition 4.4 which expresses the Riemannian metric as a normalized Euclidean Hessian of .) Then the difference

 T(q∣p)+T(r∣q)−T(r∣p) (1.8)

is positive, zero or negative depending on whether the angle is less than, equal to, or greater than degrees (see Figure 1).

We also prove other remarkable properties of the geodesics: (i) There exist explicit coordinate systems under which the primal and dual geodesics are time changes of Euclidean straight lines (Theorem 5.1). In other words, the new geometry is dually projectively flat. In particular, the primal geodesics are Euclidean straight lines up to time reparameterization. Moreover, the primal and dual connections have constant sectional curvature with respect to the Riemannian metric, and thus satisfy an Einstein condition (Corollary 4.10). The primal and dual geodesics can also be constructed as time changes of Riemannian gradient flows for the functions and (Theorem 5.5). This is remarkable because while the geodesic equations depend only on the local properties of near , the gradient flows are global as they involve the derivatives of and . Indeed, this relation is known only for limited families of divergence including Bregman divergence and -divergence [5].

As shown in [2, Chapter 1], the generalized Pythagorean theorem holds for any Bregman divergence which induces a dually flat geometry. We will prove that the resulting geometries from L-divergences are not flat for (Theorem 4.9). While some extensions of the generalized Pythagorean theorem hold in certain non-flat spaces (see for example [2, Theorem 4.5]), they involve some extra terms. To the best of our knowledge Theorem 1.2 is the first exact Pythagorean theorem that holds in a geometry which is not dually flat. The difference (1.8) can also be given an optimal transport interpretation (Section 3.3).

(ii) We extend the static transport problem (1.7) to a time-dependent transport problem with a corresponding convex Lagrangian action. In Theorem 6.2 we show that the action minimizing curves are the (reparametrized) dual geodesics which, in addition, satisfy the intermediate time optimality condition. This allows for a consistent displacement interpolation formulation between probability measures on the unit simplex. Previously, such studies focused almost exclusively on the Wasserstein spaces corresponding to the cost functions (here is a metric on a Polish space with suitable properties and ). Displacement interpolation and the related concept of displacement convexity were introduced in [38] and in the thesis [37]. These ideas have grown to be immensely important in classical Wasserstein transport with fundamental implications in geometry, physics, probability and PDE. See [51, Chapter 7] for a thorough discussion. Our Lagrangian, although convex, is not superlinear, and, therefore, is not covered by the standard theory. However, we expect it to lead to many equally remarkable properties.

These results suggest plenty of problems for further research. Generalizing Theorem 1.2 to more than three points is of interest in stochastic portfolio theory. Displacement interpolation has become an extremely important topic in optimal transport theory. Extensions to Riemanninan manifolds, done in [16], have led to new functional inequalities. In another vein, [34] defines Ricci curvatures on metric measure spaces in terms of displacement interpolation and displacement convexity. We expect that the displacement interpolation in this paper will lead to a new Otto calculus ([51, Chapter 15]) and related PDEs (such as Hamilton-Jacobi equations). It appears that Bregman divergence and L-divergence are only two of an entire family of divergences with special properties and corresponding optimal transport problems. For example, see [43] which extends the optimal transport problem (1.7) via the cumulant generating function of a general probability distribution. We also believe that this new information geometry will be useful in dynamic optimization problems where the objective function is multiplicative in time. Finally, it is naturally of interest to study exponential concavity on general convex domains.

1.2 Related literature

We have mentioned the L-divergence and the Bregman divergence. In general, a divergence on a set (usually a manifold of probability distributions) is a non-negative function such that if and only if . Divergences are not metrics in general since they may be asymmetric and may not satisfy the triangle inequality. Apart from Bregman divergence, many families of divergences (such as -divergence and -divergence) have been applied in information theory, statistics and other areas; see the survey [7] for a catalog of these divergences. Among these divergences, Bregman divergence plays a special role because it induces a dually flat geometry on the underlying space. First studied in the context of exponential families in statistical inference by [40], it gave rise to information geometry – the geometric study of manifolds of probability distributions. Furthermore, Bregman divergence enjoys properties such as the generalized Pythagorean theorem and projection theorem which led to numerous applications. See [2, 3, 13, 32, 39] for introductions to this beautiful theory. The related concept of dual affine connection is also useful in affine differential geometry (see [17, 33, 47]). In [36] dually projectively flat manifolds are characterized in terms of the Bartlett tensors and conformal flatness. Here we identify a new and important class of examples and show that they have concrete applications.

This work is motivated by our study in mathematical finance. Recently optimal transport has been applied to financial problems such as robust asset pricing; see, for example, [1, 8, 18]. This line of work has a somewhat different flavor than ours although they share the same goal: development of model-free mathematical finance. Portfolios generated by exponentially concave functions generate profit due to fluctuations of a sequence in representing the stock market. This idea is sometimes called volatility harvesting and leads naturally to the transport problem (1.7), as shown in [45]. In this philosophy, our work can be interpreted as developing a notion of model-free volatility.

1.3 Outline of the paper

In the next section we recall the optimal transport problem formulated in [45] using the exponential coordinate system. Its relation with functionally generated portfolio is also reviewed. In Section 3 we relate exponential concavity with -concavity and give a transport-motivated definition of L-divergence. Here duality plays a crucial role. After reviewing the basic concepts of information geometry, we derive in Section 4 the geometric structure induced by an exponentially concave function. The properties of this new geometry are then studied in Section 5. In particular, we characterize the primal and dual geodesics and prove the generalized Pythagorean theorem which has an interesting application in mathematical finance. Finally, in Section 6 we apply the geometric structure to construct a displacement interpolation for the associated optimal transport problem. Some technical and computational details are gathered in the Appendix.

2 Optimal transport and portfolio maps

In this section we recall the optimal transport problem in [45] using the exponential coordinate system. We also review the definition of functionally generated portfolio and explain how it relates to the transport problem.

2.1 Exponential coordinate system

The exponential coordinate system defines a global coordinate system on regarded as an -dimensional smooth manifold [2, Section 2.2].

Definition 2.1 (Exponential coordinate system).

The exponential coordinate of is given by

 θi=logpipn,i=1,…,n−1. (2.1)

We denote this map by . By convention we set . The inverse transformation is given by

 pi=pi(θ)=eθi−ψ(θ),1≤i≤n, (2.2)

where as defined in (1.7).

The exponential coordinate system is the first of several coordinate systems we will introduce on the simplex. By changing coordinate systems, any function on can be expressed as a function on and vice versa. Explicitly, a function on can be expressed in exponential coordinates by . To simplify the notations, we simply write or depending on the coordinate system used. For example, if is the cross entropy where , then .

2.2 The transport problem

We refer the reader to the books [4, 51] for introductions to optimal transport and its interplay with analysis, probability and geometry. Let be equipped with the standard Euclidean metric and topology. Let and be Borel probability measures on and respectively. By a coupling of and we mean a Borel probability measure on whose marginals are and respectively. Let be the set of all couplings of and . This set is always non-empty as it contains the product measure .

Given and we consider the Monge-Kantorovich optimal transport problem with cost defined by (1.7):

 infR∈Π(P,Q)E[c(θ,ϕ)]. (2.3)

Here the expectation is taken under the probability measure under which the random element has distribution . If an optimal coupling takes the form for some measurable map , we say that is a Monge transport map.

In general, we may consider the optimal transport problem (2.3) with replaced by a general cost function denoted by and , are general Polish spaces. The classical example is where and is a power of the underlying metric: (especially and ). For these costs rich and delicate theories have been developed on Euclidean spaces, Riemannian manifolds and geodesic metric measure spaces. However, we consider the cost function defined by (1.7).

Remark 2.2.

The cost function

 ˜c(θ,ϕ):=log(1n+1nn−1∑i=1eθi−ϕi)−n−1∑i=11n(θi−ϕi)

differs from in a linear term which plays no role in optimal transport. Thus we may consider instead. The advantage of is that it is non-negative and, by Jensen’s inequality, equals zero if and only if . To be consistent with the notations in [45] we will use the cost function in this paper.

Definition 2.3 (c-cyclical monotonicity).

A non-empty subset is -cyclical monotone if and only if it satisfies the following property. For any finite collection in and any permutation of the set , we have

 n∑j=1c(θj,ϕj)≤m∑j=1c(θj,ϕσ(j)). (2.4)

It is well known that -cyclical monotonicity is, under mild technical conditions, a necessary and sufficient solution criteria of the general optimal transport problem (see [4, Chapter 1]). In particular, a coupling of is optimal if and only if the support of is -cyclical monotone.

2.3 Functionally generated portfolio

At this point it is convenient to introduce the concept of functionally generated portfolio. Although it is possible to present the theory without reference to finance-motivated concepts, we stress that the portfolio map gives an additional structure to the transport problem not found in other cases. Also, the main examples of the theory as well as the key quantities (such as the induced Riemannian metric) are best expressed in terms of portfolios. Mathematically, the portfolio can be regarded as a normalized gradient of . In Section 5.4 we apply our information geometry to functionally generated portfolios.

Functionally generated portfolio was introduced in [23] and the following refined definition is taken from [45].

Definition 2.4 (Functionally generated portfolio).

By a portfolio map we mean a function . Let be exponentially concave. We say that a portfolio map is generated by if for any we have

 n∑i=1πi(p)qipi≥eφ(q)−φ(p). (2.5)

We call the log generating function of and (which is positive and concave) the generating function. It is known that is unique (for a given ) up to an additive constant. If is differentiable, then is necessarily given by (1.4).

Throughout this paper we impose the following regularity conditions on the exponentially concave function .

Assumption 2.5 (Regularity conditions).
1. The function is smooth (i.e., infinitely differentiable) on .

2. The (Euclidean) Hessian of is strictly negative definite everywhere on . In particular, is strictly concave. Moreover, it can be shown that the function defined by (1.4) maps into .

Let us discuss these conditions briefly. Differentiability is needed to define differential geometric structures on in terms of the derivatives of the L-divergence. Our theory requires the L-divergence to be three times continuously differentiable, and for convenience we simply assume that is smooth. Strict concavity guarantees that the L-divergence is non-degenerate, i.e., only if , and strict positive definiteness of the Hessian implies that the induced Riemannian metric is non-degenerate.

Henceforth we let be an exponentially concave function satisfying Assumption 2.5 and let given by (1.4) be the portfolio map generated by . The cost function always refers to the one defined in (1.7), and a general cost function is denoted by . Using (1.4), it can be shown that the L-divergence (1.3) of can be expressed in the form

 T(q∣p)=log(n∑i=1πi(p)qipi)−(φ(q)−φ(p)). (2.6)

Now we give several examples of functionally generated portfolios and their log generating functions. Many more examples can be found in [25, Chapter 3]. In particular, the constant-weighted portfolios play a special role and will be taken as the basic example of the theory.

Example 2.6 (Examples of functionally generated portfolios).
1. (Market portfolio) The identity map is generated by the constant function . (Here Assumption 2.5 does not hold.)

2. (Constant-weighted portfolio) The constant map is generated by the cross-entropy . The special case is called the equal-weighted portfolio.

3. (Diversity-weighted portfolio) Let be a fixed parameter. Consider the portfolio map defined by

 πi(p)=pλi∑nj=1pλj,i=1,…,n.

It can be shown that the generating function is .

4. (Convex combinations) It is known that the set of functionally generated portfolios is convex. Indeed, let be generated by and be generated by . Then for , the portfolio map is generated by . Its generating function is then the geometric mean . This fact was used in [53, 54] to formulate and study nonparametric estimation of functionally generated portfolio.

The following result is taken from [45].

Proposition 2.7.

For any portfolio map the following statements are equivalent.

1. There exists an exponentially concave function on which generates in the sense of (2.5).

2. The portfolio map is multiplicatively cyclical monotone (MCM) in the following sense: for any sequence in satisfying , we have

 m∏t=0(n∑i=1πi(μ(t))μi(t+1)μi(t))≥1.
3. Define a map by

 ϕi=θi−logπi(θ)πn(θ),i=1,…,n−1. (2.7)

Here is regarded as a function of the exponential coordinate. In words, we define in such a way that the exponential coordinate of is . Then the graph of this map is -cyclical monotone.

Using this result, we showed in [45] how the optimal transport problem (2.3) can be solved in terms of functionally generated portfolios. Here is a simple but interesting explicit example which is a direct generalization of the one-dimensional case treated in [45, Section 4].

Example 2.8 (Product of Gaussian distributions).

In the transport problem (2.3), let be a product of one-dimensional Gaussian distributions:

 P=n−1⨂i=1N(ai,σ2i),

where and . Also let

 Q=n−1⨂i=1N(bi,(1−λ)σ2i),

where and . Then the optimal transport map for the measures and is given by the map (2.7), where the portfolio map is the following variant of the diversity-weighted portfolio discussed in Example 2.6(iii):

 πi(p)=wipλi∑nj=1wjpλj,φ(p)=1λlog(n∑j=1wjpλj), (2.8)

where the coefficients are chosen such that for all .

3 Optimal transport and duality

3.1 c-concavity and duality

Now we make use of the notion of -concavity in optimal transport theory. The definitions we use are standard and can be found in [4, Chapter 1]. Again, refers to our cost function (1.7). Also recall that and are the underlying spaces of the variables and respectively. For we define its -transform by

Similarly, the -transform of a function is defined by

 g∗(θ):=infϕ∈Y(c(θ,ϕ)−g(ϕ)),θ∈X.

We say that is -concave if there exists such that (similar for -concave functions on ). A function (on or ) is -concave if and only if .

If is -concave, its -superdifferential is defined by

 ∂cf:={(θ,ϕ)∈X×Y:f(θ)+f∗(ϕ)=c(θ,ϕ)}. (3.1)

For we define . If this set is a singleton , we call the -supergradient of at and write . Similar definitions hold for a -concave function on .

Let be -concave. By definition of , we have

 f(θ)+f∗(ϕ)≤c(θ,ϕ) (3.2)

for every pair , and equality holds in (3.2) if and only if . This is a generalized version of Fenchel’s identity (see [46, Section 12]) and will be used frequently in this paper.

Our first lemma relates exponential concavity on with -concavity on and on . Note that the cost function is asymmetric, and -concavity on is equivalent to -concavity on after a change of variable.

Lemma 3.1 (Exponential concavity and c-concavity).

For the following statements are equivalent.

1. is exponentially concave on .

2. The function defined by

 f(θ)=φ(p(θ))+ψ(θ)

is -concave on .

3. The function defined by

 g(ϕ)=φ(p(−ϕ))+ψ(−ϕ),

where is the exponential coordinate, is -concave on .

Proof.

We prove the implication (i) (ii) and the others can be proved similarly. Suppose (i) holds and consider the non-negative concave function on . By [46, Theorem 10.3], we can extend continuously up to , the closure of in . We further extend to the affine hull of in by setting for . The extended function is then a closed concave function on . By convex duality (see [46, Theorem 12.1]), there exists a family of affine functions on such that

 Φ(p)=infℓ∈Cℓ(p),p∈Δn. (3.3)

Since is non-negative on , each is non-negative on . Replacing by the sequence , , we may assume without loss of generality that each is strictly positive on . We parameterize each in the form where are positive constants. (Note that an extra constant term is not required since .) Writing , and switching to exponential coordinates, we have

 logℓ(p)=log(n∑i=1aipi)=log(1+n−1∑i=1aianpipn)+logpn+logan=log(1+n−1∑i=1eθi−ϕi)−ψ(θ)+logan=c(θ−ϕ)−ψ(θ)+logan,

It follows from (3.3) that

 f(θ)=φ(θ)+ψ(θ)=infℓ∈C(c(θ−ϕ)+logan). (3.4)

Define by setting

 h(ϕ)=inf{−logan:∃ ℓ(p)=n∑i=1aipi∈C s.t. ϕi=−logaian ∀ i},

where the infimum of the empty set is . From (3.4), we have

 f(θ)=infϕ∈Y(c(θ−ϕ)−h(ϕ))=h∗(θ)

which shows that is -concave on . ∎

The following is the -concave analogue of the classical Legendre transformation [46]. Its proof is standard but lengthy and will be given in the Appendix.

Theorem 3.2 (c-Legendre transformation).

Let be an exponentially concave function satisfying Assumption 2.5, and let , defined by (1.4), be the portfolio map generated by . Given , consider the -concave function

 f(θ):=φ(θ)+ψ(θ) (3.5)

defined on via the exponential coordinate system.

1. The -supergradient of is given by (2.7), i.e.,

 (3.6)

Moreover, the map is injective.

2. Let be the range of . Then the -supergradient of is given on by

 ∇cf∗(ϕ)=(∇cf)−1(ϕ),ϕ∈Y′.

In fact, the map is a diffeomorphism from to whose inverse is . Also, the function is smooth on the open set .

Although is in general a strict subset of , by Theorem 3.2 the dual variable defines a global coordinate system of the manifold . In Theorem 5.1 we will use another coordinate system on called the dual Euclidean coordinate system. Thus we have four coordinate systems on : Euclidean, primal, dual and dual Euclidean (see Definition 3.3). In the following we will frequently switch between coordinate systems to facilitate computations. To avoid confusions let us state once for all the conventions used. We let and be given.

Definition 3.3 (Coordinate systems).

For the unit simplex (defined by (1.1)) we call the identity map

 p=(p1,…,pn),pi>0,n∑i=1pi=1

the (primal) Euclidean coordinate system with range . We let

 θ=θ(p)=(logp1pn,…,logpn−1pn)

be the primal (exponential) coordinate system with range and

 ϕ=ϕ(p):=∇cf(θ)

be the dual (exponential) coordinate system with range . The dual Euclidean coordinate system is defined by the composition

 p∗=p∗(p):=p(−ϕ(p)).

See Figure 2 for an illustration. From now on , , and always represent the same point of . In particular, unless otherwise specified and are dual to each other in the sense that . By convention we let for any .

Notation 3.4 (Switching coordinate systems).

We identify the spaces , and using the coordinate systems in Definition 3.3. If is a function on any one of these spaces, we write depending on the coordinate system used.

We also record a useful fact. A formula analogous to the first statement is derived in [49].

Lemma 3.5.

For , we have

 ∂∂θif(θ)=πi(θ),θ∈X,∂∂ϕif∗(ϕ)=−πi(ϕ),ϕ∈Y′.
Proof.

The first statement is derived in the proof of Theorem 3.2. The second statement can be proved by differentiating (Fenchel’s identity). ∎

3.2 c-divergence

By duality, we show that a pair of natural divergences on can be defined for the -concave functions and . Moreover, they coincide with L-divergence. Clearly we can consider other cost functions other than . When is the squared Euclidean distance, the analogue of Definition 3.6 below gives the classical Bregman divergence. This covers both L-divergence and Bregman divergence under the same framework. To the best of our knowledge these definitions, which depend crucially on the interplay between transport and divergence, are new. We will use the triple representation for each point in .

Definition 3.6 (c-divergence).

Consider the -concave function defined by (3.5) and its -transform .

1. The -divergence of is defined by

 D(p∣p′)=c(θ,ϕ′)−c(θ′,ϕ′)−(f(θ)−f(θ′)),p,p′∈Δn. (3.7)
2. The -divergence of is defined by

 D∗(p∣p′)=c(θ′,ϕ)−c(θ′,ϕ′)−(f∗(ϕ)−f∗(ϕ′)),p,p′∈Δn. (3.8)

From Fenchel’s identity (3.2) we see that and are non-negative and non-degnereate, i.e., they vanish only on the diagonal of . The following is a generalization of the self-dual expression of Bregman divergence (see [2, Theorem 1.1]).

Proposition 3.7 (Self-dual expressions).

We have

 D(p∣p′) = c(θ,ϕ′)−f(θ)−f∗(ϕ′), (3.9) D∗(p∣p′) = c(θ′,ϕ)−f∗(ϕ)−f(θ′). (3.10)

In particular, for we have .

Proof.

To prove (3.9), we use the Fenchel identity . Starting from (3.7), we have

 D(p∣p′)=c(θ,ϕ′)−c(θ′,ϕ′)−(f(θ)−f(θ′))=c(θ,ϕ′)−f(θ)−f∗(ϕ′).

The proof of (3.10) is similar. ∎

Now we show that L-divergence is a -divergence where .

Theorem 3.8 (L-divergence as c-divergence).

The -divergence of is the L-divergence of . Namely, for we have

 D(p∣p′)=T(p∣p′).
Proof.

Using the primal-dual relation (3.6), we have

 ψ(θ−ϕ′)=log(n∑i=1eθi−θ′i+logπi(θ′)πn(θ′))=log(π(p′)⋅pp′)−log(πn(p′)pnp′n).

Next, by Fenchel’s identity (see (3.2)), we have

 f∗(ϕ′)=ψ(θ′−ϕ′)−f(θ′)=ψ(θ′−ϕ′)−φ(θ′)−ψ(θ′).

Using these identities and (2.6), we compute

 D(p∣p′)=ψ(θ−ϕ′)−f(θ)−f∗(ϕ′)=log(π(p′)⋅pp′)−log(πn(p′)pnp′n)−(φ(θ)+ψ(θ))−(ψ(θ′−ϕ′)−φ(θ′)−ψ(θ′))=log(π(p′)⋅pp′)−(φ(θ)−φ(θ′))=T(p∣p′).

For computations it is convenient to express solely in terms of either the primal or dual coordinates. We omit the details of the computations.

Lemma 3.9 (Coordinate representations).

For we have

 T(p∣p′)=log(n∑ℓ=1πℓ(θ′)eθℓ−θ′ℓ)−(f(θ)−f(θ′)),T(p∣p′)=log(n∑ℓ=1πℓ(ϕ)eϕℓ−ϕ′ℓ)−(f∗(ϕ′)−f∗(ϕ)).

3.3 Transport interpretation of the generalized Pythagorean theorem

Using Proposition 3.8 we give an interesting transport interpretation of the expression (1.8) in the generalized Pythagorean theorem (Theorem 1.2). Let be given. Let be the primal and dual coordinates of , and respectively. By Proposition 2.7, the coupling is -cyclical monotone. Hence coupling with is optimal.

Consider two (suboptimal) perturbations of the optimal coupling:

• (Cyclical perturbation) Couple with , with , and with . The associated cost is

 c(θ(1),ϕ(3))+c(θ(2),ϕ(1))+c(θ(3),ϕ(2)).
• (Transposition) Couple with , with , and keep the coupling . The associated cost is

 c(θ(1),ϕ(3))+c(θ(3),ϕ(1))+c(θ(2),ϕ(2)).

Now we ask which perturbation has lower cost. The difference (i) (ii) is

 c(θ(2),ϕ(1))+c(θ(3),ϕ(2))−c(θ(3),ϕ(1))−c(θ(2),ϕ(2)).

By Proposition 3.8, this is nothing but the difference . Thus the generalized Pythagorean theorem gives an information geometric characterization of the relative costs of the two perturbations.

3.4 Examples

We consider the portfolios in Example 2.6.

Example 3.10 (Constant-weighted portfolio).

Let be a constant-weighted portfolio. Then is the cross entropy and we have

 f(θ)=φ(θ)+ψ(θ)=n−1∑i=1πiθi,

which is an affine function on . Its -transform is also affine. Indeed, we have

 f∗(ϕ)=n−1∑i=1πi(−ϕi)+H(π),

where is the Shannon entropy of . For this reason we say that the constant-weighted portfolios are self-dual. The transport map in this case is given by a translation: . Its L-divergence (1.6) is given in primal coordinates (see Lemma 3.9) by

 T(p∣p′)=log(n∑ℓ=1πℓeθℓ−θ′ℓ)−n∑ℓ=1πℓ(θℓ−θ′ℓ),

which is translation invariant. This property is equivalent to the following numéraire invariance property [44, Lemma 3.2]: for any , we have under the mapping

 p↦˜p=(wipiw1p1+⋯+wnpn)1≤i≤n.

In fact, it is not difficult to show that this property characterizes the constant-weighted portfolios among L-divergences of exponentially concave functions. Also see [44, Proposition 4.6] for a chain rule analogous to that of relative entropy.

Example 3.11 (Diversity-weighted portfolio).

We have . Since

 logπi(θ)πn(θ)=λθi,

the map is a scaling: . For the generalized diversity-weighted portfolio in Example 2.8 the transport map is the composition of a scaling and a translation.

4 Geometric structure induced by L-divergence

In this section we derive the geometric structure induced by a given L-divergence . As always we impose the regularity conditions in Assumption 2.5. Using the primal and dual coordinate systems (Definition 3.3), we compute explicitly the Riemannian metric , the primal connection (not to be confused with the Euclidean gradient) and the dual connection . We call the induced geometric structure. An important fact in information geometry is that the Levi-Civita connection is not necessarily the right one to use. Nevertheless, by duality we always have .

4.1 Preliminaries

For differential geometric concepts such as Riemannian metric and affine connection we refer the reader to [2, Chapters 5] whose notations are consistent with ours. For computational convenience we define the geometric structure in terms of coordinate representations. The geometric structure is determined by the L-divergence and is independent of the choice of coordinates; for intrinsic formulations we refer the reader to [13, Chapter 11]. The following definition (which makes sense for a general divergence on a manifold) is taken from [2, Section 6.2].

Definition 4.1 (Induced geometric structure).

Given a coordinate system of , the coefficients of the geometric structure are given as follows.

1. The Riemannian metric is given by

 gij(ξ)=−∂∂ξi∂∂ξ′jT(ξ∣ξ′)∣∣ ∣∣ξ=ξ′,i,j=1,…,n−1. (4.1)

By Assumption 2.5 the matrix is strictly positive definite. The Riemannian inner product and length are denoted by and respectively.

2. The primal connection is given by

 Γijk(ξ)=−∂∂ξi∂∂ξj∂∂ξ′kT(ξ∣ξ′)∣∣∣ξ=ξ′,i,j,k=1,…,n−1. (4.2)
3. The dual connection is given by

 Γ∗ijk(ξ)=−∂∂ξk∂∂ξ′i∂∂ξ′jT(ξ∣ξ′)∣∣ ∣∣ξ=ξ′,i,j,k=1,…,n−1. (4.3)

For a general divergence the above definitions were first introduced in [19, 20]. If we define the dual divergence by , the dual connection of is the primal connection of . The primal and dual connections are dual to each other with respect to the Riemannian metric (see [2, Theorem 6.2]). While any divergence induces a geometric structure, it may not enjoy nice properties. For the geometric structure induced by a Bregman divergence, it can be shown that the Riemann-Christoffel curvatures of the primal and dual connections both vanish. Thus we say that the induced geometry is dually flat [2, Chapter 1]. We will show that L-divergence gives rise to a different geometry with many interesting properties.

4.2 Notations

We begin by clarifying the notations. Following our convention (see Notation 3.4), we write