A note on the quasiconvex Jensen divergences and the quasiconvex Bregman divergences derived thereof

A note on the quasiconvex Jensen divergences and the quasiconvex Bregman divergences derived thereof

11E-mail: Frank.Nielsen@acm.org. Web: https://FrankNielsen.github.io/
Sony Computer Science Laboratories Inc., Tokyo, Japan
Sony Computer Science Laboratories, Paris, France
Abstract

We first introduce the class of strictly quasiconvex and strictly quasiconcave Jensen divergences which are oriented (asymmetric) distances, and study some of their properties. We then define the strictly quasiconvex Bregman divergences as the limit case of scaled and skewed quasiconvex Jensen divergences, and report a simple closed-form formula which shows that these divergences are only pseudo-divergences at countably many inflection points of the generators. To remedy this problem, we propose the -averaged quasiconvex Bregman divergences which integrate the pseudo-divergences over a small neighborhood in order obtain a proper divergence. The formula of -averaged quasiconvex Bregman divergences extend even to non-differentiable strictly quasiconvex generators. These quasiconvex Bregman divergences between distinct elements have the property to always have one orientation finite while the other orientation is infinite. We show that these quasiconvex Bregman divergences can also be interpreted as limit cases of generalized skewed Jensen divergences with respect to comparative convexity by using power means. Finally, we illustrate how these quasiconvex Bregman divergences naturally appear as equivalent divergences for the Kullback-Leibler divergences between probability densities belonging to a same parametric family of distributions with nested supports.

Keywords: oriented forward and reverse distances, Jensen divergence, Bregman divergence, quasiconvexity, inflection points, comparative convexity, power means, nested densities.

1 Introduction, motivation, and contributions

A dissimilarity is a measure of the deviation of an object from a reference object (i.e., ) which satisfies the following two basic properties:

Non-negativity.

Law of the indiscernibles.

if and only if .

In other words, a dissimilarity satisfies with equality if and only if . A pseudo-dissimilarity is a measure of deviation for which the non-negativity property holds but not necessarily the law of the indiscernibles [30]. The objects can be vectors, probability distributions, random variables, strings, graphs, etc. In general, a dissimilarity may not be symmetric, i.e., potentially we may have . In that case, the dissimilarity is said to be oriented, and we consider the following two reference orientations of the dissimilarity: the forward ordinary dissimilarity and its associated reverse dissimilarity . Notice that we used the ’:’ notation instead of the comma delimiter ’,’ between the dissimilarity arguments to emphasize that the dissimilarity may be asymmetric. In the literature, a dissimilarity is also commonly called a divergence [3] although several additional meanings may be associated to this term like a dissimilarity between probability distributions instead of vectors (e.g., the Kullback-Leibler divergence [12] in information theory) or like a notion of smoothness (e.g., a contrast function in information geometry [3]). A dissimilarity may also be loosely called a distance although this may convey to mathematicians in some contexts the additional notion of a dissimilarity satisfying the metric axioms (non-negativity, law of the indiscernibles, symmetry and triangular inequality).

The Bregman divergences [10, 9] were introduced in operations research, and are widely used nowadays in machine learning and information sciences. For a strictly convex and smooth generator , called the Bregman generator, we define the corresponding Bregman divergence between parameter vectors and as:

 BF(θ:θ′)=F(θ)−F(θ′)−(θ−θ′)⊤∇F(θ′). (1)

Bregman divergences are always finite, and generalize many common distances [5], including the Kullback-Leibler (KL) divergence and the squared Euclidean and Mahalanobis distances. Furthermore, the KL divergence between two probability densities belonging to a same exponential family [6, 5] amount to a reverse Bregman divergence between the corresponding parameters when setting the Bregman generator to be the cumulant function of the exponential family [4]. Moreover, a bijection between regular exponential families [6] and the so-called class of “regular Bregman divergences” was reported in [5] and used for learning statistical mixtures showing that the expectation-maximization algorithm is equivalent to a Bregman soft clustering algorithm. Bregman divergences have been extended to many non-vector data types like matrix arguments [32] or functional arguments [16].

In this note, we consider defining the notion of Jensen divergences [26] for strictly quasiconvex or strictly quasiconcave generators, and the induced notion of Bregman divergences. We term them quasiconvex Bregman divergences (and omit to prefix it by ’strictly’ for sake of brevity). We then establish a connection between the KL divergence between parametric families of densities with nested supports and these quasiconvex Bregman divergences.

We summarize our main contributions as follows:

• By using quasiconvex generators instead of convex generators, we define the skewed quasiconvex Jensen divergences (Definition 1) and derived thereof quasiconvex Bregman divergences (Definition 3 and Theorem 1). The quasiconvex Bregman divergences turn out to be only pseudo-divergences at inflection points of the generator. Since this happens only at countably many points, we still loosely call them quasiconvex Bregman divergences. We can also integrate the quasiconvex Bregman (pseudo-)divergence over a small neighborhood and obtain a -averaged quasiconvex Bregman divergence in §3.2. The -averaged quasiconvex Bregman divergence are also well-defined for strictly quasiconvex but not differentiable generators. Quasiconvex Bregman divergences between distinct parameters always have one orientation finite while the other one evaluates to infinity.

• We show that quasiconvex Jensen divergences and quasiconvex Bregman divergences can be reinterpreted as generalized Jensen and Bregman divergences with comparative convexity [25, 29] using power means in the limit case (§2.3 and §2.3).

• We exhibit some parametric families of probability distributions with strictly nested supports such that the Kullback-Leibler divergences between them amount to equivalent quasiconvex Bregman divergences (§4).

The paper is organized as follows: Section 2 defines the quasiconvex and quasiconcave difference distances by analogy to Jensen difference distances [34, 26], study some of their properties, and show how to obtain them as generalized Jensen divergences [29] obtained from comparative convexity using power means. Henceforth their name: quasiconvex Jensen divergences. When the generator is quasilinear instead of quasiconvex, we call them quasilinear Jensen divergences. We then define the quasiconvex Bregman divergences in §3 as limit cases of scaled and skewed quasiconvex Jensen divergences, and report a closed-form formula which highlights the fact that one orientation of the distance is always finite while the other one is always infinite (for divergences between distinct elements). Since the quasiconvex Bregman divergences are only pseudo-divergences at inflection points, we define the -averaged quasiconvex Bregman divergences in §3.2. We also recover the formula by taking the limit case of power means Bregman divergences that were introduced using comparative convexity [29].

In §4, we consider the problem of finding parametric family of probability distributions for which the Kullback-Leibler divergence amount to a quasiconvex Bregman divergence. We illustrate one example showing that nested supports of the densities ensure the property of having one orientation finite while the other one is infinite. Finally, §5 concludes this note and hints at applications perspectives of these quasiconvex Bregman divergences, including flat and hierarchical clustering.

2 Divergences based on inequality gaps of quasiconvex or quasiconcave generators

2.1 Quasiconvex and quasiconcave difference dissimilarities

In this work, a divergence or distance refers to a dissimilarity such that with equality iff. . A pseudo-divergence or pseudo-distance only satisfies the non-negativity property but not necessarily the law of the indiscernibles of the dissimilarities.

Consider a function which satisfies the following “Jensen-type” inequality [8] for any :

 Q((θθ′)α)

where denotes the weighted linear interpolation of with , and the parameter space. Function is said strictly quasiconvex [17, 7, 33, 8] as it relaxes the strict convexity inequality:

 Q((θθ′)α)<(1−α)Q(θ)+αQ(θ′)≤max{Q(θ),Q(θ′)}. (3)

Let denote the space of such strictly quasiconvex real-valued function, and let denote the space of strictly convex functions. We have : Any strictly convex function or any strictly increasing function is quasiconvex, but not necessarily the converse: Some examples of quasiconvex functions which are not convex are , , , etc. Decreasing and then increasing functions are quasiconvex but may not be necessarily smooth. Some concave functions like are quasiconvex. The sum of quasiconvex functions are not necessarily quasiconvex. In the same spirit that function convexity can be reduced to set convexity via the epigraph representation of the function, a function is quasiconvex if the level set is (set) convex for all . When is univariate, a quasiconvex function is also commonly called unimodal (i.e., decreasing and then increasing function). Thus a multivariate quasiconvex function can be characterized as being unimodal along each line of its domain. Figure 1 displays some examples of quasiconvex functions with one function that fails to be quasiconvex. Notice that strictly monotonic functions which are both strictly quasiconvex and strictly quasiconcave are termed strictly quasilinear. The ceil function is an example of quasilinear function (idem for the floor function). Another example, are the linear fractional functions which are quasilinear functions on the domain . We denote by the set of strictly quasilinear functions, and by the set of strictly quasiconcave functions.

Definition 1 (Quasiconvex difference distance)

The quasiconvex difference distance (or qcvx distance for short) for is defined as the inequality difference gap of Eq. 2

 qcvxJαQ(θ:θ′) := max{Q(θ),Q(θ′)}−Q((θθ′)α)≥0, (4) = max{Q(θ),Q(θ′)}−Q((1−α)θ+αθ′)). (5)

By definition, the quasiconvex difference distance is a dissimilarity satisfying iff. when the generator is strictly quasiconvex (see Eq. 2).

Remark 1

Notice that we could also have defined a log-ratio gap [30] as a dissimilarity:

 qcvxJLαQ(θ:θ′):=−log(Q((θθ′)α)max{Q(θ),Q(θ′)}). (6)

However, in that case we should have required the extra condition that the generator does not vanish in the domain, i.e., for any .

Property 1

Let and , and define . Functions are quasiconvex, and .

Similarly, we can characterize a strictly quasiconcave real-valued function by the following inequality for :

 H((θθ′)α)>min{H(θ),H(θ′)},θ≠θ′∈Θ⊂RD. (7)

This allows one to define the quasiconcave difference distance (or qccv distance for short):

Definition 2 (Quasiconcave difference distance)

For a quasiconcave function and , we define the quasiconcave distance as:

 qccvJαH(θ:θ′) := H((θθ′)α)−min{H(θ),H(θ′)}, (8) = H((1−α)θ+αθ′)−min{H(θ),H(θ′)} (9)

Similarly, we have for and .

Now, observe that for any , we have333Indeed, . (or equivalently ). Thus it follows the following identity:

Property 2

A quasiconcave difference distance with quasiconcave generator is equivalent to a quasiconvex difference distance for the quasiconvevx generator :

 qccvJαH(θ:θ′)=qcvxJα−H(θ:θ′),qcvxJαQ(θ:θ′)=qccvJα−Q(θ:θ′). (10)
•  qccvJαH(θ:θ′) = H((θθ′)α)−min{H(θ),H(θ′)}, (11) = max{−H(θ),−H(θ′)}−(−H((θθ′)α)), (12) = qcvxJα−H(θ:θ′). (13)

Therefore, we consider without loss of generality quasiconvex difference distances in the reminder.

2.2 Relationship of quasiconvex difference distances with Jensen difference distances

Since for any , we have , and , we can rewrite Eq. 4 to get

 qcvxJαQ(θ:θ′) = Q(θ)+Q(θ′)2+12∣∣Q(θ)−Q(θ′)∣∣−Q((θθ′)α), (14) = (15)

where

 eJαQ(θ,θ′):=(Q(θ)Q(θ′))α−Q((θθ′)α), (16)

is called the extended Jensen divergence, a Jensen-type divergence extended to quasiconvex generators instead of ordinary convex generators.

Property 3 (Upperbounded the extended Jensen divergence by qcvxJαQ)

We have:

 eJαQ(θ:θ′)≤qcvxJαQ(θ:θ′) (17)

since . In particular, when is strictly convex, we have .

Notice that when is strictly convex, but may be negative when only quasiconvex. For example, is a quasiconvex and concave function, and therefore .

When , we get the following identity:

Property 4 (Regularization of extended Jensen divergences)
 qcvxJQ(θ:θ′) = Q(θ)+Q(θ′)2+12|Q(θ)−Q(θ′)|−Q(θ+θ′2), (18) = eJQ(θ,θ′)+12|Q(θ)−Q(θ′)|, (19)

where

 eJQ(θ,θ′):=Q(θ)+Q(θ′)2−Q(θ+θ′2), (20)

is an extension of the Jensen divergence [11, 34] to a quasiconvex generator .

Thus when the generator is convex, we can interpret the quasiconvex divergence as a -regularization of the ordinary Jensen divergence. When the generator is not convex, beware that may be negative but we always have .

Similarly, when the generator is strictly quasiconcave, we rewrite the quasiconvex difference distance as

 qccvJH(θ:θ′) = H(θ+θ′2)−H(θ)+H(θ′)2+12|H(θ)−H(θ′)|, (21) = eJ−H(θ,θ′)+12|H(θ)−H(θ′)|. (22)

2.3 Quasiconvex difference distances: The viewpoint of comparative convexity

In [29], a generalization of the skewed Jensen divergences with respect to comparative convexity [25] is obtained using a pair of weighted means. A mean between two reals and belonging to an interval is a bivariate function such that

 min{x,y}≤M(x,y)≤max{x,y}. (23)

That is, a mean satisfies the in-betweeness property (see [25], p. 328). A weighted mean for can always be built from a mean by using the dyadic expansion of real numbers, see [25].

Consider two weighted means and .

A function is said convex iff:

 Nα(F(θ),F(θ′))≥F(Mα(θ,θ′)),θ,θ′∈Θ. (24)

We recover the ordinary convexity when , where is the weighted arithmetic mean.

We can define the -skewed -Jensen divergence as:

 JM,NF,α(θ:θ′):=Nα(F(θ),F(θ′))−F(Mα(θ,θ′)). (25)

By definition, when is a -strictly convex function.

A quasi-arithmetic mean [25] is defined for a continuous strictly increasing function as:

 Mf(p,q):=f−1(f(p)+f(q)2). (26)

These quasi-arithmetic means are also called Kolmogorov-Nagumo-de Finetti means [21, 24, 13]. Without loss of generality, we assume strictly increasing functions instead of monotonic functions since . By choosing , or , we recover the Pythagorean arithmetic, geometric, and harmonic means, respectively.

Now, consider the family of power means for :

 P0(x,y):=√xy,Pδ(x,y):=(xδ+yδ2)1δ,δ≠0. (27)

These means fall in the class of quasi-arithmetic means obtained for for with , and include in the limit cases the maximum and minimum values: and .

The power mean Jensen divergence [29] is defined as a special case of the -Jensen divergence by:

 JPδF(θ:θ′):=JA,PδF(θ:θ′)=Pδ(F(θ),F(θ′))−F((θθ′)α), (28)

for a strictly convex generator .

Let us now observe that the quasiconvex difference distance is a limit case of power mean Jensen divergences:

Property 5 (qcvxJQ as a limit case of power mean Jensen divergences)

We have

 qcvxJQ(θ:θ′)=limδ→∞JPδF(θ:θ′). (29)

Notice that a strictly quasiconvex function is interpreted as a -strictly convex function in comparative convexity, a limit case of -convexity. From now on, we term the quasiconvex difference distance the quasiconvex Jensen divergence.

3 Bregman divergences for quasiconvex generators

3.1 Quasiconvex Bregman divergences as limit cases of quasiconvex Jensen divergences

Recall that for a strictly quasiconvex generator , define the -skewed quasiconvex distance for as

 qcvxJαQ(θ:θ′):=max{Q(θ),Q(θ′)}−Q((θθ′)α). (30)

We have

 qcvxJαQ(θ:θ′)≥0, (31)

with equality if and only if . Notice that we do not require smoothness [19] of , and is symmetric. For an asymmetric divergence , denote the reverse divergence.

By analogy to Bregman divergences [5] being interpreted as limit cases of scaled and skewed Jensen divergences [37, 26]:

 limα→1−1α(1−α)JαF(θ:θ′) = BF(θ:θ′), (32) limα→0+1α(1−α)JαF(θ:θ′) = BrF(θ:θ′)=BF(θ′:θ). (33)

Let us define the following divergence:

Definition 3 (Quasiconvex Bregman pseudo-divergence)

For a strictly quasiconvex generator , we define the quasiconvex Bregman pseudo-divergence as

 qcvxBQ(θ:θ′):=limα→1−1α(1−α)qcvxJαQ(θ:θ′). (34)

As it will be shown below, we get only a pseudo-divergence in the limit case.

Theorem 1 (Formula for the quasiconvex Bregman pseudo-divergence)

For a strictly quasiconvex and differentiable generator , the quasiconvex Bregman pseudo-divergence is

 qcvxBQ(θ:θ′)={−(θ−θ′)⊤∇Q(θ′)if Q(θ)≤Q(θ′)+∞otherwise (i.e., Q(θ)>Q(θ′)). (35)
• By definition, we have

 qcvxBQ(θ:θ′)=limα→1−1α(1−α)(max{Q(θ),Q(θ′)}−Q((θθ′)α)).

Applying a first-order Taylor expansion to , we get

 Q((θθ′)α))≃α→1Q(θ′)−(1−α)(θ−θ′)⊤∇Q(θ′). (36)

Thus we have

 qcvxBQ(θ:θ′)=limα→1−1α(1−α)(max{Q(θ),Q(θ′)}−Q(θ′)−(1−α)(θ−θ′)⊤∇Q(θ′)). (37)

Consider the following two cases:

• Case : That is, . Then it follows that

 qcvxBQ(θ:θ′) = limα→1−1α(1−α)(−(1−α)(θ−θ′)⊤∇Q(θ′)), (38) = −(θ−θ′)⊤∇Q(θ′). (39)
• Case : That is, . Then we have

 qcvxBQ(θ:θ′)=limα→1−1α(1−α)(Q(θ)−Q(θ′)−(1−α)(θ−θ′)⊤∇Q(θ′)).

We have that is finite and different from when , and therefore .

Let us now prove the axiom of non-negativity and disprove the law of the indiscernibles at inflection points for the quasiconvex Bregman pseudo-divergences.

• Law of the indiscernibles: Clearly, for all . So consider , and for . It is enough to consider the 1D case, by considering the divergence restricted to the line passing through and intersected by the domain . We may have countably many inflection points for which . At those inflection points, we may find such that . Thus the quasiconvex Bregman divergence does not satisfy the law of the indiscernibles. Figure 2 displays an example of such a quasiconvex function with a few inflection points.

For example, consider the strictly quasiconvex generator , with and . We have:

 qcvxJαQ(θ:θ′)=max{Q(θ),Q(θ′)}−Q((1−α)θ+αθ′)=−(1−α)3θ3>0. (40)

Defining the corresponding quasiconvex Bregman divergence by taking the limit of scaled quasiconvex Jensen divergence yields

 (41)

Thus the quasiconvex Bregman divergence is only a pseudo-divergence at countably many inflection points. Section 3.2 will overcome this problem by introducing the -averaged quasiconvex Bregman divergence.

• Non-negativity follows from a classic theorem of quasiconvex analysis which reports a first-order condition for a function to be quasiconvex444By analogy to a classic second-order condition for a strictly convex and differentiable function to be convex: To have its Hessian positive-definite (Alexandrov’s theorem). Similarly, the first-order condition for convexity of a function states that a differentiable function with convex domain is convex iff. from which we recover the Bregman divergence: . : A function is quasiconvex iff. the following property holds (see Theorem 21.14 of [35] and §3.4.3 of [8]):

 Q(θ′)≥Q(θ)⇒∇Q(θ′)(θ−θ′)≤0. (42)

That is equivalent to or .

Notice that when is strictly convex and differentiable, then the property also follows from the non-negativity of the corresponding Bregman divergence and :

 F(θ)−F(θ′)−(θ−θ′)⊤∇F(θ′)≥0, (43) −(θ−θ′)⊤∇F(θ′)qcvxBF(θ:θ′)≥F(θ′)−F(θ)≥0. (44)

Notice that when . Figure 3 illustrates the quasiconvex Bregman divergence for a strictly quasiconvex generator which is strictly concave and has no inflection point.

An interesting property is that if for then necessarily , and vice-versa (when both parameters are not at inflection points). The forward and reverse quasiconvex Bregman pseudo-divergences are both finite only when and then we have or when one parameter is an inflection point.

Moreover, we have the following decomposition for a quasiconvex function :

 eBQ(θ:θ′)=Q(θ)−Q(θ′)+qcvxBQ(θ:θ′), (45)

when , where stands for the extended Bregman divergence, i.e., the Bregman divergence extended to a quasiconvex generator.

Remark 2 (Separability/non-separability of generators and divergences)

When the -dimensional generator is separable, i.e., where and the ’s are differentiable and quasiconvex univariate functions, the quasiconvex Bregman divergence rewrites as

 qcvxBQ(θ:θ′)={−∑Di=1(θi−θ′i)Q′i(θ′i)if Q(θ)≤Q(θ′)+∞otherwise (Q(θ)>Q(θ′)). (46)

Notice that the condition for the quasiconvex Bregman divergence to be infinite is , and not that there exists one index such that . Thus, we have . This is to contrast with Bregman divergences for which the separability of the generator yields the separability of the divergence: .

3.2 The δ-averaged quasiconvex Bregman divergence

We shall overcome the problem of indiscernability for quasiconvex Bregman pseudo-divergences:

 qcvxBQ(θ:θ′)=(θ′−θ)Q′(θ′)forQ(θ′)≥Q(θ). (47)

Since the number of inflection points is at most countable for a strictly quasiconvex generator , the function can only be identically zero on a set of null measure. We propose to integrate over a neighborhood of the parameters to obtain a strictly positive divergence when .

Given a prescribed parameter , we introduce the -averaged quasiconvex Bregman divergence via the following definition:

 qcvxBδQ(θ,θ′):=1δ∫δ0qcvxBQ(θ+u:θ′+u)du. (48)

Choosing to be a strictly positive multiple of ensures that this integral is always finite since for , where denotes the interval with endpoints and .

We now prove this claim. For all , we have so that

 Q(θ′)

Similarly, or . In the first case, if we have

 Q(θ+u)

In the second case, , and we obtain

 Q(θ+u)

proving the claim.

By construction, this -averaged quasiconvex Bregman divergence now satisfies the law of the indiscernables.

When is differentiable, we obtain:

 qcvxBδQ(θ,θ′):=1δ∫δ0(θ′−θ)Q′(θ′+u)du=(θ′−θ)(Q(θ′+δ)−Q(θ′)δ). (49)

We note that the rhs. of (49) can also serve as the definition of the divergences, even when the strictly quasiconvex function is not differentiable. This motivates us to introduce the next definition, where we now denote by the positive ratio between and of the preceding section.

Definition 4 (δ-averaged quasiconvex Bregman divergence)

For a prescribed and a strictly quasiconvex generator not necessarily differentiable, the -averaged quasiconvex Bregman divergence is defined by

 (50)

Let us report some examples of -averaged quasiconvex Bregman divergences:

• .

 qcvxBQ(θ:θ′)=(1+δ)θ′−δθ−θ′δ=θ′−θ,

when , or otherwise.

• .

 qcvxBQ(θ:θ′)=2θ′(θ′−θ)+δ(θ′−θ)2,

when , or otherwise.

• .

 qcvxBQ(θ:θ′)=3θ′2(θ′−θ)+3θ′δ(θ′−θ)2+δ2(θ′−θ)3,

when , or otherwise. At the inflection point , we now have

 qcvxBQ(θ:θ′)=−δ2θ3>0∀θ<0.

3.3 Quasiconvex Bregman divergences as limit cases of power mean Bregman divergences

For sake of simplicity, consider scalar divergences below. In [29], the -Bregman divergence is defined as the limit case:

 BM,NF(p:q)=limα→1−1α(1−α)JM,NF,α(p:q)=limα→1−1α(1−α)(Nα(F(p),F(q)))−F(Mα(p,q))). (51)

In particular, the univariate power mean Bregman divergences are obtained by taking the power means, yielding the following formula:

 Bδ1,δ2F(p:q)=Fδ2(p)−Fδ2(q)δ2Fδ2−1(q)−pδ1−qδ1δ1qδ1−1F′(q). (52)

Let and . Then we get the subfamily of -power Bregman divergences:

 BrF(θ:θ′) = Fr(θ)−Fr(θ′)rFr−1(θ′)−(θ−θ′)F′(θ′), (53) = =Fr(θ)rFr−1(θ′)−F(θ′)r−(θ−θ′)F′(θ′). (54)

In Eq. 54, when then we have since diverges. Otherwise since (because ).

When , the power mean operator tends to the maximum operator: , and the -Bregman divergence tends to the quasiconvex Bregman pseudo-divergence.

3.4 Some illustrating examples of quasiconvex Bregman divergences

We concisely report two univariate quasiconvex scalar Bregman divergences:

• For with , we have

 qcvxJαQ(θ:θ