The Bregman chord divergence

The Bregman chord divergence

Abstract

Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in machine learning and signal processing. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of ad hoc distances, one rather prefers to consider parametric classes of distances that are exhaustively characterized by axioms derived from first principles. Bregman divergences are such a class. However fine-tuning a Bregman divergence is delicate since it requires to smoothly adjust a functional generator. In this work, we propose an extension of Bregman divergences called the Bregman chord divergences. This new class of distances does not require gradient calculations, uses two scalar parameters that can be easily tailored in applications, and generalizes asymptotically Bregman divergences.

Keywords: Bregman divergence, Jensen divergence, skewed divergence, clustering, information fusion.

1 Introduction

Dissimilarities (or distances) are at the heart of many signal processing tasks [13, 6], and the performance of algorithms solving those tasks heavily depends on the chosen distances. A dissimilarity between two objects and belonging to a space (e.g., vectors, matrices, probability densities, random variables, etc.) is a function such that with equality if and only if . Since a dissimilarity may not be symmetric (i.e., an oriented dissimilarity with ), we emphasize this fact using the notation1 ’:’. The reverse dissimilarity or dual dissimilarity is defined by

 D∗(O1:O2):=D(O2:O1), (1)

and satisfies the involutive property: . When a symmetric dissimilarity further satisfies the triangular inequality

 D(O1,O2)+D(O2,O3)≥D(O1,O3),∀O1,O2,O3∈O, (2)

it is called a metric distance.

Historically, many ad hoc distances have been proposed and empirically benchmarked on different tasks in order to improve the state-of-the-art performances. However, getting the most appropriate distance for a given task is often an endeavour. Thus principled classes of distances2 have been proposed and studied. Among those generic classes of distances, three main types have emerged:

• The Bregman divergences [7, 5] defined for a strictly convex and differentiable generator :

 BF(θ1:θ2):=F(θ1)−F(θ2)−(θ1−θ2)⊤∇F(θ2), (3)

measure the dissimilarity between parameters . We use the term “divergence” (rooted in information geometry ) instead of distance to emphasize the smoothness property3 of the distance. The dual Bregman divergence is obtained from the Bregman divergence induced by the Legendre convex conjugate:

 B∗F(θ1:θ2):=BF(θ2:θ1)=BF∗(∇F(θ1):∇F(θ2)), (4)

where the Legendre-Fenchel transformation is defined by:

 F∗(η)=supθ∈Θ{θ⊤η−F(θ)}. (5)
• The Csiszár -divergences [1, 11] defined for a convex generator satisfying :

 If[p1:p2]:=∫Xp1(x)f(p2(x)p1(x))dμ(x), (6)

measure the dissimilarity between probability densities and that are absolutely continuous with respect to a base measure (defined on a support ). A scalar divergence is a divergence acting on scalar parameters, i. e., a 1D divergence. A separable divergence is a divergence that can be written as a sum of elementary scalar divergences. The -divergences are separable divergences since we have:

 If[p:q]=∫if[p(x):q(x)]dμ(x), (7)

with the scalar -divergence .

The dual -divergence is obtained for the generator (diamond -generator) as follows:

 I∗f[p:q]:=If[q:p]=If⋄[p:q]. (8)

We may -symmetrize4 a -divergence by defining its generator :

 Jf[p:q] = 12(If[p:q]+If[q:p]), (9) = If∘[p,q], (10)

with

 f∘(u):=12(f(u)+f∗(u)).

Alternatively, we may JS-symmetrize5 the -divergence by using the following generator :

 JSf[p:q] := 12(If[p:p+q2]+If[q:p+q2]), (11) = If∙[p,q], (12) f∙(u) := 1+u4(f(2u1+u)+f(21+u)). (13)
• The Burbea-Rao divergences  also called Jensen divergences because they rely on the Jensen’s inequality  for a strictly convex function :

 JF(θ1,θ2):=F(θ1)+F(θ2)2−F(θ1+θ22)≥0. (14)

We note in passing that Bregman divergences can be extended to strictly convex and non-differentiable generator as well [18, 27].

These three fundamental classes of distances are not mutually exclusive, and their pairwise intersections (e.g., or ) have been studied in [26, 2, 16]. The ’:’ notation between arguments of distances emphasizes the potential asymmetry of distances (oriented distances with ), and the brackets surrounding distance arguments indicate that it is a statistical distance between probability densities, and not a distance between parameters. Using these notations, we express the Kullback-Leibler distance  (KL) as:

 KL[p1:p2]:=∫p1(x)logp1(x)p2(x)dμ(x). (15)

The KL distance between two members and of a parametric family of distributions amount to a parameter divergence:

 KLF(θ1:θ2):=KL[pθ1:pθ2]. (16)

For example, the KL statistical distance between two probability densities belonging to the same exponential family or the same mixture family amounts to a (parameter) Bregman divergence [3, 25]. When and are finite discrete distributions of the -dimensional probability simplex , we have . This explains why sometimes we can handle loosely distances between discrete distributions as both a parameter distance and a statistical distance. For example, the KL distance between two discrete distributions is a Bregman divergence for (Shannon negentropy) for . Extending to positive measures , this Bregman divergence yields the extended KL distance: .

Whenever using a functionally parameterized distance in applications, we need to choose the most appropriate functional generator, ideally from first principles [12, 4, 3]. For example, Non-negative Matrix Factorization (NMF) for audio source separation or music transcription from the signal power spectrogram can be done by selecting the Itakura-Saito divergence  (a Bregman divergence for the Burg negentropy ) that satisfies the requirement of being scale invariant: for any . When no such first principles can be easily stated for a task , we are left by choosing manually or by cross-validation a generator. Notice that the convex combinations of Csiszár generators is a Csiszár generator (idem for Bregman divergences): for belonging to the standard simplex . Thus in practice, we could choose a base of generators and learn the best distance weighting (by analogy to feature weighting ). However, in doing so, we are left with the problem of choosing the base generators, and moreover we need to sum up different distances: This raises the problem of properly adding distance units! Thus in applications, it is often preferable to consider a smooth family of generators parameterized by scalars (e.g., -divergences  or -divergences , etc), and then finely tune these scalars.

In this work, we propose a novel class of distances, termed Bregman chord divergences. Bregman chord divergences are parameterized by two scalar parameters which make it easy to fine-tune in applications, and matches asymptotically the ordinary Bregman divergences.

The paper is organized as follows: In §2, we describe the skewed Jensen divergence, show how to biskew any distance by using two scalars, and report on the Jensen chord divergence. In §3, we first introduce the univariate Bregman chord divergence, and then extend its definition to the multivariate case, in §4. Finally, we conclude in §5.

2 Geometric design of skewed divergences from graph plots

We can geometrically design divergences from convexity gap properties of the plot of the generator. For example, the Jensen divergence of Eq. 14 is visualized as the ordinate (vertical) gap between the midpoint of the line segment and the point . The non-negativity property of the Jensen divergence follows from the Jensen’s midpoint convex inequality . Instead of taking the midpoint , we can take any interior point , and get the skewed -Jensen divergence (for any ):

 JαF(θ1:θ2):=(F(θ1)F(θ2))α−F((θ1θ2)α)≥0. (17)

A remarkable fact is that the scaled -Jensen divergence tends asymptotically to the reverse Bregman divergence when , see [29, 23]. Notice that the Jensen divergences can be interpreted as Jensen-Shannon-type symmetrization  of Bregman divergences:

 JF(θ1:θ2)=BF(θ1:θ1+θ22)+BF(θ2:θ1+θ22), (18)

and more generally, we have the skewed Jensen-Bregman divergences:

 JBαF(θ:θ′):=(1−α)BF(θ:(θθ′)α)+αBF(θ′:(θθ′)α). (19)

By measuring the ordinate gap between two non-crossing upper and lower chords anchored at the generator graph plot, we can extend the -Jensen divergences to a tri-parametric family of Jensen chord divergences :

 Jα,β,γF(θ:θ′):=(F(θ)F(θ′))γ−(F((θθ′)α)F((θθ′)β))γ−αβ−α, (20)

with and . The -Jensen divergence is recovered when .

For any given distance (with convex parameter space ), we can biskew the distance by considering two scalars (with ) as:

 Dγ,δ(θ1:θ2):=D((θ1θ2)γ:(θ1θ2)δ). (21)

Clearly, iff. . That is, if (i) or if (ii) . Since by definition , we have iff . Notice that both and should belong to the parameter space . A sufficient condition is to ensure that so that both and . When , we may further consider any .

3 The scalar Bregman chord divergence

Let be a univariate Bregman generator with open convex domain , and denote by its graph. Let us rewrite the ordinary univariate Bregman divergence  of Eq. 3 as follows:

 BF(θ1:θ2)=F(θ1)−Tθ2(θ1), (22)

where denotes the equation of the tangent line of at :

 Tθ(ω):=F(θ)+(ω−θ)F′(θ), (23)

Let denote the graph of that tangent line. Line is tangent to curve at point . Graphically speaking, the Bregman divergence is interpreted as the ordinate gap (gap vertical) between the point and the point of , as depicted in Figure 2. Figure 2: Illustration of the univariate Bregman divergence as the ordinate gap (‘vertical’ gap) evaluated at θ1 between the graph plot F and the tangent line Tθ2 to F at θ2.

Now let us observe that we may relax the tangent line to a chord line (or secant) passing through the points and for with (with corresponding Cartesian equation ), and still get a non-negative vertical gap between and (because any line intersects a convex in at most two points). By construction, this vertical gap is smaller than the gap measured by the ordinary Bregman divergence. This yields the Bregman chord divergence (, ):

 Bα,βF(θ1:θ2):=F(θ1)−C(θ1θ2)α,(θ1θ2)βF(θ1)≤BF(θ1:θ2), (24)

illustrated in Figure 3. By expanding the chord equation and massaging the equation, we get the formula:

 Bα,βF(θ1:θ2):= F(θ1)−Δα,βF(θ1,θ2)(θ1−(θ1θ2)α)−F((θ1θ2)α), F(θ1)−F((θ1θ2)α)+α{F((θ1θ2)α)−F((θ1θ2)β)}β−α,

where

 Δα,βF(θ1,θ2):=F((θ1θ2)α)−F((θ1θ2)β)(θ1θ2)α−(θ1θ2)β,

is the slope of the chord, and since and .

Notice the symmetry:

 Bα,βF(θ1:θ2)=Bβ,αF(θ1:θ2).

We have asymptotically:

 limα→1,β→1Bα,βF(θ1:θ2)=BF(θ1:θ2).

When , the Bregman chord divergences yields a subfamily of Bregman tangent divergences: . We consider the tangent line at and measure the ordinate gap at between the function plot and this tangent line:

 BαF(θ1:θ2) := F(θ1)−F((θ1θ2)α)−(θ1−(θ1θ2)α)⊤∇F((θ1θ2)α), (25) = F(θ1)−F((θ1θ2)α)−α(θ1−θ2)⊤∇F((θ1θ2)α),

for . The ordinary Bregman divergence is recovered when . Notice that the mean value theorem yields for . Thus for . Letting and (for small values of ), we can approximate the ordinary Bregman divergence by the Bregman chord divergence without requiring to compute the gradient: .

Figure 4 displays some snapshots of an interactive demo program that illustrates the impact of and for defining the Bregman chord divergences for the quadratic and Shannon generators. Figure 4: The univariate Bregman chord divergences and Bregman tangent divergences for the quadratic and Shannon information generators.

4 The multivariate Bregman chord divergence

When the generator is separable , i.e., for univariate generators , we extend easily the Bregman chord divergence as: . Otherwise, we have to carefully define the notion of “slope” for the multivariate case. An example of such a non-separable multivariate generator is the Legendre dual of the Shannon negentropy: The log-sum-exp function .

Given a multivariate (non-separable) Bregman generator with and two prescribed distinct parameters and , consider the following univariate function, for :

 Fθ1,θ2(λ):=F((1−λ)θ1+λθ2)=F(θ1+λ(θ2−θ1)), (26)

with and .

The functions are strictly convex and univariate Bregman generators.

Proof.

To prove the strict convexity of a univariate function , we need to show that for any , we have

 G((1−α)x+αy)<(1−α)G(x)+αG(y).
 Fθ1,θ2((1−α)λ1+αλ2) = F(θ1+((1−α)λ1+αλ2)(θ2−θ1)), = F((1−α)(λ1(θ2−θ1)+θ1)+α((λ2(θ2−θ1)+θ1))), < (1−α)F(λ1(θ2−θ1)+θ1)+αF((λ2(θ2−θ1)+θ1)), < (1−α)Fθ1,θ2(λ1)+αFθ1,θ2(λ2).

Then we define the multivariate Bregman chord divergence by applying the definition of the univariate Bregman chord divergence of on these families of univariate Bregman generators:

 Bα,βF(θ1:θ2):=Bα,βFθ1,θ2(0:1), (27)

Since and , we get:

 Bα,βF(θ1:θ2) =Fθ1,θ2(0)+α(Fθ1,θ2(α)−Fθ1,θ2(β))β−α−Fθ1,θ2(α), F(θ1)−F((θ1θ2)α)−α(F((θ1θ2)β)−F((θ1θ2)α))β−α,

in accordance with the univariate case. Since , we have the first-order Taylor expansion:

 F((θ1θ2)β)≃β≃αF((θ1θ2)α)−(β−α)(θ2−θ1)⊤∇F((θ1θ2)α).

Therefore, we have:

 α(F((θ1θ2)β)−F((θ1θ2)α))β−α≃−α(θ2−θ1)⊤∇F((θ1θ2)α).

This proves that .

Notice that the Bregman chord divergence does not require to compute the gradient The “slope term” in the definition is reminiscent to the -derivative  (quantum/discrete derivatives). However the -derivatives  are defined with respect to a single reference point while the chord definition requires two reference points.

5 Conclusion and perspectives

We geometrically designed a new class of distances using two scalar parameters, termed the Bregman chord divergence, and its one-parametric subfamily, the Bregman tangent divergences that includes the ordinary Bregman divergence. This generalization allows one to easily fine-tune Bregman divergences in applications by adjusting smoothly one or two (scalar) knobs. Moreover, by choosing and for small , the Bregman chord divergence lower bounds closely the Bregman divergence without requiring to compute the gradient (a different approximation without gradient is ). We expect that this new class of distances brings further improvements in signal processing and information fusion applications  (e.g., by tuning or ). While the Bregman chord divergence defines an ordinate gap on the exterior of the epigraph, the Jensen chord divergence  defines the gap inside the epigraph of the generator. In future work, the information-geometric structure induced by the Bregman chord divergences (curved) shall be investigated from the viewpoint of gauge theory  and in constrast with the dually flat structures of Bregman manifolds .

Java™ Source code is available for reproducible research.6

Acknowledgments

We express our thanks to Gaëtan Hadjeres (Sony CSL, Paris) for his careful proofreading and feedback.

Footnotes

1. In information theory , the double bar notation ’’ has been used to avoid confusion with the comma ’,’ notation, used for example in joint entropy .
2. Here, we use the word distance to mean a dissimilarity (or a distortion), not necessarily a metric distance . A distance satisfies with equality iff. .
3. A metric distance is not smooth at its calling arguments.
4. By analogy to the Jeffreys divergence that is the symmetrized Kullback-Leibler divergence.
5. By analogy to the Jensen-Shannon divergence (JS).
6. https://franknielsen.github.io/~nielsen/BregmanChordDivergence/

References

1. Syed Mumtaz Ali and Samuel D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
2. Shun-ichi Amari. -divergence is unique, belonging to both -divergence and Bregman divergence classes. IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.
3. Shun-ichi Amari. Information geometry and its applications. Springer, 2016.
4. Arindam Banerjee, Xin Guo, and Hui Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664–2669, 2005.
5. Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
6. Michèle Basseville. Divergence measures for statistical data processing: An annotated bibliography. Signal Processing, 93(4):621–633, 2013.
7. Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
8. Jacob Burbea and C. Rao. On the convexity of some divergence measures based on entropy functions. IEEE Transactions on Information Theory, 28(3):489–495, 1982.
9. Andrzej Cichocki, Hyekyoung Lee, Yong-Deok Kim, and Seungjin Choi. Non-negative matrix factorization with -divergence. Pattern Recognition Letters, 29(9):1433–1440, 2008.
10. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
11. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
12. Imre Csiszár. Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. The annals of statistics, 19(4):2032–2066, 1991.
13. Michel Marie Deza and Elena Deza. Encyclopedia of distances. In Encyclopedia of Distances, pages 1–583. Springer, 2009.
14. Cédric Févotte. Majorization-minimization algorithm for smooth Itakura-Saito nonnegative matrix factorization. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 1980–1983. IEEE, 2011.
15. Johan Ludwig William Valdemar Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193, 1906.
16. Jiantao Jiao, Thomas A Courtade, Albert No, Kartik Venkat, and Tsachy Weissman. Information measures: the curious case of the binary alphabet. IEEE Transactions on Information Theory, 60(12):7616–7626, 2014.
17. Victor Kac and Pokman Cheung. Quantum calculus. Springer Science & Business Media, 2001.
18. Krzysztof C. Kiwiel. Proximal minimization methods with generalized Bregman functions. SIAM journal on control and optimization, 35(4):1142–1168, 1997.
19. Minami Mihoko and Shinto Eguchi. Robust blind source separation by beta divergence. Neural computation, 14(8):1859–1886, 2002.
20. Dharmendra S. Modha and W. Scott Spangler. Feature weighting in -means clustering. Machine learning, 52(3):217–237, 2003.
21. Jan Naudts and Jun Zhang. Rho–tau embedding and gauge freedom in information geometry. Information Geometry, pages 1–37, 2018.
22. Frank Nielsen. The chord gap divergence and a generalization of the Bhattacharyya distance. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2276–2280, April 2018.
23. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
24. Frank Nielsen and Richard Nock. Skew Jensen-Bregman voronoi diagrams. In Transactions on Computational Science XIV, pages 102–128. Springer, 2011.
25. Frank Nielsen and Richard Nock. On the geometry of mixtures of prescribed distributions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2861–2865. IEEE, 2018.
26. M. C. Pardo and Igor Vajda. About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE transactions on information theory, 43(4):1288–1293, 1997.
27. Matus Telgarsky and Sanjoy Dasgupta. Agglomerative Bregman clustering. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 1011–1018. Omnipress, 2012.
28. Murat Üney, Jérémie Houssineau, Emmanuel Delande, Simon J. Julier, and Daniel E. Clark. Fusion of finite set distributions: Pointwise consistency and global cardinality. CoRR, abs/1802.06220, 2018.
29. Jun Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16(1):159–195, 2004.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   