The Bregman chord divergence
Abstract
Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in machine learning and signal processing. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an everexpanding dictionary of ad hoc distances, one rather prefers to consider parametric classes of distances that are exhaustively characterized by axioms derived from first principles. Bregman divergences are such a class. However finetuning a Bregman divergence is delicate since it requires to smoothly adjust a functional generator. In this work, we propose an extension of Bregman divergences called the Bregman chord divergences. This new class of distances does not require gradient calculations, uses two scalar parameters that can be easily tailored in applications, and generalizes asymptotically Bregman divergences.
Keywords: Bregman divergence, Jensen divergence, skewed divergence, clustering, information fusion.
1 Introduction
Dissimilarities (or distances) are at the heart of many signal processing tasks [13, 6],
and the performance of algorithms solving those tasks heavily depends on the chosen distances.
A dissimilarity between two objects and belonging to a space (e.g., vectors, matrices, probability densities, random variables, etc.) is a function
such that with equality if and only if .
Since a dissimilarity may not be symmetric (i.e., an oriented dissimilarity with ), we emphasize this
fact using the notation
(1) 
and satisfies the involutive property: . When a symmetric dissimilarity further satisfies the triangular inequality
(2) 
it is called a metric distance.
Historically, many ad hoc distances have been proposed and empirically benchmarked on different tasks in order to improve the stateoftheart performances. However, getting the most appropriate distance for a given task is often an endeavour.
Thus principled classes of distances

The Bregman divergences [7, 5] defined for a strictly convex and differentiable generator :
(3) measure the dissimilarity between parameters . We use the term “divergence” (rooted in information geometry [3]) instead of distance to emphasize the smoothness property
^{3} of the distance. The dual Bregman divergence is obtained from the Bregman divergence induced by the Legendre convex conjugate:(4) where the LegendreFenchel transformation is defined by:
(5) 
The Csiszár divergences [1, 11] defined for a convex generator satisfying :
(6) measure the dissimilarity between probability densities and that are absolutely continuous with respect to a base measure (defined on a support ). A scalar divergence is a divergence acting on scalar parameters, i. e., a 1D divergence. A separable divergence is a divergence that can be written as a sum of elementary scalar divergences. The divergences are separable divergences since we have:
(7) with the scalar divergence .
The dual divergence is obtained for the generator (diamond generator) as follows:
(8) We may symmetrize
^{4} a divergence by defining its generator :(9) (10) with
Alternatively, we may JSsymmetrize
^{5} the divergence by using the following generator :(11) (12) (13)
These three fundamental classes of distances are not mutually exclusive, and their pairwise intersections (e.g., or ) have been studied in [26, 2, 16]. The ’:’ notation between arguments of distances emphasizes the potential asymmetry of distances (oriented distances with ), and the brackets surrounding distance arguments indicate that it is a statistical distance between probability densities, and not a distance between parameters. Using these notations, we express the KullbackLeibler distance [10] (KL) as:
(15) 
The KL distance between two members and of a parametric family of distributions amount to a parameter divergence:
(16) 
For example, the KL statistical distance between two probability densities belonging to the same exponential family or the same mixture family amounts to a (parameter) Bregman divergence [3, 25]. When and are finite discrete distributions of the dimensional probability simplex , we have . This explains why sometimes we can handle loosely distances between discrete distributions as both a parameter distance and a statistical distance. For example, the KL distance between two discrete distributions is a Bregman divergence for (Shannon negentropy) for . Extending to positive measures , this Bregman divergence yields the extended KL distance: .
Whenever using a functionally parameterized distance in applications, we need to choose the most appropriate functional generator, ideally from first principles [12, 4, 3]. For example, Nonnegative Matrix Factorization (NMF) for audio source separation or music transcription from the signal power spectrogram can be done by selecting the ItakuraSaito divergence [14] (a Bregman divergence for the Burg negentropy ) that satisfies the requirement of being scale invariant: for any . When no such first principles can be easily stated for a task [12], we are left by choosing manually or by crossvalidation a generator. Notice that the convex combinations of Csiszár generators is a Csiszár generator (idem for Bregman divergences): for belonging to the standard simplex . Thus in practice, we could choose a base of generators and learn the best distance weighting (by analogy to feature weighting [20]). However, in doing so, we are left with the problem of choosing the base generators, and moreover we need to sum up different distances: This raises the problem of properly adding distance units! Thus in applications, it is often preferable to consider a smooth family of generators parameterized by scalars (e.g., divergences [9] or divergences [19], etc), and then finely tune these scalars.
In this work, we propose a novel class of distances, termed Bregman chord divergences. Bregman chord divergences are parameterized by two scalar parameters which make it easy to finetune in applications, and matches asymptotically the ordinary Bregman divergences.
The paper is organized as follows: In §2, we describe the skewed Jensen divergence, show how to biskew any distance by using two scalars, and report on the Jensen chord divergence. In §3, we first introduce the univariate Bregman chord divergence, and then extend its definition to the multivariate case, in §4. Finally, we conclude in §5.
2 Geometric design of skewed divergences from graph plots
We can geometrically design divergences from convexity gap properties of the plot of the generator. For example, the Jensen divergence of Eq. 14 is visualized as the ordinate (vertical) gap between the midpoint of the line segment and the point . The nonnegativity property of the Jensen divergence follows from the Jensen’s midpoint convex inequality [15]. Instead of taking the midpoint , we can take any interior point , and get the skewed Jensen divergence (for any ):
(17) 
A remarkable fact is that the scaled Jensen divergence tends asymptotically to the reverse Bregman divergence when , see [29, 23]. Notice that the Jensen divergences can be interpreted as JensenShannontype symmetrization [24] of Bregman divergences:
(18) 
and more generally, we have the skewed JensenBregman divergences:
(19) 
By measuring the ordinate gap between two noncrossing upper and lower chords anchored at the generator graph plot, we can extend the Jensen divergences to a triparametric family of Jensen chord divergences [22]:
(20) 
with and . The Jensen divergence is recovered when .
For any given distance (with convex parameter space ), we can biskew the distance by considering two scalars (with ) as:
(21) 
Clearly, iff. . That is, if (i) or if (ii) . Since by definition , we have iff . Notice that both and should belong to the parameter space . A sufficient condition is to ensure that so that both and . When , we may further consider any .
3 The scalar Bregman chord divergence
Let be a univariate Bregman generator with open convex domain , and denote by its graph. Let us rewrite the ordinary univariate Bregman divergence [7] of Eq. 3 as follows:
(22) 
where denotes the equation of the tangent line of at :
(23) 
Let denote the graph of that tangent line. Line is tangent to curve at point . Graphically speaking, the Bregman divergence is interpreted as the ordinate gap (gap vertical) between the point and the point of , as depicted in Figure 2.
Now let us observe that we may relax the tangent line to a chord line (or secant) passing through the points and for with (with corresponding Cartesian equation ), and still get a nonnegative vertical gap between and (because any line intersects a convex in at most two points). By construction, this vertical gap is smaller than the gap measured by the ordinary Bregman divergence. This yields the Bregman chord divergence (, ):
(24) 
illustrated in Figure 3. By expanding the chord equation and massaging the equation, we get the formula:
where
is the slope of the chord, and since and .
Notice the symmetry:
We have asymptotically:
When , the Bregman chord divergences yields a subfamily of Bregman tangent divergences: . We consider the tangent line at and measure the ordinate gap at between the function plot and this tangent line:
(25)  
for . The ordinary Bregman divergence is recovered when . Notice that the mean value theorem yields for . Thus for . Letting and (for small values of ), we can approximate the ordinary Bregman divergence by the Bregman chord divergence without requiring to compute the gradient: .
Figure 4 displays some snapshots of an interactive demo program that illustrates the impact of and for defining the Bregman chord divergences for the quadratic and Shannon generators.
Bregman chord divergence  Bregman tangent divergence  

Quadratic  
(a)  (b)  
Shannon  
(c)  (d) 
4 The multivariate Bregman chord divergence
When the generator is separable [3], i.e., for univariate generators , we extend easily the Bregman chord divergence as: . Otherwise, we have to carefully define the notion of “slope” for the multivariate case. An example of such a nonseparable multivariate generator is the Legendre dual of the Shannon negentropy: The logsumexp function .
Given a multivariate (nonseparable) Bregman generator with and two prescribed distinct parameters and , consider the following univariate function, for :
(26) 
with and .
The functions are strictly convex and univariate Bregman generators.
Proof.
To prove the strict convexity of a univariate function , we need to show that for any , we have
∎
Then we define the multivariate Bregman chord divergence by applying the definition of the univariate Bregman chord divergence of on these families of univariate Bregman generators:
(27) 
Since and , we get:
in accordance with the univariate case. Since , we have the firstorder Taylor expansion:
Therefore, we have:
This proves that .
Notice that the Bregman chord divergence does not require to compute the gradient The “slope term” in the definition is reminiscent to the derivative [17] (quantum/discrete derivatives). However the derivatives [17] are defined with respect to a single reference point while the chord definition requires two reference points.
5 Conclusion and perspectives
We geometrically designed a new class of distances using two scalar parameters, termed the Bregman chord divergence, and its oneparametric subfamily, the Bregman tangent divergences that includes the ordinary Bregman divergence. This generalization allows one to easily finetune Bregman divergences in applications by adjusting smoothly one or two (scalar) knobs. Moreover, by choosing and for small , the Bregman chord divergence lower bounds closely the Bregman divergence without requiring to compute the gradient (a different approximation without gradient is ). We expect that this new class of distances brings further improvements in signal processing and information fusion applications [28] (e.g., by tuning or ). While the Bregman chord divergence defines an ordinate gap on the exterior of the epigraph, the Jensen chord divergence [22] defines the gap inside the epigraph of the generator. In future work, the informationgeometric structure induced by the Bregman chord divergences (curved) shall be investigated from the viewpoint of gauge theory [21] and in constrast with the dually flat structures of Bregman manifolds [3].
Java™ Source code is available for reproducible research.
Acknowledgments
We express our thanks to Gaëtan Hadjeres (Sony CSL, Paris) for his careful proofreading and feedback.
Footnotes
 In information theory [10], the double bar notation ’’ has been used to avoid confusion with the comma ’,’ notation, used for example in joint entropy .
 Here, we use the word distance to mean a dissimilarity (or a distortion), not necessarily a metric distance [13]. A distance satisfies with equality iff. .
 A metric distance is not smooth at its calling arguments.
 By analogy to the Jeffreys divergence that is the symmetrized KullbackLeibler divergence.
 By analogy to the JensenShannon divergence (JS).
 https://franknielsen.github.io/~nielsen/BregmanChordDivergence/
References
 Syed Mumtaz Ali and Samuel D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
 Shunichi Amari. divergence is unique, belonging to both divergence and Bregman divergence classes. IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.
 Shunichi Amari. Information geometry and its applications. Springer, 2016.
 Arindam Banerjee, Xin Guo, and Hui Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664–2669, 2005.
 Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
 Michèle Basseville. Divergence measures for statistical data processing: An annotated bibliography. Signal Processing, 93(4):621–633, 2013.
 Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
 Jacob Burbea and C. Rao. On the convexity of some divergence measures based on entropy functions. IEEE Transactions on Information Theory, 28(3):489–495, 1982.
 Andrzej Cichocki, Hyekyoung Lee, YongDeok Kim, and Seungjin Choi. Nonnegative matrix factorization with divergence. Pattern Recognition Letters, 29(9):1433–1440, 2008.
 Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Imre Csiszár. Informationtype measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
 Imre Csiszár. Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. The annals of statistics, 19(4):2032–2066, 1991.
 Michel Marie Deza and Elena Deza. Encyclopedia of distances. In Encyclopedia of Distances, pages 1–583. Springer, 2009.
 Cédric Févotte. Majorizationminimization algorithm for smooth ItakuraSaito nonnegative matrix factorization. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 1980–1983. IEEE, 2011.
 Johan Ludwig William Valdemar Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193, 1906.
 Jiantao Jiao, Thomas A Courtade, Albert No, Kartik Venkat, and Tsachy Weissman. Information measures: the curious case of the binary alphabet. IEEE Transactions on Information Theory, 60(12):7616–7626, 2014.
 Victor Kac and Pokman Cheung. Quantum calculus. Springer Science & Business Media, 2001.
 Krzysztof C. Kiwiel. Proximal minimization methods with generalized Bregman functions. SIAM journal on control and optimization, 35(4):1142–1168, 1997.
 Minami Mihoko and Shinto Eguchi. Robust blind source separation by beta divergence. Neural computation, 14(8):1859–1886, 2002.
 Dharmendra S. Modha and W. Scott Spangler. Feature weighting in means clustering. Machine learning, 52(3):217–237, 2003.
 Jan Naudts and Jun Zhang. Rho–tau embedding and gauge freedom in information geometry. Information Geometry, pages 1–37, 2018.
 Frank Nielsen. The chord gap divergence and a generalization of the Bhattacharyya distance. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2276–2280, April 2018.
 Frank Nielsen and Sylvain Boltz. The BurbeaRao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
 Frank Nielsen and Richard Nock. Skew JensenBregman voronoi diagrams. In Transactions on Computational Science XIV, pages 102–128. Springer, 2011.
 Frank Nielsen and Richard Nock. On the geometry of mixtures of prescribed distributions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2861–2865. IEEE, 2018.
 M. C. Pardo and Igor Vajda. About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE transactions on information theory, 43(4):1288–1293, 1997.
 Matus Telgarsky and Sanjoy Dasgupta. Agglomerative Bregman clustering. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 1011–1018. Omnipress, 2012.
 Murat Üney, Jérémie Houssineau, Emmanuel Delande, Simon J. Julier, and Daniel E. Clark. Fusion of finite set distributions: Pointwise consistency and global cardinality. CoRR, abs/1802.06220, 2018.
 Jun Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16(1):159–195, 2004.