Bregman-divergence-guided Legendre exponential dispersion model with finite cumulants (K-LED)

# Bregman-divergence-guided Legendre exponential dispersion model with finite cumulants (K-LED)

Hyenkyun Woo School of Liberal Arts, Korea University of Technology and Education, hyenkyun@koreatech.ac.kr, hyenkyun@gmail.com
###### Abstract

Exponential dispersion model is a useful framework in machine learning and statistics. Primarily, thanks to the additive structure of the model, it can be achieved without difficulty to estimate parameters including mean. However, tight conditions on cumulant function, such as analyticity, strict convexity, and steepness, reduce the class of exponential dispersion model. In this work, we present relaxed exponential dispersion model K-LED (Legendre exponential dispersion model with K cumulants). The cumulant function of the proposed model is a convex function of Legendre type having continuous partial derivatives of K-th order on the interior of a convex domain. Most of the K-LED models are developed via Bregman-divergence-guided log-concave density function with coercivity shape constraints. The main advantage of the proposed model is that the first cumulant (or the mean parameter space) of the -LED model is easily computed through the extended global optimum property of Bregman divergence. An extended normal distribution is introduced as an example of 1-LED based on Tweedie distribution. On top of that, we present -LED satisfying mean-variance relation of quasi-likelihood function. There is an equivalence between a subclass of quasi-likelihood function and a regular -LED model, of which the canonical parameter space is open. A typical example is a regular -LED model with power variance function, i.e., a variance is in proportion to the power of the mean of observations. This model is equivalent to a subclass of beta-divergence (or a subclass of quasi-likelihood function with power variance function). Furthermore, a new parameterized K-LED model is proposed. The cumulant function of this model is the convex extended logistic loss function which is generated by extended log and exp functions. The proposed model includes Bernoulli distribution and Poisson distribution depending on the selection of parameters of the convex extended logistic loss function.

Exponential dispersion model, Generalized linear model, Exponential families, Log-concave density function, Bregman divergence, Tweedie distribution, Convex function of Legendre type, Quasi-likelihood function, Extended logistic loss function, Extended exponential function, Extended logarithmic function

## I Introduction

Various probability distributions, such as normal distribution, Poisson distribution, gamma distribution, and Bernoulli distribution, are formulated into the exponential families [4, 11, 29] with sufficient statistics by virtue of the Fisher-Neyman factorization theorem [22]. As a consequence of the additive structure of the exponential families, it is easy to estimate parameters, such as mean and variance, of probability distributions. Numerous applications of the exponential families are introduced in [3, 26, 31, 34, 46]. For instance, [3] introduce a mixture model with regular exponential families which has an equivalence with a subclass of Bregman divergence [10]. Though the exponential families have useful additive structure, the class of these is restricted, due to strong assumptions on cumulant function in terms of shape constraints of the distribution, such as analyticity, strict convexity, and steepness. Recently, the log-concave density estimation method is introduced [13, 14]. This method is a typical non-parametric estimation method with a simple coercivity shape constraint, and thus leads to relatively accurate density estimation results in a lower-dimensional space. See [14, 43] for more details and related applications.

In this work, we are interested in a relaxation of the parameterized shape constraints of exponential dispersion model [21], i.e., natural exponential families with an additional dispersion parameter. Inspired from [25], we propose a relaxed exponential dispersion model which has K continuously differentiable convex cumulant function of Legendre type (K-LED: Legendre exponential dispersion model with K cumulants). The proposed K-LED model is established through the parameterized log-concave density function based on Bregman divergence associated with a convex function of Legendre type (or Legendre). The main advantage of the proposed model is that, by the extended global optimum property of Bregman divergence [1, 3], the parameterized log-concave density function, which is developed via Bregman divergence associated with Legendre, becomes the -LED model having the well-defined first cumulant (or the mean parameter space). In Section III, we study in details on the construction of the -LED model based on the parameterized log-concave density function. For more details on the various properties of Bregman divergence, -divergence, and various related equivalence in machine learning including classification, see [16, 36, 37]. For the clustering (or segmentation) with Bregman divergence or generalized divergence, see [3, 27, 33, 34, 40].

There are probability distributions having special conditions between mean and variance, such as quadratic variance function [30] and power variance function [5, 21, 44]. Let and be mean and variance of observations , then only six probability distributions (normal, Poisson, gamma, binomial, negative binomial, and generalized hyperbolic secant) of exponential dispersion models have the quadratic variance function  where is a constant. See also [35], for the generalized quadratic variance, known as finitely generated cumulants via a recurrence relation of polynomial between the first and the second cumulant. Although it is not a standard probability distribution having an analytic cumulant generating function, there is a relaxed (quasi-)probability distribution defined only by mean and variance; with a quasi-likelihood function

 Q(b;μ)=−∫bμb−xσ2V(x)dx (1)

where is a dispersion parameter and is a unit variance function satisfying mean-variance relation  [29, 47]. Instead of mean-variance relation, by using a relation between the first and the second cumulant, the -LED model is constructed. The equivalence between a subclass of quasi-likelihood function and the regular -LED model satisfying the mean-variance function is studied in Section IV. See also [25] for more details. A typical example of -LED is Tweedie distribution [44] having power variance function:

 var(b)=σ2μ2−β (2)

This distribution includes various probability distributions, such as normal distribution (), Poisson distribution (), compound Poisson-gamma distribution (), gamma distribution (), inverse Gaussian distribution (). Note that inverse Gaussian distribution is a non-regular exponential dispersion model having a non-open canonical parameter space. Interestingly, on the boundary of the canonical parameter space, this distribution becomes Levy distribution which does not have the corresponding mean parameter space [4, 21]. Thus, the structure of Tweedie distribution is rather complicated. Besides, because of the analyticity of the cumulant generating function (or moment generating function) and the requirement of (2) at the same time, the classic Tweedie distribution is not in exponential dispersion model when  [5]. These strict constraints are relaxed in the proposed K-LED model with (2).

Concerning of -divergence (or in (2) of Tweedie distribution), it is not easy to directly use Tweedie distribution for the estimation of since it is not defined for all . Recently, [16] proposed an augmented exponential dispersion model (EDA). By an additional augmentation function, the domain of EDA is moved away from the boundary of the domain of the classic Tweedie distribution, and thus it is possible to estimate in a more natural way. However, the domain of EDA is limited to positive region, and thus the applicability of the model is reduced. The -divergence with has several interesting applications. A typical one is that -divergence with this region is used as robustified Kullback-Leibler divergence [7, 17]. It gives a robust distance between two probability distributions. For instance, it was used for a robust spatial filter of the noisy EEG data [42]. For more details on robustness of -divergence, see [7, 12, 17, 49]. Moreover, in our previous works, this region is used for cutting-edge classification models; (1) H-Logitron [50] having high-order hinge loss with stabilizer. (2) The Bregman-Tweedie classification model [51] which is developed by Bregman-Tweedie divergence (see also (38) in Appendix). This classification model is an unbounded extended logistic loss function, including unhinge loss function [45]. Besides, the convex extended logistic loss function, which is between the logistic loss and the exponential loss, is an analytic convex function of Legendre type and thus can be used as a cumulant function of the K-LED model, which connect between Bernoulli distribution and Poisson distribution. The details are studied in Section IV-C. Last but not least, the extended logistic loss function is composed of the extended elementary functions, that is, extended exponential function and extended logarithmic function. For more details on these functions and related applications in machine learning, see [49, 50, 51] and Appendix.

The article is organized as follows. Section II summarizes various properties of Legendre (or a convex function of Legendre type) and Bregman divergence associated with Legendre, which is essential ingredients in the following Sections. Section III introduces the K-LED model, i.e., Legendre exponential dispersion model with K cumulants. This model is developed by Bregman divergence associated with Legendre. The proposed Bregman-divergence-guided K-LED model inherently has the first cumulant, and thus it has the corresponding mean parameter space. For more details on the fundamental structure of exponential families, including exponential dispersion model, see [4, 11, 46]. Section IV studies the connection between the -LED model and quasi-likelihood function based on mean-variance relation . Also, we introduce the -LED model with power variance function (2), and the K-LED model, a cumulant function of which is the convex extended logistic loss function. We give our conclusions in Section V.

### I-a Notation

Let where and . , , , and . is a set of integer and . From [49], is classified as

Integration, multiplication, and division are performed component-wise. is all convex combinations of the elements of a set .

For a function , means that has continuous partial derivatives of -th order on a convex set . Let be a lower semicontinuous, convex, and proper function. Then the domain of is defined as . This is known as the effective domain [38]. In this work, we always assume that is a convex set, irrespective of convexity of . Note that is the interior of and is the interior of relative to its affine hull, the smallest affine set including . Hence, the relative interior coincides with when the affine hull of is . For this reason, we assume , unless otherwise stated. is the closure of . is the boundary of . As observed in [23], the convexity of can be extended to by using the extended-valued real number system . That is, : where or any other convex set for various purpose. If is not convex then we use the extended-valued real number system . See [50] for arithmetical operations in . Let be the expectation of observations . For simplicity, we use .

## Ii Preliminaries

This Section introduces some useful properties of a convex function of Legendre type, the corresponding Bregman divergence, and log-concave density functions. For more details on these, see [2, 4, 8, 23, 38] and reference therein.

### Ii-a A convex function of Legendre type

###### Theorem II.1.

Let be lower semicontinuous, convex, and proper function on . Then satisfies the following relation

 int(domf)⊆dom∂f⊆domf (3)

where is a convex set and thus is also convex [38, Th 6.2]. Note that where is a subgradient of at .

As noticed in [38], is not necessarily convex, though and are convex. Now, we define a convex function of Legendre type [8, 38].

###### Definition II.2.

Let be lower semicontinuous, convex, and proper function on . Then is a convex function of Legendre type (or Legendre), if the following conditions are satisfied.

• and

• is strictly convex on

• (steepness) and

 limt↓0⟨∇f(x+t(y−x)),y−x⟩=−∞ (4)

For simplicity, let us denote a class of convex functions of Legendre type as

 Ln={f:domf⊆Rn→R|f is Legendre}

Here, (4) is known as the steepness condition in statistics [4, 11]. The following Theorem [8, 38] is useful while we characterize Legendre exponential dispersion model with cumulants (K-LED).

###### Theorem II.3.

if and only if , where is the conjugate function of . The corresponding gradient

 ∇f:int(domf)→int(domf∗):x→∇f(x) (5)

is a topological isomorphism with inverse mapping .

The coercivity of Legendre is useful while we characterize Tweedie distribution [5, 21, 44] and log-concave density functions [13].

###### Theorem II.4.

Let , then the followings are equivalent:

1. is coercive, i.e.,

2. There exists such that , for all

For the proof of Theorem II.4, see [4, Th. 6.1] and [9, Prop. 14.16].

###### Definition II.5.

Let and , where and is an appropriate continuous Lebesgue (or discrete counting) measure on . Then is a log-concave (probability) density function.

See [4, 9, 11, 23, 38] for other useful properties of convex functions and their applications in statistics.

### Ii-B Bregman divergence associated with Legendre

Consider Bregman divergence associated with :

 Df(x|y)=f(x)−f(y)−⟨x−y,∇f(y)⟩ (6)

where and . It also is formulated with the conjugate function as . As observed in information geometry [1, 2], Bregman divergence (6) is related to the canonical divergence. Actually, it includes various divergences; (1) Itakura-Saito divergence with . This divergence is induced from gamma distribution [19, 48]. (2) Generalized Kullback-Leibler divergence (I-divergence) with . This generalized distance is induced from Poisson distribution [18, 40]. (3) -distance with . This distance can be easily derived from normal distribution.

In addition, we summarize several useful properties of Bregman divergence associated with Legendre. See [8, 40] for more details.

###### Theorem II.6.

Let and . Then Bregman divergence associated with satisfies the following properties.

1. is strictly convex with respect to on .

2. is coercive with respect to , for all .

3. is coercive with respect to , for all if and only if is open.

4. if and only if    where

5. For all ,

6. (Global optimum property [3]) Let and , then for all , we have .

The global optimum property in Theorem II.6 (6) is satisfied, irrespective of the convexity of Bregman divergence in terms of second variable. However, if Bregman divergence has an additional regularization term on the second variable, this property does not satisfied anymore. See [27, 34, 39, 49] for more details on the regularized Bregman divergence and its applications in image processing. The following canonical divergence (reformulated Bregman divergence associated with Legendre) is helpful for the characterization of Bregman-divergence-based probability distribution and its relation with the K-LED model.

###### Theorem II.7 (Extended global optimum property [1]).

Let , , and . Then the following canonical divergence

 df(x;θ):=Df(x|∇f∗(θ))=f(x)+f∗(θ)−⟨x,θ⟩ (7)

is strictly convex with respect to the canonical parameter . Consider the minimization problem:

 ^θ=argminθ∈int(domf∗)df(x;θ)

The solution of the above minimization problem exists within an extended-valued real number system

 ^θ={∇f(x) if x∈int(domf)±∞ if x∈domf∖int(domf)

Additionally, the global optimum property in Theorem II.6 (6) is extended to .

###### Proof.

Let then it is trivial that . Consider . Let where and as . Then . Thus, as , and (or ). Regarding the global optimum property, from , we have where for all and for all .

Theorem II.7 is useful while we analyze Tweedie distribution, such as zero-inflated compound Poisson-gamma distribution () and inverse Gaussian distribution (). The following example shows that it is possible to build up a parameterized log-concave density function with a mean parameter space which is induced from Bregman divergence associated with Legendre.

###### Example II.8 (Bregman-divergence-guided log-concave density function).

Assume that observations , , , , and . From Theorem II.6 (2), it is easy to check the coercivity of Bregman divergence associated with Legendre for all . Hence, we have the corresponding log-concave density function (see Definition II.5).

 pf(b;θ)=exp(−df(b;θ))p0(b)=exp(⟨b,θ⟩−f∗(θ))p1(b) (8)

where is a base measure with an appropriate satisfying . Consider the corresponding likelihood function and a minimization problem with the negative log-likelihood function:

 ^θavg = argminθ∈domf∗M∑i=1df(bi;θ)=Mdf(bavg;θ)+h(b) (9)

where and . Due to Theorem II.7, the solution of (9) becomes

 ^θavg={∇f(bavg) if bavg∈Ω±∞ if bavg∈Ωc

Since , we have a bijective map

 ∇f:Ω∪Ωc→Ω∗∪{±∞}

where is the mean parameter space and is the canonical parameter space. See also [1]. In fact, (8) becomes a natural exponential family with the mean parameter space if is a cumulant function.

### Ii-C β-divergence and quasi-likelihood function with power variance function

Compared to Bregman divergence associated with Legendre, -divergence [7, 17] is less structured and tightly connected to quasi-likelihood function (1) with power variance function (2). Formally, -divergence is defined as

 Dβ(b|u)=∫buxβ−2(b−x)dx=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩(bu)−ln(bu)−1, if β=0bln(bu)−(b−u), if β=1bβ−1(bβ−1−uβ−1)−1β(bβ−uβ),otherwise (10)

where is the domain of -divergence. As observed in [49], non-negativeness of the range of -divergence is guaranteed under assumption . Note that the mathematical formulation of quasi-likelihood function with power variance function (2) is equal to that of -divergence, i.e., where . However, due to an additional condition on the variance function , the equivalence is not always true on the domain of -divergence. As observed in [49], -divergence can be reformulated to Bregman-beta divergence (39) under restriction of the domain to where is defined in (40). In fact, the regular Tweedie distribution [21, 44] is developed based on Bregman-beta divergence (39). The details are dealt with in Section IV-A.

## Iii K-LED: Legendre exponential dispersion model with K cumulants and K≥1

This Section presents the K-LED model, Legendre exponential dispersion model with K cumulants, derived from Bregman divergence associated with Legendre in (7).

 pψ(b;θ)=exp(⟨b,θ⟩−ψ(θ))p1(b) (11)

where is an observation (or a random vector) and is a canonical parameter. For all , if (11) is uniquely determined then it is full. Note that it is regular if is open, and non-regular if is not open. Here, is a set of random vectors and . The minimal condition of (11) means that is not constant for any non-zero . When is replaced by a sufficient statistic , (11) becomes the traditional exponential families. For simplicity, we only consider exponential dispersion model [21] (i.e., natural exponential families with an additional dispersion parameter).

From , we have

 ψ(θ)=log∫Bexp(⟨b,θ⟩)p1(b)ν(db), (12)

where is an appropriate continuous Lebesgue (or discrete counting) measure depending on . Note that in (12) is a cumulant function (or log-partition function) of and analytic on its interior of the domain. Additionally, under the minimality condition of (11), it is not difficult to show that is strictly convex [11, 46]. The main advantage of (11) is that we can easily obtain mean (first cumulant), covariance (second cumulant), or even higher order cumulants of observations from the cumulant generating function where and is the moment generating function. For instance, and , where . In this way, a cumulant generating function uniquely determines a probability distribution within minimal natural exponential families [4]. Due to and the additional condition  [3], is known as the mean parameter space and is known as the canonical parameter space  [1, 46]. As described in [3, Theorem 3] and [4, Theorem 9.1, 9.2], the mean parameter space and a set of observations satisfies the following condition

 int(domψ∗)=int(B)andB⊆domψ∗⊆B (13)

where . In this work, we assume that (13) is always true, unless otherwise stated. Note that, from (3), we have . If then we can not obtain unique mean and variance on the set . Therefore, the condition is highly demanded. Actually, this is achieved through the steepness condition (4) of . However, a function in is not always analytic on its interior of the domain, and thus it may not become a cumulant function of minimal natural exponential families. For instance, consider a convex function of Legendre type. The domain of this function is and for all . However, we have (= or ). There is domain inconsistency depending on the order of differentiability.

As observed in Example II.8, Bregman-divergence-guided log-concave density function naturally has the first cumulant or the mean parameter space. Hence, if the conjugate of a base function of Bregman divergence satisfies mean-variance relation, such as power variance function (2), then we have a log-concave density function with the first and the second cumulants. In this way, we can build up the relaxed exponential dispersion model with finite cumulants.

###### Definition III.1 (K-Led).

Let be an observation with (13). Consider with and . Then Legendre exponential dispersion model with K cumulants (K-LED) is defined as

 pψ(b;θ,σ2)=exp(⟨b,θ⟩−ψ(θ)σ2)p1(b,σ2) (14)

where is a dispersion parameter, is the canonical parameter space, and is the mean parameter space. is a base measure satisfying . Note that is regular, if is open and non-regular, if is not open. In addition, is a cumulant generating function up to K-th cumulant. Here, , for all .

###### Remark III.2.
1. If is uniquely determined for all then (14) is the full K-LED model for a given .

2. In case of , the K-LED model (14) is degenerate at .

3. Let then (14) can be reformulated as . This becomes the additive K-LED model corresponding to the classic additive exponential dispersion model [21]. Additionally, let in (14) be a constant, , and . Then, we get a density function in natural exponential families:

 pψ1(b;θ1)=exp(⟨b,θ1⟩−ψ1(θ1))¯p1(b). (15)

Then the mean and variance of the K-LED model (14) are given as and , where . In this way, (15) is known as a density function of the scaled exponential families. See [20, 26] for more details on the scaled exponential families for Dirichlet process mixture model where is regularized to control accuracy of the density estimation.

As commented in Section II-C and [49], there is partial equivalence between a subclass of -divergence and quasi-likelihood function with power variance function (2). Hence, it is natural to consider the relation between Bregman divergence associated with Legendre and the proposed K-LED model (14). In case of K-LED, the minimum requirement is the existence of the first cumulant, i.e., the mean parameter space. As noticed in Example II.8, it can be satisfied by Bregman divergence associated with Legendre.

###### Theorem III.3.

Let , , and . Assume that and are open. Then is a parameterized log-concave density function with the first cumulant for all . Here, is a constant and is a base measure satisfying . That is, is the -LED model.

###### Proof.

By Theorem II.6 (2), it is easy to check coercivity of with respect to . Thus, is a log-concave density function with an appropriate base measure satisfying (see Theorem II.4 (4)). Regarding the existence of the first cumulant, from Theorem II.7, we have

 ∇ψ∗(E(b)) = argminθ∈domψE(dψ∗(b;θ)) (16) = argminθ∈domψdψ∗(E(b);θ)+h(b)

where . Hence, for any mean value , there is always corresponding unique canonical parameter .

Typical examples of Theorem III.3 are gamma and normal distributions. For gamma distribution, and . For normal distribution, . An extension of normal distribution, having a constant unit variance function, is not uncomplicated. In the following example, we introduce an extended normal distribution via power variance function (2). This distribution is induced from Bregman-beta divergence  (39) or Bregman-Tweedie divergence  (38), which were introduced in [49, 51].

###### Example III.4 (Extended normal distribution).

Let and . Then where and . Therefore, we have . Consider Bregman-beta divergence

 dΦ(b;θ)=Φ(b)+Ψ(θ)−⟨b,θ⟩

From , and are coercive and thus is a log-concave density function. From Theorem III.3, we have an -LED model

 pΨ(b;θ,σ2)=exp(−dΦ(b;θ)/σ2)p0(b,σ2)

with the first cumulant function where , and a base measure satisfying . When , we obtain the famous normal distribution with . Hence, the -LED model is an extended normal distribution satisfying power variance function (2). When , this distribution does not have the classic cumulant generating function [5]. Actually, if then and thus is the -LED model. However, if then with depending on . Since , as commented in Remark III.2 (3), is a degenerate -LED model. See Figure 1 for this extended normal distribution.

Now, consider the case that or are not open. In other words, the regular K-LED model with a non-trivial mean parameter space which is not open and the non-regular K-LED model, of which the canonical parameter space is not open. A typical example of the regular K-LED model is Bernoulli distribution having discrete random variables. That is, and (see (13)).

###### Theorem III.5.

Let with (13), , , and be a constant. (a) Assume that is open and is not open. Then is a parameterized log-concave density function with the first cumulant in an extended-valued real number system. Here, is a base measure satisfying . Hence, is the -LED model. (b) Assume that is not open and is open. Then is a parameterized log-concave density function with the first cumulant only when . That is, is a non-full -LED model with . Here, is a base measure satisfying .

###### Proof.

(a) In Theorem III.3, it is treated the case and is open. Consider and let . Since is open, it is not difficult to see that is a log-concave density function for all even is not open (see Theorem II.6 (2)). Now, we only need to show the existence of a function between the mean parameter space and the canonical parameter space in an extended-valued real number system. Let us consider (16): From Theorem II.7, when , we have . It means that, due to the steepness condition of Legendre, there is a sequence such that as . Therefore, we have and thus exists within an extended-valued real number system for all . In consequence, is a cumulant function of the log-concave density function having the first cumulant, i.e., is in the -LED model.

(b) Consider with and . In Theorem III.3, the case is treated. Let . As done in (a), there is a sequence with such that satisfying Therefore, we have where and . That is, in an extended-valued real number system. However, from [9, Th. 14.17], when , is not coercive in terms of . Therefore, is a log-concave density function and a non-full -LED model only when .

###### Example III.6.

Consider Bernoulli distribution. Let and , then we have Bregman divergence associated with Legendre : where and . Thus, we have Bregman-divergence-guided log-concave density function where . Thus, the corresponding maximum likelihood estimation is

 maxθM∏i=1p(bi;θ,1)=p(b)exp(M∑i=1bilogμ+(1−bi)log(1−μ))=p(b)exp(⟨M∑i=1bi,θ⟩−Mψ∗(θ)).

where . Then