Bregmandivergenceguided Legendre exponential dispersion model with finite cumulants (KLED)
Abstract
Exponential dispersion model is a useful framework in machine learning and statistics. Primarily, thanks to the additive structure of the model, it can be achieved without difficulty to estimate parameters including mean. However, tight conditions on cumulant function, such as analyticity, strict convexity, and steepness, reduce the class of exponential dispersion model. In this work, we present relaxed exponential dispersion model KLED (Legendre exponential dispersion model with K cumulants). The cumulant function of the proposed model is a convex function of Legendre type having continuous partial derivatives of Kth order on the interior of a convex domain. Most of the KLED models are developed via Bregmandivergenceguided logconcave density function with coercivity shape constraints. The main advantage of the proposed model is that the first cumulant (or the mean parameter space) of the LED model is easily computed through the extended global optimum property of Bregman divergence. An extended normal distribution is introduced as an example of 1LED based on Tweedie distribution. On top of that, we present LED satisfying meanvariance relation of quasilikelihood function. There is an equivalence between a subclass of quasilikelihood function and a regular LED model, of which the canonical parameter space is open. A typical example is a regular LED model with power variance function, i.e., a variance is in proportion to the power of the mean of observations. This model is equivalent to a subclass of betadivergence (or a subclass of quasilikelihood function with power variance function). Furthermore, a new parameterized KLED model is proposed. The cumulant function of this model is the convex extended logistic loss function which is generated by extended log and exp functions. The proposed model includes Bernoulli distribution and Poisson distribution depending on the selection of parameters of the convex extended logistic loss function.
I Introduction
Various probability distributions, such as normal distribution, Poisson distribution, gamma distribution, and Bernoulli distribution, are formulated into the exponential families [4, 11, 29] with sufficient statistics by virtue of the FisherNeyman factorization theorem [22]. As a consequence of the additive structure of the exponential families, it is easy to estimate parameters, such as mean and variance, of probability distributions. Numerous applications of the exponential families are introduced in [3, 26, 31, 34, 46]. For instance, [3] introduce a mixture model with regular exponential families which has an equivalence with a subclass of Bregman divergence [10]. Though the exponential families have useful additive structure, the class of these is restricted, due to strong assumptions on cumulant function in terms of shape constraints of the distribution, such as analyticity, strict convexity, and steepness. Recently, the logconcave density estimation method is introduced [13, 14]. This method is a typical nonparametric estimation method with a simple coercivity shape constraint, and thus leads to relatively accurate density estimation results in a lowerdimensional space. See [14, 43] for more details and related applications.
In this work, we are interested in a relaxation of the parameterized shape constraints of exponential dispersion model [21], i.e., natural exponential families with an additional dispersion parameter. Inspired from [25], we propose a relaxed exponential dispersion model which has K continuously differentiable convex cumulant function of Legendre type (KLED: Legendre exponential dispersion model with K cumulants). The proposed KLED model is established through the parameterized logconcave density function based on Bregman divergence associated with a convex function of Legendre type (or Legendre). The main advantage of the proposed model is that, by the extended global optimum property of Bregman divergence [1, 3], the parameterized logconcave density function, which is developed via Bregman divergence associated with Legendre, becomes the LED model having the welldefined first cumulant (or the mean parameter space). In Section III, we study in details on the construction of the LED model based on the parameterized logconcave density function. For more details on the various properties of Bregman divergence, divergence, and various related equivalence in machine learning including classification, see [16, 36, 37]. For the clustering (or segmentation) with Bregman divergence or generalized divergence, see [3, 27, 33, 34, 40].
There are probability distributions having special conditions between mean and variance, such as quadratic variance function [30] and power variance function [5, 21, 44]. Let and be mean and variance of observations , then only six probability distributions (normal, Poisson, gamma, binomial, negative binomial, and generalized hyperbolic secant) of exponential dispersion models have the quadratic variance function where is a constant. See also [35], for the generalized quadratic variance, known as finitely generated cumulants via a recurrence relation of polynomial between the first and the second cumulant. Although it is not a standard probability distribution having an analytic cumulant generating function, there is a relaxed (quasi)probability distribution defined only by mean and variance; with a quasilikelihood function
(1) 
where is a dispersion parameter and is a unit variance function satisfying meanvariance relation [29, 47]. Instead of meanvariance relation, by using a relation between the first and the second cumulant, the LED model is constructed. The equivalence between a subclass of quasilikelihood function and the regular LED model satisfying the meanvariance function is studied in Section IV. See also [25] for more details. A typical example of LED is Tweedie distribution [44] having power variance function:
(2) 
This distribution includes various probability distributions, such as normal distribution (), Poisson distribution (), compound Poissongamma distribution (), gamma distribution (), inverse Gaussian distribution (). Note that inverse Gaussian distribution is a nonregular exponential dispersion model having a nonopen canonical parameter space. Interestingly, on the boundary of the canonical parameter space, this distribution becomes Levy distribution which does not have the corresponding mean parameter space [4, 21]. Thus, the structure of Tweedie distribution is rather complicated. Besides, because of the analyticity of the cumulant generating function (or moment generating function) and the requirement of (2) at the same time, the classic Tweedie distribution is not in exponential dispersion model when [5]. These strict constraints are relaxed in the proposed KLED model with (2).
Concerning of divergence (or in (2) of Tweedie distribution), it is not easy to directly use Tweedie distribution for the estimation of since it is not defined for all . Recently, [16] proposed an augmented exponential dispersion model (EDA). By an additional augmentation function, the domain of EDA is moved away from the boundary of the domain of the classic Tweedie distribution, and thus it is possible to estimate in a more natural way. However, the domain of EDA is limited to positive region, and thus the applicability of the model is reduced. The divergence with has several interesting applications. A typical one is that divergence with this region is used as robustified KullbackLeibler divergence [7, 17]. It gives a robust distance between two probability distributions. For instance, it was used for a robust spatial filter of the noisy EEG data [42]. For more details on robustness of divergence, see [7, 12, 17, 49]. Moreover, in our previous works, this region is used for cuttingedge classification models; (1) HLogitron [50] having highorder hinge loss with stabilizer. (2) The BregmanTweedie classification model [51] which is developed by BregmanTweedie divergence (see also (38) in Appendix). This classification model is an unbounded extended logistic loss function, including unhinge loss function [45]. Besides, the convex extended logistic loss function, which is between the logistic loss and the exponential loss, is an analytic convex function of Legendre type and thus can be used as a cumulant function of the KLED model, which connect between Bernoulli distribution and Poisson distribution. The details are studied in Section IVC. Last but not least, the extended logistic loss function is composed of the extended elementary functions, that is, extended exponential function and extended logarithmic function. For more details on these functions and related applications in machine learning, see [49, 50, 51] and Appendix.
The article is organized as follows. Section II summarizes various properties of Legendre (or a convex function of Legendre type) and Bregman divergence associated with Legendre, which is essential ingredients in the following Sections. Section III introduces the KLED model, i.e., Legendre exponential dispersion model with K cumulants. This model is developed by Bregman divergence associated with Legendre. The proposed Bregmandivergenceguided KLED model inherently has the first cumulant, and thus it has the corresponding mean parameter space. For more details on the fundamental structure of exponential families, including exponential dispersion model, see [4, 11, 46]. Section IV studies the connection between the LED model and quasilikelihood function based on meanvariance relation . Also, we introduce the LED model with power variance function (2), and the KLED model, a cumulant function of which is the convex extended logistic loss function. We give our conclusions in Section V.
Ia Notation
Let where and . , , , and . is a set of integer and . From [49], is classified as
Integration, multiplication, and division are performed componentwise. is all convex combinations of the elements of a set .
For a function , means that has continuous partial derivatives of th order on a convex set . Let be a lower semicontinuous, convex, and proper function. Then the domain of is defined as . This is known as the effective domain [38]. In this work, we always assume that is a convex set, irrespective of convexity of . Note that is the interior of and is the interior of relative to its affine hull, the smallest affine set including . Hence, the relative interior coincides with when the affine hull of is . For this reason, we assume , unless otherwise stated. is the closure of . is the boundary of . As observed in [23], the convexity of can be extended to by using the extendedvalued real number system . That is, : where or any other convex set for various purpose. If is not convex then we use the extendedvalued real number system . See [50] for arithmetical operations in . Let be the expectation of observations . For simplicity, we use .
Ii Preliminaries
This Section introduces some useful properties of a convex function of Legendre type, the corresponding Bregman divergence, and logconcave density functions. For more details on these, see [2, 4, 8, 23, 38] and reference therein.
Iia A convex function of Legendre type
Theorem II.1.
Let be lower semicontinuous, convex, and proper function on . Then satisfies the following relation
(3) 
where is a convex set and thus is also convex [38, Th 6.2]. Note that where is a subgradient of at .
As noticed in [38], is not necessarily convex, though and are convex. Now, we define a convex function of Legendre type [8, 38].
Definition II.2.
Let be lower semicontinuous, convex, and proper function on . Then is a convex function of Legendre type (or Legendre), if the following conditions are satisfied.

and

is strictly convex on

(steepness) and
(4)
For simplicity, let us denote a class of convex functions of Legendre type as
Here, (4) is known as the steepness condition in statistics [4, 11]. The following Theorem [8, 38] is useful while we characterize Legendre exponential dispersion model with cumulants (KLED).
Theorem II.3.
if and only if , where is the conjugate function of . The corresponding gradient
(5) 
is a topological isomorphism with inverse mapping .
The coercivity of Legendre is useful while we characterize Tweedie distribution [5, 21, 44] and logconcave density functions [13].
Theorem II.4.
Let , then the followings are equivalent:

is coercive, i.e.,

There exists such that , for all


Definition II.5.
Let and , where and is an appropriate continuous Lebesgue (or discrete counting) measure on . Then is a logconcave (probability) density function.
IiB Bregman divergence associated with Legendre
Consider Bregman divergence associated with :
(6) 
where and . It also is formulated with the conjugate function as . As observed in information geometry [1, 2], Bregman divergence (6) is related to the canonical divergence. Actually, it includes various divergences; (1) ItakuraSaito divergence with . This divergence is induced from gamma distribution [19, 48]. (2) Generalized KullbackLeibler divergence (Idivergence) with . This generalized distance is induced from Poisson distribution [18, 40]. (3) distance with . This distance can be easily derived from normal distribution.
In addition, we summarize several useful properties of Bregman divergence associated with Legendre. See [8, 40] for more details.
Theorem II.6.
Let and . Then Bregman divergence associated with satisfies the following properties.

is strictly convex with respect to on .

is coercive with respect to , for all .

is coercive with respect to , for all if and only if is open.

if and only if where

For all ,

(Global optimum property [3]) Let and , then for all , we have .
The global optimum property in Theorem II.6 (6) is satisfied, irrespective of the convexity of Bregman divergence in terms of second variable. However, if Bregman divergence has an additional regularization term on the second variable, this property does not satisfied anymore. See [27, 34, 39, 49] for more details on the regularized Bregman divergence and its applications in image processing. The following canonical divergence (reformulated Bregman divergence associated with Legendre) is helpful for the characterization of Bregmandivergencebased probability distribution and its relation with the KLED model.
Theorem II.7 (Extended global optimum property [1]).
Let , , and . Then the following canonical divergence
(7) 
is strictly convex with respect to the canonical parameter . Consider the minimization problem:
The solution of the above minimization problem exists within an extendedvalued real number system
Additionally, the global optimum property in Theorem II.6 (6) is extended to .
Proof.
Let then it is trivial that . Consider . Let where and as . Then . Thus, as , and (or ). Regarding the global optimum property, from , we have where for all and for all .
Theorem II.7 is useful while we analyze Tweedie distribution, such as zeroinflated compound Poissongamma distribution () and inverse Gaussian distribution (). The following example shows that it is possible to build up a parameterized logconcave density function with a mean parameter space which is induced from Bregman divergence associated with Legendre.
Example II.8 (Bregmandivergenceguided logconcave density function).
Assume that observations , , , , and . From Theorem II.6 (2), it is easy to check the coercivity of Bregman divergence associated with Legendre for all . Hence, we have the corresponding logconcave density function (see Definition II.5).
(8) 
where is a base measure with an appropriate satisfying . Consider the corresponding likelihood function and a minimization problem with the negative loglikelihood function:
(9) 
where and . Due to Theorem II.7, the solution of (9) becomes
Since , we have a bijective map
where is the mean parameter space and is the canonical parameter space. See also [1]. In fact, (8) becomes a natural exponential family with the mean parameter space if is a cumulant function.
IiC divergence and quasilikelihood function with power variance function
Compared to Bregman divergence associated with Legendre, divergence [7, 17] is less structured and tightly connected to quasilikelihood function (1) with power variance function (2). Formally, divergence is defined as
(10) 
where is the domain of divergence. As observed in [49], nonnegativeness of the range of divergence is guaranteed under assumption . Note that the mathematical formulation of quasilikelihood function with power variance function (2) is equal to that of divergence, i.e., where . However, due to an additional condition on the variance function , the equivalence is not always true on the domain of divergence. As observed in [49], divergence can be reformulated to Bregmanbeta divergence (39) under restriction of the domain to where is defined in (40). In fact, the regular Tweedie distribution [21, 44] is developed based on Bregmanbeta divergence (39). The details are dealt with in Section IVA.
Iii KLED: Legendre exponential dispersion model with K cumulants and
This Section presents the KLED model, Legendre exponential dispersion model with K cumulants, derived from Bregman divergence associated with Legendre in (7).
Let us start with natural exponential families [4, 21, 46]:
(11) 
where is an observation (or a random vector) and is a canonical parameter. For all , if (11) is uniquely determined then it is full. Note that it is regular if is open, and nonregular if is not open. Here, is a set of random vectors and . The minimal condition of (11) means that is not constant for any nonzero . When is replaced by a sufficient statistic , (11) becomes the traditional exponential families. For simplicity, we only consider exponential dispersion model [21] (i.e., natural exponential families with an additional dispersion parameter).
From , we have
(12) 
where is an appropriate continuous Lebesgue (or discrete counting) measure depending on . Note that in (12) is a cumulant function (or logpartition function) of and analytic on its interior of the domain. Additionally, under the minimality condition of (11), it is not difficult to show that is strictly convex [11, 46]. The main advantage of (11) is that we can easily obtain mean (first cumulant), covariance (second cumulant), or even higher order cumulants of observations from the cumulant generating function where and is the moment generating function. For instance, and , where . In this way, a cumulant generating function uniquely determines a probability distribution within minimal natural exponential families [4]. Due to and the additional condition [3], is known as the mean parameter space and is known as the canonical parameter space [1, 46]. As described in [3, Theorem 3] and [4, Theorem 9.1, 9.2], the mean parameter space and a set of observations satisfies the following condition
(13) 
where . In this work, we assume that (13) is always true, unless otherwise stated. Note that, from (3), we have . If then we can not obtain unique mean and variance on the set . Therefore, the condition is highly demanded. Actually, this is achieved through the steepness condition (4) of . However, a function in is not always analytic on its interior of the domain, and thus it may not become a cumulant function of minimal natural exponential families. For instance, consider a convex function of Legendre type. The domain of this function is and for all . However, we have (= or ). There is domain inconsistency depending on the order of differentiability.
As observed in Example II.8, Bregmandivergenceguided logconcave density function naturally has the first cumulant or the mean parameter space. Hence, if the conjugate of a base function of Bregman divergence satisfies meanvariance relation, such as power variance function (2), then we have a logconcave density function with the first and the second cumulants. In this way, we can build up the relaxed exponential dispersion model with finite cumulants.
Definition III.1 (KLed).
Let be an observation with (13). Consider with and . Then Legendre exponential dispersion model with K cumulants (KLED) is defined as
(14) 
where is a dispersion parameter, is the canonical parameter space, and is the mean parameter space. is a base measure satisfying . Note that is regular, if is open and nonregular, if is not open. In addition, is a cumulant generating function up to Kth cumulant. Here, , for all .
Remark III.2.

If is uniquely determined for all then (14) is the full KLED model for a given .

In case of , the KLED model (14) is degenerate at .

Let then (14) can be reformulated as . This becomes the additive KLED model corresponding to the classic additive exponential dispersion model [21]. Additionally, let in (14) be a constant, , and . Then, we get a density function in natural exponential families:
(15) Then the mean and variance of the KLED model (14) are given as and , where . In this way, (15) is known as a density function of the scaled exponential families. See [20, 26] for more details on the scaled exponential families for Dirichlet process mixture model where is regularized to control accuracy of the density estimation.
As commented in Section IIC and [49], there is partial equivalence between a subclass of divergence and quasilikelihood function with power variance function (2). Hence, it is natural to consider the relation between Bregman divergence associated with Legendre and the proposed KLED model (14). In case of KLED, the minimum requirement is the existence of the first cumulant, i.e., the mean parameter space. As noticed in Example II.8, it can be satisfied by Bregman divergence associated with Legendre.
Theorem III.3.
Let , , and . Assume that and are open. Then is a parameterized logconcave density function with the first cumulant for all . Here, is a constant and is a base measure satisfying . That is, is the LED model.
Proof.
By Theorem II.6 (2), it is easy to check coercivity of with respect to . Thus, is a logconcave density function with an appropriate base measure satisfying (see Theorem II.4 (4)). Regarding the existence of the first cumulant, from Theorem II.7, we have
(16)  
where . Hence, for any mean value , there is always corresponding unique canonical parameter .
Typical examples of Theorem III.3 are gamma and normal distributions. For gamma distribution, and . For normal distribution, . An extension of normal distribution, having a constant unit variance function, is not uncomplicated. In the following example, we introduce an extended normal distribution via power variance function (2). This distribution is induced from Bregmanbeta divergence (39) or BregmanTweedie divergence (38), which were introduced in [49, 51].
Example III.4 (Extended normal distribution).
Let and . Then where and . Therefore, we have . Consider Bregmanbeta divergence
From , and are coercive and thus is a logconcave density function. From Theorem III.3, we have an LED model
with the first cumulant function where , and a base measure satisfying . When , we obtain the famous normal distribution with . Hence, the LED model is an extended normal distribution satisfying power variance function (2). When , this distribution does not have the classic cumulant generating function [5]. Actually, if then and thus is the LED model. However, if then with depending on . Since , as commented in Remark III.2 (3), is a degenerate LED model. See Figure 1 for this extended normal distribution.
Now, consider the case that or are not open. In other words, the regular KLED model with a nontrivial mean parameter space which is not open and the nonregular KLED model, of which the canonical parameter space is not open. A typical example of the regular KLED model is Bernoulli distribution having discrete random variables. That is, and (see (13)).
Theorem III.5.
Let with (13), , , and be a constant. (a) Assume that is open and is not open. Then is a parameterized logconcave density function with the first cumulant in an extendedvalued real number system. Here, is a base measure satisfying . Hence, is the LED model. (b) Assume that is not open and is open. Then is a parameterized logconcave density function with the first cumulant only when . That is, is a nonfull LED model with . Here, is a base measure satisfying .
Proof.
(a) In Theorem III.3, it is treated the case and is open. Consider and let . Since is open, it is not difficult to see that is a logconcave density function for all even is not open (see Theorem II.6 (2)). Now, we only need to show the existence of a function between the mean parameter space and the canonical parameter space in an extendedvalued real number system. Let us consider (16): From Theorem II.7, when , we have . It means that, due to the steepness condition of Legendre, there is a sequence such that as . Therefore, we have and thus exists within an extendedvalued real number system for all . In consequence, is a cumulant function of the logconcave density function having the first cumulant, i.e., is in the LED model.
(b) Consider with and . In Theorem III.3, the case is treated. Let . As done in (a), there is a sequence with such that satisfying Therefore, we have where and . That is, in an extendedvalued real number system. However, from [9, Th. 14.17], when , is not coercive in terms of . Therefore, is a logconcave density function and a nonfull LED model only when .
Example III.6.
Consider Bernoulli distribution. Let and , then we have Bregman divergence associated with Legendre : where and . Thus, we have Bregmandivergenceguided logconcave density function where . Thus, the corresponding maximum likelihood estimation is
where . Then