Density Estimation in Infinite Dimensional Exponential Families
In this paper, we consider an infinite dimensional exponential family of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space , and show it to be quite rich in the sense that a broad class of densities on can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in . Motivated by this approximation property, the paper addresses the question of estimating an unknown density through an element in . Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between and , do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. We propose an estimator based on minimizing the Fisher divergence, between and , which involves solving a simple finite-dimensional linear system. When , we show that the proposed estimator is consistent, and provide a convergence rate of in Fisher divergence under the smoothness assumption that for some , where is a certain Hilbert-Schmidt operator on and denotes the image of . We also investigate the misspecified case of and show that as , and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non-parametric kernel density estimator, and that the advantage of the proposed estimator grows as increases.
Keywords: density estimation, exponential family, Fisher divergence, kernel density estimator, maximum likelihood, interpolation space, inverse problem, reproducing kernel Hilbert space, Tikhonov regularization, score matching.
Exponential families are among the most important classes of parametric models studied in statistics, and include many common distributions such as the normal, exponential, gamma, and Poisson. In its “natural form”, the family generated by a probability density (defined over ) and sufficient statistic, is defined as
where is the cumulant generating function (also called the log-partition function), is the natural parameter space and is a finite-dimensional vector called the natural parameter. Exponential families have a number of properties that make them extremely useful for statistical analysis (see Brown, 1986 for more details).
where the function space is defined as
being the cumulant generating function, and a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950) with as its reproducing kernel. While various generalizations are possible for different choices of (e.g., an Orlicz space as in Pistone and Sempi, 1995), the connection of to the natural exponential family in (1) is particularly enlightening when is an RKHS. This is due to the reproducing property of the kernel, , through which takes the role of the sufficient statistic. In fact, it can be shown (see Section 3 and Example 3 for more details) that every is generated by induced by a finite dimensional RKHS , and therefore the family with being an infinite dimensional RKHS is a natural infinite dimensional generalization of . Furthermore, this generalization is particularly interesting as in contrast to , it can be shown that is a rich class of densities (depending on the choice of and therefore ) that can approximate a broad class of probability densities arbitrarily well (see Propositions 3, 5 and Corollary 3). This generalization is not only of theoretical interest, but also has implications for statistical and machine learning applications. For example, in Bayesian non-parametric density estimation, the densities in are chosen as prior distributions on a collection of probability densities (e.g., see van der Vaart and van Zanten, 2008). has also found applications in nonparametric hypothesis testing (Gretton et al., 2012; Fukumizu et al., 2008) and dimensionality reduction (Fukumizu et al., 2004, 2009) through the mean and covariance operators, which are obtained as the first and second Fréchet derivatives of (see Fukumizu, 2009, Section 1.2.3). Recently, the infinite dimensional exponential family, has been used to develop a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (Strathmann et al., 2015) and also has been used in the context of learning the structure of graphical models (Sun et al., 2015).
Motivated by the richness of the infinite dimensional generalization and its statistical applications, it is of interest to model densities by , and therefore the goal of this paper is to estimate unknown densities by elements in when is an infinite dimensional RKHS. Formally, given i.i.d. random samples drawn from an unknown density , the goal is to estimate through . Throughout the paper, we refer to case of as well-specified, in contrast to the misspecified case where . The setting is useful because is a rich class of densities that can approximate a broad class of probability densities arbitrarily well, hence it may be widely used in place of non-parametric density estimation methods (e.g., kernel density estimation (KDE)). In fact, through numerical simulations, we show in Section 6 that estimating through performs better than KDE, and that the advantage of the proposed estimator grows with increasing dimensionality.
In the finite-dimensional case where , estimating through maximum likelihood (ML) leads to solving elegant likelihood equations (Brown, 1986, Chapter 5). However, in the infinite dimensional case (assuming ), as in many non-parametric estimation methods, a straightforward extension of maximum likelihood estimation (MLE) suffers from the problem of ill-posedness (Fukumizu, 2009, Section 1.3.1). To address this problem, Fukumizu (2009) proposed a method of sieves involving pseudo-MLE by restricting the infinite dimensional manifold to a series of finite-dimensional submanifolds, which enlarge as the sample size increases, i.e., is the density estimator with
where and is a sequence of finite-dimensional subspaces of such that for all . While the consistency of is proved in Kullback-Leibler (KL) divergence (Fukumizu, 2009, Theorem 6), the method suffers from many drawbacks that are both theoretical and computational in nature. On the theoretical front, the consistency in Fukumizu (2009, Theorem 6) is established by assuming a decay rate on the eigenvalues of the covariance operator (see (A-2) and the discussion in Section 1.4 of Fukumizu (2009) for details), which is usually difficult to check in practice. Moreover, it is not clear which classes of RKHS should be used to obtain a consistent estimator (Fukumizu, 2009, (A-1)) and the paper does not provide any discussion about the convergence rates. On the practical side, the estimator is not attractive as it can be quite difficult to construct the sequence that satisfies the assumptions in Fukumizu (2009, Theorem 6). In fact, the impracticality of the estimator, is accentuated by the difficulty in efficiently handling (though it can be approximated by numerical integration).
A related work was carried out by Barron and Sheu (1991)—also see references therein—where the goal is to estimate a density, by approximating its logarithm as an expansion in terms of basis functions, such as polynomials, splines or trigonometric series. Similar to Fukumizu (2009), Barron and Sheu proposed the ML estimator , where
and is the linear space of dimension spanned by the chosen basis functions. Under the assumption that has square-integrable derivatives up to order , they showed that with for each of the approximating families, where is the KL divergence between and . Similar work was carried out by Gu and Qiu (1993), who assumed that lies in an RKHS, and proposed an estimator based on penalized MLE, with consistency and rates established in Jensen-Shannon divergence. Though these results are theoretically interesting, these estimators are obtained via a procedure similar to that in Fukumizu (2009), and therefore suffers from the practical drawbacks discussed above.
The discussion so far shows that the MLE approach to learning results in estimators that are of limited practical interest. To alleviate this, one can treat the problem of estimating in a completely non-parametric fashion by using KDE, which is well-studied (Tsybakov, 2009, Chapter 1) and easy to implement. This approach ignores the structure of , however, and is known to perform poorly for moderate to large (Wasserman, 2006, Section 6.5) (see also Section 6 of this paper).
1.1 Score Matching and Fisher Divergence
To counter the disadvantages of KDE and pseudo/penalized-MLE, in this paper, we propose to use the score matching method introduced by Hyvärinen (2005, 2007). While MLE is based on minimizing the KL divergence, the score matching method involves minimizing the Fisher divergence (also called the Fisher information distance; see Definition 1.13 in Johnson (2004)) between two continuously differentiable densities, and on an open set , given as
where with . Fisher divergence is closely related to the KL divergence through de Bruijn’s identity (Johnson, 2004, Appendix C) and it can be shown that where , , denotes the convolution, and denotes a normal distribution on with mean zero and diagonal covariance with (see Proposition B.1 for a precise statement; also see Theorem 1 in Lyu, 2009). Moreover, convergence in Fisher divergence is a stronger form of convergence than that in KL, total variation and Hellinger distances (see Lemmas E.2 & E.3 in Johnson, 2004 and Corollary 5.1 in Ley and Swan, 2013).
To understand the advantages associated with the score matching method, let us consider the problem of density estimation where the data generating distribution (say ) belongs to in (1). In other words, given random samples drawn i.i.d. from , the goal is to estimate as , and use as an estimator of . While the MLE approach is well-studied and enjoys nice statistical properties in asymptopia (i.e., asymptotically unbiased, efficient, and normally distributed), the computation of can be intractable in many situations as discussed above. In particular, this is the case for where for all , , and the functional form of is known (as a function of and ); yet we do not know how to easily compute , which is often analytically intractable. In this setting (which is exactly the setting of this paper), assuming to be differentiable (w.r.t. ), and , in (3) reduces to
through integration by parts (see Hyvärinen, 2005, Theorem 1), under appropriate regularity conditions on and for all . Here . The main advantage of the objective in (3) (and also (4)) is that when it is applied to the situation discussed above where , is independent of , and an estimate of can be obtained by simply minimizing the empirical counterpart of , given by
Since is also independent of , may be easily computable, unlike the MLE. We would like to highlight that while the score matching approach may have computational advantages over MLE, it only estimates up to the scaling factor , and therefore requires the approximation or computation of through numerical integration to estimate . Note that this issue (of computing through numerical integration) exists even with MLE, but not with KDE. In score matching, however, numerical integration is needed only once, while MLE would typically require a functional form of the log-partition function which is approximated through numerical integration at every step of an iterative optimization algorithm (for example, see (2)), thus leading to major computational savings. An important application that does not require the computation of is in finding modes of the distribution, which has recently become very popular in image processing (Comaniciu and Meer, 2002), and has already been investigated in the score matching framework (Sasaki et al., 2014). Similarly, in sampling methods such as sequential Monte Carlo (Doucet et al., 2001), it is often the case that the evaluation of unnormalized densities is sufficient to calculate required importance weights.
(i) We present an estimate of
in the well-specified case
through the minimization of Fisher divergence,
in Section 4. First, we show that estimating
the score matching method reduces to estimating by solving a simple finite-dimensional linear system
(Theorems 4 and 4).
Hyvärinen (2007) obtained a similar result for where the estimator is obtained by solving
linear system, which in the case of Gaussian family matches the
MLE (Hyvärinen, 2005). The estimator obtained
in the infinite dimensional case is not a simple extension of its
finite-dimensional counterpart, however, as the former requires an
regularizer (we use ) to make the
We would like to highlight that to the best of our knowledge, the proposed estimator is the
first practically computable estimator of with consistency guarantees (see below).
(ii) In contrast to Hyvärinen (2007) where no guarantees on consistency or convergence rates are provided for the density estimator in , we establish in Theorem 4.1 the consistency and rates of convergence for the proposed estimator of , and use these to prove consistency and rates of convergence for the corresponding plug-in estimator of (Theorems 4.1 and B.2), even when is infinite dimensional. Furthermore, while the estimator of (and therefore ) is obtained by minimizing the Fisher divergence, the resultant density estimator is also shown to be consistent in KL divergence (and therefore in Hellinger and total-variation distances) and we provide convergence rates in all these distances.
Formally, we show that the proposed estimator is converges as
if for some , where denotes the range or image of an operator , , and is a Hilbert-Schmidt operator on (see Theorem 4) with being the reproducing kernel and denoting the tensor product. When is a finite-dimensional RKHS, we show that the estimator enjoys parametric rates of convergence, i.e.,
Note that the convergence rates are obtained under a
non-classical smoothness assumption on , namely that it lies in the image
of certain fractional power of , which reduces to a more classical
assumption if we choose to be a Matérn kernel (see
Section 2 for its definition), as it induces a Sobolev space.
In Section 4.2, we discuss in detail the smoothness assumption on
for the Gaussian (Example 4.2) and
Matérn (Example 4.2) kernels.
Another interesting point to observe is that unlike
in the classical function estimation methods (e.g., kernel density estimation and regression), the rates presented above for the
proposed estimator tend to saturate for
( w.r.t. ), with the best rate attained
at ( w.r.t. ), which means the smoothness of
is not fully captured
by the estimator. Such a saturation behavior is well-studied in the inverse
literature (Engl et al., 1996) where it has been attributed to the choice
of regularizer. In Section 4.3, we discuss alternative
using ideas from Bauer et al. (2007),
which covers non-parametric least squares regression: we show that for
appropriately chosen regularizers, the above mentioned rates hold for any
, and do not saturate for the aforementioned ranges of (see
(iii) In Section 5, we study the problem of density estimation in the misspecified setting, i.e., , which is not addressed in Hyvärinen (2007) and Fukumizu (2009). Using a more sophisticated analysis than in the well-specified case, we show in Theorem 5 that as . Under an appropriate smoothness assumption on (see the statement of Theorem 5 for details), we show that as along with a rate for this convergence, even though . However, unlike in the well-specified case, where the consistency is obtained not only in but also in other distances, we obtain convergence only in for the misspecified case. Note that while Barron and Sheu (1991) considered the estimation of in the misspecified setting, the results are restricted to the approximating families consisting of polynomials, splines, or trigonometric series. Our results are more general, as they hold for abstract RKHSs.
(iv) In Section 6, we present preliminary numerical results comparing the proposed estimator with KDE in estimating a Gaussian and mixture of Gaussians, with the goal of empirically evaluating performance as gets large for a fixed sample size. In these two estimation problems, we show that the proposed estimator outperforms KDE, and the advantage grows as increases. Inspired by this preliminary empirical investigation, our proposed estimator (or computationally efficient approximations) has been used by Strathmann et al. (2015) in a gradient-free adaptive MCMC sampler, and by Sun et al. (2015) for graphical model structure learning. These applications demonstrate the practicality and performance of the proposed estimator.
Finally, we would like to make clear that our principal goal is not to construct density estimators that improve uniformly upon KDE, but to provide a novel flexible modeling technique for approximating an unknown density by a rich parametric family of densities, with the parameter being infinite dimensional, in contrast to the classical approach of finite dimensional approximation.
2 Definitions & Notation
We introduce the notation used throughout the paper. Define . For and , and . For , we write if for some positive universal constant . For a topological space , (resp. ) denotes the space of all continuous (resp. bounded continuous) functions on . For a locally compact Hausdorff space , is said to vanish at infinity if for every the set is compact. The class of all continuous on which vanish at infinity is denoted as . For open , denotes the space of continuously differentiable functions on . For , denotes the supremum norm of . denotes the set of all finite Borel measures on . For , denotes the Banach space of -power () -integrable functions. For , we will use for if is a Lebesgue measure on . For , denotes the -norm of for and we denote it as if and is the Lebesgue measure. The convolution of two measurable functions and on is defined as
provided the integral exists for all . The Fourier transform of is defined as
where denotes the imaginary unit .
In the following, for the sake of completeness and simplicity, we present definitions restricted to Hilbert spaces. Let and be abstract Hilbert spaces. A map is called a linear operator if it satisfies and for all and , where . A linear operator is said to be bounded, i.e., the image of under is bounded if and only if there exists a constant such that for all we have , where . In this case, the operator norm of is defined as . Define be the space of bounded linear operators from to . is said to be compact if is a compact subset in . The adjoint operator of is defined by . is called self-adjoint if and is called positive if for all . is called an eigenvalue of if there exists an such that and such an is called the eigenvector of and . For compact, positive, self-adjoint , , is called a fractional power of and is the square root of , which we write as . An operator is Hilbert-Schmidt if where is an arbitrary orthonormal basis of separable Hilbert space . is said to be of trace class if . For and , is an element of the tensor product space which can also be seen as an operator from to as for any . denotes the range space (or image) of .
A real-valued symmetric function is called a positive definite (pd) kernel if, for all , and all , we have . A function is a reproducing kernel of the Hilbert space of functions if and only if (i) , and (ii) , , hold. If such a exists, then is called a reproducing kernel Hilbert space. Since , it is easy to show that every reproducing kernel (r.k.) is symmetric and positive definite. Some examples of kernels that appear throughout the paper are: Gaussian kernel, that induces the following Gaussian RKHS,
the inverse multiquadric kernel, and the Matérn kernel, that induces the Sobolev space, ,
where is the Gamma function, and is the modified Bessel function of the third kind of order ( controls the smoothness of ).
For any real-valued function defined on open , is said to be -times continuously differentiable if for with , exists. A kernel is said to be -times continuously differentiable if exists and is continuous for all with where . Corollary 4.36 in Steinwart and Christmann (2008) and Theorem 1 in Zhou (2008) state that if exists and is continuous, then with and for every , we have and .
Given two probability densities, and on , the Kullback-Leibler divergence (KL) and Hellinger distance () are defined as and respectively. We refer to as the total variation (TV) distance between and .
3 Approximation of Densities by
In this section, we first show that every finite dimensional exponential family, is generated by the family induced by a finite dimensional RKHS, which naturally leads to the infinite dimensional generalization of when is an infinite dimensional RKHS. Next, we investigate the approximation properties of in Proposition 3 and Corollary 3 when is an infinite dimensional RKHS.
Let us consider a -parameter exponential family, with sufficient statistic and construct a Hilbert space, . It is easy to verify that induced by is exactly the same as since any can be written as for some . In fact, by defining the inner product between and as , it follows that is an RKHS with the r.k. since . Based on this equivalence between and induced by a finite dimensional RKHS, it is therefore clear that induced by a infinite dimensional RKHS is a strict generalization to with playing the role of a sufficient statistic.
The following are some popular examples of probability distributions that belong to . Here we show the corresponding RKHSs that generate these distributions. In some of these examples, we choose and ignore the fact that is a probability distribution as assumed in the definition of .
Exponential: , .
Normal: , .
Beta: , .
Gamma: , .
Inverse Gaussian: , .
Poisson: , , .
Binomial: , , .
While Example 3 shows that all popular probability distributions are contained in for an appropriate choice of finite-dimensional , it is of interest to understand the richness of (i.e., what class of distributions can be approximated arbitrarily well by ?) when is an infinite dimensional RKHS. This is addressed by the following result, which is proved in Section 8.1.
Define where is locally compact Hausdorff. Suppose and
Then is dense in w.r.t. Kullback-Leibler divergence, total variation ( norm) and Hellinger distances. In addition, if for some , then is also dense in w.r.t. norm.
A sufficient condition for to be locally compact Hausdorff is that it is either open or closed. Condition (5) is equivalent to being -universal (Sriperumbudur et al., 2011, p. 2396). If where , then (5) can be shown to be equivalent to (Sriperumbudur et al., 2011, Proposition 5). Examples of kernels that satisfy the conditions in Proposition 3 include the Gaussian, Matérn and inverse multiquadrics. In fact, any compactly supported non-zero satisfies the assumptions in Proposition 3 as (Sriperumbudur et al., 2010, Corollary 10). Though is still a parametric family of densities indexed by a Banach space (here ), the following corollary (proved in Section 8.2) to Proposition 3 shows that a broad class of continuous densities are contained in and therefore can be approximated arbitrarily well in norm (), Hellinger distance, and KL divergence by .
Let be a probability density such that for all , where is locally compact Hausdorff. Suppose there exists a constant such that for any , that satisfies for any with . Define
Suppose and (5) holds. Then is dense in w.r.t. KL divergence, TV and Hellinger distances. Moreover, if for some , then is also dense in w.r.t. norm.
By choosing to be compact and to be a uniform distribution on , Corollary 3 reduces to an easily interpretable result that any continuous density on can be approximated arbitrarily well by densities in in KL, Hellinger and () distances.
Similar to the results so far, an approximation result for can also be obtained w.r.t. Fisher divergence (see Proposition 5). Since this result is heavily based on the notions and results developed in Section 5, we defer its presentation until that section. Briefly, this result states that if is sufficiently rich (i.e., dense in an appropriate class of functions), then any with can be approximated arbitrarily well by elements in w.r.t. Fisher divergence, where .
4 Density Estimation in : Well-specified Case
In this section, we present our score matching estimator for an unknown density (well-specified case) from i.i.d. random samples drawn from it. This involves choosing the minimizer of the (empirical) Fisher divergence between and as the estimator, which we show in Theorem 4 to be obtained by solving a simple finite-dimensional linear system. In contrast, we would like to remind the reader that the MLE is infeasible in practice due to the difficulty in handling . The consistency and convergence rates of and the plug-in estimator are provided in Section 4.1 (see Theorems 4.1 and 4.1). Before we proceed, we list the assumptions on , and that we need in our analysis.
is a non-empty open subset of with a piecewise smooth boundary , where denotes the closure of .
is continuously extendible to . is twice continuously differentiable on with continuous extension of to for .
for and as , for all .
(-Integrability) For some and , and where
(i) being a subset of along with being continuous ensures that
is separable (Steinwart and Christmann, 2008, Lemma 4.33). The twice differentiability of ensures that every is twice continuously
differentiable (Steinwart and Christmann, 2008, Corollary 4.36). (C) ensures that in (3) is equivalent to the one in (4) through
integration by parts on (see Corollary 7.6.2 in Duistermaat and Kolk, 2004 for integration by parts on bounded subsets of which can be extended to unbounded
through a truncation and limiting argument) for densities in . In particular, (C) ensures that
for all and , which will be critical to prove
the representation in Theorem 4(ii), upon which rest of the results depend. The decay condition in (C) can be weakened
to as ,
for all if is a (possibly unbounded) box where .
(ii) When , the first condition in (D) ensures that for any . The other two conditions ensure the validity of the alternate representation for in (4) which will be useful in constructing estimators of (see Theorem 4). Examples of kernels that satisfy (D) are the Gaussian, Matérn (with ), and inverse multiquadric kernels, for which it is easy to show that there exists that satisfies (D).
(iii) (Identifiability) The above list of assumptions do not include the identifiability condition that ensures if and only if . It is clear that if constant functions are included in , i.e., , then for any . On the other hand, it can be shown that if and , then . A sufficient condition for is . We do not explicitly impose the identifiability condition as a part of our blanket assumptions because the assumptions under which consistency and rates are obtained in Theorem 4.1 automatically ensure identifiability.
Under these assumptions, the following result—proved in Section 8.3—shows that the problem of estimating through the minimization of Fisher divergence reduces to the problem of estimating through a weighted least squares minimization in (see parts (i) and (ii)). This motivates the minimization of the regularized empirical weighted least squares (see part (iv)) to obtain an estimator of , which is then used to construct the plug-in estimate of .
Suppose (A)–(D) hold with .
for all . In addition, the following
(i) For all ,
where , is a trace-class positive operator with