Direct estimation of density functionals using a polynomial basis

# Direct estimation of density functionals using a polynomial basis

Alan Wisler, Visar Berisha, Andreas Spanias, Alfred O. Hero This research was supported in part by Office of Naval Research grants N000141410722 (Berisha) and N000141712826 (Berisha) and Army Research Office grant W911NF-15-1- 0479 (Hero).
###### Abstract

A number of fundamental quantities in statistical signal processing and information theory can be expressed as integral functions of two probability density functions. Such quantities are called density functionals as they map density functions onto the real line. For example, information divergence functions measure the dissimilarity between two probability density functions and are useful in a number of applications. Typically, estimating these quantities requires complete knowledge of the underlying distribution followed by multi-dimensional integration. Existing methods make parametric assumptions about the data distribution or use non-parametric density estimation followed by high-dimensional integration. In this paper, we propose a new alternative. We introduce the concept of “data-driven basis functions” - functions of distributions whose value we can estimate given only samples from the underlying distributions without requiring distribution fitting or direct integration. We derive a new data-driven complete basis that is similar to the deterministic Bernstein polynomial basis and develop two methods for performing basis expansions of functionals of two distributions. We also show that the new basis set allows us to approximate functions of distributions as closely as desired. Finally, we evaluate the methodology by developing data driven estimators for the Kullback-Leibler divergences and the Hellinger distance and by constructing empirical estimates of tight bounds on the Bayes error rate.

Divergence estimation, direct estimation, nearest neighbor graphs, Bernstein polynomial

## I Introduction

Information divergence measures play a central role in the fields of machine learning and information theory. Information divergence functions, functionals that map density functions to IR, have been used in many signal processing applications involving classification [1], segmentation [2], source separation [3], clustering [4], and other domains. In machine learning, a sub-class of these divergences known as -divergences [5], are widely used as surrogate loss functions since they form convex upper bounds on the non-convex 0-1 loss [6].

Although these measures prove useful in a variety of applications, the task of estimating them from multivariate probability distributions using finite sample data can pose a significant challenge. It is especially challenging for continuous distributions, which is the focus of this paper. For algorithms and theory for estimating information measures for discrete distributions the reader is referred to [7, 8, 9, 10, 11, 12]. For the continuous case treated here, there are three general classes of methods for estimating divergence [13]: 1) parametric methods, 2) non-parametric methods based on density estimation, and 3) non-parametric methods based on direct (or graph-based) estimation. Parametric methods are the most common choice for estimation, and typically offer good convergence rates () when an accurate parametric model is selected. The fundamental limitation of parametric methods, is that an accurate parametric model is rarely available in real world problems and using an inaccurate parametric model can heavily bias the final estimate. As an alternative, when no parametric form is known, non-parametric density estimates such as kernel density estimation [14], histogram estimation [15], or -nearest neighbor (-NN) density estimation [16] are used to characterize the distribution. While these methods are quite powerful in certain scenarios, they are generally high variance, sensitive to outliers, and scale poorly with dimension [13].

An alternative to these two classes of methods is direct (or graph-based) estimation, which exploits the asymptotic properties of minimal graphs in order to directly estimate distribution functionals without ever estimating the underlying distributions themselves. These methods have been used to estimate density functionals such as entropy [17, 18, 19], the information divergence [13], and the -divergence [20]. This class of methods can have faster asymptotic convergence rates [13] and are often simpler to implement than plug-in methods which may have many tuning parameters such as kernel width or histogram bin size. Direct estimators are often inspired by an asymptotic property of a graph-theoretic quantity that can be scaled or modified in order to generate an estimator for a given information-theoretic measure. This approach is customized to a specific form of the density functional and is sometimes difficult to generalize. The estimator for the divergence is an example of this approach [20]. This is in contrast with plug-in estimators, where the same general approach can be used to estimate any distribution functional. Estimators based on influence functions attempt to bridge the gap between the two types of approaches [21]. In that work the authors present a recipe for estimating any smooth functional by using a Von Mises expansion - the analog of the Taylor expansion for distributions; however that approach still requires that part of the data be used for density estimation.

In this paper, we provide a general approach for estimating a wide range of distribution functionals. We propose decomposing the functional onto a complete set of “data-driven” basis functions; where the term “data-driven” means that the basis is determined directly from the data and does not involve distribution fitting or direct integration. We show that a broad class of distribution functionals can be approximated as closely as desired through linear combinations of our proposed basis, where the weights of the basis expansion are determined through convex optimization. Our approach offers a powerful alternative for estimating information-theoretic distribution functionals. We demonstrate the flexibility of the approach by constructing empirical estimators of bounds on the Bayes error rate using the same basis set.

The remainder of the paper is organized as follows. In the next section, we review the literature in this area. In Section II we provide a detailed description of the problem this paper attempts to solve and establish some of the basic mathematical notation used throughout this paper. In Section III we introduce a set of graph-theoretic basis functions and prove that a wide range of information theoretic quantities can be represented by a linear combination of functions in this set. In sections IV and V, we explore the limitations of the proposed methodology in the finite sample regime and propose two alternate fitting routines to identify weights to map these basis functions to quantities of interest. In section VI we empirically investigate how the proposed method can be used to estimate popular divergence measures (the KL-divergence, the Hellinger distance, the -divergence), and we compare its performance to various parametric and non-parametric alternatives. In Section VII, we show how the method can be extended to form tighter bounds on the Bayes error rate for binary classification problems. Section VIII offers some concluding remarks.

### I-a Related Work

A natural method for non-parametric estimation of continuous distribution functionals involves histogram binning followed by plug-in estimation [22, 23]. When the bin-size is adjusted as a function of the number of available samples per bin, this histogram plug-in method is known as Grenander’s method of sieves and it enjoys attractive non-parametric convergence rates [24, 25]. While these methods may work well for small data dimension (), their complexity becomes prohibitive for larger dimensions. Recent work has focused on non-parametric plug-in estimators that are more practical in higher dimensions. These approaches generally only estimate the PDF for the values of samples in the reference data, then calculate the expected value across the sample data in place of numerical integration [26, 27, 28]. To circumvent the slow convergence associated with these approaches, ensemble methods [27] and methods based on influence functions [21] have been proposed which are capable of achieving the parametric rate MSE convergence if the underlying densities meet certain smoothness conditions [27]. Alternatively, estimates of divergence functions that rely on estimates of the likelihood ratio instead of density estimation have been proposed for estimating the -divergence and the -divergence [6, 29, 30, 31, 32]. These methods estimate the likelihood ratio of the two density functions and plug that value into the divergence functions. Other approaches that bypass density estimation are the convex optimization approach of [29] to estimate -divergences and the -NN graph and minimal spanning tree approaches to estimating Henze-Penrose divergence [33, 34, 20].

Similarly, the approach we propose in this paper bypasses density estimation through a polynomial basis expansion where the basis coefficients are determined through a convex optimization criterion. This provides added flexibility and allows us to easily estimate a large class of distribution functionals and to establish empirical estimates of bounds on the Bayes error rate. Conceptually, this approach is similar to prior work in estimating the entropy of discrete distributions using polynomial approximations [7, 8, 9, 11].

Bounds on optimal performance are a key component in the statistical signal processing literature. For classification problems, it is often desirable to bound the Bayes error rate (BER) - the minimum achievable error in classification problems. The well-known Chernoff upper bound on the probability of error has been used in a number of statistical signal processing applications [35]. It motivated the Chernoff -divergence [13]. The Bhattacharyya distance, a special case of the Chernoff -divergence for , upper and lower bounds the BER [36, 37]. Beyond the bounds on the BER based on divergence measures, a number of other bounds exist based on other functionals of the distributions [38, 39]. For estimation problems, the Fisher information matrix (FIM) bounds the variance of the optimal unbiased estimator (through it’s relationship with the CRLB). The authors have also previously introduced the divergence, a non-parametric -divergence, and showed that it provides provably tighter bounds on the BER than the BC bound [20]. They extended this work to estimation of the Fisher information in [40].

Our data-driven basis, consisting of Bernstein polynomials, can be used to estimate functionals of distributions and to estimate bounds on Bayes optimal classification performance. Bernstein polynomials of a different form have been used for density estimation [41, 42, 43, 44, 45, 46]. In contrast to this work, our methods do not rely on density estimation.

## Ii Problem Setup

In this section, we will set up the problem and establish the notation that will be used throughout the rest of the paper. We are given a set of data containing instances, where each instance is represented by a -dimensional feature vector and a binary label . Suppose that this data is sampled from underlying distribution, , where

 fx(x)=p0f0(x)+p1f1(x) (1)

is made up of the two conditional class distributions and for classes and , with prior probabilities and respectively. If the priors aren’t explicitly known, they can be easily estimated from the sample data by measuring the ratio of samples drawn from each class. As a simple application of Bayes theorem, we can define the posterior likelihood of class 1, , evaluated at a point , as

 η(x∗) =P[y=1|x=x∗]=P[y=1]fx(x∗|y=1)fx(x∗) (2) =p1f1(x∗)fx(x∗)=p1f1(x∗)p0f0(x∗)+p1f1(x∗)

We can similarly define the posterior probability for class 0 as

 P[y=0|x=x∗]=p0f0(x∗)p0f0(x∗)+p1f1(x∗), (3)

and since is binary,

 P[y=0|x=x∗]=1−η(x∗). (4)

To simplify the notation, we remove the dependence of on from portions of the analysis that follow.

Suppose that we wish to estimate some functional of distributions and , which can be expressed in the following form

 G(f0,f1)=∫g(η(x))fx(x)dx. (5)

Throughout the rest of this paper, we will refer to the in (5) as the posterior mapping function. Many functionals in machine learning and information theory, such as -divergences and loss functions, can be expressed this way. Consider the family of -divergences as an example. They are defined as

 Dϕ(f0,f1)=∫ϕ(f0(x)f1(x))f1(x)dx, (6)

where is a convex or concave function unique to the given -divergence. By substituting

 f0(x)f1(x)=p1(1−η(x))p0η(x) (7)

and

 f1(x)=η(x)p1fx(x) (8)

we can redefine (6) as

 Dϕ(f0,f1)=∫ϕ(p1(1−η(x))p0η(x))η(x)p1fx(x)dx. (9)

Thus any -divergence can be presented in the form outlined in (5) simply by defining the posterior mapping function as

 g(η)=ηp1ϕ(p1(1−η)p0η). (10)

We propose a procedure for estimating these types of divergence functionals which bypasses density estimation. We do this by representing the functional in terms of the asymptotic limit of a linear combination of graph-theoretic basis functions.

Suppose that there exists a set of basis functions that can be similarly expressed as

 Hi(f0,f1)=∫hi(η(x))fx(x)dx. (11)

If we assume that there exists a set of coefficients such that

 g(η)≈k∑i=0wihi(η), (12)

then consequently

 G(f0,f1)≈^G(f0,f1)=k∑i=0wiHi(η), (13)

where the sense of approximation is that the norm of the difference between the right and left hand sides is small. In the following section, we will introduce a set of basis functions that have the desired properties.

## Iii Graph-theoretic Basis Functions

Consider the dataset previously defined. Suppose we select an arbitrary instance from and examine it along with the set of its nearest neighbors in . We can define the neighborhood set , as the union of and its nearest neighbors. Using this we define as the number of instances in the neighborhood set which are drawn from class 1, or alternatively, the sum of across all points in the neighborhood

 Φk(x∗)=∑i:xi∈Nk(x∗)yi. (14)

Figure 3 provides a simple illustration to help explain how is calculated. Calculating is similar to how nearest neighbor classifiers make decisions, but with two important differences:

1. The base instance is considered in the neighborhood indistinguishably from other instances in .

2. Where traditional -NN classifiers are concerned only with identifying the majority, we are interested in the exact number of instances drawn from each class.

In essence tells us something about the probability that for instances on or near . Since we are more concerned with the dataset as a whole than the local characteristics in , we define the statistic to be the fraction of instances , for which , . If we define the indicator function as

 Ir,k(x)={1Φk(x)=r0otherwise, (15)

then this test statistic can be represented by

 ρr,k,N(X)=1N∑x∈XIr,k(x). (16)

The function is simply the proportion of -NN neighborhoods that contain exactly points from class . This statistic has a number of desirable qualities. We show that this statistic asymptotically converges to a function of the underlying distributions that can be described in the form outlined in (5). The following is proven in Appendix A.

###### Theorem 1.

As the number of samples () approaches infinity,

 limN→∞ρr,k,N(X)L2→∫(kr)ηr(x)(1−η(x))k−rfx(x)dx

whenever .

We propose to use the asymptotic form of defined in Theorem 1 as a basis function for estimating functionals of the form (5),

 Hr,k(f0,f1) =limN→∞ρr,k,N(X) (17)

where

 hr,k(η)=(kr)ηr(1−η)k−r. (18)

The function (18) is the Bernstein basis polynomial of degree [47]. Bernstein’s proof of the Weierstrass Approximation Theorem [48] asserts that any continuous function can be uniformly approximated on to any desired accuracy by a linear combination of functions in (18) of the form

 g(η)≈k∑r=0g(rk)hr,k(η). (19)

Combining this result with Theorem 1, we can show that a linear combination of this basis can be used to estimate any function of the form (5) .

###### Theorem 2.

For any that can be expressed in the form

 G(f0,f1)=∫g(η(x))fx(x)dx,

where is continuous on , the approximation

 ^Gk,N(X)=k∑r=0g(rk)ρr,k,N(X) (20)

satisfies

 limk→∞limN→∞k/N→0E[(^Gk,N(X)−G(f0,f1))2]=0. (21)

Theorem 2 provides an asymptotically consistent method of estimating a variety of information-theoretic functions that makes no assumptions on the underlying distributions and can be calculated without having to perform density estimation. Throughout the rest of the paper we will refer to the weights in the approximation specified by (20) in Theorem 2 as the Bernstein weights. We next turn to the finite sample properties of the estimator .

## Iv Finite Sample Considerations

The previous Section investigated the asymptotic properties of linear combinations of the proposed set of empirically estimable basis functions. The asymptotic consistency of the proposed method is valuable, however in real world scenarios, data is inherently a finite resource, and as a result the efficacy of this method is heavily dependent on its convergence characteristics in the finite sample regime. In this section, we will take a detailed look into how restricting both and affects our ability to estimate functions of two distributions. To do this, it is necessary to first break down the different sources of error in the proposed methodology.

### Iv-a Estimation vs. Approximation Error

The goal of this paper is to empirically estimate the functional of the two underlying distributions and using a linear combination of directly estimable basis functions

 ^Gk,N(X)=k∑r=0wr^Hr,k,N(X). (22)

We divide the error of this estimate into two types, the approximation error () and the estimation error ():

 eT =G(f0,f1)−^Gk,N(X) (23) =G(f0,f1)−^Gk(f0,f1)=:eA+^Gk(f0,f1)−^Gk,N(X)=:eest.

This allows us to separate the error in estimating the basis functions from error in fitting to the posterior mapping function. Understanding the trade-off between these two error types will be particularly useful in Section V, where we explore different methods of fitting weights to the desired density functionals.

### Iv-B Considerations for finite k

A finite sample also implies a finite and impacts the approximation error. Let us consider the Bernstein weighting scheme introduced in (20) for the scenario where the size of the basis set () is restricted. Consider the following example problem.

Example: Suppose that we wish to estimate the function

 g(η)=(31)η(1−η)2 (24)

using the basis set . Because , there exists a set of weights such that

 3∑r=0wrβr,3(η)=g(η), (25)

however, using the Bernstein weighting scheme in (20) yields

 ^g(η)= 3∑r=0g(r3)βr,3(η) (26) = g(03)β0,3(η)+g(13)β1,3(η)+g(23)β2,3(η) +g(33)β3,3(η) = 43η(1−η)2+23η2(1−η) ≠ g(η).

It is clear from this example that the Bernstein weighting procedure do not always provide ideal weights when is restricted. Based on these results, we are motivated to explore alternative weighting procedures in order to improve the performance of this method for the finite sample case. In the following Section, we will introduce a method of finding better weights using convex optimization.

Regardless of how weights are assigned to the approximation, selection of remains an important factor affecting performance. In order to satisfy Theorem 2, should be functionally dependent on , such that approaches infinity, both and , however this still provides a great degree of freedom in the selection of .

In general, there are two major competing factors that must be considered when selecting . The first is that the Weierstrass approximation theorem can exactly represent any posterior mapping function as a linear combination of the proposed basis set only as . This provides motivation for choosing a large -value to ensure the best possible fit of . The second factor is that the asymptotic characteristics of are dependent on all points in . Moreover the regime for which Theorem 2 holds, requires that , so we are motivated to select such that . This means that the selection of must achieve a compromise in the trade-off between the approximation and estimation errors, since larger values of will increase the amount of finite sample error made in estimating the individual basis functions, while lower values of may inhibit our ability to accurately model the desired function in the asymptotic regime.

To illustrate how our ability to estimate the desired set of basis functions varies with , we calculate the estimated and asymptotic values of for and on data drawn from distributions and , where and , and plot the results in Figure 4. Estimates are calculated at 3 different sample sizes () and each estimate shown in Figure 4 has been averaged across 500 Monte Carlo trials. While we can estimate the basis set for either -value with a high degree of accuracy given enough samples, the estimates for are noticeably more accurate. In fact, we are able to do about as well with samples for as we are with samples for .

## V Optimization Criteria for Fitting Density Functionals

In this section, we propose a convex optimization criterion to identify appropriate weights for fitting information-theoretic functions when and are restricted. Inherently, our goal is to minimize the total error, defined in (23), however minimizing this quantity directly isn’t feasible since the value of is unknown. To circumvent this challenge we focus on developing a criterion to minimize the approximation error. We initially develop an optimization criterion that assumes the posterior is uniformly distributed, then propose an alternate method which incorporates an estimate of the posterior density function in order to more accurately model the approximation error.

### V-a Uniform Optimization Criteria

Recall that the approximation error can be represented as

 eA=∫ϵ(η(x))fx(x)dx, (27)

where

 ϵ(η)=g(η)−k∑r=0wrhr,k(η). (28)

Since solving (27) requires high-dimensional integration and knowledge of the underlying distributions. However, because is a function of , is implicitly a function of as well, and by the law of the unconscious statistician [49],

 eA=E[ϵ(η)]=∫ϵ(η)fη(η)dη, (29)

where is the probability density function of the random variable . Rewriting the error in this form simplifies the region of integration to a well defined space (since ) and circumvents the high dimensionality of . While this eliminates some of the challenges in calculating the error it also creates new ones stemming from the fact that is unknown and difficult to estimate due to its implicit dependency on and . The task of estimating will be explored in detail in Section V-B, however for the time being we will bypass this challenge and simply attempt to minimize

 e∗A=∫|ϵ(η)|2dη. (30)

It is worth noting that if is uniformly distributed

 e∗A=E[|ϵ(η)|2]≥e2A. (31)

While there exists an analytical solution for identifying the weights which minimize (30)[50], we use a convex optimization procedure that allows us to also account for the estimation error. If we define a discretized set of posterior values , where , a procedure to identify weights that minimize (30) can be defined as

 w0,...,wk=argminw0,...,wk1~N~N∑i=1∣∣g(~ηi)−k∑r=0wrhr,k(~ηi)∣∣2. (32)

.

To illustrate the effectiveness of this method, we consider the example problem of trying to estimate the Hellinger distance (a problem we will further explore in Section VI). If we assume both classes have equal prior probability (), then the posterior mapping function for the squared Hellinger distance is

 g(η)=(√η−√1−η)2. (33)

This function is estimated using this convex weighting procedure as well as the previously described Bernstein weighting procedure, and we compare how well each method models the desired function for values of varying from to . The performances of each method is evaluated by the following formula

 MSE(^g,g)=~N∑i=1∣∣g(~ηi)−^g(~ηi)∣∣2, (34)

and the results are presented in Figure 5 for a range of values varying from to . This experiment shows that the proposed convex fitting procedure is able to approximate the Hellinger posterior mapping function far more accurately than the Bernstein approximation. This improvement isn’t surprising since the proposed method directly minimizes the MSE whereas the Bernstein approximation guarantees consistency only as , but it still helps to illustrate the potential for improvement in the Bernstein weights that exists for smaller .

The expression (34) does not take into account finite sample errors that lead to noisy estimates of the basis functions and thus does not directly reflect our ability to estimate with a finite sample. The consequences of this could be quite significant. When using a similar approach for entropy estimation, Paninski observed that the good approximation properties were a result of “large oscillating coefficients” which magnify the variance of the corresponding estimator[11]. Additionally, it does not account for the possibility that is distributed non-uniformly.

To empirically examine the finite sample properties of the two approaches, we repeat the previous experiment, this time estimating the basis functions empirically on samples of data drawn from distributions and . We generate samples (500 samples per class) in each of the 500 iterations of a Monte Carlo simulation, and evaluate the MSE as

 MSE(G,^G)=1NMCNMC∑i=1[G(f0,f1)−^G(f0,f1)]2, (35)

where represents the number of Monte Carlo iterations. Since we know that the estimation error is scaled by the magnitude of the weights, we also evaluate a modified fitting routine which augments (32) with a regularization term to penalize solutions with large weights,

 w0,...,wk= (36) argminw0,...,wk1~N~N∑i=1∣∣g(~ηi)−k∑r=0wrhr,k(~ηi)∣∣2+λkk∑r=0w2r,

where represents a tuning parameter which controls the importance assigned to minimization of the approximation error relative to the estimation error. Intuitively, higher values will make sense for smaller data sets to control the variance of the estimator. We set for all experiments conducted in this paper. The results of this experiment are shown in Figure 6. We immediately see the necessity of the regularization term, as without it the error becomes extremely large for a range of values. More generally, the inclusion of the regularization term improves the performance at every value in this experiment. In comparing the Bernstein weights with the convex (regularized) weights, we find that 1) the performance of the convex method is less dependent on the selection of and 2) the convex weights generally perform better at lower values of , while the Bernstein weights outperform at higher . While the peak performance of the Bernstein method is higher than the convex method, there exists no good method of selecting a priori in order to reliably achieve this performance. In contrast, the convex method with regularization is less sensitive to the value of selected.

### V-B Density-weighted Optimization Criteria

In the optimization criteria introduced in the previous section we implicitly make the assumption that the distribution of the random variable is uniformly distributed. In this section we will investigate a data-driven estimator for . However, before we proceed it is important to clarify what this distribution actually is.

We initially introduced as the posterior likelihood function for class 1, which we showed in (2) can be represented as a function of the underlying distributions. When this function’s input is a known point , represents the probability that given that . However, if the input is a random variable , then is also a random variable, which is distributed according to .

Figure 10 illustrates for two univariate normal distributions and . Figure (a)a displays the two class distributions across , Figure (b)b displays as a function of , and Figure (c)c displays as a function of . We see from this illustration that while, is close to or across most of the region of that is displayed, remains close to in the regions where is greatest. As a results is roughly bell-shaped, and the likelihood of existing at the extremities (close to or ) is relatively low. Because the probability density in these regions is low, the accuracy of our fit in these regions is less important, and can be given less weight in the fitting routine. Figure 14 repeats this illustration for two well seperated normal distributions. In this case the distribution of is such that is most dense towards the extremities, and therefore they should be given more weight in the fitting routine. Side by side, these two figures present an interesting contrast. Despite the fact that the set of density functions and look quite similar in the two scenarios, the minor difference in separation significantly alters the distribution of .

In practice the underlying distributions and are unknown, and as a result, is unknown as well. However, if we were able to sample at , a more direct method of minimizing would be to solve

 w0,...,wk= (37) argminw0,...,wk~N∑i=1∣∣g(~ηi)−k∑r=0wrhr,k(~ηi)∣∣2^fη(~ηi)Δ~η+λkk∑r=0w2r,

where .

Below we show that , the statistic previously defined in Theorem 1 can be used to sample the PDF of . This result is stated in Theorem 3.

###### Theorem 3.

As and in a linked manner such that and

 kρr,k,N(X)→fη(η∗).

Theorem 3 is useful as it provides a method of sampling that doesn’t depend on estimates of the underlying density functions and . Using this result, we can estimate the density of the posterior at point as

 ^fη(~ηi)=~kiρ~ri,~ki,N(X) (38)

where and are integers selected such that for . Sampling at exactly may not always be possible since the maximum value of is limited by the sample size, and determines the resolution of the sampling scheme. Even if it is possible, it may not be desirable to recalculate for different values of because of the computational burden. To overcome these problems we can design our approach such that we utilize the same set of test statistics in the estimation of the posterior distribution as are used in the estimation of the basis set. One way to do this is to assign the set of discretized posteriors

 ~η=[0,1k,2k,...,1] (39)

so that it is straightforward to calculate from the known values of . An alternate approach is to leave unchanged and interpolate to determine its value at the desired points. The advantage of this approach is that it doesn’t constrain how is sampled. Throughout the rest of this paper, we will employ the latter method and solve for by linearly interpolating between its values on the discretized set (39).

Because this density-weighted fitting routine more directly minimizes the approximation error of the final estimate, we expect it to generally outperform the uniform method, particularly in cases where the density of the posterior is highly non-uniform and where the desired posterior mapping function is difficult to model using the proposed basis set. This hypothesis will be verified in Sections VI and VII, when we empirically evaluate our methods with real data. This improvement comes at a computational cost since the weights are now data-dependent, they must be calculated online, whereas for the uniform method they can be calculated offline and stored. Solving for the weights can be be implemented with time complexity [51]. Current state-of-the-art algorithms for -NN graphs can be implemented with time complexity [52]. As a result the complexity for the Convex (uniform) method is assuming that the weights are available. The complexity for the Convex (density weighted) method is since the weights must be determined for each new dataset.

## Vi Divergence Estimation

In this section, we show how the proposed method can be applied to estimating three -divergences, the Hellinger distance, the KL-divergence, and the -divergence, directly from data.

### Vi-a Hellinger Distance

The Hellinger distance squared is an -divergence used to quantify the dissimilarity between two probability distributions and is calculated by

 H2(f0,f1)=12∫(√f0(x)−√f1(x))2dx. (40)

Using the approach proposed in Section V, we can estimate by fitting weights to the posterior mapping function

 gH(η)=12(√ηp1−√1−ηp0)2. (41)

To evaluate the efficacy of this method, we conduct four different experiments in which we attempt to estimate the Hellinger distance between two distributions from finite sample data. In the first three experiments, both distributions are normally distributed according to , where

 Σd=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣σ1,1σ1,2…σ1,dσ2,1σ2,2…σ2,d⋮⋮⋱⋮σd,1σd,2…σd,d⎤⎥ ⎥ ⎥ ⎥ ⎥⎦, (42)

for . The first experiment evaluates the most basic case where both Gaussians have spherical covariance. The second experiment considers the case where there exists a strong fixed dependency between adjacent dimensions by using an elliptical covariance structure. The third experiment evaluates the case where this dependency between adjacent dimensions is now dependent on which class the data is drawn from. In the fourth experiment, we return to linearly independent dimensions and consider the case where one of the distributions isn’t normally distributed, but instead uniformly distributed within a -dimensional hypercube

 f(x)={1(2β)dx∈[μ−β,μ+β]d0otherwise. (43)

A detailed description of the distribution types and parameter setting used in each of the four experiments is presented in Table I. In addition to using the proposed method, we also estimate the Hellinger distance using two different plug-in estimators, one based on a parametric density estimator that assumes the data is normally distributed and one based on a -NN density estimate of the underlying distributions. The proposed method is implemented using and , while setting . To calculate the -NN estimate, we use the universal divergence estimation approach described in [53] and implemented in the ITE toolbox [54]. This method allows us to fix and still achieve an asymptotically consistent estimate of the divergence.

In each of the first three experiments, the parametric model shows the highest rate of convergence as expected, although in experiment two it is slightly outperformed at smaller sample sizes by the proposed method. In the fourth experiment, when the assumption of Gaussianity in no longer holds, the parametric solution is significantly biased and as a result, is outperformed by both non-parametric methods at higher sample sizes (). Relative to the -NN plug-in estimator, the proposed method performs slightly worse in experiment 1, slightly better in experiment 2 and significantly better in experiments 3 and 4, with the results being relatively consistent across the various sample sizes. The improvement in performance shown in experiments 3 and 4 suggests that the proposed method offers the greatest benefit when there exists differences in the shapes of the two underlying distributions. Though the density-weighted procedure consistently outperformed the uniform method, the observed improvement was relatively minor in these experiments.

### Vi-B Kullback Leibler Divergence

The Kullback Leibler (KL) divergence [55], also sometimes referred to as the KL risk or relative entropy, is an asymmetric measure of divergence between two probability density functions. Using our regular notation the KL-divergence can be calculated by

 dKL(f0||f1)=∫∞−∞f0(x)log(f0(x)f1(x))dx. (44)

While the KL-divergence has the same general purpose as the Hellinger distance, that is to measure the dissimilarity between two probability density functions, it also has several key difference. Firstly since the KL-divergence is asymmetric it doesn’t technically qualify as a distance function and isn’t necessarily equal to . This also means that and will have different posterior mapping functions. We can define the posterior mapping function for as

 g0KL(η)=1−ηp0log(p1(1−η)p0η) (45)

and the posterior mapping function for as

 g1KL(η)=ηp1log(p0ηp1(1−η)) (46)

such that

 dKL(fi||f|i−1|)=∫∞−∞giKL(η(x))(p0f0(x)+p1f1(x))dx. (47)

It is worth noting that when

 g1KL(η)=g0KL(1−η) (48)

thus is a reflection of . One challenge presented in modeling the KL-divergence is that the posterior mapping functions are difficult to model due to discontinuities at the end points, since and is undefined though

 limη→1−g0KL(η)=0. (49)

Due to their symmetry has the same problem at the opposite endpoints. To handle this we simply select our discretized set of posteriors such that . For the experiments in this Section, we set

 ~η1,~η2,~η3,...,~η100,~η101=ϵ,0.01,0.02,...,0.99,1−ϵ (50)

where . Using this modified set of discretized posteriors, we repeat the set of four experiments described in Section VI-A to evaluate the proposed methods ability to estimate the KL-divergence. The results of this experiment are displayed in Figure 16. Like the estimates of the Hellinger distance, the parametric method generally yielded the best performance in the first three experiments, but suffered from a large asymptotic bias in experiment 4. The proposed method once again significantly outperformed the -NN plug-in estimator in experiments 3 and 4, however the results in experiments 1 and 2 are slightly more nuanced due to the significant difference in performance between the two optimization criteria in these trials. In both of these trials the density-weighted criteria significantly outperforms the uniform method at all sample sizes. In experiment 1 the -NN plug-in estimator outperforms the regular plug-in estimator at all sample sizes, but is outperformed by the density-weighted method for . In experiment 2 the -NN plug-in estimator consistently outperforms both proposed methods, however the improvement over the density-weighted method is marginal.

### Vi-C Dp-Divergence

The -divergence is an -divergence defined by

 Dp0(f0,f1)=14p0p1[∫(p0f0(x)−p1f1(x))2p0f0(x)+p1f1(x)dx−(p0−p1)2]. (51)

The -divergence has the unique property of being directly estimable from data using minimum spanning trees [20]. Because of this property, it has been used to form non-parametric estimates of the Fisher information [40] as well as upper bounds on the Bayes error rate in a range of classification problems [20, 56]. The posterior mapping function for the -divergence can be defined as

 gDp(η)=(2η−1)2−(2p0−1)24p0(1−p0) (52)

which simplifies to when . We once again repeat the experiments described in Section VI-A to evaluate the proposed methods ability to estimate the -divergence. This experiment provides the unique opportunity to compare the proposed method to a more traditional direct estimation procedure. The results of this experiment are displayed in Figure 17.

As in the previous experiments, the parametric estimate generally performed the best in the first three experiments, but suffered from a large asymptotic bias in experiment 4. The proposed methods perform better than the MST-based estimation in experiments 1 and 2 but worse in experiments 3 and 4. The relative performance of each method in this experiment was largely consistent across the various sample sizes, though the proposed method seems to be converging slightly faster than the MST method in experiment 4 and could possibly exceed its performance given enough samples. Unlike in the previous experiments, we found no difference in performance between the uniform optimization criteria and the density-weighted criteria in this experiment. This is due to the fact that is a polynomial and can be perfectly represented by the proposed basis set, even when is truncated. Since we are able to achieve a solution where , the density weighting has no impact on the final results.

## Vii Fitting Bounds on Performance

The optimization criteria in the proposed fitting routines gives us the ability to not only approximate information-theoretic functions, but to bound them as well. This is especially useful for forming bounds on the Bayes Error Rate (BER). The Bayes error rate represents the optimal classification performance that is achievable for a given pair of class distributions and with prior probabilities and respectively and can be calculated by

 ϵBayes=∫p0f0(x)≤p1f1(x)p0f0(x)dx  +∫p1f1(x)≤p0f0(x)p1f1(x)dx. (53)

In essence the BER measures the intrinsic difficulty of a particular classification problem based on the data. A thorough understanding of the BER of a particular problem can help design optimal classifiers. Because of the challenges associated with estimating the BER, much of the literature has focused on generating bounds on the BER [37, 20], which are generally formulated in terms of some measure of divergence between the two class distributions. One such bound, the well-known Bhattacharya bound, is given by [37]

 12−12√1−BC2(f0,f1)≤ϵBayes≤12BC(f0,f1), (54)

where

 BC(f0,f1)=1−H2(f0,f1). (55)

While the Hellinger distance here can be estimated via any of the methods discussed in Section VI-A, parametric estimates are most common. Alternatively [20] introduced the bounds

 12−12√D12(f0,f1)≤ϵBayes≤12−12D12(f0,f1) (56)

where

 D12(f0,f1)=1−2∫f0(x)f1(x)f0(x)+f1(x)dx. (57)

These bounds have the advantage of being provably tighter than the Bhattacharyya bounds [20]. Furthermore since represents a particular case of the -divergence, which is estimable directly from data, these bounds bypass the need for density estimation much like the approaches proposed in this paper. While these bounds are significantly tighter than the Bhattacharyya bounds, they still leave room for improvement. Avi-Itzhak proposed arbitrarily tight bounds on the BER in [39], however these bounds require density estimation to be employed in practical problems. In this section, we will use a modified version of the previously described fitting routine in order to investigate how tightly we are able to bound the BER using a linear combination of directly estimable basis functions.

Using the fitting routine described in (36) to bound the BER, requires that we define appropriately for estimation of the BER, and constrain our fit such that

 k∑r=0wrhr,k(~ηi)≥g(~ηi)∀~ηi. (58)

We can express (53) as

 ϵBayes (59) =∫min[1−η(x),η(x)]fx(x)dx

so . Incorporating these changes within the regularized fitting routine described in (36) yields

 w0,...,wk= (60) argminw0,...,wk1~N~N∑i=1∣∣g(~ηi)−k∑r=0wrhr,k(~ηi)∣∣2+λkk∑r=0w2r subject to k∑r=0wrhr,k(~ηi)≥g(~ηi)∀~ηi.

Figure 18 compares the theoretical values of each of these bounds as a function of . These results indicate that the proposed method offers much tighter theoretical bounds than the other two methods, however this bound is based on the asymptotic properties of the proposed basis set and doesn’t consider the limitations of a finite sample estimate.

We evaluate the finite-sample performance of this method by calculating each of the described bounds across the four experiments described in Table I. Figure 19 displays the ground truth value of the BER, along with the theoretical and estimated values for each of the three bounds (Bhattacharyya, , and convex) across sample sizes ranging from 100 to 10000. The Bhattacharyya bound is calculated based on a parametric plug-in estimator which assumes both class distributions to be normally distributed. The bound is calculated from the Friedman-Rafsky test statistic using the approach described in [20]. Finally the convex bound is calculated as a linear combination of the proposed directly estimable basis functions using weights optimized according to (60). The results of this experiment are largely consistent across the four experiments, the convex method yields the tightest bound, followed by the bound, and finally the Bhattacharyya bound. Except for the Bhattacharyya bound in experiment 4, which is estimated parametrically, all of the bounds appear to converge to their asymptotic solution. While the convex bound generally offers a slightly slower convergence rate than the other two solutions, it remains tighter than the other two bounds across all sample sizes.

In order to further validate this bound we repeat one of the experiments conducted in [20] by evaluating the proposed bound along with the Mahalanobis bound, the Bhattacharyya bound, and the bound on two 8-dimensional Gaussian data sets described in [57]. The mean and standard deviations of and for the two data sets are described in Table II, and all dimensions are independent. These data sets allow us to analyze the tightness and validity of the bounds in a higher dimensional setting. For this experiment the sample size was fixed at and only the empirical value of each of the bounds was evaluated. Table III displays the mean and standard deviation of each bound calculated across 500 Monte Carlo iterations for each of the two data sets. In both data sets the convex method provides the tightest bounds on the BER.

## Viii Conclusion

This paper introduces a novel method for estimating density functionals which utilizes a set of directly estimable basis functions. The most appealing feature of the proposed method is its flexibility. Where previous methods of direct estimation are generally only applicable to a specific quantity, we show that the basis set can be used to generate an asymptotically consistent estimate of a broad class of density functionals, including all