Ensemble estimators for multivariate
The problem of estimation of density functionals like entropy and mutual information has received much attention in the statistics and information theory communities. A large class of estimators of functionals of the probability density suffer from the curse of dimensionality, wherein the mean squared error (MSE) decays increasingly slowly as a function of the sample size as the dimension of the samples increases. In particular, the rate is often glacially slow of order , where is a rate parameter. Examples of such estimators include kernel density estimators, -nearest neighbor (-NN) density estimators, -NN entropy estimators, intrinsic dimension estimators and other examples. In this paper, we propose a weighted affine combination of an ensemble of such estimators, where optimal weights can be chosen such that the weighted estimator converges at a much faster dimension invariant rate of . Furthermore, we show that these optimal weights can be determined by solving a convex optimization problem which can be performed offline and does not require training data. We illustrate the superior performance of our weighted estimator for two important applications: (i) estimating the Panter-Dite distortion-rate factor and (ii) estimating the Shannon entropy for testing the probability distribution of a random sample.
Non-linear functionals of probability densities of the form arise in applications of information theory, machine learning, signal processing and statistical estimation. Important examples of such functionals include Shannon and Rényi entropy, and the quadratic functional . In these applications, the functional of interest often must be estimated empirically from sample realizations of the underlying densities.
Functional estimation has received significant attention in the mathematical statistics community. However, estimators of functionals of multivariate probability densities suffer from mean square error (MSE) rates which typically decrease with dimension of the sample as , where is the number of samples and is a positive rate parameter. Examples of such estimators include kernel density estimators , -nearest neighbor (-NN) density estimators , -NN entropy functional estimators [9, 17, 13], intrinsic dimension estimators , divergence estimators , and mutual information estimators. This slow convergence is due to the curse of dimensionality. In this paper, we introduce a simple affine combination of an ensemble of such slowly convergent estimators and show that the weights in this combination can be chosen to significantly improve the rate of MSE convergence of the weighted estimator. In fact our ensemble averaging method can improve MSE convergence to the parametric rate .
Specifically, for -dimensional data, it has been observed that the variance of estimators of functionals decays as while the bias decays as . To accelerate the slow rate of convergence of the bias in high dimensions, we propose a weighted ensemble estimator for ensembles of estimators that satisfy conditions (2.1) and (2.2) defined in Sec. II below. Optimal weights, which serve to lower the bias of the ensemble estimator to , can be determined by solving a convex optimization problem. Remarkably, this optimization problem does not involve any density-dependent parameters and can therefore be performed offline. This then ensures MSE convergence of the weighted estimator at the parametric rate of .
1.1 Related work
When the density is times differentiable, certain estimators of functionals of the form , proposed by Birge and Massart , Laurent  and Giné and Mason , can achieve the parametric MSE convergence rate of . The key ideas in [2, 11, 5] are: (i) estimation of quadratic functionals with MSE convergence rate ; (ii) use of kernel density estimators with kernels that satisfy the following symmetry constraints:
for ; and finally (iii) truncating the kernel density estimate so that it is bounded away from . By using these ideas, the estimators proposed by [2, 11, 5] are able to achieve parametric convergence rates.
In contrast, the estimators proposed in this paper require additional higher order smoothness conditions on the density, i. e. the density must be times differentiable. However, our estimators are much simpler to implement in contrast to the estimators proposed in [2, 11, 5]. In particular, the estimators in [2, 11, 5] require separately estimating quadratic functionals of the form , and using truncated kernel density estimators with symmetric kernels (1.1), conditions that are not required in this paper. Our estimator is a simple affine combination of an ensemble of estimators, where the ensemble satisfies conditions and . Such an ensemble can be trivial to implement. For instance, in this paper we show that simple uniform kernel plug-in estimators (3.3) satisfy conditions and .
Ensemble based methods have been previously proposed in the context of classification. For example, in both boosting  and multiple kernel learning  algorithms, lower complexity weak learners are combined to produce classifiers with higher accuracy. Our work differs from these methods in several ways. First and foremost, our proposed method performs estimation rather than classification. An important consequence of this is that the weights we use are data independent, while the weights in boosting and multiple kernel learning must be estimated from training data since they depend on the unknown distribution.
The remainder of the paper is organized as follows. We formally describe the weighted ensemble estimator for a general ensemble of estimators in Section 2, and specify conditions and on the ensemble that ensure that the ensemble estimator has a faster rate of MSE convergence. Under the assumption that conditions and are satisfied, we provide an MSE optimal set of weights as the solution to a convex optimization(2.3). Next, we shift the focus to entropy estimation in Section 3, propose an ensemble of simple uniform kernel plug-in entropy estimators, and show that this ensemble satisfies conditions and . Subsequently, we apply the ensemble estimator theory in Section 2 to the problem of entropy estimation using this ensemble of kernel plug-in estimators. We present simulation results in Section 4 that illustrate the superior performance of this ensemble entropy estimator in the context of (i) estimation of the Panter-Dite distortion-rate factor  and (ii) testing the probability distribution of a random sample. We conclude the paper in Section 5.
We will use bold face type to indicate random variables and random vectors and regular type face for constants. We denote the statistical expectation operator by the symbol and the conditional expectation given random variable using the notation . We also define the variance operator as and the covariance operator as . We denote the bias of an estimator by .
2 Ensemble estimators
Let denote a set of parameter values. For a parameterized ensemble of estimators of , define the weighted ensemble estimator with respect to weights as
where the weights satisfy . This latter sum-to-one condition guarantees that is asymptotically unbiased if the component estimators are asymptotically unbiased. Let this ensemble of estimators satisfy the following two conditions:
The bias is given by
where are constants that depend on the underlying density, is a finite index set with cardinality , and , and are basis functions that depend only on the estimator parameter .
The variance is given by
For an ensemble of estimators , assume that the conditions and hold. Then, there exists a weight vector such that
This weight vector can be found by solving the following convex optimization problem:
where is the basis defined in (2.1).
The bias of the ensemble estimator is given by
Denote the covariance matrix of by . Let . Observe that by (2.2) and the Cauchy-Schwarz inequality, the entries of are . The variance of the weighted estimator can then be bounded as follows:
We seek a weight vector that (i) ensures that the bias of the weighted estimator is and (ii) has low norm in order to limit the contribution of the variance, and the higher order bias terms of the weighted estimator. To this end, let be the solution to the convex optimization problem defined in (2.3). The solution is the solution of
where and are defined below. Let be the vector of ones: ; and let , for each be given by . Define , and .
Since , the system of equations is guaranteed to have at least one solution (assuming linear independence of the rows ). The minimum squared norm is then given by
For any fixed dimension and fixed number of estimators in the ensemble independent of sample size , the value of is also independent of . Stated mathematically, for any fixed dimension and fixed number of estimators independent of sample size . This concludes the proof.
3 Application to estimation of functionals of a density
Our focus is the estimation of general non-linear functionals of -dimensional multivariate densities with known finite support , where has the form
for some smooth function . Let denote the boundary of . Assume that i.i.d realizations are available from the density .
3.1 Plug-in estimators of entropy
The truncated uniform kernel density estimator is defined below. For any positive real number , define the distance to be: . Define the truncated kernel region for each to be , and the volume of the truncated uniform kernel to be . Note that when the smallest distance from to is greater than , . Let denote the number of samples falling in : . The truncated uniform kernel density estimator is defined as
The plug-in estimator of the density functional is constructed using a data splitting approach as follows. The data is randomly subdivided into two parts and of and points respectively. In the first stage, we form the kernel density estimate at the points using the realizations . Subsequently, we use the samples to approximate the functional and obtain the plug-in estimator:
Also define a standard kernel density estimator , which is identical to except that the volume is always set to the untruncated value . Define
The estimator is identical to the estimator of Györfi and van der Meulen . Observe that the implementation of , unlike , does not require knowledge about the support of the density.
We make a number of technical assumptions that will allow us to obtain tight MSE convergence rates for the kernel density estimators defined above. : Assume that for some rate constant , and assume that , and are linearly related through the proportionality constant with: , and . : Let the density be uniformly bounded away from and upper bounded on the set , i.e., there exist constants , such that . : Assume that the density has continuous partial derivatives of order in the interior of the set , and that these derivatives are upper bounded. : Assume that the function has partial derivatives w.r.t. the argument , where satisfies the condition . Denote the -th partial derivative of wrt by . : Assume that the absolute value of the functional and its partial derivatives are strictly upper bounded in the range for all . : Let and . Let be a positive function satisfying the condition . For some fixed , define and . Assume that the conditions
are satisfied by and , for some constants , , and .
These assumptions are comparable to other rigorous treatments of entropy estimation. The assumption is equivalent to choosing the bandwidth of the kernel to be a fractional power of the sample size . The rest of the above assumptions can be divided into two categories: (i) assumptions on the density , and (ii) assumptions on the functional . The assumptions on the smoothness, boundedness away from and of the density are similar to the assumptions made by other estimators of entropy as listed in Section II, . The assumptions on the functional ensure that is sufficiently smooth and that the estimator is bounded. These assumptions on the functional are readily satisfied by the common functionals that are of interest in literature: Shannon and Rényi entropy, where is the indicator function, and the quadratic functional .
3.1.2 Analysis of MSE
Under the assumptions stated above, we have shown the following in the Appendix:
The biases of the plug-in estimators are given by
where , and are constants that depend on and .
The variances of the plug-in estimators are identical up to leading terms, and are given by
where and are constants that depend on and .
3.1.3 Optimal MSE rate
From Theorem 2, observe that the conditions and are necessary for the estimators and to be unbiased. Likewise from Theorem 3, the conditions and are necessary for the variance of the estimator to converge to . Below, we optimize the choice of bandwidth for minimum MSE, and also show that the optimal MSE rate is invariant to the choice of .
Optimal choice of
Minimizing the MSE over is equivalent to minimizing the square of the bias over . The optimal choice of is given by
and the bias evaluated at is .
Observe that the MSE of and are dominated by the squared bias as contrasted to the variance . This implies that the asymptotic MSE rate of convergence is invariant to the selected proportionality constant .
In view of (a) and (b) above, the optimal MSE for the estimators and is therefore achieved for the choice of , and is given by . Our goal is to reduce the estimator MSE to . We do so by applying the method of weighted ensembles described in Section 2.
3.2 Weighted ensemble entropy estimator
For a positive integer , choose to be positive real numbers. Define the mapping and let . Define the weighted ensemble estimator
From Theorems 2 and 3, we see that the biases of the ensemble of estimators satisfy (2.1) when we set and . Furthermore, the general form of the variance of follows (2.2) because . This implies that we can use the weighted ensemble estimator to estimate entropy at convergence rate by setting equal to the optimal weight given by (2.3).
We illustrate the superior performance of the proposed weighted ensemble estimator for two applications: (i) estimation of the Panter-Dite rate distortion factor, and (ii) estimation of entropy to test for randomness of a random sample.
For finite direct use of Theorem 1 can lead to excessively high variance. This is because forcing the condition (2.3) that is too strong and, in fact, not necessary. The careful reader may notice that to obtain MSE convergence rate in Theorem 1 it is sufficient that be of order . Therefore, in practice we determine the optimal weights according to the optimization:
The optimization (4.1) is also convex. Note that, as contrasted to (2.3), the norm of the weight vector is bounded instead of being minimized. By relaxing the constraints in (2.3) to the softer constraints in (4.1), the upper bound on can be reduced from the value obtained by solving (2.3). This results in a more favorable trade-off between bias and variance for moderate sample sizes. In our experiments, we find that setting yields good MSE performance. Note that as , we must have for in order to keep finite, thus recovering the strict constraints in (2.3).
For fixed sample size and dimension , observe that increasing increases the number of degrees of freedom in the convex problem (4.1), and therefore will result in a smaller value of and in turn improved estimator performance. In our simulations, we choose to be equally spaced values between and , ie the are uniformly spaced as
with scale and range parameters and respectively. We limit to 50 because we find that the gains beyond are negligible. The reason for this diminishing return is a direct result of the increasing similarity among the entries in , which translates to increasingly similar basis functions .
4.1 Panter-Dite factor estimation
For a -dimensional source with underlying density , the Panter-Dite distortion-rate function  for a -dimensional vector quantizer with levels of quantization is given by The Panter-Dite factor corresponds to the functional with . The Panter-Dite factor is directly related to the Rényi -entropy, for which several other estimators have been proposed [7, 3, 14, 12].
In our simulations we compare six different choices of functional estimators - the three estimators previously introduced: (i) the standard kernel plug-in estimator , (ii) the boundary truncated plug-in estimator and (iii) the weighted estimator with optimal weight given by (4.1), and in addition the following popular entropy estimators: (iv) histogram plug-in estimator , (v) -nearest neighbor (-NN) entropy estimator  and (vi) entropic -NN graph estimator [3, 14]. For both and , we select the bandwidth parameter as a function of according to the optimal proportionality and .
We choose to be the dimensional mixture density ; where , is a -dimensional Beta density with parameters , is a -dimensional uniform density and the mixing ratio is . The reason we choose the beta-uniform mixture for our experiments is because it trivially satisfies all the assumptions on the density listed in Section 3.1, including the assumptions of finite support and strict boundedness away from 0 on the support. The true value of the Panter-Dite factor for the beta-uniform mixture is calculated using numerical integration methods via the ’Mathematica’ software (http://www.wolfram.com/mathematica/). Numerical integration is used because evaluating the entropy in closed form for the beta-uniform mixture is not tractable.
The MSE values for each of the six estimators are calculated by averaging the squared error , over Monte-Carlo trials, where each corresponds to an independent instance of the estimator.
4.1.1 Variation of MSE with sample size
The MSE results of the different estimators are shown in Fig. 1(a) as a function of sample size , for fixed dimension . It is clear from the figure that the proposed ensemble estimator has significantly faster rate of convergence while the MSE of the rest of the estimators, including the truncated kernel plug-in estimator, have similar, slow rates of convergence. It is therefore clear that the proposed optimal ensemble averaging significantly accelerates the MSE convergence rate.
4.1.2 Variation of MSE with dimension
For fixed sample size and fixed number of estimators , it can be seen that increases monotonically with . This follows from the fact that the number of constraints in the convex problem 4.1 is equal to and each of the basis functions monotonically approaches as grows, . This in turn implies that for a fixed sample size and number of estimators , the overall MSE of the ensemble estimator should increase monotonically with the dimension .
The MSE results of the different estimators are shown in Fig. 1(b) as a function of dimension , for fixed sample size . For the standard kernel plug-in estimator and truncated kernel plug-in estimator, the MSE increases rapidly with as expected. The MSE of the histogram and -NN estimators increase at a similar rate, indicating that these estimators suffer from the curse of dimensionality as well. On the other hand, the MSE of the weighted estimator also increases with the dimension as predicted, but at a slower rate. Also observe that the MSE of the weighted estimator is smaller than the MSE of the other estimators for all dimensions .
4.2 Distribution testing
In this section, we illustrate the weighted ensemble estimator for non-parametric estimation of Shannon differential entropy. The Shannon differential entropy is given by where . The improved accuracy of the weighted ensemble estimator is demonstrated in the context of hypothesis testing using estimated entropy as a statistic to test for the underlying probability distribution of a random sample. Specifically, the samples under the null and alternate hypotheses and are drawn from the probability distribution , described in Section IV.A, with fixed , and two sets of values of under the null and alternate hypothesis, versus .
First, we fix and . The density under the null hypothesis has greater curvature relative to and therefore has smaller entropy. Five hundred (500) experiments are performed under each hypothesis with each experiment consisting of 1000 samples drawn from the corresponding distribution. The true entropy and estimates , and obtained from each instance of samples are shown in Fig. 2(a) for the 1000 experiments. This figure suggests that the ensemble weighted estimator provides better discrimination ability by suppressing the bias, at the cost of some additional variance.
To demonstrate that the weighted estimator provides better discrimination, we plot the histogram envelope of the entropy estimates using standard kernel plug-in estimator, truncated kernel plug-in estimator and the weighted estimator for the cases corresponding to the hypothesis (color coded blue) and (color coded red) in Fig. 2(b). Furthermore, we quantitatively measure the discriminative ability of the different estimators using the deflection statistic where and (respectively and ) are the sample mean and standard deviation of the entropy estimates. The deflection statistic was found to be , and for the standard kernel plug-in estimator, truncated kernel plug-in estimator and the weighted estimator respectively. The receiver operating curves (ROC) for this entropy-based test using the three different estimators are shown in Fig. 3(a). The corresponding areas under the ROC curves (AUC) are given by , and .
In our final experiment, we fix and set , perform 500 experiments each under the null and alternate hypotheses with samples of size 5000, and plot the AUC as varies from to in Fig. 3(b). For comparison, we also plot the AUC for the Neyman-Pearson likelihood ratio test. The Neyman-Pearson likelihood ratio test, unlike the Shannon entropy based tests, is an omniscient test that assumes knowledge of both the underlying beta-uniform mixture parametric model of the density and the parameter values , and , under the null and alternate hypothesis respectively. Figure 4 shows that the weighted estimator uniformly and significantly outperforms the individual plug-in estimators and comes closest to the performance of the omniscient Neyman-Pearson likelihood test. The relatively superior performance of the Neyman-Pearson likelihood test is due to the fact that the weighted estimator is a nonparametric estimator that has marginally higher variance (proportional to ) as compared to the underlying parametric model for which the Neyman-Pearson test statistic provides the most powerful test.
We have proposed a new estimator of functionals of a multivariate density based on weighted ensembles of kernel density estimators. For ensembles of estimators that satisfy general conditions on bias and variance as specified by (2.1) and (2.2) respectively, the weight optimized ensemble estimator has parametric MSE convergence rate that can be much faster than the rate of convergence of any of the individual estimators in the ensemble. The optimal weights are determined as a solution to a convex optimization problem that can be performed offline and does not require training data. We illustrated this estimator for uniform kernel plug-in estimators and demonstrated the superior performance of the weighted ensemble entropy estimator for (i) estimation of the Panter-Dite factor and (ii) non-parametric hypothesis testing.
Several extensions of the framework of this paper are being pursued: (i) using -nearest neighbor (-NN) estimators in place of kernel estimators; (ii) extending the framework to the case where support is not known, but for which conditions and hold; (iii) using ensemble estimators for estimation of other functionals of probability densities including divergence, mutual information and intrinsic dimension; and (iv) using an norm in place of the norm in the weight optimization algorithm (2.3) so as to introduce sparsity into the weighted ensemble.
This work was partially supported by (i) ARO grant W911NF-12-1-0443 and (ii) NIH grant 2P01CA087634-06A2.
Outline of appendix
Appendix A Moment properties of boundary compensated uniform kernel density estimates
Throughout this section, we assume without loss of generality that the support . Observe that is a binomial random variable with parameters and . The probability mass function of the binomial random variable is given by
Define the error function of the truncated uniform kernel density,
Also define the error function of the standard uniform kernel density,
and note that when , .
a.1 Taylor series expansion of coverage
For any , the coverage function can be represented by using a order Taylor series expansion of about as follows. Because the density has continuous partial derivatives of order in , for any ,
where are functions which depend on and the unknown density . This implies that the expectation of the density estimate is given by
a.2 Concentration inequalities for uniform kernel density estimator
Because is a binomial random variable, standard Chernoff inequalities can be applied to obtain concentration bounds on . In particular, for ,
Let denote the event , where for some fixed . Then, for ,
where satisfies the condition for any . Also observe that under the event ,
a.3 Bounds on uniform kernel density estimator
Let be an Euclidean ball of radius centered at . Let be a Lebesgue point of , i.e., an for which
Because is an density, we know that almost all satisfy the above property. Now, fix and find such that
For small values of , and therefore
This implies that under the event defined in the previous subsection,
Let denote the event that . Let denote the event and denote . Then conditioned on the event
and conditioned on the event
Observe that , , and form a disjoint partition of the event space.
Let be an arbitrary function with partial derivatives wrt and . Let denote i.i.d realizations of the density . Then,
where are functionals of and .
To analyze the bias, first extend the density function as follows. In particular, extend the definition of to the domain while ensuring that the extended function is differentiable times on this extended domain. Let be the natural un-truncated ball. Let . Define the function . For any , using this extended definition,
where are only functions of the unknown density . Also define