Asymptotic properties of parallel Bayesian kernel density estimators
In this article we perform an asymptotic analysis of Bayesian parallel kernel density estimators introduced by Neiswanger, Wang and Xing . We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.
- 1 Introduction
- 2 Notation and hypotheses
- 3 Asymptotic analysis of MISE for
- 4 Asymptotic analysis of MISE for
- 5.1 optimization for a symmetric case
- 5.2 optimization for normal subset posterior densities
- 5.3 optimization for gamma distributed subset posterior densities
- 5.4 Numerical experiments with normal subset posterior densities
- 5.5 Numerical experiments with gamma distributed subset posterior densities
- 6 Appendix
Recent developments in data science and analytics research have produced an abundance of large data sets that are too large to be analyzed in their entirety. As the size of data sets increases, the time required for processing rises significantly. An effective solution to this problem is to perform statistical analysis of large data sets with the use of parallel computing. The prevalence of parallel processing of large data sets motivated a surge in research on parallel statistical algorithms.
One approach is to divide data sets into smaller subsets, and analyze the subsets on separate machines using parallel Markov chain Monte Carlo (MCMC) methods [15, 19, 26]. These methods, however, require communication between machines for generation of each sample. Communication costs in modern computer networks dwarf the speed up achieved by parallel processing and therefore algorithms that require extensive communications between machined are ineffective; see Scott .
To address these issues, numerous alternative communication-free parallel MCMC methods have been developed for Bayesian analysis of big data. These methods partition data into subsets, perform independent Bayesian MCMC analysis on each subset, and combine the subset posterior samples to estimate the full data posterior; see [24, 20, 18].
Neiswanger, Wang and Xing  introduced a parallel kernel density estimator that first approximates each subset posterior density; the full data posterior is then estimated by multiplying the subset posterior estimators together,
Here is the model parameter, is the full data set partitioned into disjoint independent subsets, and
is the subset posterior kernel density estimator, with a kernel bandwidth parameter.
The authors of  show that the estimator (1.1) is asymptotically exact and develop a sampling algorithm that generates samples from the distribution approximating the full data estimator. Similar sampling algorithms were presented and investigated in Dunson and Wang  and Scott [24, 25]. It has been noted that these algorithms do not perform well for posteriors that have non-Gaussian shape and are sensitive to the choice of the kernel parameters; see [18, 24, 21].
The highlighted issues indicate that the proper choice of the bandwidth can greatly benefit the accuracy of the estimation as well as sampling algorithms. Moreover, properly chosen bandwidth parameters will improve accuracy of the estimation without incurring additional computational cost.
In the present article, we are concerned with an asymptotic analysis of the parallel Bayesian kernel density estimators of the form (1.1). In particular, we are interested in the asymptotic representation of the mean integrated squared error (MISE) for the non-normalized estimator and the density estimator as well as the properties of the optimal kernel bandwidth vector parameter as ; the issues left open in .
We also propose a universal iterative algorithm based on the derived asymptotic expansions that locates optimal parameters without adopting any assumptions on the underlying probability densities.
The kernel density estimators for the case have been studied extensively in the past five decades. Asymptotic properties of the mean integrated squared error for the estimator (1.1) with and , which takes the form (1.2), were studied by Rosenblatt , Parzen  and Epanechnikov . In particular, for sufficiently smooth probability densities Parzen  derived the asymptotic expansion for the mean integrated squared error
with and , and obtained a formula for the asymptotically optimal bandwidth parameter
which minimizes the leading terms in the expansion.
The case of non-differentiable or discontinuous probability density functions has been shown to possess different asymptotic estimates for MISE. It has been shown by van Eden  that the optimal bandwidth parameter and the rate of convergence of the mean integrated squared error depend directly on the regularity of the probability density .
In the case of multivariate distributions, , the complexity of the asymptotic analysis depends on the form of the bandwidth matrix . In the simplest case, one can assume that , where is a scalar; see Silverman , Simonoff  and Epanechnikov . Another approach is to consider the bandwidth matrix of the form , with being a bandwidth parameter for each dimension . The most general formulation assumes that is a matrix, which allows one to encode correlations between components of ; see Duong and Hazelton, and Wand and Jones .
In the present work, motivated by the ideas of [10, 6, 31, 8] we focus on the case and and do the asymptotic analysis of the mean integrated squared error for both the parallel non-normalized estimator
and the full data set posterior density estimator
In Theorem 3.3, under appropriate condition on the regularity of the probability density, we derive the expression for , the asymptotically leading part of MISE for the estimator . The leading part turns out to be in agreement with the leading part for the case , but in the multi-subset case, , the leading part contains novel terms that take into account the relationship between subset posterior densities .
We then perform a similar analysis for the mean square error of the full data set posterior density estimator . The presence of the normalizing constant
introduces major difficulties in the analysis of MISE because may in general have an infinite second moment in which case is not defined. This may occur when the estimators (on some events) decay too quickly in variable and the sets of with the most ‘mass’ for each have little common intersection, which potentially leads to large values of . To make sure that one must impose appropriate conditions on the density and kernel . In this article, however, we take another approach. Instead, we replace the mean integrated squared error by an asymptotically equivalent distance functional denoted by
We show that the new functional is always well-defined and that it is asymptotically equivalent to MISE when restricted to events whose probability tends to one as the total number of samples .
We then do the analysis of the functional by carrying out the same program as for the MISE of the estimator . In Theorem 4.6 we derive the expression for , the asymptotically leading part of the for the full data set posterior density estimator . The asymptotically optimal bandwidth parameter for the full data set posterior is then defined to be a minimizer
We then compute minimizing bandwidth in explicit form for two special cases. In the examples presented here we consider subset posterior densities of normal and gamma distributions; see (5.2), (5.4), and (5.7). In the two examples the optimizing bandwidth vectors differ significantly and depend, as expected, directly on the full data set density which is typically unknown. For that reason we propose an iterative algorithm for locating optimal bandwidth parameters based on asymptotic expansion we derived; see Algorithm 1.
Our analysis demonstrates that partitioning data into sets affects the optimality condition of parameter . It also indicates that the bandwidth vector
which minimizes the ‘componentwise’ mean integrated squared error
where is the optimal bandwidth parameter for the estimator given by (1.4), is suboptimal for both estimators and whenever .
This observation highlights the fact that the choice of optimal parameters for parallel kernel density estimators (suitable for parallelizing data analysis) must differ from the theoretical choice suggested in case of processing on a single machine. We must also note, that the increased values of MISE resulted from choosing a suboptimal bandwidth parameter get compounded in case of parallel processing. This further necessitates the importance of a proper choice of bandwidth, especially if it comes at no additional computational costs.
The paper is arranged as follows. In Section 2 we set notation and hypotheses that form the foundation of the analysis. In Section 3 we derive an asymptotic expansion for MISE of the non-normalized estimator as well as derive formulas for leading parts of and , which are central to the analysis performed in subsequent sections. In Section 4 we perform the analysis of for the full data set posterior density. In Section 5 we compute explicit expressions for optimal bandwidth parameters for several special cases and conduct numerical experiments. Finally, in the appendix we provide supplementary lemmas and theorems employed in Section 3 and Section 4.
2 Notation and hypotheses
For the convenience of the reader we collect in this section all hypotheses and results relevant to our analysis and present the notation that is utilized throughout the article.
Motivated by the form of the posterior density at Neiswanger et al.  we consider the probability density function of the form
Here is a probability density function for each .
We consider the estimator of in the form
and for each is the kernel density estimator of the probability density that has the form
Here are independent identically distributed random variables, is a kernel density function, and is a bandwidth parameter.
The mean integrated squared error of the estimator of the non-normalized product as well as for the estimator of the full posterior density is defined by
where we use the notation and . We also use the following convention for the bias and variance of estimators
We assume that the kernel density function and probability densities functions satisfy the following hypotheses:
is positive, bounded, normalized, and its first moment is zero, that is
For each , and density there exists a constant such that
For each and the density and its derivatives are integrable, that is, there is a constant so that
satisfy for all
We also define and note that .
3 Asymptotic analysis of Mise for
We start with the observation that MISE can be expressed via the combination of bias and variance
In what follows we do the analysis of the bias, then that of variance and conclude with the section where we derive the formula for the optimal bandwidth vector.
3.1 Bias expansion
Using the fact that , are independent, we obtain
To simplify notation in (3.2) we shall employ the multiindex notation. Let be the multiindex with
Then the above formula can rewritten as follows
Using this decomposition, we prove the following lemma
Here is the error in bias approximation for each from (6.2). We are computing bounds for
To simplify the derivations we separate the terms in (3.8) into two groups: terms with at least one multiple of and terms free of . We define the sets
Here is the characteristic function. Consequently, the error term can be written as follows
For some appropriate constants . Since each one of the products below has at least two terms with for some , a constant must exist, so that
which proves the second estimate in (3.5).
Using the estimates obtained above, we conclude
Finally, (ii) follows from Cauchy-Schwartz inequality applied to
3.2 Variance expansion
We next obtain an asymptotic formula for the variance of . For the proof of the lemma, we perform the following preliminary calculation
According to (3.16) we have
Here, is the approximation error of variance of each from (6.11). In a fashion similar to the previous proof, we separate the terms in (3.19). We single out the leading order terms, the terms with at least one multiple of , the terms with multiples of and the terms of the order .
3.3 Amise formula and optimal bandwidth vector
With the lemmas above we can derive the decomposition of into leading order terms and higher order terms.