Asymptotic properties of parallel Bayesian kernel density estimators
Abstract
In this article we perform an asymptotic analysis of Bayesian parallel kernel density estimators introduced by Neiswanger, Wang and Xing [20]. We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a nontrivial choice of bandwidth parameters that optimizes the estimation error.
Contents
 1 Introduction
 2 Notation and hypotheses
 3 Asymptotic analysis of MISE for
 4 Asymptotic analysis of MISE for
 5 Examples
 6 Appendix
1 Introduction
Recent developments in data science and analytics research have produced an abundance of large data sets that are too large to be analyzed in their entirety. As the size of data sets increases, the time required for processing rises significantly. An effective solution to this problem is to perform statistical analysis of large data sets with the use of parallel computing. The prevalence of parallel processing of large data sets motivated a surge in research on parallel statistical algorithms.
One approach is to divide data sets into smaller subsets, and analyze the subsets on separate machines using parallel Markov chain Monte Carlo (MCMC) methods [15, 19, 26]. These methods, however, require communication between machines for generation of each sample. Communication costs in modern computer networks dwarf the speed up achieved by parallel processing and therefore algorithms that require extensive communications between machined are ineffective; see Scott [25].
To address these issues, numerous alternative communicationfree parallel MCMC methods have been developed for Bayesian analysis of big data. These methods partition data into subsets, perform independent Bayesian MCMC analysis on each subset, and combine the subset posterior samples to estimate the full data posterior; see [24, 20, 18].
Neiswanger, Wang and Xing [20] introduced a parallel kernel density estimator that first approximates each subset posterior density; the full data posterior is then estimated by multiplying the subset posterior estimators together,
(1.1) 
Here is the model parameter, is the full data set partitioned into disjoint independent subsets, and
(1.2) 
is the subset posterior kernel density estimator, with a kernel bandwidth parameter.
The authors of [20] show that the estimator (1.1) is asymptotically exact and develop a sampling algorithm that generates samples from the distribution approximating the full data estimator. Similar sampling algorithms were presented and investigated in Dunson and Wang [21] and Scott [24, 25]. It has been noted that these algorithms do not perform well for posteriors that have nonGaussian shape and are sensitive to the choice of the kernel parameters; see [18, 24, 21].
The highlighted issues indicate that the proper choice of the bandwidth can greatly benefit the accuracy of the estimation as well as sampling algorithms. Moreover, properly chosen bandwidth parameters will improve accuracy of the estimation without incurring additional computational cost.
In the present article, we are concerned with an asymptotic analysis of the parallel Bayesian kernel density estimators of the form (1.1). In particular, we are interested in the asymptotic representation of the mean integrated squared error (MISE) for the nonnormalized estimator and the density estimator as well as the properties of the optimal kernel bandwidth vector parameter as ; the issues left open in [20].
We also propose a universal iterative algorithm based on the derived asymptotic expansions that locates optimal parameters without adopting any assumptions on the underlying probability densities.
The kernel density estimators for the case have been studied extensively in the past five decades. Asymptotic properties of the mean integrated squared error for the estimator (1.1) with and , which takes the form (1.2), were studied by Rosenblatt [8], Parzen [10] and Epanechnikov [9]. In particular, for sufficiently smooth probability densities Parzen [10] derived the asymptotic expansion for the mean integrated squared error
(1.3) 
with and , and obtained a formula for the asymptotically optimal bandwidth parameter
(1.4) 
which minimizes the leading terms in the expansion.
The case of nondifferentiable or discontinuous probability density functions has been shown to possess different asymptotic estimates for MISE. It has been shown by van Eden [7] that the optimal bandwidth parameter and the rate of convergence of the mean integrated squared error depend directly on the regularity of the probability density .
In the case of multivariate distributions, , the complexity of the asymptotic analysis depends on the form of the bandwidth matrix . In the simplest case, one can assume that , where is a scalar; see Silverman [27], Simonoff [28] and Epanechnikov [9]. Another approach is to consider the bandwidth matrix of the form , with being a bandwidth parameter for each dimension . The most general formulation assumes that is a matrix, which allows one to encode correlations between components of ; see Duong and Hazelton[6], and Wand and Jones [31].
In the present work, motivated by the ideas of [10, 6, 31, 8] we focus on the case and and do the asymptotic analysis of the mean integrated squared error for both the parallel nonnormalized estimator
and the full data set posterior density estimator
as
In Theorem 3.3, under appropriate condition on the regularity of the probability density, we derive the expression for , the asymptotically leading part of MISE for the estimator . The leading part turns out to be in agreement with the leading part for the case , but in the multisubset case, , the leading part contains novel terms that take into account the relationship between subset posterior densities .
We then perform a similar analysis for the mean square error of the full data set posterior density estimator . The presence of the normalizing constant
introduces major difficulties in the analysis of MISE because may in general have an infinite second moment in which case is not defined. This may occur when the estimators (on some events) decay too quickly in variable and the sets of with the most ‘mass’ for each have little common intersection, which potentially leads to large values of . To make sure that one must impose appropriate conditions on the density and kernel . In this article, however, we take another approach. Instead, we replace the mean integrated squared error by an asymptotically equivalent distance functional denoted by
We show that the new functional is always welldefined and that it is asymptotically equivalent to MISE when restricted to events whose probability tends to one as the total number of samples .
We then do the analysis of the functional by carrying out the same program as for the MISE of the estimator . In Theorem 4.6 we derive the expression for , the asymptotically leading part of the for the full data set posterior density estimator . The asymptotically optimal bandwidth parameter for the full data set posterior is then defined to be a minimizer
We then compute minimizing bandwidth in explicit form for two special cases. In the examples presented here we consider subset posterior densities of normal and gamma distributions; see (5.2), (5.4), and (5.7). In the two examples the optimizing bandwidth vectors differ significantly and depend, as expected, directly on the full data set density which is typically unknown. For that reason we propose an iterative algorithm for locating optimal bandwidth parameters based on asymptotic expansion we derived; see Algorithm 1.
Our analysis demonstrates that partitioning data into sets affects the optimality condition of parameter . It also indicates that the bandwidth vector
which minimizes the ‘componentwise’ mean integrated squared error
where is the optimal bandwidth parameter for the estimator given by (1.4), is suboptimal for both estimators and whenever .
This observation highlights the fact that the choice of optimal parameters for parallel kernel density estimators (suitable for parallelizing data analysis) must differ from the theoretical choice suggested in case of processing on a single machine. We must also note, that the increased values of MISE resulted from choosing a suboptimal bandwidth parameter get compounded in case of parallel processing. This further necessitates the importance of a proper choice of bandwidth, especially if it comes at no additional computational costs.
The paper is arranged as follows. In Section 2 we set notation and hypotheses that form the foundation of the analysis. In Section 3 we derive an asymptotic expansion for MISE of the nonnormalized estimator as well as derive formulas for leading parts of and , which are central to the analysis performed in subsequent sections. In Section 4 we perform the analysis of for the full data set posterior density. In Section 5 we compute explicit expressions for optimal bandwidth parameters for several special cases and conduct numerical experiments. Finally, in the appendix we provide supplementary lemmas and theorems employed in Section 3 and Section 4.
2 Notation and hypotheses
For the convenience of the reader we collect in this section all hypotheses and results relevant to our analysis and present the notation that is utilized throughout the article.

Motivated by the form of the posterior density at Neiswanger et al. [20] we consider the probability density function of the form
(2.1) Here is a probability density function for each .

We consider the estimator of in the form
(H2a) and for each is the kernel density estimator of the probability density that has the form
(H2b) Here are independent identically distributed random variables, is a kernel density function, and is a bandwidth parameter.
The mean integrated squared error of the estimator of the nonnormalized product as well as for the estimator of the full posterior density is defined by
(2.2)  
where we use the notation and . We also use the following convention for the bias and variance of estimators
(2.3)  
We assume that the kernel density function and probability densities functions satisfy the following hypotheses:

is positive, bounded, normalized, and its first moment is zero, that is
(2.4) 
For each
(2.5) 
For each , and density there exists a constant such that
(2.6) 
For each and the density and its derivatives are integrable, that is, there is a constant so that
(2.7) 
Functions
satisfy for all
(2.8) We also define and note that .
3 Asymptotic analysis of Mise for
We start with the observation that MISE can be expressed via the combination of bias and variance
(3.1)  
In what follows we do the analysis of the bias, then that of variance and conclude with the section where we derive the formula for the optimal bandwidth vector.
3.1 Bias expansion
Using the fact that , are independent, we obtain
(3.2)  
To simplify notation in (3.2) we shall employ the multiindex notation. Let be the multiindex with
Then the above formula can rewritten as follows
(3.3)  
Using this decomposition, we prove the following lemma
Lemma 3.1.
Proof.
According to (3.3) and (6.2) we have
Here is the error in bias approximation for each from (6.2). We are computing bounds for
(3.8)  
To simplify the derivations we separate the terms in (3.8) into two groups: terms with at least one multiple of and terms free of . We define the sets
(3.9) 
and functions
(3.10) 
Here is the characteristic function. Consequently, the error term can be written as follows
(3.11)  
Assuming that is bounded, (H5) and (6.2), we can conclude that there is a constant so that
Using (H5) and (6.2), we conclude that the first term is bounded, and there is a constant so that
(3.12) 
The next sum in (3.11) contains terms are bounded due to (H5):
For some appropriate constants . Since each one of the products below has at least two terms with for some , a constant must exist, so that
(3.13) 
The inequalities (3.12) and (3.13) imply the first inequality in (3.5):
(3.14) 
3.2 Variance expansion
We next obtain an asymptotic formula for the variance of . For the proof of the lemma, we perform the following preliminary calculation
(3.16)  
Lemma 3.2.
Proof.
According to (3.16) we have
(3.19)  
Here, is the approximation error of variance of each from (6.11). In a fashion similar to the previous proof, we separate the terms in (3.19). We single out the leading order terms, the terms with at least one multiple of , the terms with multiples of and the terms of the order .
3.3 Amise formula and optimal bandwidth vector
With the lemmas above we can derive the decomposition of into leading order terms and higher order terms.