Asymptotic properties of parallel Bayesian kernel density estimators

# Asymptotic properties of parallel Bayesian kernel density estimators

Alexey Miroshnikov Department of Mathematics, University of California, Los Angeles, amiroshn@gmail.com    Evgeny Savelev Department of Mathematics, Virginia Polytechnic Institute and State University, savelev@vt.edu
###### Abstract

In this article we perform an asymptotic analysis of Bayesian parallel kernel density estimators introduced by Neiswanger, Wang and Xing [20]. We derive the asymptotic expansion of the mean integrated squared error for the full data posterior estimator and investigate the properties of asymptotically optimal bandwidth parameters. Our analysis demonstrates that partitioning data into subsets requires a non-trivial choice of bandwidth parameters that optimizes the estimation error.

## 1 Introduction

Recent developments in data science and analytics research have produced an abundance of large data sets that are too large to be analyzed in their entirety. As the size of data sets increases, the time required for processing rises significantly. An effective solution to this problem is to perform statistical analysis of large data sets with the use of parallel computing. The prevalence of parallel processing of large data sets motivated a surge in research on parallel statistical algorithms.

One approach is to divide data sets into smaller subsets, and analyze the subsets on separate machines using parallel Markov chain Monte Carlo (MCMC) methods [15, 19, 26]. These methods, however, require communication between machines for generation of each sample. Communication costs in modern computer networks dwarf the speed up achieved by parallel processing and therefore algorithms that require extensive communications between machined are ineffective; see Scott [25].

To address these issues, numerous alternative communication-free parallel MCMC methods have been developed for Bayesian analysis of big data. These methods partition data into subsets, perform independent Bayesian MCMC analysis on each subset, and combine the subset posterior samples to estimate the full data posterior; see [24, 20, 18].

Neiswanger, Wang and Xing [20] introduced a parallel kernel density estimator that first approximates each subset posterior density; the full data posterior is then estimated by multiplying the subset posterior estimators together,

 ˆp(x|y)∝ˆp∗(x|y):=ˆp1(x|y1)⋅ˆp2(x|y2)⋯⋅ˆpM(x|yM). (1.1)

Here is the model parameter, is the full data set partitioned into disjoint independent subsets, and

 (1.2)

is the subset posterior kernel density estimator, with a kernel bandwidth parameter.

The authors of [20] show that the estimator (1.1) is asymptotically exact and develop a sampling algorithm that generates samples from the distribution approximating the full data estimator. Similar sampling algorithms were presented and investigated in Dunson and Wang [21] and Scott [24, 25]. It has been noted that these algorithms do not perform well for posteriors that have non-Gaussian shape and are sensitive to the choice of the kernel parameters; see [18, 24, 21].

The highlighted issues indicate that the proper choice of the bandwidth can greatly benefit the accuracy of the estimation as well as sampling algorithms. Moreover, properly chosen bandwidth parameters will improve accuracy of the estimation without incurring additional computational cost.

In the present article, we are concerned with an asymptotic analysis of the parallel Bayesian kernel density estimators of the form (1.1). In particular, we are interested in the asymptotic representation of the mean integrated squared error (MISE) for the non-normalized estimator and the density estimator as well as the properties of the optimal kernel bandwidth vector parameter as ; the issues left open in [20].

We also propose a universal iterative algorithm based on the derived asymptotic expansions that locates optimal parameters without adopting any assumptions on the underlying probability densities.

The kernel density estimators for the case have been studied extensively in the past five decades. Asymptotic properties of the mean integrated squared error for the estimator (1.1) with and , which takes the form (1.2), were studied by Rosenblatt [8], Parzen [10] and Epanechnikov [9]. In particular, for sufficiently smooth probability densities Parzen [10] derived the asymptotic expansion for the mean integrated squared error

 \rm MISE[p,ˆp,N,h]=h4k224∫R(p′′(x))2dx+1nh∫RK2(t)dt+o(1nh+h4), (1.3)

with and , and obtained a formula for the asymptotically optimal bandwidth parameter

 hoptM=1=n−1/5k−2/52(∫RK2(t)dt)1/5(∫R(p′′(x))2dx)−1/5 (1.4)

which minimizes the leading terms in the expansion.

The case of non-differentiable or discontinuous probability density functions has been shown to possess different asymptotic estimates for MISE. It has been shown by van Eden [7] that the optimal bandwidth parameter and the rate of convergence of the mean integrated squared error depend directly on the regularity of the probability density .

In the case of multivariate distributions, , the complexity of the asymptotic analysis depends on the form of the bandwidth matrix . In the simplest case, one can assume that , where is a scalar; see Silverman [27], Simonoff [28] and Epanechnikov [9]. Another approach is to consider the bandwidth matrix of the form , with being a bandwidth parameter for each dimension . The most general formulation assumes that is a matrix, which allows one to encode correlations between components of ; see Duong and Hazelton[6], and Wand and Jones [31].

In the present work, motivated by the ideas of [10, 6, 31, 8] we focus on the case and and do the asymptotic analysis of the mean integrated squared error for both the parallel non-normalized estimator

 \rm MISE[ˆp∗,p∗;N,h] =E∫R{p∗(x|y)−ˆp∗(x|y))}2dx

and the full data set posterior density estimator

 \rm MISE[ˆp,p;N,h] =E∫R{p(x|y)−ˆp(x|y))}2dx

as

 N=(N1,N2,…,NM)→∞,h=(h1,h2,…,hM)→0and(N⋅h)−1→0.

In Theorem 3.3, under appropriate condition on the regularity of the probability density, we derive the expression for , the asymptotically leading part of MISE for the estimator . The leading part turns out to be in agreement with the leading part for the case , but in the multi-subset case, , the leading part contains novel terms that take into account the relationship between subset posterior densities .

We then perform a similar analysis for the mean square error of the full data set posterior density estimator . The presence of the normalizing constant

introduces major difficulties in the analysis of MISE because may in general have an infinite second moment in which case is not defined. This may occur when the estimators (on some events) decay too quickly in variable and the sets of with the most ‘mass’ for each have little common intersection, which potentially leads to large values of . To make sure that one must impose appropriate conditions on the density and kernel . In this article, however, we take another approach. Instead, we replace the mean integrated squared error by an asymptotically equivalent distance functional denoted by

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\rm MISE[ˆp,p;N,h]

We show that the new functional is always well-defined and that it is asymptotically equivalent to MISE when restricted to events whose probability tends to one as the total number of samples .

We then do the analysis of the functional by carrying out the same program as for the MISE of the estimator . In Theorem 4.6 we derive the expression for , the asymptotically leading part of the for the full data set posterior density estimator . The asymptotically optimal bandwidth parameter for the full data set posterior is then defined to be a minimizer

 hopt=argminh∈RM+¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\rm AMISE[p,ˆp;N,h].

We then compute minimizing bandwidth in explicit form for two special cases. In the examples presented here we consider subset posterior densities of normal and gamma distributions; see (5.2), (5.4), and (5.7). In the two examples the optimizing bandwidth vectors differ significantly and depend, as expected, directly on the full data set density which is typically unknown. For that reason we propose an iterative algorithm for locating optimal bandwidth parameters based on asymptotic expansion we derived; see Algorithm 1.

Our analysis demonstrates that partitioning data into sets affects the optimality condition of parameter . It also indicates that the bandwidth vector

 hopt0=(hopt1,M=1,hopt2,M=1,…,hoptM,M=1),

which minimizes the ‘componentwise’ mean integrated squared error

 M∑i=1\rm MISE[ˆpi,pi,Ni,hi],

where is the optimal bandwidth parameter for the estimator given by (1.4), is suboptimal for both estimators and whenever .

This observation highlights the fact that the choice of optimal parameters for parallel kernel density estimators (suitable for parallelizing data analysis) must differ from the theoretical choice suggested in case of processing on a single machine. We must also note, that the increased values of MISE resulted from choosing a suboptimal bandwidth parameter get compounded in case of parallel processing. This further necessitates the importance of a proper choice of bandwidth, especially if it comes at no additional computational costs.

The paper is arranged as follows. In Section 2 we set notation and hypotheses that form the foundation of the analysis. In Section 3 we derive an asymptotic expansion for MISE of the non-normalized estimator as well as derive formulas for leading parts of and , which are central to the analysis performed in subsequent sections. In Section 4 we perform the analysis of for the full data set posterior density. In Section 5 we compute explicit expressions for optimal bandwidth parameters for several special cases and conduct numerical experiments. Finally, in the appendix we provide supplementary lemmas and theorems employed in Section 3 and Section 4.

## 2 Notation and hypotheses

For the convenience of the reader we collect in this section all hypotheses and results relevant to our analysis and present the notation that is utilized throughout the article.

1. Motivated by the form of the posterior density at Neiswanger et al. [20] we consider the probability density function of the form

 p(x)∝p∗(x)wherep∗(x):=M∏m=1pm(x) (2.1)

Here is a probability density function for each .

2. We consider the estimator of in the form

 ^p(x)∝^p∗(x)where^p∗(x):=M∏m=1^pm(x) (H2-a)

and for each is the kernel density estimator of the probability density that has the form

 ^pm(x)=1NmhmNm∑i=1K(x−Xmihm). (H2-b)

Here are independent identically distributed random variables, is a kernel density function, and is a bandwidth parameter.

The mean integrated squared error of the estimator of the non-normalized product as well as for the estimator of the full posterior density is defined by

 \rm MISE[p∗,ˆp∗,N,h] =\rm MISE[p∗,ˆp∗(x)]:=E∫R(ˆp∗(x)−p∗(x))2dx (2.2) \rm MISE[p,ˆp,N,h] =\rm MISE[p,ˆp(x)]:=E∫R(ˆp(x)−p(x))2dx

where we use the notation and . We also use the following convention for the bias and variance of estimators

 \rm bias[^p(x)] =E[^p(x)]−p(x) (2.3) \rm bias[^p∗(x)] =E[^p∗(x)]−p∗(x) \rm bias[^pm(x)] =E[^pm(x)]−pm(x),m∈{1,…,M}.

We assume that the kernel density function and probability densities functions satisfy the following hypotheses:

1. is positive, bounded, normalized, and its first moment is zero, that is

 0≤K(t)≤C,∫RK(t)dt=1,∫RtK(t)dt=0,∫RK2(t)dt<∞ (2.4)
2. For each

 ks=∫R|t|sK(t)dt<∞. (2.5)
3. For each , and density there exists a constant such that

 |p(s)m(x)|
4. For each and the density and its derivatives are integrable, that is, there is a constant so that

 ∫R|p(s)m(x)|dx=C<∞. (2.7)
5. Functions

 N(n) ={N1(n),N2(n),N3(n),…,NM(n)}:N→NM h(n) ={h1(n),h2(n),…,hM(n)}:N→RM++

satisfy for all

 D1≤Nin≤D2for some 0

We also define and note that .

## 3 Asymptotic analysis of Mise for ˆp∗

We start with the observation that MISE can be expressed via the combination of bias and variance

 \rm MISE[p∗,ˆp∗] =E∫R(ˆp∗(x)−p∗(x))2dx (3.1) =∫R(\rm bias[ˆp∗(x)])2dx+∫RV[ˆp∗(x)]dx.

In what follows we do the analysis of the bias, then that of variance and conclude with the section where we derive the formula for the optimal bandwidth vector.

### 3.1 Bias expansion

Using the fact that , are independent, we obtain

 \rm bias[ˆp∗(x)] =E[ˆp∗(x)]−p∗(x) (3.2) =M∏m=1E[ˆpm](x)−M∏m=1pm(x) =M∏m=1(\rm bias[pm(x)]+pm(x))−M∏m=1pm(x)

To simplify notation in (3.2) we shall employ the multiindex notation. Let be the multiindex with

 α=(α1,α2,…,αM)αm∈{0,1}.

Then the above formula can rewritten as follows

 \rm bias[ˆp∗(x)] =∑1⩽|α|⩽MM∏m=1\rm bias% αm[ˆpm(x)](pm(x))(1−αm) (3.3) =M∑m=1⎡⎢ ⎢⎣\rm bias[ˆpm(x)]M∏k=1k≠mpk(x)⎤⎥ ⎥⎦ +∑2⩽|α|⩽MM∏m=1(\rm bias[ˆpm(x)])αm(pm(x))(1−αm).

Using this decomposition, we prove the following lemma

###### Lemma 3.1.

Suppose hypotheses (H3)-(H6) hold. Then

• The bias can be expressed as

 \rm bias[ˆp∗(x)] =k22M∑m=1⎡⎢ ⎢⎣h2mp′′m(x)M∏k=1k≠mpk(x)⎤⎥ ⎥⎦+Eb(x;h) (3.4)

where the error term satisfies the bounds

 |Eb(x;h)| ≤E∞||h||3,∀x∈R (3.5) ∫R|Eb(x;h)|dx ≤E1||h||3 ∫R|Eb(x;h)|2dx ≤E2||h||6
• The square-integrated bias satisfies

 ∫R\rm bias2[ˆp∗(x)]dx=k224∫R⎡⎢ ⎢⎣M∑m=1h2mp′′m(x)M∏k=1k≠mpk(x)⎤⎥ ⎥⎦2dx+Eb(h)<∞ (3.6)

with the error term satisfying

 |Eb(h)|≤C||h||5 (3.7)

where the constant is independent of and .

###### Proof.

According to (3.3) and (6.2) we have

 \rm bias[ˆp∗(x)]= =k22M∑m=1⎡⎢ ⎢⎣h2mp′′m(x)M∏k=1k≠mpk(x)⎤⎥ ⎥⎦+M∑m=1⎡⎢ ⎢⎣Eb,mM∏k=1k≠mpk(x)⎤⎥ ⎥⎦ +∑2⩽|α|⩽MM∏m=1(h2mk22p′′m(x)+Eb,m)αm(pm(x))(1−αm)

Here is the error in bias approximation for each from (6.2). We are computing bounds for

 Eb(x;h)= (3.8) =M∑m=1⎡⎢ ⎢⎣Eb,mM∏k=1k≠mpk(x)⎤⎥ ⎥⎦+∑2⩽|α|⩽MM∏m=1(h2mk22p′′m(x)+Eb,m)αm(pm(x))(1−αm)

To simplify the derivations we separate the terms in (3.8) into two groups: terms with at least one multiple of and terms free of . We define the sets

 Am={α=(αj)Mj=1 : αm=0 and 1≤|α|≤(M−1)} (3.9)

and functions

 Pm(x)=M∏k=1k≠mpk(x)+∑α∈Am⎡⎢ ⎢⎣M∏j=1j≠m(h2jk22p′′j(x)+\mathds1{j>m}Eb,j)αj(pj(x))(1−αj)⎤⎥ ⎥⎦. (3.10)

Here is the characteristic function. Consequently, the error term can be written as follows

 Eb(x;h)= (3.11) =M∑m=1[Eb,mPm(x)]+∑2⩽|α|⩽MM∏m=1(h2mk22p′′m(x))αm(pm(x))(1−αm).

Assuming that is bounded, (H5) and (6.2), we can conclude that there is a constant so that

 |Pm(x)|≤CP for any x∈R and 1≤m≤M

Using (H5) and (6.2), we conclude that the first term is bounded, and there is a constant so that

 M∑m=1|Eb,mPm(x)|≤CM∑m=1(k3h3m6)≤CM||h||3k36. (3.12)

The next sum in (3.11) contains terms are bounded due to (H5):

 ∣∣∣h2mk22p′′m(x)∣∣∣≤||h||2Ck22and|pm(x)|≤C

For some appropriate constants . Since each one of the products below has at least two terms with for some , a constant must exist, so that

 ∣∣ ∣∣∑2⩽|α|⩽MM∏m=1(h2mk22p′′m(x))αm(pm(x))(1−αm)∣∣ ∣∣≤CQ||h||4k224 (3.13)

The inequalities (3.12) and (3.13) imply the first inequality in (3.5):

 |Eb(x;h)|≤CM||h||3k36+||h||4k224CQ (3.14)

integrability follows from conditions (H5), (H6), the expansion (3.11) and the second formula in (6.4)

 ∫R|Eb(x;h)|dx≤C(k3||h||36+||h||4k224), (3.15)

which proves the second estimate in (3.5).
Using the estimates obtained above, we conclude

 ∫R|Eb(x;h)|2dx≤supR|Eb(x;h)|⋅∫R|Eb(x;h)|dx≤E∞⋅E1||h||6

Finally, (ii) follows from Cauchy-Schwartz inequality applied to

 \rm bias2[ˆp∗(x)] =k224⎡⎢ ⎢⎣M∑m=1h2mp′′m(x)M∏k=1k≠mpk(x)⎤⎥ ⎥⎦2+ +Eb(x;h)k2⎡⎢ ⎢⎣M∑m=1h2mp′′m(x)M∏k=1k≠mpk(x)⎤⎥ ⎥⎦+E2b(x;h)

which leads directly to (3.6) and (3.7) ∎

### 3.2 Variance expansion

We next obtain an asymptotic formula for the variance of . For the proof of the lemma, we perform the following preliminary calculation

 V[ˆp∗(x)] =E[(ˆp∗(x))2]−(E[ˆp∗(x)])2=M∏m=1E[ˆp2m]−M∏m=1E2[ˆpm] (3.16) =M∏m=1(V[ˆpm]+(pm+\rm bias[ˆpm])2)−M∏m=1(pm+\rm bias[ˆpm])2 =∑1⩽|α|⩽MM∏m=1(V[ˆpm])αm(pm+\rm bias[ˆpm])2(1−αm)
###### Lemma 3.2.

Let hypotheses (H3)-(H7) hold. Then

• The variation of is given by

 V[^p∗(x)]=⎛⎜ ⎜⎝M∑m=1⎡⎢ ⎢⎣pmNmhmM∏k=1k≠mp2k(x)⎤⎥ ⎥⎦⎞⎟ ⎟⎠∫RK2(t)dt+EV(x;N,h),x∈R (3.17)

where the error term satisfies the bounds

 |EV(N,h)| :=∣∣∣∫REV(x)dx∣∣∣=o(1∥N∥) (3.18)
###### Proof.

According to (3.16) we have

 V(^p∗(x))= (3.19) =∑1⩽|α|⩽MM∏m=1(pm(x)Nmhm∫RK2(t)dt+EV,m)αm(pm+\rm bias[ˆpm])2(1−αm) =∑1⩽|α|⩽MM∏m=1(pm(x)Nmhm∫RK2(t)dt+EV,m)αm(p2m+2pm\rm bias[ˆpm]+\rm bias% 2[ˆpm])(1−αm)

Here, is the approximation error of variance of each from (6.11). In a fashion similar to the previous proof, we separate the terms in (3.19). We single out the leading order terms, the terms with at least one multiple of , the terms with multiples of and the terms of the order .

We define sets

 A0m ={α=(αj)Mj=1 : αm=0 and% 0≤|α|≤(M−1)} (3.20) B1m ={α=(αj)Mj=1 : αm=0 and% |α|=1}

and functions

 P0m(x) =∑α∈A0m⎡⎢ ⎢⎣M∏j=1j≠m(pm(x)Nmhm∫RK2(t)dt+\mathds1{j>m}EV,m)αm(E2[ˆpm])(1−αm)⎤⎥ ⎥⎦, (3.21) Q1m(x) =∑α∈B1m⎡⎢ ⎢⎣M∏j=1j≠m(pm(x)Nmhm∫RK2(t)dt)αm(E2[ˆpm])(1−αm)⎤⎥ ⎥⎦,

The variance expansion can be rewritten as

 V(^p∗(x))= (3.22) =∑1⩽|α|⩽MM∏m=1(pm(x)Nmhm∫RK2(t)dt)αm(p2m+2pm\rm bias[ˆpm]+\rm bias2[ˆpm])(1−αm) +M∑m=1EV,mP0m(x) =M∑m=1(pm(x)Nmhm∫RK2(t)dt)M∏j=1j≠mp2m(x) +M∑m=1\rm bias[ˆpm](2pm(x)+\rm bias[ˆpm])Q1m(x) +∑2⩽|α|⩽MM∏m=1(pm(x)Nmhm∫RK2(t)dt)αm(E2[ˆpm])(1−αm) +M∑m=1EV,mP0m(x)

Based on definitions of functions and , hypotheses (H5), (H6) and (H7) we can conclude that there are constants so that

 E[ˆpm] ≤CE |P0m(x)| ≤CP |Q1m(x)| ≤CQ1∥N∥∥h∥

Therefore

 ∫R|EV(x)|dx (3.23) ≤M∑m=1C(2+||h||2k22)CQ∥N∥∥h∥∫R|\rm bias% [ˆpm]|dx +1∥N∥2∥h∥2∑2⩽|α|⩽M(1∥N∥2∥h∥2)(|α|−2)C(M−|α|)E +M⋅CP∥N∥

This leads directly to (3.18). ∎

### 3.3 Amise formula and optimal bandwidth vector

With the lemmas above we can derive the decomposition of into leading order terms and higher order terms.

###### Theorem 3.3.

Let hypotheses (H3)-(H7) hold. Then MISE can be represented as

 \rm MISE[p∗,ˆp∗,N,h]=\rm AMISE[p∗,ˆp∗,N,h]+E(N,h) (3.24)