On Bootstrap Averaging Empirical Bayes Estimators1footnote 11footnote 1This version: July 3, 2019

# On Bootstrap Averaging Empirical Bayes Estimators111This version: July 3, 2019

SHONOSUKE SUGASAWA
Risk Analysis Research Center, The Institute of Statistical Mathematics

Abstract.  Parametric empirical Bayes (EB) estimators have been widely used in variety of fields including small area estimation, disease mapping. Since EB estimator is constructed by plugging in the estimator of parameters in prior distributions, it might perform poorly if the estimator of parameters is unstable. This can happen when the number of samples are small or moderate. This paper suggests bootstrapping averaging approach, known as “bagging” in machine learning literatures, to improve the performances of EB estimators. We consider two typical hierarchical models, two-stage normal hierarchical model and Poisson-gamma model, and compare the proposed method with the classical parametric EB method through simulation and empirical studies.

Key words: Bagging; Hierarchical model; Mean squared error; Poisson-gamma model

## 1 Introduction

The parametric empirical Bayes estimators (Morris, 1983) are known to be a useful method producing reliable estimates of multidimensional parameters. This technique is widely used in variety of fields such as small area estimation (Rao and Molina, 2015) and disease mapping (Lawson, 2013). Let be the multiple parameters of interest, and be the independent observations generated from the distribution . To carry out an empirical Bayes estimation, it is assumed that the parameters independently follows the distribution , where is a vector of unknown parameters. Therefore, we obtain a two stage model:

 yi|θi∼fi(yi|θi),     θi∼g(θi;\boldmathϕ),    i=1,…,m, (1)

which are independent for . Under the setting, the posterior distribution of is given by

 π(θi|yi;\boldmathϕ)=fi(yi|θi)g(θi;\boldmathϕ)∫fi(yi|θi)g(θi;\boldmathϕ)dθi,    i=1,…,m.

The Bayes estimator of under squared error loss is the conditional expectation (posterior mean) of given , that is

 ˜θi≡E[θi|yi;\boldmathϕ]=∫θifi(yi|θi)g(θi;% \boldmathϕ)dθi∫fi(yi|θi)g(θi;\boldmathϕ)dθi,    i=1,…,m. (2)

However, the Bayes estimator depends on unknown model parameters , which can be estimated from the marginal distribution of all the data , given by

 L(\boldmathϕ)=m∏i=1∫fi(yi|θi)g(θi;\boldmathϕ)dθi.

Using the marginal distribution of , one can immediately define the maximum likelihood (ML) estimator as the maximizer of . Based on the estimator , we obtain the empirical Bayes (EB) estimator of as .

The variability of the EB estimator can be measured by the integrated mean squared error (MSE) , where the expectation is taken with respect to ’s and ’s following the model (1). Since is the conditional expectation as given in (2), the MSE can be decomposed as with and . The first term is not affected by the estimation of whereas the second term reflects the variability of the ML estimator , so that the second term can be negligibly small when is large. However, in many applications, might be small or moderate, in which the contribution of the second term to the MSE cannot be ignored. Hence, the EB estimator might perform poorly depending on the ML estimator . To overcome this problem, we propose to use the bootstrap averaging technique, known as “bagging” (Breiman, 1996) in machine learning literatures. This method produces many estimators based on bootstrap samples, and average them to produce a stable estimator. We adapt the bagging method to the EB estimation to improve the performances of EB estimators under small or moderate .

This paper is organized as follows: In Section 2, we consider mean squared errors of EB estimators and propose a bootstrap averaging empirical Bayes (BEB) estimator for decreasing the mean squared error. In Section 3 and Section 4, we apply the BEB estimators in well-known two-stage normal hierarchical model and Poisson-gamma model, respectively, and compare the performances between BEB and EB estimators through simulation and empirical studies. In Section 5, we provide conclusions and discussions.

## 2 Bootstrap Averaging Empirical Bayes Estimators

As noted in the previous section, the performances of the EB estimators depend on the variability of the estimator , which cannot be ignored when is not large. To reduce the variability of the empirical Bayes estimator , we propose to average many empirical Bayes estimators with bootstrap estimates of rather than computing one empirical Bayes estimator from the observation . Specifically, letting be a bootstrap samples of the original observation , we define be an estimator of based on the bootstrap sample . Then the bagging empirical Bayes (BEB) estimator is given by

 ˆθBooti=1BB∑b=1˜θi(yi,ˆ\boldmathϕ(b)). (3)

Similarly to Breiman (1996), we note that

 1BB∑b=1{˜θi(yi,ˆ\boldmathϕ(b))−θi}2 =1BB∑b=1˜θi(yi,ˆ\boldmathϕ(b))2−2ˆθ% Bootiθi+θ2i ≥{1BB∑b=1˜θi(yi,ˆ\boldmathϕ(b))}2−2ˆθBootiθi+θ2i=(ˆθ%Booti−θi)2.

By taking expectation with respect to the model (1), we have

 1BB∑b=1E[{˜θi(yi,ˆ\boldmathϕ(b))−θi}2]≥E[(ˆθBooti−θi)2],

which means that the integrated MSE of BEB estimator (3) is smaller than bootstrap average of the integrated MSE of the EB estimator. Hence, the BEB estimator is expected to perform better than the EB estimator. The amount of improvement depends on

 1BB∑b=1˜θi(yi,ˆ% \boldmathϕ(b))2−{1BB∑b=1˜θi(yi,ˆ\boldmathϕ(b))}2=1BB∑b=1{˜θi(yi,ˆ%\boldmath$ϕ$(b))−ˆθBooti}2,

which is the bootstrap variance of the EB estimator and it vanishes as but it would not be negligible when is not large. Therefore, when is small or moderate, the BEB estimator would improve the performance of the EB estimator. In the subsequent section, we investigate the performances of the EBE estimator compared with the EB estimator in the widely-used hierarchical models.

## 3 Two-stage normal hierarchical model

### 3.1 Model description

We first consider the two-stage normal hierarchal model to demonstrate the proposed bagging procedure. The two-stage normal hierarchical model is described as

 yi|θi∼N(θi,Di),      θi∼N(\boldmathxti\boldmathβ,A),    i=1,…,m, (4)

where is known sampling variance, and are a vector of covariates and regression coefficients, respectively, is an unknown variance. Let be the vector of unknown parameters. The model (6) is known as the Fay-Herriot model (Fay and Herriot, 1979) in the context of small area estimation.

Under the model (6), the Bayes estimator of is

 ˜θi(yi;\boldmathϕ)=\boldmathxti\boldmathβ+DiA+Di(yi−\boldmath% xti\boldmathβ).

Concerning the estimation of unknown parameter , we here consider the maximum likelihood estimator for simplicity. Since under the model (6), the maximum likelihood estimator is defined as the maximizer of the function:

 Q(\boldmathϕ)=m∑i=1log(A+Di)+m∑i=1(yi−\boldmathxti\boldmathβ)2A+Di.

While several other estimating methods are available, we here only consider the maximum likelihood estimator for presentational simplicity. Using the maximum likelihood estimator , we obtain the EB estimator of as .

### 3.2 Simulation study

We here evaluate the performances of the BEB estimator together with the EB estimator under the normal hierarchical model (6) without covariates, namely . We considered . For each , we set as equally spaced points between and . Concerning the true parameter values, we used and four cases for , namely and . The simulated data was generated from the model (6) in each iteration, and computed the EB and BEB estimates of . Based on simulation runs we calculated the simulated mean squared errors (MSE) defined as

 MSE=1mRm∑i=1R∑r=1(ˆθ(r)i−θ(r)i)2, (5)

where is the EBE or EB estimates and is the true value of in the th iteration.

In Figure 1, we present the simulated MSE of the EB estimator as well as the three BEB estimator using and bootstrap samples under various settings of and . It is observed that the BEB estimator performs better than the EB estimator on the whole. In particular, the improvement is greater when is small compared with , which is often arisen in practice. Moreover, as the number of gets larger, the MSE differences get smaller since the variability of estimating vanishes when is sufficiently large. We also found that the ML estimator of often produces estimates when is small, in which the EB estimator is known to perform poorly. However, the BEB estimator can avoid the problem since the BEB estimator is aggregated by bootstrap estimators and at least one bootstrap estimates should be non-zero. In fact, by investigating the case where the ML estimator produces estimates of , the some bootstrap estimates of were away from . This would be one of the reason why the BEB estimator performs better than the EB estimator in this setting.

### 3.3 Example: corn data

We next illustrate the performances of the BEB estimator by using the corn and soybean productions in 12 Iowa counties, which has been used as an example in the context of small area estimation. Especially, we use the area-level data set given in table 6 in Dass et al. (2012) and we here focus only on corn productions for simplicity. The data set consists of areas with sample sizes in each area ranging from 3 to 5, and survey data of corn production , sampling variance and the satellite data of corn as the covariate observed in each area. We considered the following hierarchical model:

 yi|θi∼N(θi,Di),     θi∼N(β0+β1xi,A),    i=1,…,m, (6)

where and are unknown parameters. For the data set, we computed the BEB as well as EB estimators. We used bootstrap samples for computing the BEB estimator. In Figure 2, we present the histogram of of the bootstrap estimates used in the BEB estimates and the maximum likelihood (ML) estimates used in the EB estimators. We can observe that the bootstrap estimates vary depending on the bootstrap samples. Moreover, in Table 1, we show the BEB and EB estimates of , which shows that the BEB estimator produces different estimates from the EB estimator since the number of areas is only .

## 4 Poisson-gamma model

### 4.1 Setup

The Poisson-gamma model (Clayton and Kalder, 1987) is described as

 zi|θi∼Po(niθi),     θi∼Γ(νmi,ν),    i=1,…,m, (7)

where , and are a vector of covariates and regression coefficients, respectively, is an unknown scale parameter. This model is used as the standard method of disease mapping. Let be the vector of unknown parameters. The model (9) is known as the Poisson-Gamma model considered in Clayton and Kaldor (1987) and used in disease mapping.

Under the model (9), the Bayes estimator of is given by

 ˜θi(yi;\boldmathϕ)=zi+νmini+ν.

Since the Bayes estimator depends on unknown , we need to replace by its estimator. Noting that the gamma prior of is a conjugate prior for the mean parameter in the Poisson distribution, the marginal distribution of is the negative binomial distribution with the probability function:

 fm(yi;\boldmathϕ)=Γ(zi+νmi)Γ(zi+1)Γ(νmi)(nini+ν)zi(νni+ν)νmi.

Then the maximum likelihood estimator of is defined as , which enables us to obtain the empirical Bayes estimator .

### 4.2 Simulation study

We next evaluated the performances of the BEB estimator under the Poisson-gamma model without covariates, described as

 zi|θi∼Po(niθi),   θi∼Γ(νμ,ν),     i=1,…,m, (8)

where we set and and . Note that is a scale parameter and , so that random effect variance is a decreasing function of . Regarding the number of areas, we considered . For each , we set as rounded integers of equally spaced numbers between and . Similarly to Section 3.2, using (5) with simulation runs, we calculated the MSE of the BEB estimator as well as the EB estimator of . The results are presented in Figure 3, which show that the BEB estimator tends to perform better than the EB estimator. In particular, the amount of improvement is greater when is not large as we expected. Moreover, we can also observe that the MSE difference tends larger as gets larger, which corresponds to the case where the random effect variance gets smaller. This is consistent to the results in the normal model given in Section 3.2.

### 4.3 Example: Scottish lip cancer

We applied the BEB and EB method to the famous Scottish lip cancer data during the 6 years from 1975 to 1980 in each of the counties of Scotland. For each county, the observed and expected number of cases are available, which are respectively denoted by and . Moreover, the proportion of the population employed in agriculture, fishing, or forestry is available for each county, thereby we used it as a covariate , following Wakefield (2007). For each area, , we consider the Poisson-gamma model:

 zi|θi∼Po(niθi),   θi∼Γ(νexp(β0+β1AFFi),ν), (9)

where is the true risk of lip cancer in the th area, and and are unknown parameters. For the data set, we computed the BEB as well as EB estimates of , where we used bootstrap samples for computing the BEB estimator. In Figure 4, we present the quantiles of the bootstrap estimates used in the BEB estimates and the maximum likelihood (ML) estimates used in the EB estimators. We can observe that the bootstrap estimates vary depending on the bootstrap samples while the variability seems small compared with Figure 2. This might comes from that the number of areas in this case is much larger than the corn data in Section 3.3. Finally, in Figure 5, we show the scatter plot of percent relative difference between the BEB and EB estimates, that is, , against the number of expected number of cases . Figure 5 shows that the differences get larger as gets small since the direct estimator of is shrunk toward the regression mean in areas with small .

## 5 Conclusion and Discussion

We have proposed the use of bootstrap averaging, known as “bagging” in the context of machine learning, for improving the performances of empirical Bayes (EB) estimators. We focused on two models extensively used in practice, two-stage normal hierarchical model and Poisson-gamma model. In both models, the simulation studies revealed that the bootstrap averaging EB (BEB) estimator performs better than the EB estimator.

In this paper, we considered the typical area-level models as an application of the BEB estimator. However, the BEB method would be extended to the more general case, for example, generalized linear mixed models. The detailed comparison in such models will be left to a future study.

## References

• [1] Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
• [2]
• [3] Clayton, D. and Kaldor, J. (1987). Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics, 43, 671-681.
• [4]
• [5] Dass, S. C., Maiti, T., Ren, H. and Sinha, S. (2012). Confidence interval estimation of small area parameters shrinking both means and variances. Survey Methodology, 38, 173-187.
• [6]
• [7] Fay, R. and Herriot, R. (1979). Estimators of income for small area places: An application of James-Stein procedures to census. Journal of the American Statistical Association, 74, 341-353.
• [8]
• [9] Lawson, A. B. (2013). Bayesian disease mapping: hierarchical modeling in spatial epidemiology, 2nd Edition. Chapman and Hall/CRC press.
• [10]
• [11] Morris, C. N. (1983). Parametric empirical Bayes inference: theory and applications. Journal of the American Statistical Association, 78, 47-65.
• [12]
• [13] Rao, J.N.K. and Molina, I. (2015) Small Area Estimation, 2nd Edition. Wiley.
• [14]
• [15] Wakefield, J. (2007). Disease mapping and spatial regression with count data. Biostatistics, 8, 158-183.
• [16]
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters