Split Sampling:Expectations, Normalisation and Rare Events

# Split Sampling: Expectations, Normalisation and Rare Events

John R. Birge, Changgee Chang, and Nicholas G. Polson
Booth School of Business111 E-mail: john.birge@chicagobooth.edu, changgee@uchicago.edu, ngp@chicagobooth.edu. We would like to thank the participants at the conference in honor of Pietrio Muliere at Bocconi, September 13-15, 2012 for their helpful comments. The authors’ work was supported by the University of Chicago Booth School of Business.
First Draft: August 2012
This Draft: October 2013
###### Abstract

In this paper we develop a methodology that we call split sampling methods to estimate high dimensional expectations and rare event probabilities. Split sampling uses an auxiliary variable MCMC simulation and expresses the expectation of interest as an integrated set of rare event probabilities. We derive our estimator from a Rao-Blackwellised estimate of a marginal auxiliary variable distribution. We illustrate our method with two applications. First, we compute a shortest network path rare event probability and compare our method to estimation to a cross entropy approach. Then, we compute a normalisation constant of a high dimensional mixture of Gaussians and compare our estimate to one based on nested sampling. We discuss the relationship between our method and other alternatives such as the product of conditional probability estimator and importance sampling. The methods developed here are available in the R package: SplitSampling.

Keywords: Rare Events, Cross Entropy, Product Estimator, Slice Sampling, MCMC, Importance Sampling, Serial Tempering, Annealing, Adaptive MCMC, Wang-Landau, Nested Sampling, Bridge and Path Sampling.

## 1 Introduction

In this paper we develop a methodology we refer to as split sampling methods to provide more precise estimates for high dimensional expectations and rare event probabilities. We show that more precise estimators can be achieved by splitting the expectation of interest into a number of easier-to-estimate normalisation constants and then integrating those estimates to produce an estimate of the full expectation. To do this, we employ an auxiliary variable MCMC approach with a family of splitting functions and a weighting function on the conditional distribution of the auxiliary random variable. We allow for an adaptive MCMC approach to specify our weighting function. We relate our method to the product estimator (Diaconis and Holmes, 1994, Fishman, 1994) which splits the rare event probability into a set of relatively larger conditional probabilities which are easier to estimate and to nested sampling (Skilling, 2006) for the estimation of expectations. Other variance reduction techniques, such as control variates will provide further efficiency gains (see Dellaportas and Kontoyiannis, 2012, Mira et al, 2012).

There are two related approaches in the literature. One approach is for estimating expectations is nested sampling (Skilling, 2006) which sequentially estimates the quantiles of the likelihood function under the prior. Other normalisation methods include bridge and path sampling (Meng and Wong, 1996, Gelman and Meng, 1998), generalized versions of the Wang-Landau algorithm (Wang and Landau, 2001), and the TPA algorithm of Huber and Schott (2010). Serial tempering (Geyer, 2010) and linked importance sampling (Neal, 2005) provide ratios of normalisation constants for a discrete set of unnormalised densities. The second approach is cross entropy (Rubinstein and Glynn, 2009, Asmussen et al, 2012) which sequentially constructs an optimal variance-reducing importance function for calculating rare event probabilities. We show that our adaptively chosen weighted MCMC algorithm can provide efficiency gains over cross entropy methods.

The problem of interest is to calculate an expectation of interest, , where is a likelihood and is a prior measure. For rare event probabilities, , and the splitting functions are specified by level sets in the likelihood, namely . This occurs naturally in the rare event simulation literature and in nested sampling. To develop an efficient estimator, we define a joint split sampling split distribution, , on and an auxiliary variable that tilts the conditional distribution with a weighting function, . We provide a default setting for the weight function to match the sampling properties of the product estimator and of nested sampling. We also allow for the possibility of adaptively learning a weight function from the MCMC output. As with other ergodic Monte Carlo methods, we assume that the researcher can construct a fast MCMC algorithm for sampling our joint distribution.

The rest of the paper is outlined as follows. Section 2 details our split sampling methodology. A key identity “splits” the expectation of interest into an integrated set of rare event normalisation constants. MCMC then provides an estimator of the marginal distribution of the auxiliary variable, which in turn provides our overall estimator. We provide a number of guidelines for specifying our weight function in both discrete and continuous settings. Section 3 describes the relationship with the product estimator and nested sampling methods. In both cases, we provide a theorem that provides a default choice of weight function to match the sampling behavior of the product estimator and nested sampling. Section 4 applies our methodology to a shortest path rare event probability and to the calculation of a normalisation constant for a spike-and-slab mixture of Gaussians. We illustrate the efficiency gains of split sampling over crude Monte Carlo, the product estimator and the cross entropy method. Finally, Section 5 concludes with directions for future research.

## 2 Split Sampling

We now introduce the notation to characterize the estimation problems and to develop our method. The central problem is to calculate an expectation of a non-negative functional of interest, which we denote as , under a -dimensional probability distribution . We write this expectation as:

 Z=Eπ(L(x))=∫XL(x)π(x)dx.

The corresponding rare event probability is given by

 Z(m)=Eπ(Lm(x))=P(L(x)>m)

We interpret as a “prior” distribution, as a likelihood, and as an auxiliary variable. Given a family of splitting functions, , their normalisation constants are, . For rare events, where is large. Here we assume that is continuous with respect to the prior and .

Split sampling works as follows. We define the set of “tilted” distributions and corresponding normalisation constants by

 π(x|m)=Lm(x)π(x)Z(m)whereZ(m)=∫XLm(x)π(x)dx.

For rare events, and corresponds to conditioning on level sets of . The ’s correspond to “rare” event probabilities

 Z(m)=∫XLm(x)π(x)dx=∫L(x)>mπ(x)dx=Pπ(L(x)>m).

For the functional , the expectation of interest, , is an integration of the rare event probabilities. Using Fubini and writing we have the key identity

 Z=∫XL(x)π(x)dx=∫X{∫L(x)>mπ(x)dm}dx=∫∞0Z(m)dm.

We have “split” the computation of into a set of easier to compute normalisation constants . We will simultaneously provide an estimator and .

To do this, we further introduce a weight function on the auxiliary variable, , and the cumulative weight . The joint split sampling density, , is defined with as

 πSS(x,m)=πSS(x|m)⋅ω(m)Z(m)ZW,

where . The marginals on and are

 πSS(m)=ω(m)Z(m)ZWandπSS(x)=Ω(L(x))π(x)ZW (1)

where .

The key feature of our split sampling MCMC draws, , are that they provide an efficient Rao-Blackwellised estimate of the marginal without the knowledge of the ’s. We now show how to estimate and .

With , the joint splitting density is

 π(x,m)=π(m)π(x|M=m)=ω(m)Z(m)ZWI(L(x)>m)π(x)Z(m)0

The conditional posterior of the auxiliary index given is

 π(m|x)=ω(m)I(m

The density is proportional to the weight function on the interval . Slice sampling corresponds to and uniform sampling on . This would lead to direct draws from the posterior distribution and the resultant estimator would be the Harmonic mean. Our approach will weight towards regions of smaller values to provide an efficient estimator of all the rare event probabilities and hence of .

The marginal density estimator of is

 ˆπSS(m) =1NN∑i=1π(m|x(i))=1NN∑i=1ω(m)I{L(x(i))>m}Ω(L(x(i))),0

This is a re-weighted version of the initial weights . The function will be used to adaptively re-balance the initial weights in our adaptive version of the algorithm, see Section 4.

We now derive estimators for and by exploiting a Rao-Blackwellised estimator for the marginal density, . From (1), we have and so an estimate of is given by

 ˆZ(m)Z(0)=ω(m)−1ˆπSS(m)ω(0)−1ˆπSS(0).

With , this provides a new estimator , where , given by

 ˆZ(m)=ϕN(m)ϕN(0)=∑i:L(x(i))>mΩ(L(x(i)))−1∑Ni=1Ω(L(x(i)))−1.

To find , we use the summation-integral counterpart to Fubini and the fact that to yield

 ∫∞0∑i:L(x(i))>mΩ(L(x(i)))−1dm=N∑i=1Ω(L(x(i)))−1L(x(i)).

Therefore, we have our estimator

 ˆZ=N∑i=1Ω(L(x(i)))−1∑Ni=1Ω(L(x(i)))−1L(x(i)).

We now describe our split sampling algorithm.

Algorithm: Split Sampling

• Draw samples by iterating and

• Estimate the marginal distribution, , via

 ˆπSS(m)=1NN∑i=1ω(m)Lm(x(i))∫M0ω(m)Lm(x(i))dm.
• Estimate the individual normalisation constants, , via

 ˆZ(m)=∑i:L(x(i))>mΩ(L(x(i)))−1∑Ni=1Ω(L(x(i)))−1. (2)
• Compute a new estimate, , via

 ˆZ=N∑i=1Ω(L(x(i)))−1∑Ni=1Ω(L(x(i)))−1L(x(i)). (3)

A practical use of the algorithm will involve a discrete grid . We write the rare event probabilities and the weights . The marginal probabilities are estimated by Rao-Blackwellization as

 ˆπt=1NN∑i=1π(mt|x(i))=1NN∑i=1ωtI{L(x(i))>mt}∑Ts=0ωsI{L(x(i))>ms},0≤t≤T.

With , the estimator is for . In the next sections we provide a default choice of weights to match the sampling behaviour of the product estimator and nested sampling together with an adaptive MCMC scheme for estimating the weights. First, we turn to convergence issues.

### 2.1 Convergence and Monte Carlo standard errors

Roberts and Rosenthal (1997), Mira and Tierney (2002) and Hobert et al (2002) who provide general conditions for geometric ergodicity of slice sampling. Geometric ergodicity will imply a Monte Carlo CLT for calculating asymptotic distributions and standard errors. Our chain is geometrically ergodic if is bounded, and there exists an such that is nonincreasing on for some where

 G(m)=mα−1+1∂Z(m)/∂m.

Then we can apply a central limit theorem to ergodic averages of the functional

 g(x,m)=Ω(L(x))−1I(L(x)>m)

which yields the condition

 ∫Θg(x,m)2+ϵπSS(x)dx=∫L(x)≥mΩ(L(x))−(1+ϵ)π(dx)<∞

We have a central limit theorem for at any , where

 √NZW{ˆZ(m)−Z(m)}\lx@stackrelD⇒N(0,σ2Z(m))

This argument also works at as long as .

### 2.2 Importance Sampling

The standard Monte Carlo estimate of is where , with draws possibly obtained via MCMC. This is too inaccurate for rare event probabilities. Von Neumann’s original view of importance sampling was as a variance reduction to improve this estimator. By viewing the calculation of an expectation as a problem of normalising a posterior distribution, we can write

 πL(x)=L(x)π(x)/ZwhereZ=∫XL(x)π(x)dx.

Importance sampling uses a blanket to compute

 Z=∫XL(x)π(x)g(x)g(x)dx≈ˆZIS=1NN∑i=1L(x(i))π(x(i))g(x(i))wherex(i)∼g(x).

Picking to be the posterior distribution leads to the estimator with zero variance. While impractical, this suggests finding a class of importance blankets that are adaptive and depend on can exhibit good Monte Carlo properties.

Split sampling specifies a class of importance sampling blankets, indexed by , by

The estimator in (2) can be viewed as an importance sampling estimator where we average over the splitting set with . Similarly we can express as an importance sampling estimator as in (3) which uses a proposal distribution proportional to .

## 3 Comparison with the Product Estimator and Nested Sampling

### 3.1 Product Estimator

A standard approach to calculating the rare event probability is the product estimator. We set for some and introduce a discrete grid of -values starting at . The conditional probability estimator writes

 Z(m)=ZT=T∏t=1P(L(x)>mt|L(x)>mt−1)=T∏t=1P(L(x)>mt)P(L(x)>mt−1)

or equivalently .

Variance reduction is achieved by splitting into pieces of larger magnitude which are relatively easier to estimate. With , we estimate

Given independent samples from the tilted distributions for each , we have

 ˆZT=T∏t=11NN∑i=1Lmt(x(i)t)andVar(ˆZT)Z2T=T∏t=1(σ2tμ2t+1)−1

with mean and variance . The product estimator, as well as the cross-entropy estimator, relies on a set of independent samples drawn in a sequential fashion. Split sampling, on the other hand, uses a fast MCMC and ergodic averaging to provide an estimate . The Monte Carlo variation can be determined from the output of the chain. Controlling the Monte Carlo error of this estimator is straightforward due to independent samples with relative mean squared error, see Fishman (1994) and Garvels et al (2002),

#### 3.1.1 Matching the Product Estimator Sampling Distribution

We can now compare the product estimator with split sampling. Suppose and let be the grid points of for where . By construction, are the tail -quantile of , and we have .

There are two versions of the product estimator. First, the standard product estimator has the long-run sampling distribution

 πPE(x)=1TT∑t=1πt−1(x)=1TT∑t=1π(x)I(L(x)>mt−1)Zt−1.

The second product estimator includes samples from the previous level generation that were above the threshold. This has the long-run sampling distribution

 πPEI(x) ∝π(x)+T∑t=2Zt−2−Zt−1Zt−2πt−1(x) =π(x)+T∑t=2Zt−2−Zt−1Zt−2Zt−1π(x)I(L(x)>mt−1).

We call this the product estimator with inclusion.

###### Theorem 1 (Product Estimator)

The two product estimators and the split sampling are related as follows. The standard product estimator corresponds to the (discrete) split sampling with . The product estimator with inclusion is equivalent to split sampling with cumulative weights

 Ωt=t∑i=0ωi=1/Zt.

Split sampling with discrete knots , weights and has sampling distribution

 πSS(x) ∝T∑t=1ωt−1π(x)I(L(x)>mt−1) ∝T−1∑t=1Ωt−1π(x)I(mt−1

Therefore, this is equivalent to the product estimator if and only if, for ,

 ωt=1Zt, and Ωt=t∑i=11Zi.

As are the tail -quantile of , we have and

 Ωt=ρ(ρ−t−1)1−ρ.

For the product estimator with inclusion, the equivalent split sampling weights are

 Ωt=1+t∑i=1Zi−1−ZiZi−1Zi=1Zt=ρ−t.

### 3.2 Nested Sampling

We now provide a comparison with nested sampling. We provide a specification of so that the sampling distribution of matches that of nested sampling. We can adaptively determine the values from our MCMC output. Let be the -quantile of the likelihood function under the prior . Then, nested sampling expresses

 Z=∫10QL(q)dq=∫∞0QL(1−e−Y)e−YdY.

where . This integral can be approximated by quadrature

 Z≈∞∑i=1(e−i−1N−e−iN)QL(1−e−iN).

The larger is, the more accurate the approximation, and so this suggests the following estimator

 ˆZ=n−N∑i=1(e−i−1N−e−iN)Li+e−n−NNNN∑i=1Ln−N+i,

where are the simulated -quantiles of likelihoods with respect to the prior. Here, is the total number of samples. Nested Sampling is a sequential simulation procedure for finding the -quantiles, , by sampling .

Brewer et al (2011) propose a diffuse nested sampling approach to determine the levels . Both nested and diffuse nested sampling are product estimator approaches. The quantiles are chosen so that each level occupies times as much prior mass as the previous level . Diffuse nested sampling achieves this by sequentially sampling from a mixture importance sampler where the weights are exponential for some . MCMC methods are used to traverse this mixture distribution with a random walk step for the index that steps up or down a level with equal probability. A new level is added using the -quantile of the likelihood draws. Using diffuse nested sampling allows some chance of the samples’ escaping to lowered constrained levels and to explore the space more freely. One caveat is that a large contribution can come from values of near the origin and we have to find many levels to obtain an accurate approximation.

Murray et al (2006) provide single and multiple sample versions of nested sampling algorithms. If is known, we sample as follows:

1. Set .

2. Generate samples from and sort .

3. Repeat while :

1. Set and .

2. Generate and set .

3. Sort ’s and set .

4. Set and stop.

If is not known, replace step 3 with:

1. Repeat while :

#### 3.2.1 Matching the Nested Sampling Distribution

We now choose to match the sampling properties of nested sampling. The main difference between split and nested sampling is that in split sampling we specify a weight function for and sample from the full mixture distribution, rather than employing a sequential approach for grid selection which requires a termination rule. Another difference is that split sampling estimator does not need to know the ordered ’s.

We can now match the sampling distributions of split and nested sampling (see, Skilling, 2006). The expected number of samples less than is . Assume for a while. Since are independent standard uniforms and

 −logZ(Lk)=−k∑i=1logZ(Li)Z(Li−1)for% k≥1,

the distribution of the number of samples less than is same as the number of arrivals before of a Poisson process with rate . For general , it only changes the arrival rate of the Poisson process into .

###### Theorem 2 (Nested Sampling)

If we pick the weights such that

 Ω(m)=1/Z(m)

Then the sampling distributions of split and nested sampling match.

The sampling distribution of the nested sampling for finite is hard to calculate, but we can observe limiting results. As , if , then we have

 ZNS(m)=PNS(L(x)>m)=1+limn,N→∞NnlogZ(m)=1+λlogZ(m).

Split sampling has marginal density of given by

 πSS(x)=Ω(L(x))π(x)ZWwithΩ(m)=∫m0ω(s)ds.

The tail distribution function is then

 ZSS(m) =∫∞0∫XI{L(x)>m}ω(s)Z(s)ZWI{L(x)>s}π(x)Z(s)dxds =∫∞0ω(s)Z(m∨s)dsZW=Z(m)Ω(m)+∫∞mω(s)Z(s)dsZW.

We now find the importance splitting density that matches the nested sampling distributional properties in the sense that . Since

 Z′NS(m)=λZ′(m)Z(m)andZ′SS(m)=Z′(m)Ω(m)ZW,

where . We therefore set .

Since is unknown, we are not ready to begin sampling. As a remedy we propose approximating . We estimate ’s for certain grid points and interpolate by, for example, a piecewise exponentially increasing function. As we assume no information on a priori, we have to start with a single grid point and build more grid points as sampling goes. When we have collected enough samples higher than the current top grid point, we add a new grid point and adjust the approximated function. For that purpose, we introduce a condition that lets us monitor the number of visits, , to the current top level of the likelihood before we construct a new level. Specifically, we run

1. Set , , , and .

2. While , set

1. Draws , and set .

2. Obtain if or otherwise, where .

3. Repeat (a) and (b) until we have visits to level .

4. Choose the -quantile of likelihoods of level as .

5. Set and .

Under the condition , the chain will visit each level roughly uniformly. However, it may take a long time to reach the top level, and the uncertainty in may act like a hurdle for visiting upper levels. With these concerns, it is desirable to favor upper levels by replacing step (d) with

1. Set .

We call the boosting factor as increases the preference for the upper levels. This reduces the search time and ensures the time complexity to be . To further expedite this procedure, we may put more weight on the top level by substituting (d) with the step:

1. Set and .

For example, if , the chain spends half of the time on the top level and the other half backtracking the other levels.

Once we identify all levels, our split sampling algorithm runs:

1. Set and for each .

2. While , set .

1. Draw with and , and set .

2. For each with , update .

3. Update and set .

From a practical perspective, it is critical to have nonzero initial values on . If we start with , the early and are unstable, and the whole procedure can become abortive. Since, though not very accurate, is a reasonable initial estimate, we use them for the initial values of and . reflects the degree of dependence on those initial values.

Another point to make is even if the initial are not as accurate as needed to guarantee good mixing of our MCMC iterations, we dynamically refine as in step 2(c). The beauty of this algorithm is that this update makes the chain self-balanced. When is larger than it should be, or is smaller, the chain visits level more often. Thus increasing and decreases , which helps converge more quickly to the true value.

At first sight, the time complexity appears to be since steps 2(b) and 2(c) cost operations. However, if the values are chosen so that are exponentially decreasing, the work can be done in time. The updates needed at steps 2(b) and 2(c) are only for the last several ’s since the increment becomes negligible very quickly relative to as decreases.

## 4 Choice of cumulative weights: ω(m) and Ω(m)

The previous subsection assumed that is fixed. The “correction factor” in the construction of needs to be estimated as accurately as possible. To do this we will use an adaptive choice of the weight function, and use convergence results from the adaptive MCMC literature.

A common initialisation is to set which leads to draws from the posterior. Then, . This leads to an estimate of the marginal, , given by the measure

 μN(m)={1NN∑i=1I{L(x(i))>m}Ω(L(x(i)))}ω(m).

The density estimate will be zero for by construction. We also have a set of estimates where which can be used to re-balance to weights .

Given an initial run of the algorithm, we can re-proportion the prior weight function to regions we have not visited frequently enough. To accomplish this, let be a desired target distribution for , for example a uniform measure. Then re-balance the weights inversely proportional to the visitation probabilities and set the new weights by

This will only adjust our weights in the region where . As the algorithm proceeds we will sample regions of higher likelihood values and further adaptive our weight function.

Other choices for weights are available. For example, in many normalisation problems will be exponential in due to a Laplace approximation argument. This suggests taking an exponential weighting for some . In this case, we have

 Ω(m)=∫m∧M0ω(s)ds=eκ(m∧M)−1.

The marginal distribution is

 πSS(x)=(eκ(L(x)∧M)−1)π(x)∫∞0κeκsZ(s)ds.

We can also specify to deal with the possibility that the chain might not have visited all states by setting a threshold which corresponds to the maximum allowable increase in the log-prior weights. This leads to a re-balancing rule

 ω⋆(m)ω(m)=min{maxmμN(m)μN(m),eωmax},

where we have also re-normalised the value of the largest state to one.

When is available, we set and for . To initialise , we use the harmonic mean for and an exponential interpolation for . Drawing , we have

 Z−1M=EπM(L−1M(x))toestimateˆω(M)=1NN∑i=1L−1M(x(i)).

The harmonic mean estimator (Raftery et al, 2007) is known to have poor Monte Carlo error variance properties (Polson, 2006, Wolpert and Schmidler, 2012) although we are estimating and not its inverse.

We can extend this insight to a fully adaptive update rule for , similar to stochastic approximation schemes. Define a sequence of decreasing positive step sizes with . A practical recommendation is where , see e.g. Sato and Ishii (2000). Another approach is to wait until a “flat histogram” (FH) condition holds:

 maxm∈{mt}\vlineμN(m)−ϕ(m)\vline

for a pre-specified tolerance threshold, . The measure tracks our current estimate of the marginal auxiliary variable distribution. The Rao-Blackwellised estimate further reduces variance.

The empirical measure can be used to update as the chain progresses. Let denote the points at which will be decreased according to its schedule. Then an update rule which guarantees convergence is to set

 logωκN(m)←logωκN−1(m)+γκN(μκN(m)−ϕ(m)).

Jacob and Ryder (2012) show that if is only updated on a sequence of values which correspond to times that a “flat-histogram” criterion is satisfied, then convergence ensues and the FH criteria is achieved in finite time. After updating , we re-set the counting measure and continue. Other adaptive MCMC convergence methods are available in Atchade and Liu (2010), Liang et al (2007), and Zhou and Wong (2008). Bornn et al (2012) provides a parallelisable algorithm for further efficiency gains. Peskun (1973) provides theoretical results on optimal MCMC chains to minimise the variance of MCMC functionals.

One desirable Monte Carlo property for an estimator is a bounded coefficient of variation. For simple functions, and , mixture importance functions achieve such a goal, see Iyengar (1991) and Adler et al (2008). Madras and Piccioni (1999, section 4) hint at the efficiency properties of dynamically selected mixture importance blankets. Gramacy et al (2010) propose the use of importance tempering. Johansen et al (2006) use logit annealing implemented via a sequential particle filtering algorithm.

### 4.1 Choosing a Discrete Cooling Schedule

We suggest a simple, sequential, empirical approach to selecting a “cooling schedule” in our approach. Specifically, set , then given we sample . We order the realisations of the criteria function and set equal to the -quantile of the samples. This provides a sequential approach to solving

A number of authors have proposed “optimal” choices of , which implicitly defines a cooling schedule, , for . L’Ecuyer et al (2006) and Amrein and Kunsch (2011) propose and , respectively. Huber and Schott (2010) define a well-balanced schedule as one that satisfies . They show that such a choice leads to fast algorithms. The difficulty is in finding the right order of magnitude of and the associated schedule that ensures that each slice is not exponentially small. For rare events, we sample until and then set