Monte Carlo convergence of rival samplers

# Monte Carlo convergence of rival samplers

It is often necessary to make sampling-based statistical inference about many probability distributions in parallel. Given a finite computational resource, this article addresses how to optimally divide sampling effort between the samplers of the different distributions. Formally approaching this decision problem requires both the specification of an error criterion to assess how well each group of samples represent their underlying distribution, and a loss function to combine the errors into an overall performance score. For the first part, a new Monte Carlo divergence error criterion based on Jensen-Shannon divergence is proposed. Using results from information theory, approximations are derived for estimating this criterion for each target based on a single run, enabling adaptive sample size choices to be made during sampling.

\includeversion

unblinded\excludeversionblinded {unblinded}

KEY WORDS:Sample sizes; Jensen-Shannon divergence; transdimensional Markov chains

## 1 Introduction

Let be a sequence of random samples obtained from an unknown probability distribution . The corresponding random measure from samples is the Monte Carlo estimator of ,

 ^Πn(B)=1nn∑i=1δXi(B),  B⊆X, (1)

where is the support of . The random measure (1) is a maximum likelihood estimator of and is consistent: for all -measurable sets ,

Sometimes estimating the entire distribution is of intrinsic inferential interest. In other cases, this may be desirable if there are no limits on the functionals of which might be of future interest. Alternatively, the random sampling might be an intermediary update of a sequential Monte Carlo sampler no for which it is desirable that the samples represent the current target distribution well at each step (Del Moral et al., 2006).

Pointwise Monte Carlo errors are inadequate for capturing the overall rate of convergence of the realised empirical measure to . This consideration is particularly relevant if is an infinite mixture of distributions of unbounded dimension: In this case it becomes necessary to specify a degenerate, fixed dimension function of interest before Monte Carlo error can be assessed (Sisson and Fan, 2007). This necessity is potentially undesirable, since the assessment of convergence will vary depending on which function is selected and that choice might be somewhat arbitrary.

The work presented here considers sampling multiple target distributions in parallel. This scenario is frequently encountered in real-time data processing, where streams of data pertaining to different statistical processes are collected and analysed in fixed time-window updates. Decisions on how much sampling effort to allocate to each target will be made sequentially, based on the apparent relative complexity of the targets, as higher-dimensional, more complex targets intuitively need more samples to be well represented. The complexities of the targets will not be known a priori, but can be estimated from the samples which have been obtained so far. As a consequence, the size of the sample drawn from any particular target distribution will be a realisation of a random variable, , determined during sampling by a random stopping rule governed by the history of samples drawn from that target and those obtained from the other targets.

To extend the applicability of Monte Carlo error to entire probability measures, the following question is considered: If a new sample of random size were drawn from , how different to might the new empirical measure be? If repeatedly drawing samples in this way led to relatively similar empirical measures, this suggests that the target is relatively well represented by samples; whereas if the resulting empirical measures were very different, then there would be a stronger desire to obtain a (stochastically) larger number of samples. To formally address this question, a new Monte Carlo divergence error is proposed to measure the expected distance between an empirical measure and its target.

Correctly balancing sample sizes is a non-trivial problem. Apparently sensible, but ad hoc, allocation strategies can lead to extremely poor performance, much worse than simply assigning the same number of samples to each target. Here, a sample-based estimate of the proposed Monte Carlo divergence error of an empirical measure is derived; these errors are combined across samplers through a loss function, leading to a fully-principled, sequential sample allocation strategy.

Section 2 formally defines and justifies Monte Carlo divergence as an error criterion. Section 3 examines two different loss functions for combining sampler errors into a single performance score. Section 4 introduces some alternative sample size selection strategies; some are derived by adapting related ideas in the existing literature, and some are ad hoc. The collection of strategies are compared on univariate and variable dimension target distributions in Section 5 before a brief discussion in Section 6.

## 2 Monitoring Monte Carlo convergence

In this section the rate of convergence of the empirical distribution to the target will be assessed by information theoretic criteria. In information theory, it is common practice to discretise distributions of any continuous random variables (see Paninski, 2003). Without this discretisation (or some alternative smoothing) the intersection of any two separately generated sets of samples would be empty, and distribution-free comparisons of their empirical measures would be rendered meaningless: For example, the Kullback-Leibler divergence between two independent realisations of (1) will be always be infinite.

When a target distribution relates to that of a continuous random variable, a common discretisation of both the empirical measure and notionally the target will be performed. For the rest of the article, both and should be regarded as suitably discretised approximations to the true distributions when the underlying variables are continuous. When there are multiple distributions, the same discretisation will be used for all distributions. For univariate problems a large but finite grid with fixed spacing will be used to partition into bins; for mixture problems with unbounded dimension, the same strategy will be used for each component of each dimension, implying an infinite number of bins. Later in Section 3.4, consideration will be given to how the number of bins for each dimension should be chosen.

### 2.1 Monte Carlo divergence error

For a discrete target probability distribution , let be the estimator (1) for a prospective sample of size to be drawn from , and let be the same estimator when the sample size is a random stopping time.

For , the Monte Carlo divergence error of the estimator will be defined as

 eKL,n=H(π)−E{H(^Πn)}, (2)

where is Shannon’s entropy function; recall that if is a probability mass function, Note that is the maximum likelihood estimator of . The Monte Carlo divergence error has a direct interpretation: it is the expected Kullback-Leibler divergence of the empirical distribution of a sample of size from the target , and therefore provides a natural measure of the adequacy of for estimating .

The Monte Carlo divergence error of the estimator when is a random stopping time is defined as the expectation of with respect to the stopping rule, or equivalently

 eKL=H(π)−E{H(^Π)}, (3)

where the expectation in (3) is now with respect to both and the stopping rule. This more general definition of Monte Carlo divergence error should be interpreted as the expected Kullback-Leibler divergence of the empirical distribution of a sample of random size from the target .

To provide a sampling based justification for this definition of Monte Carlo divergence error, for consider the empirical distribution estimates which would be obtained from independent repetitions of sampling from , where the sample size of each run is a random stopping time from the same rule.

The Jensen-Shannon divergence (Lin and Wong, 1990; Lin, 1991) of ,

 JSD(^π1,…,^πM)=H(1MM∑j=1^πj)−1MM∑j=1H(^πj), (4)

measures the variability in these distribution estimates by calculating their average Kullback-Leibler divergence from the closest dominating measure, which is their average. The Jensen-Shannon divergence is a popular quantification of the difference between distributions, and its square root has the properties of a metric on distributions (Endres and Schindelin, 2003).

Just as Monte Carlo variance is the limit of the sample variance of sample means as , the Monte Carlo divergence error defined in (3) is easily seen to be the limit of (4) as : By the strong law of large numbers, and the expected entropy of a Monte Carlo distribution estimate from one of the runs. It follows that (4) is a biased but consistent estimate of (3).

Finally, it should be noted that there is a second interpretation of the Monte Carlo divergence error: is also the negative bias of the maximum likelihood estimator of the entropy of given a random sample. In the next section it will be shown that this alternative interpretation is very useful, since it leads to a mechanism for estimating from a single sample.

### 2.2 Estimating Monte Carlo divergence error

While is the maximum likelihood estimator of , it is known to be a negatively biased since is a concave function (Miller, 1955). Various approximate bias corrections for have been proposed in the information theory literature, and these correction terms can serve here as approximately unbiased estimates of . Furthermore, note that any unbiased estimate of is also an unbiased estimate of , the error under the random stopping rule.

Given a sample of size , the popular Miller-Madow method estimates the negative bias of the maximum likelihood estimate to be , where is the number of nonempty bins in . This estimate proves to be too crude for the purpose here of estimating , since any two distributions with the same number of represented bins would be estimated to have the same divergence, regardless of how uniform the corresponding bin probabilities might be.

An improvement on the Miller-Madow estimate was provided by Grassberger (1988, 2003),

 ^eKL,n=1nK∑i=1ϕ(ni), (5)

where is the number of samples in the th nonempty bin, such that , and where is the digamma function. In this work, (5) will provide an approximately unbiased estimate of , the expected Kullback-Leibler divergence of the empirical distribution from the target, based on a single run of the sampler.

#### Efficient calculation

Calculation of (5) during sampling can be updated at each iteration very quickly, using the following equations. Let be the bin in which the th observation falls. Then

 ^eKL,n=(n−1)^eKL,n−1+Δ1ϕ(ni′−1)n, (6)

where the forward difference operator .

Besides estimating the current Monte Carlo divergence error of a distribution estimate after samples from , it will also be of interest to estimate the expected reduction in error that would be achieved from obtaining one more sample,

 δKL,n=eKL,n+1−eKL,n. (7)

To estimate this quantity it is now necessary to assume that samples are drawn approximately independently (perhaps via thinned MCMC), and that the probability of the new sample falling into the th bin is approximated by the empirical, maximum likelihood estimate . Then the expected reduction in error from a further sample can be estimated as

 ^δKL,n = 1nK∑i=11∑j=0nji(n−ni)1−j{ϕ(ni)n−ϕ(ni+j)n+1} (8) = 1n(n+1)K∑i=1(ni+1)ϕ(ni)−niϕ(ni+1).

Again considering this calculation iteratively, if the th observation falls into bin then

 ^δKL,n=(n−1)n^δKL,n−1−ni′Δ2ϕ(ni′−1)n(n+1), (9)

where the second forward difference .

#### Alternative formulations of bias estimation

It should be noted that further refinements (additive terms) to the bias estimate of (5) are provided by Grassberger (2003), such as

 ϕ(ni)=ni{log(ni)−ψ(ni)}−−1nini+1. (10)

However, these additional terms, which arise as part of a second order approximation of an integral, are unstable, oscillating between positive and negative values. In this context, without careful treatment such terms can incorrectly suggest that the expected error might very slightly increase by taking further samples, which in practice is not true but would make obtaining further samples seem undesirable. Furthermore, due to their oscillating sign, these terms do not affect the overall drift of the function, which will be the quantity of longer term interest when deciding whether more sampling effort should be afforded.

## 3 Rival samplers

Consider probability distributions . Suppose random samples are to be drawn from a sampler for each , and that the empirical distributions of the samples will eventually serve as approximations of the corresponding target distributions.

Given a fixed computational resource, which might simply correspond to a final total number of random samples , the decision problem to be addressed is how best to divide those samples between the samplers. That is, how sample sizes for the targets should be chosen subject to the constraint . The samplers can be viewed as rivals to one another for the same fixed computational resource.

Without this or a similar constraint, the problem would be ill-posed: for all , should be chosen to be as large as possible, since Monte Carlo errors are monotonically decreasing with sample size. A constraint is required for any sample size choice to be practically meaningful. In contrast, results which establish a minimum sample size for which errors should fall within a (typically arbitrary) desired level of precision are theoretically interesting, but are perhaps best viewed in the reverse direction in this context; given the inevitable usage of the maximum allowable computation time, understanding the level of error this limit implies.

The default choice is for equal samples sizes, , but such an approach disregards any differences in the complexities of the target distributions, which in general could be arbitrarily different. The aim of this work is to adaptively determine how much sampling effort should be afforded to each sampler. The preceding section established a general method for assessing the error of each sampler. The choice of how to balance sample sizes between the samplers will be made according to a loss function for combining those errors.

### 3.1 Loss functions for Monte Carlo errors

Suppose that the decision to assign of the total samples to the sampler of implies a Monte Carlo error level for that target. The specific definition of this Monte Carlo error can be left open; for example, this might be the usual Monte Carlo error of a point estimate; or if interest lies in summarising the whole distribution, the Monte Carlo divergence error criterion (3).

Two natural alternative loss functions for combining the individual errors into an overall performance error are considered here. One possibility is that utility could be derived from controlling the maximum error of the samplers, suggesting an (expected) loss function

 Lmax(n(1),…,n(m))=maxj∈{1,…,m}e(j)n(j). (11)

This form of loss function could be applicable in financial trading, for example, where exposure to the worst loss could be unlimited. Alternatively it might be important to control the average error across the samplers, suggesting a different loss function of the form

 Lave(n(1),…,n(m))=1mm∑j=1e(j)n(j). (12)

This form could be applicable in portfolio trading, where exposure to loss is spread across the composite stocks. To illustrate the difference between these two loss functions, consider estimating the means of two distributions with known variances , , with error measured by the Monte Carlo variance of the sample means; the optimal ratio of sample sizes, , would be given by the ratio of the variances under , and by the ratio of the standard deviations under . Therefore care should be exercised in specifying the required form of loss function. Other choices, or indeed linear combinations of these two losses, could be examined.

### 3.2 Sequential allocation of samples

Away from the stylised example of the previous section, in general it is more likely that little will be known a priori about the target distributions being sampled. Instead, the aim will be to dynamically decide, during sampling, which samplers should be afforded more sampling effort, conditional on the information learned so far about the targets. A sequential decision approach is taken. Having taken samples, with of these allocated to the th sampler, the decision problem is to choose from which sampler to draw the th sample, such that the chosen loss function of the estimated Monte Carlo errors of the samplers is minimised.

In this sequential setting, the operational difference between the loss functions and becomes clearer. If the aim is to minimise , then the optimal decision for allocating one more sample is to allocate it to the sampler with the highest estimated error,

 argmaxj ^e(j)n(j), (13)

since error is a decreasing function of sample size. Alternatively, if minimising then the new sample should be allocated to the sampler for which the estimated decrease in error is highest,

 argmaxj ^δ(j)n(j), (14)

since this will minimise the overall expected sum.

These sequential decision rules are myopic, looking only one step ahead. There are three reasons why this is preferred; first, considering optimal sequences of future allocations leads to a combinatorial explosion unsuitable for a method intended for optimising the use of a fixed computational resource; second, the final number of samples may even be unknown; third, the estimated error or expected change in error under the Monte Carlo divergence criterion (5), (8) can be updated very efficiently via (6) and (9): after one more sample, only one bin count for one sampler will have changed.

Any sequential sampling allocation scheme which depends on the outcomes of the random draws will imply a random stopping rule for the number of samples eventually allocated to each sampler. This adds an extra complication, since some stopping rules will introduce bias into Monte Carlo estimates (Mendo and Hernando, 2006). Here this bias arises if the first samples taken from a target distribution have a particularly low estimated Monte Carlo error, as this will cause the other rival samplers to share all of the remaining samples; without corrective action, this phenomenon causes Monte Carlo estimators to be biased towards estimates of this character.

When the Monte Carlo error is the divergence measure (3), low error estimates correspond to low entropy empirical measures, which can spuriously arise if the first random samples happen to fall into the same bin. Therefore, to eradicate this bias, a minimum number of samples is recommended for each target distribution, to prevent degenerate sample sizes. To examine stability, these minima can be chosen in increasing steps until the resulting samples sizes converge. For the examples in Section 5, due to the relatively fine grid used for binning samples it was enough to set to obtain convergence.

### 3.3 Algorithm: Rival sampling

The full algorithm for sequential sampling from rival target distributions to minimise estimated loss is now presented. For samplers of target distributions , let be the minimum number of samples that should be drawn from . Let be the chosen loss function for combining Monte Carlo errors across the samplers.

The algorithm proceeds as follows:

1. Initialise— For :

1. Draw samples from and calculate , the binned empirical estimate of assuming bins in each dimension; let be the number of non-empty bins in , and be the corresponding bin counts; set .

2. If : calculate the divergence estimate for the th sampler, , using (5);

else if : calculate the estimated increment in divergence for the th sampler, , using (8).

2. Iterate— Until the available computational resource is exhausted:

1. If : set ;

else if : set .

2. Sample one new observation from . Set . Let be the bin into which the new observation falls. Set . If bin was previously empty, set .

3. Update or using (6) or (9) respectively.

### 3.4 Choosing a bin width for discretisation

The algorithm of Section 3.3 requires a method of discretising samples from continuous distributions. (For simplicity, a fixed bin width can be assumed for each dimension of a multivariate distribution.) The following observations offer some insight for what makes a good bin width in this context. In the limit of the bin width going to zero, the binned empirical distribution after independent draws would have non-empty bins each containing one observation. Although the identity of those bins would vary across samplers, these empirical distributions would be indistinguishable in terms of both entropy and (3); so each sampler would be allocated the same sample size. In the opposite limit of the bin width becoming arbitrarily large, all samples of the same dimension would fall into the same bin. For fixed dimension problems, this would mean all sample sizes would be equal, and otherwise in transdimensional problems the strategies would simplify to working with marginal distributions of the dimension, which reduces the potential diversity of sample sizes. So a good bin width would lie well within these two extremes, ideally maximising the resulting differences in sample sizes. That is, a good bin width should distinguish well the varying complexity of the different targets. Further to these observations, the next section suggests a novel maximum likelihood approach for determining an optimal number of bins, which could be deployed adaptively or using the initial samples drawn from each target.

#### Maximum likelihood bin width estimation for Bayesian histograms

Consider a regular histogram of equal width bins on the interval , and let be the bin probabilities. The Bayesian formulation of this histogram (Leonard, 1973) treats the probabilities as unknown, and a conjugate Dirichlet prior distribution based on a Lebesgue base measure with confidence level suggests . For samples, the marginal likelihood of observing bin counts under this model is

 Γ{α(b−a)}/[Γ{α(b−a)+n}Γ{α(b−a)/K}K{(b−a)/K}n]K∏i=1Γ{α(b−a)/K+ni}. (15)

Using standard optimisation techniques, identifying the pair that jointly maximise (15) suggests that serves as a good number of bins for a regular histogram of the observed data.

## 4 Alternative strategies

To calibrate the performance of the proposed method, some variations of the strategy for selecting sample sizes are considered. This section considers some alternative measures of the Monte Carlo error of a sampler, to be used in place of the divergence estimates or in the algorithm of Section 3.3.

### 4.1 Chi-squared goodness of fit statistic

In the context of particle filters, Fox (2003) proposed a method for choosing the number of samples required from a single sampler to guarantee that, under a chi square approximation, with a desired probability the Kullback-Leibler divergence between the binned empirical and true distributions does not exceed a certain threshold . This was achieved by noting an identity between times this divergence and the likelihood ratio statistic for testing the true distribution against the empirical distribution, assuming the true distribution had the same number of bins, , as the observed empirical distribution. Since the likelihood ratio statistic should approximately follow a chi-squared distribution with degrees of freedom, this suggested a sample size of

 n=χ2K−1,1−δ/(2ε), (16)

where is the quantile of that distribution.

Adapting this idea to the algorithm of Section 3.3 simply requires a rearrangement of (16) to give the approximate error as a function of sample size,

 ε=χ2K−1,1−δ/(2n). (17)

This error estimate can be substituted directly into the algorithm in place of the Monte Carlo divergence error estimate to provide an alternative scheme for choosing sample sizes when using loss function . The same (arbitrary) value of must be used for each rival sampler, and here this was specified as although the results are robust to different choices.

By the central limit theorem, the chi-squared distribution quantiles grow near-linearly with the degrees of freedom parameter for (Fisher, 1959), so it should be noted that (17), which depends only on the number of bins, has much similarity, and almost equivalence, with the Miller-Madow estimate of entropy error cited in Section 2.2. By the reasoning given in Section 2.2, use of this error function should show some similarity in performance with the proposed method, but be less robust to distinguishing differences in distributions beyond the number of non-empty bins.

Recall from Section 3.2 that the sequential allocation strategy for minimising the loss function requires an estimate of the expected reduction in error which would be achieved from obtaining another observation from a sampler. Since this error criterion depends entirely upon the number of non-empty bins , in this case an estimate is required for the probability of the new observation falling into a new bin. A simple empirical estimate of the probability of falling into a new bin is provided by the proportion of samples after the first one that have fallen into new bins, given by . Note that this estimate will naturally carry positive bias, since discovery of new bins should decrease over time, and so a sliding window of this quantity might be more appropriate in some contexts.

### 4.2 Reference points

As a convergence diagnostic for transdimensional samplers, Sisson and Fan (2007) proposed running replicate sampling chains for the same target distribution, and comparing the variability across the chains of the empirical distributions of a distance-based function of interest. The method requires that the target be a probability distribution for a point process, and maps multidimensional sampled tuples of events from to a fixed-dimension space. Specifically, a set of reference points are chosen, and for any sampled tuple of events the distance from each reference point to the closest event in the tuple is calculated. Thus is summarised by a -dimensional distribution, where is the number of reference points in .

One example considered in Sisson and Fan (2007) is a Bayesian continuous-time changepoint analysis of a changing regression model with an unknown number of changepoint locations. A variation of this example is analysed in Section 5.1.2 in this article, where instead the analysis will be for the canonical problem of detecting changes in the piecewise constant intensity function of an inhomogeneous Poisson process (see Raftery and Akman, 1986, and the subsequent literature). For Poisson process data, there are two natural functions of interest which could be evaluated at each reference point. The first is the distance to the nearest changepoint, the second is the intensity level. Both will be considered in Section 5.1.2. Note that in Sisson and Fan (2007), reference points are selected from random components from an initial sample from the single target distribution. Here, since there are multiple target distributions, a grid of one hundred uniformly spaced points across the domain is used.

The convergence diagnostic of Sisson and Fan (2007) did not formally provide a method for calibrating error or selecting sample size. Here, to compare the performance of the proposed sample size algorithm of Section 3.3, the sum across the reference points of the Monte Carlo variances of either of these functions of interest is used as the error criterion in the algorithm.

To demonstrate the value of the sophisticated sample size selection strategies given above, two simple strategies which have similar motivation but are otherwise ad hoc are included in the numerical comparisons of Section 5. These strategies are now briefly explained.

#### Extent

The extent of a distribution is the exponential of its entropy, and was introduced as a measure of spread by Campbell (1966). A simple strategy might be to choose sample size proportional to the estimated squared extent of , . Note that the Gaussian distribution , has an extent which is directly proportional to the standard deviation , and so in the univariate Gaussian example which will be considered in Section 5.1.1, this sample allocation strategy will be approximately equivalent to the optimal strategy when minimising the maximum Monte Carlo error of the sample means (cf. Section 3.1).

#### Jensen-Shannon divergence

Robert and Casella (2005) present a class of convergence tests for monitoring the stationarity of the output of a sampler from a single run which operate by splitting the current sample in two and quantifying the difference between the empirical distributions of the first half of the sample , and the second half of the sample . For univariate samplers the Kolmogorov-Smirnov test, for example, is used to obtain a p-value as a measure of evidence that the second half of the sample is different from the first, and hence neither half is adequately representative of the target. The test statistics which are used condition on the sample size, and so the sole purpose of these procedures is to investigate how well the sampler is mixing and exploring the target distribution.

To adapt these ideas to the current context, any mixing issues can first be discounted by splitting the sample for each target in half by allocating the samples into two groups alternately, so that the distribution of, say, can be compared with the distribution of . This method of splitting up the sample is also computationally much simpler in a streaming context, as incrementing the sample size does not change the required groupings of the existing samples. Let and be the respective empirical distributions of these two subsamples.

A crude variation on using the Monte Carlo divergence error criteria of (5) is to estimate the error of the sampler by the Jensen-Shannon divergence of and ,

 ^eJSD,n=JSD(^πoddn,^πevenn)=H(^πn)−H(^πoddn)+H(^πevenn)2. (18)

If sufficiently many samples have been taken for to be a good representation of the target distribution, then both halves of the sample should also provide reasonable approximations of the target and therefore have low divergence between one another.

As in Section 2.2.1, calculation of (18) during sampling can be updated at each iteration very quickly. Let be the bin in which the th observation falls. Then, for example, updating the first term of (18) simply requires

 H(^πn)=H(^πn−1)+log(n)−log(n−1)−ni′log(ni′)+(ni′−1)log(ni′−1). (19)

## 5 Examples

The methodology from this article is demonstrated on three different data problems. The first two examples assume only two or three data processes respectively, to allow a detailed examination of how the allocation strategies differ. Then finally a larger scale example with 400 data processes is considered, derived from the IEEE VAST 2008 Challenge concerning communication network anomaly detection.

### 5.1 Small scale examples

Two straightforward, synthetic examples are now considered. The first is a univariate problem of fixed dimension with two Gaussian target distributions, and the second is a transdimensional problem of unbounded dimension, concerning the changepoints in the piecewise constant intensity functions of three inhomogeneous Poisson processes. In both examples, it is assumed that a priori nothing is known about the target distributions and that computational limitations determine that only a fixed total number of samples can be obtained from them overall, which will correspond to an average of samples per target distribution.

Both loss functions from Section 3.1 are considered, measuring either the maximum error or average error across the target samplers. For each loss function, the following sample size allocation strategies are considered:

1. “Fixed” — The default strategy, samples are obtained from each sampler.

2. Dynamically, aiming to minimise the expected loss, with sampling error estimated using the following methods:

1. “Grassberger” — Monte Carlo divergence error estimation from Section 2.2;

2. “Fox” — the goodness of fit statistic of Section 4.1;

3. “Sisson” (only for the transdimensional example) — the Monte Carlo variances of one of the two candidate fixed dimension functions from Section 4.2 evaluated at 100 equally spaced reference points (denoted “Sisson-i”, for the intensity function, “Sisson-n” for the distance to nearest changepoint function);

4. “Extent” and “JSD” — two ad hoc criteria from Section 4.3.

Each sample size allocation strategy is evaluated over a large number of replications , where or respectively in the two examples.

Good performance of a sample allocation strategy is measured by the chosen loss function when applied to the realised Monte Carlo divergence error for each sampler. Good estimates of the true values of are obtained by calculating the Jensen-Shannon divergence of the Monte Carlo empirical distributions obtained from the runs (cf. Section 2.1).

Note that in all simulations, the same random number generating seeds are used for all strategies, so that all strategies are making decisions based on exactly the same samples.

#### Univariate target distributions

In the first example, a total of 100,000 samples are drawn from two Gaussian distributions, where one Gaussian has twice the standard deviation of the other: , Note that if these two distributions were considered on different scales they would be equivalent; but when losses in estimating the distributions are measured on the same scale, then they are not equivalent. For discretising the distributions, the following bins were used: , , , , , , . This corresponds to an interior range of plus or minus five times the largest of the standard deviations of the two targets, divided into 100 evenly spaced bins, along with two extra bins for the extreme tails. Results are robust to allowing wider ranges or more bins, but are omitted from presentation. For further validation, a simple experiment was conducted using the method from Section 3.4.1 on : 100,000 samples were simulated from each of and , leading to estimates and respectively, suggesting 100 as a good number of bins for fitting these densities.

The varying sample sizes obtained from each target from the 1 million simulations using each of the sample allocation strategies listed above and the loss function are shown in Fig. 1. Tables 1 and 2 show the mean sample sizes and the implied Monte Carlo divergence error for each target distribution using each of the sample allocation strategies listed above; the two tables correspond to the two choices of loss function for combining errors.

Table 1 gives the results under the loss function (11) which calculates the maximal error across samplers. Optimal performance would imply approximately equal Monte Carlo divergence errors for the two targets, and the proposed strategy based on Grassberger’s entropy bias estimate is by far the closest to achieving this objective. Interestingly, note that under this best strategy, the average sample sizes are almost exactly in the ratio 1:2, the same ratio as the true standard deviations of the target distributions. Recall from Section 3.1, that such a ratio is optimal in another sense, minimising the average Monte Carlo errors of the sample means. This contrast highlights the importance of carefully specifying the desired error criterion as well as the correct loss function.

One of the two ad hoc strategies based on calculating the Jensen-Shannon divergence between the two halves of the sample is only slightly outperformed by the goodness of fit method; however, note in Figure 1 the much higher variance of the sample sizes with the JSD method, which is indicative of an unreliable strategy. The other ad hoc method which takes sample sizes proportional to the extent of the empirical distributions is seen to overcompensate for the higher variance of the second Gaussian, and performs worse than even the default equal sample size strategy. For that strategy, note the sample sizes are almost exactly in the ratio 1:4, the same ratio as the true variances of the target distributions. Recall from Section 4.3.1 that such a strategy is approximately equivalent to minimising the maximum Monte Carlo errors of the sample means, which was noted in Section 3.1 to imply such an allocation ratio.

Table 2 gives the results under the loss function (12) which calculates the average error across samplers. The Monte Carlo divergence strategy based on Grassberger’s entropy bias estimate performs best, although the the goodness of fit method also performs very well here. The contrasting sample sizes between the loss functions and for all dynamic allocation strategies are noteworthy, as remarked in Section 3.1.

#### Transdimensional mixture target distributions

For a more complex example, simulated data were generated from three inhomogeneous Poisson processes on with different piecewise constant intensity functions. In each case, prior beliefs for the intensity functions were specified by a homogeneous Poisson process prior distribution on the number and locations of the changepoints and independent, conjugate gamma priors on the intensity levels. The three rival target distributions for inference are the Bayesian marginal posterior distributions on the number and locations of the changepoints for each of the three processes.

Each of the three simulated Poisson processes had two changepoints, located at and . The intensity levels of the three processes were respectively: , , , so the processes differed only through magnitudes of intensity changes. To make the target distributions closer and therefore make the inferential problem harder, in each case the prior expectation for the number of changepoints was set to 1. For illustration of the differences in complexity of the resulting posterior distributions for the changepoints of the three processes, large sample estimates of the true, discretised posterior distributions are shown in Fig. 2, based upon one trillion reversible jump Markov chain Monte Carlo samples (Green, 1995). Note that the different target distributions place different levels of mass on the number of changepoints, and therefore on the dimension of the problem. In all cases there is insufficient information to strongly detect both changepoints, and so much of the mass of the posterior distributions is localised at a single changepoint at , the midpoint of the two true changepoints. Additionally, Fig. 3 shows the posterior variance of two functions of interest identified in Section 4.2 for : the distance to the nearest changepoint, and the intensity level.

To determine sample sizes, reversible jump Markov chain Monte Carlo simulation was used to sample from the marginal posterior distributions of the changepoints, and the chains were thinned with only one in every fifty iterations retained to give approximately independent samples. To discretise the distributions, the interval was divided into 50 equally sized bins; while for a single dimension this would be fewer bins than were used in the previous section, here the bins are applied to each dimension of a mixture model of unbounded dimension, meaning that actually a very large number of bins are visited; computational storage issues can begin to arise when using an even larger number of bins, simply through storing the frequency counts of the samples.

Fig. 4 shows the distributions of sample sizes obtained from a selection of the strategies over repetitions, and Tables 3 and 4 show results from the different strategies examined for these more complex transdimensional samplers. Performance is similar to the previous section, with the Grassberger entropy bias correction method performing best.

For this transdimensional sampling example, it also makes sense to consider the fixed dimension function of interest methods of Sisson and Fan (2007), using the mean intensity function of the Poisson process or the distance to nearest changepoint, each evaluated at 100 equally spaced grid points on . The Monte Carlo variances used in these strategies estimate the variances displayed in the plots of Fig. 3 at the reference points, divided by the current sample size. The performance of these fixed dimensional strategies is particularly poor under the loss function . Importantly, it should also be noted that the sample sizes and performance vary considerably depending upon which of the two arbitrary functions of the reference points are used.

### 5.2 IEEE VAST 2008 Challenge Data

This final example now illustrates how the method performs in the presence of a much larger number of target distributions, in the context of network security. The IEEE VAST 2008 Challenge data are synthetic but realistically generated records of mobile phone calls for a small community of 400 individuals over a ten day period. The data can be obtained from www.cs.umd.edu/hcil/VASTchallenge08. The aim of the original challenge was to find anomalous behaviour within this social network, which might be indicative of malicious coordinated activity.

One approach to this problem is to monitor the call patterns of individual users and detect any changes from their normal behaviour, with the idea that a smaller subset of anomalous individuals will then be investigated for community structure. In particular, this approach has been shown to be effective with these data when monitoring the event process of incoming call times for each individual (Heard et al., 2010). After correcting for diurnal effects on normal behaviour, this approach can be reduced to a changepoint analysis of the intensities of 400 Poisson processes of the same character as Section 5.1.2. For the focus of this article, it is of interest to see how such an approach could be made more feasible in real time by allocating computational resource between these 400 processes more efficiently.

Fig. 5 shows the contrasting performance between an equal computational allocation of one million Markov chain Monte Carlo samples to each process against the variable sample size approach using Grassberger’s entropy bias estimate, for the same total computational resource of 400 millions samples and using the loss function . The left hand plot shows the distribution of sample sizes for the individual processes over repetitions, using 5,000 initial samples and an average allocation of one million samples for each posterior target; the dashed line represents the fixed sample size strategy. The sample sizes vary enormously across individuals. However, for each individual the variability between runs is much lower, showing that the method is robust in performance. The right hand plot shows the resulting Monte Carlo divergence errors of the estimated distributions from the targets. Ideal performance under would have each of these errors approximately equal, and the variable sample size method gets much closer to this ideal. The circled case in the right hand plot indicates the process which has the highest error when using a fixed sample size, and this corresponds to the same individual process that always gets the highest sample size allocation under the adaptive sample size strategy in the left hand plot. This individual has a very changeable calling pattern, suggesting several possible changepoints: no calls in the first five days, then two calls one hour apart, then another two days break, and then four calls each day for the remainder of the period.

## 6 Discussion

It was remarked in the review paper of Sisson (2005) on transdimensional samplers that “a more rigorous default assessment of sampler convergence” than the existing technology is required, and this has remained an open problem. This article is a first step towards establishing such a default method from a decision theoretic perspective, proposing a framework and methodology which are rigorously motivated and fully general in their applicability to all distributional settings.

Note that when the samplers induce autocorrelation, which is commonplace with Metropolis-Hastings (MH) Markov chain Monte Carlo simulation, then the decision rule (14) for becomes more complicated since independence was assumed in the derivation of (8). If one or more of the samplers has very high serial autocorrelation, then drawing additional samples from those targets will become less attractive under , as with high probability very little extra information will be obtained from the next draw. It is still possible to proceed in this setting by adapting (8) to admit autocorrelation; for example, the rejection rate of the Markov chain could be used to approximate the probability of observing the same bin as the last sample, and otherwise draws could be assumed to be more realistically drawn from the target. However, for reasons of brevity this is not pursued further in this work, and of course the efficacy would depend entirely on the specifics of the MH/other sampler. Importantly, this issue should not be seen as a decisive limitation of the proposed methodology when using , since although thinning was used in the Markov chain Monte Carlo examples of Sections 5.1.2 and 5.2 to obtain the next sample for use in calculating the convergence criteria, this would not prevent the full sample from being retained and utilised without thinning for the actual inference problem. The amount of thinning could be varied between samplers if appropriate, and this could be counterbalanced by weighting the errors in (12) accordingly.

Another related problem which could be considered is that of importance sampling. If samples cannot be obtained directly from the target but instead from some importance distribution with the same support, then it would be useful to understand how these error estimates and sample size strategies can be extended to the case where the empirical distribution of the samples has associated weights. In addressing the revised question of how large an importance sample should be, there should be an interesting trade-off between the inherent complexity of the target distributions, which has been the subject of this article, and how well the importance distributions match those targets.

### References

1. Campbell, L. L. (1966). Exponential entropy as a measure of extent of a distribution. Probability Theory and Related Fields 5, 217–225.
2. Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 68, 411–436.
3. Endres, D. and J. Schindelin (2003). A new metric for probability distributions. Information Theory, IEEE Transactions on 49(7), 1858 – 1860.
4. Fisher, R. (1959). Statistical methods and scientific inference. Oliver and Boyd.
5. Fox, D. (2003). Adapting the sample size in particle filters through KLD-sampling. International Journal of Robotics Research 22, 985–1003.
6. Grassberger, P. (1988). Finite sample corrections to entropy and dimension estimates. Physics Letters A 128(6â7), 369 – 373.
7. Grassberger, P. (2003). Entropy Estimates from Insufficient Samplings. ArXiv Physics e-prints.
8. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732.
9. Heard, N. A., D. J. Weston, K. Platanioti, and D. J. Hand (2010). Bayesian anomaly detection methods for social networks. Annals of Applied Statistics 4, 645–662.
10. Leonard, T. (1973). A bayesian method for histograms. Biometrika 60(2), 297–308.
11. Lin, J. (1991). Divergence measures based on the Shannon entropy. Information Theory, IEEE Transactions on 37(1), 145 –151.
12. Lin, J. and S. K. M. Wong (1990). A new directed divergence measure and its characterization. International Journal of General Systems 17(1), 73–81.
13. Mendo, L. and J. Hernando (2006). A simple sequential stopping rule for Monte Carlo simulation. Communications, IEEE Transactions on 54(2), 231 – 241.
14. Miller, G. A. (1955). Note on the bias of information estimates. Information Theory in Psychology Problems and Methods IIB II, 95–100.
15. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation 15, 1191â1253.
16. Raftery, A. E. and V. E. Akman (1986). Bayesian Analysis of a Poisson Process with a Change-Point. Biometrika 73(1), 85–89.
17. Robert, C. P. and G. Casella (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.
18. Sisson, S. and Y. Fan (2007). A distance-based diagnostic for trans-dimensional Markov chains. Statistics and Computing 17, 357–367.
19. Sisson, S. A. (2005). Transdimensional markov chains. Journal of the American Statistical Association 100(471), 1077–1089.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters