Splitting for Rare Event Simulation: A Large Deviation Approach to Design and Analysis

# Splitting for Rare Event Simulation: A Large Deviation Approach to Design and Analysis

## Abstract

Particle splitting methods are considered for the estimation of rare events. The probability of interest is that a Markov process first enters a set before another set , and it is assumed that this probability satisfies a large deviation scaling. A notion of subsolution is defined for the related calculus of variations problem, and two main results are proved under mild conditions. The first is that the number of particles generated by the algorithm grows subexponentially if and only if a certain scalar multiple of the importance function is a subsolution. The second is that, under the same condition, the variance of the algorithm is characterized (asymptotically) in terms of the subsolution. The design of asymptotically optimal schemes is discussed, and numerical examples are presented.

## 1 Introduction

The numerical estimation of probabilities of rare events is a difficult problem. There are many potential applications in operations research and engineering, insurance, finance, chemistry, biology, and elsewhere, and many papers (and by now even a few books) have proposed numerical schemes for particular settings and applications. Because the quantity of interest is very small, standard Monte Carlo simulation requires an enormous number of samples for the variance of the resulting estimate to be comparable to the unknown probability. It quickly becomes unusable, and more efficient alternatives are sought.

The two most widely considered alternatives are those based on change-of-measure techniques and those based on branching processes. The former is usually called importance sampling, and the latter is often referred to as multi-level splitting. While good results on a variety of problem formulations have been reported for both methods, it is also true that both methods can produce inaccurate and misleading results. The design issue is critical, and one can argue that the proper theoretical tools for the design of importance sampling and splitting algorithms were simply not available for complicated models and problem formulations. [An alternative approach based on interacting particles has also been suggested as in [delgar]. However, we are unaware of any analysis of the performance of these schemes as the probability of interest becomes small.]

Suppose that the probability of interest takes the form , where is a subset of some reasonably regular space (e.g., a Polish space ) and a probability measure. In ordinary Monte Carlo one generates a number of independent and identically distributed (iid) samples from , and then estimates using the sample mean of . In the case of importance sampling, one uses an alternative sampling distribution , generates iid samples from , and then estimates via the sample mean of . The Radon-Nikodim derivative guarantees that the estimate is unbiased. The goal is to choose so that the individual samples cluster tightly around , thereby reducing the variance. However, for complicated process models or events the selection of a good measure may not be simple. The papers [glakou, glawan] show how certain standard heuristic methods based on ideas from large deviations could produce very poor results. The difficulty is due to points in with low probability under for which is very large. The aforementioned large deviation heuristic does not properly account for the contribution of these points to the variance of the estimate, and it is not hard to find examples where the corresponding importance sampling estimator is much worse than even ordinary Monte Carlo. The estimates exhibit very inaccurate and/or unstable behavior, though the instability may not be evident from numerical data until massive amounts have been generated.

The most discussed application of splitting type schemes is to first entrance probabilities, and to continue the discussion we specialize to that case. Thus is the sample path of a stationary stochastic process (which for simplicity is taken to be Markovian), and is the set of trajectories that first enter a set prior to entering a set . More precisely, for disjoint and and ,

 p=p(x)=P{Xj∈B,Xi∉A,i∈{0,…,j},j<∞|X0=x}.

In the most simple version of splitting, the state space is partitioned according to certain sets , with and . These sets are often defined as level sets of a particular function , which is commonly called an importance function. Particles are generated and killed off according to the following rules. A single particle is started at . Generation of particles (splitting) occurs whenever an existing particle reaches a threshold or level for the first time. At that time, a (possibly random) number of new particles are placed at the location of entrance into . The future evolutions of these particles are independent of each other (and all other particles), and follow the law of . Particles are killed if they enter before . Attached to each particle is a weight. Whenever a particle splits the weight of each descendent equals that of the parent times a discount factor. A random tree is thereby produced, with each leaf corresponding to a particle that has either reached or been killed. A random variable (roughly analogous to a single sample from the importance sampling approach) is defined as the sum of the weights for all particles that make it to . The rule that updates the weights when a particle splits is chosen so that the expected value of this random variable is . This numerical experiment is independently repeated a number of times, and the sample mean is again used to estimate .

There are two potential sources of poor behavior in the splitting algorithm. The first and most troubling is that the number of particles may be large. For example, the number could be comparable for some . In settings where a large deviation characterization of is available, the number of levels itself usually grows with the large deviation parameter, and so the number of particles could grow exponentially. We will refer to this as instability of the algorithm. For obvious computational reasons, instability is something to be avoided. The other source of poor behavior is analogous to that of importance sampling (and ordinary Monte Carlo), which is high relative variance of the estimate. If the weighting rule leads to high variation of the weights of particles that make it to , or if too many simulations produce no particles that make it to (in which case a zero is averaged in the sample mean), then high relative variance is likely. Note, however, that this problem has a bounded potential for mischief, since the weights cannot be larger than one. Such a bound does not hold for the Radon-Nikodim derivative of importance sampling.

When the probability of interest can be approximated via large deviations, the rate of decay is described in terms of a variational problem, such as a calculus of variations or optimal control problem. It is well known that problems of this sort are closely related to a family of nonlinear partial differential equations (PDE) known as Hamilton-Jacobi-Bellman (HJB) equations. In a pair of recent papers [dupsezwan, dupwan5], it was shown how subsolutions of the HJB equations associated with a variety of rare event problems could be used to construct and rigorously analyze efficient importance sampling schemes. In fact, the subsolution property turns out to be in some sense necessary and sufficient, in that efficient schemes can be shown to imply the existence of an associated subsolution.

The purpose of the present paper is to show that in certain circumstances a remarkably similar result holds for splitting algorithms. More precisely, we will show the following under relatively mild conditions.

• A necessary and sufficient condition for the stability of the splitting scheme associated to a given importance function is that a certain scalar multiple of the importance function be a subsolution of the related HJB equation. The multiplier is the ratio of the logarithm of the expected number of offspring for each split and the gap between the levels.

• If the subsolution property is satisfied, then the variance of the splitting scheme decays exponentially with a rate defined in terms of the value of the subsolution at a certain point.

• As in the case of importance sampling, when a subsolution has the maximum possible value at this point (which is the value of the corresponding solution), the scheme is in some sense asymptotically optimal.

These results are significant for several reasons. The most obvious is that a splitting algorithm is probably not useful if it is not stable, and the subsolution property provides a way of checking stability. A second is that good, suboptimal schemes can be constructed and compared via the subsolutions framework. A third reason is that for interesting classes of problems it is possible to construct subsolutions that correspond to asymptotically optimal algorithms (see [dupsezwan, dupwan5]). Subsolutions can be much easier to construct than solutions. In this context it is worth noting that the type of subsolution required for splitting (a viscosity subsolution [barcap, fleson1]) is less restrictive that the type of subsolution required for importance sampling. Further remarks on this point will be given in Section 5.

An outline of the paper is as follows. In the next section we describe the probabilities to be approximated, state assumptions, and formulate the splitting algorithm. This section also presents a closely related algorithm that will be used in the analysis. Section 3 studies the stability problem, and Section 4 shows how to bound the variance of an estimator in terms of the related subsolution. The results of Sections 3 and 4 can be phrased directly in terms of the solution to the calculus of variations problem that is related to the large deviation asymptotics. However, for the purposes of practical construction of importance functions the characterization via subsolutions of a PDE is more useful. These issues are discussed in Section 5, and examples and numerical examples are presented in the concluding Section 6.

Acknowledgment. Our interest in the parallels between importance sampling and multi-level splitting was stimulated by a talk given by P.T. de Boer at the RESIM conference in Bamberg, Germany [deb2].

## 2 Problem Formulation

### 2.1 Problem Setting and Large Deviation Properties

A domain is given and also a sequence of discrete time, stationary, Markov valued processes .  Disjoint sets and are given, and we set .  The probability of interest is then

 pn(xn)≐P{Xn\boldmath\mathchar284n∈B∣∣Xn0=xn}.

The varying initial conditions are used for greater generality, but also because initial conditions for the prelimit processes may be restricted to some subset of . The analogous continuous time framework can also be used with analogous assumptions and results. For a given point , we make the following large deviation-type assumption.

###### Condition 1

For any sequence ,

 limn→∞−1nlogpn(xn)=W(x),

where is the solution to a control problem of the form

 inf∫t0L(\boldmath\mathchar286(s),˙\boldmath\mathchar286(s))ds.

Here , and the infimum is taken over all absolutely continuous functions with , for all and some .

###### Remark 2

The assumption that be Markovian is not necessary for the proofs to follow.  For example, it could be the case that is the first component of a Markov process (e.g., so-called Markov-modulated processes).  In such a case it is enough that the analogous large deviation limit hold uniformly in all possible initial conditions , and indeed the proofs given below will carry over with only notational changes. This can be further weakened, e.g., it is enough that the estimates hold uniformly with sufficiently high probability in the conditioning data.  However, the construction of subsolutions will be more difficult, since the PDE discussed in Section 5 is no longer available in explicit form. See [dupwan5] for further discussion on this point.

It is useful to say a few words on how one can verify conditions like Condition 1 from existing large deviation results. Similar but slightly different assumptions will be made in various places in the sequel, and in all cases analogous remarks will apply.

For discrete time processes one often finds process-level large deviation properties phrased in terms of a continuous time interpolation , with and defined by piecewise linear interpolation for not of the form . In precise terms, process-level large deviation asymptotics hold for if the following upper and lower bounds hold for each and any sequence of initial conditions with . Define

 ITx(\boldmath\mathchar286)=∫T0L(\boldmath\mathchar286(s),˙% \boldmath\mathchar286(s))ds

if is absolutely continuous with , and otherwise. If is any closed subset of then the upper bound

 limsupn→∞1nlogP{Xn∈F|Xn(0)=xn}≤−inf\boldmath\mathchar286∈FITx(\boldmath\mathchar286\boldmath\mathchar286)

holds, and if is any open subset of then the lower bound

 liminfn→∞1nlogP{Xn∈O|Xn(0)=xn}≥−inf\boldmath\mathchar286∈OITx(\boldmath\mathchar286\boldmath\mathchar286)

holds. It is also usual to assume that for each fixed and any , the set

 {\boldmath\mathchar286∈C([0,T]:D):ITx(\boldmath\mathchar286)≤M}

is compact in . The zero-cost trajectories (i.e., paths for which ) are particularly significant in that all other paths are in some sense exponentially unlikely.

With regard to Condition 1, two different types of additional conditions beyond the sample path large deviation principle are required. One is a condition that allows a reduction to large deviation properties over a finite time interval. For example, suppose that there is such that if enters neither nor before , then . In this case, the contribution to from sample paths that take longer than is negligible, and can be ignored. This allows an application of the finite time large deviation principle. Now let be the set of trajectories that enter at some time without having previously entered . By the first condition, the asymptotic rates of decay of and are the same. The second type of condition is to impose enough regularity on the sets and and the rate function that the infimum over the interior and closure of are the same. These points are discussed at length in the literature on large deviations [frewen].

###### Example 3

Assume the following conditions: is lower semicontinuous; for each is convex; is uniformly superlinear; for each there is a unique point for which ; is Lipschitz continuous, and all solutions to are attracted to , with open. Let be a bounded domain that contains and , and assume for , where is the outward normal to at . Suppose that the cost to go from to any point in is at least . Then as described above exists.

### 2.2 The Splitting Algorithm

In order to define a spitting algorithm we need to choose an importance function and a level size . We will require that be continuous and that for all . Later on we will relate to the value function , and discuss why subsolutions to the PDE that is satisfied by are closely related to natural candidates for the importance function.

To simplify the presentation, we consider only splitting mechanisms with an a priori bound on the maximum number of offspring. The restriction is convenient for the analysis, and as we will see is without loss of generality. The set of (deterministic) splitting mechanisms will be indexed by . Given that mechanism has been selected, particles (with ) are generated and weights are assigned to the particles. Note that we do not assume . The class of all splitting mechanisms (i.e., including randomized mechanisms) is identified with the set of all probability distributions on .

Associated with are the level sets

 Lz={y∈D:V(y)≤z}.

A key technical condition we use is the following. In the condition, denotes expected value given .

###### Condition 4

Let be given and define . Then

 liminfn→∞−1nlogExn[1{Xn% \boldmath\mathchar283n∈Lz}(pn(Xn\boldmath\mathchar283n))2]≥W(x)+infy∈∂LzW(y).

Under the conditions discussed after Condition 1 which allow one to consider bounded time intervals, Condition 4 follows from the Markov property and the large deviation upper bound.

We also define collections of sets . Define the level function by . The location of the starting point corresponds to , and indicates entry into the target set . The splitting algorithm associated with a particular distribution will now be defined. Although the algorithm depends on and , to minimize notational clutter these dependencies are not explicitly denoted.

Splitting Algorithm (SA)

Variables:
number of particles in generation
position of particle in generation
weight of particle in generation
Initialization Step:
, ,
for

end
Main Algorithm:
for
if then
else
for
generate a single sample of a process with the same law as and initial condition

let

Splitting Step begin
if
let be an independent sample from the law
for

end
end
Splitting Step end

end
end
end
Construction of a sample:
once all the generations have been calculated we form
the quantity
.

It should be noted that generation consists of all of the particles that make it to set and more generally generation consists of all particles that make it to set . We also define generation to be the initial particle.

An estimate of is formed by averaging a number of independent samples of . Observe that once generation has been calculated the information about generations to can be discarded, and so there is no need to keep all the data in memory until completion of the algorithm. Also note that in practice there is no need to split upon entering .

We first need to find conditions under which this splitting algorithm gives an unbiased estimator of . To simplify this and other calculations we introduce an auxiliary algorithm titled Splitting Algorithm Fully Branching (SFB). The essential difference between the two is that the process dynamics are redefined in to make it absorbing, and that splitting continues even after a particle enters . When the estimate is constructed we only count the particles which are in in the last generation, so that the two estimates have the same distribution. The SFB algorithm is more convenient for purposes of analysis, because we do not distinguish those particles which have entered from those which still have a chance to enter . Of course this algorithm would be terrible from a practical perspective–the total number of particles is certain to grow exponentially. However, the algorithm is used only for the purposes of theoretical analysis, and the number of particles is not a concern. Overbars are used to distinguish this algorithm from the previous one.

Splitting Algorithm Fully Branching (SFB)

Variables:
number of particles in generation
position of particle in generation
weight of particle in generation
Initialization Step:
, ,
for

end
Main Algorithm:
for
for
generate a single sample of a process with the
same law as and initial condition

let

Splitting Step begin
let  be an independent sample from the law
for

end
Splitting Step end

end
end
Construction of a sample:
once all the generations have been calculated we form
the quantity
.

Since the distributions of the two estimates coincide

 Exn⎡⎢⎣Nnln(xn)∑j=1wnln(xn),j⎤⎥⎦=Exn⎡⎢⎣¯Nnln(xn)∑j=11{¯Xnln(xn),j∈B}¯wnln(xn),j⎤⎥⎦.

Because of this, the SFB algorithm can be used to prove the following.

###### Lemma 5

An estimator based on independent copies of is unbiased if and only if

 E⎡⎣r(M)∑i=1wi(M)⎤⎦=1.

Proof. It suffices to prove

 Exn⎡⎢⎣¯Nnln(xn)∑j=11{¯Xnln(xn),j∈B}¯wnln(xn),j⎤⎥⎦=pn(xn).

We will use a particular construction of the SFB algorithm that is useful here and elsewhere in the paper. Recall that with this algorithm every particle splits at every generation. Hence the random number of particles associated with each splitting can be generated prior to the generation of any trajectories that will determine particle locations. As a consequence, the total number of particles present at the last generation can be calculated, as can the weight that will be assigned to each particle in this final generation, prior to the assignment of a trajectory to the particle. Once the weights have been assigned, the trajectories of all the particles can be constructed in terms of random variables that are independent of those used to construct the weights. Since the probability that any such trajectory makes it to prior to hitting is ,

 Exn⎡⎢⎣¯Nnln(xn)∑j=11{¯Xnln(xn),j∈B}¯wnln(xn),j⎤⎥⎦=pn(xn)Exn⎡⎢⎣¯Nnln(xn)∑j=1¯wnln(xn),j⎤⎥⎦.

A simple proof by induction and the independence of the splitting from particle to particle shows that

 Exn⎡⎢⎣¯Nnln(xn)∑j=1¯wnln(xn),j⎤⎥⎦=⎛⎝E⎡⎣r(M)∑i=1wi(M)⎤⎦⎞⎠ln(xn).

For the rest of the paper we restrict attention to splitting mechanisms that are unbiased.

## 3 Stability

Now let an importance function , level , and splitting mechanism be given. Define

 J(x,y)≐inf\boldmath\mathchar286,t:\boldmath\mathchar286(0)=x,\boldmath\mathchar286(t)=y∫t0L(\boldmath\mathchar286(s),˙\boldmath\mathchar286(s))ds, (1)

where the infimum is over absolutely continuous functions. A function will be called a subsolution if for all and if for all . In Section 5 we will discuss conditions under which can be identified as a viscosity subsolution for an associated PDE. Recall that a splitting algorithm is called stable if the total number of particles ever used grows subexponentially as . For a given splitting algorithm define

 ¯W(x)=logEr(M)ΔV(x). (2)

In this section we show that, loosely speaking, a splitting algorithm is stable if and only if is a subsolution.

A construction that will simplify some of the proofs is to replace a given splitting mechanism by one for which all the weights are constant. Thus is replaced by , where for each and ,

 [¯wi(j)]−1=Er(M)=J∑j=1r(j)qj.

The new splitting mechanism is also unbiased, and the distribution of the number of particles at each stage is the same as that of .

To establish the instability we make a very mild assumption on a large deviation lower bound for the probability that an open ball is hit prior to reaching . This assumption can be expected to hold under conditions which guarantee Condition 1.

###### Proposition 6

Consider an importance function , level , and splitting mechanism , and define by (2). Suppose there exists such that and

 ¯W(x)−¯W(y)>J(x,y). (3)

Assume that is continuous at . Let be the probability that enters the ball of radius about before entering , given , and assume

 liminfn→∞1nlog~pn(xn)≥−infz:|z−y|<\boldmath\mathchar270J(x,z).

Then the corresponding splitting algorithm is not stable.

Proof. It is enough to prove the instability of the algorithm that uses . Since , . From the definition of in (2) and (3) there exist and such that for all with ,

 [V(x)−V(z)]logEr(M)Δ>J(x,z)+\boldmath\mathchar290.

Let . By taking smaller if necessary we can guarantee that and for all .

Suppose one were to consider the problem of estimating . One could use the same splitting mechanism and level sets, and even the same random variables, except that one would stop on hitting or rather than or , and the last stage would correspond to some number . Of course, since is positive on it will no longer work as an importance function, but there is a function that will induce exactly the same level sets as and which can serve as an importance function for this problem. See Figure 2. The two problems can be coupled, in that exactly the same random variables can be used to construct both the splitting mechanisms and particle trajectories up until the particles in the problem enter .

If a particular particle has not been trapped in prior to entering , then that particle would also not yet be trapped in in the corresponding scheme used to estimate . Note also that the number of particles that make it to are at most times the number that make it to . Let denote the number of particles that make it to in the SA used to approximate , and let be the number used in the SA that approximates . Then .

Using the SFB variant in the same way that it was used in the proof of Lemma 5 and that the mechanism is used,

 ~pn(xn) = Exn⎡⎢⎣~Nnmn∑j=1~wnmn,j⎤⎥⎦ = Exn⎡⎢⎣~Nnmn∑j=1[Er(M)]−mn⎤⎥⎦ = [Er(M)]−mnEx[~Nnmn].

We now use the lower bound on and that :

 liminfn→∞1nlogExn[~Nnmn] = liminfn→∞1nlogExn[~pn(xn)[Er(M)]mn] ≥ −infz∈SJ(x,z)+[V(x)−supz∈SV(z)]Δlog[Er(M)] ≥ ≥ \boldmath\mathchar290.

It follows that

 liminfn→∞1nlogExn[Nnln(xn)]≥\boldmath\mathchar290>0,

which completes the proof.

The next proposition considers stability. Here we will make a mild assumption concerning a large deviation upper bound, which can also be expected to hold under conditions which guarantee Condition 1.

###### Proposition 7

Consider an importance function , level , and splitting mechanism , and define by (2). Suppose that

 ¯W(x)−¯W(y)≤J(x,y)

for all and that for all . Consider any , let be the probability that enters level set before entering (given ), and assume

 limsupn→∞1nlog~pn(xn)≤−infz∈LaJ(x,z).

Then the corresponding splitting algorithm is stable.

Proof. For each let be the value in that maximizes .  Since is bounded, along some subsequence (again denoted by ) we have . Using the usual argument by contradiction, it is enough to prove

along this subsequence. First suppose that . Given , choose such that for all . Then , and so . Since is arbitrary, this case is complete.

Now assume and let be given. Suppose one were to consider the problem of estimating as defined in the statement of the proposition, with . We again use the same splitting mechanism and level sets, except that we now stop on hitting or . An importance function with these level sets can be found by adding a constant to . We again couple the processes.  Observe that entry into for the problem corresponds to entry into in the problem for some such that as . Note that particles generated upon entry into the set will correspond to generation of the algorithm. The final generation of the algorithm to estimate is generation , corresponding to the particles generated upon reaching . Observe that since every particle in the algorithm used to estimate that is not trapped in by stage will make it to the target set in the algorithm used to estimate . Hence , where denotes the number of such particles for the SA used to estimate .

We again use the SFB variant in the same way that it was used in the proof of Lemma 5 and the splitting mechanism to obtain

 ~pn(xn)=[Er(M)]−(ln(x)−mn+1)Exn[~Nnln(x)−mn+1].

It follows from that . Using the upper bound on ,

 limsupn→∞1nlogExn[~Nnln(x)−mn+1] = limsupn→∞1nlogExn[~pn(xn)[Er(M)]ln(x)−mn+1] ≤ −infz∈LV(x)−Δ(v−\boldmath\mathchar270)J(x,z)+[v−\boldmath\mathchar270]log[Er(M)] ≤ supz∈LV(x)−Δ(v−\boldmath\mathchar270)[[V(x)−V(z)]Δlog[Er(M)]−J(x,z)] ≤ 0.

For sufficiently large we have , and hence . It follows that

and since is arbitrary the proof is complete.

## 4 Asymptotic Performance

Since the sample has mean , any estimator constructed as an average of independent copies of is unbiased and has variance proportional to var.  Once the mean is fixed, the minimization of var among splitting algorithms is equivalent to the minimization of .  It is of course very difficult to find the minimizer in this problem.  When a large deviation scaling holds, a useful alternative is to maximize the rate of decay of the second moment, i.e., to maximize

 liminfn→∞−1nlogExn[sn{SA}]2=liminfn→∞−1nlogExn⎡⎢⎣Nnln(xn)∑j=1wnln(xn),j⎤⎥⎦2.

By Jensen’s inequality the best possible rate is :

 liminfn→∞−1nlogExn[sn{SA}]2≥liminfn→