Subset Simulation Method for Rare Event Estimation: An Introduction
Abstract
This paper provides a detailed introductory description of Subset Simulation, an advanced stochastic simulation method for estimation of small probabilities of rare failure events. A simple and intuitive derivation of the method is given along with the discussion on its implementation. The method is illustrated with several easytounderstand examples. For demonstration purposes, the MATLAB code for the considered examples is provided. The reader is assumed to be familiar only with elementary probability theory and statistics.
I Introduction
Subset Simulation (SS) is an efficient and elegant method for simulating rare events and estimating the corresponding small tail probabilities. The method was originally developed by SiuKui Au and James Beck in the already classical paper AuBeck for estimation of structural reliability of complex civil engineering systems such as tall buildings and bridges at risk from earthquakes. The method turned out to be so powerful and general that over the last decade, SS has been successfully applied to reliability problems in geotechnical, aerospace, fire, and nuclear engineering. Moreover, the idea of SS proved to be useful not only in reliability analysis but also in other problems associated with general engineering systems, such as sensitivity analysis, design optimization, and uncertainty quantification. As of October 2013, according to the Web of Knowledge database, the original SS paper AuBeck received more than 250 citations that indicates the high impact of the Subset Simulation method on the engineering research community.
Subset Simulation is essentially based on two different ideas: conceptual and technical. The conceptual idea is to decompose the rare event into a sequence of progressively “lessrare” nested events,
(1) 
where is a relatively frequent event. For example, suppose that represents the event of getting exactly heads when flipping a fair coin times. If is large, then is a rare event. To decompose into a sequence (1), let us define to be the event of getting exactly heads in the first flips, where . The smaller , the less rare the corresponding event ; and — getting heads in the first flip — is relatively frequent.
Given a sequence of subsets (1), the small probability of the rare event can then be represented as a product of larger probabilities as follows:
(2) 
where denotes the conditional probability of event given the occurrence of event , for . In the coin example, , all conditional probabilities , and the probability of the rare event .
Unlike the coin example, in real applications, it is often not obvious how to decompose the rare event into a sequence (1) and how to compute all conditional probabilities in (2). In Subset Simulation, the “sequencing” of the rare event is done adaptively as the algorithm proceeds. This is achieved by employing Markov chain Monte Carlo, an advanced simulation technique, which constitutes the second – technical – idea behind SS. Finally, all conditional probabilities are automatically obtained as a byproduct of the adaptive sequencing.
The main goals of this paper are: (a) to provide a detailed exposition of Subset Simulation at an introductory level; (b) to give a simple derivation of the method and discuss its implementation; and (c) to illustrate SS with intuitive examples. Although the scope of SS is much wider, in this paper the method is described in the context of engineering reliability estimation, the problem SS was originally developed for in AuBeck.
The rest of the paper is organized as follows. Section II describes the engineering reliability problem and explains why this problem is computationally challenging. Section III discusses how the direct Monte Carlo method can be used for engineering reliability estimation and why it is often inefficient. In Section IV, a necessary preprocessing step which is often used by many reliability methods is briefly discussed. Section V is the core of the paper, where the SS method is explained. Illustrative examples are considered in Section LABEL:Examples. For demonstration purposes, the MATLAB code for the considered examples is provided in Section LABEL:Appendix. Section LABEL:Conclusions concludes the paper with a brief summary.
Ii Engineering Reliability Problem
One of the most important and computationally challenging problems in reliability engineering is to estimate the probability of failure for a system, that is, the probability of unacceptable system performance. The behavior of the system can be described by a response variable , which may represent, for example, the roof displacement or the largest interstory drift. The response variable depends on input variables , also called basic variables, which may represent geometry, material properties, and loads,
(3) 
where is called the performance function. The performance of the system is measured by comparison of the response with a specified critical value : if , then the system is safe; if , then the system has failed. This failure criterion allows to define the failure domain in the input space as follows:
(4) 
In other words, the failure domain is a set of values of input variables that lead to unacceptance system performance, namely, to the exceedance of some prescribed critical threshold , which may represent the maximum permissible roof displacement, maximum permissible interstory drift, etc.
Engineering systems are complex systems, where complexity, in particular, means that the information about the system (its geometric and material properties) and its environment (loads) is never complete. Therefore, there are always uncertainties in the values of input variables . To account for these uncertainties, the input variables are modeled as random variables whose marginal distributions are usually obtained from test data, expert opinion, or from literature. Let denote the join probability density function (PDF) for . The randomness in the input variables is propagated through (3) into the response variable , which makes the failure event also random. The engineering reliability problem is then to compute the probability of failure , given by the following expression:
(5) 
The behavior of complex systems, such as tall buildings and bridges, is represented by a complex model (3). In this context, complexity means that the performance function , which defines the integration region in (5), is not explicitly known. The evaluation of for any is often timeconsuming and usually done by the finite element method (FEM), one of the most important numerical tools for computation of the response of engineering systems. Thus, it is usually impossible to evaluate the integral in (5) analytically because the integration region, the failure domain , is not known explicitly.
Moreover, traditional numerical integration is also generally not applicable. In this approach, the dimensional input space is partitioned into a union of disjoint hypercubes, . For each hypercube , a “representative” point is chosen inside that hypercube, . The integral in (5) is then approximated by the following sum:
(6) 
where denotes the volume of and summation is taken over all failure points . Since it is not known in advance whether a given point is a failure point or not (the failure domain is not known explicitly), to compute the sum in (6), the failure criterion (4) must be checked for all . Therefore, the approximation (6) becomes
(7) 
where stands for the indicator function, i.e.,
(8) 
If denotes the number of intervals each dimension of the input space is partitioned into, then the total number of terms in (7) is . Therefore, the computational effort of numerical integration grows exponentially with the number of dimensions . In engineering reliability problems, the dimension of the input space is typically very large (e.g., when the stochastic load time history is discretized in time). For example, is not unusual in the reliability literature. This makes numerical integration computationally infeasible.
Over the past few decades, many different methods for solving the engineering reliability problem (5) have been developed. In general, the proposed reliability methods can be classified into three categories, namely:

Analytic methods are based on the Taylorseries expansion of the performance function, e.g. the FirstOrder Reliability Method (FORM) and the SecondOrder Reliability Method (SORM) Ditlevsen; Madsen; Melchers.

Surrogate methods are based on a functional surrogate of the performance function, e.g. the Response Surface Method (RSM) Faravelli; Schueller; Bucher, Neural Networks Papadrakakis, Support Vector Machines Hurtado, and other methods Hurtado_book.

Monte Carlo simulation methods, among which are Importance Sampling Engelund, Importance Sampling using Elementary Events AuBeck2, Radialbased Importance Sampling Grooteman, Adaptive Linked Importance Sampling ALIS, Directional Simulation Ditlevsen, Line Sampling LineSampling, Auxiliary Domain Method ADM, Horseracing Simulation Zuev, and Subset Simulation AuBeck.
Subset Simulation is thus a reliability method which is based on (advanced) Monte Carlo simulation.
Iii The Direct Monte Carlo Method
The Monte Carlo method, referred in this paper as Direct Monte Carlo (DMC), is a statistical sampling technique that have been originally developed by Stan Ulam, John von Neumann, Nick Metropolis (who actually suggested the name “Monte Carlo” Metropolis), and their collaborators for solving the problem of neutron diffusion and other problems in mathematical physics MonteCarlo. From a mathematical point of view, DMC allows to estimate the expected value of a quantity of interest. More specifically, suppose the goal is to evaluate , that is an expectation of a function with respect to the PDF ,
(9) 
The idea behind DMC is a straightforward application of the law of large numbers that states that if are i.i.d. (independent and identically distributed) from the PDF , then the empirical average converges to the true value as goes to . Therefore, if the number of samples is large enough, then can be accurately estimated by the corresponding empirical average:
(10) 
The relevance of DMC to the reliability problem (5) follows from a simple observation that the failure probability can be written as an expectation of the indicator function (8), namely,
(11) 
where denotes the entire input space. Therefore, the failure probability can be estimated using the DMC method (10) as follows:
(12) 
where are i.i.d. samples from .
The DMC estimate of is thus just the ratio of the total number of failure samples , i.e., samples that produce system failure according to the system model, to the total number of samples, . Note that is an unbiased random estimate of the failure probability, that is, on average, equals to . Mathematically, this means that . Indeed, using the fact that and (11),
(13) 
The main advantage of DMC over numerical integration is that its accuracy does not depend on the dimension of the input space. In reliability analysis, the standard measure of accuracy of an unbiased estimate of the failure probability is its coefficient of variation (c.o.v.) , which is defined as the ratio of the standard deviation to the expected value of , i.e., , where denotes the variance. The smaller the c.o.v. , the more accurate the estimate is. It is straightforward to calculate the variance of the DMC estimate:
(14) 
Here, the identity was used. Using (13) and (14), the c.o.v. of the DMC estimate can be calculated:
(15) 
This result shows that depends only on the failure probability and the total number of samples , and does not depend on the dimension of the input space. Therefore, unlike numerical integration, the DMC method does not suffer from the “curse of dimensionality”, i.e. from an exponential increase in volume associated with adding extra dimensions, and is able to handle problems of high dimension.
Nevertheless, DMC has a serious drawback: it is inefficient in estimating small failure probabilities. For typical engineering reliability problems, the failure probability is very small, . In other words, the system is usually assumed to be designed properly, so that its failure is a rare event. In the reliability literature, have been considered. If is very small, then it follows from (15) that
(16) 
This means that the number of samples needed to achieve an acceptable level of accuracy is inverse proportional to , and therefore very large, . For example, if and the c.o.v. of is desirable, then samples are required. Note, however, that each evaluation of , , in (12) requires a system analysis to be performed to check whether the sample is a failure sample. As it has been already mentioned in Section II, the computation effort for the system analysis, i.e., computation of the performance function , is significant (usually involves the FEM method). As a result, the DMC method becomes excessively costly and practically inapplicable for reliability analysis. This deficiency of DMC has motivated research to develop more advanced simulation algorithms for efficient estimation of small failure probabilities in high dimensions.
Remark 1.
It is important to highlight, however, that even though DMC cannot be routinely used for reliability problems (too expensive), it is a very robust method, and it is often used as a check on other reliability methods.
Iv Preprocessing: Transformation of Input Variables
Many reliability methods, including Subset Simulation, assume that the input variables are independent. This assumption, however, is not a limitation, since in simulation one always starts from independent variables to generate dependent “physical” variables. Furthermore, for convenience, it is often assumed that are i.i.d. Gaussian. If this is not the case, a “preprocessing” step that transforms to i.i.d. Gaussian variables must be undertaken. The transformation form to can be performed in several ways depending on the available information about the input variables. In the simplest case, when are independent Gaussians, , where and are respectively the mean and variance of , the necessary transformation is standardization:
(17) 
In other cases, more general techniques should be used, such as the Rosenblatt transformation Rosenblatt and the Nataf transformation Nataf. To avoid introduction of additional notation, hereinafter, it is assumed without loss of generality that the vector has been already transformed and it follows the standard multivariate Gaussian distribution,
(18) 
where denotes the standard Gaussian PDF,
(19) 
V The Subset Simulation Method
Unlike Direct Monte Carlo, where all computational resources are directly spent on sampling the input space, , and computing the values of the performance function , Subset Simulation first “probes” the input space by generating a relatively small number of i.i.d samples , , and computing the corresponding system responses . Here, the subscript 0 indicates the stage of the algorithm. Since is a rare event and is relatively small, it is very likely that none of the samples belongs to , that is for all . Nevertheless, these Monte Carlo samples contain some useful information about the failure domain that can be utilized. To keep the notation simple, assume that are arranged in the decreasing order, i.e. (it is always possible to achieve this by renumbering if needed). Then, and are, respectively, the closest to failure and the safest samples among , since and are the largest and the smallest responses. In general, the smaller , the closer to failure the sample is. This is shown schematically in Fig. 1.
Let be any number such that is integer. By analogy with (4), define the first intermediate failure domain as follows:
(20) 
where
(21) 
In other words, is the set of inputs that lead to the exceedance of the relaxed threshold . Note that by construction, samples belong to , while do not. As a consequence, the Direct Monte Carlo estimate for the probability of which is based on samples is automatically equal to ,
(22) 
The value is often used in the literature, which makes a relatively frequent event. Fig. 2 illustrates the definition of .
The first intermediate failure domain can be viewed as a (very rough) conservative approximation to the target failure domain . Since , the failure probability can be written as a product:
(23) 
where is the conditional probability of given . Therefore, in view of (22), the problem of estimating is reduced to estimating the conditional probability .
In the next stage, instead of generating samples in the whole input space (like in DMC), the SS algorithm aims to populate . Specifically, the goal is to generate samples from the conditional distribution
(24) 
First of all, note that samples not only belong to , but are also distributed according to . To generate the remaining samples from , which, in general, is not a trivial task, Subset Simulation uses the socalled Modified Metropolis algorithm (MMA). MMA belongs to the class of Markov chain Monte Carlo (MCMC ) algorithms Liu; Robert, which are techniques for sampling from complex probability distributions that cannot be sampled directly, at least not efficiently. MMA is based on the original Metropolis algorithm MA and specifically tailored for sampling from the conditional distributions of the form (24).
v.1 Modified Metropolis algorithm
Let be a sample from the conditional distribution . The Modified Metropolis algorithm generates another sample from as follows:

Generate a “candidate” sample :
For each coordinate
Sample , where , called the proposal distribution, is a univariate PDF for centered at with the symmetry property . For example, the proposal distribution can be a Gaussian PDF with mean and variance ,
(25) or it can be a uniform distribution over , for some .

Compute the acceptance ratio
(26) 
Define the coordinate of the candidate sample by accepting or rejecting ,
(27)


Accept or reject the candidate sample by setting
(28)
The Modified Metropolis algorithm is schematically illustrated in Fig. LABEL:Malg.