Earthquake Forecasting Based on Data Assimilation:
Sequential Monte Carlo Methods for Renewal Processes
1 Swiss Seismological Service, Institute of Geophysics, ETH Zurich, Switzerland. (email@example.com)
2 Department of Atmospheric and Oceanic Science, Center for Scientific Computation and Mathematical Modeling, Institute for Physical and Scientific Technology, University of Maryland, College Park, USA.
3 Department of Management, Technology and Economics, and Department of Earth Sciences, ETH Zurich, Switzerland.
In meteorology, engineering and computer sciences, data assimilation is routinely employed as the optimal way to combine noisy observations with prior model information for obtaining better estimates of a state, and thus better forecasts, than can be achieved by ignoring data uncertainties. Earthquake forecasting, too, suffers from measurement errors and partial model information and may thus gain significantly from data assimilation. We present perhaps the first fully implementable data assimilation method for earthquake forecasts generated by a point-process model of seismicity. We test the method on a synthetic and pedagogical example of a renewal process observed in noise, which is relevant to the seismic gap hypothesis, models of characteristic earthquakes and to recurrence statistics of large quakes inferred from paleoseismic data records. To address the non-Gaussian statistics of earthquakes, we use sequential Monte Carlo methods, a set of flexible simulation-based methods for recursively estimating arbitrary posterior distributions. We perform extensive numerical simulations to demonstrate the feasibility and benefits of forecasting earthquakes based on data assimilation. In particular, we show that the forecasts based on the Optimal Sampling Importance Resampling (OSIR) particle filter are significantly better than those of a benchmark forecast that ignores uncertainties in the observed event times. We use the marginal data likelihood, a measure of the explanatory power of a model in the presence of data errors, to estimate parameters and compare models.
- 1 Introduction
- 2 Data Assimilation and Probabilistic Earthquake Forecasting
3 Method: Sequential Monte Carlo Methods for Renewal Processes
- 3.1 Bayesian Data Assimilation of State-Space or Hidden Markov Models (HMMs)
- 3.2 Renewal Processes as Forecast Models
- 3.3 Sequential Monte Carlo Methods
- 3.4 Particle Filters and their Numerical Algorithms
- 3.5 Parameter Estimation
4 Numerical Experiments and Results
- 4.1 Experiment Design
- 4.2 Particle Filtering
- 4.3 Parameter Estimation
- 5 Conclusion
In dynamical meteorology, the primary purpose of data assimilation has been to estimate and forecast as accurately as possible the state of atmospheric flow, using all available appropriate information [Talagrand, 1997]. Recent advanced methods of data assimilation attempt to include the effects of uncertainties explicitly in the estimation by taking probabilistic approaches. Kalnay  defines data assimilation as a statistical combination of observations and short-range forecasts. According to Wikle and Berliner , data assimilation is an approach for fusing data (observations) with prior knowledge (e.g., mathematical representations of physical laws or model output) to obtain an estimate of the distribution of the true state of a process. To perform data assimilation, three components are required: (i) a statistical model for observations (i.e., a data or measurement model) and (ii) an a priori statistical model for the state process (i.e., a state or process model), which may be obtained through a physical model of the time-evolving system, and (iii) a method to effectively merge the information from (i) and (ii).
Both data and model are affected by uncertainty, due to measurement and model errors and/or stochastic model elements, leading to uncertain state estimates that can be described by probability distributions. Data assimilation is therefore a Bayesian estimation problem: the prior is given by model output (a forecast from the past) and the likelihood by the measurement error distribution of the data. The posterior provides the best estimate of the true state and serves as initial condition for a new forecast. The essence of data assimilation is to inform uncertain data through the model, or, equivalently, to correct the model using the data. The cycle of predicting the next state and updating, or correcting this forecast given the next observation, constitutes sequential data assimilation (see [Daley, 1991; Ghil and Malanotte-Rizzoli, 1991; Ide et al., 1997; Talagrand, 1997; Kalnay, 2003] for introductions to data assimilation and [Tarantola, 1987; Miller et al., 1999; Pham, 2001; Wikle and Berliner, 2007] for a Bayesian perspective).
Although data assimilation is increasingly popular in meteorology, climatology, oceanography, computer sciences, engineering and finance, only a few partial attempts, reviewed in section 2.1, have been made within the statistical seismology community to use the concept for seismic and fault activity forecasts. But earthquake forecasting suffers from the same issues encountered in other areas of forecasting: measurement uncertainties in the observed data and incomplete, partial prior information from model forecasts. Thus, basing earthquake forecasting on data assimilation may provide significant benefits, some of which we discuss in section 2.2. Indeed, there is a growing perception in the seismological and geophysical community that data assimilation should be used in earthquake forecasts. For instance, at the inaugural meeting of the ACES international cooperation on earthquake simulations (http://www.aces.org.au/), it was concluded that there is a strong need for a gradual evolution from the well-developed concept of geophysical data inversion to the emerging one of data assimilation, for which no generally accepted method exists in the domain of earthquake modeling.
There are perhaps two major challenges for developing data assimilation methods for earthquake forecasts: Firstly, seismicity is often modeled by point-processes, and secondly, earthquake statistics are far from Gaussian. Neither are typical assumptions in practical data assimilation methods. To explain the first issue, data assimilation is often cast in terms of discrete-time state-space models, or Hidden Markov models (HMMs), reflecting the underlying physics-based stochastic differential equations [Daley, 1991; Kalnay, 2003; Künsch, 2001; Cappé et al., 2005; Doucet et al., 2001]. An HMM is, loosely speaking, a Markov chain observed in noise [Doucet et al., 2001; Durbin and Koopman, 2001; Künsch, 2001; Robert and Casella, 2004; Cappé et al., 2005, 2007]: An HMM consists of an unobserved Markov (state) process and an associated, conditionally independent observation process (both process being potentially nonlinear/non-Gaussian; see section 3.1 for precise definitions). The Kalman filter is an archetypical assimilation method for such a model [Kalman, 1960; Kalman and Bucy, 1961]. In contrast, earthquake catalogs have many features which make them uniquely distinct from the forecast targets in other disciplines. Indeed, seismicity models are usually stochastic point processes [Daley and Vere-Jones, 2003; Vere-Jones, 1970, 1995; Kagan and Knopoff, 1987; Ogata, 1998; Helmstetter and Sornette, 2002; Rhoades and Evison, 2004; Kagan and Jackson, 2000], which are completely different from the noisy differential or finite difference equations decorated by noise of standard data assimilation methods. There seems to exist little statistical work that extends the idea of data assimilation or state filtering to point processes, which model the stochastic point-wise space-time occurrence of events along with their marks.
The second challenge, that of non-Gaussian probability distributions, has been solved to some extent by recent Monte Carlo methods, especially for models with a small number of dimensions [Liu, 2001; Doucet et al., 2001; Robert and Casella, 2004]. In particular, Sequential Monte Carlo (SMC) methods, a set of simulation-based methods for recursively estimating arbitrary posterior distributions, provide a flexible, convenient and (relatively) computationally-inexpensive method for assimilating non-Gaussian data distributions into nonlinear/non-Gaussian models [Doucet et al., 2001; Durbin and Koopman, 2001; Künsch, 2001; Robert and Casella, 2004; Cappé et al., 2005, 2007]. Also called particle filters, SMC filters have been particularly successful at low-dimensional filtering problems for the family of HMMs or state-space models.The Kalman-Lévy filter [Sornette and Ide, 2001] provides an analytic solution extending the Kalman filter for Lévy-law and power-law distributed model errors and data uncertainties. We present an overview of SMC methods in sections 3.3 and 3.4.
The main purpose of this article is to develop an implementable method for forecasting earthquakes based on data assimilation. We test this sequential method on a pedagogical and synthetic example of a simulated catalog of “observed” occurrence times of earthquakes, which are not the “true” event times because of observational errors. We specifically use a point-process as our model of seismicity. To estimate arbitrary posterior distributions of the “true” event times, we use the SMC methods we just mentioned.
Our technique offers a step towards the goal of developing a “brick-by-brick” approach to earthquake predictability [Jordan, 2006; Jackson, 1996; Kagan, 1999], given the enormous difficulties in identifying reliable precursors to impending large earthquakes [Geller, 1997; Geller et al., 1997; Kagan, 1997]. With suitable adaptations and extensions, our approach should find its natural habitat in the general testing framework developed within the Regional Earthquake Likelihood Models (RELM) Working Group [Field, 2007; Schorlemmer et al., 2007, 2009] and the international Collaboratory for the Study of Earthquake Predictability (CSEP) [Jordan, 2006; Zechar et al., 2009], in which forecast-generating models are tested in a transparent, controlled, reproducible and fully prospective manner.
The importance of data uncertainties, forecast specification and evaluation for earthquake predictability experiments were highlighted by several recent studies. Werner and Sornette  showed that measurement errors in magnitudes have serious, adverse effects on short-term forecasts performed using a general class of models of clustered seismicity, including two of the most popular models, the Short Term Earthquake Probabilities (STEP) model [Gerstenberger et al., 2005] and the Epidemic-Type Aftershock Sequence (ETAS) model [Ogata, 1988]. Moreover, Werner and Sornette  showed that the RELM evaluation tests are not appropriate for the broadened forecast distributions that arise from taking into account uncertainties in data, and recommended that forecasts should be replaced by a full distribution. Schorlemmer et al.  confirmed and supported this recommendation after examining first results from the five-year (i.e. long-term) RELM forecast competition. The methods used in this article for evaluating point-process forecasts when the observations are noisy provide an alternative to the current forecast evaluation method used in RELM and CSEP.
Data and parameter uncertainties also play a crucial role in the ongoing debate about the relevance of the seismic gap hypothesis [McCann et al., 1979; Nishenko, 1991; Kagan and Jackson, 1991, 1995; Rong et al., 2003; McGuire, 2008], of models of characteristic earthquakes [Wesnousky, 1994; Bakun et al., 2005; Scholz, 2002; Kagan, 1993], and of recurrence statistics of earthquakes on a particular fault segment inferred from paleoseismic data records [Bakun et al., 2005; Davis et al., 1989; Rhoades et al., 1994; Ogata, 1999, 2002; Sykes and Menke, 2006; Parsons, 2008]. The data are often modeled using renewal processes, and studies investigating data and parameter uncertainty confirmed that any model inference or forecast must take into account uncertainties [Davis et al., 1989; Rhoades et al., 1994; Ogata, 1999, 2002; Sykes and Menke, 2006; Parsons, 2008].
In this article, we focus on the class of renewal processes as models of seismicity. On the one hand, renewal processes are extensively used to model paleoseismic data records, characteristic earthquakes and seismic gaps, as mentioned above. On the other hand, renewal processes are the point-process analog of Markov chains, thereby enabling us to use sequential Monte Carlo methods developed for state-space models. In other words, renewal processes are the simplest class of point process models relevant to statistical seismology. By developing rigorously a data assimilation procedure for renewal processes, we aim at providing the building blocks for more complicated models. In addition to the obvious relevance to earthquake forecasts, we hope to generate interest among statisticians to tackle the general problem of state filtering for point processes, for which the Markovian state-space model framework seems too restrictive.
The article is structured as follows. Section 2 provides a brief literature review of data assimilation in connection with statistical seismology and points out potential benefits of data assimilation to earthquake forecasting. Section 3 introduces the methods we believe are relevant in the seismicity context, Section 3.1 provides the notation and basic Bayesian estimation problem we propose to solve for renewal processes. Section 3.2 defines renewal processes, which serve as our forecast models. Section 3.3 explains the basics of Sequential Monte Carlo methods. In section 3.4, we describe three particular SMC filters (often called particle filters) in order of sophistication. To perform model inference, we must estimate parameters, which is described in section 3.5. Section 4 describes numerical experiments to demonstrate how earthquake forecasting based on data assimilation can be implemented for a particular renewal process, where inter-event times are lognormally distributed. Section 4.1 describes the set-up of the simulations: we use a lognormal renewal process of which only noisy occurrence times can be observed. In section 4.2 we use particle filters to estimate the actual occurrence times, demonstrating that the particle filters improve substantially on a forecasting method that ignores the presence of data uncertainties. In section 4.3, we show that parameter estimation via maximum (marginal) likelihood is feasible. We conclude in section 5.
2 Data Assimilation and Probabilistic Earthquake Forecasting
2.1 Literature on Probabilistic Earthquake Forecasting and Data Assimilation
The general concepts of data assimilation or Hidden Markov models (HMMs) state inference are relatively new to earthquake seismology and statistical earthquake modeling. The few studies that are related can be separated into three categories. (i) Varini [2005, 2008] studied a HMM of seismicity, in which the (unobserved) state could be in one of three different states (a Poisson process state, an ETAS process state and a stress-release process state) and the observational data were modeled according to one of the three processes. Varini did not consider measurement uncertainties of the data. (ii) Grant and Gould  proposed data formats and standards for the assimilation of uncertain paleoseismic data into earthquake simulators. Aalsburg et al.  assimilated uncertain paleoseismic data into “Virtual California”, a fixed-geometry earthquake simulator of large earthquakes: model runs are accepted or rejected depending on whether simulated earthquakes agree with the paleoseismic record. (iii) Rhoades et al.  calculated seismic hazard on single fault segments by averaging the hazard function of a renewal process over parameter and data uncertainties, achieved by sampling over many parameter and data samples. Ogata  presented a Bayesian approach to parameter and model inference on uncertain paleoseismic records, closely related to our approach. Data uncertainties were represented with either a uniform or a triangular distribution. To compute the integrals, Ogata seems to have used numerical integration, a process that becomes increasingly difficult as the number of events increases, in contrast to the particle filters that we use below. Sykes and Menke  assumed Gaussian data errors and uncorrelated recurrence intervals, also providing a maximum likelihood estimation procedure for the parameters of a lognormal process based on a Monte Carlo integration approach. Parsons  provided a simple but inefficient Monte Carlo method for estimating parameters of renewal processes from paleoseismic catalogs.
2.2 Why base earthquake forecasting on data assimilation?
The brief review of the literature indicates that the potential of data assimilation for earthquake forecasting has not been exploited. Some of the potential benefits are summarized here:
Earthquake forecasting under observational uncertainties: The current surge in earthquake predictability experiments [Field, 2007; Jordan, 2006; Schorlemmer et al., 2007, 2009; Zechar et al., 2009] provides strong motivational grounds for developing earthquake forecasting methods that are robust with respect to observational uncertainties in earthquake catalogs.
Seismic hazard calculations under observational uncertainties: On a practical level, the inclusion of uncertainties for seismic hazard calculations may provide better scientific support for decision-making for the insurance industry, disaster agencies and risk mitigation strategies.
Scientific hypothesis testing under observational uncertainties: Observational uncertainties may bias conclusions regarding scientific hypotheses. The only way to accurately provide confidence limits to test our ideas about earthquakes is to take uncertainties into account – data assimilation provides the ideal vehicle for such a systematic treatment.
Model inference: Data assimilation can be used as a framework for likelihood-based model testing and development, fully accounting for uncertainties.
Near-Real Time Forecasting: Once new observations become available, data assimilation provides a vehicle for correcting an existing forecast without having to re-calibrate and re-initialize the model on the entire data set. For complex, high-dimensional models, this reduced computational burden may allow near-real-time forecasts.
Estimating physical quantities of physics-based models from seismicity: In its general formulation as a state and parameter estimation problem, data assimilation may also be viewed as a method for estimating physical quantities (“states”) and model parameters, directly related to physics-based models, such as rate-and-state friction and Coulomb stress-change models.
Optimally integrating different types of data for earthquake forecasts: The models submitted to the five-year forecast competition of RELM contain a variety of data from which forecasts are generated. In the future, the coupled integration of several types of different data is highly desirable. Numerical weather prediction has a long history of integrating such different types of data – statistical seismology may be able to adapt these methods.
Developing statistical theory and methodology for inference of point processes under observational data uncertainties: The theory of point processes has so far largely focused on exact data. The development of the statistical theory and practical methodology for taking into account noisy observations is therefore interesting for applications beyond earthquake forecasting.
3 Method: Sequential Monte Carlo Methods for Renewal Processes
3.1 Bayesian Data Assimilation of State-Space or Hidden Markov Models (HMMs)
In this section, we state the general problem of Bayesian data assimilation that will be solved for specific model and observation assumptions in section 4. The presentation borrows from [Doucet et al., 2000, 2001] and [Arulampalam et al., 2002]; see also [Künsch, 2001; Robert and Casella, 2004; Cappé et al., 2005, 2007; Wikle and Berliner, 2007] and references therein.
We restrict ourselves to signals modeled as Hidden Markov Models (HMMs), i.e. Markovian, nonlinear, non-Gaussian state-space models. The unobserved signal (the hidden states) is modeled as a Markov process ( may be a vector). The initial state has initial distribution . The transition from to is governed by a Markov transition probability distribution . The observations are assumed to be conditionally independent given the process and of conditional distribution (the observations may also be vectors, in general of different dimension than the state). The model can be summarized by
|Conditional data likelihood:||(3)|
We denote and . The problem statement is then as follows: the aim is to estimate sequentially in time the posterior distribution . We may also be interested in estimating the marginal distribution , also known as the filtering distribution, and the marginal complete data likelihood , which we will use for parameter estimation.
At any time , the posterior distribution is given by Bayes’ theorem
A recursive or sequential formula can be derived from (i) the Markov property of the state process and (ii) the independence of observations given the state:
where is given by
The marginal distribution also satisfies the following recursion:
Expressions (7) and (8) are the essential steps in sequential data assimilation. Using the last update (the posterior, also often called analysis) as initial condition, the Chapman-Kolmogorov (prediction) equation (7) is used to forecast the state at the next time step. When observations become available, they are assimilated into the model forecast by the update equation (8). This cycle constitutes sequential data assimilation of state-space or Hidden Markov models. The problem appears in other research fields under different guises, e.g. Bayesian, optimal, nonlinear or stochastic filtering, or online inference and learning [Doucet et al., 2001; Cappé et al., 2005].
In general, there may be unknown parameters in the model forecast distribution that need to be estimated. We assume that the parameters of the conditional data likelihood are known, since they should be characterized by the measurement process and its associated uncertainties. Several parameter estimation techniques exist; we will focus on maximizing the marginal complete data likelihood, the denominator in Bayes’ theorem:
Equation (9) provides a measure of how successfully a particular model is explaining the data. The marginal complete data likelihood is the analog of the traditional likelihood function, but generalized to noisy observational data. This, in turn, implies that different models may be compared and tested for their consistency with observed data, while explicitly acknowledging data uncertainties. In other words, (earthquake) forecasts may be evaluated based on this measure.
Only in very special cases are the prediction and update equations (7) and (8) amenable to analytical solutions. In the case of a linear Gaussian state-space model, the widespread Kalman filter [Kalman, 1960; Kalman and Bucy, 1961] calculates exactly the posterior distributions. Much of filtering theory and data assimilation has been concerned with identifying useful, suitable and computationally inexpensive filters for a variety of particular problems. For instance, the extended Kalman filter performs a local tangent linearization of nonlinear model and observation operators for nonlinear problems. The Kalman-Lévy filter [Sornette and Ide, 2001] generalizes the Kalman filter to Lévy-law and power-law distributed model and data uncertainties. In other cases, numerical integration may be possible, or approximate grid-based methods, e.g. HMM filters, may be convenient. The ensemble Kalman filter [Evensen, 1994] is a Monte Carlo approach to the nonlinear extension of the Kalman filter by introducing an ensemble of particles with equal weights, each evolved individually, to approximate distributions. The general, nonlinear, non-Gaussian, sequential Bayesian estimation problem, however, seems best solved with sequential Monte Carlo methods whenever the model’s dimensionality is small.
3.2 Renewal Processes as Forecast Models
Data assimilation is an iterative method that involves two steps, forecast (7) and analysis (8), in each cycle. To formulate the data assimilation problem for earthquakes, we use a renewal point process as the model in the forecast. Renewal point processes are characterized by intervals between successive events that are identically and independently distributed according to a probability density function that defines the process [Daley and Vere-Jones, 2004]. Examples of such a probability density function (d.f.) include the lognormal d.f., the exponential d.f., the gamma d.f., the Brownian passage time d.f. and the Weibull d.f. The time of the next event in a renewal process depends solely on the time of the last event:
where is the interval between events. The time of the event corresponds to the model state in data assimilation. Renewal point processes provide prior information for the analysis, which we will discuss next.
3.3 Sequential Monte Carlo Methods
Earthquake statistics clearly violate Gaussian approximations in terms of their temporal, spatial and magnitude occurrences, so much so that approximate algorithms based on local Gaussian approximations (e.g. the extended Kalman filter) are unlikely to produce good results. Furthermore, the continuous state space of seismicity rules out methods in which that space is assumed to be discrete (such as grid-based methods). This leaves us with numerical integration techniques and Monte Carlo methods. The former are numerically accurate but computationally expensive in problems with medium to high dimensionality.
Sequential Monte Carlo (SMC) methods bridge the gap between these cost-intensive methods and the methods based on Gaussian approximations. They are a set of simulation-based methods that provide a flexible alternative to computing posterior distributions. They are applicable in very general settings, parallelisable and often relatively easy to implement. Early methods were developed in the 70s, but only with the advent of cheap computational power in the mid 90s did they become a widespread tool. Since then, SMC methods have been applied in target tracking, financial analysis, diagnostic measures of fit, missing data problems, communications and audio engineering, population biology, neuroscience, and many more. SMC methods are also known under the names of particle filters, bootstrap filters, condensation, Monte Carlo filters, interacting particle approximations and survival of the fittest. Good introductions can be found in [Arulampalam et al., 2002; Cappé et al., 2005, 2007; Doucet et al., 2000, 2001; Künsch, 2001; Liu, 2001; Liu and Chen, 1998] and in Chapter 6 of De Freitas .
Sequential Monte Carlo filters use the techniques of Monte Carlo sampling, of (sequential) importance sampling and of resampling, which we describe briefly below before defining three particular particle filters.
3.3.1 Monte Carlo Sampling
In Monte Carlo (MC) simulation [Liu, 2001; Robert and Casella, 2004], a set of weighted “particles” (samples) are drawn identically and independently from a distribution, say, a posterior . Then, an empirical estimate of the distribution is given by
where denotes the Dirac mass located at . The essential idea of Monte Carlo sampling is to convert an integral into a discrete sum. One is often interested in some function of the posterior distribution, say its expectation, covariance, marginal or another distribution. Estimates of such functions can be obtained from
This estimate is unbiased. If the posterior variance of is finite, say , then the variance of is equal to . From the law of large numbers,
where a.s. denotes almost sure convergence. That is, the probability that the estimate converges to the “true” value equals one in the limit of infinite number of particles. Furthermore, if the posterior variance , then a central limit theorem holds:
where denotes convergence in distribution and is the normal (Gaussian) distribution with mean zero and variance . The advantage of this perfect Monte Carlo method is therefore that the rate of convergence of the MC estimate is independent of the dimension of the integrand. This stands in contrast to any deterministic numerical integration method, whose rate of convergence decreases with the dimensionality of the integrand.
Unfortunately, because the posterior distribution is usually highly complex, multi-dimensional and only known up to a normalizing constant, it is often impossible to sample directly from the posterior. One very successful solution for generating samples from such distributions is Markov Chain Monte Carlo (MCMC). Its key idea is to generate samples from a proposal distribution, different from the posterior, and then to cause the proposal samples to migrate, so that their final distribution is the target distribution. The migration of the samples is caused by the transition probabilities of a Markov chain (see, e.g., Appendix D of De Freitas ). However, MCMC are iterative algorithms unsuited to sequential estimation problems and will not be pursued here. Rather, SMC methods primarily rely on a sequential version of importance sampling.
3.3.2 Importance Sampling (IS)
Importance Sampling (IS) introduced the idea of generating samples from a known, easy-to-sample probability density function (pdf) , called the importance density or proposal density, and then “correcting” the weights of each sample so that the weighted samples approximate the desired density. As long as the support of the proposal density includes the support of the target density, one can make use of the substitution
to obtain the identity
where is known as the importance weight
Therefore, if one can generate independently and identically distributed samples from the importance density , a Monte Carlo estimate of is given by
where the normalized importance weights are given by
For finite , the estimate is biased, as it is the ratio of two estimates. However, it is possible to obtain asymptotic almost sure convergence and a central limit theorem provided (i) the importance density support contains the posterior density support, and (ii) the expectations of the weights and exist and are finite [Geweke, 1989; De Freitas, 1999].
Thus, the posterior density function can be approximated arbitrarily well by the point-mass estimate
3.3.3 Sequential Importance Sampling (SIS)
In its simplest form, IS is not adequate for sequential estimation. Whenever new data become available, one needs to recompute the importance weights over the entire state sequence. Sequential Importance Sampling (SIS) modifies IS so that it becomes possible to compute an estimate of the posterior without modifying the past simulated trajectories. It requires that the importance density at time admits as marginal distribution at time the importance function :
after iterating, one obtains:
Assuming that the state evolves according to a Markov process and that the observations are conditionally independent given the states, one can obtain
where the normalization is provided by . Equation (24) provides a mechanism for sequentially updating the importance weights. In summary, SIS provides a method to approximate the posterior density function (20) (or some function thereof) sequentially in time without having to draw samples directly from the posterior. All that is required is (i) sampling from the importance density and evaluating it up to some constant, (ii) evaluating the likelihood up to some proportionality constant, (iii) evaluating the forecast up to some constant, and (iv) normalizing the importance weights via . The SIS thus makes sequential Bayesian estimation feasible.
3.3.4 Choice of the Importance Density and Resampling
The problem encountered by the SIS method is that, as increases, the distribution of the importance weights becomes more and more skewed. For instance, if the support of the importance density is broader than the posterior density, then some particles will have their weights set to zero in the update stage. But even if the supports coincide exactly, many particles will over time decrease in weight so that after a few time steps, only a few lucky survivors have significant weights, while a large computational effort is spent on propagating unimportant particles. It has been shown that the variance of the weights can only increase over time, thus it is impossible to overcome the degeneracy problem [Kong et al., 1994]. Two solutions exist to minimize this problem: (i) a good choice of the importance density, and (ii) resampling.
Importance Density: The optimal importance density is given by:
because it can be proven to minimize the variance of the importance weights (see Kong et al.  and Chapter 6 of De Freitas ). However, using the optimal importance density requires the ability to sample from and to evaluate the integral over the new state [Arulampalam et al., 2002; Doucet et al., 2001; De Freitas, 1999]. In many situations, this is impossible or very difficult, prompting the use of other importance densities. Perhaps the simplest and most common choice for the importance density is given by the prior:
which, although resulting in a higher variance of the Monte Carlo estimator, is usually easy to implement. Many other choices are possible, including additional MCMC steps to sample from the importance density and bridging densities and progressive corrections to herd the particles to the important part of the state space [Arulampalam et al., 2002; Doucet et al., 2001; Liu, 2001].
Resampling: Even the optimal importance density will lead to this “degeneracy” of the particles (few important ones and many unimportant ones). One therefore introduces an additional selection or resampling step, in which particles with little weight are eliminated and new particles are sampled in the important regions of the posterior. De Freitas  and Arulampalam et al.  provide an overview of different resampling methods.
Resampling introduces its own problems. Since particles are sampled from discrete approximations to density functions, the particles with high weights are statistically selected many times. This leads to a loss of diversity among the particles as the resultant sample will contain many repeated points. This is known as “sample impoverishment” [Arulampalam et al., 2002] and is severe when the model forecast is very narrow or deterministic. There are various methods to deal with this problem, including sophisticated methods that recalculate past states and weights via a recursion and MCMC methods, the Resample-Move algorithm and the Regularized Particle Filter (RPF). These filters will not be necessary here because of the broad and highly stochastic model forecast.
Because of the additional problems introduced by resampling, it makes sense to resample only when the variance of the weights has decreased appreciably. A suitable measure of degeneracy of an algorithm is the effective sample size introduced by Liu and Chen  and defined by
where is referred to as the true weight. This may not be available, but an estimate can be obtained as the inverse of the so-called Participation Ratio [Mézard, 1987] (or Herfindahl index [Polakoff, 1981; Lovett, 1988]):
Thus, resampling can be applied when falls below a certain threshold .
3.4 Particle Filters and their Numerical Algorithms
In this section, we define three particle filters, characterized by particular choices for the importance density and the resampling strategy. The presentation and the pseudo-codes closely follow Arulampalam et al. . More on particular particle filters can be found in [Arulampalam et al., 2002; De Freitas, 1999; Doucet et al., 2000, 2001; Liu, 2001; Cappé et al., 2005, 2007]. The filters are:
The Simple Sequential Importance Sampling (SSIS) particle filter: The simplest particle filter, it uses the prior given by equation (26) as the (sub-optimal) importance density and does not include a resampling step.
The Optimal Sequential Importance Sampling (OSIS) particle filter: This filter improves on the SSIS by using the optimal importance sampling density (25), but does not include resampling.
The Optimal Sequential Importance Resampling (OSIR) particle filter: This filter improves on the SIS filters by including a resampling step to counteract the degeneracy of particles. The importance density may either be the prior or the optimal importance density.
In all particle filters, the prior is obtained by random draw for individual particles using the forecast model, i.e. the renewal point process defined by (10).
3.4.1 Simple Sequential Importance Sampling (SSIS) Filter
The Simple SIS (SSIS) particle filter is characterized by a lack of resampling and by choosing the prior as the importance density:
where is simply the likelihood and the weights are normalized by
This filter is simple and easy to implement. However, if the likelihood has a much narrower support than the importance density, then the weights of many particles will be set to zero so that only few active particles are left to approximate the posterior. Depending on the overlap of the support of the two density functions, this particle filter may quickly degenerate in quality.
3.4.2 Optimal Sequential Importance Sampling (OSIS) Filter
The Optimal Simple SIS (OSIS) improves on the SSIS by using the optimal sampling density:
Weights are normalized as in equation (31). As was already mentioned, the optimal density suffers from two difficulties: (i) generating samples from the posterior (32), and (ii) calculating the integral in (33).
3.4.3 Optimal Sampling Importance Resampling (OSIR) Filter
To counter the inevitable problem of particle degeneracy, we can use resampling to generate a new set of particles from the (discrete) posterior. Setting the importance density equal to the optimal importance density as in the OSIS particle filter described above, we recover the Optimal Sampling Importance Resampling (OSIR) algorithm given by Algorithm 3.
In the literature, the OSIR filter is usually an implementation with the prior as the (suboptimal) importance density. Such a filter is called the “bootstrap” filter by Doucet et al. .
There are many methods to resample from the posterior (see Doucet et al.  or Chapter 6 of De Freitas  for a discussion of methods, and Arulampalam et al.  for a brief overview). The basic idea is to eliminate particles that have small weights and to concentrate on particles with large weights. It involves generating a new set of particles and associated weights by resampling (with replacement) times from an approximate discrete representation of the posterior. The resulting sample is an independently and identically distributed sample so that the weights are reset to . The method of choice of Arulampalam et al.  is systematic resampling since “it is easy to implement, takes time and minimizes the Monte Carlo variation.” Its operation is described in Algorithm 4, where is the uniform distribution on the interval . For each resampled particle , this resampling algorithm also stores the index of its parents, which is denoted . This is unnecessary and can easily be suppressed, but may be useful in some situations.
While there are of course many more particle filters, each suited to particular applications, we have here presented the standard algorithms. For more advanced particle filters, see for instance [Arulampalam et al., 2002; De Freitas, 1999; Doucet et al., 2000] and references therein. The particle filters described above will be used below for seismicity models based on point processes.
3.5 Parameter Estimation
Parameter estimation techniques within sequential Monte Carlo methods are discussed by, e.g., [Doucet et al., 2001; Künsch, 2001; Andrieu et al., 2004; Cappé et al., 2005, 2007]. The methods are either online-sequential or offline-batch methods. For simplicity, we will restrict this section to one particular technique, based on the offline or batch technique of maximizing (an MC estimate of) the complete marginal data likelihood defined in equation (9). The presentation follows Doucet et al. .
We assume that both the Markov transition kernel and the conditional data likelihood, defined by equations (2) and (3), respectively, also depend on an unknown, static parameter vector . Moreover, we assume the marginal likelihood admits a sequential formulation:
where the individual predictive likelihoods are defined as
These can be estimated from the weighted particles as
where are the unnormalized weights at the time step. Expression (24) is used to go from the second to the third approximate equality.
The log-likelihood is therefore given by
Maximizing the sum of the normalized weights given by expression (39) with respect to the parameter set results in the maximum likelihood estimator :
Doucet et al. ; Andrieu et al. ; Cappé et al. [2005, 2007] and Olsson and Rydén  consider the estimator’s statistical properties. To find the maximum of the log-likelihood in equation (40), one may use the standard optimization algorithms, such as gradient-based approaches, the expectation-maximization algorithm, or random search algorithms such as simulated annealing, genetic algorithms, etc (see, e.g., Sambridge and Mosegaard ). In our parameter estimation experiments (see section 4.3), we chose a combination of a coarse direct grid-search method and a pattern search method to refine the coarse estimate [Hooke and Jeeves, 1961; Torczon, 1997; Lewis and Torczon, 1999].
When the number of particles is small, the log-likelihood function will suffer from Monte Carlo noise. Techniques exist to smooth the function in order to find the maximum in these situations (see references above). We were able to perform large particle ensemble computations ( particles) in our numerical experiments below, so that the Monte Carlo noise was significantly reduced and we did not need to smooth the likelihood function.
4 Numerical Experiments and Results
In this section, we present a simple, pedagogical example of earthquake forecasting based on data assimilation. Our model is the 1-dimensional, temporal lognormal renewal process (section 4.1.1): the simplest point process, which nevertheless draws much interest in earthquake seismology and seismic hazard, as mentioned above. We assume the process is observed in noise, i.e. the unobservable true occurrence times are perturbed by (additive) identically and independently distributed noise (section 4.1.2). The aim of this section is to show an example of how data assimilation provides better forecasts, as measured by the likelihood gain, than a forecast (“the benchmark”) which ignores the data errors (assumes the observed times are the true times). We will compare the performance of the three particle filters defined in section 3.4 against each other, and measure their skill against the benchmark (section 4.2). Finally, in section 4.3 we will use maximum likelihood estimation to obtain parameter estimates using both the particle filters and the benchmark. The results in this section thereby demonstrate that data assimilation can help make earthquake forecasting and validating robust with respect to observational data errors.
4.1 Experiment Design
4.1.1 The Forecast Model: Lognormal Renewal Process
Motivated by its relevance to paleoseismology, seismic hazard and the characteristic earthquake debate, we use a lognormal renewal process as our forecast model of intervals between subsequent earthquakes (section 3.2):
where the parameters and may need to be estimated. In the notation of section 3, using a physically meaningful for the state variable instead of , the lognormal distribution of the intervals is the transition kernel of the HMM defined in equation (2):
For our pedagogical example, we set the parameters to
Figure 1 shows the lognormal distribution (solid black curve) with these parameter values.
4.1.2 The Observations: Noisy Occurrence Times
We suppose that the -th observed occurrence time is a noisy perturbation of the “true” occurrence time :
where is an additive noise term distributed according to some distribution . For our numerical experiments below, we choose for simplicity the uniform distribution:
where is the Heaviside step function. Substituting gives the density (conditional likelihood) of the data given the true occurrence time, defined by equation (3):
We set the parameter to
Figure 1 shows the lognormal distribution (the model; solid black curve) and the error distribution (the conditional likelihood; dashed rectangular magenta line) for the parameters defined by expressions (43) and (47).
4.1.3 Initial Condition and Observation Period
We assume for simplicity that the period over which the point process is observed begins with an event at . We further assume that the true and observed occurrence times of this first event coincide, so that our initial condition is a delta function at . This assumption can be relaxed: Ogata  provided the relevant equations. One can even predict backwards in time to estimate the time of the last event before the observation period began.
We assume that the observation period ends with the last observed event . Again, an open interval after the last observed event can be added if needed [Ogata, 1999].
4.1.4 Simulation Procedure
In this entirely simulated example, we begin by generating the “true” (unobservable) process. We generate random samples from the lognormal distribution given by equation (42) to obtain the sequence of true event times . Next, we simulate the observed process by generating random samples from the uniform conditional likelihood given by equation (46) to obtain the sequence of observed event times .
To perform the particle filtering, we initialize particles at the exactly known . To forecast , we propagate each particle through the model kernel (42). Given the observation and the model forecast, we use one of the particle filters described in section 3.4 to obtain the analysis/posterior of . The weighted particle approximation of the posterior is then used to forecast according to equation (7). This cycle is repeated until the posteriors of all occurrence times are computed.
4.1.5 Optimal Importance Density
Our simple choice of the uniform conditional likelihood given by equation (46) allows us to use the optimal importance density defined in equation (32) to sample from the posterior. As stated in section 3.4.2, we need to overcome two difficulties in order to use the optimal importance density: (i) to sample from expression (32), and (ii) to calculate the integral in (33). To sample from (32), we generate a uniformly distributed random variable in the bounded interval defined by the quantiles that correspond to and . Then, we invert the prior cumulative distribution function to find a sample . In other words, for each particle , we generate samples from the lognormal distribution , but only in the interval as permitted by the boxcar likelihood function. Secondly, the integral can be transformed into error functions, since we need to integrate the lognormal distribution over a bounded interval defined by the boxcar likelihood. The error functions, in turn, can be easily calculated by standard numerical functions.
4.2 Particle Filtering
This section presents examples of the forecast and posterior distributions using a large number of particles (). We also compare the performance of the three particle filters SSIS, OSIS, and OSIR, defined in section 3.4, against each other and against the benchmark, which entirely neglects the presence of data uncertainties. We assume in this section that the parameters are known.
Figure 2 shows an example of the forecast distributions of the benchmark and of the OSIR particle filter. The benchmark assumes that the observed events correspond to the previous “true” events without data errors. Therefore, the forecast distribution of the next occurrence time is simply the lognormal distribution, as opposed to the particle filter forecasts, which are broadened due to the integration of the lognormal distribution over the uncertainty in the last occurrence time. As a result, the benchmark forecast is artificially sharper than the filter forecasts. In some cases, the sharper peak may lead to higher likelihoods – but the benchmark will pay heavily in the case when the observed event is in the tails of its sharp forecast. In those cases, the broaden particle filter forecasts will more than recuperate. Section 4.2.4 compares the likelihood scores and gains of the benchmark and the particle filters.
Figure 2 also shows Bayes’ theorem at work using the OSIR filter: The weighted particle approximation of the forecast (dashed blue) is combined with the conditional likelihood (dash-dotted magenta) to generate an estimate of the posterior of the true occurrence time (solid red). The posterior (or analysis) is the best possible estimate of the true occurrence time, given a priori model information and the data. The posterior is used as the initial condition for the next forecast.
Note that the posterior’s support is limited by the boxcar data likelihood function. However, because of the finite width of the kernel we chose to represent the weighted particles, the posterior seems to extend just beyond the sharp edges. More sophisticated kernels should be able to solve this minor visualization issue.
4.2.3 Comparison of Particle Filters
As discussed in section 3.3.4, particle filters without resampling will eventually collapse, because of a highly skewed distribution of weights: a few particles will carry most of the weight while the majority carry little or no weight. Figure 3 shows the evolution of the estimated effective sample size (), the measure of the skewness of the weight distribution defined in equation (28), for a simulation of 100 events, for each of the three filters.
Since the SSIS uses the model prior as sampling density, many of the particles fall outside the boxcar likelihood and their weights are set to zero, according to equation (30). Thus, fewer and fewer particles are used to represent the posterior, resulting in a deteriorating representation of the posterior distributions. Figure 4 shows kernel density visualizations of the weighted particle approximations of the posterior distributions of the first 5 unobservable “true” occurrence times. The SSIS shows signs of departures from the OSIS and the OSIR quickly, until the SSIS filter finally “dies” after event 17 (see Figure 5). Nevertheless, the SSIS offers the advantage of a very simple filter, which can be used effectively for a few time steps.
In the OSIS filter, particles are propagated using the optimal importance density, e.g., equation (33), so that their weights cannot be set to zero as in the SSIS filter. Nevertheless, the increasingly skewed distribution of the particle weights leads to poor representations of the posteriors. Figure 5 shows first signs of problems after event 16: spikes in the distribution belie the presence of “heavy-weight” particles. Figure 6 shows the posteriors of the true occurrence times for events 500 to 505. A few heavy-weights completely dominate the weight distribution, resulting in spikes in the posterior. Since the particles never “die”, the OSIS filter survives indefinitely, but the posterior representations progressively worsen as measured by the effective sample size in Figure 3. Nevertheless, the OSIS filter, through its use of the optimal importance density, is a significant improvement on the SSIS filter and may be very useful for small data sets.
The OSIR filter re-samples the particles to equalize the particle weights whenever the estimated effective sample size falls below a certain threshold, i.e., , as described in section 3.4.3. In our experiments, we chose to be one third of the particle number . The effective sample size, and thus the filter, can therefore be rejuvenated (see Figure 3) to continue to provide good approximations of the posteriors (see Figures 3 to 7). Figure 7 shows the posteriors for events 1000 to 1005.
Theoretically, the posteriors of the three filters should coincide as the number of particles approaches infinity. We have already mentioned the different weight distributions as one reason for the visible differences for a finite number of particles. Another source of differences lies in the inherent Monte Carlo noise, i.e. fluctuations due to the pseudo-random numbers used to generate samples.
4.2.4 Comparison against the Benchmark: Likelihood Scores and Likelihood Gain
To measure the improvement of our “earthquake” forecasts based on data assimilation over the naive benchmark, which ignores data uncertainties, we use the log-likelihood score and other common, likelihood-based measures of earthquake forecasts based on point processes [Daley and Vere-Jones, 2004; Harte and Vere-Jones, 2005; Helmstetter et al., 2006; Kagan, 2007]. However, we extend the measures by taking into account uncertainties (see also [Ogata, 1999]), as suggested by Doucet et al. ; Andrieu et al. ; Cappé et al.  for HMM estimation problems. In particular, we employ the marginal log-likelihood of the data, defined by equation (9), which reflects both the model forecast and the conditional likelihood function (the measurement process). This marginal likelihood is nothing but the denominator in Bayes’ theorem (4), which judges how well the data are explained, assuming both a model forecast and a measurement process.
For the particle filters, the marginal log-likelihood of the data can be approximated by equation (39). The benchmark effectively assumes that the measurement process is perfect, such that the box-car conditional likelihood is replaced by a Dirac function . The benchmark log-likelihood score is thus simply obtained by using the lognormal density function and plugging in the observed occurrence times. Since this is a stochastic prediction problem, it is also of interest to compare these forecasts to the ideal case of having access to the “true” occurrence times. For the “true” process, the log-likelihood score is obtained by using the lognormal distribution and plugging in the “true” event times, again replacing the conditional likelihoods by Dirac functions. Since this will only give the score for one particular realization, we also calculate the average log-likelihood score per event, given by the negative entropy of the lognormal distribution, which is available analytically. In this section, we assume the parameters are known exactly.
Figure 8 shows the evolution of the cumulative (complete) marginal log-likelihood of the data using the OSIS and OSIR particle filters and the benchmark for a simulation of 100 events (since the SSIS filter collapses quickly, we discard it henceforth). The benchmark clearly performs much worse than the particle filters, i.e. the observed data can be explained much better with the particle filters than with the benchmark. This simple observation demonstrates a major potential benefit of data assimilation techniques for earthquake forecasting: Taking realistic uncertainties into account produces better forecasts and explains the data better than ignoring observational errors.
All the particle filters should produce the same log-likelihood scores, at least theoretically as the number of particles approaches infinity. However, as discussed above, the SSIS filter quickly collapses, while the OSIS filter suffers from a strongly skewed distribution of particle weights. For this reason, and because of random Monte Carlo noise, there exists a small difference between the OSIS and the OSIR scores.
To understand how much better the particle filters perform than the benchmark, we also include the log-likelihood scores of the “true” event times, assuming access to the past “true” times. The process being stochastic, each realization will give a different score. The average log-likelihood score of the “true” process, on the other hand, is given by the negative entropy of the lognormal distribution using the “true” occurrence times. No forecast method could perform better (on average) than the “true” process. For the particular realization in Figure 8, the particle filters obtain scores almost half as negative as the benchmark.
To investigate the average performance improvement, we simulated 100 realizations of a 100-event point process. We calculated the mean of the log-likelihood scores at each event index, as shown in Figure 9. Fluctuations are now mostly smoothed out. The mean “true” likelihood scores now match exactly the negative entropy predictions. The OSIR performs slightly better than the OSIS, for the reasons stated above, and significantly better than the benchmark. Since parameters are fixed, the likelihood ratio test would strongly favor the particle filters.
To measure the quality of a point process forecast with respect to a reference forecast, we employ several common measures. The individual probability gain measures the ratio of the likelihood of the observed event under a specific forecast over the likelihood of the same event under a reference forecast :
The individual probability gain measures how much better the event is explained by our particle filter forecast over the naive benchmark forecast. corresponds to no improvement. Since usually log-likelihood scores are used rather than likelihood values, it is common to use the (individual) log-likelihood ratio or log-likelihood gain, defined by:
where is the marginal log-likelihood of event . Now corresponds to no improvement.
The (cumulative) probability gain per earthquake of the proposed forecast with respect to a reference forecast is defined as [Kagan and Knopoff, 1977; Daley and Vere-Jones, 2003, 2004; Harte and Vere-Jones, 2005; Helmstetter et al., 2006; Kagan, 2007]:
where and are the cumulative marginal log-likelihood scores of the proposed model and a reference model, respectively, for the considered events. This measure quantifies the cumulative improvement due to the proposed forecast over a reference forecast. The measure is motivated by its expression as the geometric average of the individual conditionally independent probability gains:
where the product over all events specifies the joint probability density of the entire process under a specific model.
Finally, the (algebraic) average log-likelihood ratio or average log-likelihood gain for events is defined by
In our experiments, the benchmark is the reference forecast, i.e. we directly measure any improvement of the particle filters over the benchmark.
For a 100-event point-process simulation, we first calculated the individual probability gains for each event , as shown in Figure 10. The individual gains fluctuate wildly, from about 0.4 to . In other words, there are many events that are better forecast by the benchmark than by the OSIR particle filter (), but there are some events for which the particle filter outperforms the benchmark by a factor of . The (cumulative) probability gain per earthquake is , i.e. in a geometric average sense, the OSIR particle filter performs better by a factor of 1.6.
The seemingly surprising occurrence of forecasts can be explained by the fact that the benchmark forecasts are sharper than the particle filter forecasts, since the benchmark does not take into account the uncertainty in the last occurrence time (compare the forecasts in Figure 2). As a result, if the next observed event actually falls into the center of the benchmark forecast, the likelihood of the data is higher under the benchmark forecast than under the broadened particle filter forecast. Thus, frequently, the benchmark produces higher likelihood scores than the filters. However, precisely because the benchmark forecast does not take into account data errors, the forecasts are overly optimistic. When the observed events fall outside of this artificially narrow window, the particle filter performs better than the benchmark, and sometimes immensely better. Such surprises for the benchmark are reflected in the very large individual likelihood gains of up to .
To illuminate the performance of the OSIR particle filter further, we simulated a 10,000-event point-process and calculated, for each event , the individual log-likelihood scores and of the OSIR particle filter and the benchmark, respectively, and their log-likelihood ratios . Figure 11 displays the empirical cumulative distribution functions of the OSIR particle filter’s scores (red dashed) and of the benchmark’s scores (purple solid). For comparison, we also show the distribution of log-likelihood scores obtained by the using the “true” process (black solid), and by two other distributions, explained below. The log-likelihood distribution of the “true” process has consistently the highest scores, up to statistical fluctuations, as expected. The log-likelihood scores of the OSIR particle filter, however, are not consistently better than the benchmark (as already seen in Figure 10). Rather, the highest scores of the benchmark are higher than those of the particle filter. These values correspond to those events that occur right in the middle of the overly optimistic, overly sharp forecast of the benchmark, thus resulting in a higher score compared with the broadened OSIR particle filter forecast. However, the scores of the benchmark quickly become worse than the particle filter’s, and indeed the lowest scores are orders of magnitude lower than the particle filter’s. The body and tail of the distributions, emphasized in Figure 11 by the semi-logarithmic axes, show the particle filter’s advantage: the benchmark sometimes produces terrible forecasts, for which it pays with a poor score. At the same time, the individual particle filter’s scores always remain relatively close to the scores of the “true” process.
We found it helpful to include two other distributions of log-likelihood scores in Figure 11. To produce the first distribution, labeled LL(lognormal noise), we simulated lognormally distributed random variables , and then perturbed each by an additive, uniformly distributed error with the same distribution as the observational uncertainty that perturbs the lognormal point process. We then used these random variables and calculated their likelihood using the original lognormal function. For the second distribution, labeled LL(lognormal noise), we perturbed twice with the same noise and again calculated their log-likelihood under a lognormal function. The point is to show that the log-likelihood scores of the benchmark naturally come from the fact that we assume a lognormal function in the calculation of the likelihood scores, but that the random variables we observe are not actually lognormally distributed. In fact, the benchmark makes 2 mistakes: (i) the origin point (start of the interval) is a perturbed version of the last true occurrence time, and (ii) the observed next event is again a perturbed version of the next true occurrence time. Thus, the interval is perturbed twice. In other words, the simulated LL(lognormal +2*noise) should correspond exactly to the log-likelihood distribution of the benchmark (up to statistical fluctuations).