Bayesian Uncertainty Management in Temporal Dependence of Extremes
Abstract
Both marginal and dependence features must be described when modelling the extremes of a stationary time series. There are standard approaches to marginal modelling, but long and shortrange dependence of extremes may both appear. In applications, an assumption of longrange independence often seems reasonable, but shortrange dependence, i.e., the clustering of extremes, needs attention. The extremal index is a natural limiting measure of clustering, but for wide classes of dependent processes, including all stationary Gaussian processes, it cannot distinguish dependent processes from independent processes with . Eastoe and Tawn (2012) exploit methods from multivariate extremes to treat the subasymptotic extremal dependence structure of stationary time series, covering both and , through the introduction of a thresholdbased extremal index. Inference for their dependence models uses an inefficient stepwise procedure that has various weaknesses and has no reliable assessment of uncertainty. We overcome these issues using a Bayesian semiparametric approach. Simulations and the analysis of a UK daily river flow time series show that the new approach provides improved efficiency for estimating properties of functionals of clusters.
Keywords: Asymptotic independence; Bayesian semiparametrics; Conditional extremes; Dirichlet process; Extreme value theory; Extremogram; Risk analysis; Thresholdbased extremal index
1 Introduction
Extreme value theory provides an asymptotically justified framework for the statistical modelling of rare events. In the univariate case with independent variables there is a broadlyused framework involving modelling exceedances of a high threshold by a generalised Pareto distribution (Coles, 2001). For extremes of stationary univariate time series, standard procedures use marginal extreme value modelling but consideration of the dependence structure between the variables is essential when assessing risk due to clusters of extremes. Leadbetter et al. (1983) and Hsing et al. (1988) described the roles of long and shortrange dependence on extremes of stationary time series. Typically an assumption of independence at long range is reasonable, with Ledford and Tawn (2003) giving diagnostic methods for testing this. In practice independent clusters are often identified using the runs method (Smith and Weissman, 1994), which deems successive exceedances to be in separate clusters if they are separated by at least consecutive nonexceedances of the threshold; Ferro and Segers (2003) provide automatic methods for the selection of .
Shortrange dependence has the most important practical implications, since it leads to local temporal clustering of extreme values. Leadbetter (1983) and O’Brien (1987) provide different asymptotic characterisations of the clustering though the extremal index , the former with being the limiting mean cluster size. The case corresponds to there being no more clustering than if the series were independent, and decreasing corresponds to increased clustering.
Although the extremal index is a natural limiting measure for such clustering there is a broad class of dependent processes with , including all stationary Gaussian processes. Thus the extremal index cannot distinguish the clustering properties of this class of dependent processes from those of white noise. Furthermore, many other functionals of clusters of extremes may be of interest (Segers, 2003), so for practical modelling of the clusters of extreme values above a high threshold, the restriction to is a major weakness.
Smith et al. (1997b), Ledford and Tawn (2003), Eastoe and Tawn (2012) and Winter and Tawn (2016) draw on methodology for multivariate extremes to provide models for univariate clustering, and thereby enable the properties of a wide range of cluster functionals to be estimated. The focus of this paper is the improvement of inference techniques for the most general model for these cases, namely the semiparametric conditional extremes model of Heffernan and Tawn (2004). Our ideas can be applied to any cluster functional, but we focus here primarily on the thresholdbased extremal index introduced by Ledford and Tawn (2003).
Given the strong connections between multivariate extreme value and clustering modelling, here and in Section 3 we present the developments of the model in parallel for the two situations. Examples of applications for multivariate cases include assessing the risk of joint occurrence of extreme river flows or sealevels at different locations (Keef et al., 2009; Asadi et al., 2015), the concurrent high levels of different pollutants at the same location (Heffernan and Tawn, 2004), and simultaneous stock market crashes (Poon et al., 2003). For the time series case, applications include assessing heatwave risks (Reich et al., 2014; Winter and Tawn, 2016), modelling of extreme rainfall events (Süveges and Davison, 2012) and wind gusts (Fawcett and Walshaw, 2006).
For the stationary time series , Ledford and Tawn (2003) define the thresholdbased extremal index
(1.1) 
where is large, which is the key measure of shortrange clustering of extreme values, with being the mean cluster size when clusters of exceedances of the threshold are defined via the runs method of Smith and Weissman (1994) with run length . Furthermore converges to the extremal index as and appropriately (O’Brien, 1987; Kratz and Rootzén, 1997). Many studies have focused on estimating the limit (Ferro and Segers, 2003; Süveges, 2007; Robert, 2013), but in applications, all these consider a finite level as an approximation to the limit . This is equivalent to assuming to be constant above , which is generally not the case in applications (see Figure 1). Additionally Eastoe and Tawn (2012) find that is fundamental to modelling the distributions of both cluster maxima, i.e., peaks over threshold, and block maxima, e.g., annual maxima. See Section 4 for more on the relevance of for time series extremes.
When considering asymptotically motivated models for the joint distribution of given that for the estimation of , it is helpful to have a simple characterisation of extremal dependence. The standard pairwise measure of extremal dependence for is
(1.2) 
with the cases and respectively termed asymptotic dependence and asymptotic independence at lag . A plot of against has been termed the extremogram (Davis and Mikosh, 2009), by analogy with the correlogram of time series analysis. When for all , the extremogram fails to distinguish between different levels of asymptotic independence, but the rate of convergence to zero of determines the key characteristics of the tail of the joint distribution (Ledford and Tawn, 1996). Ledford and Tawn (2003) propose using such a measure at each time lag when the variables are asymptotically independent. An alternative is to combine both approaches by studying a thresholdbased version of ,
(1.3) 
for a range of large values of .
In Section 2, we review classical multivariate extreme models, which all entail , for some high threshold , and often even . Instead we consider the conditional formulation of Heffernan and Tawn (2004) that has been subsequently studied more theoretically by Heffernan and Resnick (2007), Das and Resnick (2011), and Mitra and Resnick (2013). This class of models covers and changing with large () through modelling dependence within the asymptotic independence class. This model gives estimates of that can be constant or vary with , . This additional flexibility comes at a price: inference is required for up to parameters, and for an arbitrary dimensional distribution .
The asymptotic arguments for the Heffernan–Tawn model are given in Section 3, for an dimensional variable with Laplace marginal distributions. Suppose that the monotone increasing transformation transforms to , with (), so that has Laplace marginal distributions. In applications the Heffernan–Tawn model corresponds to a multivariate regression with
(1.4) 
where here, and subsequently, the arithmetic is to be understood componentwise, with , , , , and an dimensional random variable with ; has zero mean and unit variance for all marginal variables, with . We require that (1.4) holds for all , where is a high threshold on the Laplace marginal scale. The parameters determine the conditional mean and variance through
Thus can be estimated by multivariate regression. The complication for inference is that the error distribution is in general unknown and arbitrary, apart from the first two moment properties mentioned above. One exception to this is when and , in which case the Heffernan–Tawn model reduces to known asymptoticallydependent models with directly related to , as detailed in Section 3.
Heffernan and Tawn (2004) and Eastoe and Tawn (2012) used a stepwise inference procedure, estimating under a working assumption that are independent normal variables. After obtaining parameter estimates, they estimated nonparametrically using the empirical joint distribution of the observed standardised multivariate residuals, i.e., values of for . There are weaknesses in this approach, which loses efficiency in the estimation of and and in subsequent inferences due to the generally incorrect working assumption of normality. Moreover, as noted by Peng and Qi (2004), the empirical estimation of leads to poor estimation of the upper tail of the conditional distribution of , so it would be preferable to have a better, yet general, estimator of . Furthermore, the uncertainty of the parameter estimation is unaccountedfor in the estimation of and of cluster functionals such as .
Cheng et al. (2014) proposed a Bayesian approach to estimating the Heffernan–Tawn model in a single stage, but their estimation procedure involves changing the structure of the model and adding a noise term in (1.4), thereby allowing the likelihood term to be split appropriately. They also need strong prior information extracted from the stepwise inference method in order to get valid estimates for the model parameters, so this procedure does not really tackle the loss of efficiency of the stepwise estimation procedure.
We propose to overcome these weaknesses by using Bayesian semiparametric inference to estimate the model parameters and the distribution , simultaneously performing the entire fitting procedure for the dependence model. This gives a new model for , namely, a mixture of Gaussian distributions, which provides estimates of the conditional distribution of beyond the range of the current estimator and which in theory provides an arbitrary good approximation for (Marron and Wand, 1992). The Bayesian approach also provides a coherent framework for fitting a parsimonious parametric model; joint estimation of the model parameters enables the imposition of structure between them. For example, in multivariate problems the context may suggest that different components of may be identical. In the context of time series extremes, for firstorder Markov models, it can be shown that and involve at most two unknown parameters (Papastathopoulos et al., 2016). Furthermore, when the are known to be asymptotically dependent on , this method provides a new approach to modelling.
We show the practical importance of the new approach by applying it to the daily mean flow time series of the River Ray at Grendon Underwood in northwest London, with observations from October to December . We consider only flows from October to March, as this period typically contains the largest flows and forms an approximately stationary series. An empirical estimate of , with , is shown in Figure 1 with bootstrapderived confidence intervals. A major weakness with this estimate is that it cannot be evaluated beyond the range of the data, so a model is needed to evaluate for larger . We select our modelling threshold to be the empirical marginal quantile of the data. Using the methods in Eastoe and Tawn (2012) we have an estimate of for all using the stepwise estimation method. As seen in Figure 1, this estimate converges to as but we have no reliable method for deriving confidence intervals. Figure 1 also shows posterior median estimates and credibility intervals obtained using our Bayesian semiparametric method. These show broad agreement with both of the other estimates within the range of the data, but with tighter uncertainty intervals and statistically significant differences in extrapolation of , indicating that the new method has the potential to offer marked improvement for estimating and other cluster functionals.
The paper is structured as follows. We first briefly present the standard approaches to multivariate extremes in Section 2. We introduce the conditional model of interest in a multivariate framework in Section 3, followed by a section about modelling of dependent time series. Section 5 explains the Bayesian semiparametric inference procedure, which is used in Section 6 to illustrate the efficiency gains of this new inference method on simulated data. In Section 7 we fit our model to the River Ray flow data and show its ability to estimate functionals of time series clusters other than the thresholdbased index.
2 Multivariate setup and classical models
Both multivariate and time series extremes involve estimating the probability of events that may never yet have been observed. Suppose that is an dimensional variable with joint distribution function and marginal distribution functions . We need to estimate the probability , where is an extreme set, i.e., a set such that for all , at least one component of is extreme. To do this we must model for all , where is an extreme set that contains . Let be the subset of for which component is largest on a quantile scale, i.e.,
and let , so that we can write
(2.1) 
Thus estimates of marginal and dependence features are required for estimating the conditional probabilities in the sum and marginal distributions determine the second terms of the products in the sum.
Although our approach applies to any form of set , to focus our arguments we restrict ourselves to identical marginal distributions with upper endpoint and we set
so and we estimate only the conditional probability term in (2.1), i.e., , as defined in (1.1).
Early approaches to modelling the conditional distribution appearing in (1.1) assumed that lies in the domain of attraction of a multivariate extreme value distribution (Coles and Tawn, 1994; de Haan and de Ronde, 1998) and applied these asymptotic models above a high threshold. Unlike in the univariate case, there is no finite parametrisation of the dependence structure; it can only be restricted to functions of a distribution on the unit simplex with (), where . Both parametric and nonparametric inference for this class of models has been proposed. Numerous parametric models are available (Kotz and Nadarajah, 2000, Ch. 3; Cooley et al., 2010; Ballani and Schlather, 2011). Nonparametric estimation is also widely studied, mostly based on empirical estimators (de Haan and de Ronde, 1998; Hall and Tadjvidi, 2000; Einmahl et al., 2001; Einmahl and Segers, 2009).
A major weakness of these early methods is that for these models either or is independent for all , whereas there are distributions, such as the multivariate Gaussian, with but dependent. If for any , these models give estimates of as , where . This class of models is not flexible enough to cover distributions that are dependent at finite levels, but asymptotically independent for all pairs of variables.
3 Thresholdbased model for conditional probabilities
3.1 Heffernan–Tawn model
In order to provide a model characterising conditional probabilities, such as those which describe the clustering behaviour, we need a model for multivariate extreme values, and in particular we require a model for the joint distribution of given that . In this section we suppose that the marginal distributions of are not identical, and that is in the domain of attraction of a generalised Pareto distribution, i.e., there exists a function such that as , , conditional on , converges to a generalised Pareto variable with unit scale parameter and shape parameter .
The joint distribution is modelled via the marginal distributions and a copula for the dependence structure. To study the conditional behaviour of extremes, the copula formulation is most transparent when expressed with Laplace marginals. Let denote the transformation of the marginal distribution of to the Laplace scale, i.e.,
the specification of the is discussed in Section 3.2. Assume there exist dimensional functions and for which
(3.1) 
where all marginal distributions of are nondegenerate and . Hereafter we write the standardised , or residual, as
Under assumption (3.1), the rescaled conditioning variable is asymptotically conditionally independent of the residual given , as . That is,
(3.2) 
where is the generalised Pareto distribution survivor function (4.1) with scale and shape parameters and is the limit distribution of the residuals.
Equation (3.1) can be illustrated through the particular case when is a centred multivariate Gaussian distribution with correlation matrix elements , , . In this case we can derive and , and is a centred multivariate Gaussian distribution with variances and correlation matrix elements
Heffernan and Tawn (2004) and Keef et al. (2013) showed that under broad conditions, the component functions of and can be modelled by
In terms of the dependence structure, and reflect the flexibility of the model. It turns out that are asymptotically dependent only if , , and then
with the th marginal survivor function of ; if , then are asymptotically independent, with positive extremal dependence if , negative extremal dependence if , and with extremal nearindependence if and .
We set , as when all the conditional quantiles for converge to the same value as increases, which is unlikely in most environmental contexts. If the conditioning threshold is high enough that the conditional probability on the left of (3.1) is close to its limit, then the Heffernan–Tawn model can be stated as
(3.3) 
where is independent of , and can be any distribution with nondegenerate margins.
3.2 Existing inference procedure
We now outline the approach to inference suggested by Heffernan and Tawn (2004). Consider a vector whose marginal distributions each lie in the domain of attraction of a generalised Pareto distribution. We estimate them using the semiparametric estimator of Coles and Tawn (1994),
where is the empirical marginal distribution function of . Here and are maximum likelihood estimates based on all exceedances of , ignoring any dependence; their variances can be evaluated by a sandwich procedure (Fawcett and Walshaw, 2007) or by a block bootstrap. The margins are transformed to the Laplace scale through the transformation .
Estimation of the probability of any extreme set of interest involves inference for model (3.3) with the estimators of parameters of the dependence model assumed independent of the parameter estimators of the marginal distribution. This assumption has been found not to be restrictive in other copula inference contexts (Genest et al., 1995). Heffernan and Tawn (2004) proposed a stepwise inference procedure for estimating the extremal dependence structure, based on the working assumption that the residual variables are independent and Gaussian with means and variances . This assumption allows likelihood inference based on the assumed marginal densities,
The first step of their procedure consists of a likelihood maximisation performed separately for each , giving estimates of , and the nuisance parameters and . Additional constraints, arising from results of Keef et al. (2013), lead to the likelihood function being zero for certain combinations of parameters . Thus the maximisation is over a subset of for these two parameters. These constraints ensure that the conditional quantiles of are ordered in a decreasing sequence for all large under fitted models corresponding to positive asymptotic dependence, asymptotic independence and negative asymptotic dependence respectively. For details of these constraints see Keef et al. (2013), who show that imposing these additional constraints improves inference of the conditional extremes model. Given model (3.3) and the estimates and , the second step of the estimation procedure involves multivariate residuals for each data point, using the relation
and hence constructing the joint empirical distribution function .
An estimator for for any extreme set is obtained as follows: sample independent replicates of conditional on from a generalised Pareto distribution with threshold ; independently sample from the joint empirical distribution function ; compute
where vector arithmetic is to be understood componentwise and is a componentwise backtransformation to the original scale; then the estimator for is
where is the indicator function.
In the rest of the paper, we are interested in estimating the conditional probability , corresponding to , for which the Heffernan–Tawn model provides a characterisation. Under the assumption that the limit (3.1) approximately holds for some subasymptotic , Eastoe and Tawn (2012) obtain
(3.4) 
where is the generalised Pareto density for threshold , with scale parameter and shape parameter , and is an dimensional vector with elements
(3.5) 
A Monte Carlo approximation to the integral (3.4) gives the estimator
where is given by expression (3.5) with and replaced by estimates. Monte Carlo variability can be reduced by using the same pseudorandom sequence when generating samples for different values of . Eastoe and Tawn (2012) use a bootstrap method to get confidence bounds for , but de Carvalho and Ramos (2012) found this to be unreliable.
Four main weaknesses of this inference procedure justify developing a more comprehensive approach. First, the working assumption needed for the likelihood maximisation is that the residuals have independent Gaussian distributions, and it is hard to quantify how this affects inference. Second, ignoring the variability of , estimated in the first step leads to underestimation of the uncertainty in the estimate for the residual distribution and hence also for . Third, the empirical estimation of restricts estimates of extremal conditional probabilities, as simulated values provide no extrapolation over observed values of . Fourth, the inability to impose natural constraints on and leads to inefficiency.
4 Modelling dependence in time
Consider a stationary time series satisfying appropriate longrange dependence properties and with marginal distribution . The thresholdbased extremal index summarises the key extremal dependence in time series. In the blockmaxima context, the distribution of the block maximum at a level is approximately for large , and (O’Brien, 1987; Kratz and Rootzén, 1997). The associated independent series , having the same marginal distribution as but independent observations, has with distribution function . So , with accounting for the dependence.
The most popular approach to dealing with shortrange dependent in such series is the peaks over threshold (POT) approach formalised by Davison and Smith (1990). This approach consists of selecting a high threshold , identifying independent clusters of exceedances of , picking the maximum of each cluster, and then fitting to these cluster maxima the generalised Pareto distribution
(4.1) 
The limiting results are used as an approximation for data at subasymptotic levels with limit distribution (4.1) taken as exact above a selected value of .
Alternatives to the POT approach include modelling the series of all exceedances, for example using a Markov chain (Smith et al., 1997a; Winter and Tawn, 2016), but they depend heavily on the validity of the underlying modelling assumptions and so may be inappropriate.
Eastoe and Tawn (2012) consider the thresholdbased extremal index as part of a model for the distribution of cluster maxima. Specifically they show that, for a given high threshold , the cluster maxima, defined by the runs method with runlength , have approximate distribution function
(4.2) 
where the parameters and determine the marginal distribution of the original series, and . Eastoe and Tawn (2012) show how using the information in can improve over the POT approach. Distribution (4.2) reduces to the generalised Pareto model asymptotically as , and more generally when for all . When estimates of vary appreciably above , this equality condition for provides a diagnostic for situations where the POT method is inappropriate.
In our approach, when a Markov property can reasonably be assumed for a time series, the and have a structure that we want to exploit. Papastathopoulos et al. (2016) and Kulik and Soulier (2015) characterise the form of and under very weak assumptions. If the conditions needed for the Heffernan–Tawn simplification — and — hold, then for positively associated first order Markov processes, Papastathopoulos et al. (2016) show that either , or or for some and . The first case corresponds to asymptotic dependence at all time lags, and the other two to different forms of decaying dependence under asymptotic independence. If follows an asymptotically dependent Markov process, then no parameters need be estimated, rather than . If the process is wellapproximated by an asymptotically independent Markov process then either of the last two cases applies, and the number of parameters in the parametric component of the model reduces from to or . In the case of a Gaussian AR(1) process with standard Gaussian margins
Heffernan and Tawn (2004) get normalising parameters and , ; the distribution is a centred multivariate Gaussian with variances and correlation matrix elements
See Papastathopoulos et al. (2016) for many more examples of first order Markov processes and their resulting forms for , and .
5 Bayesian Semiparametrics
5.1 Overview
Since the dimensional residual distribution in the Heffernan–Tawn model (3.3) is unknown, the approach described in Section 3.2 uses the joint empirical distribution function, which cannot model the tails of the conditional distribution of in (3.3). Our proposed Bayesian approach instead takes to be a mixture of a potentially infinite number of multivariate Gaussian distributions through the use of a Dirichlet process. This approach can model any and capture its tails, and has the major benefit of allowing joint estimation of , and .
Below we introduce Dirichlet processes and describe an approach to approximate Monte Carlo sampling from them. We then describe Bayesian semiparametric inference and the specification of prior distributions, and discuss implementation issues. Throughout we assume that we have observations from the distribution of , or equivalently from if and were known.
5.2 Dirichlet process mixtures for the residual distribution
Consider a bivariate problem, ; with known and . If we are to estimate the distribution function of this is equivalent to estimating the distribution function of the univariate random variable . If is estimated nonparametrically, its prior must be a distribution over a space of distributions. In this context, a widely used prior is the Dirichlet process (Ferguson, 1973) mixture. A simple model structure takes to be a mixture of an unknown number of distributions having parameters , , so that the Dirichlet process boils down to a distribution on the space of mixture distributions for . If , then the distribution of is the Dirichlet process , where is the centre distribution and the concentration parameter (Hjort et al., 2010).
The definition of a Dirichlet process states that for any , and any finite measurable partition of the space of the ,
The interpretation of the Dirichlet process parameters stems from the properties
so the prior is closer to its mean and less variable for large values of . A constructive characterisation of the Dirichlet process is the stickbreaking representation (Sethuraman, 1994)
(5.1) 
where denotes a distribution concentrated on , and are independent, distributed, and independent of the random weights , which satisfy . The stickbreaking process takes its name from the computation of the weights: define as the breaks independently sampled from a distribution. The weights are then
(5.2) 
Ishwaran and Zarepour (2000) use formulation (5.1) to express the Dirichlet process in terms of the random variables and . They also introduce index variables that describe the components of the mixture in which the observations lie, giving a stickbreaking representation in terms of the index variables rather than the random variables .
A key step in deriving a posterior distribution is the truncation of the sum in the stickbreaking representation, i.e., replacing the infinite sum in (5.1) by a sum up to . This is achieved by imposing in (5.2). The accuracy of the stickbreaking approximation improves exponentially fast in terms of the error (Ishwaran and James, 2001), as
where is the marginal density
For example, the error is smaller than when truncating the stickbreaking sum representation at as in our real data analysis (Section 7) for which and .
Taking into account the transformations discussed above, a simple model for involving the Dirichlet process prior is
(5.3) 
where is the matrix with rows , and , with for some suitably large . To lighten the notation we write instead of in what follows. Taking the () to be normal distributions with means and variances leads to being a matrix, with rows . Model (5.3) is made more flexible by adding a hyperprior for the concentration parameter .
5.3 Multivariate semiparametric setting
We now specify the features of our algorithm, finally yielding model (5.3). We must add a further element to (5.3): covariates, i.e., the parametric part of the Heffernan–Tawn model (3.3), to recognise that and are unknown. This is achieved using a covariatedependent Dirichlet process, and it can be formulated in terms of the truncated stickbreaking representation as
(5.4) 
so that a single output of the stickbreaking procedure gives rise to a whole family of distributions indexed by . Our data are the observations from dimensional variables , given is large. We assume that, conditional on , has a mixture of multivariate normal distributions, , where the mean vector and the covariance matrix of the th normal component depend on the value of . For parsimony, the variance matrix is taken to be diagonal with diagonal elements , as the mixture structure is considered flexible enough to capture the dependence between the elements of .
5.4 Implementation issues
The semiparametric multivariate Bayesian model (5.3) has an added benefit of allowing us to structure the parametric component of the model. Assuming to be a first order Markov process yields different structures discussed in Section 4. For example, the different forms of decaying dependence in the class of asymptotic independence can be modelled by setting the priors as and , independently. The appropriate structure can be determined using standard diagnostics, and if adopted in the modelling will lead to substantially improved efficiency. Imposing continuous priors on and induces a restriction to the class of asymptotically independent series, but both parameters can be arbitrary close to the boundaries of their support, ensuring that the behaviour of and the extremal structure of dependence of the series are not affected at the high levels of interest. A reversible jump procedure (Green, 1995) could be added to the current algorithm in order to enable and to have prior masses on the support boundaries to ensure positive posterior probability of asymptotic dependence, see Coles and Pauli (2002) for an example of this type of construction.
The shape and scale for the prior variances of the components are taken to be to make the model prefer numerous components with smaller variances to a few dispersed components. The posterior distribution for depends on the logarithm of the last weight in the truncation (cf. Appendix A) and can be numerically unstable, so a vague gamma prior truncated at small values is needed to ensure convergence. Conjugacy of the prior densities allows analytical calculation of the posterior distributions for all parameters in model (5.3) except and , for which a Metropolis–Hastings step is needed. We use a regional adaptive scheme in Roberts and Rosenthal (2009) to avoid the choice of specific proposal variances. The posterior densities are mainly derived from Ishwaran and James (2002), and are given in Appendix A.
As noticed by Porteus et al. (2006), the Gibbs sampler used for model (5.3) leads to a clustering bias, because the weights do not satisfy the weak ordering . Papaspiliopoulos and Roberts (2008) suggested two different label switching moves to improve the mixing of the algorithm. Components cannot be simply swapped, as this would change the joint distribution of the weights. Label switching is not to be understood in its exact sense within this framework: if a switch between two mixture components is proposed and accepted, then only their means and variances are swapped and the index variables of the data points belonging to these components are renumbered accordingly. We use this approach and adapt it to our semiparametric framework.
The results presented in Sections 6 and 7 are promising, but two aspects would benefit from improvement. Bayesian semiparametric inference provides a valuable approach to uncertainty in the Heffernan–Tawn model, but the procedure is not fully Bayesian, since the marginal distribution is fitted using maximum likelihood estimation. With further work we could include the fit for the marginal distribution within the fit for the dependence structure, but we would have to account for the temporal dependence between observations in order not to introduce bias. The second possibility for improvement pertains to the sampling of and in (3.3): the special cases corresponding to the boundaries of their support should correspond to Dirac masses, so reversible jump Markov chain Monte Carlo sampling (Green, 1995) could be used.
6 Simulation study
6.1 Bivariate data
We start by showing how the Bayesian semiparametric approach to inference can improve over the stepwise approach in a bivariate setting. The working assumption of Gaussianity for the residual variable is key to the stepwise process, and if it fails badly then the stepwise approach may perform poorly relative to the Bayesian semiparametric approach. To illustrate this we take to have a bimodal density, either a mixture of Gaussian densities, or a mixture of Laplace densities. As the former is a special case of the structure of the mixture components in the dependent Dirichlet process, we may expect a clear improvement in that case, but it is less clear what to expect in the latter case.
We generated data directly from the Heffernan–Tawn model with parameters subject to , for large , as follows:

Simulate as ;

Independently simulate from the required mixture model;

Let .
We selected the mixture for the bimodal distribution of such that the simulated data are split into two clear clusters for large (see left panel of Figure 2 for an example).
We simulated data sets each with points, roughly twice the number of exceedances available in our river flow application, and fitted the conditional model using the stepwise and the Bayesian semiparametric approaches. We compare the methods through the relative efficiency, measured as the ratio of the root mean squared error (RMSE) for the Bayesian approach to the RMSE of the stepwise approach. The estimators we consider in order to compute the efficiency are the mode, the mean, and the median of the posterior distribution of and for the Bayesian approach and the maximum likelihood estimators of and for the stepwise approach.
The benefits of the Bayesian semiparametric approach are clearly found, with similar relative efficiencies whether is simulated with a Gaussian or Laplace mixture. The relative efficiency is broadly for and in the range for depending on which of the three summary measures of the posterior distribution is chosen. The posterior number of components in the mixture is concentrated around and , so the Bayesian semiparametric approach seems to capture the distribution of well. Figure 2 shows the joint sampling distribution of the estimators of based on the two inference methods. The contours are similar, but suggest that the Bayesian approach estimates the true more precisely.
Of key importance is the practical implication of this improvement, which is more naturally measured in terms of improved performance for estimating the thresholdbased extremal index. Specifically we estimate , which requires accurate estimation of the distribution of as well as of and . The relative efficiency for is computed, where the true value for is obtained from a huge simulation from the true model. The relative efficiency varies over , with values of and for the and quantiles, suggesting slight improvements within the range of the data. The relative efficiency reduces to at the quantile, suggesting that the real benefits in the Bayesian semiparametric approach arise when we extrapolate.
6.2 AR(1) process
We now compare the performances of the empirical, the stepwise and the Bayesian semiparametric inference procedures in estimating the thresholdbased extremal index of a stationary time series. The data are generated from a firstorder Markov process with Gaussian copula and exponential margins. This is equivalent to having a standard Gaussian AR(1) process and using the probability integral transform to obtain exponential marginal distributions. In Gaussian margins this process has lag autocorrelation , where we consider the set of for the true value of . For each of these values of , the process is asymptotically independent, with extremal index , but it exhibits dependence at any subasymptotic threshold when . The true value for is evaluated by computing the ratio of multivariate normal integrals
(6.1) 
using the methods of Genz and Bretz (2009) and Genz et al. (2014). The use of exponential margins ensures that the GPD marginal model is exact for all thresholds, so any bias in the estimation of can be attributed to inference for the dependence structure. A similar approach was taken by Eastoe and Tawn (2012).
The three methods are applied to data sets of length , approximately the length of the winter flow data studied in Section 7. This procedure is repeated for each value of in the range . The empirical method simply estimates each of the probabilities in expression (6.1) empirically. Often called the runs estimate (Smith and Weissman, 1994), this is not defined beyond the largest value of the sample, whereas the other two methods do not suffer this weakness. In each case the marginal threshold for the modelling and inference is fixed at the empirical quantile of each series. Unlike the stepwise procedure, the Bayesian semiparametric approach enables us to constrain , and this allows us to exploit our knowledge of the Markovian structure of the process to impose the constraints on the and discussed in Section 4, thus reducing the number of parameters from to or .
We estimate for a range of high quantiles and for declustering runlengths and . Table 1 shows the ratios of RMSEs of the posterior median of from the Bayesian semiparametric approach and the empirical and the stepwise estimators in the particular case when . The Bayesian semiparametric estimator is always superior to the empirical estimator, with the advantage improving as increases. For the stepwise approach the results are similar to those in Section 6.1: the two estimators are similar at low levels but the Bayesian semiparametric estimator performs better at higher levels. Figure 3 summarises the results for all values of , showing a major improvement of our method over the stepwise approach for negative autocorrelation and short runlength, with increased gain at higher levels. In order to assess the effectiveness of imposing the Markovian structure in the Bayesian semiparametric approach, we also fitted the simulated time series with unconstrained and in the case . The efficiency of the unconstrained approach only declines relative to the constrained approach at high quantiles. For example the in the bottom right of Table 1 increases to .
m=1  m=4  
Level  Empirical  Stepwise  Empirical  Stepwise 
98%  88  101  88  100 
99%  68  92  71  98 
99.99%  –  59  –  57 
We expect the Bayesian approach to gain accuracy in terms of frequentist coverage of , as it fits the data in one stage and thus provides a better measure of uncertainty. To assess this we considered bootstrap confidence intervals for the stepwise method and credible intervals for the Bayesian method, both of the type . Here is the th quantile of the distribution of the estimator, considering bootstrap estimates for the former and posterior samples for the latter. Using the same simulated data sets as earlier in this section, we computed the proportion of times that the true value of would fall in these confidence or credible intervals, for a range of confidence levels from to , different runlengths , and several levels for . The coverage performance is summarised in Figure 4, which shows the difference between the calculated and the nominal coverage . Zero coverage error means perfect uncertainty assessment; positive and negative errors mean onesided over and undercoverage respectively. The stepwise approach overestimates coverage for both levels of and all . At relatively low levels of , the gain in coverage accuracy by the Bayesian approach is remarkable in particular at midcoverage levels, but it shows no marked improvement for larger .
7 Data analysis
River flooding can badly damage properties and have huge insurance costs. Largescale floods in the UK since the year have caused insurance losses of £ billion, and more than £ million is spent each year on flood defences. Modelling the dependence of extreme waterlevels is key to accurate prediction of flood risk.
Our application uses daily flows of the River Ray at Grendon Underwood, northwest of London, for the winters from to . We assume stationarity of the series over the winter months. River flows in this catchment are typically shortrange dependent: after heavy rainfall the flow can reach high values before decreasing gradually as the river returns to its baseflow regime. We thus expect the flow to be dependent at extreme levels and at small lags, so a small run length is required. For illustration we take , , with the former being the more appropriate.
Standard graphical methods (Coles, 2001) were used to choose the empirical quantile as the marginal threshold. A sensitivity analysis on a range of thresholds gave results similar to those below. The Heffernan–Tawn model was then fitted to the data transformed to Laplace margins, with as the empirical quantile and , 7. A higher threshold was selected for the dependence modelling to ensure the independence of and in approximation (3.3).
We first investigate the asymptotic structure of the data at different lags. We use , , whose limit (cf. Section 1) either measures the degree of association within the asymptotic dependence class when or indicates asymptotic independence when . A Monte Carlo integration similar to that used for estimating the posterior distribution of is applied to get the posterior distribution of for selected values of and depicted in Figure 5, where the posterior densities are summarised using highest density regions (Hyndman, 1996). Convergence of to at all lags is supported by the model. As expected we observe a monotone decay in extremal dependence over time lag. The flexibility of the conditional model is well illustrated here, as the procedure establishes positive dependence at any finite level but anticipates asymptotic independence of successive daily flows.