Probability-Matching Predictors for Extreme Extremes
A location- and scale-invariant predictor is constructed which exhibits good probability matching for extreme predictions outside the span of data drawn from a variety of (stationary) general distributions. It is constructed via the three-parameter Generalized Pareto Distribution (GPD). The predictor is designed to provide matching probability exactly for the GPD in both the extreme heavy-tailed limit and the extreme bounded-tail limit , whilst giving a good approximation to probability matching at all intermediate values of the tail parameter . The predictor is valid even for small sample sizes , even as small as .
The main purpose of this paper is to present the somewhat lengthy derivations which draw heavily on the theory of hypergeometric functions, particularly the Lauricella functions. Whilst the construction is inspired by the Bayesian approach to the prediction problem, it considers the case of vague prior information about both parameters and model, and all derivations are undertaken using sampling theory.
This paper presents a novel approach to extrapolation beyond the span of historical data, one of the basic problems of inference. The notion of say the “1 in 10,000 year event” exists in common parlance, even though the philosophical interpretation of the phrase differs in detail between Bayesian and non-Bayesian schools. Often the analyst has only a limited set of historical data, say of the order of 100 years, but is asked to make predictions regarding possible future events (at say the “1 in 10,000 year level”) which are substantially greater than historical experience. The paper concerns the issue of whether there are rational methods for tackling this philosophically-fraught problem, and the method developed here - whilst only in its preliminary stages of development - is presented as a potential candidate which may be worthy of further exploration. Although the following theory is created around properties of the Generalised Pareto Distribution (GPD) and its tail parameter , the final predictor created is non-parametric. Its usefulness or otherwise remains to be determined, but nevertheless it has some remarkable properties.
We consider here only the univariate case, with the data consisting of a set of discrete real-valued “events” drawn from a stationary distribution with unknown parameters . The approach presented focuses directly on prediction rather than estimation, and the key concept is that of probability matching. The phrase has been given precise definitions elsewhere (e.g. Datta and Mukerjee (2004); Sweeting (2008)) and refers to cases when coverage probabilities of Bayesian and frequentist approaches coincide. The use of the phrase here is similar, but for the predictors in question it is not clear that a Bayesian prior exists, thus the phrase will be applied here directly to the predictor. Its usage here is an attempt to formalise the loose notion that a prediction of the “1 in T” level event should actually deliver what it appears to promise: that the probability that the next event will exceed is indeed . Of course, this loose notion needs to be formalised, particularly since Bayesians will disagree with non-Bayesians about even what is meant by “probability”.
In the frequentist approach, the analyst typically uses the data to construct a point estimate of the actual unknown parameters of the assumed model, and then uses the tail of the family member to select the value , via . Thus is an estimate of the actual (but unknown) value that has an exceedance probability for the actual (but unknown) distribution from which the data was drawn. Confidence intervals may then be constructed around that estimate at some chosen confidence level, and a designer or decision maker may perhaps choose to use the upper confidence level as the basis for the “1-in-T level” event. Note that there is thus a mixture of notions of probability here: the designer may aim for the 1 in 10,000 event but may only have a 95% confidence in the result.
In the Bayesian approach, the analyst combines the data with prior knowledge about the parameters to construct the posterior distribution on the parameters, this being a representation of the analyst’s updated beliefs about the parameters of the model. Rather than simply choosing the single family member with the parameter value set to the point estimate which has the greatest posterior belief value, the Bayesian takes account of uncertainty in knowledge of the parameters by constructing the predictive distribution, this being the analyst’s updated beliefs about the possible value of the next data element . The predictive distribution is, loosely speaking, the integral of all probabilities of all possibilities. The Bayesian candidate for the “1 in T” level is that value above which lies a fraction of the analyst’s beliefs about . That is, extreme value predictions at some given return level are set at the value above which the tail integral of the predictive distribution equals the desired exceedance probability (see Coles and Tawn (2005), for example). Despite the appealing rationality of the procedure, it is not without difficulties. For predictions well outside the span of the data, the tails of the predictive distribution may be strongly influenced by the prior beliefs about model parameters, and this can be problematic when prior knowledge - particularly regarding the tail parameters - is vague. Moreover, standard numerical integration techniques such as MCMC can require excessively long run times to explore the tail regions outside the data sufficiently often for the analyst to be confident that the numerically-generated distributions have converged sufficiently.
The problem of predicting the “1 in T” level event has solutions in various restricted cases. For example, the Bayesian predictor based on some proper prior matches probability at the target level in the sampling sense, sampling parameters from and then sampling the data (and ) from the chosen distribution . The two parameter () location-scale families provide other ready examples. For example, if the data is sampled from some unknown member of some known family of location-scale distributions with , then any statistic of the form (where is location-scale invariant and is scale invariant) is a probability-matching predictor at some level . Specifically, the predictor that arises from the Bayesian approach using the (improper) prior matches probability at the designed-for level (see McRobie (2004)). (Priors using other powers of give probability performance which is location- and scale-invariant, but not at the designed-for level).
Although the predictor has the pleasing property of delivering the required level performance no matter what the parameter values actually are, one obvious short-coming that limits its usefulness is the requirement for complete prior knowledge of the model. That is, the functional form of the two-parameter family of distributions must be known a priori. The compass of the procedure would thus be somewhat expanded if it could be extended to cover three-parameter location-scale-shape families of the form with . Since both the Generalized Pareto and Generalised Extreme Value distributions (GPD, GEV) can be expressed in this form, the possibility may then exist that the machinery of Extreme Value Theory (e.g. Embrechts et al. (1999)) could also be invoked in order to apply the predictor to data sets drawn from more general distributions. Loosely speaking, since - under suitable conditions - the upper order statistics of samples drawn from more general distributions have the GPD as their limiting distribution, then a probability-matching predictor for the GPD may have wider application.
The question thus arises as to whether probability-matching predictors can be constructed in the general three-parameter (, , ) case. If so, there are the further questions as to whether there is a corresponding prior, and what form that prior might be.
This paper endeavours to construct a probability-matching predictor for the three-parameter GPD. By taking a sampling - rather than a Bayesian - approach, predictors are constructed such that probability matching is exact in the both the extreme heavy-tailed () and extreme bounded-tail ()) limits. At finite values of , probability is only approximately matched, but the degree of approximation is very good. Moreover the predictor applies to samples as small as , and works remarkably well for predictions which lie far outside the span of the data. Finally, when applied to small data sets sampled from non-GPD distributions, out-of-sample predictions match probability to a remarkably close approximation.
2 Sampling Distributions of the Normalised Data
Suppose data points are sampled from a Generalised Pareto Distribution (GPD) with distribution function
with unknown parameters . Let the ordered data be , such that . (Note that the indexing of the ordered data is from the lowest order statistic, and that this is in the opposite direction to that adopted in McRobie (2013b) and McRobie (2013a)).
The aim is to construct a predictor such that, for any chosen return level , there is a probability that the next data point will exceed . That is, we desire
irrespective of the values of the parameters of the distribution from which the data was sampled.
We first normalise the data to lie within the unit interval via the statistics with
The normalised data is location- and scale-independent, in that for any and any . The normalisation is simply a linear mapping of the data onto the interval , the data minimum mapping to zero and the data maximum mapping to 1.
We are interested in extrapolating to possible large future extremes which lie outside the span of previous data. The next data point might not exceed the data maximum , but we shall be most interested in those cases when it does. We shall thus denote the next data point as . We will likewise focus on constructing predictions in that region beyond the data maximum.
The next data point and the prediction may be normalised via
Given the parameters , the sampling density of the statistics may, via the elementary integrations of Appendix 1, be expressed in terms of a Lauricella function . Since the statistics are location- and scale-independent, the sampling density retains a parameter dependence only through the shape parameter .
Writing for brevity, then in the region of the heavy-tailed () GPDs, writing , we obtain from Appendix 1 that
where is a vector of ones, and .
For the bounded-tail () GPDs, writing , we similarly obtain
3 The Problem Statement
Since all dependence on the location and scale parameters has been removed by the normalisation, the only parameter of remaining interest is the tail parameter .
Given the parameters , the probability that the next (normalised) event will exceed some given function of any data is given by
where and the integral is over all admissible normalised data , and the integrand is defined via
In the heavy-tailed region (), for the case of interest where , the elementary integrations of Appendix 2 reveal the integrand of equation 7 to be
where the -th element of is for , the notation being extended here to include the two further points and .
For the bounded-tail region (), the corresponding integrand is
4 The Heavy-Tailed Limit
We consider first those predictors which guarantee to match probability at the level for those distributions in the limit of extremely heavy tails . For large (i.e. small ), the Lauricella function in Eqn. (9) approaches unity due to the argument in its first slot. The conditional exceedance integrand of Eqn. (9) thus simplifies to
Since any function of the statistics is a predictor, there is an almost limitless variety to the possible functional forms that our predictor might take.
To progress, we consider predictors taking a power law form:
Integrating (11) over the domain and setting the result equal to the desired exceedance probability leads to the constraint
For , this constraint requires
For larger samples and for a given , the constraint equation (13) defines an -dimensional space of possible exponents for our power-law predictor. One obvious solution sets each term in the left-hand product of equation (13) equal to the same value, , giving
An alternative choice is one that gives predictions that are in some sense small, and the expected value of can be minimised by choosing
For simplicity, we might instead choose to make all the the same (). For any sample size and any desired return level , the exponents of the power-law predictor can be obtained numerically (e.g. by a simple bisection method) to determine that value of which satisfies the constraint equation. Values of so determined are given in Table 1.
The performance of the above predictors when played against samples drawn from distributions with various shape parameters are shown in Figure 1. All give the desired performance in the extremely heavy-tailed limit (at the right of the diagram), but over-predict elsewhere.
The predictors in this heavy-tailed limit we shall denote by , this being the (scaled) excess of the prediction above the data maximum.
5 The Bounded-Tail Limit
In this section a constraint equation is constructed for the exponents of a power law predictor which will deliver exact probability matching in the extreme limit of large negative . Such a constraint is somewhat more difficult to find. The derivation in Appendix 3 shows that a predictor can be constructed of the form where
and the exponents of each must satisfy the constraint
For there is a single solution
which is reciprocally related to the exponent of the corresponding extreme heavy-tailed case.
Again, for larger sample sizes and given , the constraint defines an -dimensional manifold of possible exponents, and various ad hoc schemes can be readily devised that satisfy the constraint. For further progress, we consider only the scheme which sets all exponents equal to the same value for some given . Since the constraint equation is a polynomial of degree in , this can again be solved numerically for at any , for example by using a simple bisection method. Values of for , 7, 15 and 31 are given in Table 2.
The performance of the resulting predictor is illustrated in Fig. 2 for , 7 and 15, showing that exceedance is delivered in the extreme bounded-tail limit , but typically under-predicting for less extreme tail parameters.
6 Predictors at specific intermediate values of
We now possess predictors and which match probability in the two extreme limits. Individual predictors can readily be constructed at any specific known between these two limits. For example at the GPDs where and (the exponential and the uniform) any number of location- and scale-invariant predictors can be constructed which match probability there, and a simple example is provided by the Bayesian predictors.
For (exponential), the Bayesian predictor is
and for (uniform) it is simply the constant
The performance of these two predictors over a range of shape parameters is illustrated in Fig. 3. It can be seen that they do indeed deliver the required prediction performance at the shape parameter for which they are designed, but the performance deviates rapidly away from the desired level at nearby shape parameters. This illustrates the danger of designers assuming that data is say exponentially distributed and predicting accordingly, for if the data is actually drawn from a nearby GPD the predictions might be extremely optimistic, and the designer should not be so surprised when the design is soon exceeded.
Note that the exponential predictor does not predict the data maximum () for the return level . It can even predict values below the data maximum (e.g. for , , which can be as low as ). This differs from the strategy that will be adopted in this paper. Here the aim is to create more general predictors and the decision has thus been made that these should all pass through the only presently known (albeit trivial) example of a universal probability-matching extreme value predictor (namely the data maximum for the prediction, corresponding to ).
In the absence of a known non-informative prior on the shape parameter, a number strategies could be adopted for constructing predictors which attempt to match probability across the full range of . In this paper, the approach will be to interpolate between the two extreme predictors and .
7 Predictions across all shape parameters
The two extreme predictors proposed thus far each define a surface above the -dimensional simplex of all possible normalised data . If there is a general predictor which matches probability at all , then one might expect its prediction surface to approach for small and for near unity, these being the regions where data tends to congregate in the respective extreme limits.
A large number of interpolation schemes were considered but for brevity, only one such scheme is presented here.
7.1 An interpolated predictor
First we pre-condition the two extreme predictors and such that they give better probability matching over wider ranges of than the extreme limits for which they have been designed, and then we interpolate.
Pre-multiplying and by some power of the geometric means , of , respectively, can improve their performance over wider ranges of whilst matching probability in their respective limits. That is, moderated predictors can be constructed of the form
where the exponents and are chosen by numerical experiments such that and give reasonable probability matching over most of their respective and ranges (see Fig. 4). Candidate values for and determined on the basis of such numerical experiments are shown in Table 3.
A combined predictor can then be constructed via simple linear interpolation between the two moderated prediction surfaces. The interpolation chosen is based on the “elemental estimators” of McRobie (2013b). These are a family of simple location- and scale-invariant estimators based on log-spacings of the data, and they were shown to be absolutely unbiased estimators of the shape parameter of the GPD. The specific estimator used here is the one which gives equal weight to each elemental estimator.
The interpolation functions are
The resulting total predictor is
All elements of this predictor have some analytical justification except for the two numerically-determined pre-conditioning exponents and . Prediction at the return level from any data sample of size can thus be computed by the above formula, and requires knowledge of only two numbers, and , determined by numerical experiments.
The performance of the interpolated predictor is shown in Fig. 5 for samples of size , 7, 15 and 31. It can be seen that probability is matched to a good approximation in all cases across the full range of , even for extrapolations to return levels far beyond the span of the data. The probability matching is almost exact in the lower half of each figure, which corresponds to prediction factors of up to 100 - i.e. a predictor with a return level up to can be constructed for a sample of size 3, and up to for a sample of size 31. Such large extrapolations beyond the data are often demanded in engineering: design for the 10,000 year event often being required from around 100 years of historical data.
A typical extrapolation by the interpolated predictor is illustrated in Fig. 6. Extrapolation is from a sample of size drawn from a GPD with (i.e. an exponential distribution). The data (dots, bottom left) is plotted against the empirical return levels (both logarithmic), and the predictions are plotted likewise against the corresponding return level aimed for. The actual quantiles are also plotted (dashed), together with the Bayes predictions (knowing that the data is drawn from an exponential). The predictions of the interpolated predictor typically exceed the actual quantile by a considerable margin, reflecting the uncertainty in the actual value of . Loosely speaking, although the data may suggest that there is a high likelihood that the underlying distribution is indeed exponential, the interpolated predictor recognises that there remains an appreciable chance that the data may have been drawn from a heavy-tailed GPD with a value of .
This inherent bias towards larger predictions must not be confused with risk aversion. The actual quantiles (dashed lines) can only be drawn here because the underlying parameters are known. In practice the parameters will be unknown, and the prediction algorithm has been designed to allow for the possibility that the data may have come from any GPD, including those with heavier tails.
8 Application to other distributions
The interpolated predictor is now applied to samples drawn from a variety of distributions outside the GPD family. The results are shown in Fig. 7, where the delivered return interval is plotted against that designed for.
The version of was used. This is applicable to any sample of size , by using only the upper data points for prediction. The performance is illustrated for samples drawn from uniform, normal, one- and two-sided Cauchy and two variants of the Burr distribution. Both axes of Fig. 7 plot , with as target on the -axis and as-delivered on the -axis. The right-most points, with ordinate , thus correspond to extrapolations beyond the data by THREE orders of magnitude - i.e. to the level from a sample of size , using just the largest 7 data points thereof.
In all cases, as becomes significantly larger than , the probability is matched to increasingly better degrees of approximation. This accords with the general expectation, since the upper quantiles will approach GPDs as .
In all cases the location- and scale-parameters were randomised (by picking from normal distributions) for each of the samples drawn. However, given that the predictor guarantees location-scale invariance this should not (and did not) affect the performance.
The uniform (upper left) delivers good probability matching at all , since the uniforms lie within the GPD class.
The normal (upper right) is the first example outside the GPD class. As may be expected, the predictor based on the upper order statistics delivers poor performance for a sample of size (because the upper 7 data points are the full data set, and these will have a two-sided normal distribution, and are thus blatantly far from the GPD, and are in no sense extreme values). However, the probability matching improves dramatically as the sample size increases even moderately beyond .
For the two Cauchy examples (centre, Fig. 7) the performance is good in the one-sided case for all , but requires higher in the two-sided case. Again, this follows expectations. It highlights the fact that the problem lies with attempting to extrapolate from non-extreme data, much of which lies below the mode in the two-side case. The good results for the one-sided distribution, even when , illustrate that the famously heavy-tails of the Cauchy present little problem.
The results for the Burr distributions (lower figures), taken from the family
again illustrate that the predictor works well with heavy-tailed distributions. The lower right figure shows the performance over games played against Burr distributions wherein the parameters and (as well as and ) are drawn from normal distributions. The good performance obtained when playing against such a large stock of distributions with randomised parameters gives some credence to the claim that has properties approaching that of a universal probability-matching extreme value predictor.
Good performance has also been found to be delivered for numerous other distributions, but it should be noted that there are cases where probabilities do not match well. These include Weibull distributions with small shape parameters, and beta distributions where the second parameter is small. “Two population” distributions can also be readily constructed for which the predictor performs poorly. These could be said to be of the “black swan” variety, a simple archetype being 99% uniform over with the remaining 1% uniform over , with . For moderate , most samples contain no information about the existence of the second population above , the unknown unknowns. Extreme predictions are thus mostly based on data from whilst next events with high return period are values in . However, for any extrapolation proposal there will obviously be bad cases, and what is surprising about the proposed predictor is how widely and how often it does work (particularly given how small the sample sizes are for which it does work).
The intuition that a predictor, designed to match probability within the GPD family, might transfer some of its potentially-desirable properties across to more general distributions appears to have been borne out. Although the candidate predictor was only approximately probability-matching across the whole range of GPDs, the precision with which return levels could be delivered, even from small data sets drawn from non-GPD distributions, is remarkable.
Finally, it should be emphasised that it is not the intention of this paper to encourage the extrapolation from three data points to the level. Rather, suspecting that many current methods of extrapolation are inherently optimistic, the paper has endeavoured to put forward a novel alternative for criticism and/or further development.
Appendix 1: Densities of the normalised data
Let raw data points be sampled from a GPD with distribution function
These functions exist over the appropriate domains for and for . The case of reduces to the exponential case.
The parameters are assumed unknown, and we consider first the case , defining . This corresponds to the heavy-tailed case.
We define the ordered data , such that .
All possible samples of raw data form an N-dimensional data space. The possible sorted data samples cover only a semi-infinite prism-shaped subset of a similar N-dimensional space. The density over that prism of the ordered data space is identical to that over an equivalent region of the unordered data space multiplied by a factor , this being the number of such prisms required to make up the full space.
The density over the prism-shaped space of ordered data is thus
Although the parameters are unknown, we may define
It follows that and over the domain of the ’s, their density is
We define location- and scale-independent statistics via
This is the normalised data . It is ordered and lies in the unit interval ( ). Trivially, we can also measure from the opposite end of the interval, defining
We also define
Writing the left and right end points of the unit interval as , and , respectively then together the transformations
have Jacobian and the density in the new variables is
For fixed, the variable is bounded below at . It is removed by integration, using
before removal of via the integration
This is a standard Euler integral representation of a Lauricella function (Exton, 1978) leading to the desired density
where , and is a vector of length with each element being .
The Lauricella function may be expressed using one of its Euler transformations (Exton, 1976) to give the alternative expression
The advantage of this representation is that for sufficiently small , the leading Lauricella parameter is small such that the Lauricella function is close to unity.
For the bounded-tail case, we define with , and a similar derivation leads to
where . It should be noted that this is not simply the expression with the substitution since the bounded nature of the tails leads to different limits in the various integrations.
Appendix 2. Derivation of
is the (average) probability performance delivered at some given tail parameter by any (normalised) predictor which is a function of the (normalised) data . The rest of this paper concerns itself with attempting to construct such a function such that the integral 42 is independent of .
The functional form of the integrand can be derived from first principles in a manner akin to the derivations of Appendix 1 or, equivalently, via appropriate integration of the Appendix 1 results, as here.
Consider an ordered sample of size drawn from a GPD with shape parameter . Let the normalised data be . This normalised data is related to the normalised data plus an extra normalised data point via for to .
For (), the density of is obtained from Eqn. (38) as
The divisor must be introduced since we are considering only that fraction of cases where the next data point exceeds the historical data maximum.
Substituting (a transformation with Jacobian ) leads to the joint density for and of
where the vectors extend from to .
Using one of the standard Lauricella transforms (Exton, 1976) this can be written
This now needs to be integrated from to to obtain the tail probability.
It follows from the Euler integral representation of the Lauricella function that
This leads to
A further Lauricella transform leads to the final form
with for to (where and ) .
A similar derivation, omitted for brevity, for the bounded-tail case with negative ( positive), leads to
where for to (with ). Again, this is not simply the expression with , owing to the bounded nature of the tails changing various integration limits.
Appendix 3: The constraint on the power law exponents in the bounded-tail case
We derive forms for power law predictors for the extreme bounded-tail case (where and is small and positive).
In the case, a constraint on the power-law exponents was derived directly from the small behaviour of the Lauricella form of of Eqn. 9. Unfortunately, a similar procedure does not seem possible via of Eqn. 10 (or any of its Lauricella transformations or asymptotic forms) in the bounded-tail case. Instead, it is necessary to return to first principles, inserting the small approximation en route.
For a sample of size drawn from a GPD in the bounded-tail case, the density of the ordered, normalised data is given by a Lauricella transformation of Eqn. LABEL:Frechdens as
with , , and for to (where ).
Consider now drawing an additional data point . This might not exceed the sample maximum, but we restrict attention to the fraction of cases when it does. The density of the ordered, normalised data in this region is thus given by Eqn. 51, with and . That is,
with . The leading in the denominator of Eqn 52 accounts for the fact we only consider those cases where the next data point exceeds the sample maximum.
The Jacobian of the transformation gives