Automated Threshold Selection for Extreme Value Analysis via Goodness-of-Fit Tests with Application to Batched Return Level Mapping Research partially supported by NSF grant DMS 1521730, University of Connecticut Research Excellence Program, and Environment Canada. The authors thank Prof. Vartan Choulakian for the discussion and insight on approximating the tails of the null distribution of the Anderson–Darling and Cramér–von Mises statistics for generalized Pareto distributed data.

Automated Threshold Selection for Extreme Value Analysis via Goodness-of-Fit Tests with Application to Batched Return Level Mapping thanks: Research partially supported by NSF grant DMS 1521730, University of Connecticut Research Excellence Program, and Environment Canada. The authors thank Prof. Vartan Choulakian for the discussion and insight on approximating the tails of the null distribution of the Anderson–Darling and Cramér–von Mises statistics for generalized Pareto distributed data.

Brian Bader Department of Statistics, University of Connecticut Jun Yan Department of Statistics, University of Connecticut Xuebin Zhang Environment and Climate Change Canada

Threshold selection is a critical issue for extreme value analysis with threshold-based approaches. Under suitable conditions, exceedances over a high threshold have been shown to follow the generalized Pareto distribution (GPD) asymptotically. In practice, however, the threshold must be chosen. If the chosen threshold is too low, the GPD approximation may not hold and bias can occur. If the threshold is chosen too high, reduced sample size increases the variance of parameter estimates. To process batch analyses, commonly used selection methods such as graphical diagnosis are subjective and cannot be automated, while computational methods may not be feasible. We propose to test a set of thresholds through the goodness-of-fit of the GPD for the exceedances, and select the lowest one, above which the data provides adequate fit to the GPD. Previous attempts in this setting are not valid due to the special feature that the multiple tests are done in an ordered fashion. We apply two recently available stopping rules that control the false discovery rate or familywise error rate to ordered goodness-of-fit tests to automate threshold selection. Various model specification tests such as the Cramér–von Mises, Anderson–Darling, Moran’s, and a score test are investigated. The performance of the method is assessed in a large scale simulation study that mimics practical return level estimation. This procedure was repeated at hundreds of sites in the western US to generate return level maps of extreme precipitation.

Key Words: batch analysis, exceedance diagnostic, specification test, stopping rule

1 Introduction

Extreme value analysis has wide applications in a variety of fields, such as hydrology (e.g., Katz et al. 2002) and climatology (e.g., Kharin et al. 2007, 2013; Wang and Zhang 2008). Return levels, the levels of a measure of interest that is expected to be exceeded on average once every certain period of time (return period), are a major goal of inference in these fields. Commonly used in extreme value analysis, threshold-based methods involve modeling data exceeding a suitably chosen high threshold with the generalized Pareto distribution (GPD) (Balkema and De Haan 1974; Pickands 1975). Choice of the threshold is critical in obtaining accurate estimates of model parameters and return levels. The threshold should be chosen high enough for the excesses to be well approximated by the GPD to minimize bias, but not so high to substantially increase the variance of the estimator due to reduction in the sample size (the number of exceedances).

Although it is widely accepted in the statistics community that the threshold-based approach, in particular peaks-over-threshold (POT), use data more efficiently than the block maxima method (e.g., Caires 2009), it is much less used than the block maxima method in some fields such as climatology. The main issue is the need to conduct the analyses over many locations, sometimes over hundreds of thousands of locations (e.g., Kharin et al. 2007, 2013), and there is a lack of efficient procedures that can automatically select the threshold in each analysis. For example, to make a return level map of annual maximum daily precipitation for three west coastal US states of California, Oregon, and Washington alone, one needs to repeat the estimation procedure including threshold selection, at each of the hundreds of stations. For the whole US, thousands of sites need to be processed. A graphical based diagnosis is clearly impractical. It is desirable to have an intuitive automated threshold selection procedure in order to use POT in analysis.

Many threshold selection methods are available in the literature; see Scarrott and MacDonald (2012) and Caeiro and Gomes (2016) for recent reviews. Among them, graphical diagnosis methods are the most popular. The mean residual life (MRL) plot (Davison and Smith 1990) uses the fact that, if follows a GPD, then for , the MRL , if existing, is linear in . The threshold is chosen to be the smallest such that the sample MRL is approximately linear above this point. Parameter stability plots check whether the esimates of GPD parameters, possibly transformed to be comparable across different thresholds, are stable above some level of threshold. Drees et al. (2000) suggested the Hill plot, which plots the Hill estimator of the shape parameter based on the top order statistics against . Many variants of the Hill plot have been proposed (Scarrott and MacDonald 2012, Section 4). The threshold is the th smallest order statistic beyond which the parameter estimates are deemed to be stable. The usual fit diagnostics such as quantile plots, return level plots, and probability plots can be helpful too, as demonstrated in Coles (2001). Graphical diagnostics give close inspection of the data, but they are quite subjective and disagreement on a particular threshold can occur even among experts in the field; see, for example, a convincing demonstration of unclear choices using the Fort Collins precipitation data in Scarrott and MacDonald (2012).

Other selection methods can be grouped into various categories. One is based on the asymptotic results about estimators of properties of the tail distribution. The threshold is selected by minimizing the asymptotic mean squared error (MSE) of the estimator of, for example, tail index (Beirlant et al. 1996), tail probabilities (Hall and Weissman 1997), or extreme quantiles (Ferreira et al. 2003). Theoretically sound as these methods are, their finite sample properties are not well understood. Some require second order assumptions, and computational (bootstrap) based estimators can require tuning parameters (Danielsson et al. 2001a) or may not be satisfactory for small samples (Ferreira et al. 2003). Irregardless, such resampling methods may be quite time-consuming in an analysis involving many locations.

A second category of methods are based on goodness-of-fit of the GPD, where the threshold is selected as the lowest level above which the GPD provides adequate fit to the exceedances (e.g., Davison and Smith 1990; Dupuis 1999; Choulakian and Stephens 2001; Northrop and Coleman 2014). Goodness-of-fit tests are simple to understand and perform, but the multiple testing issue for a sequence of tests in an ordered fashion have not been addressed to the best of our knowledge. Methods in the third category are based on mixtures of a GPD for the tail and another distribution for the “bulk” joined at the threshold (e.g., MacDonald et al. 2011; Wadsworth and Tawn 2012). Treating the threshold as a parameter to estimate, these methods can account for the uncertainty from threshold selection in inferences. However, there is little known about the asymptotic properties of this setup and how to ensure that the bulk and tail models are robust to one another in the case of misspecification.

Some automated procedures have been proposed. The simple naive method is a priori or fixed threshold selection based on expertise on the subject matter at hand. Various rules of thumb have been suggested; for example, select the top 10% of the data (e.g., DuMouchel 1983), or the top square root of the sample size (e.g., Ferreira et al. 2003). Such one rule for all is not ideal in climate applications where high heterogeneity in data properties is the norm. The proportion of the number of rain days can be very different from wet tropical climates to dry subtropical climates; therefore the number of exceedance over the same time period can be very different across different climates. Additionally, the probability distribution of daily precipitation can also be different in different climates, affecting the speed the tails converge to the GPD (Raoult and Worms 2003). Goodness-of-fit test based procedures can be automated to select the lowest one in a sequence of thresholds, at which the goodness-of-fittest is not rejected (e.g., Choulakian and Stephens 2001). The error control, however, is challenging because of the ordered nature of the hypotheses, and the usual methods from multiple testing such as false discovery rate (e.g., Benjamini 2010a, b) cannot be directly applied.

We propose an automated threshold selection procedure based on a sequence of goodness-of-fit tests with error control for ordered, multiple testing. The very recently developed stopping rules for ordered hypotheses in G’Sell et al. (2015) are adapted to control the familywise error rate (FWER), the probability of at least one type I error in the whole family of tests (Shaffer 1995), or the false discovery rate (FDR), the expected proportion of incorrectly rejected null hypotheses among all rejections (Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001). They are applied to four goodness-of-fit tests at each candidate threshold, including the Cramér–Von Mises test, Anderson–Darling test, Rao’s score test, and Moran’s test. For the first two tests, the asymptotic null distribution of the testing statistic is unwieldy (Choulakian and Stephens 2001). Parametric bootstrap puts bounds on the approximate p-values which would invalidate the stopping rules. We propose a fast approximation based on the results of Choulakian and Stephens (2001) to facilitate the application. The performance of the procedures are investigated in a large scale simulation study, and recommendations are made. The procedure is applied to annual maximum daily precipitation return level mapping for three west coastal states of the US. Interesting findings are revealed from different stopping rules. The automated threshold selection procedure has applications in various fields, especially when batch processing of massive datasets is needed.

The outline of the paper is as follows. Section 2 presents the generalized Pareto model, its theoretical justification, and how to apply the automated sequential threshold testing procedure. Section 3 introduces the tests proposed to be used in the automated testing procedure. A simulation study demonstrates the power of the tests for a fixed threshold under various misspecification settings and it is found that the Anderson–Darling test is most powerful in the vast majority of cases. A large scale simulation study in Section 4 demonstrates the error control and performance of the stopping rules for multiple ordered hypotheses, both under the null GPD and a plausible alternative distribution. In Section 5, we return to our motivating application and derive return levels for extreme precipitation at hundreds of west coastal US stations to demonstrate the usage of our method. A final discussion is given in Section 6.

2 Automated Sequential Testing Procedure

Threshold methods for extreme value analysis are based on that, under general regularity conditions, the only possible non-degenerate limiting distribution of properly rescaled exceedances of a threshold is the GPD as  (e.g., Pickands 1975). The GPD has cumulative distribution function


where , is a shape parameter, and is a threshold-dependent scale parameter. The GPD also has the property that for some threshold , the excesses follow a GPD with the same shape parameter, but a modified scale .

Let be a random sample of size . If is sufficiently high, the exceedances for all such that are approximately a random sample from a GPD. The question is to find the lowest threshold such that the GPD fits the sample of exceedances over this threshold adequately. Our solution is through a sequence of goodness-of-fit tests (e.g., Choulakian and Stephens 2001) or model specification tests (e.g., Northrop and Coleman 2014) for the GPD to the exceedances over each candidate threshold in an increasing order. The multiple testing issues in this special ordered setting are handled by the most recent stopping rules in G’Sell et al. (2015).

Consider a fixed set of candidate thresholds . For each threshold, there will be excesses, . The sequence of null hypotheses can be stated as

: The distribution of the exceedances above follows the GPD.

For a fixed , many tests are available for this . An automated procedure can begin with and continue until some threshold provides an acceptance of (Choulakian and Stephens 2001; Thompson et al. 2009). The problem, however, is that unless the test has high power, an acceptance may happen at a low threshold by chance and, thus, the data above the chosen threshold is contaminated. One could also begin at the threshold and descend until a rejection occurs, but this would result in an increased type I error rate. The multiple testing problem obviously needs to be addressed, but the issue here is especially challenging because these tests are ordered; if is rejected, then will be rejected for all . Despite the extensive literature on multiple testing and the more recent developments on FDR control and its variants (e.g., Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001; Benjamini 2010a, b), no definitive procedure has been available for error control in ordered tests until the recent work of G’Sell et al. (2015).

We adapt the stopping rules of G’Sell et al. (2015) to the sequential testing of (ordered) null hypotheses . Let be the corresponding p-values of the hypotheses. G’Sell et al. (2015) transform the sequence of p-values to a monotone sequence and then apply the original method of Benjamini and Hochberg (1995) on the monotone sequence. Two rejection rules are constructed, each of which returns a cutoff such that are rejected. If no exists, then no rejection is made. The first is called ForwardStop,


and the second is called StrongStop,


where is a pre-specified level.

Under the assumption of independence among the tests, both rules were shown to control the FDR at level . In our setting, stopping at implies that goodness-of-fit of the GPD to the exceedances at the first thresholds is rejected. In other words, the first set of null hypotheses are rejected. At each , ForwardStop is a transformed average of the previous and current p-values, while StrongStop only accounts for the current and subsequent p-values. StrongStop provides an even stronger guarantee of error control; that is that the FWER is controlled at level . The tradeoff for stronger error control is reduced power to reject. Thus, the StrongStop rule tends to select a lower threshold than the ForwardStop rule. This is expected since higher thresholds are more likely to approximate the GPD well, and thus provide higher p-values. In this sense, ForwardStop could be thought of as more conservative (i.e., stopping at higher threshold by rejecting more thresholds).

The stopping rules, combined with the sequential hypothesis testing, provide an automated selection procedure — all that is needed are the levels of desired control for the ForwardStop and StrongStop procedures, and a set of thresholds. A caveat is that the p-values of the sequential tests here are dependent, unlike the setup of G’Sell et al. (2015). Nonetheless, the stopping rules may still provide some reasonable error control as their counter parts in the non-sequential multiple testing scenario (Benjamini and Yekutieli 2001; Blanchard and Roquain 2009). A simulation study is carried out in Section 4 to assess the empirical properties of the two rules.

3 The Tests

The automated procedure can be applied with any valid test for each hypothesis corresponding to threshold . Four existing goodness-of-fit tests that can be used are presented. Because the stopping rules are based on transformed p-values, it is desirable to have testing statistics whose p-values can be accurately measured; bootstrap based tests that put a lower bound on the p-values (1 divided by the bootstrap sample size) may lead to premature stopping. For the remainder of this section, the superscript is dropped. We consider the goodness-of-fit of GPD to a sample of size of exceedances above a fixed threshold .

3.1 Anderson–Darling and Cramér–von Mises Tests

The Anderson–Darling and the Cramér–von Mises tests for the GPD have been studied in detail (Choulakian and Stephens 2001). Let be the maximum likelihood estimator (MLE) of under from the the observed exceedances. Make the probability integral transformation based on , as in (1), for the order statistics of the exceedances . The Anderson–Darling statistic is

The Cramér–von Mises statistic is

The asymptotic distributions of and are unwieldy, both being sum of weighted chi-squared variables with one degree of freedom with weight found from the eigenvalues of an integral equation (Choulakian and Stephens 2001, Section 6). The distributions depend only on the estimate of . The tests are often applied by referring to a table of a few upper tail percentiles of the asympotic distributions (Choulakian and Stephens 2001, Table 2), or through bootstrap. In either case, the p-values are truncated by a lower bound. Such truncation of a smaller p-value to a larger one can be proven to weaken the stopping rules given in (2) and (3). In order to apply these tests in the automated sequential setting, more accurate p-values for the tests are needed.

We provide two remedies to the table in Choulakian and Stephens (2001). First, for values in the range of , which is applicable for most applications, we enlarge the table to a much finer resolution through a pre-set Monte Carlo method. For each value from to incremented by , 2,000,000 replicates of and are generated with sample size to approximate their asymptotic distributions. A grid of upper percentiles from 0.999 to 0.001 for each value is produced and saved in a table for future fast reference. Therefore, if and the test statistic falls in the range of the table, the p-value is computed via log-linear interpolation.

The second remedy is for observed test statistics that are greater than that found in the table (implied p-value less than 0.001). As Choulakian pointed out (in a personal communication), the tails of the asymptotic distributions are exponential, which can be confirmed using the available tail values in the table. For a given , regressing on the upper tail percentiles in the table, for example, from 0.05 to 0.001, gives a linear model that can be extrapolated to approximate the p-value of statistics outside of the range of the table. This approximation of extremely small p-values help reduce loss of power in the stopping rules.

The two remedies make the two tests very fast and are applicable for most applications with . For values outside of , although slow, one can use parametric bootstrap to obtain the distribution of the test statistic, understanding that the p-value has a lower bound. The methods are implemented in R package eva (Bader and Yan 2015).

3.2 Moran’s Test

Moran’s goodness-of-fit test is a byproduct of the maximum product spacing (MPS) estimation for estimating the GPD parameters. MPS is a general method that allows efficient parameter estimation in non-regular cases where the MLE fails or the information matrix does not exist (Cheng and Amin 1983). It is based on the fact that if the exceedances are indeed from the hypothesized distribution, their probability integral transformation would behave like a random sample from the standard uniform distribution. Consider the ordered exceedances . Define the spacings as

for with and . The MPS estimators then found by minimizing

As demonstrated in Wong and Li (2006), the MPS method is especially useful for GPD estimation in the non-regular cases of Smith (1985). In cases where the MLE exists, the MPS estimator may have an advantage of being numerical more stable for small samples, and have the same properties as the MLE asymptotically.

The objective function evaluated at the MPS estimator is Moran’s statistic (Moran 1953). Cheng and Stephens (1989) showed that, under the null hypothesis, Moran’s statistic is normally distributed and when properly center and scaled, has an asymptotic chi-square approximation. Define

where is Euler’s constant. Moran’s goodness-of-fit test statistic is

where and . Under the null hypothesis asymptotically, follows a chi-square distribution with degrees of freedom. Wong and Li (2006) show that for GPD data, the test empirically holds its size for samples as small as ten.

3.3 Rao’s Score Test

Northrop and Coleman (2014) considered a piecewise constant specification of the shape parameter as the alternative hypothesis. For a fixed threshold , a set of higher thresholds are specified to generate intervals on the support space. That is, for the set of thresholds , the shape parameter is given a piecewise representation


The null hypothesis is tested as . Rao’s score test has the advantage that only restricted MLE under is needed, in contrast to the asymptotically equivalent likelihood ratio test or Wald test. The testing statistic is

where is the score function and is the fisher information matrix, both evaluated at the restricted MLE . Given that  (Smith 1985), the asymptotic null distribution of is .

Northrop and Coleman (2014) tested suitable thresholds in an ascending manner, increasing the initial threshold and continuing the testing of . They suggested two possibilities for automation. First, stop testing as soon as an acceptance occurs, but the p-values are not guaranteed to be non-decreasing for higher starting thresholds. Second, stop as soon as all p-values for testing at subsequent higher thresholds are above a certain significance level. The error control under multiple, ordered testing were not addressed.

3.4 A Power Study

The power of the four goodness-of-fit tests are examined in an individual, non-sequential testing framework. The data generating schemes in Choulakian and Stephens (2001) are used, some of which are very difficult to distinguish from the GPD:

  • Gamma with shape 2 and scale 1.

  • Standard lognormal distribution (mean 0 and scale 1 on log scale).

  • Weibull with scale 1 and shape 0.75.

  • Weibull with scale 1 and shape 1.25.

  • 50/50 mixture of GPD() and GPD().

  • 50/50 mixture of GPD() and GPD().

  • 50/50 mixture of GPD() and GPD().

Finally, the GPD() was also used to check the sizes. Four sample sizes were considered: 50, 100, 200, 400. For each scenario, 10,000 samples are generated. The four tests were applied to each sample, with a rejection recorded if the p-value is below 0.05. For the score test, a set of thresholds were set according to the deciles of the generated data.

Sample Size 50 100
Test Score Moran AD CVM Score Moran AD CVM
Gamma(2, 1) 7.6 9.7 47.4 43.5 8.0 14.3 64.7 59.7
LogNormal 6.0 5.9 13.3 8.6 5.4 8.2 28.3 23.4
Weibull(0.75) 11.5 7.8 55.1 23.5 12.1 9.5 65.1 39.4
Weibull(1.25) 6.6 11.3 29.1 27.3 5.6 12.5 20.8 19.2
GPDMix(0.4, 0.4) 11.4 7.5 19.2 9.9 16.0 8.6 24.3 20.4
GPDMix(0, 0.4) 7.8 6.0 6.5 5.9 7.4 5.9 9.6 5.6
GPDMix(0.25, 0.25) 8.1 6.5 6.0 7.4 8.9 6.5 11.1 8.4
GPD(1, 0.25) 6.9 5.5 6.7 6.1 5.0 5.2 5.2 5.2
Sample Size 200 400
Test Score Moran AD CVM Score Moran AD CVM
Gamma(2, 1) 15.4 23.3 95.3 93.1 36.5 42.2 100.0 100.0
LogNormal 5.7 11.9 69.3 59.7 7.9 19.1 97.8 95.0
Weibull(0.75) 16.5 10.7 84.8 66.4 32.1 14.6 98.2 93.0
Weibull(1.25) 8.0 14.7 40.9 36.7 15.5 19.2 79.8 74.6
GPDMix(0.4, 0.4) 31.5 9.7 45.1 44.0 63.8 11.9 79.9 80.2
GPDMix(0, 0.4) 8.9 6.5 8.8 7.3 12.3 6.2 10.8 10.3
GPDMix(0.25, 0.25) 13.9 6.7 16.6 14.8 26.2 7.9 33.0 32.4
GPD(1, 0.25) 5.3 5.8 7.2 5.2 4.7 5.3 5.8 4.7
Table 1: Empirical rejection rates of four goodness-of-fit tests for GPD under various data generation schemes with nominal size 0.05. GPDMix(a, b) refers to a 50/50 mixture of GPD(1, a) and GPD(1, b).

The rejection rates are summarized in Table 1. Samples in which the MLE failed were removed, which accounts for roughly 10.8% of the Weibull samples with shape 1.25 and sample size 400, and around 10.7% for the Gamma distribution with sample size 400. Decreasing the sample size in these cases actually decreases the percentage of failed MLE samples. This may be due to the shape of these two distributions, which progressively become more distinct from the GPD as their shape parameters increase. In the other distribution cases, no setting resulted in more than a 0.3% failure rate. As expected, all tests appear to hold their sizes, and their powers all increase with sample size. The mixture of two GPDs is the hardest to detect. For the GPD mixture of shape parameters 0 and 0.4, quantile matching between a single large sample of generated data and the fitted GP distribution shows a high degree of similarity. In the vast majority of cases, the Anderson–Darling test appears to have the highest power, followed by the Cramér–von Mises test. Between the two, the Anderson–Darling statistic is a modification of the Cramér–von Mises statistic giving more weight to observations in the tail of the distribution, which explains the edge of the former.

4 Simulation Study of the Automated Procedures

The empirical properties of the two stopping rules for the Anderson–Darling test are investigated in simulation studies. To check the empirical FWER of the StrongStop rule, data only need to be generated under the null hypothesis. For , , , and , 10,000 GPD samples were generated in each setting of these parameters. Ten thresholds are tested by locating ten percentiles, 5 to 50 by 5. Since the data is generated from the GPD, data above each threshold is also distributed as GP, with an adjusted scale parameter. Using the StrongStop procedure and with no adjustment, the observed FWER is compared to the expected rates for each setting at various nominal levels. At a given nominal level and setting of the parameters, the observed FWER is calculated as the number of samples with a rejection of at any of the thresholds, divided by the total number of samples. The results of this study can been seen in Figure 1.

Figure 1: Observed FWER for the Anderson–Darling test (using StrongStop and no adjustment) versus expected FWER at various nominal levels. The 45 degree line indicates agreement between the observed and expected rates under H0.

It can be seen that the observed FWER under using the StrongStop procedure is nearly in agreement with the expected rate at most nominal levels for the Anderson–Darling test (observed rate is always within 5% of the expected error rate). However, using the naive, unadjusted stopping rule (simply rejecting for any p-value less than the nominal level), the observed FWER is generally 2–3 times the expected rate for all sample sizes.

It is of interest to investigate the performance of the ForwardStop and StrongStop in selecting a threshold under misspecification. To check the ability of ForwardStop and StrongStop to control the false discover rate (FDR), data need to be generated under misspecification. Consider the situation where data is generated from a 50/50 mixture of distributions. Data between zero and five are generated from a Beta distribution with , and scaled such that the support is on zero to five. Data above five is generated from the GPD(, , ). Choosing misspecification in this way ensures that the mixture distribution is at least continuous at the meeting point. See Figure 2 for a visual assessment.

Figure 2: Plot of the (scaled) density of the mixture distribution used to simulate misspecification of . Vertical line indicates the continuity point of the two underlying distributions.

A total of observations are generated, with from the Beta distribution and from the GP distribution. 1000 datasets are simulated and 50 thresholds are tested, starting by using all observations and removing the 15 lowest observations each time until the last threshold uses just 250 observations. In this way, the correct threshold is given when the lowest observations are removed.

The results are presented here. The correct threshold is the 34th, since and . Rejection of less than 33 tests is problematic since it allows contaminated data to be accepted. Thus, a conservative selection criteria is desirable. Rejection of more than 33 tests is okay, although some non-contaminated data is being thrown away. By its nature, the StrongStop procedure has less power-to-reject than ForwardStop, since it has a stricter error control (FWER versus FDR).

Figure 3 displays the frequency distribution for threshold choice using the Anderson–Darling test at the 5% nominal level for the three stopping rules. There is a clear hierarchy in terms of power-to-reject between the ForwardStop, StrongStop, and no adjustment procedures. StrongStop, as expected, provides the least power-to-reject, with all 1000 simulations selecting a threshold below the correct one. ForwardStop is more powerful than the no adjustment procedure and on average selects a higher threshold. The median number of thresholds rejected is 33, 29, and 22 for ForwardStop, no adjustment, and StrongStop respectively.

Figure 3: Percent frequency of the number of rejections for the Anderson–Darling test and different error control procedures, at the 5% nominal level. The correct number of rejections is 34. This is for the simulation setting under sequential misspecification described in Section 4.

The observed versus expected FDR using ForwardStop and the observed versus expected FWER using StrongStop using the Anderson–Darling test for the data generated under misspecification can be seen in Figure 4. There appears to be reasonable agreement between the expected and observed FDR rates using ForwardStop, while StrongStop has observed FWER rates well below the expected rates.

Figure 4: Observed FDR (using ForwardStop) and observed FWER (using StrongStop) versus expected FDR and FWER respectively using the Anderson–Darling test, at various nominal levels. This is for the sequential simulation setting under misspecification described in Section 4. The 45 degree line indicates agreement between the observed and expected rates.

To further evaluate the performance of the combination of tests and stopping rules, it is of interest to know how well the data chosen above each threshold can estimate parameters of interest. Two such parameters are the shape and return level. The year return level (e.g., Coles 2001, Section 4.3.3) is given by


for a given threshold , where is the number of observations per year, and is the rate, or proportion of the data exceeding . The rate parameter has a natural estimator simply given by the number of exceeding observations divided by the total number of observations. A confidence interval for can easily be found using the delta method, or preferably and used here, profile likelihood. The ‘true’ return levels are found by letting , , and treating the data as the tail of some larger dataset, which allows computation of the rate for each threshold.

For ease of presentation, focus will be on the performance of the Anderson–Darling test in conjunction with the three stopping rules. In each of the 1000 simulated datasets (seen in Figure 2), the threshold selected for each of the three stopping rules is used to determine the bias, squared error, and confidence interval coverage (binary true/false) for the shape parameter and 50, 100, 250, and 500 year return levels. An average of the bias, squared error, and coverage across the 1000 simulations is taken, for each stopping rule and parameter value. For each parameter and statistic of interest (mean bias, mean squared error (MSE), and mean coverage), a relative percentage is calculated for the three stopping rules. A visual assessment of this analysis is provided in Figure 5.

Figure 5: Average performance comparison of the three stopping rules in the simulation study under misspecification, using the Anderson–Darling test for various parameters. Shown are the relative frequencies of the average value of each metric (bias, squared error, and coverage) for each stopping rule and parameter of interest.

It is clear from the result here that ForwardStop is the most preferable stopping rule in application. For all parameters, it has the smallest average bias (by a proportion of 2-1) and highest coverage rate. In addition, the MSE is comparable or smaller than the unadjusted procedure in all cases. Arguably, StrongStop has the worst performance, obtaining the highest average bias and MSE, and the lowest coverage rates for all parameters. Replacing the Anderson–Darling test with the other tests provides similar results (not shown here).

5 Return Level Mapping of Extreme Precipitation

One particularly useful application of the automated threshold selection method is generating an accurate return level map of extreme precipitation in the three western US coastal states of California, Oregon, and Washington. The automated procedure described in Section 2 provides a method to quickly obtain an accurate map without the need for visual diagnostics at each site. Return level maps are of great interest to hydrologists (Katz et al. 2002) and can be used for risk management and land planning (Blanchet and Lehning 2010; Lateltin and Bonnard 1999).

Daily precipitation data is available for tens of thousands of surface sites around the world via the Global Historical Climatology Network (GHCN). A description of the data network can be found in Menne et al. (2012). After an initial screening to remove sites with less than 50 years of available data, there are 720 remaining sites across the three chosen coastal states.

As the annual maximum daily amount of precipitation mainly occurs in winter, only the winter season (November - March) observations are used in modeling. A set of thresholds for each site are chosen based on the data percentiles; for each site the set of thresholds is generated by taking the 75th to 97th percentiles in increments of 2, then in increments of 0.1 from the 97th to 99.5th percentile. This results in 37 thresholds to test at each site. If consecutive percentiles result in the same threshold due to ties in data, only one is used to guarantee uniqueness of thresholds and thus reducing the total number of thresholds tested at that site.

As discussed in the beginning of Section 2, modeling the exceedances above a threshold with generalized Pareto requires the exceedances to be independent. This is not always the case with daily precipitation data; a large and persistent synoptic system can bring large storms successively. The extremal index is a measure of the clustering of the underlying process at extreme levels. Roughly speaking, it is equal to (limiting mean cluster size). It can take values from 0 to 1, with independent series exhibiting a value of exactly 1. To get a sense for the properties of series in this dataset, the extremal index is calculated for each site using the 75th percentile as the threshold via the R package texmex (Southworth and Heffernan 2013). In summary, the median estimated extremal index for all 720 sites is 0.9905 and 97% of the sites have an extremal index above 0.9. Thus, we do not do any declustering on the data. A more thorough description of this process can be found in Ferro and Segers (2003) and Heffernan and Southworth (2012).

The Anderson–Darling test is used to test the set of thresholds at each of the 720 sites, following the procedure outlined in Section 3.1. This is arguably the most powerful test out of the four examined in Section 3.4. Three stopping rules are used — ForwardStop, StrongStop, and with no adjustment, which proceeds in an ascending manner until an acceptance occurs. Figure 6 shows the distribution of chosen percentiles for the 720 sites using each of the three stopping rules.

Figure 6: Distribution of chosen percentiles (thresholds) for the 720 western US coastal sites, as selected by each stopping rule. Note that this does not include sites where all thresholds were rejected by the stopping rule.

As expected, ForwardStop is the most conservative, rejecting all thresholds at 348 sites, with the unadjusted procedure rejecting 63, and StrongStop only rejecting all thresholds at 3 sites. Figure 7 shows the geographic representation of sites in which all thresholds are rejected. Note that there is a pattern of rejections by ForwardStop, particularly in the eastern portion of Washington and Oregon, and the Great Valley of California. This may be attributed to the climate differences in these regions – rejected sites had a smaller number of average rain days than non-rejected sites (30 vs. 34), as well as a smaller magnitude of precipitation (an average of 0.62cm vs. 1.29cm). The highly selective feature of the ForwardStop procedure is desired as it suggests not to fit GPD at even the highest threshold at these sites, a guard that is not available from those unconditionally applied, one-for-all rules.

Figure 7: Map of US west coast sites for which all thresholds were rejected (black) and for which a threshold was selected (grey), by stopping rule.

For the sites at which a threshold was selected for both the ForwardStop and StrongStop rules, the return level estimates based on the chosen threshold for each stopping rule can be compared. The result of this comparison can be seen in Figure 8. For a smaller return period (50 years), the agreement between estimates for the two stopping rules is quite high. This is a nice confirmation to have confidence in the analysis, however it is slightly misleading in that it does not contain the sites rejected by ForwardStop.

Figure 8: Comparison of return level estimates based on the chosen threshold for ForwardStop vs. StrongStop. The 45 degree line indicates agreement between the two estimates. This is only for the sites in which both stopping rules did not reject all thresholds.

The end result of this automated batch analysis provides a map of return level estimates. The 50, 100, and 250 year map of return level estimates from threshold selection using ForwardStop and the Anderson–Darling test can be seen in Figure 9. To provide an estimate at any particular location in this area, some form of interpolation can be applied.

Figure 9: Map of US west coast sites with 50, 100, and 250 year return level estimates via threshold selection with ForwardStop and the Anderson–Darling test.

6 Discussion

We propose an intuitive and comprehensive methodology for automated threshold selection in the peaks over threshold approach. In addition, it is not as computationally intensive as some competing resampling or bootstrap procedures (Danielsson et al. 2001b; Ferreira et al. 2003). Automation and efficiency is required when prediction using the peaks over threshold approach is desired at a large number of sites. This is achieved through sequentially testing a set of thresholds for goodness-of-fit to the generalized Pareto distribution (GPD). Previous instances of this sequential testing procedure did not account for the multiple testing issue. We apply two recently developed stopping rules (G’Sell et al. 2015) ForwardStop and StrongStop, that control the false discovery rate and familywise error rate, respectively in the setting of (independent) ordered, sequential testing. It is a novel application of them to the threshold selection problem. There is a slight caveat in our setting, that the tests are not independent, but it is shown via simulation that these stopping rules still provide reasonable error control here.

Four tests are compared in terms of power to detect departures from the GPD at a single, fixed threshold and it is found that the Anderson–Darling test has the most power in various non-null settings. Choulakian and Stephens (2001) derived the asymptotic null distribution of the Anderson–Darling test statistic. However this requires solving an integral equation. Our contribution, with some advice from the author, provides an approximate, but accurate and computationally efficient version of this test. To investigate the performance of the stopping rules in conjunction with the Anderson–Darling test, a large scale simulation study was conducted. Data is generated from a plausible distribution – misspecified below a certain threshold and generated from the null GPD above. In each replicate, the bias, coverage, and squared error is recorded for the stopping threshold of each stopping rule for various parameters. The results of this simulation suggest that the ForwardStop procedure has on average the best performance using the aforementioned metrics. The entire methodology is applied to daily precipitation data at hundreds of sites in three U.S. west coast states, with the goal of creating a return level map.

Temporal or covariate varying thresholds, as discussed in Roth et al. (2012) and Northrop and Jonathan (2011), is an obvious extension to this work. However, one particular complication that arises is performing model selection (i.e. what covariates to include), while concurrently testing for goodness-of-fit to various thresholds. It is clear that threshold selection will be dependent on the choice of model. Another possible extension involves testing for overall goodness-of-fit across sites (one test statistic). In this way, a fixed or quantile regression based threshold may be predetermined and then tested simultaneously across sites. In this setup both spatial and temporal dependence need to be taken into account. Handling this requires some care due to censoring (e.g., Dey and Yan 2016, Section 2.5.2). In other words, it is not straightforward to capture the temporal dependence as exceedances across sites are not guaranteed to occur at the same points in time.


  • Bader and Yan (2015) Bader, B. and Yan, J. (2015), eva: Extreme Value Analysis with Goodness-of-Fit Testing, R package version 0.1.2.
  • Balkema and De Haan (1974) Balkema, A. A. and De Haan, L. (1974), “Residual Life Time at Great Age,” The Annals of Probability, 2, 792–804.
  • Beirlant et al. (1996) Beirlant, J., Vynckier, P., and Teugels, J. L. (1996), “Tail Index Estimation, Pareto Quantile Plots, and Regression Diagnostics,” Journal of the American Statistical Association, 91, 1659–1667.
  • Benjamini (2010a) Benjamini, Y. (2010a), “Discovering the False Discovery Rate,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 405–416.
  • Benjamini (2010b) — (2010b), ‘‘Simultaneous and Selective Inference: Current Successes and Future Challenges,” Biometrical Journal, 52, 708–721.
  • Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995), “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society. Series B, 57, 289–300.
  • Benjamini and Yekutieli (2001) Benjamini, Y. and Yekutieli, D. (2001), “The Control of the False Discovery Rate in Multiple Testing under Dependency,” The Annals of Statistics, 29, 1165–1188.
  • Blanchard and Roquain (2009) Blanchard, G. and Roquain, É. (2009), “Adaptive False Discovery Rate Control under Independence and Dependence,” The Journal of Machine Learning Research, 10, 2837–2871.
  • Blanchet and Lehning (2010) Blanchet, J. and Lehning, M. (2010), “Mapping snow depth return levels: smooth spatial modeling versus station interpolation,” Hydrology and Earth System Sciences, 14, 2527–2544.
  • Caeiro and Gomes (2016) Caeiro, F. and Gomes, M. I. (2016), ‘‘Threshold Selection in Extreme Value Analysis,” Extreme Value Modeling and Risk Analysis: Methods and Applications, 69–82.
  • Caires (2009) Caires, S. (2009), “A comparative simulation study of the annual maxima and the peaks-over-threshold methods.” Tech. rep., SBW-Belastingen: subproject ‘Statistics’. Deltares Report 1200264-002.
  • Cheng and Amin (1983) Cheng, R. C. H. and Amin, N. A. K. (1983), “Estimating parameters in continuous univariate distributions with a shifted origin,” Journal of the Royal Statistical Society. Series B (Methodological), 45, 394–403.
  • Cheng and Stephens (1989) Cheng, R. C. H. and Stephens, M. A. (1989), “A goodness-of-fit test using Moran’s statistic with estimated parameters,” Biometrika, 76, 385–392.
  • Choulakian and Stephens (2001) Choulakian, V. and Stephens, M. A. (2001), “Goodness-of-Fit Tests for the Generalized Pareto Distribution,” Technometrics, 43, 478–484.
  • Coles (2001) Coles, S. (2001), An Introduction to Statistical Modeling of Extreme Values, Springer, 1st ed.
  • Danielsson et al. (2001a) Danielsson, J., de Haan, L., Peng, L., and de Vries, C. G. (2001a), “Using a Bootstrap Method to Choose the Sample Fraction in Tail Index Estimation,” Journal of Multivariate Analysis, 76, 226–248.
  • Danielsson et al. (2001b) — (2001b), “Using a Bootstrap Method to Choose the Sample Fraction in Tail Index Estimation,” Journal of Multivariate Analysis, 76, 226–248.
  • Davison and Smith (1990) Davison, A. C. and Smith, R. L. (1990), “Models for Exceedances Over High Thresholds,” Journal of the Royal Statistical Society. Series B (Methodological), 52, 393–442.
  • Dey and Yan (2016) Dey, D. K. and Yan, J. (eds.) (2016), Extreme Value Modeling and Risk Analysis: Methods and Applications, CRC Press.
  • Drees et al. (2000) Drees, H., De Haan, L., and Resnick, S. (2000), ‘‘How to Make a Hill Plot,” The Annals of Statistics, 28, 254–274.
  • DuMouchel (1983) DuMouchel, W. H. (1983), “Estimating the Stable Index in Order to Measure Tail Thickness: A Critique,” The Annals of Statistics, 11, 1019–1031.
  • Dupuis (1999) Dupuis, D. (1999), “Exceedances over High Thresholds: A Guide to Threshold Selection,” Extremes, 1, 251–261.
  • Ferreira et al. (2003) Ferreira, A., de Haan, L., and Peng, L. (2003), “On Optimising the Estimation of High Quantiles of a Probability Distribution,” Statistics, 37, 401–434.
  • Ferro and Segers (2003) Ferro, C. A. and Segers, J. (2003), “Inference for Clusters of Extreme Values,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 545–556.
  • G’Sell et al. (2015) G’Sell, M. G., Wager, S., Chouldechova, A., and Tibshirani, R. (2015), “Sequential Selection Procedures and False Discovery Rate Control,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), forthcoming.
  • Hall and Weissman (1997) Hall, P. and Weissman, I. (1997), “On the Estimation of Extreme Tail Probabilities,” The Annals of Statistics, 25, 1311–1326.
  • Heffernan and Southworth (2012) Heffernan, J. E. and Southworth, H. (2012), “Extreme value modelling of dependent series using R,” .
  • Katz et al. (2002) Katz, R. W., Parlange, M. B., and Naveau, P. (2002), “Statistics of extremes in hydrology,” Advances in Water Resources, 25, 1287–1304.
  • Kharin et al. (2013) Kharin, V. V., Zwiers, F., Zhang, X., and Wehner, M. (2013), “Changes in temperature and precipitation extremes in the CMIP5 ensemble,” Climatic Change, 119, 345–357.
  • Kharin et al. (2007) Kharin, V. V., Zwiers, F. W., Zhang, X., and Hegerl, G. C. (2007), “Changes in temperature and precipitation extremes in the IPCC ensemble of global coupled model simulations,” Journal of Climate, 20, 1419–1444.
  • Lateltin and Bonnard (1999) Lateltin, O. and Bonnard, C. (1999), “Hazard assessment and land-use planning in Switzerland for snow avalanches, floods and landslides.” Tech. rep., World Meteorological Organization.
  • MacDonald et al. (2011) MacDonald, A., Scarrott, C. J., Lee, D., Darlow, B., Reale, M., and Russell, G. (2011), “A Flexible Extreme Value Mixture Model,” Computational Statistics & Data Analysis, 55, 2137–2157.
  • Menne et al. (2012) Menne, M. J., Durre, I., Vose, R. S., Gleason, B. E., and Houston, T. G. (2012), “An overview of the global historical climatology network-daily database,” Journal of Atmospheric and Oceanic Technology, 29, 897–910.
  • Moran (1953) Moran, P. A. P. (1953), “The Random Division of an Interval-Part II,” Journal of the Royal Statistical Society. Series B (Methodological), 15, 77–80.
  • Northrop and Coleman (2014) Northrop, P. J. and Coleman, C. L. (2014), “Improved threshold diagnostic plots for extreme value analyses,” Extremes, 17, 289–303.
  • Northrop and Jonathan (2011) Northrop, P. J. and Jonathan, P. (2011), “Threshold modelling of spatially dependent non-stationary extremes with application to hurricane-induced wave heights,” Environmetrics, 22, 799–809.
  • Pickands (1975) Pickands, III, J. (1975), “Statistical Inference Using Extreme Order Statistics,” The Annals of Statistics, 3, 119–131.
  • Raoult and Worms (2003) Raoult, J.-P. and Worms, R. (2003), “Rate of Convergence for the Generalized Pareto Approximation of the Excesses,” Advances in Applied Probability, 35, 1007–1027.
  • Roth et al. (2012) Roth, M., Buishand, T., Jongbloed, G., Klein Tank, A., and Zanten, v. J. (2012), ‘‘A regional peaks-over-threshold model in a nonstationary climate,” Water Resources Research, 48.
  • Scarrott and MacDonald (2012) Scarrott, C. and MacDonald, A. (2012), “A Review of Extreme Value Threshold Estimation and Uncertainty Quantification,” REVSTAT–Statistical Journal, 10, 33–60.
  • Shaffer (1995) Shaffer, J. P. (1995), “Multiple Hypothesis Testing,” Annual Review of Psychology, 46, 561–584.
  • Smith (1985) Smith, R. L. (1985), “Maximum Likelihood Estimation in a Class of Nonregular Cases,” Biometrika, 72, 67–90.
  • Southworth and Heffernan (2013) Southworth, H. and Heffernan, J. E. (2013), texmex: Statistical Modelling of Extreme Values, R package version 2.1.
  • Thompson et al. (2009) Thompson, P., Cai, Y., Reeve, D., and Stander, J. (2009), “Automated threshold selection methods for extreme wave analysis,” Coastal Engineering, 56, 1013–1021.
  • Wadsworth and Tawn (2012) Wadsworth, J. L. and Tawn, J. A. (2012), “Likelihood-Based Procedures for Threshold Diagnostics and Uncertainty in Extreme Value Modelling,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 543–567.
  • Wang and Zhang (2008) Wang, J. and Zhang, X. (2008), “Downscaling and projection of winter extreme daily precipitation over North America,” Journal of Climate, 21, 923–937.
  • Wong and Li (2006) Wong, T. S. T. and Li, W. K. (2006), “A note on the estimation of extreme value distributions using maximum product of spacings,” in Time Series and Related Topics, Institute of Mathematical Statistics, pp. 272–283.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description