Maximum likelihood estimators based on the block maxima method

Maximum likelihood estimators based on the block maxima method

Clément Dombry
Univ. Bourgogne Franche-Comté,
Laboratoire de Mathématiques de Besançon,
UMR CNRS 6623, 16 route de Gray, 25030 Besançon cedex, France.
Ana Ferreira
Instituto Superior Técnico,
Universidade de Lisboa,
Av. Rovisco Pais 1049-001 Lisboa, Portugal.

The extreme value index is a fundamental parameter in univariate Extreme Value Theory (EVT). It captures the tail behavior of a distribution and is central in the extrapolation beyond observed data. Among other semi-parametric methods (such as the popular Hill’s estimator), the Block Maxima (BM) and Peaks-Over-Threshold (POT) methods are widely used for assessing the extreme value index and related normalizing constants. We provide asymptotic theory for the maximum likelihood estimators (MLE) based on the BM method. Our main result is the asymptotic normality of the MLE with a non-trivial bias depending on the extreme value index and on the so-called second order parameter. Our approach combines asymptotic expansions of the likelihood process and of the empirical quantile process of block maxima. The results permit to complete the comparison of most common semi-parametric estimators in EVT (MLE and probability weighted moment estimators based on the POT or BM methods) through their asymptotic variances, biases and optimal mean square errors.

Key words: asymptotic normality, block maxima method, extreme value index, maximum likelihood estimator, peaks-over-threshold method, probability weighted moment estimator.

AMS Subject classification: 62G32, 62G20, 62G30.

1 Introduction

The Block Maxima (BM) method, also known as Annual Maxima, after Gumbel [11] is a fundamental method in Extreme Value Theory and has been widely used. The method is justified under the Maximum Domain of Attraction (MDA) condition: for an independent and identically distributed (i.i.d.) sample with distribution function , if the linearly normalized partial maxima converges in distribution, then the limit must be a Generalized Extreme Value (GEV) distribution. In practice, one rarely exactly knows but the MDA condition holds for most common continuous distributions.

In the BM method, the initial sample is divided into blocks of the same size and the MDA condition ensures that the block maxima are approximately GEV distributed. The method is commonly used in hydrology and other environmental applications or in insurance and finance when analysing extremes - see e.g. the monographs by Embrechts et al. [9], Coles [5], Beirlant et al. [2], de Haan and Ferreira [6] and references therein.

The GEV is a three parameter distribution, with the usual location and scale parameters, and the extreme value index being the main parameter as it characterizes the heaviness of the tail. Several estimation methods have been proposed, including the classical maximum likelihood (ML) and probability weighted moments (PWM) estimators (Hosking et al. [13]). The asymptotic study of these estimators has been established for a sample from the GEV distribution and asymptotic normality holds with null bias and explicit variance (Prescott and Walden [15], Hosking et al. [13], Smith [17], Bücher and Segers [4]). The theory is made quite difficult and technical by the fact that the support of the GEV is varying with respect to its parameters. Regularity in quadratic mean of the GEV model has been proven only recently by Bücher and Segers [4] and we provide here a different and somewhat simpler proof (cf. Proposition 4.1).

However, in applications, the sample block maxima are only approximately GEV so that the classical parametric theory suffers from model misspecification. In this paper, we intend to fill this gap for ML estimators (MLE), by showing asymptotic normality under a flexible second order condition (a refinement of the MDA condition). Depending on the asymptotic block size, a non trivial bias may appear in the limit for which we provide an exact expression. Recently Ferreira and de Haan [10] showed asymptotic normality of the PWM estimators under the same conditions. They derived a uniform expansion for the empirical quantile of block maxima that is a crucial tool in our approach as well. Indeed, the MLE can be seen as a maximizer of the so-called likelihood process. Expressing the likelihood process in terms of this empirical quantile process, we are able to derive an expansion of the likelihood process that implies the asymptotic normality of the MLE. This derivation is again made quite technical by the fact that the support of the GEV is varying. Note that the asymptotic normality for the MLE of a Fréchet distribution based on the block maxima of a stationary heavy-tailed time series has been obtained by Bücher and Segers [3]. There the issue of parameter dependent supports is avoided but time dependence has to be dealt with. Besides, the ideas underlying their proof are quite different.

The asymptotic normality result in the present paper brings novel results to the theoretical comparison of the main semi-parametric estimation procedures in EVT. On the one hand it permits to compare BM and Peaks-over-Threshold (POT) methods (see e.g. Balkema and de Haan [1], Pickands [14]), the latter being another fundamental method in EVT and concurrent with BM. We discuss and compare the four different approaches – MLE/PWM estimators in the BM/POT approaches – based on exact theoretical formulas for asymptotic variances, biases and optimal mean square errors depending on the extreme value index and the second order parameter. It turns out that MLE under BM has minimal asymptotic variance among all combinations MLE/PWM and BM/POT but, on the other hand it has some significant asymptotic bias. When analysing the asymptotic optimal mean square error that balances variance and bias, the most efficient combination turns out to be MLE under POT (e.g. Drees, Ferreira and de Haan 2004). It turns out that the optimal sample size is larger for POT-MLE than for BM-MLE, giving a theoretical justification to the heuristic that POT allows for a better use of the data than BM.

The outline of the paper is as follows: In Section 2 we present the main theoretical conditions and results including Theorem 2.2 giving the asymptotic normality of the MLE. In Section 3 we present a comparative study of asymptotic variances and biases, optimal asymptotic mean square errors and optimal samples sizes among all combinations MLE/PWM and BM/POT. In Section 4 we state additional theoretical statements, including the local asymptotic normality of MLE under the fully parametric GEV model, and provide all the proofs. Finally, Appendix A gathers some formulas for the information matrix and for the bias of BM-MLE and Appendix B provides useful bounds for the derivatives of the likelihood function that are necessary for the main proofs.

2 Asymptotic behaviour of MLE

2.1 Framework and notations

The GEV distribution with index is defined by

and the corresponding log-likelihood by


For , the formula is interpreted as . The three parameter model with index , location and scale is defined by the log-likelihood


A distribution is said to belong to the max-domain of attraction of the extreme value distribution , denoted by , if there exist normalizing sequences and such that

The main aim of the BM method is to estimate the extreme value index as well as the normalizing constants and . The set-up is the following. Consider independent and identically distributed (i.i.d.) random variables with common distribution function . Divide the sequence into blocks of length and define the -th block maximum by


For each , the variables are i.i.d. with distribution function and by the max-domain of attraction condition


This suggests that the distribution of is approximately a GEV distribution with parameters . The method consists in pretending that the sample follows exactly the GEV distribution and in maximizing the GEV log-likelihood so as to compute the MLE. A particular feature of the method is that the model is clearly misspecified since the GEV distribution appears as the limit distribution of the block maxima as the block size tends to while in practice we have to use a finite block size. As seen afterwards, we quantify the misspecification thanks to the so-called second order condition that implies an asymptotic expansion of the empirical quantile process with a non trivial bias term. When plugging this expansion in the ML equations, we obtain a bias term for the likelihood process as well as for the MLE.

The (misspecified) log-likelihood of the -sample is


We say that an estimator is a MLE if it solves the score equations

which we write shortly in vectorial notation


A main purpose of this paper is to study the existence and asymptotic normality of the MLE under the following conditions:

  • First order condition:

    Note that the also called first order condition (2.4) is equivalent to

    with . W.l.g., we can take in Equation 2.4, what we shall assume in the following.

  • Second order condition: for some positive function and some positive or negative function with ,


    with . Note that necessarily and is regularly varying with index .

  • Asymptotic growth for the number of blocks and the block size :


2.2 Main results

Before considering the MLE, we focus on the asymptotic properties of the likelihood and score processes. For the purpose of asymptotic we introduce the local parameter :


Set . The local log-likelihood process at is given by


and, the local score process by


Clearly, the score equation (2.6) rewrites in this new variable as .

In the following, denote the quantile function of the extreme value distribution , i.e.

Proposition 2.1.

Assume conditions (2.7) and (2.8). Let be a sequence of positive numbers verifying, as ,


Let be the ball of center and radius . Then, uniformly for ,


with the Fisher information matrix


As a consequence, the local log-likelihood process is strictly concave on with high probability.

Remark 2.1.

The conditions in Proposition 2.1 are sufficient for consistency of MLE, see Dombry [7]. In particular implies , the later required for consistency. When , condition (2.13) implies that (2.14) holds for , .

Our main result is the following Theorem establishing the asymptotic behavior of the local likelihood process and from which the existence and asymptotic normality of MLE will be deduced.

Theorem 2.1.

Assume conditions (2.7) and (2.8). Then, the local likelihood process satisfies uniformly for in compact sets




i.e., is asymptotically Gaussian with variance equal to the information matrix and mean depending on the second order condition (2.7) through


and on the asymptotic block size through from (2.8).

Remark 2.2.

Explicit formulas for the Fisher information matrix have been given by Prescott and Walden [15] (see also Beirlant et al. [2] page 169). The vector given by the integral representation (2.19) can also be computed explicitly. Formulas are provided in Appendix A.

Remark 2.3.

Equation (2.8) requires that both the number of blocks and the block size go to infinity with a relative rate measured by the second order scaling function and a parameter . When , the bias term disappears in (2.18); this corresponds to the situation where grows to infinity very quickly with respect to so that the block size is large enough and the GEV approximation (2.4) is very good.

Existence and asymptotic normality of the MLE can be deduced from Theorem 2.1, mainly by the argmax theorem. The concavity property stated in Proposition 2.1 plays an important role in the proof of existence and uniqueness.

Theorem 2.2.

Assume conditions (2.7) and (2.8).

  1. There exists a sequence of estimators , , such that



  2. If , ,are two sequences of estimators satisfying


    then and are equal with high probability, i.e.

Remark 2.4.

An interesting by-product of the strict concavity stated in Proposition 2.1 is the convergence of numerical procedures for the computation of the MLE that are implemented in software. The Newton-Raphson algorithm is commonly used to solve numerically the score equation (2.6). Strict concavity of the objective function on a large neighbourhood of the solution ensures convergence of the algorithm with high probability as soon as the initial value belongs to this neighbourhood.

3 Theoretical comparisons: BM vs POT and MLE vs PWM

The POT method uses observations above some high threshold or top order statistic and the underlying approximate model is the Generalized Pareto distribution (Balkema and de Haan [1], Pickands [14]). Estimators of the shape parameter , as well as location and scale parameters have been proposed and widely studied, including MLE and PWM (Hosking and Wallis [12]). For their asymptotic properties - under basically the same conditions as under BM in Theorem 2.2 - we refer to de Haan and Ferreira [6]. Asymptotic normality of PWM estimators under BM has been established only recently by Ferreira and de Haan [10] and a comparison of PWM estimators under BM and POT has been carried out. The aim of the present section is to include our new asymptotic results for MLE estimators under BM, completing the picture in the comparison of the four different cases BM/POT and MLE/PWM.

Recall that asymptotic normality of MLE (resp. PWM estimator) holds for (resp. ). The number of selected observations corresponds to the number of blocks in BM and of selected top order statistics in POT. Similarly as in Ferreira and de Haan [10], our comparative study is restricted to the range where second order conditions for BM and POT are comparable (cf. Drees et al. [8] or Ferreira and de Haan [10]). In the following we compare MLE/PWM under BM/POT methods through their: (i) asymptotic variances (VAR), (ii) asymptotic biases (BIAS), (iii) optimal asymptotic mean square errors (AMSE) and optimal number of observations minimizing AMSE ().

(i) Asymptotic variances

The asymptotic variance depends on only and is plotted in Figure 1 where straight lines stand for MLE and dashed lines for PWM estimators. Among all four different cases, BM-MLE is the one with the smallest variance within its range. Moreover, for both estimators, BM has the lowest variance indicating that BM is preferable to POT when variance is concerned.

Figure 1: Asymptotic variances of estimators of the extreme value index . The straight lines corresponds to the MLE under BM and POT while the dashed lines correspond to PWM under BM and POT.

(ii) Asymptotic biases

The asymptotic biases depend on and and are shown in Figures 23: POT-MLE is the one with the smallest bias also in absolute value when compared to BM-MLE, contrary to what was observed for variance. This is in agreement with what has been observed when comparing BM-PWM and POT-PWM, also shown in Figures 23 already analysed in Ferreira and de Haan [10]. There is again the indication that POT method is favourable to BM when concerning bias.

Figure 2: Asymptotic bias of estimators of the extreme value index : blue color for BM and orange for POT.
(a) MLE
(b) PWM
Figure 3: Ratios of asymptotic bias: .

(iii) Optimal asymptotic MSEs and optimal number of observations

Another way to compare the estimators that combines both variance and bias information is through mean square error. One can compare these for the optimal number of observations i.e., that value for which the asymptotic mean square error (AMSE) is minimal. Similarly as in Ferreira and de Haan [10], under the conditions of Theorem 2.2, we have

with a decreasing and regularly varying function such that . It follows in particular that the optimal is different but of the same order for both estimators and methods. As for the optimal AMSE, we have

When considering ratios of optimal AMSE (or optimal number of selected observations), the regularly varying function cancels out and the asymptotic ratio does not depend on but only on and .

Figure 4 shows the contour plots of the ratio for MLE and PWM estimators. It is surprising to see a reverse behaviour in both cases: in the range of parameters considered, POT is preferable when MLE are considered, while BM is mostly preferable for PWM estimators.

(a) MLE
(b) PWM
Figure 4: Contour plot of ratios of optimal AMSE: .

In Figure 5 are shown for comparing optimal AMSE among all combinations. The green surface corresponds to MLE-POT that has always the minimal optimal AMSE in the range of parameters considered. Finally, Figure 6 reports for MLE the asymptotic ratio of optimal numbers of selected observations, that is . We can see that the optimal number of observations is larger for POT, which is in agreement with the PWM case considered in previous studies.

Figure 5: Comparison of optimal AMSE: the lowest green surface corresponds to MLE-POT. Figure 6: Asymptotic ratio of optimal sizes: .

4 Main proofs

We start by introducing some material that will be useful for the proofs. More technical material is still postponed to Appendices.

4.1 Local asymptotic normality of the GEV model

If the observations are exactly distributed, then the choice of constants


ensures that the normalised block maxima are i.i.d. with distribution . The issue of model misspecification is irrelevant in that particular case.

In this simple i.i.d. setting, a key property in the theory of ML estimation is differentiability in quadratic mean (see e.g. van der Vaart [18, Chapter 7]). A statistical model defined by the family of densities is called differentiable in quadratic mean at the point if there exists a measurable function called the score function such that

The following Proposition corresponds to Proposition 3.2 in Bücher and Segers [4]. We provide a slightly different proof in the case .

Proposition 4.1.

The three parameter GEV model with log-likelihood defined in Equation (2.2) is differentiable in quadratic mean at if and only if . The score function is then given by .

Proof of Proposition 4.1.

The density of the 3-parameter GEV model is given by

if and otherwise. In the case , the function

is continuously differentiable for every and the information matrix is well defined and continuous (see Appendix A or Beirlant et al. [2] page 169). Differentiability in quadratic mean of the GEV model follows by a straightforward application of Lemma 7.6 in Van der Vaart [18].

In the case , the function is not differentiable at points such that . Going back to the definition of differentiability in quadratic mean, we need to show that


This is credible because for all , the relation


For further reference, we note also that, for ,


A rigorous proof of (4.2) is given below. Since , we have for in a neighbourhood of so that the density vanishes outside with the right endpoint of the distribution . We also introduce . For all , the function is twice continuously differentiable, whence Taylor formula entails

for some . Together with Equation (4.3), the formula and Proposition B.1, this yields the upper bound

for all and small enough. This entails


It remains to estimate the contribution of the integral between and . Recall that vanishes for . We have

The first and second integral converge to as because and . The third integral converges also to because and so that the score is square integrable (its covariance matrix is ). We deduce


Equations (4.4) and (4.5) imply (4.2).

The fact that differentiability in quadratic mean does not hold when is proved in Bücher and Segers [4] Appendix C. They observe that for ,

which rules out differentiability in quadratic mean. We omit further details here. ∎

Differentiability in quadratic mean implies that the score function is centered with finite variance equal to the information matrix, i.e.


Another important consequence of differentiability in quadratic mean is the local asymptotic normality property of the local score process. The following Corollary follows from Proposition 4.1 by a direct application of Theorem 7.2 in Van der Vaart [18].

Corollary 4.1.

Assume that with and that the constants , are given by (4.1). Then the local log-likelihood process (2.10) satisfies


Note the similarity between Theorem 2.1 and Corollary 4.1. In Theorem 2.1 however, the is uniform on compact sets and the model misspecification results in a bias term for the asymptotic distribution .

4.2 The empirical quantile process associated to BM

The starting point of the proof of Proposition 2.1 and Theorem 2.1 is to rewrite the local log-likelihood process (2.10) in terms of the (normalized) empirical quantile process


where are the order statistics of the block maxima sample defined by (2.3) and denotes the smallest integer larger than or equal to . The local log-likelihood process (2.10) can be rewritten as


Convergence (2.4) ensures the convergence of the empirical quantile process to the “true” quantile function defined in (2.12). The following expansion of the empirical quantile process is taken from Ferreira and de Haan [10], Theorem 2.1.

Proposition 4.2.

Assume conditions (2.7) and (2.8). For a specific choice of the second order auxiliary functions and in (2.7),


where , , denotes an appropriate sequence of standard Brownian bridges and the remainder term satisfies, for ,


uniformly for .

Remark 4.1.

For Proposition 4.2, the auxiliary functions and have to be specially chosen for establishing uniform second order regular variation bounds refining (2.7), see Lemma 4.2 in [10]. However, this choice is useful for the proofs only and is irrelevant for the statements of the main results in Section 2.2.

The following Proposition provides useful technical bounds for the proof of the main results.

Proposition 4.3.

Assume conditions (2.7) and (2.8). Then, as ,


uniformly for and as in Proposition 2.1.

For the proof of Proposition 4.3, we need the following Lemma.

Lemma 4.1.

Let be the order statistics of i.i.d random variables with standard Fréchet distribution. Then,

where the term is uniform for .

Proof of Lemma 4.1.

An equivalent statement is, with standard uniform,

We use Shorack and Wellner [16] (inequality 1 on p.419): for some


Relation (4.11) implies, for ,

Both sides are bounded for . Relation (4.12) implies, for ,

Both sides are bounded for . ∎

Proof of Proposition 4.3.

Let be a unit Fréchet random variable, i.e. with distribution function , , and be the order statistics from the associated i.i.d. sample of size , . Note that . From Lemma 4.2 in [10],

is bounded (above and below) by

for each provided and are large enough. Hence,

is bounded (above and below) by,