Large deviations for method-of-quantiles estimators of one-dimensional parametersThe support of Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM) is acknowledged.

# Large deviations for method-of-quantiles estimators of one-dimensional parameters††thanks: The support of Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM) is acknowledged.

Valeria Bignozzi Dipartimento di Statistica e Metodi Quantitativi, Università di Milano Bicocca, Via Bicocca degli Arcimboldi 8, I-20126 Milano, Italia. e-mail: valeria.bignozzi@unimib.it    Claudio Macci Dipartimento di Matematica, Università di Roma Tor Vergata, Via della Ricerca Scientifica, I-00133 Roma, Italia. e-mail: macci@mat.uniroma2.it    Lea Petrella Dipartimento di Metodi e Modelli per l’Economia, il Territorio e la Finanza, Sapienza Università di Roma, Via del Castro Laurenziano 9, I-00161 Roma, Italia. e-mail: lea.petrella@uniroma1.it
###### Abstract

We consider method-of-quantiles estimators of unknown parameters, namely the analogue of method-of-moments estimators obtained by matching empirical and theoretical quantiles at some probability level . The aim is to present large deviation results for these estimators as the sample size tends to infinity. We study in detail several examples; for specific models we discuss the choice of the optimal value of and we compare the convergence of the method-of-quantiles and method-of-moments estimators.

AMS Subject Classification: 60F10; 62F10; 62F12.
Keywords: location parameter; methods of moments; order statistics; scale parameter; skewness parameter.

## 1 Introduction

Estimation of parameters of statistical or econometric models is one of the main concerns in the parametric inference framework. When the probability law is specified (up to unknown parameters), the main tool to solve this problem is the Maximum Likelihood (ML) technique; on the other hand, whenever the assumption of a particular distribution is too restrictive, different solutions may be considered. For instance the Method of Moments (MM) and the Generalized Method of Moments (GMM) provide valuable alternative procedures; in fact the application of these methods only requires the knowledge of some moments.

A different approach is to consider the Method of Quantiles (MQ), that is the analogue of MM with quantiles; MQ estimators are obtained by matching the empirical percentiles with their theoretical counterparts at one or more probability levels. Inference via quantiles goes back to [1] where the authors consider an estimation problem for a three-parameter log-normal distribution; their approach consists in minimizing a suitable distance between the theoretical and empirical quantiles, see for instance [12]. Successive papers deal with the estimation of parameters of extreme value (see [8] and [11]), logistic (see [9]) and Weibull (see [10]) distributions. A more recent reference is [2] where several other distributions are studied. We also recall [6] where the authors consider an indirect inference method based on the simulation of theoretical quantiles, or a function of them, when they are not available in a closed form. In [15], an iterative procedure based on ordinary least-squares estimation is proposed to compute MQ estimators; such estimators can be easily modified by adding a LASSO penalty term if a sparse representation is desired, or by restricting the matching within a given range of quantiles to match a part of the target distribution. Quantiles and empirical quantiles represent a key tool also in quantitative risk management, where they are studied under the name of Value-at-Risk (see for instance [13]).

In our opinion, MQ estimators deserve a deeper investigation because of several advantages. They allow to estimate parameters when the moments are not available and they are invariant with respect to increasing transformations; moreover they have less computational problems, and behave better when distributions are heavy-tailed or their supports vary with the parameters.

The aim of this paper is to present large deviation results for MQ estimators (as the sample size tends to infinity) for statistical models with one-dimensional unknown parameter , where the parameter space is a subset of the real line; thus we match empirical and theoretical quantiles at one probability level . The theory of large deviations is a collection of techniques which gives an asymptotic computation of small probabilities on an exponential scale (see e.g. [4] as a reference on this topic). Several examples of statistical models are considered throughout the paper, and some particular distributions are studied in detail. For most of the examples considered, we are able to find an explicit expression for the rate function which governs the large deviation principle of the MQ estimators and, when possible, our investigation provides the optimal that guarantees a faster convergence to the true parameter (see Definition 3.1). Further we compare MQ and MM estimators in terms of the local behavior of the rate functions around the true value of the parameter in the spirit of Remark 2.1. Which one of the estimators behaves better strictly depends on the type of parameter we have to estimate and varies upon distributions. However, we provide explicit examples (a part from the obvious ones where the MM estimators are not available) where MQ estimators are preferable.

We conclude with the outline of the paper. In Section 2 we recall some preliminaries. Sections 3 and 4 are devoted to the results for MQ and MM estimators, respectively. In Section 5 we present examples for different kind of parameters (e.g. scale, location, skewness, etc.), and for each example specific distributions are discussed in Section 6.

## 2 Preliminaries

In this section we present some preliminaries on large deviations and we provide a rigorous definition of the MQ estimators studied in this paper (see Definition 2.1 below).

### 2.1 Large deviations

We start with the concept of large deviation principle (LDP for short). A sequence of random variables taking values on a topological space satisfies the LDP with rate function if is a lower semi-continuous function,

 liminfn→∞1nlogP(Wn∈O)≥−infw∈OI(w) %forallopensets O

and

 limsupn→∞1nlogP(Wn∈C)≤−infw∈CI(w) %forallclosedsets C.

We also recall that a rate function is said to be good if all its level sets are compact.

###### Remark 2.1 (Local comparison between rate functions around the unique common zero).

It is known that, if uniquely vanish at some , then the sequence of random variables converges weakly to . Moreover, if we have two rate functions and which uniquely vanish at the same point , and if in a neighborhood of (except ) then any sequence which satisfies the LDP with rate function converges to faster than any sequence which satisfies the LDP with rate function .

We also recall a recent large deviation result on order statistics of i.i.d. random variables (see Proposition 2.1 below) which plays a crucial role in this paper. We start with the following condition.

###### Condition 2.1.

Let be a sequence of i.i.d. real valued random variables with distribution function , and assume that is continuous and strictly increasing on , where . Moreover let be such that for all and .

We introduce the following notation: for all , are the order statistics of the sample ; for we set

 H(p|q):=plogpq+(1−p)log1−p1−q, (1)

that is the relative entropy of the Bernoulli distribution with parameter with respect to the Bernoulli distribution with parameter .

###### Proposition 2.1 (Theorem 3.2 in [7] for λ∈(0,1)).

Assume that Condition 2.1 holds. Then satisfies the LDP with good rate function defined by

 Iλ,F(x):={H(λ|F(x)) for x∈(α,ω)∞ otherwise.
###### Remark 2.2 (I′′λ,F(F−1(λ)) as the inverse of an asymptotic variance).

Theorem 7.1(c) in [3] states that, under suitable conditions, converges weakly to the centered Normal distribution with variance . Then, if we assume that is twice differentiable, we can check that with some computations.

A more general formulation of Proposition 2.1 could be given also for but, in view of the applications presented in this paper, we prefer to consider a restricted version of the result with only. This restriction allows to have the goodness of the rate function (see Remark 1 in [7]) which is needed to apply the contraction principle (see e.g. Theorem 4.2.1 in [4]).

### 2.2 MQ estimators

Here we present a rigorous definition of MQ estimators. In view of this, the next Condition 2.2 plays a crucial role.

###### Condition 2.2.

Let be a family of distribution functions where and, for all , satisfies the same hypotheses of the distribution function in Condition 2.1, for some . Moreover, for , consider the function , where , defined by

 [F−1(∙)(λ)](θ):=F−1θ(λ).

We assume that, for all , the equation admits a unique solution (with respect to ) which will be denoted by .

Now we are ready to present the definition.

###### Definition 2.1.

Assume that Condition 2.2 holds. Then is a sequence of MQ estimators (for the level ).

Proposition 3.1 below provides the LDP for the sequence of estimators in Definition 2.1 (as the sample size goes to infinity) when the true value of the parameter is . Actually we give a more general formulation in terms of , where is a sequence as in Condition 2.1.

## 3 Results for MQ estimators

In this section we prove the LDP for the sequence of estimators in Definition 2.1. Moreover we discuss some properties of the rate function; in particular Proposition 3.2 (combined with Remark 2.1 above) leads us to define a concept of optimal presented in Definition 3.1 below.

We start with our main result and, in view of this, we present the following notation:

 hλ,θ0(θ):=Fθ0(F−1θ(λ)) (for F−1θ(λ)∈(αθ0,ωθ0)). (2)
###### Proposition 3.1 (LD for MQ estimators).

Assume that is as in Condition 2.1 and that Condition 2.2 holds. Moreover assume that, for some , are i.i.d. random variables with distribution function . Then, if the restriction of on is continuous, satisfies the LDP with good rate function defined by

 Iλ,θ0(θ):=⎧⎨⎩λlogλhλ,θ0(θ)+(1−λ)log1−λ1−hλ,θ0(θ) for θ∈Θ such that F−1θ(λ)∈(αθ0,ωθ0)∞ otherwise,

where is defined by (2).

###### Proof.

Since the restriction of on is continuous, a straightforward application of the contraction principle yields the LDP of with good rate function defined by

 Iλ,θ0(θ):=inf{Iλ,Fθ0(x):x∈(αθ0,ωθ0),(F−1(∙)(λ))−1(x)=θ},

where is the good rate function in Proposition 2.1, namely the good rate function defined by , for . Moreover the set has at most one element, namely

 {x∈(αθ0,ωθ0):(F−1(∙)(λ))−1(x)=θ}={{F−1θ(λ)} for θ∈Θ such that F−1θ(λ)∈(αθ0,ωθ0)∅ otherwise;

thus we have for such that , and otherwise. The proof is completed by taking into account the definition of the function in (1). ∎

###### Remark 3.1 (Rate function invariance with respect to increasing transformations).

Let be a family of distribution functions as in Condition 2.2 and assume that there exists an interval such that for all . Moreover let be a strictly increasing function. Then, if we consider the MQ estimators based on the sequence instead of , we can consider an adapted version of Proposition 3.1 with in place of , in place of and, as stated in Property 1.5.16 in [5], in place of . The LDP provided by this adapted version of Proposition 3.1 is governed by the rate function defined by

 Iλ,θ0;ψ(θ):={H(λ|Fθ0∘ψ−1(ψ∘F−1θ(λ))) for θ∈Θ such that ψ∘F−1θ(λ)∈(ψ(α),ψ(ω))∞ otherwise

 Iλ,θ0(θ):={H(λ|Fθ0(F−1θ(λ))) for θ∈Θ such that%  F−1θ(λ)∈(α,ω)∞ otherwise.

One can easily realize that and coincide.

By taking into account the rate function in Proposition 3.1, it would be interesting to compare two rate functions and in the spirit of Remark 2.1 for a given pair ; namely it would be interesting to have a strict inequality between and in a neighborhood of (except ).

Thus, if both rate functions are twice differentiable, is locally larger (resp. smaller) than around if we have (resp. ). So it is natural to give an expression of under suitable hypotheses.

###### Proposition 3.2 (An expression for I′′λ,θ0(θ0)).

Let be the rate function in Proposition 3.1. Assume that and are twice differentiable. Then

 I′′λ,θ0(θ0)=(h′λ,θ0(θ0))2λ(1−λ)={F′θ0(F−1θ0(λ))}2λ(1−λ)(ddθF−1θ(λ)∣∣∣θ=θ0)2,

where is defined by (2).

###### Proof.

One can easily check that

 hλ,θ0(θ0)=λ and h′λ,θ0(θ0)=F′θ0(F−1θ0(λ))⋅ddθF−1θ(λ)∣∣∣θ=θ0.

Moreover after some computations we get

 I′λ,θ0(θ)=h′λ,θ0(θ)(1−λ1−hλ,θ0(θ)−λhλ,θ0(θ))

and

 I′′λ,θ0(θ)=h′′λ,θ0(θ)(1−λ1−hλ,θ0(θ)−λhλ,θ0(θ))+(h′λ,θ0(θ))2⎛⎝λh2λ,θ0(θ)+1−λ(1−hλ,θ0(θ))2⎞⎠.

Thus and . The proof is completed by taking into account the expression of above. ∎

Finally, by taking into account Proposition 3.2 (and what we said before it), it is natural to consider the following

###### Definition 3.1.

A value is said to be optimal if it maximizes , namely if we have .

## 4 Results for MM estimators

The aim of this section is to present a version of the above results for MM estimators; namely the LDP and an expression of , where is the rate function which governs the LDP of MM estimators. In particular, when we compare MM and MQ estimators in terms of speed of convergence by referring to Remark 2.1, the value will be compared with in Proposition 3.2.

We start with the following condition which allows us to define the MM estimators.

###### Condition 4.1.

Let be a family of distribution functions as in Condition 2.2, and consider the function , where , defined by

 μ(θ):=∫ωθαθxdFθ(x).

We assume that, for all , the equation admits a unique solution (with respect to ) which will be denoted by .

From now on, in connection with this condition, we introduce the following function:

 Λ∗θ(x):=supγ∈R{γx−Λθ(γ)}, where Λθ(γ):=log∫ωθαθeγxdFθ(x). (3)

It is well-known that, if are i.i.d. random variables with distribution function , and if we set for all , then satisfies the LDP with rate function in (3) by Cramér Theorem on (see e.g. Theorem 2.2.3 in [4]).

Then we have the following result.

###### Proposition 4.1 (LD for MM estimators).

Assume that Condition 4.1 holds. Moreover assume that, for some , are i.i.d. random variables with distribution function .
(i) If for some such that , then satisfies the LDP with rate function defined by

 Jθ0(θ):={Λ∗θ0(μ(θ)) for θ∈Θ such that μ(θ)∈(αθ0,ωθ0)∞ otherwise.

(ii) If the restriction of on is continuous and if is a good rate function, the same LDP holds and is a good rate function.

###### Proof.

(i) In this case and is again a sequence of empirical means of i.i.d. random variables. Then the LDP still holds by Cramér Theorem on , and the rate function is defined by

 Jθ0(θ):=supγ∈R{γθ−Λθ0(c1γ)−γc0},

which yields

 Jθ0(θ)=supγ∈R{c1γθ−c0c1−Λθ0(c1γ)}=supγ∈R{c1γμ(θ)−Λθ0(c1γ)}=Λ∗θ0(μ(θ)),

as desired.
(ii) Since the restriction of the function on is continuous and is a good rate function, a straightforward application of the contraction principle yields the LDP of with good rate function defined by

 Jθ0(θ):=inf{Λ∗θ0(x):x∈(αθ0,ωθ0),μ−1(x)=θ}.

Moreover the set has at most one element, namely

thus we have for such that , and otherwise. ∎

Now, in the spirit of Remark 2.1, it would be interesting to have a local strict inequality between the rate function in Proposition 3.1 for MQ estimators (for some ), and the rate function in Proposition 4.1 for MM estimators.

Then we can repeat the same arguments which led us to Proposition 3.2. Namely, if both rate functions and (for some ) are twice differentiable, is locally larger (resp. smaller) than around if (resp. ). So it is natural to give an expression of under suitable hypotheses.

###### Proposition 4.2 (An expression for J′′θ0(θ0)).

Let be the rate function in Proposition 4.1. Assume that, for all , the function in (3) is finite in a neighborhood of the origin and that is twice differentiable. Then , where is the variance function.

###### Proof.

One can easily check that

 J′θ0(θ)=(Λ∗θ0)′(μ(θ))μ′(θ) and J′′θ0(θ)=(Λ∗θ0)′′(μ(θ))(μ′(θ))2+μ′′(θ)(Λ∗θ0)′(μ(θ)).

Then we can conclude noting that and . ∎

###### Remark 4.1 (On the functions Λθ and Λ∗θ in (3)).

The function is finite in a neighborhood of the origin when we deal with empirical means (of i.i.d. random variables) with light-tailed distribution. Typically is a good rate function only in this case.

## 5 Examples

The aim of this section is to present several examples of statistical models with unknown parameter , where ; in all the examples we always deal with one-dimensional parameters assuming all the others to be known.

Let us briefly introduce the examples presented below. We investigate distributions with scale parameter in Example 1, with location parameter in Example 2, and with skewness parameter in Example 3. We remark that in Example 3 we use the epsilon-Skew-Normal distribution defined in [14]; this choice is motivated by the availability of an explicit expression of the inverse of the distribution function giving us the possibility of obtaining explicit formulas. Moreover we present Example 4 with Pareto distributions, which allows to give a concrete illustration of the content of Remark 3.1. In all these statistical models the intervals do not depend on and we simply write . Finally we present Example 5 where we have for ; namely for this example is a right-endpoint parameter.

In all examples (except Example 4) we give a formula for (as a consequence of Proposition 3.2) which will be used for the local comparisons between rate functions (in the spirit of Remark 2.1) analyzed in Section 6.

In what follows we say that a distribution function on has the symmetry property if it is a distribution function of a symmetric random variable, i.e. if for all . In such a case we have for all .

###### Example 1 (Statistical model with a scale parameter θ∈Θ:=(0,∞)).

Let be defined by

 Fθ(x):=G(xθ) for x∈(α,ω),

where is a strictly increasing distribution function on or . Then

 F−1θ(λ):=θG−1(λ) and hλ,θ0(θ)=Fθ0(F−1θ(λ))=G(θθ0⋅G−1(λ));

it is important to remark that, when , the value (which yields ) is not allowed. Now we give a list of some specific examples studied in this paper.

For the case we consider the Weibull distribution:

 G(x):=1−exp(−xρ) (where ρ>0) and G−1(λ):=(−log(1−λ))1/ρ. (4)

We also give some specific examples where and, in each case, is a known location parameter (and the not-allowed value depends on ): the Normal distribution

 G(x):=Φ(x−η) and G−1(λ):=η+Φ−1(λ), (5)

where is the standard Normal distribution function; the Cauchy distribution

 G(x):=1π(arctan(x−η)+π2) and G−1(λ):=η+tan((λ−12)π); (6)

the logistic distribution

 G(x):=11+e−(x−η) and G−1(λ):=η−log(1λ−1); (7)

the Gumbel distribution

 G(x):=exp(−e−(x−η)) and G−1(λ):=η−log(−logλ). (8)

If is twice differentiable we have

 I′′λ,θ0(θ0)={G′(G−1(λ))G−1(λ)}2λ(1−λ)θ20 (9)

by Proposition 3.2; so, if it is possible to find an optimal , such a value does not depend on (on the contrary it could depend on the known location parameter as we shall see in Section 6). Moreover one can check that if we consider the not-allowed value (when ) because , and that (for all ) if is symmetric as it happens, for instance, in (5), (6) and (7) with .

###### Example 2 (Statistical model with a location parameter θ∈Θ:=(−∞,∞)).

Let be defined by

 Fθ(x):=G(x−θ) for x∈(α,ω)=(−∞,∞),

where is a strictly increasing distribution function on . Then

 F−1θ(λ):=θ+G−1(λ) and hλ,θ0(θ)=Fθ0(F−1θ(λ))=G(θ+G−1(λ)−θ0).

We give some specific examples studied in this paper and, in each case, is a known scale parameter: the Normal distribution

 G(x):=Φ(xs) and G−1(λ):=s⋅Φ−1(λ); (10)

the Cauchy distribution

 G(x):=1π(arctanxs+π2) and G−1(λ):=s⋅tan((λ−12)π); (11)

the logistic distribution

 G(x):=11+e−x/s and G−1(λ):=−s⋅log(1λ−1); (12)

the Gumbel distribution

 G(x):=exp(−e−x/s) and G−1(λ):=−s⋅log(−logλ). (13)

If is twice differentiable we have

 I′′λ,θ0(θ0)={G′(G−1(λ))}2λ(1−λ) (14)

by Proposition 3.2; so, if it is possible to find an optimal , such a value does not depend on and on the known scale parameter . Moreover one can check that (for all ) if has the symmetry property (as happens for in (10), (11) and (12), and not for in (13)).

###### Example 3 (Statistical model with a skewness parameter θ∈Θ:=(−1,1)).

Let be defined by

 Fθ(x):={(1+θ)G(x1+θ) for x≤0θ+(1−θ)G(x1−θ) for x>0, with x∈(α,ω)=(−∞,∞),

where is a strictly increasing distribution function on with the symmetry property. Then

 F−1θ(λ):=⎧⎨⎩(1+θ)G−1(λ1+θ) for λ∈(0,1+θ2](1−θ)G−1(λ−θ1−θ) for λ∈(1+θ2,1)

and

We can consider the same specific examples presented in Example 2, i.e. the functions in (10), (11) and (12) for some known scale parameter .

If is twice differentiable and we have

by Proposition 3.2, and therefore

 I′′λ,θ0(θ0)=⎧⎪ ⎪⎨⎪ ⎪⎩1λ(1−λ)[G′(G−1(λ1+θ0))G−1(λ1+θ0)−λ1+θ0]2 for λ∈(0,1+θ02]1λ(1−λ)[−G′(G−1(λ−θ01−θ0))G−1(λ−θ01−θ0)+λ−11−θ0]2 for λ∈(1+θ02,1); (15)

so one can expect that, if it is possible to find an optimal , such a value depends on (this is what happens in Section 6). Moreover one can check that in (15) does not depend on , and (for all ).

###### Example 4 (Statistical model with Pareto distributions with θ∈Θ:=(0,∞)).

Let be defined by

 Fθ(x):=1−x−1/θ for x∈(α,ω)=(1,∞).

Then

 F−1θ(λ):=e−θlog(1−λ)=(1−λ)−θ and hλ,θ0(θ)=Fθ0(F−1θ(λ))=1−(1−λ)θ/θ0. (16)

We remark that, if we consider Example 1 with as in (4) with , namely

 ~Fθ(x)=1−e−x for (~α,~ω)=(0,∞),

we can refer to Remark 3.1 with

 ψ(x):=ex for x∈(~α,~ω):=(0,∞)

(note that ). Then, as pointed out in Remark 3.1, and coincide; in fact in (16) meets the analogue expression in Example 1 with as in (4) with