Efficient estimation of functionals in nonparametric boundary models1footnote 11footnote 1We are grateful for very helpful comments and questions by the referees. Financial support by the DFG through Research Unit FOR 1735 Structural Inference in Statistics: Adaptation and Efficiency is acknowledged.

# Efficient estimation of functionals in nonparametric boundary models111We are grateful for very helpful comments and questions by the referees. Financial support by the DFG through Research Unit FOR 1735 Structural Inference in Statistics: Adaptation and Efficiency is acknowledged.

Markus Reiß Institute of Mathematics Humboldt-Universität zu Berlin mreiss@mathematik.hu-berlin.de and Leonie Selk Institute of Mathematics Universität Hamburg leonie.selk@math.uni-hamburg.de
###### Abstract

For nonparametric regression with one-sided errors and a boundary curve model for Poisson point processes we consider the problem of efficient estimation for linear functionals. The minimax optimal rate is obtained by an unbiased estimation method which nevertheless depends on a Hölder condition or monotonicity assumption for the underlying regression or boundary function.

We first construct a simple blockwise estimator and then build up a nonparametric maximum-likelihood approach for exponential noise variables and the point process model. In that approach also non-asymptotic efficiency is obtained (UMVU: uniformly minimum variance among all unbiased estimators).The proofs rely essentially on martingale stopping arguments for counting processes and the point process geometry. The estimators are easily computable and a small simulation study confirms their applicability.

Key words and Phrases: frontier estimation, support estimation, Poisson point process, sufficiency, completeness, UMVU, nonparametric MLE, shape constraint, monotone boundary, optional stopping.
AMS subject classification: 62G08, 62G15, 62G32, 62M30,60G55

## 1 Introduction

For regression models

 Yi=g(i/n)+εi,i=1,…,n, (1.1)

the estimation of linear functionals of the regression function is well understood if are uncorrelated with mean zero and variance . Then the discrete functionals

 ϑ(n)=1nn∑i=1g(i/n)w(i/n) for some function % w:[0,1]→R (1.2)

can be estimated by the plug-in version without bias and with variance . By the Gauß-Markov theorem has minimal variance among all linear and unbiased estimators. In the Gaussian case is even UMVU (uniformly of minimum variance among all unbiased estimators). In the corresponding continuous-time signal-in-white-noise model , , with a Brownian motion and some the plug-in estimator is equally an unbiased estimator of

 ϑ=∫10g(t)w(t)dt for some w∈L2([0,1]) (1.3)

of variance . By the Riesz representation theorem, we can thus estimate any linear -continuous functional of with parametric rate .

In certain applications, however, the function is determined as the boundary or frontier function of the observations, which can be modeled equivalently by one-sided errors . The prototypical case is that are i.i.d. with and for some

 P(εi⩽x)=λx+O(x2) as x↓0, (1.4)

e.g. . In that case the parametric rate for the location model (i.e. assuming to be constant) is with much faster than in the regular case. These irregular statistical models have also found considerable theoretical interest, e.g. in the recent work by ?. A rate-optimal estimator is given by the extreme value statistics . For the nonparametric problem of estimating the function in -loss, the optimal rate is for in a Hölder ball of regularity and radius :

 g∈Cβ(R)={f:[0,1]→R|∀x,y∈[0,1]:|f(y)−f(x)|⩽R|y−x|β}. (1.5)

This is achieved by a local polynomial estimator as in the regular case, see e.g. ? for a construction and a survey of the large literature on that topic. A plug-in estimator with optimal bandwidth to estimate in (1.3) can achieve at best the rate . This is due to a pointwise bias of order and a pointwise variance of order , which after integration and by using independence results in a total mean squared error of order for the plug-in estimator. The standardised rate is not optimal and for even slower than in the regular case. At the heart of the problem is the usual nonparametric bias bound, which cannot be improved by averaging.

Here we show that the optimal estimation rate for under one-sided errors is for . The improvement over the plug-in estimator is achieved by an unbiased estimation procedure. The bias is exactly zero for the case of exponentially distributed errors and it is asymptotically negligible under (1.4) for . Compared with standard nonparametric results it is remarkable that an unbiased estimator can be constructed whose rate is nevertheless worse than the parametric rate ( in this case). The risk bound comes from a trade-off between two terms in the variance instead of the usual bias-variance trade-off.

As for mean regression with the signal-in-white-noise model, also for one-sided errors an analogous continuous-time model is most useful in exhibiting the main statistical structure. It is given by observing a Poisson point process (PPP) on of intensity

 λg(x,y)=n1(y⩾g(x)),x∈[0,1],y∈R, (1.6)

see e.g. ? or ? for point process properties and Figure 1 below for an illustration. For sufficiently regular this model can be shown to be asymptotically equivalent to the regression-type model (1.1) with in (1.4), cf. ?. At the same time, this serves as a canonical model for support boundary estimation from i.i.d. observations. For instance, ? propose projection based estimators for in this model class and derive convergence rates as well as limit distributions, already relying on bias reduction techniques. Also ? use it as an agnostic model for limit order books in financial markets.

We first develop the methods in the fundamental PPP model for from (1.3) and then transfer them explicitly to the discrete model (1.1). By a blocking technique can be estimated without bias and at the minimax optimal rate. The method is then extended to the one-sided regression setting. Using Lepski’s method we are then able to provide also an adaptive estimator, that is an estimator which does not rely on the smoothness parameters and still attains the minimax rate up to a logarithmic factor. In a second step we can even construct an estimator of which is UMVU. This non-asymptotic efficiency result is based on a nonparametric maximum-likelihood approach, where the maximum-likelihood estimator (MLE) is not only explicit, but also forms a sufficient and complete statistics. In parallel with Gaussian mean regression we thus have the UMVU-property of the estimator, but its asymptotic rate is worse than for parametric location estimation. Still, we are able to prove its asymptotic normality and to provide a self-normalising CLT such that asymptotic inference is feasible. The MLE approach equally works for the class of monotone functions .

The regression-type model (1.1) with one-sided errors and the PPP model (1.6) have a similar structure as models considered for density support estimation or image boundary recovery problems. Let us review briefly the literature on functional estimation for these statistical models. Many asymptotic results for the expected area of the convex hull for i.i.d. observations are based on the classical results by ?. Based on these results, the ideas of the present paper have been used by ? to construct an UMVU estimator for the volume of a convex body. For image recovery problems ? describe already the rate obtained for the functional . The upper bound is based on a localisation step and loses a logarithmic factor. By threefold sample splitting ? has constructed an estimator achieving this rate exactly for the related density support area estimation. An interesting linear programming approach is proposed by ?. Yet, these estimators are analysed asymptotically and lack the non-asymptotic unbiasedness and UMVU property we have found here. Many other estimators are concerned with the estimation of the density support set or the regression-type function itself, not of the area or other functionals, let us mention the work by ? for connections to classification problems. Specifically, a nonparametric MLE approach under monotonicity has been developed by ? for the asymptotically exact risk in estimating the density support set in Hausdorff distance. In Gaussian mean regression a nonparametric MLE over regular function classes is equivalent to a least-squares approach with roughness penalty, leading e.g. to smoothing splines. Under shape constraints the MLE is a well studied object, see e.g. ?, but usually results are derived asymptotically.

In the next section we shall develop a simple block-wise estimator. Based on optional stopping for an intrinsic martingale we prove that it is unbiased under the PPP model and under exponential noise in the regression-type model. For more general regression noise the required compensation cannot be achieved exactly, but it comes close to the model with corresponding exponential noise. The third part of that section presents the adaptive estimator, while the final part presents the lower bound implying that the rate is indeed optimal. The nonparametric MLE approach is presented in Section 3, first for the class of Hölder functions, then for monotone functions. The derivation of the completeness of the nonparametric MLE and the stopping arguments for the intrinsic martingale are intriguing. For the MLE under Lipschitz conditions we obtain central limit theorems which allow for feasible confidence sets.

In Section 4 we discuss some implications of the results, in particular concerning estimating coefficients in a projection estimator approach. Extensions and limitations are mentioned and a small simulation study shows that the estimators are numerically feasible and have satisfying finite-sample properties. Most proofs are instructive and reveal some beautiful interplay between statistics, probability and geometry such that in the Appendix we only provide some technical lemmata (some of independent interest) and the more involved proofs of the adaptive rate and the CLT. The notation follows the usual conventions. We write or to say that is bounded by a constant multiple of and for as well as . Moreover, means and stands for .

## 2 Simple rate-optimal estimation

### 2.1 Block-wise estimation in the PPP model

Let denote the observations of the Poisson point process (PPP) with intensity (1.6). We shall estimate from (1.3) without any bias. To grasp the main idea, suppose that holds and that we know a deterministic function with the property (pointwise). Then the number of PPP observations below the graph of is Poisson-distributed with intensity equal to times the area between and :

 ∑j⩾11(Yj⩽¯g(Xj))∼Poiss(n∫10(¯g−g)(x)dx).

This yields an unbiased pseudo-estimator :

 ¯ϑ:=∫10¯g(x)dx−1n∑i⩾11(Yi⩽¯g(Xi))⇒E[¯ϑ]=∫10(¯g−(¯g−g))(x)dx=ϑ.

The larger the area between the graphs, the larger is the Poisson parameter and thus the variance of .

Now, we shall define an empirical substitute for , which by stopping time arguments keeps the unbiasedness, but is nevertheless sufficiently close to . We partition in subintervals of length with and note that the block-wise minimum satisfies . By the Hölder property of we conclude that holds for all and thus is a local upper bound for , see also Figure 1. We thus estimate the functional locally on these blocks by

 ^ϑk:=(Y∗k+Rhβ)¯wk−1nh∑i⩾11(Xi∈Ik,Yi⩽Y∗k+Rhβ)w(Xi),

where and the true local parameter is .

###### 2.1 Theorem.

The estimator satisfies with

 E[^ϑblock]=ϑ,Var(^ϑblock)⩽2Rhβ+(nh)−1n∥w∥2L2.

In particular, the asymptotically optimal block size yields

 limsupn→∞supg∈Cβ(R)n(2β+1)/(β+1)Var(^ϑblock)⩽β+1β(2βR)1/(β+1)∥w∥2L2.
###### Proof.

Let us study the weighted counting process

 N(t):=∑i⩾11(Xi∈Ik,Yi⩽t)w(Xi),t∈R.

The pure counting process is a point process in with deterministic intensity . Hence, is a process with independent increments satisfying (e.g. via Prop. 2.32 in ?)

 E[N(t)]=∫Ikn(t−g(x))+w(x)dx,Var(N(t))=∫Ikn(t−g(x))+w(x)2dx.

In particular, is a càdlàg martingale with respect to the filtration

 Ft=σ((Xi,Yi)1(Yi⩽t),i⩾1),t∈R, (2.1)

with mean zero and predictable quadratic variation .

Now note that is an -stopping time with

 P(τ⩾t) =exp(−n∫Ik(t−Rhβ−g(x))+dx) ⩽exp(−nh(t−maxx∈Ikg(x)−Rhβ)) (2.2)

for . In particular, has finite expectation and Lemma 5.1 on optional stopping yields

 E[M(τ)]=0⇒E[N(τ)]=n∫IkE[(τ−g(x))+]w(x)dx

and

 Var(M(τ))=E[⟨M⟩τ]=n∫IkE[(τ−g(x))+]w(x)2dx.

Noting we have

 E[(τ−g(x))+]=E[Y∗k]+Rhβ−g(x) for all x∈Ik.

The identity

 ^ϑk=τ¯wk−1nhN(τ)=ϑk−1nhM(τ)

implies and

 Var(^ϑk)=1n2h2Var(M(τ))=1nh2∫IkE[Y∗k+Rhβ−g(x)]w(x)2dx.

A rough universal bound, using that is stochastically smaller than the minimum in of a PPP with intensity , yields with a random variable

This implies

 Var(^ϑk)⩽2Rhβ+(nh)−1nh2∫Ikw(x)2dx.

We conclude for the final estimator by the independence of that

 E[^ϑblock]=ϑ,Var(^ϑblock)⩽2Rhβ+(nh)−1n∫10w(x)2dx.

Finally, insertion of the asymptotically optimal yields the variance bound. ∎

### 2.2 Blockwise estimation in the regression-type model

We consider the equi-distant regression model (1.1) where are i.i.d. satisfying (1.4). The primary example will be , but any distribution on with a Lipschitz continuous density at zero and will be covered, as soon as some loose tail bound at infinity holds.

Since the observation design is discrete, our parameter of interest becomes from (1.2). In analogy with the PPP case we build an estimator for on each block of indices , where :

 ~ϑk:=1nh∑i∈~Ik(Yi∧(Y∗k+Rhβ)−λ−11(Yi⩽Y∗k+Rhβ))w(i/n). (2.3)

Here, is again the minimal observation on each block. In contrast to the PPP-estimator the empirical upper bound for on is given by the minimum of and , which for the rate-optimal choice of , however, has negligible impact. We obtain the following result where denotes the standardised -norm.

###### 2.2 Theorem.

Let the i.i.d. error variables satisfy (1.4) as well as for some . For the estimator satisfies for with uniformly in

 |E[~ϑblockn−ϑ(n)]|≲(Rhβ+(nh)−1)2∥w∥1,Var(~ϑblockn)≲h(Rhβ+(nh)−1)2∥w∥22.

In particular, uniformly over , we obtain with the rate-optimal block size

 (E[~ϑblockn−ϑ(n)])2=o(Var(~ϑblockn)),Var(~ϑblockn)≲R1/(β+1)n−(2β+1)/(β+1)∥w∥22.

In the case we have for any , the more precise result

 E[~ϑblockn]=ϑ,Var(~ϑblockn)⩽2Rhβ+(nλh)−1nλ∥w∥22.
###### 2.3 Remark.

The result and proof for are exactly as in the PPP model. For other distributions of the estimator is only asymptotically unbiased, but for Hölder regularity the bias is negligible with respect to the stochastic error. A side remark is that for strong asymptotic equivalence with the PPP model in Le Cam’s sense the necessary minimal regularity is higher, see ?.

###### Proof.

Fix a block with index and consider for

 M(t):=∑i∈~Ik(1(Yi⩽t)+log¯Fε(Yi∧t−g(i/n)))w(i/n). (2.4)

With respect to the filtration , defines a martingale with and quadratic variation : just note that the compensator of equals the integrated hazard function . Moreover, is a stopping time with respect to . From the representation

 ~ϑk−ϑk =1nλh(∑i∈~IkGε(Yi∧τ−g(i/n))w(i/n)−M(τ)) with% Gε(z):=λz+log¯Fε(z) (2.5)

and the stopping Lemma 5.1 in combination with due to the moment bound from Lemma 5.2 below we conclude

 E[~ϑk−ϑk]=1nλh∑i∈~IkE[Gε(Yi∧τ−g(i/n))]w(i/n).

In the case we have , that is , and the estimator is unbiased.

By Assumption (1.4), there is some such that holds for . For we have also and . By Lemma 5.2, Cauchy-Schwarz inequality, monotonicity of and this leads to

 |E[~ϑk−ϑk]| ≲∑i∈~IkE[(τ−g(i/n))2]+E[log(¯Fε(Yi−g(i/n)))2]1/2P(|log(¯Fε(τ−g(i/n)))|>δ)1/2nh|w(i/n)| ≲∑i∈~IkE[(τ−g(i/n))2]+E[log(¯Fε(εi))2]1/2P(¯Fε(mini∈~Ikεi+2Rhβ)

With and the second term in the numerator converges geometrically fast to zero and the assertion for the bias of follows.

For the variance bound we use , , and the stopping Lemma 5.1 to obtain in analogy with the bias part:

 Var(~ϑk)=1(nλh)2Var(∑i∈~IkGε(Yi∧τ−g(i/n))w(i/n)−M(τ)) ⩽2(nλh)2∑i∈~IkE[(nhGε(Yi∧τ−g(i/n))2+|log¯Fε(Yi∧τ−g(i/n))|)]w(i/n)2 +E[τ−g(i/n)]+E[|log¯Fε(εi)|2]1/2P(¯Fε(mini∈~Ikεi+2Rhβ)

For the global estimator we infer by independence of . It remains to insert the rate-optimal choice of and to note that holds for .

Finally, in the case we have and . Consequently,

 Var(~ϑk)⩽∑i∈~IkE[Y∗k+Rhβ−g(i/n)]λ(nh)2w(i/n)2⩽2Rhβ+(nλh)−1λ(nh)2∑i∈~Ikw(i/n)2,

which by independence gives the asserted bound for . ∎

We now address the question of choosing the block size in a data-driven way, not assuming the regularity parameters and to be known. We apply Lepski’s method (?) and treat the general regression-type model (1.1). The main technical work is devoted to obtaining explicit critical values in Proposition 5.4 of the appendix. To this end, the critical values are defined via the compensator of an exponential counting process and are thus itself again stochastic. While an explicit non-asymptotic risk analysis is clearly possible, we focus here on the asymptotic risk, showing that by the versatility of Lepski’s method rate-optimal adaptive estimation up to logarithmic factors is indeed possible in our non-regular situation.

For a choice of block sizes with consider the corresponding blockwise estimators

 ~ϑblockn,hm=1nh−1m−1∑k=0∑i∈~Ik,hm(Yi∧(Y∗k,hm+(nhm)−1)−λ−11(Yi⩽Y∗k,hm+(nhm)−1))w(i/n),

where the subscript marks all quantities depending on the block size. Remark that the intercept in (2.3) has been replaced by the asymptotically balanced size , which does not depend on the unknown and . Among we select the block size adaptively via

 ^h:=inf{hm∣∣∃m′⩽m:|~ϑblockn,hm′−~ϑblockn,hm+1|>κm+1+κm′}∧hM

with critical values ( denotes the block with )

 κm=n∑i=1(1(Yi⩽Y∗ki,hm+(nhm)−1)H√clogn(h1/2mw(i/n))nλh1/2m)+(Cclogn)2∥w∥1(nhm)2λ+√clogn2nλh1/2m.

Here, the function and the constant with property for are used and is specified below. Asymptotically, we have as and in the case of a differentiable density of around zero (note for ). Then the proof in the Appendix yields the following risk bounds.

###### 2.4 Theorem.

Assume , and that is bounded. The adaptive estimator satisfies with and for , sufficiently large

 E[(~ϑblockn−ϑ(n))21(^h

Choosing and asymptotically , for , , , the adaptive estimator exhibits the asymptotic rate

 E[(~ϑblockn−ϑ(n))2]≲(logn)2n−(2β+1)/(β+1)+(logn)4n−4β/(β+1).

In particular, for the estimator achieves the minimax optimal rate up to a logarithmic factor.

###### 2.5 Remark.

The oracle-type block size is the largest block size among such that remains unbiased, except for the distribution bias induced in the case . As the proof shows, in the case not only the critical values (), but also the bounds simplify. We obtain and thus the minimax optimal rate up to a log-factor for any .

More elaborate arguments in the proof could give a smaller exponent for the logarithmic factor, but it is quite likely that some logarithmic factor has to be paid necessarily for adaptation, cf. ? for a related result. Similarly, the hypotheses that and are uniformly bounded are certainly not necessary, but permit more concise and transparent bounds.

### 2.4 Rate optimality

We prove that the rate is optimal in a minimax sense over . The proof is conducted for the PPP model, the regression case with can be treated in the same way.

###### 2.6 Theorem.

For estimating , , over the parameter class , , , the following asymptotic lower bound holds:

 liminfn→∞inf^ϑnsupg∈Cβ(R)R−1/(β+1)n(2β+1)/(β+1)∥w∥−2L2Eg[(^ϑn−ϑ)2]>0.

The infimum extends over all estimators