Minimax and minimax adaptive estimation in multiplicative regression: locally bayesian approach

# Minimax and minimax adaptive estimation in multiplicative regression: locally bayesian approach

\fnmsM. \snmChichignoud label=e1]chichign@cmi.univ-mrs.fr [ Université Aix-Marseille 1 Université Aix-Marseille 1
LATP, 39 rue Joliot Curie, 13453 Marseille cedex 13, FRANCE,
E-mail:
###### Abstract

The paper deals with the non-parametric estimation in the regression with the multiplicative noise. Using the local polynomial fitting and the bayesian approach, we construct the minimax on isotropic Hölder class estimator. Next applying Lepski’s method, we propose the estimator which is optimally adaptive over the collection of isotropic Hölder classes. To prove the optimality of the proposed procedure we establish, in particular, the exponential inequality for the deviation of locally bayesian estimator from the parameter to be estimated. These theoretical results are illustrated by simulation study.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Locally bayesian approach

{aug}

class=AMS] \kwd62G08 \kwd62G20

local bayesian fitting \kwdmultiplicative regression \kwdadaptive bandwidth selector \kwdLepski’s method \kwdoptimality criterion

## 1 Introduction

Let statistical experiment be generated by the couples of observations where satisfies the equation

 Yi=f(Xi)×Ui,i=1,…,n. (1.1)

Here is unknown function and we are interested in estimating at a given point from observation .

The random variables (noise) are supposed to be independent and uniformly distributed on .

The design points are deterministic and without loss of generality we will assume that

 Xi∈{1/n1/d,2/n1/d,…,1}d,i=1,…,n.

Along the paper the unknown function is supposed to be smooth, in particular, it belongs to the Hölder ball of functions (see Definition 1 below). Here is the smoothness of , is the sum of upper bounds of and its partial derivatives and is Lipschitz constant.

Moreover, we will consider only the functions separated away from zero by some positive constant. Thus, from now on we will suppose that there exists such that , where

 Hd(β,L,M,A)={g∈Hd(β,L,M):infx∈[0,1]dg(x)≥A}.

#### Motivation.

The theoretical interest to the multiplicative regression model (1.1) with discontinuous noise is dictated by the following fact. The typical approach to the study of the models with multiplicative noise consists in their transformation into the model with an additive noise and in the application, after that, the linear smoothing technique, based on standard methods like kernel smoothing, local polynomials etc. Let us illustrate the latter approach by the consideration of one of the most popular non-parametric model namely multiplicative gaussian regression

 Yi=σ(Xi)ξi,i=1,…,n. (1.2)

Here are i.i.d. standard gaussian random variables and the goal is to estimate the variance .

Putting and one can transform the model (1.2) into the heteroscedastic additive regression:

 Y′i=σ2(Xi)+σ2(Xi)ηi,i=1,…,n,

where, obviously, . Applying any of the linear methods mentioned above to the estimation of one can construct an estimator whose estimation accuracy is given by and which is optimal in minimax sense (See Definition 2). The latter result is proved under assumptions on which are similar to the assumption imposed on the function . In particular, denotes the regularity of the function . The same result can be obtained for any noise variables with known, continuously differentiable density, possessing sufficiently many moments.

The situation changes dramatically when one considers the noise with discontinuous distribution density. Although, the transformation of the original multiplicative model to the additive one is still possible, in particular, the model (1.1) can be rewritten as

 Y′i=f(Xi)+f(Xi)ηi,Y′i=2Yi,ηi=2ui−1,i=1,…,n,

the linear methods are not optimal anymore. As it is proved in Theorem 2.1 the optimal accuracy is given by . To achieve this rate the non-linear estimation procedure, based on locally bayesian approach, is proposed in Section 2.

Another interesting feature is the selection from given family of estimators (see , ). Such selections are used for construction of data-driven (adaptive) procedures. In this context, several approaches to the selection from the family of linear estimators were recently proposed, see for instance , ,  and the references therein. However, these methods are heavily based on the linearity property. As we already mentioned the locally bayesian estimators are non-linear and in Section 3 we propose the selection rule from this family. It requires, in particular, to develop new non-asymptotical exponential inequalities, which may have an independent interest.

Besides the theoretical interest, the multiplicative regression model is applied in various domains, in particular, in the image processing, for example, in so-called nonparametric frontier model (see , ) can be considered as the particular case of the model (1.1). Indeed, the reconstruction of the regression function can be viewed as the estimation of a production set . Indeed, , and, therefore, the estimation of is reduced to finding the upper boundary of . In this context, one can also cite  dealing with the estimation of function’s support. It is worth to mention that although nonparametric estimation in the latter models is studied, the problem of adaptive estimation was not considered in the literature.

#### Minimax estimation.

The first part of the paper is devoted to the minimax over estimation. This means, in particular, that the parameters and are supposed to be known a priori. We find the minimax rate of convergence (1.3) on and propose the estimator being optimal in minimax sense (see Definition 2). Our first result (Theorem 2.1) in this direction consists in establishing a lower bound for maximal risk on . We show that for any the minimax rate of convergence is bounded from below by the sequence

 φn(β)=n−ββ+d. (1.3)

Next, we propose the minimax estimator, i.e. the estimator attaining the normalizing sequence (1.3). To construct the minimax estimator we use so-called locally bayesian estimation construction which consists in the following. Let

 Vh(y)=d⨂j=1[yj−h/2,yj+h/2],

be the neighborhood around such that , where is a given scalar. Fix an integer number and let

 Db=b∑m=0(m+d−1d−1).

Let , we define the local polynomial

 ft(x) = ∑p∈Pbtp(x−yh)pIVh(y)(x),x∈Rd,t=(tp:p∈Pb), (1.4)

where for and denotes the indicator function. The local polynomial can be viewed as an approximation of the regression function inside of the neighborhood and the number of coefficients of this polynomial. Introduce the following subset of

 Θ(A,M)={t∈RDb:2t0,...,0−∥t∥1≥A,∥t∥1≤M}, (1.5)

where is -norm on . can be viewed as the set of coefficients such that for all and for all in the neighbourhood . Consider the pseudo likelihood ratio

 Lh(t,Y(n))=∏i:Xi∈Vh(y)[ft(Xi)]−1I[0,ft(Xi)](Yi),t∈Θ(A,M).

Set also

 πh(t)=∫Θ(A,M)∥t−u∥1Lh(u,Y(n))du,t∈Θ(A,M). (1.6)

Let be the solution of the following minimization problem:

 ^θ(h)=argmint∈Θ(A,M)πh(t). (1.7)

The locally bayesian estimator of is defined now as Note that this local approach allows to estimate successive derivatives of function . In this paper, only the estimation of at a given point is studied.

We note that similar locally parametric approach based on maximum likelihood estimators was recently proposed in  and  for regular statistical models. But when the density of observations is discontinuous, the bayesian approach outperforms the maximum likelihood estimator. This phenomenon is well known in parametric estimation (see ). Moreover, the establishing of statistical properties of bayesian estimators requires typically much weaker assumptions than whose used for analysis of maximum likelihood estimators.

As we see our construction contains an extra-parameter to be chosen. To make this choice we use quite standard arguments. First, we note that in view of the definition of (below in Definition 1), we have

 ∃θ=θ(f,y,h)∈[−M,M]Db:supx∈Vh(y)∣∣f(x)−fθ(x)∣∣≤Ldhβ.

Remark that if , then . Thus, if is chosen sufficiently small, our original model (1.1) is well approximated inside of by the “parametric” model

 Yi=fθ(Xi)×Ui,i=1,…,nhd,nhd∈N∗

in which the bayesian estimator is rate-optimal (See Theorem 2.2).

It is worth mentioning that the analysis of the deviation of from is not simple. Namely here requirements are used. This assumption, which seems not to be necessary, allows us to make the presentation of basic ideas clear and to simplify routine computations (see also Remark 1).

Finally, is chosen as the solution of the following minimization problem

 Ldhβ+1/nhd→minh (1.8)

and we show that corresponding estimator is minimax for on if (see Theorem 2.2). Since the parameter can be chosen in arbitrary way, the proposed estimator is minimax for any given value of the parameter .

We remark that in regular statistical models, where linear methods are usually optimal, the choice of the bandwidth is due to the relation

 Ldhβ+1/√nhd→minh,

with the solution . This explains that the improvement of the rate of convergence, compared to , in the model with the discontinuous density.

The second part of the paper is devoted to the adaptive minimax estimation over collection of isotropic functional classes in the model (1.1). At our knowledge, the problem of adaptive estimation in the multiplicative regression with the noise, having discontinuous density, is not studied in the literature.

Well-known drawback of minimax approach is the dependence of the minimax estimator on the parameters describing functional class on which the maximal risk is determined. In particular, the locally bayesian estimator depends obviously on the parameters and via the solution of the minimization problem (1.7). Moreover optimally chosen in view of (1.8) depends explicitly on and . To overcome this drawback the minimax adaptive approach was proposed (see , , ). The first question arising in the adaptation (reduced to the problem at hand) can be formulated as follows.

Does there exist an estimator which would be minimax on simultaneously for all values of and belonging to some given subset of ?

In section 3, we show that the answer to this question is negative, that is typical for the estimation of the function at a given point (see , , ). This answer can be reformulated in the following manner: the family of rates of convergence is unattainable for the problem under consideration.

Thus, we need to find another family of normalizations for maximal risk which would be attainable and, moreover, optimal in view of some criterion of optimality. Nowadays, the most developed criterion of optimality is due to Klutchnikoff .

We show that the family of normalizations, being optimal in view of this criterion, is

 ϕn(β)=(ρn(β)n)ββ+d,ρn(β)=1+ln(φn(β)φn(b)), (1.9)

whenever The factor can be considered as price to pay for adaptation (see ).

The most important step in proving the optimality of the family (1.9) is to find an estimator, called adaptive, which attains the optimal family of normalizations. Obviously, we seek an estimator whose construction is parameter-free, i.e. independent of and . In order to explain our estimation procedure let us make several remarks.

First we note that the role of the constants and in the construction of the minimax estimator is quite different. Indeed, the constants are used in order to determine the set needed for the construction of the locally bayesian estimator, see (1.6) and (1.7). However, this set does not depend on the localization parameter , in other words, the quantities and are not involved in the selection of optimal size of the local neighborhood given by (1.8). Contrary to that, the constants are used for the derivation of the optimal size of the local neighborhood (1.8), but they are not involved in the construction of the collection of locally bayesian estimators

Next remark explains how to replace the unknown quantities and in the definition of . Our first simple observation consists in the following: the estimator remains minimax if we replace in (1.6) and (1.7) by with any and . It follows from obvious inclusion The next observation is less trivial and it follows from Proposition 1. Put and define for any function

 A(f)=infx∈Vhmax(y)f(x),M(f)=b∑m=0∑p1+…+pd=m∣∣ ∣∣∂mf(y)∂xp11⋯∂xpdd∣∣ ∣∣. (1.10)

The following agreement will be used in the sequel: if the function and be such that does not exist we will put formally in the definition of .

It remains to note that contrary to the quantities and the functionals and can be consistently estimated from the observation (1.1) and let and be the corresponding estimators. The idea now is to determine the collection of locally bayesian estimators by replacing in (1.6) and (1.7) by the random parameter set which is defined as follows.

 ^Θ=Θ(^A/2,4^M)={t∈RDb:2t0,...,0−∥t∥1≥2−1^A,∥t∥1≤4^M}.

In this context it is important to emphasize that the estimators and are built from the same observation which is used for the construction of the family .

Contrary to all saying above, the constants and cannot be estimated consistently. In order to select an “optimal” estimator from the family we use general adaptation scheme due to Lepski , . To the best of our knowledge it is the first time when this method is applied in the statistical model with multiplicative noise and discontinuous distribution. Moreover, except already mentioned papers  and , Lepski’s procedure is typically applied to the selection from the collection of linear estimators (kernel estimators, locally polynomial estimator, etc.). In the present paper we apply this method to very complicated family of nonlinear estimators, obtained by the use of bayesian approach on the random parameter set. It required, in particular, to establish the exponential inequality for the deviation of locally bayesian estimator from the parameter to be estimated (Proposition 1). It generalizes the inequality proved for the parametric model (see  Chapter 1, Section 5), this result seems to be new.

#### Simulations.

In the present paper we adopt the local parametric approximation to a purely non parametric model. As it proved, this strategy leads to the theoretically optimal statistical decisions. But the minimax as well as the minimax adaptive approach are asymptotical and it seems natural to check how proposed estimators work for reasonable sample size. In the simulation study, we test the bayesian estimator in the parametric and nonparametric cases. We show that the adaptive estimator approaches the oracle estimator. The oracle estimator is selected from the family under the hypothesis f that is known. We show that the bayesian estimator performs well starting with .

This paper is organized as follows. In Section 2 we present the results concerning minimax estimation and Section 3 is devoted to the adaptive estimation. The simulations are given in Section 4. The proofs of main results are proved in Section 5 (upper bounds) and section 6 (lower bounds). Auxiliary lemmas are postponed to Appendix (Section 7) contains the proofs of technical results.

## 2 Minimax estimation on isotropic Hölder class

In this section we present several results concerning minimax estimation. First, we establish lower bound for minimax risk defined on for any and . For any we denote and .

###### Definition 1.

Fix , and and let be the largest integer strictly less than . The isotropic Hölder class is the set of functions having on all partial derivatives of order and such that

 ∑0≤|p|≤⌊β⌋supx∈[0,1]d∣∣ ∣∣∂|p|f(x)∂xp11⋯∂xpdd∣∣ ∣∣ ≤ M, ∣∣ ∣∣f(x)−∑0≤|p|≤⌊β⌋∂|p|f(y)∂yp11⋯∂ypddd∏j=1(xj−yj)pjpj!∣∣ ∣∣ ≤ Ld∑j=1|xj−yj|β,

where and are the th components of and .

This definition implies that if (defined in the beginning of this paper), then and , where and are defined in (1.10).

#### Maximal and minimax risk on Hd(β,L,M,A).

To measure the performance of estimation procedures on we will use minimax approach.

Let be the mathematical expectation with respect to the probability law of the observation satisfying (1.1). We define first the maximal risk on corresponding to the estimation of the function at a given point .

Let be an arbitrary estimator built from the observation . Let

The quantity is called maximal risk of the estimator on and the minimax risk on is defined as

where is taken over the set of all estimators.

###### Definition 2.

The normalizing sequence is called minimax rate of convergence (MRT) and the estimator is called minimax (asymptotically minimax) if

 liminfn→∞ψ−qnRn,q[^f,Hd(β,L,M,A)] > 0; limsupn→∞ψ−qnRn,q[^f,Hd(β,L,M,A)] < ∞.
###### Theorem 2.1.

For any , , , and

 liminfn→∞φ−qn(β)Rn,q[Hd(β,L,M,A)]>0,φn(β)=n−ββ+d.
###### Remark 1.

The obtained result shows that on the minimax rate of convergence cannot be faster than . In view of the obvious inclusion the minimax rate of convergence on an isotropic Hölder class is also bounded from below by .

The next theorem shows how to construct the minimax estimator basing on locally bayesian approach. Put and let is given by (1.5), (1.6) and (1.7) with

###### Theorem 2.2.

Let , and be fixed. Then there exists the constant such that for any satisfying

 φ−qn(β)Rn,q[¯f¯h(y),Hd(β,L,M,A)]≤C∗,∀q≥1.

The explicit form of is given in the proof.

###### Remark 2.

We deduce from Theorems 2.1 and 2.2 that the estimator is minimax on .

## 3 Adaptive estimation on isotropic Hölder classes

This section is devoted to the adaptive estimation over the collection of the classes . We will not impose any restriction on possible values of , but we will assume that , where , as previously, is an arbitrary a priori chosen integer.

We start with formulating the result showing that there is no optimally adaptive estimator (here we follow the terminology introduced in , ). It means that there is no an estimator which would be minimax simultaneously for several values of parameter even if all other parameters and are supposed to be fixed. This result does not require any restriction on as well.

###### Theorem 3.1.

For any such that , for any and any ,

 liminfn→∞inf~f[φ−qn(β1)Rn,q(~f,Hd(β1,L,M,A)) +φ−qn(β2)Rn,q(~f,Hd(β2,L,M,A))]=+∞,

where is taken over all possible estimators.

The assertion of Theorem 3.1 can be considerably specified if . To do that we will need the following definition. Let be a given family of normalizations.

###### Definition 3.

The family is called admissible if there exist an estimator such that for some and

 limsupn→∞ψ−qn(β)Rn,q(^f,Hd(β,L,M,A))<∞,∀β∈(0,b]. (3.1)

The estimator satisfying (3.1) is called -attainable. The estimator is called -adaptive if (3.1) holds for any and .

Note that the result proved in Theorem 3.1 means that the family of rates of convergence is not admissible. Denote by the following family of normalizations:

 ϕn(β)=(ρn(β)n)ββ+d,ρn(β)=1+ln(φn(β)φn(b)),β∈(0,b].

We remark that and for any .

###### Theorem 3.2.

Let be an arbitrary admissible family of normalizations.
I. For any such that , there exists an admissible family for which

 limn→∞υn(α)ψ−1n(α)=0.

II. If there exists such that

 limn→∞ψn(γ)ϕ−1n(γ)=0, (3.2)

then necessarily

 (a) limn→∞ψn(β)ϕ−1n(β)>0,∀β∈(0,γ); (b) limn→∞[ψn(γ)ϕn(γ)][ϕn(β)ψn(β)]=0,∀β∈(γ,b].

Several remarks are in order.

We note that if the family of normalizations is admissible, i.e. one can construct -attainable estimator, then is in an optimal family of normalizations in view of Kluchnikoff criterion . It follows from the second assertion of the theorem. We note however that a -attainable estimator may depend on and , and, therefore, this estimator have only theoretical interest. In the next section we construct -adaptive estimator, which is, by its definition, fully parameter-free. Moreover, this estimator obviously proves that is admissible, and, therefore, optimal as it was mentioned above.

As it was already mentioned in Introduction the construction of our estimation procedure consists of several steps. First, we determine the set , built from observation, which is used after that in order to define the family of locally bayesian estimators. Next, based on Lepski’s method (see  and ), we propose data-driven selection from this family.

First step: Determination of parameter set. Put and let be the solution of the following minimization problem.

 inft∈RDbn∑i:Xi∈Vmax(y)[2Yi−tK⊤(Xi−yhmax)]2,Vmax(y)=Vhmax(y),

where the -dimensional vector and the sign below means the transposition. Thus, is the local least squared estimator and its explicit expression is given by

where and is the design matrix. Put

 ~δp=p1!...pd!h−|p|max~θp,|p|≤b.

Introduce the following quantities

 ^A=~δ0…,0,^M=∥∥~δ∥∥1, (3.3)

and define the random parameter set as follows.

 ^Θ={t∈RDb:2t0,...,0−∥t∥1≥2−1^A,∥t∥1≤4^M}. (3.4)

Second step: Collection of locally bayesian estimators. Put

 ^πh(t) = ∫^Θ∥t−u∥1Lh(u,Y(n))du; (3.5) ^θ∗(h) = argmint∈^Θ^πh(t). (3.6)

The family of locally bayesian estimator is defined now as follows.

 ^F={^fh(y)=^θ∗0,…,0(h),h∈(0,hmax]}. (3.7)

Third step: Data-driven selection from the collection . Put

 hk=2−khmax,k=0,…,kn,

where is smallest integer such that . Set

 ^F∗={^f(k)(y)=^θ∗0,…,0(hk),k=0,…,kn}.

We put , where is selected from in accordance with the rule:

 ^k=inf{k=¯¯¯¯¯¯¯¯¯¯¯0,kn:∣∣^f(k)(y)−^f(l)(y)∣∣≤^MSn(l),l=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯k+1,kn}. (3.8)

Here we have used the following notations.

 Sn(l)=432D3b(32qd+16)λ−1n(hl)⎡⎣1+lln2n(hl)d⎤⎦,l=0,1,…,kn,

and is the smallest eigenvalue of the matrix

 Mnh(y)=1nhdn∑i=1K⊤(Xi−yh)K(Xi−yh)IVh(y)(Xi), (3.9)

which is completely determined by the design points and by the number of observations. We will prove that there exists a nonnegative real , such that for any and any (see Lemma 2).

###### Theorem 3.3.

Let an integer number be fixed. Then for any , and

###### Remark 3.

The assertion of the theorem means that the proposed estimator is -adaptive. It implies in particular that the family of normalizations is admissible. This, together with Theorem 3.2 allows us to state the optimality of in view of Kluchnikoff criterion (see ).

## 4 Simulation study

We will consider the case . The data are simulated accordingly to the model (1.1), where we use the following functions (Figure 1).

Here , and

To construct the family of estimators we use the linear approximation (), i.e. within the neighbourhoods of the given size , the locally bayesian estimator has the form

 ^fh(x)=^θ0+^θ1x,x∈[0,1].

We define the ideal (oracle) value of the parameter as the minimizer of the risk:

 ~h=arginfh∈[1/n,1]Ef∣∣^fh(y)−f(y)∣∣.

To compute it we apply Monte-Carlo simulations (10000 repetitions). Our first objective is to compare the risk provided by the ”oracle” estimator and whose provided by the adaptive estimator from Section 3. Figure 2 shows the deviation of the adaptive estimator from the function to be estimated. In several points, for example in , we remark so-called over-smoothing phenomenon, inherent to any adaptive estimator.

We compute the risks of the oracle and the adaptive estimator in 100 points of the interval . The next tabular presents the mean value of the ratio oracle risk/adaptive risk calculated for the functions and .

Figure 4 presents the ”oracle risk/adaptive risk” ratio as the function of the number of observations .

We consider the function (figure 5), which is linear inside the neighborhood of size around point and simulate observations in accordance with the model (1.1). Using only the observations corresponding to the interval we construct the bayesian estimator .

It is important to emphasize that this estimator is efficient  since the model is parametric. Our objective now is to compare the risk of our adaptive estimator with the risk provided by the estimator . We also try to understand how far is the localization parameter , inherent to the construction of our adaptive estimator, from the true value . We compute the risk of each estimator via Monte-Carlo method with repetitions. For each repetition the procedure select the adaptive bandwidth . We confirm once again the over-smoothing phenomenon since

 h(j)^k∼0.1405>h∗=0.1250,j=1,...,10000.

Note however that the adaptive procedure selects the neighborhood of the size which is quite close to the true one. We also compute the risks of both estimators: “bayesian risk”=0.0206 and “adaptive risk”=0.0308. We conclude that the estimation accuracy provided by our adaptive procedure is quite satisfactory.

## 5 Proofs of main results: upper bounds

Let be the following subinterval of .

 Hn=⎡⎢ ⎢⎣(b+1)∨(lnn)1(d+d2)n1/d,(1lnn)1b+d⎤⎥ ⎥⎦. (5.1)

Later on we will consider only the values of belonging to . We start with establishing the exponential inequality for the deviation of locally bayesian estimator from . The corresponding inequality is the basic technical to allowing to prove minimax and minimax adaptive results.

### 5.1 Exponential Inequality

Introduce the following notations. For any , put , where and

 ωp=∂|p|f(y)∂yp11⋯∂ypddh|p|p1!...pd!,p∈Pb. (5.2)

Remind the agreement which we follow in the present paper: if the function and vector are such that does not exist we put .

Let , given by (1.4), be the local polynomial approximation of inside and let be the corresponding approximation error, i.e.

 bh=supx∈Vh(y)∣∣fω(x)−f(x)∣∣. (5.3)

If , one could remark that by definition of in (5.2) and in Definition 2. Put also

 Nh=bh×nhd,E(h)=exp{(1+6D2b)Nh6A(f)D2b}. (5.4)

Introduce the random events and and put where and are defined in (3.3), Section 3.

Recall that (see Section 3) is the smallest eigenvalue of the matrix

 Mnh(y)=1nhdn∑i=1K⊤(Xi−yh)K(Xi−yh)IVh(y)(Xi),

and is the -dimensional vector of the monomials .

###### Proposition 1.

For any and any such that and , then