Minimax and minimax adaptive estimation in multiplicative regression: locally bayesian approach

Minimax and minimax adaptive estimation in multiplicative regression: locally bayesian approach

Abstract

The paper deals with the non-parametric estimation in the regression with the multiplicative noise. Using the local polynomial fitting and the bayesian approach, we construct the minimax on isotropic Hölder class estimator. Next applying Lepski’s method, we propose the estimator which is optimally adaptive over the collection of isotropic Hölder classes. To prove the optimality of the proposed procedure we establish, in particular, the exponential inequality for the deviation of locally bayesian estimator from the parameter to be estimated. These theoretical results are illustrated by simulation study.

1Introduction

Let statistical experiment be generated by the couples of observations where satisfies the equation

Here is unknown function and we are interested in estimating at a given point from observation .

The random variables (noise) are supposed to be independent and uniformly distributed on .

The design points are deterministic and without loss of generality we will assume that

Along the paper the unknown function is supposed to be smooth, in particular, it belongs to the Hölder ball of functions (see Definition ? below). Here is the smoothness of , is the sum of upper bounds of and its partial derivatives and is Lipschitz constant.

Moreover, we will consider only the functions separated away from zero by some positive constant. Thus, from now on we will suppose that there exists such that , where

Motivation. The theoretical interest to the multiplicative regression model (Equation 1) with discontinuous noise is dictated by the following fact. The typical approach to the study of the models with multiplicative noise consists in their transformation into the model with an additive noise and in the application, after that, the linear smoothing technique, based on standard methods like kernel smoothing, local polynomials etc. Let us illustrate the latter approach by the consideration of one of the most popular non-parametric model namely multiplicative gaussian regression

Here are i.i.d. standard gaussian random variables and the goal is to estimate the variance .

Putting and one can transform the model (Equation 2) into the heteroscedastic additive regression:

where, obviously, . Applying any of the linear methods mentioned above to the estimation of one can construct an estimator whose estimation accuracy is given by and which is optimal in minimax sense (See Definition ?). The latter result is proved under assumptions on which are similar to the assumption imposed on the function . In particular, denotes the regularity of the function . The same result can be obtained for any noise variables with known, continuously differentiable density, possessing sufficiently many moments.

The situation changes dramatically when one considers the noise with discontinuous distribution density. Although, the transformation of the original multiplicative model to the additive one is still possible, in particular, the model (Equation 1) can be rewritten as

the linear methods are not optimal anymore. As it is proved in Theorem ? the optimal accuracy is given by . To achieve this rate the non-linear estimation procedure, based on locally bayesian approach, is proposed in Section 2.

Another interesting feature is the selection from given family of estimators (see [2], [4]). Such selections are used for construction of data-driven (adaptive) procedures. In this context, several approaches to the selection from the family of linear estimators were recently proposed, see for instance [4], [5], [8] and the references therein. However, these methods are heavily based on the linearity property. As we already mentioned the locally bayesian estimators are non-linear and in Section 3 we propose the selection rule from this family. It requires, in particular, to develop new non-asymptotical exponential inequalities, which may have an independent interest.

Besides the theoretical interest, the multiplicative regression model is applied in various domains, in particular, in the image processing, for example, in so-called nonparametric frontier model (see [1], [19]) can be considered as the particular case of the model (Equation 1). Indeed, the reconstruction of the regression function can be viewed as the estimation of a production set . Indeed, , and, therefore, the estimation of is reduced to finding the upper boundary of . In this context, one can also cite [11] dealing with the estimation of function’s support. It is worth to mention that although nonparametric estimation in the latter models is studied, the problem of adaptive estimation was not considered in the literature.

Minimax estimation. The first part of the paper is devoted to the minimax over estimation. This means, in particular, that the parameters and are supposed to be known a priori. We find the minimax rate of convergence (Equation 3) on and propose the estimator being optimal in minimax sense (see Definition ?). Our first result (Theorem ?) in this direction consists in establishing a lower bound for maximal risk on . We show that for any the minimax rate of convergence is bounded from below by the sequence

Next, we propose the minimax estimator, i.e. the estimator attaining the normalizing sequence (Equation 3). To construct the minimax estimator we use so-called locally bayesian estimation construction which consists in the following. Let

be the neighborhood around such that , where is a given scalar. Fix an integer number and let

Let , we define the local polynomial

where for and denotes the indicator function. The local polynomial can be viewed as an approximation of the regression function inside of the neighborhood and the number of coefficients of this polynomial. Introduce the following subset of

where is -norm on . can be viewed as the set of coefficients such that for all and for all in the neighbourhood . Consider the pseudo likelihood ratio

Set also

Let be the solution of the following minimization problem:

The locally bayesian estimator of is defined now as Note that this local approach allows to estimate successive derivatives of function . In this paper, only the estimation of at a given point is studied.

We note that similar locally parametric approach based on maximum likelihood estimators was recently proposed in [9] and [18] for regular statistical models. But when the density of observations is discontinuous, the bayesian approach outperforms the maximum likelihood estimator. This phenomenon is well known in parametric estimation (see [6]). Moreover, the establishing of statistical properties of bayesian estimators requires typically much weaker assumptions than whose used for analysis of maximum likelihood estimators.

As we see our construction contains an extra-parameter to be chosen. To make this choice we use quite standard arguments. First, we note that in view of the definition of (below in Definition ?), we have

Remark that if , then . Thus, if is chosen sufficiently small, our original model (Equation 1) is well approximated inside of by the “parametric” model

in which the bayesian estimator is rate-optimal (See Theorem ?).

It is worth mentioning that the analysis of the deviation of from is not simple. Namely here requirements are used. This assumption, which seems not to be necessary, allows us to make the presentation of basic ideas clear and to simplify routine computations (see also Remark ?).

Finally, is chosen as the solution of the following minimization problem

and we show that corresponding estimator is minimax for on if (see Theorem ?). Since the parameter can be chosen in arbitrary way, the proposed estimator is minimax for any given value of the parameter .

We remark that in regular statistical models, where linear methods are usually optimal, the choice of the bandwidth is due to the relation

with the solution . This explains that the improvement of the rate of convergence, compared to , in the model with the discontinuous density.

Adaptive estimation. The second part of the paper is devoted to the adaptive minimax estimation over collection of isotropic functional classes in the model (Equation 1). At our knowledge, the problem of adaptive estimation in the multiplicative regression with the noise, having discontinuous density, is not studied in the literature.

Well-known drawback of minimax approach is the dependence of the minimax estimator on the parameters describing functional class on which the maximal risk is determined. In particular, the locally bayesian estimator depends obviously on the parameters and via the solution of the minimization problem (Equation 7). Moreover optimally chosen in view of (Equation 8) depends explicitly on and . To overcome this drawback the minimax adaptive approach was proposed (see [12], [13], [16]). The first question arising in the adaptation (reduced to the problem at hand) can be formulated as follows.

Does there exist an estimator which would be minimax on simultaneously for all values of and belonging to some given subset of ?

In Section 3, we show that the answer to this question is negative, that is typical for the estimation of the function at a given point (see [15], [20], [21]). This answer can be reformulated in the following manner: the family of rates of convergence is unattainable for the problem under consideration.

Thus, we need to find another family of normalizations for maximal risk which would be attainable and, moreover, optimal in view of some criterion of optimality. Nowadays, the most developed criterion of optimality is due to Klutchnikoff [10].

We show that the family of normalizations, being optimal in view of this criterion, is

whenever The factor can be considered as price to pay for adaptation (see [13]).

The most important step in proving the optimality of the family (Equation 9) is to find an estimator, called adaptive, which attains the optimal family of normalizations. Obviously, we seek an estimator whose construction is parameter-free, i.e. independent of and . In order to explain our estimation procedure let us make several remarks.

First we note that the role of the constants and in the construction of the minimax estimator is quite different. Indeed, the constants are used in order to determine the set needed for the construction of the locally bayesian estimator, see (Equation 6) and (Equation 7). However, this set does not depend on the localization parameter , in other words, the quantities and are not involved in the selection of optimal size of the local neighborhood given by (Equation 8). Contrary to that, the constants are used for the derivation of the optimal size of the local neighborhood (Equation 8), but they are not involved in the construction of the collection of locally bayesian estimators

Next remark explains how to replace the unknown quantities and in the definition of . Our first simple observation consists in the following: the estimator remains minimax if we replace in (Equation 6) and (Equation 7) by with any and . It follows from obvious inclusion The next observation is less trivial and it follows from Proposition ?. Put and define for any function

The following agreement will be used in the sequel: if the function and be such that does not exist we will put formally in the definition of .

It remains to note that contrary to the quantities and the functionals and can be consistently estimated from the observation (Equation 1) and let and be the corresponding estimators. The idea now is to determine the collection of locally bayesian estimators by replacing in (Equation 6) and (Equation 7) by the random parameter set which is defined as follows.

In this context it is important to emphasize that the estimators and are built from the same observation which is used for the construction of the family .

Contrary to all saying above, the constants and cannot be estimated consistently. In order to select an “optimal” estimator from the family we use general adaptation scheme due to Lepski [12], [14]. To the best of our knowledge it is the first time when this method is applied in the statistical model with multiplicative noise and discontinuous distribution. Moreover, except already mentioned papers [9] and [18], Lepski’s procedure is typically applied to the selection from the collection of linear estimators (kernel estimators, locally polynomial estimator, etc.). In the present paper we apply this method to very complicated family of nonlinear estimators, obtained by the use of bayesian approach on the random parameter set. It required, in particular, to establish the exponential inequality for the deviation of locally bayesian estimator from the parameter to be estimated (Proposition ?). It generalizes the inequality proved for the parametric model (see [6] Chapter 1, Section 5), this result seems to be new.

Simulations. In the present paper we adopt the local parametric approximation to a purely non parametric model. As it proved, this strategy leads to the theoretically optimal statistical decisions. But the minimax as well as the minimax adaptive approach are asymptotical and it seems natural to check how proposed estimators work for reasonable sample size. In the simulation study, we test the bayesian estimator in the parametric and nonparametric cases. We show that the adaptive estimator approaches the oracle estimator. The oracle estimator is selected from the family under the hypothesis f that is known. We show that the bayesian estimator performs well starting with .

This paper is organized as follows. In Section 2 we present the results concerning minimax estimation and Section Section 3 is devoted to the adaptive estimation. The simulations are given in Section 4. The proofs of main results are proved in Section 5 (upper bounds) and Section 6 (lower bounds). Auxiliary lemmas are postponed to Appendix (Section 7) contains the proofs of technical results.

2Minimax estimation on isotropic Hölder class

In this section we present several results concerning minimax estimation. First, we establish lower bound for minimax risk defined on for any and . For any we denote and .

This definition implies that if (defined in the beginning of this paper), then and , where and are defined in (Equation 10).

Maximal and minimax risk on . To measure the performance of estimation procedures on we will use minimax approach.

Let be the mathematical expectation with respect to the probability law of the observation satisfying (Equation 1). We define first the maximal risk on corresponding to the estimation of the function at a given point .

Let be an arbitrary estimator built from the observation . Let

The quantity is called maximal risk of the estimator on and the minimax risk on is defined as

where is taken over the set of all estimators.

The next theorem shows how to construct the minimax estimator basing on locally bayesian approach. Put and let is given by (Equation 5), (Equation 6) and (Equation 7) with

3Adaptive estimation on isotropic Hölder classes

This section is devoted to the adaptive estimation over the collection of the classes . We will not impose any restriction on possible values of , but we will assume that , where , as previously, is an arbitrary a priori chosen integer.

We start with formulating the result showing that there is no optimally adaptive estimator (here we follow the terminology introduced in [13], [14]). It means that there is no an estimator which would be minimax simultaneously for several values of parameter even if all other parameters and are supposed to be fixed. This result does not require any restriction on as well.

The assertion of Theorem ? can be considerably specified if . To do that we will need the following definition. Let be a given family of normalizations.

Note that the result proved in Theorem ? means that the family of rates of convergence is not admissible. Denote by the following family of normalizations:

We remark that and for any .

Several remarks are in order.

We note that if the family of normalizations is admissible, i.e. one can construct -attainable estimator, then is in an optimal family of normalizations in view of Kluchnikoff criterion [10]. It follows from the second assertion of the theorem. We note however that a -attainable estimator may depend on and , and, therefore, this estimator have only theoretical interest. In the next section we construct -adaptive estimator, which is, by its definition, fully parameter-free. Moreover, this estimator obviously proves that is admissible, and, therefore, optimal as it was mentioned above.

Construction of -adaptive estimator. As it was already mentioned in Introduction the construction of our estimation procedure consists of several steps. First, we determine the set , built from observation, which is used after that in order to define the family of locally bayesian estimators. Next, based on Lepski’s method (see [13] and [16]), we propose data-driven selection from this family.

Put and let be the solution of the following minimization problem.

where the -dimensional vector and the sign below means the transposition. Thus, is the local least squared estimator and its explicit expression is given by

where and is the design matrix. Put

Introduce the following quantities

and define the random parameter set as follows.

Put

The family of locally bayesian estimator is defined now as follows.

Put

where is smallest integer such that . Set

We put , where is selected from in accordance with the rule:

Here we have used the following notations.

and is the smallest eigenvalue of the matrix

which is completely determined by the design points and by the number of observations. We will prove that there exists a nonnegative real , such that for any and any (see Lemma ?).

4Simulation study

We will consider the case . The data are simulated accordingly to the model (Equation 1), where we use the following functions (Figure 1).

Here , and

To construct the family of estimators we use the linear approximation (), i.e. within the neighbourhoods of the given size , the locally bayesian estimator has the form

We define the ideal (oracle) value of the parameter as the minimizer of the risk:

To compute it we apply Monte-Carlo simulations (10000 repetitions). Our first objective is to compare the risk provided by the “oracle” estimator and whose provided by the adaptive estimator from Section 3. Figure Figure 2 shows the deviation of the adaptive estimator from the function to be estimated. In several points, for example in , we remark so-called over-smoothing phenomenon, inherent to any adaptive estimator.

Oracle-adaptive ratio. We compute the risks of the oracle and the adaptive estimator in 100 points of the interval . The next tabular presents the mean value of the ratio oracle risk/adaptive risk calculated for the functions and .

Figure 3 presents the “oracle risk/adaptive risk” ratio as the function of the number of observations .

Adaptation versus parametric estimation. We consider the function (Figure 4), which is linear inside the neighborhood of size around point and simulate observations in accordance with the model (Equation 1). Using only the observations corresponding to the interval we construct the bayesian estimator .

It is important to emphasize that this estimator is efficient [6] since the model is parametric. Our objective now is to compare the risk of our adaptive estimator with the risk provided by the estimator . We also try to understand how far is the localization parameter , inherent to the construction of our adaptive estimator, from the true value . We compute the risk of each estimator via Monte-Carlo method with repetitions. For each repetition the procedure select the adaptive bandwidth . We confirm once again the over-smoothing phenomenon since

Note however that the adaptive procedure selects the neighborhood of the size which is quite close to the true one. We also compute the risks of both estimators: “bayesian risk“=0.0206 and “adaptive risk“=0.0308. We conclude that the estimation accuracy provided by our adaptive procedure is quite satisfactory.

5Proofs of main results: upper bounds

Let be the following subinterval of .

Later on we will consider only the values of belonging to . We start with establishing the exponential inequality for the deviation of locally bayesian estimator from . The corresponding inequality is the basic technical to allowing to prove minimax and minimax adaptive results.

5.1Exponential Inequality

Introduce the following notations. For any , put , where and

Remind the agreement which we follow in the present paper: if the function and vector are such that does not exist we put .

Let , given by (Equation 4), be the local polynomial approximation of inside and let be the corresponding approximation error, i.e.

If , one could remark that by definition of in (Equation 18) and in Definition ?. Put also

Introduce the random events and and put where and are defined in (Equation 11), Section 3.

Recall that (see Section 3) is the smallest eigenvalue of the matrix

and is the -dimensional vector of the monomials .

The next proposition provides us with upper bound for the risk of a locally bayesian estimator.

5.2Proof of Proposition

Before to start with the proof, let us breafly discuss its ingredients.

Discussion. I. First, the obvious inclusion (remind that minimizes defined in (Equation 13))

allows us to reduce the study of the deviation of from to the study of the behaviour of .

II. We note that is the integral functional of the pseudo-likelihood . As the consequence, the behaviour of is completely determined by this process. Following [6] (Chapter 1, Section 5, Theorem 5.2), where similar problems were studied under parametric model assumption, we introduce the stochastic process

defined on

Here, the vector is defined as follows.

where is the coefficients of Taylor polynomial defined in (Equation 18). The definition of implies obviously

As it was noted in [6] (Chapter 1, Section 5, Theorem 5.2) the following properties of the process are essential for the study of :

• Hölder continuity of its trajectories;

• the rate of its decay at infinity.

The exact statements are formulated in Lemma ? below.

III. As it was shown in [6] (Chapter 1, Section 5, Theorem 5.2) in parametric situation the mentioned above properties provide with the desirable properties of the process

where the set is defined in (Equation 12). The exact statements are given in Assertions 1 and 2. The latter process is important in view of the following inclusion

Auxiliary Lemma. First, we note that in view of (Equation 21), the event is always realized, because . Hence, can be rewritten

Proof of Proposition . Define for any and

where , is defined such that: for any , (for more details, see Lemma ?) and

. Suppose that Assertions 1 and 2 are proved. Then, in view of Assertion ?, choosing we get

Using the Tchebychev inequality, we have in view of the last inequality

The assertion of Proposition ? follows now from the last inequality, Assertion ? and the definitions of and the function .

. Now, we will prove Assertion ?. The definition of and implies

Some remarks are in order. First, it is easily seen that . Therefore, if the event holds then Remind also that minimizes defined in (Equation 13) and, therefore, the following inclusion holds since