Minimax and minimax adaptive estimation in multiplicative regression: locally bayesian approach
The paper deals with the non-parametric estimation in the regression with the multiplicative noise. Using the local polynomial fitting and the bayesian approach, we construct the minimax on isotropic Hölder class estimator. Next applying Lepski’s method, we propose the estimator which is optimally adaptive over the collection of isotropic Hölder classes. To prove the optimality of the proposed procedure we establish, in particular, the exponential inequality for the deviation of locally bayesian estimator from the parameter to be estimated. These theoretical results are illustrated by simulation study.
Let statistical experiment be generated by the couples of observations where satisfies the equation
Here is unknown function and we are interested in estimating at a given point from observation .
The random variables (noise) are supposed to be independent and uniformly distributed on .
The design points are deterministic and without loss of generality we will assume that
Along the paper the unknown function is supposed to be smooth, in particular, it belongs to the Hölder ball of functions (see Definition ? below). Here is the smoothness of , is the sum of upper bounds of and its partial derivatives and is Lipschitz constant.
Moreover, we will consider only the functions separated away from zero by some positive constant. Thus, from now on we will suppose that there exists such that , where
Motivation. The theoretical interest to the multiplicative regression model (Equation 1) with discontinuous noise is dictated by the following fact. The typical approach to the study of the models with multiplicative noise consists in their transformation into the model with an additive noise and in the application, after that, the linear smoothing technique, based on standard methods like kernel smoothing, local polynomials etc. Let us illustrate the latter approach by the consideration of one of the most popular non-parametric model namely multiplicative gaussian regression
Here are i.i.d. standard gaussian random variables and the goal is to estimate the variance .
Putting and one can transform the model (Equation 2) into the heteroscedastic additive regression:
where, obviously, . Applying any of the linear methods mentioned above to the estimation of one can construct an estimator whose estimation accuracy is given by and which is optimal in minimax sense (See Definition ?). The latter result is proved under assumptions on which are similar to the assumption imposed on the function . In particular, denotes the regularity of the function . The same result can be obtained for any noise variables with known, continuously differentiable density, possessing sufficiently many moments.
The situation changes dramatically when one considers the noise with discontinuous distribution density. Although, the transformation of the original multiplicative model to the additive one is still possible, in particular, the model (Equation 1) can be rewritten as
the linear methods are not optimal anymore. As it is proved in Theorem ? the optimal accuracy is given by . To achieve this rate the non-linear estimation procedure, based on locally bayesian approach, is proposed in Section 2.
Another interesting feature is the selection from given family of estimators (see , ). Such selections are used for construction of data-driven (adaptive) procedures. In this context, several approaches to the selection from the family of linear estimators were recently proposed, see for instance , ,  and the references therein. However, these methods are heavily based on the linearity property. As we already mentioned the locally bayesian estimators are non-linear and in Section 3 we propose the selection rule from this family. It requires, in particular, to develop new non-asymptotical exponential inequalities, which may have an independent interest.
Besides the theoretical interest, the multiplicative regression model is applied in various domains, in particular, in the image processing, for example, in so-called nonparametric frontier model (see , ) can be considered as the particular case of the model (Equation 1). Indeed, the reconstruction of the regression function can be viewed as the estimation of a production set . Indeed, , and, therefore, the estimation of is reduced to finding the upper boundary of . In this context, one can also cite  dealing with the estimation of function’s support. It is worth to mention that although nonparametric estimation in the latter models is studied, the problem of adaptive estimation was not considered in the literature.
Minimax estimation. The first part of the paper is devoted to the minimax over estimation. This means, in particular, that the parameters and are supposed to be known a priori. We find the minimax rate of convergence (Equation 3) on and propose the estimator being optimal in minimax sense (see Definition ?). Our first result (Theorem ?) in this direction consists in establishing a lower bound for maximal risk on . We show that for any the minimax rate of convergence is bounded from below by the sequence
Next, we propose the minimax estimator, i.e. the estimator attaining the normalizing sequence (Equation 3). To construct the minimax estimator we use so-called locally bayesian estimation construction which consists in the following. Let
be the neighborhood around such that , where is a given scalar. Fix an integer number and let
Let , we define the local polynomial
where for and denotes the indicator function. The local polynomial can be viewed as an approximation of the regression function inside of the neighborhood and the number of coefficients of this polynomial. Introduce the following subset of
where is -norm on . can be viewed as the set of coefficients such that for all and for all in the neighbourhood . Consider the pseudo likelihood ratio
Let be the solution of the following minimization problem:
The locally bayesian estimator of is defined now as Note that this local approach allows to estimate successive derivatives of function . In this paper, only the estimation of at a given point is studied.
We note that similar locally parametric approach based on maximum likelihood estimators was recently proposed in  and  for regular statistical models. But when the density of observations is discontinuous, the bayesian approach outperforms the maximum likelihood estimator. This phenomenon is well known in parametric estimation (see ). Moreover, the establishing of statistical properties of bayesian estimators requires typically much weaker assumptions than whose used for analysis of maximum likelihood estimators.
As we see our construction contains an extra-parameter to be chosen. To make this choice we use quite standard arguments. First, we note that in view of the definition of (below in Definition ?), we have
Remark that if , then . Thus, if is chosen sufficiently small, our original model (Equation 1) is well approximated inside of by the “parametric” model
in which the bayesian estimator is rate-optimal (See Theorem ?).
It is worth mentioning that the analysis of the deviation of from is not simple. Namely here requirements are used. This assumption, which seems not to be necessary, allows us to make the presentation of basic ideas clear and to simplify routine computations (see also Remark ?).
Finally, is chosen as the solution of the following minimization problem
and we show that corresponding estimator is minimax for on if (see Theorem ?). Since the parameter can be chosen in arbitrary way, the proposed estimator is minimax for any given value of the parameter .
We remark that in regular statistical models, where linear methods are usually optimal, the choice of the bandwidth is due to the relation
with the solution . This explains that the improvement of the rate of convergence, compared to , in the model with the discontinuous density.
Adaptive estimation. The second part of the paper is devoted to the adaptive minimax estimation over collection of isotropic functional classes in the model (Equation 1). At our knowledge, the problem of adaptive estimation in the multiplicative regression with the noise, having discontinuous density, is not studied in the literature.
Well-known drawback of minimax approach is the dependence of the minimax estimator on the parameters describing functional class on which the maximal risk is determined. In particular, the locally bayesian estimator depends obviously on the parameters and via the solution of the minimization problem (Equation 7). Moreover optimally chosen in view of (Equation 8) depends explicitly on and . To overcome this drawback the minimax adaptive approach was proposed (see , , ). The first question arising in the adaptation (reduced to the problem at hand) can be formulated as follows.
Does there exist an estimator which would be minimax on simultaneously for all values of and belonging to some given subset of ?
In Section 3, we show that the answer to this question is negative, that is typical for the estimation of the function at a given point (see , , ). This answer can be reformulated in the following manner: the family of rates of convergence is unattainable for the problem under consideration.
Thus, we need to find another family of normalizations for maximal risk which would be attainable and, moreover, optimal in view of some criterion of optimality. Nowadays, the most developed criterion of optimality is due to Klutchnikoff .
We show that the family of normalizations, being optimal in view of this criterion, is
whenever The factor can be considered as price to pay for adaptation (see ).
The most important step in proving the optimality of the family (Equation 9) is to find an estimator, called adaptive, which attains the optimal family of normalizations. Obviously, we seek an estimator whose construction is parameter-free, i.e. independent of and . In order to explain our estimation procedure let us make several remarks.
First we note that the role of the constants and in the construction of the minimax estimator is quite different. Indeed, the constants are used in order to determine the set needed for the construction of the locally bayesian estimator, see (Equation 6) and (Equation 7). However, this set does not depend on the localization parameter , in other words, the quantities and are not involved in the selection of optimal size of the local neighborhood given by (Equation 8). Contrary to that, the constants are used for the derivation of the optimal size of the local neighborhood (Equation 8), but they are not involved in the construction of the collection of locally bayesian estimators
Next remark explains how to replace the unknown quantities and in the definition of . Our first simple observation consists in the following: the estimator remains minimax if we replace in (Equation 6) and (Equation 7) by with any and . It follows from obvious inclusion The next observation is less trivial and it follows from Proposition ?. Put and define for any function
The following agreement will be used in the sequel: if the function and be such that does not exist we will put formally in the definition of .
It remains to note that contrary to the quantities and the functionals and can be consistently estimated from the observation (Equation 1) and let and be the corresponding estimators. The idea now is to determine the collection of locally bayesian estimators by replacing in (Equation 6) and (Equation 7) by the random parameter set which is defined as follows.
In this context it is important to emphasize that the estimators and are built from the same observation which is used for the construction of the family .
Contrary to all saying above, the constants and cannot be estimated consistently. In order to select an “optimal” estimator from the family we use general adaptation scheme due to Lepski , . To the best of our knowledge it is the first time when this method is applied in the statistical model with multiplicative noise and discontinuous distribution. Moreover, except already mentioned papers  and , Lepski’s procedure is typically applied to the selection from the collection of linear estimators (kernel estimators, locally polynomial estimator, etc.). In the present paper we apply this method to very complicated family of nonlinear estimators, obtained by the use of bayesian approach on the random parameter set. It required, in particular, to establish the exponential inequality for the deviation of locally bayesian estimator from the parameter to be estimated (Proposition ?). It generalizes the inequality proved for the parametric model (see  Chapter 1, Section 5), this result seems to be new.
Simulations. In the present paper we adopt the local parametric approximation to a purely non parametric model. As it proved, this strategy leads to the theoretically optimal statistical decisions. But the minimax as well as the minimax adaptive approach are asymptotical and it seems natural to check how proposed estimators work for reasonable sample size. In the simulation study, we test the bayesian estimator in the parametric and nonparametric cases. We show that the adaptive estimator approaches the oracle estimator. The oracle estimator is selected from the family under the hypothesis f that is known. We show that the bayesian estimator performs well starting with .
This paper is organized as follows. In Section 2 we present the results concerning minimax estimation and Section Section 3 is devoted to the adaptive estimation. The simulations are given in Section 4. The proofs of main results are proved in Section 5 (upper bounds) and Section 6 (lower bounds). Auxiliary lemmas are postponed to Appendix (Section 7) contains the proofs of technical results.
2Minimax estimation on isotropic Hölder class
In this section we present several results concerning minimax estimation. First, we establish lower bound for minimax risk defined on for any and . For any we denote and .
This definition implies that if (defined in the beginning of this paper), then and , where and are defined in (Equation 10).
Maximal and minimax risk on . To measure the performance of estimation procedures on we will use minimax approach.
Let be the mathematical expectation with respect to the probability law of the observation satisfying (Equation 1). We define first the maximal risk on corresponding to the estimation of the function at a given point .
Let be an arbitrary estimator built from the observation . Let
The quantity is called maximal risk of the estimator on and the minimax risk on is defined as
where is taken over the set of all estimators.
3Adaptive estimation on isotropic Hölder classes
This section is devoted to the adaptive estimation over the collection of the classes . We will not impose any restriction on possible values of , but we will assume that , where , as previously, is an arbitrary a priori chosen integer.
We start with formulating the result showing that there is no optimally adaptive estimator (here we follow the terminology introduced in , ). It means that there is no an estimator which would be minimax simultaneously for several values of parameter even if all other parameters and are supposed to be fixed. This result does not require any restriction on as well.
The assertion of Theorem ? can be considerably specified if . To do that we will need the following definition. Let be a given family of normalizations.
Note that the result proved in Theorem ? means that the family of rates of convergence is not admissible. Denote by the following family of normalizations:
We remark that and for any .
Several remarks are in order.
We note that if the family of normalizations is admissible, i.e. one can construct -attainable estimator, then is in an optimal family of normalizations in view of Kluchnikoff criterion . It follows from the second assertion of the theorem. We note however that a -attainable estimator may depend on and , and, therefore, this estimator have only theoretical interest. In the next section we construct -adaptive estimator, which is, by its definition, fully parameter-free. Moreover, this estimator obviously proves that is admissible, and, therefore, optimal as it was mentioned above.
The assertions of Theorem ? allows us to give rather simple interpretation of Kluchnikoff criterion. Indeed, the first assertion, which is easily deduced from Theorem ?, shows that any admissible family of normalizations can be improved by another admissible family at any given point except maybe one. In particular, it concerns the family if it is admissible. On the other hand, the second assertion of the theorem shows that there is no admissible family which would outperform the family at two points. Moreover, in view of , -adaptive (attainable) estimator, if exists, has the same precision on , , as any -adaptive(attainable) estimator whenever satisfies ( ?). Additionally, implies that the gain in the precision provided by -adaptive (attainable) estimator on leads automatically to much more losses on for any with respect to the precision provided by -adaptive(attainable) estimator. We conclude that -adaptive(attainable) estimator outperforms any -adaptive(attainable) estimator whenever satisfies ( ?). It remains to note that any admissible family not satisfying ( ?) is asymptotically equivalent to .
Construction of -adaptive estimator. As it was already mentioned in Introduction the construction of our estimation procedure consists of several steps. First, we determine the set , built from observation, which is used after that in order to define the family of locally bayesian estimators. Next, based on Lepski’s method (see  and ), we propose data-driven selection from this family.
Put and let be the solution of the following minimization problem.
where the -dimensional vector and the sign below means the transposition. Thus, is the local least squared estimator and its explicit expression is given by
where and is the design matrix. Put
Introduce the following quantities
and define the random parameter set as follows.
The family of locally bayesian estimator is defined now as follows.
where is smallest integer such that . Set
We put , where is selected from in accordance with the rule:
Here we have used the following notations.
and is the smallest eigenvalue of the matrix
which is completely determined by the design points and by the number of observations. We will prove that there exists a nonnegative real , such that for any and any (see Lemma ?).
Here , and
To construct the family of estimators we use the linear approximation (), i.e. within the neighbourhoods of the given size , the locally bayesian estimator has the form
We define the ideal (oracle) value of the parameter as the minimizer of the risk:
To compute it we apply Monte-Carlo simulations (10000 repetitions). Our first objective is to compare the risk provided by the “oracle” estimator and whose provided by the adaptive estimator from Section 3. Figure Figure 2 shows the deviation of the adaptive estimator from the function to be estimated. In several points, for example in , we remark so-called over-smoothing phenomenon, inherent to any adaptive estimator.
Oracle-adaptive ratio. We compute the risks of the oracle and the adaptive estimator in 100 points of the interval . The next tabular presents the mean value of the ratio oracle risk/adaptive risk calculated for the functions and .
|function||adaptive risk||oracle-adaptive ratio||adaptive risk||oracle-adaptive ratio|
Figure 3 presents the “oracle risk/adaptive risk” ratio as the function of the number of observations .
Adaptation versus parametric estimation. We consider the function (Figure 4), which is linear inside the neighborhood of size around point and simulate observations in accordance with the model (Equation 1). Using only the observations corresponding to the interval we construct the bayesian estimator .
It is important to emphasize that this estimator is efficient  since the model is parametric. Our objective now is to compare the risk of our adaptive estimator with the risk provided by the estimator . We also try to understand how far is the localization parameter , inherent to the construction of our adaptive estimator, from the true value . We compute the risk of each estimator via Monte-Carlo method with repetitions. For each repetition the procedure select the adaptive bandwidth . We confirm once again the over-smoothing phenomenon since
Note however that the adaptive procedure selects the neighborhood of the size which is quite close to the true one. We also compute the risks of both estimators: “bayesian risk“=0.0206 and “adaptive risk“=0.0308. We conclude that the estimation accuracy provided by our adaptive procedure is quite satisfactory.
5Proofs of main results: upper bounds
Let be the following subinterval of .
Later on we will consider only the values of belonging to . We start with establishing the exponential inequality for the deviation of locally bayesian estimator from . The corresponding inequality is the basic technical to allowing to prove minimax and minimax adaptive results.
Introduce the following notations. For any , put , where and
Remind the agreement which we follow in the present paper: if the function and vector are such that does not exist we put .
Let , given by (Equation 4), be the local polynomial approximation of inside and let be the corresponding approximation error, i.e.
If , one could remark that by definition of in (Equation 18) and in Definition ?. Put also
Recall that (see Section 3) is the smallest eigenvalue of the matrix
and is the -dimensional vector of the monomials .
The next proposition provides us with upper bound for the risk of a locally bayesian estimator.
5.2Proof of Proposition
Before to start with the proof, let us breafly discuss its ingredients.
Discussion. I. First, the obvious inclusion (remind that minimizes defined in (Equation 13))
allows us to reduce the study of the deviation of from to the study of the behaviour of .
II. We note that is the integral functional of the pseudo-likelihood . As the consequence, the behaviour of is completely determined by this process. Following  (Chapter 1, Section 5, Theorem 5.2), where similar problems were studied under parametric model assumption, we introduce the stochastic process
Here, the vector is defined as follows.
where is the coefficients of Taylor polynomial defined in (Equation 18). The definition of implies obviously
As it was noted in  (Chapter 1, Section 5, Theorem 5.2) the following properties of the process are essential for the study of :
Hölder continuity of its trajectories;
the rate of its decay at infinity.
The exact statements are formulated in Lemma ? below.
III. As it was shown in  (Chapter 1, Section 5, Theorem 5.2) in parametric situation the mentioned above properties provide with the desirable properties of the process
where the set is defined in (Equation 12). The exact statements are given in Assertions 1 and 2. The latter process is important in view of the following inclusion
Auxiliary Lemma. First, we note that in view of (Equation 21), the event is always realized, because . Hence, can be rewritten
Proof of Proposition . Define for any and
where , is defined such that: for any , (for more details, see Lemma ?) and
. Suppose that Assertions 1 and 2 are proved. Then, in view of Assertion ?, choosing we get
Using the Tchebychev inequality, we have in view of the last inequality
The assertion of Proposition ? follows now from the last inequality, Assertion ? and the definitions of and the function .
. Now, we will prove Assertion ?. The definition of and implies
Some remarks are in order. First, it is easily seen that . Therefore, if the event holds then Remind also that minimizes defined in (Equation 13) and, therefore, the following inclusion holds since