# On Bayesian “Testimation” and Its Application to Wavelet Thresholding

## Abstract

We consider the problem of estimating the unknown response function in the Gaussian white noise model. We first utilize the recently developed Bayesian maximum a posteriori “testimation” procedure of Abramovich et al. (2007) for recovering an unknown high-dimensional Gaussian mean vector. The existing results for its upper error bounds over various sparse -balls are extended to more general cases. We show that, for a properly chosen prior on the number of non-zero entries of the mean vector, the corresponding adaptive estimator is asymptotically minimax in a wide range of sparse and dense -balls.

The proposed procedure is then applied in a wavelet context to derive adaptive global and level-wise wavelet estimators of the unknown response function in the Gaussian white noise model. These estimators are then proven to be, respectively, asymptotically near-minimax and minimax in a wide range of Besov balls. These results are also extended to the estimation of derivatives of the response function.

Simulated examples are conducted to illustrate the performance of the proposed level-wise wavelet estimator in finite sample situations, and to compare it with several existing counterparts.

## 1Introduction

We consider the problem of estimating the unknown response function in the Gaussian white noise model, where one observes Gaussian processes governed by

The noise parameter is assumed to be known, is a standard Wiener process, and is the unknown response function. Under some smoothness constraints on , such a model is asymptotically equivalent in Le Cam sense to the standard nonparametric regression setting (Brown & Low, 1996).

In a consistent estimation theory, it is well-known that should possess some smoothness properties. We assume that belongs to a Besov ball of a radius , where and . The latter restriction ensures that the corresponding Besov spaces are embedded in . The parameter measures the degree of smoothness while and specify the type of norm used to measure the smoothness. Besov classes contain various traditional smoothness spaces such as Hölder and Sobolev spaces as special cases. However, they also include different types of spatially inhomogeneous functions (Meyer, 1992).

The fact that wavelet series constitute unconditional bases for Besov spaces has caused various wavelet-based estimation procedures to be widely used for estimating the unknown response in the Gaussian white noise model (Equation 1). The standard wavelet approach for the estimation of is based on finding the empirical wavelet coefficients of the data and denoising them, usually by some type of a thresholding rule. Transforming them back to the function space then yields the resulting estimate. The main statistical challenge in such an approach is a proper choice of a thresholding rule. A series of various wavelet thresholds originated by different ideas has been proposed in the literature during the last decade, e.g., the universal threshold (Donoho & Johnstone, 1994a), Stein’s unbiased risk estimation threshold (Donoho & Johnstone, 1995), the false discovery rate threshold (Abramovich & Benjamini, 1996), the cross-validation threshold (Nason, 1996), the Bayes threshold (Abramovich et al., 1998) and the empirical Bayes threshold (Johnstone & Silverman, 2005).

Abramovich & Benjamini (1996) demonstrated that thresholding can be viewed as a multiple hypothesis testing procedure, where one first simultaneously tests the wavelet coefficients of the unknown response function, for significance. The coefficients concluded to be significant are then estimated by the corresponding empirical wavelet coefficients of the data, while the non-significant ones are discarded. Such a “testimation” procedure evidently mimics a hard thresholding rule. Various choices for adjustment to multiplicity on the testing step lead to different thresholds. In particular, the universal threshold of Donoho & Johnstone (1994a) and the false discovery rate threshold of Abramovich & Benjamini (1996) fall within such a framework corresponding to Bonferroni and false discovery rate multiplicity corrections, respectively.

In this paper, we proceed along the lines of “testimation” approach, where we utilize the recently developed maximum a posteriori Bayesian multiple testing procedure of Abramovich & Angelini (2006). Their hierarchical prior model is based on imposing a prior distribution on the number of false null hypotheses. Abramovich et al. (2007) applied this approach to estimating a high-dimensional Gaussian mean vector and showed its minimax optimality where the unknown mean vector was assumed to be sparse.

We first extend the results of Abramovich et al. (2007) to more general settings. Consider the problem of estimating an unknown high-dimensional Gaussian mean vector, where one observes governed by

The variance , that may depend on , is assumed to be known, are independent random variables, and the unknown mean vector is assumed to lie in a strong -ball , , of a normalized radius , that is, , where . Abramovich et al. (2007) considered the Gaussian sequence model (Equation 2) with and derived upper error bounds for the quadratic risk of an adaptive Bayesian maximum a posteriori estimator of in the sparse case, where and as . We extend their results for all combinations of and and for the variance in (Equation 2) that may depend on . We show, in particular, that for a properly chosen prior distribution on the number of non-zero entries of , the corresponding estimator, up to a constant factor, is asymptotically minimax for almost all -balls including both sparse and dense cases.

We then apply the proposed approach to the wavelet thresholding estimation in the Gaussian white noise model (Equation 1). We show that, under mild conditions on the prior distribution on the number of non-zero wavelet coefficients, the resulting global wavelet estimator of , up to a logarithmic factor, attains the minimax convergence rates simultaneously over the entire range of Besov balls. Furthermore, we demonstrate that estimating wavelet coefficients at each resolution level separately, allows one to remove the extra logarithmic factor. Moreover, the procedure can also be extended to the estimation of derivatives of . These results, in some sense, complement the adaptively minimax empirical Bayes estimators of Johnstone & Silverman (2005).

## 2Estimation in the Gaussian sequence model

### 2.1Bayesian maximum a posteriori estimation procedure

We start with reviewing the Bayesian maximum a posteriori estimation procedure for the Gaussian sequence model (Equation 2) developed by Abramovich et al. (2007).

For this model, consider the multiple hypothesis testing problem, where we wish to simultaneously test

A configuration of true and false null hypotheses is uniquely defined by the indicator vector , where and denotes the indicator function of the set . Let be the number of non-zero , i.e., . Assume some prior distribution on with . For a given , all the corresponding different vectors are assumed to be equally likely a priori, that is, conditionally on ,

Naturally, , where is a probability atom at zero. To complete the prior specification, we assume that .

For the proposed hierarchical prior, the posterior probability of a given vector with non-zero entries is

where the Bayes factor of is

and is the variance ratio (Abramovich & Angelini, 2006).

Given the posterior distribution , we apply the maximum a posteriori rule to choose the most likely indicator vector. Generally, to find the posterior mode of , one should look through all possible sequences of zeroes and ones. However, for the proposed model, the number of candidates for a mode is, in fact, reduced to only. Indeed, let be a maximizer of (Equation 3) for a fixed that indicates the most plausible vector with non-zero entries. From (Equation 3), it follows immediately that at the entries corresponding to the smallest Bayes factors and zeroes otherwise. Due to the monotonicity of in in (Equation 4), it is equivalent to setting for the largest and zeroes for others. The proposed Bayesian multiple testing procedure then leads to finding that maximizes

for some constant or, equivalently, minimizes

where . The null hypotheses corresponding to are rejected. The resulting Bayesian estimation yields a hard thresholding with a threshold , i.e.,

If , then all , , are thresholded and .

From a frequentist view, the above estimator in (Equation 5) is evidently a penalized likelihood estimator with the complexity penalty

In this sense, it can be also considered within the framework of Birgé & Massart (2001). We will discuss these relations in the following section in more details.

### 2.2Upper error bounds

Abramovich et al. (2007, Theorem 6) obtained upper error bounds for the -risk of (Equation 5) in the Gaussian sequence model (Equation 2) for sparse -balls, where and as . We extend now these results to more general settings.

Fix a prior distribution , on the number of non-zero entries of , and let be the variance ratio.

The proof of Proposition ? is given in the Appendix. Similar to Abramovich et al. (2007), analogous results can be obtained for other types of balls, e.g., weak -balls, , and -balls, with necessary changes in the proofs (Petsa, 2009, Chapter 3).

Since the prior assumptions in Proposition ? do not depend on the parameters and of the -ball, the estimator (Equation 5) is inherently adaptive. The condition on guarantees that its risk is always bounded by an order of , corresponding to the risk of the maximum likelihood estimator, , in the Gaussian sequence model (Equation 2).

The following corollary of Proposition ? essentially defines dense and sparse zones for , and dense, sparse and super-sparse zones for of different behavior for the quadratic risk of the proposed estimator (Equation 5). To evaluate its accuracy, we also compare the resulting risks with the corresponding minimax risks that can be found, e.g., in Donoho & Johnstone (1994b). In what follows denotes as .

For one can easily verify that all three conditions of Corollary ? are satisfied, for example, for the truncated geometric prior , where . On the other hand, for any , no binomial prior can “kill three birds with one stone”. The requirement necessarily implies as . However, to satisfy , one needs .

The impact of Corollary ? is that, up to a constant multiplier, the proposed estimator (Equation 5) is adaptively minimax for almost all -balls, , except those with very small normalized radiuses, where . Hence, while the optimality of most the existing threshold estimators, e.g., universal, Stein’s unbiased risk, false discovey rate, has been established only over various sparse settings, the Bayesian estimator (Equation 5) is appropriate for both sparse and dense cases. To the best of our knowledge, such a wide adaptivity range can be compared only with the penalized likelihood estimators of Birgé & Massart (2001) and the empirical Bayes threshold estimators of Johnstone & Silverman (2004b, 2005); see Petsa (2009, Chapter 3) for more details.

In fact, as we have mentioned, there are interesting asymptotic relationships between the Bayesian estimator (Equation 5) and the penalized likelihood estimator of Birgé & Massart (2001) that may explain their similar behavior. For estimating the normal mean vector in (Equation 2) within -balls, Birgé & Massart (2001) considered a penalized likelihood estimator with a specific complexity penalty

where for fixed and (Birgé & Massart, 2001, Section 6.3). For large and , this penalty is approximately of the following form:

for some positive constants ; see also Lemma ? in the Appendix. Thus, within this range, in (Equation 7)-(Equation 8) behaves in a way similar to a particular case of the penalty in (Equation 6) corresponding to the geometric type prior . This prior satisfies the second condition on of Corollary ?. Such a Bayesian interpretation can also be helpful in providing some intuition behind the penalty motivated in Birgé & Massart (2001) mostly due technical reasons. In addition, under the conditions of Corollary ?, .

Furthermore, for sparse cases, where , under the conditions on the prior of Corollary ?, both penalties and are of the same so-called -type penalties of the form where and is negligible relative to . Such type of penalties has appeared within different frameworks in a series of recent works on estimation and model selection (Foster & Stine, 1999; George & Foster, 2000; Birgé & Massart, 2001; Abramovich et al., 2006; Abramovich et al., 2007).

## 3Bayesian maximum a posteriori wavelet estimation in the Gaussian white noise model

### 3.1General algorithm

In this section we apply the results of Section 2 on estimation in the Gaussian sequence model (Equation 2) to wavelet estimation of the unknown response function in the Gaussian white noise model (Equation 1).

Given a compactly supported scaling function of regularity and the corresponding mother wavelet , one can generate an orthonormal wavelet basis on the unit interval from a finite number of scaling functions at a primary resolution level and wavelets at resolution levels and scales (Cohen et al., 1993; Johnstone & Silverman, 2004a). For clarity of exposition, we use the same notation for interior and edge wavelets, and in what follows denote by .

Then, is expanded in the orthonormal wavelet series on as

where . In the wavelet domain, the Gaussian white noise model (Equation 1) becomes

where the empirical wavelet coefficients are given by and are independent random variables.

Define . Estimate wavelet coefficients at different resolution levels by the following scheme:

set ;

apply the Bayesian estimation procedure of Abramovich et al. (2007) described in Section 2 to estimate at resolution levels by the corresponding ;

set .

The resulting wavelet estimator of is then defined as

Theorem ? below shows that, under mild conditions on the prior , the resulting global wavelet estimator (Equation 9) of , where the estimation procedure is applied to the entire set of wavelet coefficients at all resolution levels , up to a logarithmic factor, attains the minimax convergence rates over the whole range of Besov classes. Furthermore, Theorem ? demonstrates that performing the estimation procedure at each resolution level separately allows one to remove the extra logarithmic factor. Moreover, a level-wise version of (Equation 9) allows one to estimate the derivatives of at optimal convergence rates as well.

### 3.2Global wavelet estimator

The number of wavelet coefficients at all resolution levels up to is for large . Let , be a prior distribution on the number of non-zero wavelet coefficients of at all resolution levels , and let the prior variance of non-zero coefficients at the th resolution level be ; the corresponding level-wise variance ratios are .

It is well-known (Donoho & Johnstone, 1998) that, as , the minimax convergence rate for the -risk of estimating the unknown response function in the model (Equation 1) over Besov balls , where and , is given by

The proof of Theorem ? is based on the relationship between the smoothness conditions on functions within Besov spaces and the conditions on their wavelet coefficients. Namely, if , then the sequence of its wavelet coefficients belongs to a weak -ball of a radius , where the constant depends only on a chosen wavelet basis (Donoho, 1993, Lemma 2). One can then apply the corresponding results of Abramovich et al. (2007) for estimation over weak -balls. Details of the proof of Theorem ? are given in the Appendix.

The resulting global wavelet estimator does not rely on the knowledge of the parameters , , and of a specific Besov ball and it is, therefore, inherently adaptive. Theorem ? establishes the upper bound for its -risk and shows that the resulting adaptive global wavelet estimator is asymptotically near-optimal within the entire range of Besov balls. In fact, the additional logarithmic factor in ( ?) is the unavoidable minimal price for adaptivity for any global wavelet threshold estimator (Donoho et al., 1995; Cai, 1999), and in this sense, the upper bound for the convergence rates in ( ?) is sharp. To remove this logarithmic factor one should consider level-wise thresholding.

### 3.3Level-wise wavelet estimator

Consider now the level-wise version of the wavelet estimator (Equation 9), where estimation is applied separately at each resolution level . The number of wavelet coefficients at the th resolution level is . Let , be the prior distribution on the number of non-zero wavelet coefficients, and let be their level-wise prior variance, ; the corresponding level-wise variance ratios are .

For , the sequence of its wavelet coefficients at the th resolution level belongs to , where for some (Meyer, 1992, Section 6.10). The conditions on the prior in Theorem ? ensure that all the four statements of the Proposition ? simultaneously hold at all resolution levels with , and one can exploit any of them at each resolution level. It is necessary for adaptivity of the resulting level-wise wavelet estimator (Equation 9).

As we have mentioned in Section 2.2, all three conditions of Theorem ? hold, for example, for the truncated geometric prior , where are bounded away from zero and one.

It turns out that requiring a slightly more stringent condition on , allows one also to estimate derivatives of by the corresponding derivatives of its level-wise wavelet estimator at the optimal convergence rates. Such a plug-in estimation of by is, in fact, along the lines of the vaguelette-wavelet decomposition approach of Abramovich & Silverman (1998).

Recall that, as , the minimax convergence rate for the -risk of estimating an th derivative of the unknown response function in the model (Equation 1) over Besov balls , where , and , is given by

(Donoho et al., 1997; Johnstone and Silverman, 2005).

The following Theorem ? is a generalization of Theorem ? for simultaneous level-wise wavelet estimation of a function and its derivatives.

Theorem ? is evidently a particular case of Theorem ? corresponding to the case , for in the condition on . Theorem ? shows that the same proposed adaptive level-wise wavelet estimator (Equation 9) is simultaneously optimal for estimating a function and an entire range of its derivatives. This range is the same as that for the empirical Bayes shrinkage and threshold estimators appearing in Theorem 1 of Johnstone & Silverman (2005). The proof of Theorem ? is given in the Appendix.

## 4Numerical Study

### 4.1Preamble

In this section, we present a simulation study to illustrate the performance of the developed level-wise wavelet estimator (Equation 9) and compare it with three empirical Bayes wavelet estimators: the posterior mean and the posterior median of Johnstone & Silverman (2005), and the Bayes Factor of Pensky & Sapatinas (2007); and two other estimators: the block wavelet estimator NeighBlock of Cai & Silverman (2001) and the complex-valued wavelet hard thresholding estimator of Barber & Nason (2004). All the above Bayesian estimators and the block wavelet estimator are asymptotically minimax in a wide range of Besov balls. Although no such theoretical results have been established so far for the complex-valued wavelet estimator, it has performed well in simulations (Barber & Nason, 2004).

In practice, one typically deals with discrete data of a sample size and the sampled data analog of the Gaussian white noise model (Equation 1) is the standard nonparametric regression model

where are independent random variables. The corresponding global and level-wise Bayesian maximum a posteriori wavelet estimation procedures then use the empirical wavelet coefficients obtained by the discrete wavelet transforms of the data. However, utilizing the machinery of Johnstone & Silverman (2004a, 2005) for development of appropriate boundary-corrected wavelet bases, one can show that discretization does not affect the order of magnitude of the accuracy of the resulting wavelet estimates (Johnstone & Silverman, 2004a, 2005; Petsa, 2009, Chapter 3).

The computational algorithms were performed using the WaveLab and EbayesThresh software. The entire study was carried out using the Matlab programming environment.

### 4.2Estimation of parameters

To apply the proposed level-wise wavelet estimator (Equation 9) one should specify the priors , the noise variance and the prior variances or, equivalently, the variance ratios . We used the truncated geometric priors discussed in Section 3.3. Since the parameters and are rarely known a priori in practice, they should be estimated from the data in the spirit of empirical Bayes.

The unknown was robustly estimated by the median of the absolute deviation of the empirical wavelet coefficients at the finest resolution level , divided by 0.6745 as suggested by Donoho & Johnstone (1994a), and usually applied in practice. For a given , we then estimate and by the conditional likelihood approach of Clyde & George (1999).

Consider the prior model described in Section 2.1. The corresponding marginal likelihood of the observed empirical wavelet coefficients, say , at the th resolution level is then given by

where and are indicator vectors. Instead of direct maximization of with respect to and , regard the indicator vector as a latent variable and consider the corresponding log-likelihood for the augmented data , i.e.,

where is a constant. The EM-algorithm iteratively alternates between computation of the expectation of in (Equation 10) with respect to the distribution of given evaluated using the current estimates for the parameters’ values at the E-step, and updating then the parameters by maximizing it with respect to and at the M-step. However, for a general prior distribution and for the truncated geometric prior, in particular, the EM-algorithm does not allow one to achieve analytic expressions on the E-step. Instead, we apply the conditional likelihood estimation approach originated by George & Foster (2000) and adapted to the wavelet estimation context by Clyde & George (1999). The approach is based on evaluating the augmented log-likelihood (Equation 10) at the mode for the indicator vector at the E-step rather than using the mean as in the original EM-algorithm (Abramovich & Angelini, 2006).

For a fixed number of its non-zero entries, it is evident from (Equation 10) that the most likely vector is for the largest and zero otherwise. For the given , maximizing (Equation 10) with respect to after some algebra yields . To simplify maximization with respect to , approximate the truncated geometric distribution in (Equation 10) by a non-truncated one. This approximation does not strongly affect the results, especially at sufficiently high resolution levels, and allows one to obtain analytic solutions for , i.e., . It is now straightforward to find that maximizes (Equation 10) together with the corresponding and . The above conditional likelihood approach results therefore in rapidly computable estimates for and in closed forms.

### 4.3Simulation study

We now present and discuss the results of the simulation study. For all three empirical Bayes wavelet estimators, we used the Double-exponential prior, where the corresponding prior parameters were estimated level-by-level by marginal likelihood maximization, as described in Johnstone & Silverman (2005). The prior parameters for the proposed level-wise wavelet estimator (Equation 9) were estimated by conditional likelihood maximization procedure described in Section 4.2 above. For the block wavelet estimator, the lengths of the blocks and the thresholds were selected as suggested by Cai & Silverman (2001). Finally, for all competing methods, was estimated by the median of the absolute value of the empirical wavelet coefficients at the finest resolution level divided by 0.6745 as discussed in Section 4.2.

In the simulation study, we evaluated the above six wavelet estimators for a series of test functions. We present here the results for the nowadays standard Bumps, Blocks, Doppler and Heavisine functions of Donoho & Johnstone (1994a), and Wave (Marron et al., 1998; Antoniadis et al., 2001) and Peak (Angelini et al., 2003) functions defined, respectively, as

and

See Figure ? for Wave and Peak test functions.

For each test function, samples were generated by adding independent Gaussian noise to 256, 512 and 1024 equally spaced points on [0,1]. The value of the root signal-to-noise ratio was taken to be 3, 5 and 7 corresponding respectively to high, moderate and low noise levels. The goodness-of-fit for an estimator of in a single replication was measured by its mean squared error.

For brevity, we report the results only for using the compactly supported mother wavelet Coiflet 3 (Daubechies, 1992, p.258) and the Lawton mother wavelet (Lawton, 1993) for the complex-valued wavelet estimator. The primary resolution level was . Different choices of sample sizes and wavelet functions basically yielded similar results in magnitude.

The sample distributions of mean squared errors over replications for different wavelet estimators in the conducted simulation study were typically asymmetrical and affected by outliers. Therefore, we preferred the sampled medians of mean squared errors rather than means to gauge the estimators’ goodness-of-fit. Thus, for each wavelet estimator, test function and noise level, we calculated the sample median of mean squared errors over all 100 replications. To quantify the comparison between the competing wavelet estimators over various test functions and noise levels, for each model we found the best wavelet estimator among the six, i.e., the one achieving the minimum median mean squared error. We then evaluated the relative median mean squared error of each estimator defined as the ratio between the minimum and the estimator’s median mean squared errors; see Table ?.

As expected, Table ? shows that there is no uniformly best wavelet estimator. Each one has its own favorite and challenging cases, and its relative performance strongly depends on a specific test function. Thus, the complex-valued estimator indeed demonstrates excellent results for Donoho & Johnstone’s functions as it has been reported in Barber & Nason (2004), but is much less successful for Peak and Wave. The block estimator is the best for the Peak and Doppler but the worst for Blocks and Bumps. The proposed Bayesian estimator (Equation 9) outperforms others for Wave but is less efficient for Donoho & Johnstone’s (1994a) examples. Interestingly, the relative performance of the estimators is much less sensitive to the noise level. For each of the test functions, the corresponding best estimator is essentially the same for all noise levels.

The minimal relative median of mean squared errors of an estimator over all cases reflects its inefficiency at the most challenging combination of a test function and noise level, and can be viewed as a natural measure of its robustness. In this sense, the posterior mean estimator is the most robust although it is not the winner in any particular case.

signal | RSNR | MAP | BF | Postmed | Postmean | Block | CW |

Peak | 3 | 0.8697 | 0.1763 | 0.8279 | 0.6589 | 1 | 0.5795 |

5 | 0.7772 | 0.1497 | 0.7864 | 0.6525 | 1 | 0.6234 | |

7 | 0.8033 | 0.186 | 0.8501 | 0.6958 | 1 | 0.6979 | |

Wave | 3 | 1 | 0.5614 | 0.9841 | 0.9103 | 0.4570 | 0.9189 |

5 | 0.9841 | 0.4603 | 1 | 0.9165 | 0.6072 | 0.8265 | |

7 | 1 | 0.6241 | 0.9900 | 0.9303 | 0.7498 | 0.7793 | |

Bumps | 3 | 0.5968 | 0.6254 | 0.6814 | 0.7569 | 0.4769 | 1 |

5 | 0.5221 | 0.5641 | 0.5893 | 0.6671 | 0.4788 | 1 | |

7 | 0.5132 | 0.5537 | 0.5707 | 0.6420 | 0.5202 | 1 | |

Blocks | 3 | 0.6595 | 0.6807 | 0.8815 | 0.9500 | 0.5606 | 1 |

5 | 0.6875 | 0.727 | 0.8541 | 0.9065 | 0.4416 | 1 | |

7 | 0.6921 | 0.7134 | 0.7806 | 0.8535 | 0.4288 | 1 | |

Doppler | 3 | 0.7214 | 0.611 | 0.8277 | 0.8709 | 0.9878 | 1 |

5 | 0.6962 | 0.6739 | 0.8116 | 0.8583 | 1 | 0.9119 | |

7 | 0.7655 | 0.7122 | 0.8236 | 0.883 | 1 | 0.9382 | |

HeaviSine | 3 | 0.7523 | 0.3566 | 0.9333 | 0.9154 | 0.8406 | 1 |

5 | 0.6640 | 0.3764 | 0.8622 | 0.8427 | 0.5796 | 1 | |

7 | 0.6931 | 0.3505 | 0.8298 | 0.8424 | 0.5028 | 1 | |

As a “by-product”, we also compared different thresholding estimators in terms of sparsity measured by the average percentage of non-zero wavelet coefficients which remained after thresholding; see Table ?. The posterior mean estimator was not included in this comparison since it is a non-linear shrinkage but not a thresholding estimator. Similar to the previous results on the goodness-of-fit, the relative sparsity strongly depends on the test function. However, except for the Doppler example, the estimator (Equation 9) is consistently the most sparse among Bayesian estimators.

signal | RSNR | MAP | BF | Postmed | Block | CW |

Peak | 3 | 7.57 | 14.87 | 8.92 | 1.64 | 1.74 |

5 | 5.62 | 15.89 | 8.39 | 1.61 | 1.93 | |

7 | 4.26 | 12.93 | 7.81 | 1.63 | 2.05 | |

Wave | 3 | 11.39 | 18.81 | 12.79 | 5.21 | 5.52 |

5 | 11.51 | 20.19 | 12.52 | 6.30 | 6.28 | |

7 | 10.38 | 20.84 | 13.07 | 6.30 | 7.01 | |

Bumps | 3 | 10.63 | 12.13 | 10.86 | 16.63 | 12.23 |

5 | 11.17 | 12.60 | 12.45 | 21.13 | 14.52 | |

7 | 12.65 | 13.90 | 13.80 | 23.70 | 16.03 | |

Blocks | 3 | 17.10 | 15.32 | 11.62 | 12.27 | 8.39 |

5 | 10.39 | 12.20 | 11.72 | 18.03 | 12.07 | |

7 | 11.12 | 12.87 | 12.63 | 22.47 | 14.20 | |

Doppler | 3 | 11.46 | 14.73 | 8.58 | 5.69 | 5.13 |

5 | 7.24 | 9.23 | 6.52 | 6.63 | 6.60 | |

7 | 6.42 | 9.58 | 6.57 | 7.27 | 7.86 | |

HeaviSine | 3 | 6.35 | 19.27 | 10.75 | 1.98 | 2.17 |

5 | 8.11 | 19.73 | 10.62 | 3.17 | 2.69 | |

7 | 10.87 | 18.52 | 12.55 | 4.00 | 3.39 | |

Apart from providing a theoretical justification, the presented numerical results show that the proposed estimator demonstrates good performance in finite sample settings and can, therefore, be viewed as a contribution to the list of useful wavelet-based function estimation tools.

## Acknowledgments

Felix Abramovich and Vadim Grinshtein were supported by the Israel Science Foundation grant ISF-248/08. The authors would like to thank the Editor, the Associate Editor and the three anonymous referees for helpful comments on improvements to this paper. Valuable remarks of Bernice Oberman are gratefully acknowledged.

## 6Appendix

Throughout the proofs we use to denote a generic positive constant, not necessarily the same each time it is used, even within a single equation.

### 6.1Proof of Proposition

We start the proof of the proposition with the following lemma that establishes the bounds for binomial coefficients:

This lemma generalizes Lemma A.1 of Abramovich et al. (2007), where the upper bound similar to that in ( ?) was obtained for .

Proof of Lemma ?. The obvious lower bound for the binomial coefficient in ( ?) has been shown in Lemma A.1 of Abramovich et al. (2007). To prove the upper bound in ( ?), note that using Stirling’s formula one has

for all and .

Note that for all . In particular, for it implies and, therefore, that together with (Equation 11) completes the proof of ( ?).

The second statement ( ?) of the lemma is an immediate consequence of ( ?) for . This completes the proof of Lemma ?.

We now return to the proof of Proposition ? and consider separately all the four cases covered by the proposition. The proof will exploit the general results of Abramovich et al. (2007) on the upper error bounds for the -risk of the estimator (Equation 5) adapting them also for the case where the variance in the Gaussian sequence model may depend on .

Case 1. Under the condition , the definition (Equation 5) of and (Equation 6) immediately imply Thus,

Case 2. Applying Corollary 1 of Abramovich et al. (2007) for yields

where the exact expressions for and are given in Theorem 2 of Birgé & Massart (2001) with their ; see the proof of Theorem 1 of Abramovich et al. (2007). In particular, under the assumptions of the proposition on the boundness of , the functions and are also bounded from above. For , the least favorable sequence that maximizes over is . As , one then has

Case 3. This is essentially a sparse case considered in Abramovich et al. (2007) and its proof is a direct consequence of their Theorem 6.

Case 4. The proof for this case is similar to that of Case 2 except that for , the least favorable sequences that maximize over are permutations of the spike and therefore . Repeating the arguments used in the proof of Case 2 for , under the requirements of the proposition on boundedness of , we then get as ,

for all . This completes the proof of Theorem ?.

### 6.2Proof of Theorem

Let , be the -risk of the global wavelet estimator (Equation 9) at the th resolution level. Due to the Parseval relation, . Scaling coefficients are not thresholded and therefore as . At very high resolution levels, where , all wavelet coefficients are set to zero and, therefore, as ,

where (Johnstone & Silverman, 2005).

Consider now . The set of wavelet coefficients of a function lies within a weak -ball of a radius with , where the constant depends only on a chosen wavelet basis: (Donoho, 1993, Lemma 2). The corresponding normalized radius , where for large .

Under the conditions of the theorem, one can then apply Theorem 6 of Abramovich et al. (2007) for to get

as . This completes the proof of Theorem ?.

### 6.3Proof of Theorem

Let , be now the -risk of the level-wise version of the wavelet estimator (Equation 9) at the th resolution level. Johnstone & Silverman (2005, Section 5.6) showed that .

For any , the sequence of its wavelet coefficients at the th resolution level belongs to a strong -ball of a normalized radius for some (Meyer, 1992, Section 6.10).

Define

For sufficiently large , . Note that for and for with obvious modifications for . Consider the following cases:

1. Scaling coefficients: . Similarly to the global wavelet estimator, for a fixed primary resolution level , as .

2. Coarse resolution levels: . Applying the first statement of Proposition ? for each level one has

as .

3. Middle and high resolution levels: . Consider separately the cases (a) and (b) .

(a) . Under the conditions of the theorem, the second statement of Proposition ? at the th resolution level yields

and, hence, as ,

(b) . Let be the largest integer for which . One can easily verify that .

Using the monotonicity arguments, for all middle resolution levels . One can then apply the third statement of Proposition ?, and after some algebra, to get, for ,

as .

At high resolution levels , , and the fourth statement of Proposition ? implies

Hence, for and , one has

where evidently as . From the definition of ,