On Bayesian “Testimation” and Its Application to Wavelet Thresholding
We consider the problem of estimating the unknown response function in the Gaussian white noise model. We first utilize the recently developed Bayesian maximum a posteriori “testimation” procedure of Abramovich et al. (2007) for recovering an unknown high-dimensional Gaussian mean vector. The existing results for its upper error bounds over various sparse -balls are extended to more general cases. We show that, for a properly chosen prior on the number of non-zero entries of the mean vector, the corresponding adaptive estimator is asymptotically minimax in a wide range of sparse and dense -balls.
The proposed procedure is then applied in a wavelet context to derive adaptive global and level-wise wavelet estimators of the unknown response function in the Gaussian white noise model. These estimators are then proven to be, respectively, asymptotically near-minimax and minimax in a wide range of Besov balls. These results are also extended to the estimation of derivatives of the response function.
Simulated examples are conducted to illustrate the performance of the proposed level-wise wavelet estimator in finite sample situations, and to compare it with several existing counterparts.
We consider the problem of estimating the unknown response function in the Gaussian white noise model, where one observes Gaussian processes governed by
The noise parameter is assumed to be known, is a standard Wiener process, and is the unknown response function. Under some smoothness constraints on , such a model is asymptotically equivalent in Le Cam sense to the standard nonparametric regression setting (Brown & Low, 1996).
In a consistent estimation theory, it is well-known that should possess some smoothness properties. We assume that belongs to a Besov ball of a radius , where and . The latter restriction ensures that the corresponding Besov spaces are embedded in . The parameter measures the degree of smoothness while and specify the type of norm used to measure the smoothness. Besov classes contain various traditional smoothness spaces such as Hölder and Sobolev spaces as special cases. However, they also include different types of spatially inhomogeneous functions (Meyer, 1992).
The fact that wavelet series constitute unconditional bases for Besov spaces has caused various wavelet-based estimation procedures to be widely used for estimating the unknown response in the Gaussian white noise model (1). The standard wavelet approach for the estimation of is based on finding the empirical wavelet coefficients of the data and denoising them, usually by some type of a thresholding rule. Transforming them back to the function space then yields the resulting estimate. The main statistical challenge in such an approach is a proper choice of a thresholding rule. A series of various wavelet thresholds originated by different ideas has been proposed in the literature during the last decade, e.g., the universal threshold (Donoho & Johnstone, 1994a), Stein’s unbiased risk estimation threshold (Donoho & Johnstone, 1995), the false discovery rate threshold (Abramovich & Benjamini, 1996), the cross-validation threshold (Nason, 1996), the Bayes threshold (Abramovich et al., 1998) and the empirical Bayes threshold (Johnstone & Silverman, 2005).
Abramovich & Benjamini (1996) demonstrated that thresholding can be viewed as a multiple hypothesis testing procedure, where one first simultaneously tests the wavelet coefficients of the unknown response function, for significance. The coefficients concluded to be significant are then estimated by the corresponding empirical wavelet coefficients of the data, while the non-significant ones are discarded. Such a “testimation” procedure evidently mimics a hard thresholding rule. Various choices for adjustment to multiplicity on the testing step lead to different thresholds. In particular, the universal threshold of Donoho & Johnstone (1994a) and the false discovery rate threshold of Abramovich & Benjamini (1996) fall within such a framework corresponding to Bonferroni and false discovery rate multiplicity corrections, respectively.
In this paper, we proceed along the lines of “testimation” approach, where we utilize the recently developed maximum a posteriori Bayesian multiple testing procedure of Abramovich & Angelini (2006). Their hierarchical prior model is based on imposing a prior distribution on the number of false null hypotheses. Abramovich et al. (2007) applied this approach to estimating a high-dimensional Gaussian mean vector and showed its minimax optimality where the unknown mean vector was assumed to be sparse.
We first extend the results of Abramovich et al. (2007) to more general settings. Consider the problem of estimating an unknown high-dimensional Gaussian mean vector, where one observes governed by
The variance , that may depend on , is assumed to be known, are independent random variables, and the unknown mean vector is assumed to lie in a strong -ball , , of a normalized radius , that is, , where . Abramovich et al. (2007) considered the Gaussian sequence model (2) with and derived upper error bounds for the quadratic risk of an adaptive Bayesian maximum a posteriori estimator of in the sparse case, where and as . We extend their results for all combinations of and and for the variance in (2) that may depend on . We show, in particular, that for a properly chosen prior distribution on the number of non-zero entries of , the corresponding estimator, up to a constant factor, is asymptotically minimax for almost all -balls including both sparse and dense cases.
We then apply the proposed approach to the wavelet thresholding estimation in the Gaussian white noise model (1). We show that, under mild conditions on the prior distribution on the number of non-zero wavelet coefficients, the resulting global wavelet estimator of , up to a logarithmic factor, attains the minimax convergence rates simultaneously over the entire range of Besov balls. Furthermore, we demonstrate that estimating wavelet coefficients at each resolution level separately, allows one to remove the extra logarithmic factor. Moreover, the procedure can also be extended to the estimation of derivatives of . These results, in some sense, complement the adaptively minimax empirical Bayes estimators of Johnstone & Silverman (2005).
2 Estimation in the Gaussian sequence model
2.1 Bayesian maximum a posteriori estimation procedure
We start with reviewing the Bayesian maximum a posteriori estimation procedure for the Gaussian sequence model (2) developed by Abramovich et al. (2007).
For this model, consider the multiple hypothesis testing problem, where we wish to simultaneously test
A configuration of true and false null hypotheses is uniquely defined by the indicator vector , where and denotes the indicator function of the set . Let be the number of non-zero , i.e., . Assume some prior distribution on with . For a given , all the corresponding different vectors are assumed to be equally likely a priori, that is, conditionally on ,
Naturally, , where is a probability atom at zero. To complete the prior specification, we assume that .
For the proposed hierarchical prior, the posterior probability of a given vector with non-zero entries is
where the Bayes factor of is
and is the variance ratio (Abramovich & Angelini, 2006).
Given the posterior distribution , we apply the maximum a posteriori rule to choose the most likely indicator vector. Generally, to find the posterior mode of , one should look through all possible sequences of zeroes and ones. However, for the proposed model, the number of candidates for a mode is, in fact, reduced to only. Indeed, let be a maximizer of (3) for a fixed that indicates the most plausible vector with non-zero entries. From (3), it follows immediately that at the entries corresponding to the smallest Bayes factors and zeroes otherwise. Due to the monotonicity of in in (4), it is equivalent to setting for the largest and zeroes for others. The proposed Bayesian multiple testing procedure then leads to finding that maximizes
for some constant or, equivalently, minimizes
where . The null hypotheses corresponding to are rejected. The resulting Bayesian estimation yields a hard thresholding with a threshold , i.e.,
If , then all , , are thresholded and .
From a frequentist view, the above estimator in (5) is evidently a penalized likelihood estimator with the complexity penalty
In this sense, it can be also considered within the framework of Birgé & Massart (2001). We will discuss these relations in the following section in more details.
2.2 Upper error bounds
Abramovich et al. (2007, Theorem 6) obtained upper error bounds for the -risk of (5) in the Gaussian sequence model (2) for sparse -balls, where and as . We extend now these results to more general settings.
Fix a prior distribution , on the number of non-zero entries of , and let be the variance ratio.
Let . Assume that for some . Then, as ,
Let . Assume that there exists such that for some . Then, as ,
Let . Assume for all , where , and for some . Then, as ,
for all .
Let . Assume that there exists such that for some . Then, as ,
for all .
The proof of Proposition 1 is given in the Appendix. Similar to Abramovich et al. (2007), analogous results can be obtained for other types of balls, e.g., weak -balls, , and -balls, with necessary changes in the proofs (Petsa, 2009, Chapter 3).
Since the prior assumptions in Proposition 1 do not depend on the parameters and of the -ball, the estimator (5) is inherently adaptive. The condition on guarantees that its risk is always bounded by an order of , corresponding to the risk of the maximum likelihood estimator, , in the Gaussian sequence model (2).
The following corollary of Proposition 1 essentially defines dense and sparse zones for , and dense, sparse and super-sparse zones for of different behavior for the quadratic risk of the proposed estimator (5). To evaluate its accuracy, we also compare the resulting risks with the corresponding minimax risks that can be found, e.g., in Donoho & Johnstone (1994b). In what follows denotes as .
for some and ;
for all , where or if is known, and for some ;
for some .
Then, as , depending on and , one has:
Case 1. Let . Then,
Case 2. Let . Then,
Case 3. Let . Then,
Case 4. Let . Then,
For one can easily verify that all three conditions of Corollary 1 are satisfied, for example, for the truncated geometric prior , where . On the other hand, for any , no binomial prior can “kill three birds with one stone”. The requirement necessarily implies as . However, to satisfy , one needs .
The impact of Corollary 1 is that, up to a constant multiplier, the proposed estimator (5) is adaptively minimax for almost all -balls, , except those with very small normalized radiuses, where . Hence, while the optimality of most the existing threshold estimators, e.g., universal, Stein’s unbiased risk, false discovey rate, has been established only over various sparse settings, the Bayesian estimator (5) is appropriate for both sparse and dense cases. To the best of our knowledge, such a wide adaptivity range can be compared only with the penalized likelihood estimators of Birgé & Massart (2001) and the empirical Bayes threshold estimators of Johnstone & Silverman (2004b, 2005); see Petsa (2009, Chapter 3) for more details.
In fact, as we have mentioned, there are interesting asymptotic relationships between the Bayesian estimator (5) and the penalized likelihood estimator of Birgé & Massart (2001) that may explain their similar behavior. For estimating the normal mean vector in (2) within -balls, Birgé & Massart (2001) considered a penalized likelihood estimator with a specific complexity penalty
where for fixed and (Birgé & Massart, 2001, Section 6.3). For large and , this penalty is approximately of the following form:
for some positive constants ; see also Lemma 1 in the Appendix. Thus, within this range, in (7)-(8) behaves in a way similar to a particular case of the penalty in (6) corresponding to the geometric type prior . This prior satisfies the second condition on of Corollary 1. Such a Bayesian interpretation can also be helpful in providing some intuition behind the penalty motivated in Birgé & Massart (2001) mostly due technical reasons. In addition, under the conditions of Corollary 1, .
Furthermore, for sparse cases, where , under the conditions on the prior of Corollary 1, both penalties and are of the same so-called -type penalties of the form where and is negligible relative to . Such type of penalties has appeared within different frameworks in a series of recent works on estimation and model selection (Foster & Stine, 1999; George & Foster, 2000; Birgé & Massart, 2001; Abramovich et al., 2006; Abramovich et al., 2007).
3 Bayesian maximum a posteriori wavelet estimation in the Gaussian white noise model
3.1 General algorithm
Given a compactly supported scaling function of regularity and the corresponding mother wavelet , one can generate an orthonormal wavelet basis on the unit interval from a finite number of scaling functions at a primary resolution level and wavelets at resolution levels and scales (Cohen et al., 1993; Johnstone & Silverman, 2004a). For clarity of exposition, we use the same notation for interior and edge wavelets, and in what follows denote by .
Then, is expanded in the orthonormal wavelet series on as
where . In the wavelet domain, the Gaussian white noise model (1) becomes
where the empirical wavelet coefficients are given by and are independent random variables.
Define . Estimate wavelet coefficients at different resolution levels by the following scheme:
apply the Bayesian estimation procedure of Abramovich et al. (2007) described in Section 2 to estimate at resolution levels by the corresponding ;
The resulting wavelet estimator of is then defined as
Theorem 1 below shows that, under mild conditions on the prior , the resulting global wavelet estimator (9) of , where the estimation procedure is applied to the entire set of wavelet coefficients at all resolution levels , up to a logarithmic factor, attains the minimax convergence rates over the whole range of Besov classes. Furthermore, Theorem 2 demonstrates that performing the estimation procedure at each resolution level separately allows one to remove the extra logarithmic factor. Moreover, a level-wise version of (9) allows one to estimate the derivatives of at optimal convergence rates as well.
3.2 Global wavelet estimator
The number of wavelet coefficients at all resolution levels up to is for large . Let , be a prior distribution on the number of non-zero wavelet coefficients of at all resolution levels , and let the prior variance of non-zero coefficients at the th resolution level be ; the corresponding level-wise variance ratios are .
It is well-known (Donoho & Johnstone, 1998) that, as , the minimax convergence rate for the -risk of estimating the unknown response function in the model (1) over Besov balls , where and , is given by
Let be a mother wavelet of regularity and let be the corresponding global wavelet estimator (9) of in the Gaussian white noise model (1), where , , and . Assume that there exist positive constants and such that for all . Let the prior satisfy for all or, for a shorter range if is known. Then, as ,
The proof of Theorem 1 is based on the relationship between the smoothness conditions on functions within Besov spaces and the conditions on their wavelet coefficients. Namely, if , then the sequence of its wavelet coefficients belongs to a weak -ball of a radius , where the constant depends only on a chosen wavelet basis (Donoho, 1993, Lemma 2). One can then apply the corresponding results of Abramovich et al. (2007) for estimation over weak -balls. Details of the proof of Theorem 1 are given in the Appendix.
The resulting global wavelet estimator does not rely on the knowledge of the parameters , , and of a specific Besov ball and it is, therefore, inherently adaptive. Theorem 1 establishes the upper bound for its -risk and shows that the resulting adaptive global wavelet estimator is asymptotically near-optimal within the entire range of Besov balls. In fact, the additional logarithmic factor in (10) is the unavoidable minimal price for adaptivity for any global wavelet threshold estimator (Donoho et al., 1995; Cai, 1999), and in this sense, the upper bound for the convergence rates in (10) is sharp. To remove this logarithmic factor one should consider level-wise thresholding.
3.3 Level-wise wavelet estimator
Consider now the level-wise version of the wavelet estimator (9), where estimation is applied separately at each resolution level . The number of wavelet coefficients at the th resolution level is . Let , be the prior distribution on the number of non-zero wavelet coefficients, and let be their level-wise prior variance, ; the corresponding level-wise variance ratios are .
Let be a mother wavelet of regularity and let be the corresponding level-wise wavelet estimator (9) of in the Gaussian white noise model (1), where , , and . Assume that there exist positive constants and such that for all . Let the priors satisfy the following conditions for all :
for some ;
for all , where and for some constant , and the function was defined in Proposition 1;
for some .
Then, as ,
For , the sequence of its wavelet coefficients at the th resolution level belongs to , where for some (Meyer, 1992, Section 6.10). The conditions on the prior in Theorem 2 ensure that all the four statements of the Proposition 1 simultaneously hold at all resolution levels with , and one can exploit any of them at each resolution level. It is necessary for adaptivity of the resulting level-wise wavelet estimator (9).
It turns out that requiring a slightly more stringent condition on , allows one also to estimate derivatives of by the corresponding derivatives of its level-wise wavelet estimator at the optimal convergence rates. Such a plug-in estimation of by is, in fact, along the lines of the vaguelette-wavelet decomposition approach of Abramovich & Silverman (1998).
Recall that, as , the minimax convergence rate for the -risk of estimating an th derivative of the unknown response function in the model (1) over Besov balls , where , and , is given by
(Donoho et al., 1997; Johnstone and Silverman, 2005).
Let be a mother wavelet of regularity and let be the level-wise wavelet estimator (9) of in the Gaussian white noise model (1), where , , and . Assume that there exist positive constants and such that for all . Let the priors satisfy the following conditions for all :
for some and ;
for all , where and for some constant , and the function was defined in Proposition 1;
for some .
Then, for all th derivatives of , where and , as ,
Theorem 2 is evidently a particular case of Theorem 3 corresponding to the case , for in the condition on . Theorem 3 shows that the same proposed adaptive level-wise wavelet estimator (9) is simultaneously optimal for estimating a function and an entire range of its derivatives. This range is the same as that for the empirical Bayes shrinkage and threshold estimators appearing in Theorem 1 of Johnstone & Silverman (2005). The proof of Theorem 3 is given in the Appendix.
4 Numerical Study
In this section, we present a simulation study to illustrate the performance of the developed level-wise wavelet estimator (9) and compare it with three empirical Bayes wavelet estimators: the posterior mean and the posterior median of Johnstone & Silverman (2005), and the Bayes Factor of Pensky & Sapatinas (2007); and two other estimators: the block wavelet estimator NeighBlock of Cai & Silverman (2001) and the complex-valued wavelet hard thresholding estimator of Barber & Nason (2004). All the above Bayesian estimators and the block wavelet estimator are asymptotically minimax in a wide range of Besov balls. Although no such theoretical results have been established so far for the complex-valued wavelet estimator, it has performed well in simulations (Barber & Nason, 2004).
In practice, one typically deals with discrete data of a sample size and the sampled data analog of the Gaussian white noise model (1) is the standard nonparametric regression model
where are independent random variables. The corresponding global and level-wise Bayesian maximum a posteriori wavelet estimation procedures then use the empirical wavelet coefficients obtained by the discrete wavelet transforms of the data. However, utilizing the machinery of Johnstone & Silverman (2004a, 2005) for development of appropriate boundary-corrected wavelet bases, one can show that discretization does not affect the order of magnitude of the accuracy of the resulting wavelet estimates (Johnstone & Silverman, 2004a, 2005; Petsa, 2009, Chapter 3).
The computational algorithms were performed using the WaveLab and EbayesThresh software. The entire study was carried out using the Matlab programming environment.
4.2 Estimation of parameters
To apply the proposed level-wise wavelet estimator (9) one should specify the priors , the noise variance and the prior variances or, equivalently, the variance ratios . We used the truncated geometric priors discussed in Section 3.3. Since the parameters and are rarely known a priori in practice, they should be estimated from the data in the spirit of empirical Bayes.
The unknown was robustly estimated by the median of the absolute deviation of the empirical wavelet coefficients at the finest resolution level , divided by 0.6745 as suggested by Donoho & Johnstone (1994a), and usually applied in practice. For a given , we then estimate and by the conditional likelihood approach of Clyde & George (1999).
Consider the prior model described in Section 2.1. The corresponding marginal likelihood of the observed empirical wavelet coefficients, say , at the th resolution level is then given by
where and are indicator vectors. Instead of direct maximization of with respect to and , regard the indicator vector as a latent variable and consider the corresponding log-likelihood for the augmented data , i.e.,
where is a constant. The EM-algorithm iteratively alternates between computation of the expectation of in (11) with respect to the distribution of given evaluated using the current estimates for the parameters’ values at the E-step, and updating then the parameters by maximizing it with respect to and at the M-step. However, for a general prior distribution and for the truncated geometric prior, in particular, the EM-algorithm does not allow one to achieve analytic expressions on the E-step. Instead, we apply the conditional likelihood estimation approach originated by George & Foster (2000) and adapted to the wavelet estimation context by Clyde & George (1999). The approach is based on evaluating the augmented log-likelihood (11) at the mode for the indicator vector at the E-step rather than using the mean as in the original EM-algorithm (Abramovich & Angelini, 2006).
For a fixed number of its non-zero entries, it is evident from (11) that the most likely vector is for the largest and zero otherwise. For the given , maximizing (11) with respect to after some algebra yields . To simplify maximization with respect to , approximate the truncated geometric distribution in (11) by a non-truncated one. This approximation does not strongly affect the results, especially at sufficiently high resolution levels, and allows one to obtain analytic solutions for , i.e., . It is now straightforward to find that maximizes (11) together with the corresponding and . The above conditional likelihood approach results therefore in rapidly computable estimates for and in closed forms.
4.3 Simulation study
We now present and discuss the results of the simulation study. For all three empirical Bayes wavelet estimators, we used the Double-exponential prior, where the corresponding prior parameters were estimated level-by-level by marginal likelihood maximization, as described in Johnstone & Silverman (2005). The prior parameters for the proposed level-wise wavelet estimator (9) were estimated by conditional likelihood maximization procedure described in Section 4.2 above. For the block wavelet estimator, the lengths of the blocks and the thresholds were selected as suggested by Cai & Silverman (2001). Finally, for all competing methods, was estimated by the median of the absolute value of the empirical wavelet coefficients at the finest resolution level divided by 0.6745 as discussed in Section 4.2.
In the simulation study, we evaluated the above six wavelet estimators for a series of test functions. We present here the results for the nowadays standard Bumps, Blocks, Doppler and Heavisine functions of Donoho & Johnstone (1994a), and Wave (Marron et al., 1998; Antoniadis et al., 2001) and Peak (Angelini et al., 2003) functions defined, respectively, as
See Figure 4.1 for Wave and Peak test functions.
For each test function, samples were generated by adding independent Gaussian noise to 256, 512 and 1024 equally spaced points on [0,1]. The value of the root signal-to-noise ratio was taken to be 3, 5 and 7 corresponding respectively to high, moderate and low noise levels. The goodness-of-fit for an estimator of in a single replication was measured by its mean squared error.
For brevity, we report the results only for using the compactly supported mother wavelet Coiflet 3 (Daubechies, 1992, p.258) and the Lawton mother wavelet (Lawton, 1993) for the complex-valued wavelet estimator. The primary resolution level was . Different choices of sample sizes and wavelet functions basically yielded similar results in magnitude.
The sample distributions of mean squared errors over replications for different wavelet estimators in the conducted simulation study were typically asymmetrical and affected by outliers. Therefore, we preferred the sampled medians of mean squared errors rather than means to gauge the estimators’ goodness-of-fit. Thus, for each wavelet estimator, test function and noise level, we calculated the sample median of mean squared errors over all 100 replications. To quantify the comparison between the competing wavelet estimators over various test functions and noise levels, for each model we found the best wavelet estimator among the six, i.e., the one achieving the minimum median mean squared error. We then evaluated the relative median mean squared error of each estimator defined as the ratio between the minimum and the estimator’s median mean squared errors; see Table 4.1.
As expected, Table 4.1 shows that there is no uniformly best wavelet estimator. Each one has its own favorite and challenging cases, and its relative performance strongly depends on a specific test function. Thus, the complex-valued estimator indeed demonstrates excellent results for Donoho & Johnstone’s functions as it has been reported in Barber & Nason (2004), but is much less successful for Peak and Wave. The block estimator is the best for the Peak and Doppler but the worst for Blocks and Bumps. The proposed Bayesian estimator (9) outperforms others for Wave but is less efficient for Donoho & Johnstone’s (1994a) examples. Interestingly, the relative performance of the estimators is much less sensitive to the noise level. For each of the test functions, the corresponding best estimator is essentially the same for all noise levels.
The minimal relative median of mean squared errors of an estimator over all cases reflects its inefficiency at the most challenging combination of a test function and noise level, and can be viewed as a natural measure of its robustness. In this sense, the posterior mean estimator is the most robust although it is not the winner in any particular case.
As a “by-product”, we also compared different thresholding estimators in terms of sparsity measured by the average percentage of non-zero wavelet coefficients which remained after thresholding; see Table 4.2. The posterior mean estimator was not included in this comparison since it is a non-linear shrinkage but not a thresholding estimator. Similar to the previous results on the goodness-of-fit, the relative sparsity strongly depends on the test function. However, except for the Doppler example, the estimator (9) is consistently the most sparse among Bayesian estimators.
Apart from providing a theoretical justification, the presented numerical results show that the proposed estimator demonstrates good performance in finite sample settings and can, therefore, be viewed as a contribution to the list of useful wavelet-based function estimation tools.
Felix Abramovich and Vadim Grinshtein were supported by the Israel Science Foundation grant ISF-248/08. The authors would like to thank the Editor, the Associate Editor and the three anonymous referees for helpful comments on improvements to this paper. Valuable remarks of Bernice Oberman are gratefully acknowledged.
Throughout the proofs we use to denote a generic positive constant, not necessarily the same each time it is used, even within a single equation.
5.1 Proof of Proposition 1
We start the proof of the proposition with the following lemma that establishes the bounds for binomial coefficients:
For all and ,
In particular, for ,
This lemma generalizes Lemma A.1 of Abramovich et al. (2007), where the upper bound similar to that in (12) was obtained for .
Proof of Lemma 1. The obvious lower bound for the binomial coefficient in (12) has been shown in Lemma A.1 of Abramovich et al. (2007). To prove the upper bound in (12), note that using Stirling’s formula one has
for all and .
Note that for all