On the usefulness of Meyer wavelets for deconvolution and density estimation
The aim of this paper is to show the usefulness of Meyer wavelets for the classical problem of density estimation and for density deconvolution from noisy observations. By using such wavelets, the computation of the empirical wavelet coefficients relies on the fast Fourier transform and on the fact that Meyer wavelets are band-limited functions. This makes such estimators very simple to compute and this avoids the problem of evaluating wavelets at non-dyadic points which is the main drawback of classical wavelet-based density estimators. Our approach is based on term-by-term thresholding of the empirical wavelet coefficients with random thresholds depending on an estimation of the variance of each coefficient. Such estimators are shown to achieve the same performances of an oracle estimator up to a logarithmic term. These estimators also achieve near-minimax rates of convergence over a large class of Besov spaces. A simulation study is proposed to show the good finite sample performances of the estimator for both problems of direct density estimation and density deconvolution.
Keywords: Density estimation, Deconvolution, Inverse problem, Wavelet thresholding, Random thresholds, Oracle inequalities, Adaptive estimation, Besov space, Minimax rates of convergence.
AMS classifications: Primary 62G07; secondary 42C40, 41A29
Institut de Mathématiques de Toulouse, Université de Toulouse et CNRS (UMR 5219), Jeremie.Bigot@math.univ-toulouse.fr
We gratefully acknowledge Yves Rozenholc for providing the Matlab code to compute the model selection estimator.
Density estimation is a well-known problem in statistics that has been thoroughly studied. It consists in estimating an unknown probability density function from an independent and identically distributed (iid) sample of random variables for , with representing the sample size. Wavelet decomposition is known to be a powerful method for nonparametric estimation, see e.g. ? (?). The advantages of wavelet methods is their ability in estimating spatially inhomogeneous functions. They can be used to estimate functions in Besov spaces with optimal rates of convergence, and have therefore received special attention in the literature over the last two decades, in particular for density estimation, see e.g. ? (?) and ? (?) for a detailed review on the subject.
For a given scaling function and mother wavelet , the scaling and wavelet coefficients are usually estimated as ? (?)
where , , and denotes the usual coarse level of resolution. A hard-thresholding estimator of then takes the form
where is an appropriate frequency cut-off, and where the ’s are appropriate thresholds (positive numbers) that are possibly level-dependent. However, in practice the computation of the coefficients (1.1) requires some numerical approximation by interpolation as one typically only knows the values of the functions and at dyadic points. Various approaches have been used to approximate numerically the coefficients in (1.1). For instance ? (?), ? (?) use a binning method followed by the the standard discrete wavelet transform, while the algorithm proposed in ? (?) is based on an approximation of the scaling coefficients at a sufficiently small level of resolution.
In this paper, we propose to avoid the use of such numerical approximation schemes. This is achieved by using Meyer wavelets which are band-limited functions. Indeed, using such wavelets and thanks to the Plancherel’s identity, the empirical coefficients can be easily computed from the Fourier transform of the data . Such an approach therefore takes full advantages of the fast Fourier transform and of existing fast algorithms for Meyer wavelet decomposition developed by ? (?).
As Meyer wavelets are band-limited functions, they have been recently used for deconvolution problems in nonparametric regression by ? (?), ? (?), and for density deconvolution by ? (?, ?, ?). Moreover, the use of such wavelets leads to fast algorithms, see ? (?) and ? (?).
Density deconvolution is the problem of estimating the function when the observation of the random variable is contaminated by an independent additive noise. In this case, the observations at hand are a sample of variables such that
where are iid variables with unknown density , and are iid variables with known density which represents some additive noise independent of the ’s. Density estimation from a noisy sample is of fundamental importance in many practical situations, and applications can be found in communication theory ? (?), experimental physics (e.g. ?) or econometrics ? (?). In this setting, the density of the observed variables is the convolution of the density with the density of the additive noise. Hence, the problem of estimating relates to nonparametric methods of deconvolution which is a widely studied inverse problem in statistics and signal processing. However the indirect observation of the data leads to different optimality properties, for instance in terms of rate of convergence, than the direct problem of density estimation without an additive error. Standard techniques recently studied for density deconvolution include model selection ? (?), kernel smoothing ? (?), spline deconvolution ? (?), spectral cut-off ? (?) and wavelet thresholding ? (?, ?, ?), to name but a few. Again, Meyer wavelets can be used to easily compute estimators of the wavelet coefficients of by using the Fourier coefficients of the noise density without using any numerical approximation scheme.
The second contribution of this paper is the use of random thresholds . Classically, the thresholds used in wavelet density estimation are deterministic, but such thresholds may be too large in practice as they do not take into account the fact that the variance of each empirical wavelet coefficient depends on its location . The use of random thresholds has been recently proposed in ? (?) in the context of density estimation, and in ? (?) for Poisson intensity estimation. However, the estimation procedure in ? (?) and ? (?) is different from the one we propose, since it is based on biorthogonal bases and on the use of the Haar basis to compute the wavelet coefficients. The use of data based threshold exploiting the variance structure of the empirical wavelet coefficients is also proposed in ? (?), but the theoretical properties of the resulting algorithms are not studied. In this paper, we show that using Meyer wavelets allows one to compute easily an estimation of an upper bound of the variance of the ’s that is then used to compute random thresholds .
Then, a third contribution of this paper is that the resulting hard-thresholding estimators are shown to attain the same performances (up to logarithmic terms) of an ideal estimator, called oracle as its computation depends on unknown quantities such as the variance of the ’s or the magnitude of the true wavelet coefficients. Oracle inequalities is an active research area in nonparametric statistics (see e.g. ? (?), ? (?) for detailed expositions) which has recently gained popularity. Deriving an oracle inequality is the problem of bounding the risk of a statistical procedure by the performances of an ideal estimator which represents the best model for the function to recover. Oracle inequalities are currently used in many different contexts in statistics. They have been introduced in ? (?) for nonparametric regression with wavelets, then used by ? (?) for inverse problems in the white noise model, by ? (?), ? (?), ? (?), ? (?) for density estimation problems, and by ? (?) for estimating the intensity of a Poisson process, to name but a few.
The rest of this paper is organized as follows. In Section 2, we provide some background on Meyer wavelets and we define the corresponding wavelet estimators with random thresholds for both direct density estimation and density deconvolution. In Section 3, it is explained how the performances of such estimates can be compared to those of an oracle estimate, which leads to non-asymptotic oracle inequalities. Asymptotic properties of the estimators are then studied in Section 4. Depending on the problem at hand., these estimators are shown to achieve near-minimax rates of convergence over a large class of Besov spaces. Finally, a simulation study is proposed in Section 5 to evaluate the numerical performances of our estimators and to compare them with other procedures existing in the literature. The proofs of the main theorems are gathered in a technical Appendix.
2 Density estimation with Meyer wavelets
In what follows, it will be assumed that the density function of the ’s has a compact support included in . Of course, assuming that the support of is included in would not hold in many practical applications and this is mainly made for mathematical convenience to simplify the presentation of the estimator. If the range of the data is outside , one can simply rescale and center them such that they fall into , and then apply the inverse transformation to the estimated density.
2.1 Wavelet decomposition and the periodized Meyer wavelet basis
Let be the Meyer scaling and wavelet function respectively (see ? (?) for further details). It is constructed from a scaling function with Fourier transform
where is a smooth function chosen as a polynomial of degree 3 in our simulations. Scaling and wavelet function at scale are defined by
As in ? (?) , one can then define the periodized Meyer wavelet basis of (the space of squared integrable functions on ) by periodizing the functions i.e.
For any function of , its wavelet decomposition can be written as:
where , and denotes the usual coarse level of resolution. Moreover, the norm of is given by
Meyer wavelets can be used to efficiently compute the coefficients and by using the Fourier transform. Indeed, let and denote by the Fourier coefficients of supposed to be a function in . By the Plancherel’s identity, we obtain that
where denote the Fourier coefficients of and . As Meyer wavelets are band-limited is a finite subset set of with (see ? (?)).
2.2 The case of direct density estimation
Based on a sample , an unbiased estimator of is given by which yields an unbiased estimator of given by
The coefficients (2.2) can therefore be easily calculated by combining the fast Fourier transformation of the data with the fast algorithm of ? (?) which relies on the fact that the first sum in (2.2) only involves a finite number of terms. Equation (2.2) also shows that using Meyer wavelets, which are band-limited functions, avoids the use of numerical schemes to approximate the computation of the coefficients (1.1) as one typically only knows the values of the functions and at dyadic points. We define the estimators of the scaling coefficients analogously, with instead of and instead of .
2.3 The case of density deconvolution
Consider now the problem (1.3) of density deconvolution. Denote by the Fourier coefficients of , and by the Fourier coefficients of the error density . Since and , it follows by independence that . This equality and the Plancherel’s identity (2.1) therefore implies that an unbiased estimator of is given by (assuming that for all )
It is well-known that the difficulty of the deconvolution problem is quantified by the smoothness of the error density . The so-called ill-posedness of such inverse problems depends on how fast the Fourier coefficients tend to zero. Depending on the decay of these coefficients, the estimation of will be more or less accurate. In this paper, we consider the case where the ’s have a polynomial decay which is usually referred to as ordinary smooth convolution (see e.g. ? (?)):
The Fourier coefficients of have a polynomial decay which means that there exists a real and two positive constants such that for all , .
The rates of convergence that can be expected from a wavelet estimator depend on such smoothness assumptions and are well-studied in the literature, and we refer to ? (?, ?) for further details.
Finally note that to simplify the presentation, we prefer to use the same notation and for both problems of direct density estimation and density deconvolution.
2.4 Thresholding of the empirical wavelet coefficients
Based on an estimation of the scaling and wavelet coefficients, a linear wavelet estimator of is of the form For an appropriate choice of one can show that achieves optimal rates of convergence among the class of linear estimators. Typically, if belongs to a Sobolev space with smoothness order , then for direct density estimation the choice yields optimal rates of convergence for the quadratic risk, see ? (?), ? (?) for further details . However, this choice is not adaptive because it depends on the unknown smoothness of . It is well known that adaptivity can be obtained by using nonlinear estimators based on appropriate thresholding of the estimated wavelet coefficients. A non-linear estimator by hard-thresholding is defined by
where the ’s are appropriate thresholds (positive numbers). Various choices for and the threshold have been proposed. In the case of direct density estimation, ? (?) recommended to take level-dependent threshold and . For density deconvolution, one possible calibration in ordinary smooth deconvolution is and , see ? (?). The choices and have also been considered in ? (?) and ? (?) respectively.
However, such thresholds may be too large in practice as they do not take into account the fact that the variance of each empirical wavelet coefficient depends on its location . Consider the problem of density deconvolution and let us denote the variance of by
Let us also denote by the function defined for by
By definition, it follows that and thus . Hence, a simple upper bound for is . Then, simple algebra shows that in the case of density deconvolution
with and . An unbiased estimator of is thus given by
Similar computations can be made for the case of direct density estimation with instead , instead of , and instead of in the above equations (2.5) and (2.6). Note that can also be written as , but its calculation is obtained from the Fourier coefficients and not from whose computation at non dyadic points requires numerical approximation. Alternatively, one could also use an estimation of the variance given by
instead of the upper bound . However, this does not change significantly our results since and are very close for sufficiently large. Moreover, oracle inequalities are simpler to derive using an estimated upper bound for .
A thresholding rule is usually chosen by controlling the probability of deviation of from the true wavelet coefficient . From Lemma 5.3 (see the Appendix) one has that for any positive , where
which would suggest to take a threshold of the form
where is a tuning parameter. Thinking of the classical universal threshold, one would take . However, this choice can be too conservative and the results of this paper shows that it is possible to take smaller that 1. Moreover, it is shown that the choice of this tuning parameter depends on the highest resolution level and the degree of ill-posedness in the case of density deconvolution. Throughout the paper, we discuss the choice of and finally propose data-based values for its calibration.
Obviously is an ideal threshold as is unknown. Based on Lemma 5.4 (see the Appendix) which gives a control on the probability of deviation of from , we propose to use the following random thresholds
Again, for direct density estimation, we take the same thresholds with instead of to compute in (2.7). The above choice for resembles to the universal threshold proposed by ? (?) in the context of nonparametric regression with homoscedastic variance . Here we exploit the fact that in the context of density estimation the variance of a wavelet coefficient depends on its location and has to be estimated.
The additive terms and will allow us to derive oracle inequalities. Indeed, in Section 3, we compare the quadratic risk of to the risk of the following oracle estimator
Note that is an ideal estimator that can not be computed in practice as its depends on the unknown coefficients of and the unknown variance terms . However, we shall use it as a benchmark to assess the quality of our estimator. The quadratic risk of is
where . Equation (2.10) shows that we retrieve the classical formula for the quadratic risk of an oracle estimator given in ? (?) except that the variance term is not constant as in standard nonparametric regression with homoscedastic variance.
Data-driven thresholds based on an estimation of the variance of the empirical wavelet coefficients have already been proposed in ? (?) in the context of density estimation. However, the estimation procedure in ? (?) is different from ours since it is based on biorthogonal bases and on the use of the Haar basis to compute the wavelet coefficients. Note that our choice for is similar to the random threshold in ? (?), but the addition of the terms and will allow us to compare the performances of with those of the oracle . The addition of similar deterministic and stochastic terms is also proposed in ? (?) to derive oracle inequalities in the context of Poisson intensity estimation.
3 Oracle inequalities
To derive oracle inequalities, we need a further smoothness assumption on the error density :
There exists a constant and a real such that the density satisfies for all .
Obviously, the above condition for is not very restrictive as is by definition an integrable function on . For a bounded function we denote by its supremum norm. Let us also define the following class of densities
3.1 The case of direct density estimation
The following theorem states that for appropriate choices of and the tuning parameter , then the estimator behaves essentially as the oracle up to logarithmic terms.
Assume that with . Let and be some fixed constants. For any , define to be the integer such that , and to be the integer such that and suppose that and are such that . Assume that , and take the random thresholds given by equation (2.8). Then, the estimator satisfies the following oracle inequality
and and are two positive constants not depending on and , and such that .
The above inequality (3.1) shows that the performances of mimic those of the oracle in term of quadratic risk, see equation (2.10), up to a logarithmic term. The additive term depends on two hyperparameters and which are used to control the effect of the choice of the highest resolution level and the tuning parameter on the performances of the estimator. Moreover, the above inequality tends to show that the performances of deteriorates as tends to . Thinking of the classical universal threshold, one would like to take . However, if one sets and , the additive term is bounded by which is rate typically faster that the decay of the oracle risk (2.10) when belongs to a Sobolev or a Besov space. Hence, Theorem 3.1 shows that if one chooses with , then it is possible to take smaller than 1. Choosing carefully such hyperparameters is of fundamental importance, and a detailed study is therefore proposed in Section 5 to validate the results of Theorem 3.1, and to analyze the risk of as a function of the resolution level and the tuning constant .
Classically, the level is chosen as the integer such that (see e.g. ? (?)). The results of this paper shows that one can use smaller level of the order . For , the price to pay is a slightly lower rate for the additive term in the oracle inequality (3.1) which is of the order instead of the rate as classically obtained for the additive term when deriving oracle inequalities. Note that a similar result in the context of Poisson intensity estimation is also given in ? (?). In particular ? (?) obtain an additive term of the order but to derive such a result their proof relies heavily on the fact that the wavelet basis they use (the Haar basis) is such that which is not the case for Meyer wavelets.
3.2 The case of density deconvolution
Consider now the problem of density deconvolution under Assumption 2.1 of ordinary smooth deconvolution.
Assume that with , and that satisfies Assumption 2.1 and Assumption 3.1. Let and be some fixed constants. For any , define to be the integer such that , and to be the integer such that , and suppose that and are such that . Assume that , and take the random thresholds given by equation (2.8). Then, the estimator satisfies the following oracle inequality
and and are two positive constants not depending on and , and such that .
Hence, the above theorem shows that in the case of density deconvolution then the estimator also behaves as an oracle estimate up to logarithmic terms, and that the performances of the estimator tend to deteriorate as tends to . Similar comments to those given for Theorem 3.1 can be made. If one chooses , then the additive term is of the order , and has to be greater than . Hence, this again shows that one can take a value for smaller than 1. However the choice of is typically larger for deconvolution than in the direct case, as it is controlled by the degree of ill-posedness.
In deconvolution problems, the high-frequency cut-off is usually related to the ill-possedness of the inverse problem and is a typical choice for various estimators proposed in the literature (see e.g. ? (?, ?, ?)). This is a standard fact that a smaller should be used for ill-posed inverse problems than in the direct case. Again we have introduced hyperparameters and to control the effect of the choice of the highest resolution level . Understanding the influence of the choice of these hyperparameters on the quality of the estimator is a fundamental issue, and a detailed simulation study is thus proposed in Section 5 to validate the results of Theorem 3.2.
4 Asymptotic properties and near-minimax optimality
It is well known that Besov spaces for periodic functions in can be characterized in terms of wavelet coefficients (see e.g. ? (?)). Let denote the usual smoothness parameter, then for the Meyer wavelet basis and for a Besov ball of radius with , one has that for
with the respective above sums replaced by maximum if or .
The condition that is imposed to ensure that is a subspace of , and we shall restrict ourselves to this case in this paper. Besov spaces allow for more local variability in local smoothness than is typical for functions in the usual Hölder or Sobolev spaces. For instance, a real function on that is piecewise continuous, but for which each piece is locally in , can be an element of with , despite the possibility of discontinuities at the transition from one piece to the next. Note that if is not an integer, then is equivalent to a Sobolev ball of order , and that the space with contains piecewise smooth functions with local irregularities such as discontinuities.Finally let us introduce the following space of densities
The following theorems show that for either the problem of direct density estimation or density deconvolution, the estimator is asymptotically near-optimal (in the minimax sense) up to a logarithmic factor over a wide range of Besov balls.
4.1 The case of direct density estimation
For and and let us define
Then the following result holds.
Assume that the conditions of Theorem 3.1 hold with and . Assume that with , and . Then, as
Minimax rates of convergence for density estimation over Besov spaces has been studied in detail by ? (?). Hence, Theorem 4.1 shows that is an adaptive estimator which converges to with the minimax rate up to logarithmic factor for the problem of direct density estimation over . The extra logarithmic factor is usually called the price to pay for adaptivity to the unknown smoothness .
4.2 The case of density deconvolution
Consider now the problem of density deconvolution under the Assumption 2.1 of ordinary smooth deconvolution.
Assume that the conditions of Theorem 3.2 hold and . Assume that with , and . If then as
and if then as
Theorem 4.2 show that there is two different rates of convergence depending on whether or . These two conditions are respectively referred to as the dense case when the worst functions to estimate are spread uniformly over , or the sparse case when the hardest functions to estimate have only one non-vanishing wavelet coefficient. This change in the rate of convergence, usually referred to as an elbow effect, has been studied in detail in nonparametric deconvolution problems in the white noise model by ? (?) and also ? (?). Theorem 4.2 shows that the rate of convergence of corresponds to minimax rates (up to a logarithmic term) that have been obtained in related deconvolution problems either for density estimation or nonparametric regression in the white noise model (see e.g. ? (?, ?, ?, ?), and references therein).
Simulations use the wavelet toolbox Wavelab of Matlab ( ?) and the fast algorithm for Meyer wavelet decomposition developed by ? (?). As in the simulation study of ? (?), four test densities are considered: Uniform distribution: , Exponential distribution: , Laplace distribution: , and MixtGauss distribution (mixture of two Gaussian variables): with , and . These four densities, displayed in Figure 1, exhibit different types of smoothness: the Uniform density is a piecewise constant function with two jumps, the Exponential distribution is a piecewise smooth function with a single jump, the Laplace density is a continuous function with a cusp at , whereas the MixtGauss density is infinitely differentiable.
5.1 Direct density estimation
Assume that an i.i.d sample of variables is drawn from one the test densities shown in Figure 1. The empirical Fourier coefficients are computed for , where is a dyadic integer with chosen such that . The ’s are then used as an input of the efficient algorithm of ? (?) in order to compute empirical scaling and wavelet coefficients . This algorithm only requires operations to compute the empirical wavelet coefficients from a sample of Fourier coefficients of size . According to Theorem 3.1, one can take . Then, two hyperparameters essentially controls the quality of the estimator: the high-frequency cut-off parameter and the constant used in the definition of the thresholds given by equation (2.8). If the level is such that for some , , then one must take . Increasing deteriorates the rate of convergence of the additive term in the oracle inequality (3.1), so one can choose to set . Following the choice made for the asymptotic study of in Section 4, one can take and according to Theorem 3.1 one should then take . However, it is not clear if a value for lower than 1/2 would deteriorate or improve the quality of the estimator .
Alternatively, assume that is given. Then, another possibility for choosing is to take which is the smallest constant that satisfies , and then take . Combining the above arguments, we finally suggest to take
To give an idea of the quality of , a typical example of estimation with , and is given in Figure 2. Another possibility for choosing a smaller value for is to take , and then to choose