Extreme Value Analysis of Empirical Frame Coefficients and Implications for Denoising by Soft-Thresholding
Denoising by frame thresholding is one of the most basic and efficient methods for recovering a discrete signal or image from data that are corrupted by additive Gaussian white noise. The basic idea is to select a frame of analyzing elements that separates the data in few large coefficients due to the signal and many small coefficients mainly due to the noise . Removing all data coefficients being in magnitude below a certain threshold yields a reconstruction of the original signal. In order to properly balance the amount of noise to be removed and the relevant signal features to be kept, a precise understanding of the statistical properties of thresholding is important. For that purpose we derive the asymptotic distribution of for a wide class of redundant frames . Based on our theoretical results we give a rationale for universal extreme value thresholding techniques yielding asymptotically sharp confidence regions and smoothness estimates corresponding to prescribed significance levels. The results cover many frames used in imaging and signal recovery applications, such as redundant wavelet systems, curvelet frames, or unions of bases. We show that ‘generically’ a standard Gumbel law results as it is known from the case of orthonormal wavelet bases. However, for specific highly redundant frames other limiting laws may occur. We indeed verify that the translation invariant wavelet transform shows a different asymptotic behaviour.
Keywords. Denoising, thresholding estimation, extreme value analysis, Gumbel distribution, Berman’s inequality, wavelet thresholding, curvelet thresholding, translation invariant wavelets, extreme value threshold, frames, redundant dictionaries.
We consider the problem of estimating a -signal or image from noisy observations
Here are independent normally distributed random variables (the noise), is the level of discretization, and is the variance of the data (the noise level). The signal is assumed to be a discrete approximation of some underlying continuous domain signal obtained by discretizing a function . One may think of the entries of as point samples on an equidistant grid. However, in some situations it may be more realistic to consider other discretization models. Area samples, for example, are more appropriate in many imaging applications. In this paper we will not pursue this topic further, because most of the presented results do not crucially depend on the particular discretization model as long as can be associated with a function (some kind of abstract interpolation) which, in a suitable way, tends to as .
The aim of denoising is to estimate the unknown signal from the data . The particular estimation procedure we will analyze in detail is soft-thresholding in frames and overcomplete dictionaries. We stress, however, that a similar analysis also applies to different thresholding methods, such as block thresholding techniques (as considered, for example, in [7, 8, 14, 37]).
1.1 Wavelet Soft-Thresholding
In order to motivate our results for thresholding for general frames we start by one dimensional wavelet soft-thresholding. For that purpose, let denote an orthonormal wavelet basis of , where is the number of data points and
the index set of the wavelet basis. Wavelet soft-thresholding is by now a standard method for signal and image denoising (see, for example, [1, 18, 25, 26, 28, 38, 39, 46, 53] for surveys and some original references). It consists of the following three basic steps:
- [label=0), topsep=0em]
Compute all empirical wavelet coefficients of the given noisy data with respect to the considered orthonormal wavelet basis.
For some threshold , depending on the noise level and the number of data points, apply the nonlinear soft-thresholding function
to each wavelet coefficient of the data. The resulting thresholded coefficients are then considered as estimates for the wavelet coefficients of . Notice, that the soft-thresholding function can be written in the compact form . Further, it sets all coefficients being in magnitude smaller than to zero and shrinks the remaining coefficients towards zero by the value .
The desired estimate for the signal is then defined by the wavelet series of the thresholded empirical coefficients ,
Every step in the above procedure can be computed in operation counts and hence the overall procedure of wavelet soft-thresholding is linear in the number of unknown parameters (see [16, 26, 46]). It is thus not only conceptually simple but also allows for fast numerical implementation. Even simple linear spectral denoising techniques using the FFT algorithm have a numerical complexity of floating point operations. Besides these practical advantages, wavelet soft-thresholding also obeys certain theoretical optimality properties. It yields to an almost optimal mean square error simultaneously over a wide range of function spaces (including Sobolev and Besov spaces) and, at the same time, has a smoothing effect with respect to any of the norms in these spaces. Hence soft-thresholding automatically adapts to the unknown smoothness of the desired signal [25, 27].
Any particular choice of the thresholding parameter is a tradeoff between signal approximation and noise reduction: A large threshold removes much of the noise but also removes parts of the signal. Hence a reasonable threshold choice should be as small as possible under the side constrained that a significant amount of the noise is removed. The smaller the actual threshold is taken, the more emphasis is given on signal representation and the less emphasis on noise reduction. A commonly used threshold is the so called universal threshold as proposed in the seminal work , where the following result is shown.
Theorem 1.1 (Denoising property of wavelet soft-thresholding ).
Suppose that are consistent with an underlying orthonormal wavelet basis on having -times continuously differentiable elements and vanishing moments, that , for , denote point samples of a function and that are constructed by (1.3) with the universal threshold . Then, there exists a special smooth interpolation of producing a function . Further, there are universal constants with as , such that for any Besov space which embeds continuously into (hence ) and for which is an unconditional basis (hence ),
for constants depending on and but neither on nor on .
Theorem 1.1 states that the estimate is, with probability tending to one, simultaneously as smooth as for all smoothness spaces . This result can be derived from the denoising property (see [25, 39, 40] and also Section 3.2)
For an orthonormal basis, the noise coefficients are independently distributed. Hence Equation (1.5) is a consequence from standard extreme value results for independent normally distributed random vectors [21, 44]. Extreme value theory also implies the limiting Gumbel law
uniformly in . This even allows to exactly characterize all thresholding sequences yielding a denoising property like (1.5) with in place of .
In the case that a redundant frame is considered instead of an orthonormal wavelet basis, then the empirical coefficients are no more linear independent and limiting result like (1.6) are much harder to come up with. In this paper we verify that a similar distributional result as in (1.6) holds for a wide class of redundant frames with replaced by the number of frame elements. This class is shown to include non-orthogonal wavelets, curvelet frames and unions of bases (see Theorems 4.4, 4.7 and 4.12). Roughly speaking, the reason is, that the redundancy is of these frames weak enough that it asymptotically vanishes in a statistical sense and the system behaves as an independent system. However, we also an important example (in the form of the translational wavelet system; see Theorem 4.9) which shows that highly redundant systems may show a different asymptotic behaviour.
Our work is motivated by the well known observation that the universal threshold sigma often is found to be too large in applications, hence including too few coefficients into the final estimator (see [2, 27, 46]). This recently has initiated further research on refined thresholding methods and we would like to shed some light on this phenomenon for a large class of frame systems by providing a refined asymptotics as in (1.6) in addition to results of the type (1.5). We also provide a selective review on current thresholding methodology where we focus on the link between statistical extreme value theory and thresholding techniques.
1.2 Frame Soft-Thresholding: Main Results
For any , let denote a frame of , where is a finite index set, that consists of normalized frame elements (that is, holds for all ) and has frame bounds (compare Section 2.1). Our main results concerning thresholding estimation in the frame will hold for asymptotically stable frames, which are defined as follows.
Definition 1.2 (Asymptotically stable frames).
We say that a family of frames with normalized frame elements is asymptotically stable, if the following assertions hold true:
For some , as .
The upper frame bounds are uniformly bounded, i.e., .
The following Theorem 1.3 is the key to most results of this paper. It states, that after proper normalization the distribution of converges to the Gumbel distribution – provided that the frames are asymptotically stable.
Theorem 1.3 (Limiting distribution for asymptotically stable frames).
Assume that is an asymptotically stable family of frames, and let be a sequence of random vectors in with independent -distributed entries. Then, for every ,
The function is known as the Gumbel distribution.
See Section 3.1. ∎
In the case that are orthonormal bases, results similar to the one of Theorem 1.3 follow from standard extreme value results (see, for example, [21, 44]) and are well known in the wavelet community (see, for example, [25, 39, 46]). However, neither the convergence of the maxima including absolute values (which is the actually relevant case) nor the use of redundant systems are covered by these results. In Section 4 we shall verify that many redundant frames, such as non-orthogonal wavelet frames, curvelets and unions of bases, are asymptotically stable and hence the limiting Gumbel law of Theorem 1.3 can be applied. Based on this limiting extreme value distribution we provide an exact characterisation of all thresholds satisfying a denoising property similar to the one of (1.5) for general frame thresholding; see Section 3 for details.
Suppose that is a sequence in converging to , let satisfy and let denote the wavelet soft-thresholding estimate with the threshold
According to Theorem 1.3, the probability that is contained in tends to as . Hence the sets define asymptotically sharp confidence regions around the given data for any significance level ; see Section 3.3 for details.
The proof of Theorem 1.3 relies on new extreme value results for dependent chi-square distributed random variables (with one degree of freedom) which we establish in Section A. In the field of statistical extreme value theory, the following definition is common.
Definition 1.4 (Gumbel type).
A sequence of real valued random variables is said to be of Gumbel type (or to be of extreme value typ I), if there are real valued normalizing sequences and , such that the limit as holds point-wise for all (and therefore uniformly).
Using the notion just introduced, Theorem 1.3 states that is of Gumbel type, with normalizing sequences and , where
As shown in Theorem 3.3, the maxima of without taking absolute values are also of Gumbel type. We emphasize, however, that the corresponding normalizing sequences differ from those required for the maxima with absolute values. Indeed, behaves as the maximum of (opposed to ) independent standard normally distributed random variables; compare with Remark A.6. The different fluctuation behaviour of the maxima with and without absolute values is not resembled by Equation (1.5), which is exactly the same for the maxima with and without absolute values. Only in a refined distributional limit (1.7) this difference becomes visible. Moreover, in the case that the frames are redundant, no result similar to Theorem 1.3 is known at all.
Asymptotical stability typically fails for frames without an underlying infinite dimensional frame. A prototype for such a family is the dyadic translation invariant wavelet transform (see Section 4.1.4). In this case, the redundancy of the translation invariant wavelet system increases boundlessly with increasing , which implies that the corresponding upper frame bounds tend to infinity as . We indeed prove the following counterexample if Condition 2 in Definition 1.2 fails to hold.
Theorem 1.5 (Tighter bound for translation invariant wavelet systems).
Suppose that is a discrete translation invariant wavelet system with unit norm elements generated by a mother wavelet that is continuously differentiable, and let be a sequence of random vectors in with independent -distributed entries. Then, for some constant and all we have
Theorem 1.5 shows that the maximum of the translation invariant wavelet coefficients is strictly smaller (in a distributional sense; see Section 4.1.4) than the maximum of an asymptotically stable frame with elements and therefore the result of Theorem 1.3 does not hold for a translation invariant wavelet system. Moreover, Theorem 1.5 shows that there exists a thresholding sequence being strictly smaller than yields asymptotic smoothness; see Section 4.1.4 for details. This also reveals the necessity of a detailed extreme value analysis of the empirical noise coefficients in the case of redundant frames.
In the following Section 2 we introduce some notation used throughout this paper. In particular, we define the soft-thresholding estimator in redundant frames. The core part of this paper is Section 3, where we proof the asymptotic distribution of the frame coefficients claimed in Theorem 1.3. This result is then applied to define extreme value based thresholding rules and corresponding sharp confidence regions. Moreover, in this section we show that the resulting thresholding estimators satisfy both, oracle inequalities for the mean square error and smoothness estimates for a wide class of smoothness measures. Our proofs require new facts from statistical extreme value theory for the maxima of absolute values of dependent normal random variables that we derive in Section A. Finally, in section in Section 4 discuss in detail several examples, including, non orthogonal frames, biorthogonal wavelets, curvelets and unions of bases in detail
2 Thresholding in Redundant Frames
For the following recall the model (1.1) and write for the noise vector in (1.1). We assume throughout that the variance of the noise is given. Fast and efficient methods for estimating the variance are well known (see, for example, [6, 22, 34, 36] for and  for ).
Throughout this paper all estimates for the signal are based on thresholding the coefficients of the given data with respect to prescribed frames of analyzing elements.
In the sequel denotes a frame of , with being a finite index set. Hence there exist constants , such that
(Here is the Euclidean norm on and the corresponding inner product.) The largest and smallest numbers and , respectively, that satisfy (2.1) are referred to as frame bounds. Notice that in a finite dimensional setting any family that spans the whole space is a frame.
We further denote by the operator that maps the signal to the analyzing coefficients with respect to the given frame,
The mapping is named the analysis operator, its adjoint the synthesis operator, and the frame operator corresponding to .
The frame property (2.1) implies that the frame operator is an invertible linear mapping. Hence, for any , the elements
are well defined and the family is again a frame of . It is called the dual frame and has frame bounds .
Finally, we denote by the pseudoinverse of the analysis operator . Due to linearity and the definitions of the pseudoinverse and the dual frame elements, we have the identities
In particular, the mapping is the synthesis operator corresponding to the dual frame. Equation (2.2) provides a simple representation of the given signal in terms of its analyzing coefficients. This serves as basis of thresholding estimators defined and studied in the following subsection. For further details on frames see, for example, [15, 46].
Remark 2.1 (Thresholding in a subspace).
It is not essential at all, that is a frame of the whole image space . In fact, in typical thresholding applications, such as in wavelet denoising, the space naturally decomposes into a low resolution space having small fixed dimension and a detail space having large dimension that increases with . The soft-thresholding procedure is then only applied to the signal part in the detail space and hence it is sufficient to assume that is a frame therein. In order to avoid unessential technical complication we present our results for the case of frames of the whole image space. In the concrete applications presented in Section 4 the thresholding will indeed only be performed in some subspace; all results carry over to such a situation in a straightforward manner.
2.2 Thresholding Estimation
By applying to both sides of (1.1), the original denoising problem in the signal space is transferred into the denoising problem
in the possibly higher dimensional coefficient space . Here and in the following we denote by
the coefficients of the data and the signal with respect to the given frame. The following elementary Lemma 2.2 states that the noise term in (2.3) is again a centered normal vector but has possibly non-vanishing covariances. Indeed it implies that the entries of are not uncorrelated and hence not independent, unless is an orthogonal basis.
Lemma 2.2 (Covariance matrix).
Let be a random vector in the image space with independent -distributed entries. Then is a centered normal vector in and the covariance matrix of has entries .
As the sum of normal random variables with zero mean, the random variables are again normally distributed with zero mean. In particular, we have . Hence the claim follows from the linearity of the expectation value and the independence of . ∎
Recall the soft-thresholding function defined by Equation (1.2). The thresholding estimators we consider apply to each coefficient of in (2.3) to define an estimator for the parameter . In order to get an estimate for the signal one must map the coefficient estimate back to the original signal domain. This is usually implemented by applying the dual synthesis operator (compare with Equation (2.2)).
Definition 2.3 (Frame thresholding).
The soft-thresholding estimator for using the threshold is defined by
The soft-thresholding estimator for with respect to the frame using the threshold is defined by
Hence the frame soft-thresholding estimator is simply the composition of analysis with , component-wise thresholding, and dual synthesis with .
If is an overcomplete frame, then has infinitely many left-inverses, and the pseudoinverse used in Definition 2.3 is a particular one. In principle one could use other left inverses for defining the soft thresholding estimator (2.6). Since, in general, is outside the range of , the use of a different left inverse will result in a different estimator. The special choice has the advantage that for many frames used in practical applications, the dual synthesis operator is known explicitly and, more importantly, that fast algorithms are available for its computation (typically algorithms using only or even floating point operations ).
Remark 2.4 (Thresholding variations).
Instead of the soft thresholding function several other nonlinear thresholding methods have been proposed and used. Prominent examples are the hard thresholding function and the nonnegative garrote of [5, 30]. Strictly taken, the smoothness estimates derived in Section 3.5 only hold for thresholding functions satisfying the shrinkage property for all . This property is, for example, not satisfied by the nonnegative garrote. In this case, however, similar estimates may be derived under additional assumptions on the signals of interest. Other prominent denoising techniques are based on block-thresholding (see, for example, [7, 8, 13, 35]). In this case, the derivation of sharp smoothness estimates requires extreme value results for dependent -distributed random variables (with more than one degree of freedom). Such an extreme value analysis seems possible but is beyond the scope of this paper. Our given results can be seen as the first step towards a more general theory.
Remark 2.5 (Multiple selections).
Be aware, that we allow certain elements to be contained more than once in the frame . Hence we may have . Such multiple selections often arises naturally for frames that are the union of several bases having some elements in common. A standard example is the wavelet cycle spinning procedure of , where the underlying frame is the union of several shifted orthonormal wavelet bases (see Section 4.1.3). Multiple selections of frame elements also affect the pseudoinverse and finally the soft-thresholding estimator. Hence, if and denote two frames composed by the same frame elements, , but having different cardinalities , then the soft-thresholding estimators corresponding to these frames differ from each other.
2.3 Rationale Behind Thresholding Estimation
We conclude this section by commenting on the underlying rationale behind thresholding estimation and situations where it is expected to produce good results.
The basic assumption underlying thresholding estimation is that the frame operator separates the data into large coefficients due to the signal and small coefficients mainly due to the noise. For additive noise models both issues can be studied separately. In this case, one requires that for some threshold (which finally judges between signal and noise) the following two conditions are satisfied:
- [label=(0), topsep=0.0em]
Coherence between signal and frame: The signal is well represented by few large coefficients having .
Incoherence between noise and frame: With high probability, all noise coefficients with respect to the frame satisfy .
In the following sections we shall see, that Item 2 can be analyzed in a unified way for asymptotically stable frames. Item 1, however, is more an approximation issue rather than an estimation issue. Given a frame, it, of course, cannot be satisfied for every . The choice of a ‘good frame’ depends on the specific application at hand and in particular on the type of signals that are expected. The better the signals of interest are represented by a few but large frame coefficients, the better the denoising result will be. The richer the analyzing family is, the more signals can be expected to be recovered properly. The price to pay must be, of course, a higher computational cost.
The following two simple examples demonstrate how the use of redundant frames may significantly improve the performance of the thresholding estimation.
Example 2.6 (Thresholding in the sine basis).
We consider the discrete signal defined by , which is a superposition of two sine waves having frequencies and , respectively, and amplitudes . The left image in Figure 2.1 shows the signal and the noisy data obtained by adding Gaussian white noise of variance equal to one to the signal. Apparently, there seems little hope to recover from the data in the original signal domain. Almost like a miracle, the situation changes drastically after computing the coefficients with respect to the sine basis . Now, the signal and the noise are clearly separated as can be seen from the right image in Figure 2.1. Obviously we will get an almost perfect reconstruction by simply removing all coefficients below a proper threshold.
Example 2.7 (Thresholding in a redundant sine frame).
The signal in Example 2.6 is a combination of sine waves with integer frequencies covered by the sine frame. However, in practical application the signal may also have non-integer frequencies. In order to investigate this issue, we now consider the signal having frequencies and (hence is a slight perturbation of the frequency considered in Example 2.6). The new signal is not a sparse linear combination of elements of the sine basis. As a matter of fact, the energy of the first sine wave is spread over many coefficients and thus submerges in the noise. Indeed, as can be seen from the left image in Figure 2.2, the low frequency coefficient disappears. However, by taking the two times redundant frame instead of the sine basis, the coefficient due to frequency appears again in the transformed domain. Moreover, as can be seen from Figure 2.3 the reconstruction by thresholding the coefficients with respect to the overcomplete sine frame is almost perfect, whereas the reconstruction by thresholding the basis coefficients is useless.
In Examples 2.6 and 2.7 the threshold choice is not a very delicate issue since the signal and the noise are separated very clearly in the transformed domain. Indeed as can be seen from the right plots in Figures 2.1 and 2.2 there is a quite wide range of thresholds that would yield an almost noise free estimate close to the original signal. However, if the signal also contains important coefficients of moderate size, then the choice of a good threshold is crucial and difficult. This is typically the case for image denoising using wavelets or curvelet frames: Natural images are approximately sparse in these frames but almost never strictly sparse. The particular threshold choice now will always be a tradeoff between noise removal and signal representation and becomes a delicate issue. In order to develop rationale threshold choices, a precise understanding of the distribution of is helpful. This is the subject of our following considerations.
3 Extreme Value Analysis of Frame Thresholding
Now we turn back to the denoising problem (1.1). After application of the analysis operator corresponding to the normalized frame our aim is to estimate the vector from given noisy coefficients (compare with Equation (2.3))
Here is the transformed noise vector which is normally distributed, has zero mean and covariance matrix ; see Lemma 2.2. In this section we shall analyze in detail the component-wise soft-thresholding estimator defined by (2.5). We will start by computing the extreme value distribution of claimed in Theorem 1.3. Based on the limiting law will then introduce extreme value thresholding techniques that will be shown to provide asymptotically sharp confidence regions.
3.1 Proof of Theorem 1.3
The main aim of this subsection is to verify Theorem 1.3, which states that the distribution of the maxima of the noise coefficients are of Gumbel type with explicitly given normalization constants. The proof of Theorem 1.3 will be a consequence of Lemmas 3.1 and 3.2 to be derived in the following. The main Lemma 3.1 relies itself on a new extreme value result established in Section A.
Let be a sequence of normal random vectors in with covariance matrices having ones in the diagonal. Assume additionally, that the following holds:
For every , as .
For some , as .
This will be done by splitting the sum into three parts and showing that each of them tends to zero as . For that purpose, let be any small number, let be as in Condition 2 and define
We further write with
It remains to verify that any of the terms converges to zero as .
Since any is a normal random vector with zero mean and unit variance, we have for any index pair , which yields the inequality . By Condition 2 we have which shows that as .
To estimate the second sum , recall that by definition of the set , we have for any pair of indices . Moreover, recall that by Condition 1 we further have . Hence we obtain
Since by assumption , the inequality holds which implies that we have as .
It remains to estimate the final sum . The Cauchy-Schwarz inequality, Condition 3, and the estimate yield
Now, by assumption the inequality holds and hence we have . This implies that also tends to zero as .
In summary, we have verified that as for every . Hence their sum converges to zero, too. The claimed distributional convergence results now follows from Theorem A.8 and concludes the proof. ∎
We next state a simple auxiliary Lemma that bounds the number of inner products being bounded away from zero.
For any let be a family of normalized vectors in , such that the upper frame bounds are uniformly bounded. Then, for every , we have
To verify (3.1) it is sufficient to find, for every given , some constant such that
To show (3.2) we assume to the contrary that there is some such that for all there exists some and some such that the set contains more then elements. By Assumption we have the equality for all . Together with Assumption 2 this implies
Since the last estimate should hold for all and we have by assumption, this obviously gives is a contradiction. ∎
Proof of Theorem 1.3.
We conclude this subsection by stating an extreme value result for the maximum without the absolute values. Although we do not need this result further, the distributional limit is interesting in its own.
Theorem 3.3 (Limiting Gumbel law without absolute values).
Assume that the frames are asymptotically stable and let be a sequence of random vectors in having independent -distributed entries. Then, the random sequence of the maxima is of Gumbel type with normalization constants
3.2 Universal Threshold: Qualitative Denoising Property
In the case that the family is an orthonormal basis it is well known that the thresholding sequence satisfies the asymptotic denoising property (see, for example, [25, 39, 46] and also Section 1.1 in the introduction), that is,
Equation (3.5) implies that the estimates obtained with the threshold are, with probabilities tending to one, at least as smooth as . Hence the relation (3.5) is often used as theoretical justification for using the universal threshold choice originally proposed by Donoho and Johnstone (see [25, 26]). The following Proposition 3.4 states that the same denoising property indeed holds true for any normalized frame. Actually it proves much more: First, we verify (3.5) for a wide class of thresholds including the Donoho-Johnstone threshold. Second, we show that this class in fact includes all thresholds that satisfy the denoising property (3.5) – provided that the frames are asymptotically stable. Our results can be seen as a generalization and a refinement of [25, Theorem 4.1] from the basis case to the possibly redundant frame case.
Proposition 3.4 (Thresholds yielding the denoising property).
Assume that are frames of having normalized frame elements and analysis operators , and let be a sequence of noise vectors in with independent -distributed entries.
If is asymptotically stable, then a sequence of thresholds satisfies (3.5) if and only if it has the form
for some sequence with .
2 Now let be any sequence of frames that is not necessarily asymptotically stable. Further, let be a sequence of random vectors with independent -distributed entries. Since is a random vector with possibly dependent -distributed entries, Lemma A.9 implies that
According to Proposition 3.4, any sequence with defines a sequence of thresholds (3.6) that satisfies the asymptotic denoising property. In particular, by taking the thresholds in (3.6) reduce to the universal threshold of Donoho and Johnstone. Proposition 3.4 further shows that the asymptotic relation alone is not sufficient for the denoising property (3.5) to hold and that second order approximations have to be considered. One may call a thresholding sequence smaller than , if for . The smaller the thresholding sequence is taken, the slower the convergence of will be, and hence this just yields a different compromise between noise reduction and signal approximation.
3.3 Extreme Value Threshold: Sharp Confidence Regions
For the following notice that the soft-thresholding estimate with thresholding parameter is an element of the -ball