Bayesian wavelet denoising with the caravan prior
Abstract.
According to both domain expertise knowledge and empirical evidence, wavelet coefficients of real signals typically exhibit clustering patterns, in that they contain connected regions of coefficients of similar magnitude (large or small). A wavelet denoising approach that takes into account such features of the signal may in practice outperform other, more vanilla methods, both in terms of the estimation error and visual appearance of the estimates. Motivated by this observation, we present a Bayesian approach to wavelet denoising, where dependencies between neighbouring wavelet coefficients are a priori modelled via a Markov chainbased prior, that we term the caravan prior. Posterior computations in our method are performed via the Gibbs sampler. Using representative synthetic and real data examples, we conduct a detailed comparison of our approach with a benchmark empirical Bayes denoising method (due to Johnstone and Silverman). We show that the caravan prior fares well and is therefore a useful addition to the wavelet denoising toolbox.
Key words and phrases:
Caravan prior; Discrete Wavelet Transform; Gamma Markov chain; Gibbs sampler; Regression; Wavelet denoising2000 Mathematics Subject Classification:
Primary: 62F151. Introduction
1.1. Setup
Let be an unknown function observed on a regularly spaced grid of points in the regression model
(1) 
where , and the noise level is unknown. A popular approach to inference in this model relies on an application of the Discrete Wavelet Transform (DWT) to the data , resulting in the normal means model
(2) 
where are the empirical wavelet coefficients, is the parameter vector of interest formed of the wavelet coefficients of , and are unobservable stochastic disturbances (we provide more details in Section 2). The observations are then denoised using one of the many possible techniques, yielding upon inversion of the wavelet transform the estimates of .
A rationale for a wavelet approach to regression consists in the following (see, e.g., Donoho and Johnstone (1994)): DWT typically ‘sparsifies’ the signal , in that many wavelet coefficients ’s are zero, or nearly so. Since the wavelet decomposition preserves the norm of the signal (Percival and Walden (2000), equation (95d)), this implies that the transformed signal will contain some large coefficients, and a contrast with small coefficients will typically be sharper than in the original signal (cf. Percival and Walden (2000), Section 10.1). On the other hand, due to the orthogonality property of DWT, the noise in the original observations gets spread out ‘uniformly’ in the transformed observations , in that one still has . Hence a small absolute magnitude of an observation is likely to be an indicator of the fact that the corresponding is zero (exactly, or nearly), whereas a large value of likely means that it predominantly consists of the signal . This forms the basis of various wavelet thresholding or shrinkage methods, that produce estimates of ’s by thresholding or shrinking small ’s to zero as containing pure noise, and keeping large ’s (exactly or largely) unchanged (Percival and Walden (2000), Section 10.2). A waveletbased approach to nonparametric regression leads to excellent practical results due to spatial adaptation properties of wavelets (see Donoho and Johnstone (1994)). However, there are situations when other estimators are preferable. This can happen for signals that are better representable in bases other than the wavelet basis (e.g., ‘frequency domain’ signals such as the sinusoid).
1.2. Related work
Within the Bayesian paradigm, the notion of sparsity can be naturally modelled through imposing a sparsityinducing prior distribution on the coefficients . There are two main possibilities to that end. The first is based on discrete mixtures, that model the signal via a combination of a point mass at zero and an absolutely continuous component elsewhere. The corresponding prior is often referred to as the spikeandslab prior (see, e.g., Mitchell and Beauchamp (1988)). In the second approach, absolutely continuous shrinkage priors are used instead; these put a mass around zero and also exhibit heavy tails (see, e.g., Tipping (2001) or Carvalho et al. (2010)). While the former approach leads to a correct representation of sparse estimation problems by placing a point mass at zero, truly sparse solutions are not possibly with the latter; in case they are desired, they require a further device, e.g. some form of thresholding. Nevertheless, point estimates of zero coefficients are still strongly shrunk to zero with shrinkage priors. Also, shrinkage priors are attractive computationally and have been demonstrated to perform well in various circumstances.
Several Bayesian approaches to wavelet denoising are discussed in Percival and Walden (2000), pp. 412–415 and 426–428. However, the method that gained the greatest acclaim in the wavelet denoising context is the empirical Bayes method of Johnstone and Silverman (2005b), which we will refer to as EBayes. EBayes relies on the spikeandslab prior, with its hyperparameters optimised by maximising the marginal likelihood, see Johnstone and Silverman (2004). The simulation studies in Johnstone and Silverman (2005b) demonstrate overall excellent performance of EBayes in wavelet denoising, and Bayesian point estimates resulting from it possess a natural shrinkage property. In fact, the coefficients ’s can even be estimated exactly as zero if the posterior median is used as a point estimate, and in that case the solution to the estimation problem is truly sparse. We thus consider EBayes as a benchmark in this article. This is in line with earlier works using EBayes as a benchmark in the sparse normal means model, see, e.g., Carvalho et al. (2010) and Polson and Scott (2011), who study the horseshoe prior.
1.3. Structured sparsity
It has been observed in the literature that with DWT the sparsification of the signal occurs in a structured manner. By this we mean that nonzero wavelet coefficients tend cluster instead of being scattered in a completely random fashion across the signal ; see, e.g., Section 10.8 in Percival and Walden (2000), or Appendix A, where we have collected several relevant quotes from the literature. Here we illustrate the phenomenon on a simple but representative example (cf. Cai and Silverman (2001)). Consider Figure 1, where we plotted the wavelet coefficients computed from values of the Bumps function (see Donoho and Johnstone (1995)). Donoho and Johnstone view the Bumps function as a stylised example of a spectrum arising, e.g., in nuclear magnetic resonance (NMR) spectroscopy. It is seen from the plot that when arranged according to levels of DWT, the wavelet coefficients with large absolute magnitudes occur in clusters, namely approximately at those locations where the function undergoes abrupt changes. Additionally, many coefficients are quite small or zero.
Given that wavelet coefficients typically exhibit structures beyond ‘mere sparsity’, it appears natural to incorporate in inferential procedures some of their additional features. A closely related question that one may ask is: Does ignoring possible local structures in the signal produce scientifically satisfactory answers? In that respect, the domain expert knowledge in, e.g., audio signal processing indicates that a failure to account for the structure of the signal in denoising applications may result in unacceptable solutions for a human ear; cf. Remark 3 below. Likewise, Donoho (1995) stresses importance of reducing the extent of undesirable noiseinduced structures like ‘ripples’, ‘blips’ and oscillations in the inferred signal, citing geophysical and astronomical studies, where such effects may lead to interpretational difficulties. Somewhat disappointingly, frequentist estimation methods that account for clustering of nonzero wavelet coefficients via block thresholding, such as NeighBlock and NeighCoeff of Cai and Silverman (2001), have been shown to perform worse in practice than EBayes, that does not assume anything beyond sparsity.
In this paper, we propose a Bayesian wavelet denoising method that accounts for existence of structures in wavelet coefficients. We compare it to EBayes and show via simulation and real data examples that our estimator, we baptised the caravan estimator, measures up well to EBayes, often beating it substantially as far as estimation accuracy is concerned (in terms of the squared estimation error). Nevertheless, the caravan estimator does not achieve a uniform improvement (i.e. over all simulation scenarios) upon EBayes.
1.4. Organisation
The rest of the paper is organised as follows: in Section 2 we introduce in detail the statistical problem and our Bayesian methodology to tackle it. Section 3 studies the performance of our method on synthetic data examples and compare it to the main alternative: EBayes. Sections 4 deals with real data examples. Section 5 summarises our findings and outlines directions for future research. In Appendix A a small compendium of quotes from the literature, illustrating some of the points we made in this paper, is presented, Appendix B gives details of the Gibbs sampler we use to evaluate the posterior, while Appendix C and D contain further details on our simulation study.
1.5. Notation
denotes the normal distribution with mean and variance . is the exponential distribution with rate parameter , whose density is for . is the gamma distribution with shape parameter and rate parameter , whose density is
where is the gamma function. The inverse gamma distribution with shape parameter and scale parameter is denoted by . Its density is
In conformance with standard Bayesian notation, we often denote random variables with lowercase letters, such as , and write the corresponding density as . Conditioning of on is denoted by , with standing for the conditional density of given .
2. Methodology
In this section we provide a detailed description of our Bayesian methodology for wavelet denoising.
2.1. Discrete wavelet transform
DWT is an orthogonal transformation applied on a finite dyadic sequence of numbers (that the data length is a dyadic number, say, is a restriction, although there are some ad hoc ways to deal with it; see, e.g., pp. 141–145 in Percival and Walden (2000)). Starting with the data , DWT can be conveniently described through successive applications of special low and highpass filters and (referred to as quadrature mirror filters) in combination with dyadic decimation or downsampling steps; jointly, these constitute the socalled pyramid algorithm. Care has to be exercised when computing DWT coefficients at the boundaries; we use periodic boundary conditions throughout. Define to be the original data , and let be the downsampling operator. Percival and Walden (2000) use odd decimation, retaining oddindexed entries of a given sequence; thus for , say, . This is a matter of convention, and the even decimation would have been an equally valid choice. The scaling coefficients at level are , whereas the wavelet or detail coefficients are given by (the notation stands for circular convolution of with , and similarly for ). Then one proceeds inductively: with and being already defined, one sets and . The process can be either brought to completion, the final processed level being , or stopped at level ; in this last case one talks about a partial DWT (a partial DWT does not require to be a dyadic number: it is enough to have that is an integer multiple of ). For a fixed , and are vectors of length , and can be enumerated as and , respectively, for . The scaling coefficients can be thought of as corresponding to a lowfrequency component of the signal , whereas the wavelet coefficients to the highfrequency components. When stacked together, and constitute an orthogonal transform of the data ; the latter can be easily recovered via the inverse pyramid algorithm. Both DWT and its inverse can be evaluated efficiently in multiplications. Conceptually, the wavelet detail coefficients can be associated with changes in at the scale , i.e., loosely speaking, with differences of averages formed of successive values in . On the other hand, is associated with changes in at scale and higher; in fact, if , is a (rescaled) sample mean of .
Let be a matrix corresponding to DWT applied on data . Then the vector of wavelet and scaling coefficients can be obtained as (analysis equation), and furthermore, due to orthogonality of , (synthesis equation). It holds that
where the matrices , , and are obtained by partitioning into submatrices with the number of rows commensurate with ; cf. Percival and Walden (2000), Sections 4.1 and 4.7. The dimensional vectors are called wavelet details, whereas is referred to as the th level wavelet smooth. Together, and define a multiresolution analysis (MRA) of , which can be synthesised back from these components by a simple addition. The detail corresponds to the portion of synthesis attributable to scale , whereas the smooth can be viewed as a smoothed version of and is associated with changes at scale and higher. See Figure 2 for an illustration of MRA for the Bumps function.
For a detailed exposition of wavelet transforms, the reader may consult any of the numerous reference works on the topic, e.g. Percival and Walden (2000).
2.2. Statistical model
In our regression context, upon applying DWT on the observations , one obtains empirical wavelet coefficients arranged according to levels . Remember also that the signal wavelet coefficients are denoted by .
Following a standard approach in wavelet denoising (originally proposed in Donoho and Johnstone (1994)), we will estimate the error standard deviation by the median absolute deviation (MAD) computed from the finest () level of DWT of the data, i.e. the empirical wavelet coefficients . Intuition underlying this estimate is that the majority of wavelet coefficients of the signal at level will be zero, so that ’s are mostly pure noise; a few outlier nonzero entries will not affect adversely a robust estimate of the error standard deviation such as the MAD. The estimate will be denoted by . In principle, upon equipping with a prior, it is also possible to take a fully Bayesian approach to estimating this parameter. However, as can be seen below, our proposal is simpler, since it allows to infer our primary objects of interest, the wavelet coefficients , level by level in DWT. This is convenient, e.g. because different levels of DWT are expected to have different sparsity degrees, or because such a subdivision of the inference problem into smaller subtasks may speed up the algorithm we propose below. Once we have estimated the wavelet coefficients, we also need the scaling coefficients at level in order to invert DWT and obtain an estimate of the original signal . Following Donoho and Johnstone (1994), it is common to use empirical scaling coefficients computed from the data to that end. Thereby the portion in attributable to a ‘coarse’ scale is automatically classified as signal (Percival and Walden (2000), p. 418). Estimation of scaling coefficients via empirical scaling coefficients admits a Bayesian interpretation: assuming scaling coefficients are a priori independent and equipped with a vague prior, , their posteriors are again normal (conditional on the data and the error variance ), with means equal to empirical scaling coefficients.
The statistical model for level wavelet coefficients of the data is a Gaussian sequence model from (2),
where in order to ease our notation, we have replaced the double index in (2) with a single index (since stays fixed), and have also set . Furthermore, we implicitly assume the wavelet coefficients have been realigned so as to approximately correspond to the output of a zerophase filter (admittedly, though, the statistical impact of the latter was not particularly noticeable in the simulation examples we considered). A filter is called zerophase, if its transfer function is realvalued at Fourier frequencies. This allows to associate with ’s the physically meaningful time scale of the original data (see pp. 108–110 in Percival and Walden (2000)). For Daubechies’ filters (which are the ones used in the present work), a proper alignment can be achieved by circularly shifting the output of a filtering step by a specified amount (depending on the filter and the transform level), as discussed on pp. 146–147 in Percival and Walden (2000).
The likelihood of the data in parameters (with an estimate plugged in instead of ) is
2.3. Prior
Fix hyperparameters , , and assume that a priori
(3) 
The hyperparameters will form an inverse gamma Markov chain, defined as follows (see Cemgil and Dikmen (2007)): fix hyperparameters , let be a sequence of latent variables, and consider a Markov chain
(4) 
with the initial and transition distributions
This definition induces a dependence structure in , and ensures a degree of continuity in the absolute magnitudes of ’s. In fact, as explained in Cemgil and Dikmen (2007), the variables are positively correlated. Thus, e.g., a large value of is likely to go paired with a large value of , which by (3) increases the likelihood of a similar pairing between the absolute magnitudes of and (the latent variables are used to achieve positive correlation between ’s, while retaining computational tractability of the approach; see Cemgil and Dikmen (2007)). In Figure 3 we display one realisation of the sequence from (4). We do not imply that real life signals follow an inverse gamma chain, but simply that the latter provides a computationally convenient means for encoding possible dependencies present in the wavelet coefficients. The hyperparameter controls the amount of smoothing in the gamma chain, with small values corresponding to less smoothing; we assume . For a statistical use of inverse gamma chains outside the sparsity context see, e.g., Gugushvili et al. (2018a), Gugushvili et al. (2018b) and Gugushvili et al. (2018c).
Remark 1.
Note that our construction proceeds via creating dependence between absolute magnitudes of the coefficients . A glance at Figure 1 shows that for stylised signals capturing typical features of real signals, large positive coefficients may very well cluster with large negative coefficients, and in that sense our approach is natural. In fact, a similar pattern can be observed in real signals as well, such as the electrocardiogram data in Figure 127 in Percival and Walden (2000), but there it would have been a stretch of imagination to pretend the observations are noisefree.
The parameters are local shrinkage parameters: each acts individually on , and a small value of encourages shrinkage of towards zero. A different perspective is that this entails modelling the scale parameters with a distribution which has heavier tails than the normal distribution. By linking via a global shrinkage parameter , we introduce a global control on the sparsity level of the sequence . Specifically, we assume
with conditionally independent of other parameters in the model, given . In turn, the hyperparameter is equipped with an independent prior.
By the Markov property and the various independence assumptions we made, the joint prior on , , , , and factorises as
Given the sequential nature of the definition of our prior, we term it the caravan prior, see Figure 4 for a visualisation.
Remark 2.
Our construction of the Markov chain prior is inspired by the inverse gamma Markov chain in Cemgil and Dikmen (2007). However, it is different from the approach there, in that we also employ local shrinkage parameters linked through the global shrinkage hyperparameter . The two sequences and moderate or enhance each other’s effects, and in a way our approach stands halfway between Cemgil and Dikmen (2007) and the more conventional Bayesian approaches to wavelet denoising proposed in the statistical literature. The parameter of the Markov chain prior fulfils a double role: on one hand it governs strength of dependence between realisations of the coefficients ’s; on the other hand, it affects their absolute magnitudes. A large results in a priori strongly dependent ’s, but also encourages them to take large values. The parameters give an additional handle to control absolute magnitudes of ’s, by being decoupled from the dependence structure.
A further important difference of our work from the line of research in Cemgil and Dikmen (2007) and Dikmen and Cemgil (2010) consists in the fact that ours concentrates on the onedimensional wavelet transform, whereas theirs deals with transforms relevant in audio signal processing (e.g., the modified discrete cosine transform, or the Gabor transform). We provide a detailed simulation study of our approach in Section 3, the results and conclusions of which cannot be directly read off Cemgil and Dikmen (2007) and Dikmen and Cemgil (2010). Importantly, we benchmark denoising results against the EBayes method. In short, EBayes a priori postulates that the coefficients where is a heavy tailed density. A Laplace density with scale parameter compares well to other possible choices. The method proceeds by estimating hyperparameters, here and , by maximising the marginal likelihood and then computing empirical Bayes estimates of (using the estimated hyperparameters). This constitutes a straightforward and numerically stable procedure.
Remark 3.
There is a risk with our approach to oversmooth when denoising the signal (oversmoothing here refers to oversmoothing in the wavelet coefficient domain: the estimated curve itself might be actually wigglier). However, as noted in Wolfe et al. (2004b), “by favouring connected regions of nonzero coefficients, it becomes much less probable that an isolated peak in the noise spectrum will mistakenly be modelled as a signal component. As with more heuristic noise suppression schemes, a tradeoff between oversmoothing of the extracted signal and the production of musical noise artefacts is observed (Wolfe and Godsill, 2003).” Donoho and Johnstone (1994) highlight the importance of a visual assessment of the fit after denoising, and recognise existence of “divergence between the usual numerical and visual assessments of quality of fit.”
Remark 4.
The idea of postulating an a priori dependence between coefficients of a sparse signal has already appeared in the statistical literature. Thus, e.g., in the audio signal processing context, Wolfe et al. (2004a) model their parameters with the spikeandslab prior
and impose a Markovian structure on the binary sequence ; independent inverse gamma priors are assigned to the variances . This is different from our approach inasmuch as the spikeandslab prior is different from the shrinkage prior.
We also mention the fact that there is a substantial body of the signal and image processing and compression literature (too numerous to be discussed here), where dependence among wavelet coefficients is exploited in some way. See, e.g., Crouse et al. (1998) and references therein (this paper a priori models wavelet coefficients as discrete mixtures with a hidden state variable, and assumes the hidden states form a Markov chain).
2.4. Gibbs sampler
The posterior for our approach is obtained from the likelihood in Subsection 2.2 and the prior in Subsection 2.3. The posterior inference can be performed via the Gibbs sampler. In fact, as stated in Lemma 1 in Appendix B, all the full conditional distributions in our model, except those of the shrinkage parameters and , belong to standard unimodal families and are easy to sample from. The parameters and can be sampled using MetropoliswithinGibbs steps, as explained in Appendix B.
3. Synthetic data examples
In this section we investigate performance of the caravan prior via representative simulation examples.
3.1. Generalities
We implemented the caravan method in Julia (see Bezanson et al. (2017)). The code is available under Gugushvili et al. (2018). For wavelet transforms we used the wavelets package in R, see Aldrich (2013) (at the moment of writing this paper, the native Julia package for the wavelet transform is still under development), while the plots were produced with the ggplot2 package, see Wickham (2009). Simulations were performed on a Macbook Air with GHz Intel Core i5 processor and GB MHz DDR3 memory, running macOS High Sierra (version ).
Given its excellent behaviour and overall superiority over various competitors, EBayes was employed for benchmarking the caravan estimator. EBayes is implemented in the EbayesThresh package in R, see Johnstone and Silverman (2005a). We used it with settings similar to those in Johnstone and Silverman (2005a) and Johnstone and Silverman (2005b); in particular, an absolutely continuous part of the spikeandslab prior assigned to wavelet coefficients was the Laplace prior with a scale parameter estimated by empirical Bayes, and the posterior mean and median were employed as point estimates. The wavelet transform fed to EBayes was computed via the waveslim package, see Whitcher (2015) (DWT computed by both the wavelets and waveslim packages is identical. However, EbayesThresh does not support the wavelets package; on the other hand, the latter has some functionalities we found useful). Point estimates for the caravan method were the posterior mean and median. Markov chains for the caravan method were ran for iterations ( iterations for the Blocks and HeaviSine signals, see below), with the first third of the samples discarded as a burnin. No thinning was used, but this is of course a possibility. The MetropoliswithinGibbs steps of the caravan method were scaled to ensure acceptance rates in the range of . Hyperparameters used for the caravan prior are given in Appendix C.
Our strategy for generating noisy signals was: Sample a given function on a uniform dyadic grid of points , and add i.i.d. noise to the resulting values. Next, DWT was performed on the noisy data to yield the model (2). The noise standard deviation was set to , with standing for the sample standard deviation. We used two values for the signaltonoise ratio: low and high . Finally, for DWT we used the filter; this choice is often reasonable in practice, see p. 136 in Percival and Walden (2000). The number of levels of the DWT decomposition was . The quality of estimation results with DWT in fact depends on an appropriate choice of the filter, as well as the number of denoised levels of the transform; some practical guidelines for such choices are given in Section 4.11 in Percival and Walden (2000). A mechanical approach to choices such as these cannot be recommended.
As the criterion to assess performance of various wavelet denoising methods, we employed the squared error for an estimate of , that we averaged over replicate simulation runs.
3.2. Test functions
The test functions we considered were the classical test functions named Bumps, Blocks, Doppler and HeaviSine (see Donoho and Johnstone (1995)), that reproduce stylised features of signals encountered in various applications; all the expressions are collected in Appendix D. In comparison to the original definitions, we rescaled the test functions, so that the signal in each case had the standard deviation . We plot the (rescaled) functions in Figure 5.
3.3. Standard discrete wavelet transform
We report estimation errors for the DWT (averaged over independent simulation runs) in Tables 1 and 2. It is seen from the tables that the caravan method does substantially better than EBayes for the Bumps and Doppler signals. EBayes is slightly better for the HeaviSine signal in the low noise setting, while the caravan estimator does a slightly better job in the high noise setting; however, the differences are quite small in both cases. The results are also equally split for the Blocks signal: the caravan estimator is better in the low noise setting, while EBayes in the high noise setting. Overall performance of the caravan prior is arguably superior to that of EBayes; even in those cases when the latter has a smaller estimation error, it never manages to beat the caravan estimator by a wide margin. In terms of computational time, denoising a single data set with the caravan method takes ca. minutes (when the Gibbs sampler is run for iterations), which is reasonable on its own terms; EBayes is substantially faster, though.
Low noise  High noise  

Method  bmp  blk  bmp  blk 
Caravan (mean)  3.90  3.52  20.99  19.41 
Caravan (median)  3.92  3.57  21.33  20.25 
EBayes (mean)  4.85  3.78  22.80  18.76 
EBayes (median)  5.59  4.26  25.89  20.55 
Low noise  High noise  

Method  dpl  heavi  dpl  heavi 
Caravan (mean)  1.78  1.22  8.37  3.99 
Caravan (median)  1.80  1.26  8.69  4.15 
EBayes (mean)  2.88  1.20  11.95  4.28 
EBayes (median)  3.26  1.19  13.04  4.04 
It is instructive to display estimation results in one simulation run for the Doppler signal (). See Figure 6 for the noisy signal and denoising results. The caravan estimate manages to pick up the high frequency oscillations of the signal in a neighbourhood of zero noticeably better than EBayes does. This is especially apparent from the plot of absolute deviations of both estimates from the Doppler function, and constitutes a remarkable achievement.
To highlight one advantage of the caravan estimator over EBayes, we considered the following simulation experiment: in the setting, we artificially increased measurement errors for two data points of the Bumps function in places where it is flat, in fact zero; the indices of the points were and . Denoising results are reported in Figure 7. It is seen from the plots that among the two methods, caravan visually fares the best, in that it is the least affected by spurious peaks in the reconstructed curve due to unusually large noise on two observations. In that respect it is instructive to compare, e.g., the level wavelet coefficients for EBayes, caravan estimate, Bumps function, and noisy data; see Figure 8. As seen from that figure, two purely noiseaffected empirical wavelet coefficients pass the EBayes shrinkage virtually unscathed, while they are dealt a serious blow by the caravan method.
Remark 5.
In relative terms, in comparison to EBayes, the Blocks and HeaviSine functions are the most difficult to denoise with the caravan prior. Both functions are characterised by presence of discontinuities, which is a plausible reason for a somewhat worse performance of the caravan prior in these examples. In our experience, withinlevel dependence of wavelet coefficients, that characterises the caravan prior, appears to work less successfully when estimating the signal in a neighbourhood of a discontinuity point; conversely, in some simulation runs the caravan method was able to pick up discontinuities in a signal better than EBayes, but was then unable to perform denoising as well as EBayes did in those regions where the signal was smooth. A better handling of signals with discontinuities via the caravan prior would require additional modelling of intrascale dependence of wavelet coefficients (this refers to the fact that large or small values of wavelet coefficients tend to propagate across different levels of the transform, see Section in Percival and Walden (2000); for a visualisation, see, e.g., Figure 1), which, however, lies outside the scope of the present paper.
Remark 6.
In our experience, it is advisable to use longer Markov chain runs with the caravan prior in order to avoid visually unpleasant squiggles in denoised curves, which in reality are solely due to the fact that the chains were not run long enough and have not reached stationarity. Hence our decision to run the chains for or even iterations (the latter is likely to be excessive in many scenarios). Giving concrete recommendations in the present context is a difficult task, as convergence of the chains depends on factors like the nature of the underlying signal, the number of observations and the signaltonoise ratio. As one natural check, however, one can produce trace and autocorrelation plots for the hyperparameters , as well as for some of the coefficients ’s. See Appendix E for such plots for the Doppler signal denoising that we considered above in Figure 6.
An advantage of the caravan prior is the relative simplicity of the update formulae in the Gibbs sampler (see Appendix B). However, this simplicity comes at a price: at each step of the sampler, only one parameter can be updated at a time, which slows down the mixing of the Markov chain for the full posterior, that is defined on a rather highdimensional parameter space. Potentially, this may have repercussions on scalability of the method when applied on large data. See also the relevant remarks in Cemgil et al. (2007) on a related Markov chain prior.
3.4. Maximal overlap discrete wavelet transform
It has been demonstrated in, among others, Coifman and Donoho (1995), that using the translationinvariant discrete wavelet transform for signal denoising instead of the standard DWT often leads to better practical results, either in terms of the squared error, or visually. Unlike the standard DWT, for a data sequence of length , each level of the translationinvariant transform contains wavelet coefficients, since it does not use downsampling. We specifically restrict our attention to the maximal overlap discrete wavelet transform (MODWT), see, e.g., Chapter 5 in Percival and Walden (2000).
MODWT is highly redundant and nonorthogonal. When the data size is a dyadic number, coefficients of DWT can be extracted from those of MODWT by a suitable scaling and downsampling. Furthermore, one can extract from MODWT the coefficients of DWTs of all possible cyclic shifts of the data; see Comments and Extensions to Section 5.4 in Percival and Walden (2000), p. 174. Computational complexity of MODWT and its inverse (due to its redundancy, MODWT has no unique inverse; the one we have in mind is given in Percival and Walden (2000), and on an abstract level can be described in terms of the MoorePenrose inverse, cf. p. 167 there), when evaluated via the pyramid algorithm, is multiplications, which is somewhat slower than that for DWT, but still fast (in fact as fast as the Fast Fourier Transform). Unlike DWT, that requires the number of observations be a dyadic number, no such assumption is needed for MODWT. In theory, the number of MODWT levels can be arbitrarily large (unlike DWT); however, if is a dyadic integer, MODWT yields no extra information beyond the level , which hence can be taken as a maximal decomposition level for MODWT. See Figure 9 for a visualisation of MODWT for the Bumps function.
Because of a lack of orthogonality, when applied on noisy data, the MODWT empirical wavelet coefficients will be statistically dependent. On the other hand, MODWT allows one to mitigate sensitive dependence of the standard DWT on the starting position of the data sequence (which is entirely due to downsampling used in DWT). In fact, the MODWTbased denoising essentially performs averaging of results over all possible cyclic shifts of the data (here ‘all possible’ means shifts by units), that may allow a better reconstruction of the essential features of the signal and reduce noiseinduced artefacts. See Percival and Walden (2000), Comments and Extensions to Section 5 (pp. 429–431), for a succinct description of statistical applications of MODWT.
When performing comparison of EBayes and caravan estimates, we used the settings similar to those in Section 3. In particular, the sample size was . We employed the filter and the periodic boundary conditions. The number of levels of MODWT was . Some guidelines on practicalities such as these are given in Section 5.11 in Percival and Walden (2000). Finally, separately for each level of MODWT, we estimated the error standard deviation by the MAD estimate computed from the empirical wavelet coefficients of that level. It should be clear such estimates cannot be expected to lead to necessarily good results in all cases, if only because the sparsity degree of MODWT (or DWT) coefficients typically decreases for coarser levels of the transform, whereas the nonzero coefficients tend to become larger (cf. also the remarks on p. 450 in Percival and Walden (2000)). Hence our decision to denoise only levels of MODWT coefficients.
Remark 7.
In the case the sample size is a dyadic number, by simple algebra that relies on the fact that DWT coefficients are rescaled and downsampled MODWT coefficients (see Percival and Walden (2000), equations (96d) and (169a), and page 152), an estimate of the error variance can be derived as . Here can be obtained via MAD applied on the first level of MODWT. However, at the moment of writing this paper such an option is not envisioned for EBayes in the EBayesThresh package, which is a primary reason why we did not employ it in our comparison.
Estimation results on the same synthetic data as in Subsection 2.1 are reported in Tables 3 and 4. A comparison with Tables 1 and 2 (that displayed the results for DWT) shows that MODWT substantially improves estimation accuracy of both the caravan and EBayes methods, except for the HeaviSine signal. The caravan method does better than EBayes for the Bumps, Blocks and Doppler signals. The results are indecisive for the HeaviSine function, with either method better than another in different noise settings. Overall performance of the caravan prior is superior to that of EBayes. In terms of computational time, denoising a single data set with caravan method takes ca. minutes, which is noticeably slower than EBayes.
Remark 8.
The fact that in some scenarios MODWT denoising performs worse than DWT denoising does not contradict earlier simulation studies in Coifman and Donoho (1995) and Johnstone and Silverman (2005b): DWT and translationinvariant DWT there differ in details from the implementations used by us (that are based on Percival and Walden (2000)). Most importantly, we use a different error variance estimator in the MODWT case.
Low noise  High noise  

Method  bmp  blk  bmp  blk 
Caravan (mean)  3.22  2.91  15.62  16.21 
Caravan (median)  3.20  2.88  15.28  16.85 
EBayes (mean)  3.57  3.03  17.34  17.68 
EBayes (median)  3.88  3.19  18.45  19.35 
Low noise  High noise  

Method  dpl  heavi  dpl  heavi 
Caravan (mean)  1.52  1.16  7.50  5.14 
Caravan (median)  1.47  1.12  7.26  4.85 
EBayes (mean)  2.00  1.19  9.31  4.53 
EBayes (median)  2.05  1.19  9.48  4.35 
4. Nuclear magnetic resonance data
In this section we apply our denoising methodology on the nuclear magnetic resonance (NMR) spectrum, that constitutes a standard test data set for wavelet denoising algorithms.^{1}^{1}1We downloaded the data from Donald B. Percival’s website at http://faculty.washington.edu/dbp/s530/ (accessed on 28 June 2018). There are observations in total, that we display in the top panel of Figure 10. We followed Section in Percival and Walden (2000), and used the filter to compute DWT. Percival and Walden denoise levels of the transform; an MRA plot of the data set, see Figure 11, suggests that denoising levels of the transform might be enough. A plot of the DWT coefficients, see Figure 12, indicates that there are some small wavelet coefficients present at level too, but we opted to leave the levels as such.
In visualising denoising results, we used posterior medians as our point estimates (we produced larger plots to clearly highlight differences between the estimates). The Markov chain for the caravan prior was run for iterations, with the first third of samples dropped as a burnin. Both caravan and EBayes estimates remove a substantial amount of noise from the data, see Figure 10. However, visually the caravan reconstruction appears to be more regular than EBayes. One established way to measure efficacy of a denoising procedure in this context is to determine which of the methods better maintains the peaks of the curve; these peaks contain important information on the tissue from which the sample arose. We can compare the heights of the highest peak, cf. Johnstone and Silverman (2005b), p. 1719, and Percival and Walden (2000), p. 430. In that respect, the caravan estimate yielded the peak height , while EBayes the peak height . The latter method was hence worse than its competitor (to put things in perspective, the original noisy data had the peak height ).
We also applied MODWT for denoising (with levels), cf. Percival and Walden (2000), Comments and Extensions to Section 10.5. The results are reported in Figure 13. Both methods are even more successful in removing the noise. Concerning the highest peak, with the peak height , the caravan estimate marginally outperformed EBayes, that yielded the peak height . Note also how the second sharp peak to the left of the highest peak is much lower in the EBayes estimate, unlike in the caravan estimate. On the other hand, the caravan estimate shows some small squiggles near and , that are absent in the EBayes estimate; this is similar to the hard thresholding estimate (with universal threshold; see, e.g., Percival and Walden (2000), pp. 400–402) in Figure of Percival and Walden (2000). We reproduce that plot in the bottom panel of Figure 13; note the appearance of an additional squiggle near . Finally, a wavelike behaviour of both estimates over the time interval is due to our decision to denoise only levels of the transform. These waves can be largely flattened out by denoising a level MODWT, but that would have diminished even further the heights of the sharp peaks.
Summarasing, each method appears to have its own advantages on this challenging real data set.
5. Discussion
In this paper we studied a Bayesian approach to wavelet denoising via a prior relying on the inverse gamma Markov chain (cf. Cemgil and Dikmen (2007)). Various types of Markov chain priors have been used for denoising purposes in several references, but to the best of our knowledge, our paper is the first thorough comparative study of the performance of this kind of a prior. In particular, we benchmarked our method against a popular empirical Bayes procedure of Johnstone and Silverman (2005b).
Our method, which we call the caravan, strikes a good balance between conceptual simplicity and computational feasibility. Specifically, the posterior inference can be performed via a straightforward version of the Gibbs sampler. In the synthetic data examples that we considered, the method measures up well to EBayes, often substantially outperforming it in terms of the squared estimation error. The improvement brought by the caravan method comes thanks to the fact that it takes into account some of the local structures empirically observed in wavelet coefficients of real life signals. However, the caravan method does not achieve a uniform improvement (i.e. over all simulation scenarios) upon EBayes, which can be taken as indication of a general excellence of the latter, rather than of a failure of the former. In particular, in our simulations the caravan prior seemed to be somewhat worse than EBayes at handling signals with jump discontinuities; on the other hand, there are many real life signals where presence of jumps is not to be expected.
On purely visual grounds, the caravan estimator appeared to be less prone to display artefacts in its reconstructions that are due to unusually large noise peaks. As far as the computational time is concerned, since the caravan estimator is evaluated via an MCMC algorithm (Gibbs sampler), its computation is slower than that of EBayes, although the method is still reasonably fast. However, we should also point out the fact that in some cases the conceptual simplicity of our Gibbs sampler can turn out to be a disadvantage. This is due to the fact that in each iteration of our Gibbs sampler, only one parameter is updated at a time. As a result, when denoising a large data set, exploration of the highdimensional posterior surface and convergence of the resulting Markov chain to stationarity might be slow. This would necessitate the use of long Markov chain runs, and hence imply an increase in the computational time.
We believe that our contribution adds a valuable Bayesian technique to the wavelet, or more generally the nonparametric regression toolbox. Our hope is that the present contribution provides sufficient motivation for further study of the caravan method, a task that we ourselves plan to address in subsequent research. A natural question in this context, that we do not address in the present work, is: what about asymptotic statistical theory for the caravan prior? Such work in the spirit of Ghosal and van der Vaart (2017) has been done for the horseshoe prior in van der Pas et al. (2014) and van der Pas et al. (2017). This is a problem we would very much like to study in another work.
Appendix A
In this appendix we present quotes from the image and signal processing and statistics literature, evidencing awareness of the need to model explicitly the local structure of the signal in wavelet denoising applications.

“Wavelets are known for their excellent compression and localization properties. In very many cases of interest, information about a function is essentially contained in a relatively small number of large coefficients. Figure displays the wavelet coefficients of the wellknown test function Bumps (Donoho and Johnstone, 1994). It shows that large coefficients come as groups; they cluster around the areas where the function changes significantly.
This example illustrates the motivation for our methods – a coefficient is more likely to contain signal if neighbouring coefficients do also. Therefore when the observations are contaminated with noise, estimation accuracy might be improved by incorporating information on neighbouring coefficients.” (Cai and Silverman (2001)).

“The use of priors that can capture the dependence between the coefficients of the representation is a more delicate problem which involves expert a priori knowledge. …Many empirical studies have concluded that the wavelet coefficients (even if the transformation is maximally decimated) of natural images are strongly dependent.” (Moulines (2004)).

“It turns out that termbyterm sparsity is usually not enough to obtain stateoftheart results both for denoising and inverse problems involving natural images. Indeed, wavelet coefficients of images are not only sparse, they typically exhibit local dependencies among neighboring coefficients. Geometric features (edges, textures) are poorly sparsified by isotropic multiscale decompositions and create such dependencies.” (Peyré and Fadili (2011)).
Appendix B
b.1. Full conditionals
Lemma 1.
Define
The following facts hold:

The full conditionals for , , are

The full conditionals for , , are

The full conditional for is

The full conditional for is

The full conditionals for , , are

The full conditional for is

The full conditional for is
Hence, up to an additive constant independent of , the logarithm of the full conditional is proportional to

The full conditional for is
Hence, up to an additive constant independent of , the logarithm of the full conditional is proportional to
The proof of the lemma is lengthy but elementary, and is omitted. The Gibbs sampler cycles through the above update formulae to generate approximate samples from the posterior.
b.2. MetropoliswithinGibbs for updating and
Here we outline the MetropoliswithinGibbs step to update the hyperparameters and within the Gibbs sampler described in the previous subsection. We consider the case of , and note that can be treated in the same manner. Reparametrise as , and observe that if is the full conditional of , the full conditional of is . It is enough to sample : samples can then be obtained by simple exponentiation. Given a current value , we propose a move
where is a tuning parameter and ; this is a Gaussian random walk proposal. The acceptance probability is computed as
The move is accepted, if , where is independent of ; otherwise the chain stays in . The acceptance rate is controlled by the parameter .
Appendix C
Here we give the hyperparameters for the caravan prior used in our synthetic data examples:
These can be viewed as noninformative. We note that different choices may be appropriate in settings other than those considered by us.
The scaling parameters for the MetropolisHastings steps within our Gibbs sampler were set to and , where was the number of coefficients in a given level of the wavelet transform ( corresponds to the hyperparameter , while to the hyperparameter ). The constants and can vary per case of the underlying signal: we used the values and .
Appendix D
Here we recall definitions of the test functions we used in our synthetic data examples.
d.1. Bumps
The Bumps function, also introduced as a motivating example in Subsection 1.3, is given by
where and
d.2. Blocks
The Blocks function is defined as
where , are as for the Bumps function, and
According to Donoho and Johnstone (1995), the Blocks function caricatures the acoustic impedance of layered medium in geophysics, as well as onedimensional profiles along images in certain image processing applications.
d.3. Doppler
The Doppler function is a sinusoid with a changing amplitude and frequency,
d.4. HeaviSine
The HeaviSine function is a sinusoid with two jumps,
Appendix E
Here we present several trace, autocorrelation and density plots of posterior samples for denoising the Doppler signal, that we studied in Section 3 (see in particular Figure 6 there).
We display these plots in Figures 14 and 15 for wavelet coefficients from level of DWT (this choice is arbitrary). Plots were produced with the ggmcmc package in R, see Fernándezi Marín (2016). The figures suggest that the chains are rather wellbehaved and mixing.
On the other hand, the chains for the global hyperparameters and (again corresponding to level of DWT) are mixing somewhat slower, see Figures 16 and 17. Large amounts of data and long chain runs are required for precise learning of the hyperparameters and . To improve mixing of these chains, one might consider updating and via the Metropolisadjusted Langevin algorithm rather than the random walk MetropolisHastings algorithm. However, it appears that even with somewhat less accurate knowledge of and , wavelet coefficients can still be inferred efficiently. Finally, we note that mixing of Markov chains for coarser levels of DWT is in general faster than for finer levels (plots not reported here).
References
 Aldrich (2013) Aldrich, E. (2013). wavelets: A package of functions for computing wavelet filters, wavelet transforms and multiresolution analyses. R package version 0.30.
 Bezanson et al. (2017) Bezanson, J., Edelman, A., Karpinski, S. and Shah, V. B. (2017). Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98.
 Cai and Silverman (2001) Cai, T. T. and Silverman, B. W. (2001). Incorporating information on neighbouring coefficients into wavelet estimation. Sankhyā Ser. B 63(2), 127–148. Special issue on wavelets.
 Carvalho et al. (2010) Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480.
 Cemgil and Dikmen (2007) Cemgil, A. T. and Dikmen, O. (2007). Conjugate gamma Markov random fields for modelling nonstationary sources. In M. E. Davies, C. J. James, S. A. Abdallah and M. D. Plumbley, eds., Independent Component Analysis and Signal Separation: 7th International Conference, ICA 2007, London, UK, September 9–12, 2007. Proceedings, pp. 697–705. Springer Berlin Heidelberg, Berlin, Heidelberg.
 Cemgil et al. (2007) Cemgil, A. T., Févotte, C. and Godsill, S. J. (2007). Variational and stochastic inference for Bayesian source separation. Digit. Signal Process. 17(5), 891–913. Special Issue on Bayesian Source Separation.
 Coifman and Donoho (1995) Coifman, R. R. and Donoho, D. L. (1995). Translationinvariant denoising. In A. Antoniadis and G. Oppenheim, eds., Wavelets and Statistics, pp. 125–150. Springer New York, New York, NY.
 Crouse et al. (1998) Crouse, M. S., Nowak, R. D. and Baraniuk, R. G. (1998). Waveletbased statistical signal processing using hidden Markov models. IEEE Trans. Signal Process. 46(4), 886–902.
 Dikmen and Cemgil (2010) Dikmen, O. and Cemgil, A. T. (2010). Gamma Markov random fields for audio source modeling. IEEE Trans. Audio, Speech, Language Process. 18(3), 589–601.
 Donoho (1995) Donoho, D. L. (1995). Denoising by softthresholding. IEEE Trans. Inf. Theory 41(3), 613–627.
 Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–455.
 Donoho and Johnstone (1995) Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90(432), 1200–1224.
 Fernándezi Marín (2016) Fernándezi Marín, X. (2016). ggmcmc: Analysis of MCMC samples and Bayesian inference. J. Stat. Softw. 70(9), 1–20.
 Ghosal and van der Vaart (2017) Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference, volume 44 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
 Gugushvili et al. (2018) Gugushvili, S., van der Meulen, F., Schauer, M. and Spreij, P. (2018). Code accompanying the paper “Bayesian wavelet denoising with the caravan prior”. Zenodo, 10.5281/zenodo.1460346.
 Gugushvili et al. (2018a) Gugushvili, S., van der Meulen, F., Schauer, M. and Spreij, P. (2018a). Fast and scalable nonparametric Bayesian inference for Poisson point processes. ArXiv, 1804.03616.
 Gugushvili et al. (2018b) Gugushvili, S., van der Meulen, F., Schauer, M. and Spreij, P. (2018b). Nonparametric Bayesian volatility estimation. ArXiv, 1801.09956.
 Gugushvili et al. (2018c) Gugushvili, S., van der Meulen, F., Schauer, M. and Spreij, P. (2018c). Nonparametric Bayesian volatility learning under microstructure noise. ArXiv, 1805.05606.
 Johnstone and Silverman (2005a) Johnstone, I. and Silverman, B. (2005a). EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw. 12(8), 1–38.
 Johnstone and Silverman (2004) Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Statist. 32(4), 1594–1649.
 Johnstone and Silverman (2005b) Johnstone, I. M. and Silverman, B. W. (2005b). Empirical Bayes selection of wavelet thresholds. Ann. Statist. 33(4), 1700–1752.
 Mitchell and Beauchamp (1988) Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83(404), 1023–1036. With comments by James Berger and C. L. Mallows and with a reply by the authors.
 Moulines (2004) Moulines, E. (2004). Discussion on the meeting on ‘Statistical approaches to inverse problems’. J. R. Statist. Soc. B 66(3), 628–630.
 Percival and Walden (2000) Percival, D. B. and Walden, A. T. (2000). Wavelet methods for time series analysis, volume 4 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.
 Peyré and Fadili (2011) Peyré, G. and Fadili, J. (2011). Group sparsity with overlapping partition functions. In 2011 19th European Signal Processing Conference, pp. 303–307.
 Polson and Scott (2011) Polson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: sparse Bayesian regularization and prediction. In Bayesian statistics 9, pp. 501–538. Oxford Univ. Press, Oxford. With discussions by Bertrand Clark, C. Severinski, Merlise A. Clyde, Robert L. Wolpert, Jim E. Griffin, Philiip J. Brown, Chris Hans, Luis R. Pericchi, Christian P. Robert and Julyan Arbel.
 Tipping (2001) Tipping, M. E. (2001). Sparse Bayesian learning and the Relevance Vector Machine. J. Mach. Learn. Res. 1, 211–244.
 van der Pas et al. (2017) van der Pas, S., Szabó, B. and van der Vaart, A. (2017). Uncertainty quantification for the horseshoe (with discussion). Bayesian Anal. 12(4), 1221–1274. With a rejoinder by the authors.
 van der Pas et al. (2014) van der Pas, S. L., Kleijn, B. J. K. and van der Vaart, A. W. (2014). The horseshoe estimator: posterior concentration around nearly black vectors. Electron. J. Stat. 8(2), 2585–2618.
 Whitcher (2015) Whitcher, B. (2015). waveslim: Basic wavelet routines for one, two and threedimensional signal processing. R package version 1.7.5.
 Wickham (2009) Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. SpringerVerlag New York.
 Wolfe et al. (2004a) Wolfe, P. J., Godsill, S. J. and Ng, W.J. (2004a). Bayesian variable selection and regularization for timefrequency surface estimation. J. R. Stat. Soc. Ser. B Stat. Methodol. 66(3), 575–589.
 Wolfe et al. (2004b) Wolfe, P. J., Godsill, S. J. and Ng, W.J. (2004b). Discussion on the meeting on ‘Statistical approaches to inverse problems’. J. R. Statist. Soc. B 66(3), 646–648.