Asymptotically exact data augmentation:models, properties and algorithms

# Asymptotically exact data augmentation: models, properties and algorithms

## Abstract

Data augmentation, by the introduction of auxiliary variables, has become an ubiquitous technique to improve convergence properties, simplify the implementation or reduce the computational time of inference methods such as Markov chain Monte Carlo ones. Nonetheless, introducing appropriate auxiliary variables while preserving the initial target probability distribution and offering a computationally efficient inference cannot be conducted in a systematic way. To deal with such issues, this paper studies a unified framework, coined asymptotically exact data augmentation (AXDA), which encompasses both well-established and more recent approximate augmented models. In a broader perspective, this paper shows that AXDA models can benefit from interesting statistical properties and yield efficient inference algorithms. The pillar of such models is a Dirac delta-converging sequence which ensures the recovery of the initial target density in a limiting case. In non-asymptotic settings, the quality of the proposed approximation is assessed with several theoretical results. The latter are illustrated on standard statistical problems. Supplementary materials including computer code for this paper are available online.

Keywords: Approximation, auxiliary variables, divide-and-conquer, Bayesian inference, robustness.

## 1 Introduction

Starting at least from the 1960s with the seminal paper of Hartley (1958) on the expectation-maximization (EM) algorithm, introducing auxiliary variables has been a widely adopted strategy to derive iterative algorithms able to deal with possibly complicated inference problems. Indeed, either by coming from statistical physics (Swendsen and Wang 1987) or by the broad statistical community (Dempster et al. 1977), auxiliary (also called latent) variables have been used to improve (Duane et al. 1987; Edwards and Sokal 1988; Marnissi et al. 2018) and/or simplify (Tanner and Wong 1987; Doucet et al. 2002) inference methods, such as maximum likelihood (ML) estimation or simulation-based ones. Insightful reviews of these methods were conducted by Besag and Green (1993); van Dyk and Meng (2001); Tanner and Wong (2010). Among many others, slice sampling and half-quadratic (HQ) methods are archetypal instances of such auxiliary variable-based methods. These methods, by introducing auxiliary variables, appear to be an interesting alternative when sampling cannot be performed directly from a target distribution . Nonetheless, the superiority of simulation-based algorithms based on data augmentation (DA) over classical Markov chain Monte Carlo (MCMC) methods without DA is not obvious as pointed out by Polson (1996); Damien et al. (1999). DA methods have been found to be slower than single-site update approaches in some cases (Hurn 1997) and some improvements have been derived to cope with these problems such as partial decoupling (Higdon 1998) or the introduction of a working parameter (Meng and van Dyk 1997). Moreover, DA techniques are often used on a case-by-case basis (Geman and Reynolds 1992; Albert and Chib 1993; Geman and Yang 1995; Polson et al. 2013) and could not be applied in general scenarios due to the absence of exact DA schemes yielding an efficient inference and low computation costs.

Similarly to approximate Bayesian computation (ABC) methods to circumvent intractable likelihoods (Beaumont et al. 2002; Sisson et al. 2018b), these limitations can be tackled by considering approximate DA schemes that become exact asymptotically. For instance, inspired from the variable splitting technique used in the alternating direction method of multipliers (ADMM) (Boyd et al. 2011), Vono et al. (2019) and Rendell et al. (2018) recently and independently proposed a novel and broad Bayesian inference framework that can circumvent limitations of exact DA approaches. By introducing a collection of instrumental (also called splitting) variables, the aforementioned authors considered the inference from an approximate probability distribution which can be simpler, more efficient and distributed over multiple computational workers (e.g., machines or kernels).

This paper aims at deeply investigating a broad framework coined asymptotically exact data augmentation (AXDA) which encompasses previously proposed special instances such as approximate models used in Vono et al. (2019); Rendell et al. (2018), among others. More precisely, Section 2 details how such models can be built in a quasi-systematic and simple way which is highly appreciable compared to the case-by-case search of computationally efficient DA schemes. In Section 3, we revisit some already-proposed special instances of AXDA models in order to show the potential benefits of AXDA on specific examples and to exhibit interesting properties which can be generally inherited by AXDA approaches. In Section 4, we assess quantitatively the bias of AXDA models with non-asymptotic theoretical results by considering Wasserstein and total variation distances. Then, Section 5 illustrates the previous theoretical results and the benefits of the proposed methodology on several statistical problems. In order to facilitate the use of AXDA, we eventually point out that the supplementary material involves a dedicated section presenting how such models can be instantiated to perform efficient inference through classical simulation-based, variational Bayes (VB), optimization or expectation-maximization (EM) methods. The proofs are also given in the supplementary material.

## 2 Asymptotically exact data augmentation

This section introduces AXDA schemes that aim to circumvent exact DA main issue: the art (van Dyk and Meng 2001) of finding the exact DA associated to a statistical model and its inference limitations. For sake of simplicity, with little abuse, we shall use the same notations for a probability distribution and its associated probability density function (pdf).

### 2.1 Motivations

In this paper, we are interested in performing the inference of a variable of interest , where is a closed convex set and , by relying on a probability distribution with density writing

 (1)

where the potential taking values in is such that defines a proper, bounded and continuous probability distribution. For sake of generality, note that in (1) shall describe various quantities. First, with a little abuse of notations, may simply refer to a pdf associated to the random variable , e.g., its prior distribution or its posterior distribution when referring to a set of observations denoted by . Depending on the problem, we also allow to stand for a likelihood function . We will work under this convention and write explicitly the form of when required. For sake of simplicity and clarity, only the case corresponding to will be detailed in this section. The application of the proposed methodology to is very similar and can be retrieved by a straightforward derivation.

We consider situations where direct inference from (1) is difficult because intractable or computationally prohibitive. To overcome these issues, an option is to rely on exact DA which introduces some auxiliary variables stacked into a vector and defines a new density, simpler to handle, such that

 ∫Zπ(θ,z)dz=π(θ). (2)

Much research has been devoted to these models in order to simplify an inference task or to improve the convergence properties of direct inference approaches (e.g., slice sampling and HQ methods introduced in Section 1). Nonetheless, these approaches have several limitations. Indeed, finding a convenient form for the augmented density in order to satisfy (2) while leading to efficient algorithms generally requires some knowledge and can even be impossible in some cases (Geman and Yang 1995). For instance, the mixture representation of a binomial likelihood function based on the Polya-Gamma distribution has been used to derive a promising Gibbs sampler for logistic regression problems (Polson et al. 2013). Nonetheless, even if this algorithm has been proved to be uniformly ergodic by Choi and Hobert (2013), the corresponding ergodicity constant depends exponentially on the number of observations and on the dimension of the regression coefficients vector .

To tackle these limitations, we propose to relax the constraint (2) and consider an approximate DA model. This will permit the choice of an augmented density with more flexibility, fix the issues associated to the initial model and make inference more efficient in some cases. To this purpose, Section 2.2 presents the so-called AXDA framework which embeds approximate DA models controlled by a positive scalar parameter . These models become asymptotically exact when tends towards 0. Of course, some assumptions will be required on the approximate augmented density to guarantee a good approximation. The quality of this approximation will be assessed in Section 4 with non-asymptotic theoretical results.

### 2.2 Model

Instead of searching for an exact data augmentation scheme (2), some auxiliary variables can be introduced in order to define an approximate but asymptotically exact probability distribution. One possibility is to introduce an augmented distribution depending on a parameter and such that the associated marginal density defined by

 πρ(θ)=∫Zπρ(θ,z)dz, (3)

satisfies the following property.

###### Property 1.

For all , .

By applying Scheffé’s lemma (Scheffé 1947), this property yields the convergence in total variation as detailed in the following corollary.

###### Corollary 1.

Under Property 1, as .

A natural question is: how to choose the augmented density in (3) such that Property 1 is met? In this paper, we assume that and investigate AXDA schemes associated to an initial density (1) and defined by the approximate augmented density

 πρ(θ,z)=π(z)κρ(z;θ), (4)

where is such that (4) defines a proper density.

###### Remark 1.

When stands for a product of densities, that is , the proposed approximate model can naturally be generalized to . Such a generalization will for instance be considered in Sections 3.1 and 3.2.

A sufficient condition to satisfy Property 1 is to require that the sequence weakly converges towards the Dirac delta function concentrated at the point as (Aguirregabiria et al. 2002).

One of the aim of introducing the proposed model (4) is to avoid a case-by-case search of an appropriate augmented approach. Hence, although there might exist other marginal densities satisfying Property 1, we restrict our analysis to (4) where satisfies the sufficient weak convergence condition detailed above. In the sequel, we will call AXDA any approach based on (4) and satisfying these properties. The following sections detail two possible ways to build such a sequence.

#### AXDA using standard kernels

One possibility to construct such a sequence is to consider a kernel that is a positive function such that and , for all . Based on the latter, we define for all , (Dang and Ehrhardt 2012). Table 1 lists some classical examples of symmetric kernels which are not necessarily compactly supported. For sake of simplicity, we only define univariate versions of them but they can obviously be generalized in higher dimension. Figure 1 illustrates these kernels. Some of them have for instance been used in ABC approaches (Sisson et al. 2018b).

#### AXDA using divergences

Beyond the kernels listed in Table 1 but motivated by the same idea of measuring the discrepancy between the latent variable and the initial one , another general strategy to derive such that it weakly converges towards the delta Dirac function is to build on divergence functions widely used in the optimization literature (Ben-Tal et al. 2001; Beck and Teboulle 2003; Duchi et al. 2012; Krichene et al. 2015). Indeed, if we define for all , where is a strictly convex function w.r.t. admitting a unique minimizer , then under mild differentiability assumptions on , one can show that satisfies the desired weak convergence property (Fellows et al. 2019, Theorem 1). Univariate examples of such potentials are listed in Table 2 and involve specific instances of Bregman divergences such as the logistic loss and the Kullback-Leibler divergence, see Definition 1.

###### Definition 1 (Bregman divergence).

Let a continuously-differentiable and strictly convex function defined on a closed convex set. The Bregman divergence associated to is defined by

 dψ(z,θ)=ψ(z)−ψ(θ)−∇ψ(θ)T(z−θ). (5)

Eventually, let us point out that when belongs to the exponential family, its associated potential function defined by can be expressed as a Bregman divergence up to an additional term between the auxiliary variable and the conditional expectation under , (Banerjee et al. 2005, Theorem 4).

## 3 Benefits of AXDA by revisiting existing models

Before providing theoretical guarantees for AXDA models, this section proposes to review some important state-of-the-art works from the AXDA perspective described in Section 2. We do not pretend to give new insights about these approaches. We rather use them to illustrate potential benefits that can be gained by resorting to the proposed framework. For sake of clarity, these benefits are directly highlighted in the title of the following sections before being discussed in the latter.

### 3.1 Tractable posterior inference

This first section illustrates how an AXDA approach can alleviate the intractability of an initial posterior distribution and significantly aid in the computations.

To this purpose, we consider the case where the posterior distribution is intractable. Such a model for instance appears when involves a constraint on some set (Liechty et al. 2009), admits a non-standard potential function such as the total variation norm (Chambolle et al. 2010; Pereyra 2016; Vono et al. 2019) or yields complicated conditional posterior distributions (Holmes and Mallick 2003). To simplify the inference, the aforementioned authors have considered special instances of AXDA by relying on an additional level involving latent variables , leading a hierarchical Bayesian model. In these cases, AXDA has been invoked in order to move a difficulty to the conditional posterior of where it can be dealt with more easily by using standard inference algorithms, see Section 3 in the supplementary material for more details. The following example, derived from Holmes and Mallick (2003), illustrates this idea.

###### Example 1.

Let be a set of observations and a design matrix filled with covariates. We consider a generalized non-linear model which writes

 yi|θ∼p(yi | g−1(h(xi,θ)),σ2),∀i∈[n], (6) θ∼N(θ | 0d,ν2Id), (7)

where belongs to the exponential family and has mean and variance where is a link function. As in classical regression problems, we are interested in infering the regression coefficients . In the sequel, we set the non-parametric model to be

 h(xi,θ)=k∑j=1θjB(xi,kj), (8)

where is a non-linear function of (e.g., regression splines) and is the knot location of the -th basis. The difficulty here is the non-linearity of which, combined with the non-Gaussian likelihood, rules out the use of efficient simulation schemes to sample from the posterior . In order to mitigate this issue, Holmes and Mallick (2003) proposed to rely on an additional level which boils down to consider the approximate model (4). More specifically, the aforementioned authors treated the non-linear predictor as a Gaussian random latent variable which leads to the approximate model

 yi|zi∼p(yi | g−1(zi),σ2),∀i∈[n], (9) zi|θ∼N(zi | h(xi,θ),ρ2),∀i∈[n], (10) θ∼N(θ | 0d,ν2Id). (11)

Here, AXDA has been applied only to the likelihood function with chosen as the univariate normal distribution (10) leading to a smoothed likelihood function. The main advantage of relying on such a model is that the posterior conditional distribution of is now a multivariate normal distribution. In addition, by moving the difficulty induced by to the conditional posterior of , we are now dealing with a generalized linear model where standard techniques can be applied (Albert and Chib 1993; Polson et al. 2013).

Beyond the widely-used Gaussian choice for (Holmes and Mallick 2003; Liechty et al. 2009; Barbos et al. 2017; Vono et al. 2019), more general AXDA approaches can be built by taking inspiration from these works. To this purpose, we recommand to adaptively set w.r.t. the prior and likelihood at stake. For instance, when a Poisson likelihood function and a complex prior distribution on its intensity are considered, one option for (see Section 2.2.2) would be an Itakura-Saito divergence since it preserves the positivity constraint on and yields the well-known Gamma-Poisson model (Canny 2004).

### 3.2 Distributed inference

When data are stored on multiple machines and/or one is interested in respecting their privacy, this section illustrates how AXDA can be resorted to perform distributed computations.

Let consider observed data , where stands for the covariates associated to observation , which are distributed among nodes within a cluster. By adopting a prior and by assuming that the likelihood can be factorized w.r.t. the nodes, the posterior distribution of the variable of interest writes

 π(θ|y,X)∝ν(θ)B∏b=1∏i∈node bexp(−fi(yi;h(xi,θ))). (12)

Such models classically appear in statistical machine learning when generalized linear models (GLMs) (Dobson and Barnett 2008) are considered. In these cases, . An archetypal example is the logistic regression problem. The latter assumes that the observed binary variables follow the Bernoulli distribution where is the logistic link. This leads to the posterior

 π(θ|y,X)∝ν(θ)n∏i=1exp(−fi(yi;xTiθ)), (13)

where . The likelihood (13) can be re-written as in (12) by simply gathering the indices associated to data which belong to the same node .

Due to the distributed environment, sampling efficiently from (12) is challenging and a lot of “divide-and-conquer” approaches have been proposed in the past few years to cope with this issue (Wang and Dunson 2013; Scott et al. 2016). These methods launch independent Markov chains on each node and then combine the outputs of these local chains to obtain an approximation of the posterior of interest (12). Nonetheless, the averaging schemes used to combine the local chains might lead to poor approximations when is high-dimensional and non-Gaussian, see Rendell et al. (2018) for a comprehensive review.

Instead, considering a special instance of AXDA circumvents the previously mentioned drawbacks by introducing local auxiliary variables on each node such that

 πρ(θ,z|y,X)∝ν(θ)B∏b=1∏i∈node bexp(−fi(yi;zi))κρ(zi;xTiθ). (14)

Let choose to be log-concave w.r.t. and such that stands for a conjugate prior for . The posterior distribution of the auxiliary variables conditionally to only depends on the data available at a given node. Based on this nice property, the joint posterior can be sampled efficiently with a Gibbs sampler. Indeed, the conditional distribution of being univariate and log-concave, sampling from it can be done efficiently and in a distributed manner with (adaptive) rejection sampling (Gilks and Wild 1992). On the other hand, the choice of leads to a standard conditional posterior for which can be sampled with off-the-shelf techniques. We emphasize that the benefits described in this section for Monte Carlo sampling also hold when one wants to use other types of algorithms (e.g., expectation-maximization or variational Bayes), see Section 3 in the supplementary material.

### 3.3 Robust inference

By noting that classical robust hierarchical models fall into the proposed framework, this section shows that AXDA is also a relevant strategy to cope with model misspecification by modeling additional sources of uncertainty.

Considering a well-chosen demarginalization procedure is known to yield robustness properties in some cases (Robert and Casella 2004). Some approaches took advantage of this idea in order to build robust hierarchical Bayesian models w.r.t. possible outliers in the data. For instance, such models can be built by allowing each observation to be randomly drawn from a local statistical model, as described in the recent review of Wang and Blei (2018). This “localization” idea is illustrated in Figure 2.

Many of these models can be viewed as particular instances of AXDA. Indeed, assume that data points are independently and identically distributed (i.i.d.) defining the likelihood function

 π(y|θ)=n∏i=1π(yi|θ), (15)

where is a common parameter. Applying AXDA as described in Section 2 by introducing -dimensional auxiliary variables stacked into the vector leads to the augmented likelihood

 πρ(y,z1:n|θ)=n∏i=1π(yi|zi)κρ(zi;θ). (16)

The statistical model defined by (16) implies a hierarchical Bayesian model similar to the localized one depicted on Figure 2(b) and corresponds in general to an approximation of the initial one, see Example 2.

###### Example 2.

Robust logistic regression – Assume that for all , , where stands for the Bernoulli distribution, for the sigmoid function, for the transpose of the design matrix and for the regression coefficients vector to infer. Then as proposed by Wang and Blei (2018), one can robustify the inference by assuming that each observation is drawn from a local and independent model associated to an auxiliary parameter . In this case, .

Beyond the convenient Gaussian prior advocated by Wang and Blei (2018), the choice of through its potential , see Section 2.2.2, can be motivated by robust loss functions used in the statistical machine learning literature such as the absolute or Huber losses (She and Owen 2011). In Bayesian linear inverse problems considered in the signal processing community, it is classical to approximate a complicated forward physical model in order to yield tractable computations. If the latter can be written as , with , then introducing a latent variable such that allows to take into consideration the model approximation. In those cases, one can set to be the distribution of the modeling error which could be adjusted thanks to some expertise.

### 3.4 Inheriting sophisticated inference schemes from ABC

Finally, this section shows that AXDA models, by sharing strong connections with ABC, might inherit sophisticated algorithms to sample from (4).

ABC stands for a family of methods that permit to cope with intractable likelihoods by sampling from the latter instead of evaluating them. In a nutshell, if one’s goal is to infer a parameter based on a posterior of interest, the simplest ABC rejection sampler is as follows. At iteration , draw a candidate from the prior, generate pseudo-observations from the likelihood given this candidate and accept if where is the observations vector. Many more sophisticated ABC samplers have been derived. We refer the interested reader to the recent review by Sisson et al. (2018a) for more information about ABC methods.

Among a huge literature on ABC (also called likelihood-free) methods, noisy ABC approaches proposed and motivated by Fearnhead and Prangle (2012) and Wilkinson (2013) are strongly related to AXDA. Indeed, only comparing the underlying models, AXDA with observation splitting is equivalent to noisy ABC. To see this, let stand for an intractable likelihood. Noisy ABC replaces the exact inference based on by considering the pseudo-likelihood with density

 πρ(y|θ)≜∫Θπρ(y,z|θ)dz=∫Θπ(z|θ)κρ(z;y)dz. (17)

This density has exactly the same formulation as the one defined in (4) except that noisy ABC splits the observations instead of the parameter of interest . Capitalizing on this equivalence property, also pointed out by Rendell et al. (2018), one can derive efficient algorithms for AXDA from the ABC framework. For instance, Rendell et al. (2018) recently built on the works of Beaumont et al. (2002); Del Moral et al. (2012) in the ABC context to propose a bias correction approach and a sequential Monte Carlo (SMC) algorithm avoiding the tuning of the tolerance parameter . Obviously, many other inspirations from ABC can be considered, such as the parallel tempering approach of Baragatti et al. (2013) among others, to make the inference from an AXDA model more flexible and efficient.

## 4 Theoretical guarantees

By building on existing approaches, Section 3 showed that AXDA can be used in quite general and different settings depending on ones motivations. In order to further promote the use of such approximate augmented models, this section goes beyond the empirical bias analysis performed by previous works and provides quantitative bounds on the error between the initial and the approximate model. More precisely, for a fixed tolerance parameter , non-asymptotic results on the error associated to densities, potentials and credibility regions are derived. We will assume all along this section that . The proofs of the results of this section can be found in the supplementary material.

### 4.1 Results for standard kernels

In this section, we consider the case where is a kernel, see Section 2.2. Under this model, the following results hold.

###### Proposition 1.

Let . The marginal with density in (3) has the following properties.

1. Let stand for a pdf associated to the random variable and . Then, the expectation and variance under are given by

 Eπρ(θ) =Eπ(θ) (18) varπρ(θ) =varπ(θ)+varκρ(θ). (19)
2. where is the closure of . The notation refers to the support of a function .

3. If both and are log-concave, then is log-concave.

4. If and is bounded for all , then is infinitely differentiable w.r.t. .

Proposition 1 permits to draw several conclusions about the inference based on . Firstly, the infinite differentiability of (Property iv)) implies that it stands for a smooth approximation of , see Figure 6 in Section 4. Secondly, Property i) of Proposition 1 is reassuring regarding the inference task. Indeed, if stands for a prior distribution, then considering the approximation simply corresponds to a more diffuse prior knowledge around the same expected value, see Section 5.2 in Section 4. Thus, more weight will be given to the likelihood if a posterior distribution is derived with this prior. On the other hand, if stands for a likelihood, then considering the approximation yields the opposite behavior: the likelihood becomes less informative w.r.t. the prior. This idea is directly related to robust hierarchical Bayesian models discussed in Section 3.3.

We now provide quantitative bounds on the approximation implied by considering the marginal instead of . For , we define the -Wasserstein distance between and by

 Wp(π,πρ)=(minμ{∫Rd∫Rd∥θ−z∥p2dμ(z,θ);μ∈Γ(πρ,π)})1/p, (20)

where is the set of probability distributions with marginals and w.r.t. and , respectively. Under mild assumptions on the kernel , Proposition 2 gives a simple and practical upper bound on (20).

###### Proposition 2.

Assume that in (3) stands for a pdf associated to the variable . Let such that . Then, we have

 Wp(π,πρ)≤ρmp. (21)

Note that (21) holds without assuming additional assumptions on the initial density such as infinite differentiability. If the latter is assumed w.r.t. the parameter of interest , then one can estimate the bias with a Taylor expansion of similarly to bias analysis in ABC, see Sisson et al. (2018b). Table 3 gives closed-form expressions of for the multivariate generalizations of the kernels listed in Table 1. One can denote that the constant has the same dependence w.r.t. the dimension for the considered standard kernels . Hence, in high-dimensional scenarios, the approximation quality will be more affected by an inappropriate value for the tolerance parameter rather than by the choice of . In Section 5, we illustrate Proposition 2 with numerical experiments.

### 4.2 Pointwise bias for Bregman divergences

In complement to Section 4.1 where was built using kernels, we now analyze the bias induced by considering when is derived from a Bregman divergence (see Definition 1), that is

 κρ(z,θ)∝exp(−dψ(z,θ)ρ). (22)

Under infinite differentiability assumptions on both and , one can show that the pointwise bias is of the order of when is sufficiently small, see Proposition 3.

###### Proposition 3.

Assume that is infinite differentiable on and so does w.r.t. its first argument. Let such that both and exist and are continuous, where is the Hessian matrix of and is the Hessian matrix associated to . Then,

 (23)

Note that when , stands for a Gaussian smoothing kernel, see Section 4.1. In that case, we have the sanity check that the dependence w.r.t. of the bias between and in (23) is the same as the one derived by Sisson et al. (2018b) when interpreting as a kernel.

### 4.3 A detailed non-asymptotic analysis for Gaussian smoothing

The previous sections gave quantitative approximation results for a large class of densities built either via a kernel or a Bregman divergence. In this section, we provide complementary results by restricting our analysis on the case

 κρ(z;θ)=N(z|θ,ρ2Id). (24)

This particular yet convenient assumption will allow to complement and sharpen results of Section 4.1 by deriving quantitative bounds which take into account the regularity properties of . Furthermore, these bounds can be extended to a sum of potential functions and used to assess the bias associated to both log-densities and credibility regions. This analysis is also motivated by the fact that the Gaussian smoothing case has been widely advocated in the literature since it generally leads to simple inference steps (Holmes and Mallick 2003; Giovannelli 2008; Liechty et al. 2009; Dümbgen and Rufibach 2009), and can be related to both the ADMM in optimization (Boyd et al. 2011; Vono et al. 2019) and the approximation involved in proximal MCMC methods (Pereyra 2016; Durmus et al. 2018; Salim et al. 2019). Unfortunately, a straigthforward generalization of the proof techniques used in the sequel does not give informative upper bounds for smoothing associated to other Bregman divergences.

#### Assumptions

To derive non-asymptotic bounds between quantities related to defined in (3) and in (1), some complementary assumptions on will be required. They are detailed hereafter. For simplicity and with a little abuse of notations, we also denote here by the potential associated to (1) when stands for a likelihood.

1. is -Lipschitz w.r.t. , that is such that for all , . When is a likelihood, it is further assumed that is independent of .

2. is continuously differentiable and has an -Lipschitz continuous gradient w.r.t. , that is such that for all , .

3. is convex, that is for every , , .

4. .

Assumptions , and on the potential stand for standard regularity assumptions in the optimization literature and cover a large class of functions (Beck and Teboulle 2009; Bolte et al. 2014). In the broad statistical community, has been used by Durmus et al. (2018) to derive non-asymptotic bounds on the total variation distance between probability distributions while stands for a sufficient condition to have a strong solution to the overdamped Langevin stochastic differential equation (Durmus and Moulines 2017).

Under the previous assumptions (not used all at once), non-asymptotic upper bounds on the total variation distance between and are derived in Section 4.3.2. Then, Sections 4.3.3 and 4.3.4 take advantage of this bound to state theoretical properties on the potential functions and credibility regions.

#### Non-asymptotic bounds on the total variation distance

In this section, we make additional regularity assumptions on the potential in order to show quantitative results depending explicitly on regularity constants associated to . Two different cases will be considered, namely Lipschitz potentials, and differentiable, gradient-Lipschitz and convex ones.

Lipschitz potential – When the potential function is assumed to be Lipschitz continuous but not necessarily continuously differentiable, the following result holds.

###### Theorem 1.

Let a potential function satisfy . Then,

 ∥∥πρ−π∥∥TV≤1−Δd(ρ), (25)

where

 Δd(ρ)=D−d(Lfρ)D−d(−Lfρ). (26)

The function is a parabolic cylinder function defined for all and by

 D−d(z)=exp(−z2/4)Γ(d)∫+∞0e−xz−x2/2xd−1dx. (27)

As expected from Corollary 1, note that this bound tends towards zero when . Additionally, this bound depends on few quantities that can be computed, bounded or approximated in real applications: the dimension of the problem , the Lipschitz constant associated to the regularized potential and the tolerance parameter . In the limiting case , the following equivalent function for the upper bound derived in (25) holds.

###### Corollary 2.

In the limiting case , we have:

 ∥∥πρ−π∥∥TV≤ρLf2√2Γ(d+12)Γ(d2)+o(ρ), (28)

where for all as .

Under some regularity conditions (here Lipschitz continuity) on the potential function , Proposition 2 states that grows at most linearly w.r.t. the parameter and w.r.t. when is sufficiently small. Moreover, using Stirling-like approximations when is large in the equivalence relation (28) may give a mild dependence on the dimensionality of the problem in . Potential functions verifying the hypothesis of Theorem 1 are common in machine learning and signal/image processing problems, see Section 5.3. As an archetypal example, the sparsity promoting potential function defined for all by with is Lipschitz continuous with Lipschitz constant and satisfies Theorem 1 and Proposition 2. In this case, the dependence of (28) is linear w.r.t. when is large and is small. Note also that continuously differentiable functions on a compact set are Lipschitz continuous.

Convex and gradient-Lipschitz potential – We now show a complementary result by assuming to be convex and continuously differentiable with a Lipschitz-continuous gradient.

###### Theorem 2.

Let a potential function satisfy , and . Then, when stands for a pdf associated to , we have:

 ∥∥πρ−π∥∥TV≤1−1(1+2ρ2Mf)d/2(1−ρ4MfMf1+2ρ2Mf). (29)

In the limiting case , the upper bound in (29) has a simpler expression as shown hereafter.

###### Corollary 3.

In the limiting case , we have:

 ∥∥πρ−π∥∥TV≤ρ2dMf+o(ρ2). (30)

Note that the dependences w.r.t. both and in Corollary 2 and 3 are similar to the ones found by Nesterov and Spokoiny (2017) for optimization purposes.

Figure 3 gives the behavior of the upper bounds in (25) and (29) w.r.t. the dimensionality of the problem ranging from to and as a function of in log-log scale. The linear (resp. quadratic) relation between this upper bound and shown in (28) (resp. (30)) is clearly observed for small values of . Nonetheless, these upper bounds are not a silver bullet. Indeed, as expected, for a fixed value of the parameter , the approximation error increases as the dimension grows. Thus, these bounds suffer from the curse of dimensionality and become non-informative in high-dimension if is not sufficiently small.

Theorem 1 and 2 are easily extended to the case where the initial density is expressed as a product of several terms. If stands for the pdf associated to the variable , this boils down to consider

 π(θ)=J∏j=1πj(θ)∝exp(−J∑j=1fj(θ)), (31)

where for all , , and a natural generalization of AXDA when applied to each , which writes

 πρ(θ,z1:J)=J∏j=1πj(zj)κρj(zj;θ)∝exp(−J∑j=1fj(zj)+12ρ2j∥∥zj−θ∥∥22). (32)

Under this product form, we have the following corollaries.

###### Corollary 4.

For all , let satisfy . Then,

 ∥∥πρ−π∥∥TV≤1−J∏j=1Δ(j)d(ρj), (33)

where .

###### Corollary 5.

For all , let satisfy , and . Then,

 ∥∥πρ−π∥∥TV≤1−J∏j=11(1+2ρ2jMfj)d/2(1−ρ4jMfjMfj1+2ρ2jMfj). (34)

#### Uniform bounds on potentials

From an optimization point of view, it is quite common to consider potential functions associated to densities. For such applications, we give hereafter a quantitative uniform bound on the difference between the potential functions associated to and . Similarly to the definition of the potential function in (1), we define the potential function associated to the approximate marginal in (3), for all , by

 fρ(θ)=−log∫Rdexp(−f(z))κρ(z;θ)dz. (35)

By considering a Gaussian smoothing kernel , the potential becomes

 fρ(θ)=−log∫Rdexp(−f(z)−12ρ2∥z−θ∥22)dz+d2log(2πρ2). (36)

Note that appears as a regularized version of .

###### Proposition 4.

Let satisfy . Then, for all ,

 Lρ≤fρ(θ)−f(θ)≤Uρ, (37)

with

 Lρ=logNρ−logD−d(−Lfρ), (38) Uρ=logNρ−logD−d(Lfρ), (39)

and

 Nρ=2d/2−1Γ(d/2)Γ(d)exp(L2fρ2/4). (40)

It is easily observed that these bounds are informative in the limiting case since they both tend towards zero.

#### Uniform bounds on credibility regions

When stands for the density associated to a posterior distribution, one advantage of Bayesian analysis is its ability to derive the underlying probability distribution of the variable of interest and thereby to provide credibility information under this distribution. This uncertainty information is particularly relevant and essential for real-world applications. Since the marginal stands for an approximation of the original target distribution , it is important to control the credibility regions under w.r.t. those drawn under . The control in total variation distance given by Theorem 1 is already a good indication. However, it is possible to quantify more precisely the difference between the credible regions (Robert 2001) with confidence level () under and , as stated below.

###### Proposition 5.

Let be a posterior distribution associated to and such that is verified. Let an arbitrary -credibility region under , that is with . Then,

 (1−α)NρD−d(−Lfρ)≤∫Cραπ(θ)dθ≤min(1,(1−α)NρD−d(Lfρ)), (41)

where is defined in (40).

Proposition 5 states that the coverage of under can be determined for a fixed value of . Thus, it is even possible to obtain a theoretical comprehensive description of w.r.t. the initial target density before conducting an AXDA-based inference. The bounds in (41) permit to choose a parameter in order to ensure a prescribed coverage property. The behavior of these bounds w.r.t. is the same as in Section 4.3.2, i.e., linear behavior w.r.t. when this parameter is sufficiently small.

## 5 Numerical illustrations

This section illustrates the quantitative results shown in Sections 4.1 and 4.3 on four different examples which classically appear in statistical signal processing and machine learning. As shown in Table 3, the bias induced by considering is mostly driven by the value of the tolerance parameter rather than by the choice of . Hence, for simplicity, most of the numerical illustrations hereafter consider the case where is a Gaussian smoothing kernel.

### 5.1 Multivariate Gaussian example

We start by performing a sanity check with the simple case where stands for a multivariate Gaussian density that is

 π(θ)=N(θ|μ,Σ), (42)

where is assumed to be positive definite. If is taken to be Gaussian density with mean and covariance matrix , then one can show that

 πρ(θ)=N(θ|μ,Σ+ρ2Id). (43)

In particular, let consider the univariate setting, that is , . In this case, the variance under is and simply corresponds to the variance under inflated by a factor . Therefore, the approximation will be reasonable if is sufficiently small, see Figure 4. In this Figure, we also show the approximation induced by considering a uniform kernel (see Table 2) instead of a Gaussian one. The smoothing via the uniform kernel performs slightly better than Gaussian smoothing due to its lower variance. In both cases, the approximation is reasonable for small although , built with a uniform kernel, no longer belongs to the Gaussian family.

In order to illustrate the proposed upper bounds on both 2-Wasserstein and total variation distances, we consider a covariance matrix which stands for a squared exponential matrix commonly used in applications involving Gaussian processes (Higdon 2007) and which writes

 Σij=2exp(−(si−sj)22a2)+10−6δij,∀i,j∈[d] (44)

where , are regularly spaced scalars on and if and zero otherwise.

Figure 5 shows the behavior of the quantitative bounds derived in Proposition 2 and Theorem 2 for . The Gaussian case allows to compute exactly by noting that . On the other hand, has been estimated by using a Monte Carlo approximation. One can note that the general upper bound on the 2-Wasserstein distance is quite conservative for small since it does not catch the behavior in when is small. This is essentially due to the fact that this bound only assumes a finite moment property and does not require any regularity assumptions on such as differentiability or strong convexity of its potential. On the contrary, the bound on the total variation distance, derived under stronger assumptions, manages to achieve a rate of the order for small .

### 5.2 Sparse linear regression

We study here a generalized version of the least absolute shrinkage and selection operator (lasso) regression problem analyzed by Park and Casella (2008). We assume a standard linear regression problem where centered observations are related to the unknown parameters via the model , where stands for a known standardized design matrix and . By considering a generalized Laplacian prior distribution for , the target posterior distribution has density for all ,

 (45)

where with and an arbitrary matrix acting on . The choice of such a prior may promote a form of sparsity (lasso). For instance, this matrix might stand for a -th order difference operator (Bredies et al. 2010) which is highly used in signal and image processing problems. As an archetypal example, the case leads to the well-known total variation regularization function (Chambolle et al. 2010) used to recover piecewise constant signals.

Note that because of the presence of the matrix , finding an exact data augmentation leading to an efficient sampling scheme is not possible for the general case . Instead, an AXDA model makes the posterior sampling task possible. Indeed, by regularizing the prior with a Gaussian term, the joint density writes

 (46)

By resorting to a Gibbs algorithm to sample from (46), one can now use a simple data augmentation scheme (Park and Casella 2008) to sample from the -conditional. On the other hand, sampling from the -conditional, which is a multivariate Gaussian distribution, can be undertaken efficiently with state-of-the-art approaches (Papandreou and Yuille 2010; Barbos et al. 2017; Marnissi et al. 2018).

In this specific case, the potential associated to the smoothed prior distribution (see (36)) has a closed-form expression given for all , by

 gρ(θ) =k2log(2πρ2)−logk∏i=1∫Rexp(−τ|zi|−12ρ2(bTiθ−zi)2)dzi =k2log(2πρ2) (47) −logk∏i=1(a(θ)[exp(b(θ)2){1−erf(b(θ))}+exp(c(θ)2){1−erf(c(θ))}]) (48)

with , , and standing for the -th row of . Note that in more general cases where has no closed form, one can estimate it by a Monte Carlo approximation.

Figure 6 shows the behavior of the regularized potential defined in (48) for several values of the parameter along with the associated smoothed prior and posterior distributions. For simplicity and pedagogical reasons, the univariate case corresponding to and has been considered. The regularization parameter has been set to . The contours of the shaded area correspond to and . The potential is a smooth approximation of the potential associated to the initial prior as expected, see Property iv) in Proposition 1. Note that the inequalities derived in (37) are verified. Although this approximation seems similar to the Moreau-Yosida regularization of a non-smooth potential function (Combettes and Pesquet 2011), the rationale behind this approximation is different. Indeed, the Moreau-Yosida envelope stands for a particular instance of the infimal convolution between two convex functions (an initial potential and a Gaussian one). On the other hand, is the potential associated to a smoothed density obtained by convolution with a Gaussian kernel. In addition, the third row of Figure 6 shows the form of the posterior of defined in (46) for , and and derived from the smoothed prior distributions shown in Figure 6. For sufficiently small values of , the marginal stands for a quite accurate approximation of the original target .