Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm

# Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm

\fnmsAlain \snmDurmus \thanksrefe1 label=e1 [    mark]alain.durmus@telecom-paristech.fr    \fnmsÉric \snmMoulines \thanksrefe2 label=e2 [    mark]eric.moulines@polytechnique.edu Institut Mines-Télécom ; Télécom ParisTech ; CNRS LTCI LTCI, Telecom ParisTech & CNRS,
46 rue Barrault, 75634 Paris Cedex 13, France.
\printeade1
Centre de Mathématiques Appliquées, UMR 7641, Ecole Polytechnique Centre de Mathématiques Appliquées, UMR 7641,
Ecole Polytechnique,
route de Saclay, 91128 Palaiseau cedex, France.
\printeade2
###### Abstract

: In this paper, we study a method to sample from a target distribution over having a positive density with respect to the Lebesgue measure, known up to a normalisation factor. This method is based on the Euler discretization of the overdamped Langevin stochastic differential equation associated with . For both constant and decreasing step sizes in the Euler discretization, we obtain non-asymptotic bounds for the convergence to the target distribution in total variation distance. A particular attention is paid to the dependency on the dimension , to demonstrate the applicability of this method in the high dimensional setting. These bounds improve and extend the results of [dalalyan:2014].

[
\kwd
\newaliascnt

lemmatheorem \aliascntresetthelemma \newaliascntcorollarytheorem \aliascntresetthecorollary \newaliascntpropositiontheorem \aliascntresettheproposition \newaliascntdefinitiontheorem \aliascntresetthedefinition \newaliascntremarktheorem \aliascntresettheremark

\runtitle

Non-asymptotic convergence analysis for the ULA

{aug}

and

class=AMS] \kwd[primary ]65C05, 60F05, 62L10 \kwd[; secondary ]65C40, 60J05,93E35

total variation distance\kwdLangevin diffusion \kwdMarkov Chain Monte Carlo \kwdMetropolis Adjusted Langevin Algorithm \kwdRate of convergence

## 1 Introduction

Sampling distributions over high-dimensional state-spaces is a problem which has recently attracted a lot of research efforts in computational statistics and machine learning (see [cotter:roberts:stuart:white:2013] and [andrieu:defreitas:doucet:jordan:2003] for details); applications include Bayesian non-parametrics, Bayesian inverse problems and aggregation of estimators. All these problems boil down to sample a target distribution having a density w.r.t. the Lebesgue measure on , known up to a normalisation factor where is continuously differentiable. We consider a sampling method based on the Euler discretization of the overdamped Langevin stochastic differential equation (SDE)

 dYt=−∇U(Yt)dt+√2dBdt, (1)

where is a -dimensional Brownian motion. It is well-known that the Markov semi-group associated with the Langevin diffusion is reversible w.r.t. . Under suitable conditions, the convergence to takes place at geometric rate. Precise quantitative estimates of the rate of convergence with explicit dependency on the dimension of the state space have been recently obtained using either functional inequalities such as Poincaré and log-Sobolev inequalities (see [bakry:cattiaux:guillin:2008, cattiaux:guillin:2009] [bakry:gentil:ledoux:2014]) or by coupling techniques (see [eberle:2015]). The Euler-Maruyama discretization scheme associated to the Langevin diffusion yields the discrete time-Markov chain given by

 Xk+1=Xk−γk+1∇U(Xk)+√2γk+1Zk+1 (2)

where is an i.i.d. sequence of standard Gaussian -dimensional random vectors and is a sequence of step sizes, which can either be held constant or be chosen to decrease to . The idea of using the Markov chain to sample approximately from the target has been first introduced in the physics literature by [parisi:1981] and popularised in the computational statistics community by [grenander:1983] and [grenander:miller:1994]. It has been studied in depth by [roberts:tweedie:1996], which proposed to use a Metropolis-Hastings step at each iteration to enforce reversibility w.r.t.  leading to the Metropolis Adjusted Langevin Algorithm (MALA). They coin the term unadjusted Langevin algorithm (ULA) when the Metropolis-Hastings step is skipped.

The purpose of this paper is to study the convergence of the ULA algorithm. The emphasis is put on non-asymptotic computable bounds; we pay a particular attention to the way these bounds scale with the dimension and constants characterizing the smoothness and curvature of the potential . Our study covers both constant and decreasing step sizes and we analyse both the ”finite horizon” (where the total number of simulations is specified before running the algorithm) and ”any-time” settings (where the algorithm can be stopped after any iteration).

When the step size is constant, under appropriate conditions (see [roberts:tweedie:1996]), the Markov chain is -uniformly geometrically ergodic with a stationary distribution . With few exceptions, the stationary distribution is different from the target . If the step size is small enough, then the stationary distribution of this chain is in some sense close to . We provide non-asymptotic bounds of the -total variation distance between and , with explicit dependence on the step size and the dimension . Our results complete and extend the recent works by [dalalyan:tsybakov:2012] and [dalalyan:2014].

When decreases to zero, then is a non-homogeneous Markov chain. If in addition , we show that the marginal distribution of this non-homogeneous chain converges, under some mild additional conditions, to the target distribution , and provide explicit bounds for the convergence. Compared to the related works by [lamberton:pages:2002], [lamberton:pages:2003], [lemaire:2005] and [lemaire:menozzi:2010], we establish not only the weak convergence of the weighted empirical measure of the path to the target distribution but a much stronger convergence in total variation, similarly to [dalalyan:2014], where the strongly log-concave case is considered.

The paper is organized as follows. In Section 2, the main convergence results are stated under abstract assumptions. We then specialize in Section 3 these results to different classes of densities. The proofs are gathered in Section 4. Some general convergence results for diffusions based on reflection coupling, which are of independent interest, are stated in Section 5.

### Notations and conventions

denotes the Borel -field of and the set of all Borel measurable functions on . For set . Denote by the space of finite signed measure on and . For and a -integrable function, denote by the integral of w.r.t. . Let be a measurable function. For , the -norm of is given by . For , the -total variation distance of is defined as

 ∥μ∥V=supf∈F(Rd),∥f∥V≤1∣∣∣∫Rdf(x)dμ(x)∣∣∣.

If , then is the total variation denoted by .

For , denote by the set of measurable functions such that . For , the variance of under is denoted by . For all functions such that , the entropy of with respect to is defined by

 Entπ(f)=∫Rdf(x)log(f(x))dπ(x).

Let and be two probability measures on . If , we denote by the Radon-Nikodym derivative of w.r.t. . Denote for all by the scalar product of and and the Euclidean norm of . For , denote by , the set of -times continuously differentiable functions . For , denote by the gradient of and the Laplacian of . For all and , we denote by , the ball centered at of radius . Denote for , the oscillation of a function in the ball by . Denote the oscillation of a bounded function on by . In the sequel, we take the convention that and , for , .

## 2 General conditions for the convergence of ULA

In this section, we derive a bound on the convergence of the ULA to the target distribution when the Langevin diffusion is geometrically ergodic and the Markov kernel associated with the EM discretization satisfies a Foster-Lyapunov drift inequality.

Consider the following assumption on the potential :

###### L 1.

The function is continuously differentiable on and gradient Lipschitz, i.e. there exists such that for all ,

 ∥∇U(x)−∇U(y)∥≤L∥x−y∥.

Under L 1, by [ikeda:watanabe:1989, Theorem 2.4-3.1] for every initial point , there exists a unique strong solution to the Langevin SDE (1). Define for all , and , . The semigroup is reversible w.r.t. , and hence admits as its (unique) invariant distribution. In this section, we consider the case where is geometrically ergodic, i.e. there exists such that for any initial distribution and ,

 ∥μ0Pt−π∥TV≤C(μ0)κt, (3)

for some constant . Denote by the generator associated with the semigroup , given for all by

 ALf=−⟨∇U,∇f⟩+Δf.

A twice continuously differentiable function is a Lyapunov function for the generator if there exist , and such that,

 ALV≤−θV+β\mathbbm1E. (4)

By [roberts:tweedie:1996, Theorem 2.2], if in (4) is a non-empty compact set, then the Langevin diffusion is geometrically ergodic.

Consider now the EM discretization of the diffusion (2). Let be a sequence of positive and nonincreasing step sizes and for , denote by

 Γn,p=p∑k=nγk,Γn=Γ1,n. (5)

For , consider the Markov kernel given for all and by

 Rγ(x,A)=∫A(4\uppiγ)−d/2exp(−(4γ)−1∥y−x+γ∇U(x)∥2)dy.

The discretized Langevin diffusion given in (2) is a time-inhomogeneous Markov chain, for and , where and

 Qn,pγ=Rγn⋯Rγp,Qnγ=Q1,nγ,

with the convention that for , , is the identity operator. Under L 1, the Markov kernel is strongly Feller, irreducible and strongly aperiodic. We will say that a function satisfies a Foster-Lyapunov drift condition for if there exist constants , and such that, for all

 RγV≤λγV+γc. (6)

The particular form of (6) reflects how the mixing rate of the Markov chain depends upon the step size . If , then for and . A Markov chain with transition kernel is not mixing. Intuitively, as gets larger, then it is expected that the mixing of increases. If for some , satisfies (6), then admits a unique stationary distribution . We use (6) to control quantitatively the moments of the time-inhomogeneous chain. The types of bounds which are needed, are summarised in the following elementary Lemma.

###### Lemma \thelemma.

Let . Assume that for all and , (6) holds for some constants and . Let be a sequence of nonincreasing step sizes such that for all . Then for all and , where

 F(λ,a,c,γ,w)=λaw+c(−λγlog(λ))−1. (7)
###### Proof.

The proof is postponed to Section 4.1. ∎

Note that Section 2 implies that where

 G(λ,c,γ,w)=w+c(−λγlog(λ))−1. (8)

We give below the main ingredients which are needed to obtain a quantitative bound for for all . This quantity is decomposed as follows: for all ,

 ∥δxQpγ−π∥TV≤∥δxQnγQn+1,pγ−δxQnγPΓn+1,p∥TV+∥δxQnγPΓn+1,p−π∥TV. (9)

To control the first term on the right hand side, we use a method introduced in [dalalyan:tsybakov:2012] and elaborated in [dalalyan:2014]. The second term is bounded using the convergence of the semi-group to , see (3).

###### Proposition \theproposition.

Assume that L 1 and (3) hold. Let be a sequence of nonnegative step sizes. Then for all , , , ,

 ∥δxQpγ−π∥TV≤2−1/2L(p−1∑k=n{(γ3k+1/3)A(γ,x)+dγ2k+1})1/2+C(δxQnγ)κΓn+1,p, (10)

where are defined in (3) and

 A(γ,x)=supk≥0∫Rd∥∇U(z)∥2Qkγ(y,dz). (11)
###### Proof.

The proof follows the same lines as [dalalyan:2014, Lemma 2] but is given for completeness. For , let be the space of continuous functions on taking values in . For all , denote by and the laws on of the Langevin diffusion and of the continuously-interpolated Euler discretization , both started at at time . Denote by the unique strong solution started at at time of the time-inhomogeneous diffusion defined for , by

 {dYt=−∇U(Yt)dt+√2dBdtd¯Yt=−¯¯¯¯¯¯¯¯¯∇U(¯Y,t)dt+√2dBdt, (12)

where for any continuous function and

 ¯¯¯¯¯¯¯¯¯∇U(w,t)=∞∑k=n∇U(wΓk)\mathbbm1[Γk,Γk+1)(t). (13)

Girsanov’s Theorem [karatzas:shreve:1991, Theorem 5.1, Corollary 5.16, Chapter 3] shows that and are mutually absolutely continuous and in addition, -almost surely

 (14)

Under L 1, (14) implies for all :

 KL(μyn,p|¯μyn,p) ≤4−1∫ΓpΓnE[∥∥∇U(¯Ys(y))−¯¯¯¯¯¯¯¯¯∇U(¯Y(y),s)∥∥2]ds ≤4−1p−1∑k=n∫Γk+1ΓkE[∥∥∇U(¯Ys(y))−∇U(¯YΓk(y))∥∥2]ds ≤4−1L2p−1∑k=n{(γ3k+1/3)∫Rd∥∇U(z)∥2Qn+1,kγ(y,dz)+dγ2k+1}. (15)

By the Pinsker inequality, . The proof is concluded by combining this inequality, (15) and (3) in (9). ∎

In the sequel, depending on the conditions on the potential and the techniques of proof, for any given , can have two kinds of upper bounds, either of the form , or , for some function and . In both cases, as shown in Section 2, it is possible to choose as a function of , so that under appropriate conditions on the sequence of step sizes .

###### Proposition \theproposition.

Assume that L 1 and (3) hold. Let be a nonincreasing sequence satisfying and . Then, for any for which one of the two following conditions holds:

1. and , where is defined in (11).

2. , and .

###### Proof.
1. There exists such that for all , and . Therefore, we can define for all ,

 n(p)\tiny def=min{k∈{0,⋯,p−1}|κΓk+1,p>γk+1}. (16)

and . We first show that . The proof goes by contradiction. If we could extract a bounded subsequence . For such sequence, is bounded away from , but which yields to a contradiction. The definition of implies that , showing that

 limsupp→+∞C(δxQn(p)γ)κΓn(p),p≤limsupp→+∞C(δxQn(p)γ)−log(γn(p))limsupp→+∞{γn(p)(−log(γn(p)))}=0.

On the other hand, since is nonincreasing, for any ,

 p∑k=n(p)+1γℓk≤γℓ−1n(p)+1Γn(p)+1,p≤γℓ−1n(p)+1log(γn(p)+1)/log(κ).

The proof follows from (10) using .

2. For all , define . Note that since , we have . Using and is a nonincreasing sequence, we get for all ,

 limp→+∞p∑k=n(p)γℓk=0,

which shows that the first term in the right side of (10) goes to as goes to infinity. As for the second term, since , we get using that is nonincreasing and ,

 C(δxQn(p)γ)κΓn(p),p ≤exp(log(κ)Γp+[{log(C(δxQn(p)γ))/Γn(p)}+−log(κ)]Γn(p)) ≤exp(log(κ)Γp+[supk≥1{log(C(δxQkγ))/Γk}+−log(κ)]γ1log(Γp)).

Using and , we have , which concludes the proof.

Using (10), we can also assess the convergence of the algorithm for constant step sizes for all . Two different kinds of results can be derived. First, for a given precision , we can try to optimize the step size to minimize the number of iterations required to achieve . Second if the total number of iterations is fixed , we may determine the step size which minimizes .

###### Lemma \thelemma.

Assume that (10) holds. Assume that there exists such that and , where and are defined in (3) and (11) respectively. Then for all , we get if

 p>Tγ−1 and γ≤−d+√d2+(2/3)¯A(x)ε2(L2T)−12¯A(x)/3∧¯γ, (17)

where

 T=(log{¯C(x)}−log(ε/2))/(−log(κ)).
###### Proof.

For , set . Then using the stated expressions of and in (10) concludes the proof. ∎

Note that an upper bound for defined in (17) is . The dependency of on the dimension will be addressed in Section 3.

###### Lemma \thelemma.

Assume that L 1 and (3) hold. In addition, assume that there exist and , , such that and . For all and all , if , then

 ∥δxRpγ−π∥TV≤(p−n)−1/2{¯Cn(x)(p−n)−1/2+log(p−n)(d+¯A(x)log(p−n)(p−n)−1)1/2}.
###### Proof.

The proof is a straightforward calculation using (10). ∎

To get quantitative bounds for the total variation distance it is therefore required to get bounds on , and to control . We will consider in the sequel two different approaches to get (3), one based on functional inequalities, the other on coupling techniques. We will consider also increasingly stringent assumptions for the potential . Whereas we will always obtain the same type of exponential bounds, the dependency of the constants on the dimension will be markedly different. In the worst case, the dependency is exponential. It is polynomial when is convex.

## 3 Practical conditions for geometric ergodicity of the Langevin diffusion and their consequences for ULA

### 3.1 Superexponential densities

Assume first that the potential is superexponential outside a ball. This is a rather weak assumption (we do not assume convexity here).

###### H 1.

The potential is twice continuously differentiable and there exist , and such that for all , , .

The price to pay will be constants which are exponential in the dimension. Under H 1, the potential is unbounded off compact set. Since is continuous, it has a global minimizer , which is a point at which . Without loss of generality, it is assumed that .

###### Lemma \thelemma.

Assume L 1 and H 1. Then for all ,

 U(x)≥ρ∥x−x⋆∥α/(α+1)−aαwithaα=ρMαρ/(α+1)+M2ρL/2. (18)
###### Proof.

The elementary proof is postponed to Section 4.2. ∎

Following [roberts:tweedie:1996, Theorem 2.3], we first establish a drift condition for the diffusion.

###### Proposition \theproposition.

Assume L 1 and H 1. For any , the drift condition (4) is satisfied with the Lyapunov function , , , and . Moreover, there exist constants and such that for all and probability measures and on , satisfying ,

 ∥μ0Pt−ν0Pt∥Vς≤Cςe−υςt∥μ0−ν0∥Vς, ∥μ0Pt−π∥Vς≤Cςe−υςtμ0(Vς).
###### Proof.

The proof, adapted from [roberts:tweedie:1996, Theorem 2.3] and [meyn:tweedie:1993:3, Theorem 6.1], is postponed to Section 4.3. ∎

Under H 1, explicit expressions for and have been developed in the literature but these estimates are in general very conservative. We now turn to establish (6) for the Euler discretization.

###### Proposition \theproposition.

Assume L 1 and H 1. Let . For all and , satisfies the drift condition (6) with , , and .

###### Proof.

The proof is postponed to Section 4.4. ∎

###### Theorem 1.

Assume L 1 and H 1. Let be a nonincreasing sequence with , . Then, for all , , , and , (10) holds with and

 C(δxQnγ)≤C1/2F(λ,Γ1,n,c,γ1,V(x)), (19)

where , are given by Section 3.1, by (7), , , in Section 3.1, by (8), in (18).

###### Proof.

The proof is postponed to Section 4.5. ∎

Equation 19 implies that for all , we have , so Section 2-1 shows that for all provided that and . In addition, for the case of constant step size for all , Section 2 and Section 2 can be applied.

Let , defined for all by . By Section 3.1, is a contraction operator on the space of finite signed measure , , endowed with the norm . It is therefore possible to control . To simplify the notations, we limit our discussion to constant step sizes.

###### Theorem 2.

Assume L 1 and H 1. Then, for all , and , we have

 ∥δxRpγ−π∥V1/2≤C1/4κγpV1/2(x)+B(γ,V(x)), (20)

where , are defined in Section 3.1, in Section 3.1, in (8) and

 B2(γ,v)=L2max(1,C21/4)(1+γ)(1−κ)−2(2G(λ,c,γ,v)+β1/2/θ1/2)×(γd+3−1γ2∥∇U∥2V1/2G(λ,c,γ,v)).

Moreover, has a unique invariant distribution and

 ∥π−πγ∥V1/2≤B(γ,1).
###### Proof.

The proof of (20) is postponed to Section 4.6. The bound for is an easy consequence of (20): by Section 3.2 and [meyn:tweedie:2009, Theorem 16.0.1], is -uniformly ergodic: for all . Finally, (20) shows that for all ,

 ∥π−πγ∥V1/2≤limp→+∞{∥δxRpγ−π∥V1/2+∥δxRpγ−πγ∥V1/2}≤B(γ,V(x)).

Taking the minimum over concludes the proof. ∎

Note that Theorem 2 implies that there exists a constant which does not depend on such that .

###### Remark \theremark.

It is shown in [talay:tubaro:1991, Theorem 4] that for with polynomial growth, , for some constant , provided that satisfies L 1 and H 1. Our result does not match this bound since . However the bound is uniform over the class of measurable functions satisfying for all , . Obtaining such uniform bounds in total variation is important in Bayesian inference, for example to compute high posterior density credible regions. Our result also strengthens and completes [mattingly:stuart:higham:2002, Corollary 7.5], which states that under H 1 with , for any measurable functions satisfying for all ,

 |ϕ(x)−ϕ(y)|≤C∥x−y∥{1+∥x∥k+∥y∥k},

for some , , for some constants and