Global Non-convex Optimizationwith Discretized Diffusions

Global Non-convex Optimization with Discretized Diffusions

Murat A. Erdogdu 1,2
erdogdu@cs.toronto.edu
&Lester Mackey 3
lmackey@  microsoft.com
&
1University of Toronto 2Vector Institute 3Microsoft Research 4Weizmann Institute of Science
Abstract

An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure.

Global Non-convex Optimization
with Discretized Diffusions

Murat A. Erdogdu 1,2 erdogdu@cs.toronto.edu Lester Mackey 3 lmackey@  microsoft.com Ohad Shamir 4 ohad.shamir@weizmann.ac.il 1University of Toronto 2Vector Institute 3Microsoft Research 4Weizmann Institute of Science

\@float

noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float

1 Introduction

Consider the unconstrained and possibly non-convex optimization problem

 \ignorespaces\ignorespaces% minimizex∈Rd f(x).

Recent studies have shown that the Langevin algorithm – in which an appropriately scaled isotropic Gaussian vector is added to a gradient descent update – globally optimizes whenever the objective is dissipative ( for ) with a Lipschitz gradient Gelfand and Mitter (1991); Raginsky et al. (2017); Xu et al. (2017). Remarkably, these globally optimized objectives need not be convex and can even be multimodal. The intuition behind the success of the Langevin algorithm is that the stochastic optimization method approximately tracks the continuous-time Langevin diffusion which admits the Gibbs measure – a distribution defined by – as its invariant distribution. Here, is an inverse temperature parameter, and when is large, the Gibbs measure concentrates around its modes. As a result, for large values of , a rapidly mixing Langevin algorithm will be close to a global minimum of . In this case, rapid mixing is ensured by the Lipschitz gradient and dissipativity. Due to its simplicity, efficiency, and well-understood theoretical properties, the Langevin algorithm and its derivatives have found numerous applications in machine learning (see, e.g., Welling and Teh, 2011; Dalalyan, 2017a).

In this paper, we prove an analogous global optimization property for the Euler discretization of any smooth and dissipative diffusion and show that different diffusions are suitable for solving different classes of convex and non-convex problems. Our non-asymptotic analysis, based on a multidimensional version of Stein’s method, establishes explicit bounds on both integration and optimization error. Our contributions can be summarized as follows:

• For any function , we provide explicit bounds on the numerical integration error of discretized dissipative diffusions. Our bounds depend only on simple properties of the diffusion’s coefficients and Stein factors, i.e., bounds on the derivatives of the associated Poisson equation solution.

• For pseudo-Lipschitz , we derive explicit first through fourth-order Stein factor bounds for every fast-coupling diffusion with smooth coefficients. Since our bounds depend on Wasserstein coupling rates, we provide user-friendly, broadly applicable tools for computing these rates. The resulting computable integration error bounds recover the known Markov chain Monte Carlo convergence rates of the Langevin algorithm in both convex and non-convex settings but apply more broadly.

• We introduce new explicit bounds on the expected suboptimality of sampling from a diffusion. Together with our integration error bounds, these yield computable and convergent bounds on global optimization error. We demonstrate that improved optimization guarantees can be obtained by targeting distributions other than the standard Gibbs measure.

• We show that different diffusions are appropriate for different objectives and detail concrete examples of global non-convex optimization enabled by our framework but not covered by the existing Langevin theory. For example, while the Langevin diffusion is particularly appropriate for dissipative and hence quadratic growth Raginsky et al. (2017); Xu et al. (2017), we show alternative diffusions are appropriate for “heavy-tailed” with subquadratic or sublinear growth.

We emphasize that, while past work has assumed the existence of finite Stein factors Chen et al. (2015); Xu et al. (2017), focused on deriving convergence rates with inexplicit constants Mattingly et al. (2010); Vollmer et al. (2016); Xu et al. (2017), or concentrated singularly on the Langevin diffusion Dalalyan (2017a); Durmus et al. (2017); Xu et al. (2017); Raginsky et al. (2017), the goals of this work are to provide the reader with tools to (a) check the appropriateness of a given diffusion for optimizing a given objective and (b) compute explicit optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. The rest of the paper is organized as follows. Section 1.1 surveys related work. Section 2 provides an introduction to diffusions and their use in optimization and reviews our notation. Section 3 provides explicit bounds on integration error in terms of Stein factors and on Stein factors in terms of simple properties of and the diffusion. In Section 4, we provide explicit bounds on optimization error by targeting Gibbs and non-Gibbs invariant measures and discuss how to obtain better optimization error using non-Gibbs invariant measures. We give concrete examples of applying these tools to non-convex optimization problems in Section 5 and conclude in Section 6.

1.1 Related work

The Euler discretization of the Langevin diffusion is commonly termed the Langevin algorithm and has been studied extensively in the context of sampling from a log concave distribution. Non-asymptotic integration error bounds for the Langevin algorithm are studied in Dalalyan and Tsybakov (2012); Dalalyan (2017b); Durmus et al. (2017); Dwivedi et al. (2018). A representative bound follows from combining the ergodicity of the diffusion with a discretization error analysis and yields error in steps for the strongly log concave case and steps for the general log concave case Dalalyan (2017b); Durmus et al. (2017).

Our work is motivated by a line of research that uses the Langevin algorithm to globally optimize non-convex functions. Gelfand and Mitter (1991) established the global convergence of an appropriate variant of the algorithm, and Raginsky et al. (2017) subsequently used optimal transport theory to prove optimization and integration error bounds. For example, Raginsky et al. (2017) provides an integration error bound of after steps under the quadratic-growth assumptions of dissipativity and a Lipschitz gradient; the estimate involves the inverse spectral gap parameter , a quantity that is often unknown and sometimes exponential in both inverse temperature and dimension. In this work, we accommodate “heavy-tailed” objectives that grow subquadratically and trade the often unknown and hence inexplicit spectral gap parameter of Raginsky et al. (2017) for the more user-friendly distant dissipativity condition (Prop. 3.4) which provides a straightforward and explicit certification of fast coupling and hence the fast mixing of a diffusion. For distantly dissipative diffusions, the size of our error bounds is driven primarily by a computable distance parameter; in the Langevin setting, an analogous quantity is studied in place of the spectral gap in the contemporaneous work of (Cheng et al., 2018).

Cheng et al. (2018) provide integration error bounds for sampling with the overdamped Langevin algorithm under a distant strong convexity assumption (a special case of distant dissipativity). The authors build on the results of Durmus et al. (2017); Eberle (2016) and establish error in steps. We consider general distantly dissipative diffusions and establish an integration error of in steps under mild assumptions on the objective function and smoothness of the diffusion.

Vollmer et al. (2016) used the solution of the Poisson equation in their analysis of stochastic Langevin gradient descent, invoking the bounds of Pardoux and Veretennikov (2001, Thms. 1 and 2) to obtain Stein factors. However, Thms. 1 and 2 of Pardoux and Veretennikov (2001) yield only inexplicit constants and require bounded diffusion coefficients, a strong assumption violated by the examples treated in Section 5. Chen et al. (2015) considered a broader range of diffusions but assumed, without verification, that Stein factor and Markov chain moment were universally bounded by constants independent of all problem parameters. One of our principal contributions is a careful enumeration of the dependencies of these Stein factors and Markov chain moments on the objective and the candidate diffusion. Our convergence analysis builds on the arguments of Mattingly et al. (2010); Gorham et al. (2016), and our Stein factor bounds rely on distant and uniform dissipativity conditions for -Wasserstein rate decay Eberle (2016); Gorham et al. (2016) and the smoothing effect of the Markov semigroup Cerrai (2001); Gorham et al. (2016). Our Stein factor results significantly generalize the existing bounds of Gorham et al. (2016) by accommodating pseudo-Lipschitz objectives and quadratic growth in the covariance coefficient and deriving the first four Stein factors explicitly.

2 Optimization with Discretized Diffusions: Preliminaries

Consider a target objective function . Our goal is to carry out unconstrained minimization of with the aid of a candidate diffusion defined by the stochastic differential equation (SDE)

 \ignorespaces\ignorespacesdZzt=b(Zzt)dt+σ(Zzt)dBt   with   Zz0=z.

Here, is an -dimensional Wiener process, and and represent the drift and the diffusion coefficients, respectively. The diffusion starts at a point and, under the conditions of Section 3, admits a limiting invariant distribution with (Lebesgue) density . To encourage sampling near the minima of , we would like to choose so that the maximizers of correspond to minimizers of . Fortunately, under mild conditions, one can construct a diffusion with target invariant distribution (see, e.g., (Ma et al., 2015; Gorham et al., 2016, Thm. 2)), by selecting the drift coefficient

 \ignorespaces\ignorespacesb(x)=12p(x)⟨∇,p(x)(a(x)+c(x))⟩,

where is the covariance coefficient, is the skew-symmetric stream coefficient, and denotes the divergence operator with as the standard basis of . As an illustration, consider the (overdamped) Langevin diffusion for the Gibbs measure with inverse temperature and density

 pγ(x)∝exp(−γf(x)) (2.1)

associated with our objective . Inserting and into the formula (LABEL:eq:inv-measure) we obtain

 bj(x)=12pγ(x)⟨∇,pγ(x)(a(x)+c(x))⟩j=1γpγ(x)∑k∂pγ(x)Ijk∂xk=1γpγ(x)∂pγ(x)∂xj=−∂jf(x)∂xj,

which reduces to . We emphasize that the choice of the Gibbs measure is arbitrary, and we will consider other measures that yield superior guarantees for certain minimization problems.

In practice, the diffusion LABEL:eq:sde cannot be simulated in continuous time and is instead approximated by a discrete-time numerical integrator. We will show that a particular discretization, the Euler method, can be used as a global optimization algorithm for various families of convex and non-convex . The Euler method is the most commonly used discretization technique due to its explicit form and simplicity; however, our analysis can be generalized to other numerical integrators as well. For the Euler discretization of the SDE (LABEL:eq:sde) corresponds to the Markov chain updates

 \ignorespaces\ignorespacesXm+1=Xm+ηb(Xm)+√ησ(Xm)Wm,

where is the step size, and is an isotropic Gaussian vector that is independent from . This update rule defines a Markov chain which typically has an invariant measure that is different from the invariant measure of the continuous time diffusion. However, when the step size is sufficiently small, the difference between two invariant measures becomes small and can be quantitatively characterized (see, e.g., Mattingly et al., 2002). Our optimization algorithm is simply to evaluate the function at each the Markov chain iterate and report the point with the smallest function value.

Denoting by the expectation of under the density – i.e., – we decompose the optimization error after steps of our Markov chain into two components,

and bound each term on the right-hand side separately. The integration error—which captures both the short-term non-stationarity of the chain and the long-term bias due to discretization—is the subject of Section 3; we develop explicit bounds using techniques that build upon Mattingly et al. (2010); Gorham et al. (2016). The expected suboptimality quantifies how well exact samples from minimize on average. In Section 4, we extend the Gibbs measure Langevin diffusion bound of Raginsky et al. (2017), to more general invariant measures and associated diffusions and demonstrate the benefits of targeting non-Gibbs measures.

Notation

We say a function is pseudo-Lipschitz continuous of order if it satisfies

where denotes the Euclidean norm, and is the smallest constant satisfying (LABEL:eq:pseudo-lip). This assumption, which relaxes the more stringent Lipschitz assumption, allows to exhibit polynomial growth of order . For example, is not Lipschitz but satisfies (LABEL:eq:pseudo-lip) with . In all of our examples of interest, . For operator and Frobenius norms and , we use

 ϕ1(g)=supx,y∈Rd,x≠y∥g(x)−g(y)∥F∥x−y∥2,      μ0(g)=supx∈Rd∥g(x)∥op, andμi(g)=supx,y∈Rd,x≠y∥∇i−1g(x)−∇i−1g(y)∥op∥x−y∥2

for the -th order Lipschitz coefficients of a sufficiently differentiable function . We denote the degree polynomial coefficient of the -th derivative of by , i.e.,

 \ignorespaces\ignorespaces∥∇ig(x)∥op≤~πi,n(g)(1+∥x∥n2)   where   ~πi,n(g)=supx∈Rd∥∇ig(x)∥op1+∥x∥n2.

3 Explicit Bounds on Integration Error

We develop our explicit bounds on integration error in three steps. In Theorem 3.1, we bound integration error in terms of the polynomial growth and dissipativity of diffusion coefficients (Conditions 2 and 1) and Stein factors bounds on the derivatives of solutions to the diffusion’s Poisson equation (Condition 3). Condition 3 is a common assumption in the literature but is typically not verified. To address this shortcoming, Theorem 3.2 shows that any smooth, fast-coupling diffusion admits finite Stein factors expressed in terms of diffusion coupling rates (Condition 4). Finally, in Section 3.1, we provide user-friendly tools for explicitly bounding those diffusion coupling rates. We begin with our conditions.

Condition 1 (Polynomial growth of coefficients).

For some and , the drift and the diffusion coefficients of the diffusion LABEL:eq:sde satisfy the following growth condition

 ∥b(x)∥2≤λb4(1+∥x∥2),  ∥σ(x)∥F≤λσ4(1+∥x∥2), % and   ∥σσ⊤(x)∥op≤λa4(1+∥x∥r2).

The existence and uniqueness of the solution to the diffusion SDE LABEL:eq:sde is guaranteed under Condition 1 (Khasminskii, 2011, Thm 3.5). The cases and correspond to linear and quadratic growth of , and we will explore examples of both settings in Section 5. As we will see in each result to follow, the quadratic growth case is far more delicate.

Condition 2 (Dissipativity).

For , the diffusion (LABEL:eq:sde) satisfies the dissipativity condition

 \ignorespaces\ignorespacesA∥x∥22≤−α∥x∥22+β   for   Ag(x)≜⟨b(x),∇g(x)⟩+12⟨σ(x)σ(x)⊤,∇2g(x)⟩.

is the generator of the diffusion with coefficients and , and .

Dissipativity is a standard assumption that ensures that the diffusion does not diverge but rather travels inward when far from the origin Mattingly et al. (2002). Notably, a linear growth bound on , and a quadratic growth bound on follow directly from the linear growth of and Condition 2. However, in many examples, tighter growth constants can be obtained by inspection.

Our final condition concerns the solution of the Poisson equation (also known as the Stein equation in the Stein’s method literature) associated with our candidate diffusion.

Condition 3 (Finite Stein factors).

The function solves the Poisson equation with generator (LABEL:eq:dissipative)

 \ignorespaces\ignorespacesf−p(f)=Auf,

is pseudo-Lipschitz of order with constant , and has -th order derivative with degree- polynomial growth for , i.e.,

In other words, , and for with .

The coefficients govern the regularity of the Poisson equation solution and are termed Stein factors in the Stein’s method literature. Although variants of Condition 3 have been assumed in previous work (Chen et al., 2015; Vollmer et al., 2016), we emphasize that this assumption is not easily verified, and frequently only empirical evidence is provided as justification for the assumption Chen et al. (2015). We will ultimately derive explicit expressions for the Stein factors for a wide variety of diffusions and functions , but first we will use the factors bound integration error of our discretized diffusion.

Theorem 3.1 (Integration error of discretized diffusions).

Let Conditions 3, 2 and 1 hold for some . For any even integer111In a typical example where is bounded by a quadratic polynomial, we have and . We also remind the reader that the double factorial is of order . and a step size satisfying ,

where

 c1=6ζ1,        c2=116[2ζ2λ2b+ζ3λbλ2σ+ζ4(1+3n−1)λ4σ], c3=148[ζ3λ3b+ζ4λ4b(1+3n−1)+4ζ41.5nn4(λ4b+n2eλ4σ)(λnb+n!!λnσ)],

This integration error bound, proved in Appendix A, is since the higher order term can be combined with the dominant term yielding as . We observe that one needs steps to reach a tolerance of . Theorem 3.1 seemingly makes no assumptions on the objective function , but in fact the dependence on is present in the growth parameters, the Stein factors, and the polynomial degree of the Poisson equation solution. For example, we will show in Theorem 3.2 that this polynomial degree is upper bounded by that of the objective function . To characterize the function classes covered by Theorem 3.1, we next turn to dissecting the Stein factors.

While verifying Conditions 1 and 2 for a given diffusion is often straightforward, it is not immediately clear how one might verify Condition 3. As our second principal contribution, we derive explicit values for the Stein factors for any smooth and dissipative diffusion exhibiting fast -Wasserstein decay:

Condition 4 (Wasserstein rate).

The diffusion has -Wasserstein rate if

where infimum is taken over all couplings between and . We further define the relative rates

 ~ϱ1(t)=log(ϱ2(t)/ϱ1(t))  % and   ~ϱ2(t)=log(ϱ1(t)/[ϱ1(0)ϱ2(t)])/log(ϱ1(t)/ϱ1(0)).
Theorem 3.2 (Finite Stein factors from Wasserstein decay).

Assume that Conditions 4, 2 and 1 hold and that is pseudo-Lipschitz continuous of order with, for , at most degree- polynomial growth of its -th order derivatives. Then, the Stein factors in Condition 3 are given as

 ζi=τi+ξi∫∞0ϱ1(t)ωr(t+i−2)dt   for   i=1,2,3,4,   where,

with , , and

 τ1= 0             and   τi=~μ1,n(f)~π2:i,n(f)~ν1:i(b)~ν1:i(σ)κr(6n) for   i=2,3,4, ξ1= ~μ1,n(f)   and   ξi=~μ1,n(f)~ν1:i(b)~ν1:i(σ)~ν0:i−2(σ−1)ϱ1(0)ωr(1)κr(6n)i−1   for   i=2,3,4,

where is as in Theorem 3.1, , and for a function , denotes an upper bound on its derivatives of order through .

A more detailed version of the above theorem is given as Theorem C.6 in Section C along with its proof. We emphasize that, to provide finite Stein factors, Theorem 3.2 only requires -Wasserstein decay and allows the -Wasserstein rate to grow. An integrable Wasserstein rate is an indication that a diffusion mixes quickly to its stationary distribution. Hence, Theorem 3.2 suggests that, for a given , one should select a diffusion that mixes quickly to a stationary measure, like the Gibbs measure creftype 2.1, with modes at the minimizers of . We explore user-friendly conditions implying fast Wasserstein decay in Section 3.1 and detailed examples deploying these tools in Section 5. Crucially for the “heavy-tailed” examples given in Section 5, Theorem 3.2 allows for an unbounded diffusion coefficient , unlike the classic results of Pardoux and Veretennikov (2001).

3.1 Sufficient conditions for Wasserstein decay

A simple condition that leads to exponential and -Wasserstein decay is uniform dissipativity creftype 3.1. The next result from (Wang, 2016) (see also (Cattiaux and Guillin, 2014, Sec. 1), (Gorham et al., 2016, Thm. 10)) makes the relationship precise.

Proposition 3.3 (Wasserstein decay from uniform dissipativity (Wang, 2016, Thm. 2.5)).

A diffusion with drift and diffusion coefficients and has Wasserstein rate if, for all ,

 \ignorespaces\ignorespaces2⟨b(x)−b(y),x−y⟩+∥σ(x)−σ(y)∥2F+(p−2)∥σ(x)−σ(y)∥2op≤−k∥x−y∥22. (3.1)

In the Gibbs measure Langevin case, where and , uniform dissipativity is equivalent to the strong convexity of . As we will see in Section 5, the extra degree of freedom in the diffusion coefficient will allow us to treat non-convex and non-strongly convex functions .

A more general condition leading to exponential -Wasserstein decay is the distant dissipativity condition creftype 3.2. The following result of (Gorham et al., 2016) builds upon the pioneering analyses of Eberle (2016, Cor. 2) and Wang (2016, Thm. 2.6) to provide explicit Wasserstein decay.

Proposition 3.4 (Wasserstein decay from distant dissipativity (Gorham et al., 2016, Cor. 4.2)).

A diffusion with drift and diffusion coefficients and satisfying and

 \ignorespaces\ignorespaces⟨b(x)−b(y),x−y⟩s2∥x−y∥22/2+∥~σ(x)−~σ(y)∥2Fs2∥x−y∥22−∥(~σ(x)−~σ(y))⊤(x−y)∥22s2∥x−y∥42≤{−Kif ∥x−y∥2>R   Lif ∥x−y∥2≤R (3.2)

for , , and has Wasserstein rate for

 s2k−1≤⎧⎨⎩e−12R2+e√8K−1R+4K−1if LR2≤88√2πR−1L−1/2(L−1+K−1)exp(LR28)+32R−2K−2if LR2>8. (3.3)

Conveniently, both uniform and distant dissipativity imply our dissipativity condition, Condition 2. The Prop. 3.4 rates feature the distance-dependent parameter . In the pre-conditioned Langevin Gibbs setting ( and constant) when is the negative log likelihood of a multimodal Gaussian mixture, in creftype 3.2 represents the maximum distance between modes (Gorham et al., 2016). When is relatively small, the convergence of the diffusion towards its stationary distribution is rapid, and the non-uniformity parameter is small; when is relatively large, the parameter grows exponentially in , as would be expected due to infrequent diffusion transitions between modes.

Our next result, proved in Appendix E, provides a user-friendly set of sufficient conditions for verifying distant dissipativity and hence exponential Wasserstein decay in practice.

Proposition 3.5 (User-friendly Wasserstein decay).

Fix any diffusion and skew-symmetric stream coefficients and satisfying for , , and . If

 \ignorespaces\ignorespaces−⟨m(x)∇f(x)−m(y)∇f(y),x−y⟩∥x−y∥22≤{−Kmif ∥x−y∥2>Rm   Lmif ∥x−y∥2≤Rm,

holds for , , then, for any inverse temperature , the diffusion with drift and diffusion coefficients and has stationary density and satisfies creftype 3.2 with , , , and .

4 Explicit Bounds on Optimization Error

To convert our integration error bounds into bounds on optimization error, we now turn our attention to bounding the expected suboptimality term of LABEL:eq:opt-bound. To characterize the expected suboptimality of sampling from a measure with modes matching the minima of , we generalize a result due to Raginsky et al. (2017). The original result (Raginsky et al., 2017, Prop. 3.4) was designed to analyze the Gibbs measure creftype 2.1 and demanded that be smooth, in the sense that . Our next proposition, proved in Appendix D, is designed for more general measures and importantly relaxes the smoothness requirements on .

Proposition 4.1 (Expected suboptimality: Sampling yields near-optima).

Suppose is the stationary density of an -dissipative diffusion (Condition 2) with global maximizer . Fix and . If for all , then

 −p(logp)+logp(x∗) ≤ d2θlog(2Cd)+d2log(eβα).

If this takes the generalized Gibbs form for , we have

 \ignorespaces\ignorespacespγ,θ(f(x))−f(x∗) ≤ θ√d2γ(1θlog(2γd)+log(eβμ2(f)2α)).

When , is the Gibbs measure, and the bound LABEL:eq:gen-gibbs-subopt exactly recovers (Raginsky et al., 2017, Prop. 3.4). The generalized Gibbs measures with allow for improved dependence on the inverse temperature when . Note however that, for , the distributions also require knowledge of the optimal value . In certain practical settings, such as neural network optimization, it is common to have . When is unknown, a similar analysis can be carried out by replacing with an estimate, and the bound (LABEL:eq:gen-gibbs-subopt) still holds up to a controllable error factor.

By combining Prop. 4.1 with Theorem 3.1, we obtain a complete bound controlling the global optimization error of the best Markov chain iterate.

Corollary 4.2 (Optimization error of discretized diffusions).

Instantiate the assumptions and notation of Theorem 3.1 and Prop. 4.1. If the diffusion has the generalized Gibbs stationary density , then

 +θ√d2γ(1θlog(2γd)+log(eβμ2(f)2α)).

Finally, we demonstrate that, for quadratic functions, the generalized Gibbs expected suboptimality bound LABEL:eq:gen-gibbs-subopt can be further refined to remove the dependence.

Proposition 4.3 (Expected suboptimality bound Quadratic f).

Let for a positive semidefinite and . Then for with , and for each positive integer , we have

The bound (LABEL:eq:non-gibs-bound) applies to any with level set (i.e., ) volume proportional to .

5 Applications to Non-convex Optimization

We next provide detailed examples of verifying that a given diffusion is appropriate for optimizing a given objective, using either uniform dissipativity (Prop. 3.3) or our user-friendly distant dissipativity conditions (Prop. 3.5). When the Gibbs measure Langevin diffusion is used, our results yield global optimization when is strongly convex (condition creftype 3.1 with and ) or has strongly convex tails (condition LABEL:eqn:distant-drift with ). To highlight the value of non-constant diffusion coefficients, we will focus on “heavy-tailed” examples that are not covered by the Langevin theory.

5.1 A simple example with sublinear growth

We begin with a pedagogical example of selecting an appropriate diffusion and verifying our global optimization conditions. Fix and consider , a simple non-convex objective which exhibits sublinear growth in and hence does not satisfy dissipativity (Condition 2) when paired with the Gibbs measure Langevin diffusion (). To target the Gibbs measure creftype 2.1 with inverse temperature , we choose the diffusion with coefficients and for and . This choice satisfies Condition 1 with , and with respect to and Condition 2 with and . In fact, this diffusion satisfies uniform dissipativity,

 2⟨bγ(x)−bγ(y),x−y⟩+∥σγ(x)−σγ(y)∥2F,

yielding and -Wasserstein rates by Prop. 3.3 and the relative rate . Hence, the -th Stein factor in Theorem 3.2 satisfies . This implies that the coefficients in Corollary 4.2 scale with and the final optimization error bound LABEL:eq:gen-gibbs-opt-error can be made of order by choosing the inverse temperature , the step size , and the number of iterations .

5.2 Non-convex learning with linear growth

Next consider the canonical learning problem of regularized loss minimization with

 f(x)=L(x)+R(x)

for , a datapoint-specific loss function, the -th datapoint covariate vector, and a regularizer with concave satisfying and for , some , and all . Our aim is to select diffusion and stream coefficients that satisfy the Wasserstein decay preconditions of Prop. 3.5. To achieve this, we set and choose with so that the regularization component of the drift is one-sided Lipschitz, i.e.,

 \ignorespaces\ignorespaces−⟨a(x)∇R(x)−a(y)∇R(y),x−y⟩≤−Ka∥x−y∥22for someKa>0.

We then show that from Prop. 3.5 is bounded and that, for suitable loss choices, is bounded and Lipschitz so that LABEL:eqn:distant-drift holds with and sufficiently large.

Fix any , let , and define for all . We choose so that and LABEL:eqn:one-sided-lipschitz holds with . Our constraints on ensure that is positive definite, that , and that and have at most linear and quadratic growth respectively, in satisfaction of Condition 1. Moreover,

 ∇⟨∇,a(x)⟩=I((d−1)g1(r2)r2+2g′1(r2))+2xx⊤r2((d−1)(g′1(r2)−g1(r2)r2)+2r2g′′1(r2)), and

so that . For any , we have

 ∇~σ(s0)(x)[v]=(I⟨x,v⟩r+xv⊤r−2xx⊤r3⟨x,v⟩)√gs0(r2)−√1−s0r+2xx⊤r3⟨x,v⟩rg′s0(r2)√gs0(r2)

for each , so, as , for .

Finally, to satisfy LABEL:eqn:distant-drift, it suffices to verify that is bounded and Lipschitz. For example, in the case of a ridge regularizer, for , the coefficient , and it suffices to check that is Lipschitz with Lipschitz gradient. This strongly convex regularizer satisfies our assumptions, but strong convexity is by no means necessary. Consider instead the pseudo-Huber function, , popularized in computer vision (Hartley and Zisserman, 2004). This convex but non-strongly convex regularizer satisfies all of our criteria and yields a diffusion with . Moreover, since and , is bounded and Lipschitz whenever and for some . Hence, Prop. 3.5 guarantees exponential Wasserstein decay for a variety of non-convex based on datapoint outcomes , including the sigmoid ( for or for (Bartlett et al., 2006), the Student’s t negative log likelihood (), and the Blake-Zisserman ((Hartley and Zisserman, 2004). The reader can verify that all of these examples also satisfy the remaining global optimization pre-conditions of Theorems 3.2 and 4.2. In contrast, these linear-growth examples do not satisfy dissipativity (Condition 2) when paired with the Gibbs measure Langevin diffusion.

6 Conclusion

In this paper, we showed that the Euler discretization of any smooth and dissipative diffusion can be used for global non-convex optimization. We established non-asymptotic bounds on global optimization error and integration error with convergence governed by Stein factors obtained from the solution of the Poisson equation. We further provided explicit bounds on Stein factors for large classes of convex and non-convex objective functions, based on computable properties of the objective and the diffusion. Using this flexibility, we designed suitable diffusions for optimizing non-convex functions not covered by the existing Langevin theory. We also demonstrated that targeting distributions other than the Gibbs measure can give rise to improved optimization guarantees. plus 2.7ex

References

• Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006.
• Bismut [1984] J.-M. Bismut. Large deviation and malliavin calculus. Progress in Mathematics, 45, 1984.
• Cattiaux and Guillin [2014] P. Cattiaux and A. Guillin. Semi log-concave Markov diffusions. In Séminaire de Probabilités XLVI, volume 2123 of Lecture Notes in Math., pages 231–292. Springer, Cham, 2014.
• Cerrai [2001] S. Cerrai. Second order PDE’s in finite and infinite dimension: a probabilistic approach, volume 1762. Springer Science & Business Media, 2001.
• Chen et al. [2015] C. Chen, N. Ding, and L. Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages 2278–2286, 2015.
• Cheng et al. [2018] X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018.
• Dalalyan [2017a] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. arXiv preprint arXiv:1704.04752, 2017a.
• Dalalyan [2017b] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017b.
• Dalalyan and Tsybakov [2012] A. S. Dalalyan and A. Tsybakov. Sparse regression learning by aggregation and langevin monte-carlo. Journal of Computer and System Sciences, 78:1423–1443, 2012.
• Durmus et al. [2017] A. Durmus, E. Moulines, et al. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
• Dwivedi et al. [2018] R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! arXiv preprint arXiv:1801.02309, 2018.
• Eberle [2016] A. Eberle. Reflection couplings and contraction rates for diffusions. Probability theory and related fields, 166(3-4):851–886, 2016.
• Elworthy and Li [1994] K. D. Elworthy and X.-M. Li. Formulae for the derivatives of heat semigroups. Journal of Functional Analysis, 125(1):252–286, 1994.
• Gelfand and Mitter [1991] S. B. Gelfand and S. K. Mitter. Recursive stochastic algorithms for global optimization in r^d. SIAM Journal on Control and Optimization, 29(5):999–1018, 1991.
• Gorham et al. [2016] J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv preprint arXiv:1611.06972, 2016.
• Gradshteyn and Ryzhik [2014] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic press, 2014.
• Hartley and Zisserman [2004] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
• Jameson [2013] G. J. O. Jameson. Inequalities for gamma function ratios. The American Mathematical Monthly, 120(10):936–940, 2013. ISSN 00029890, 19300972.
• Khasminskii [2011] R. Khasminskii. Stochastic stability of differential equations, volume 66. Springer Science & Business Media, 2011.
• Ma et al. [2015] Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
• Mathai and Provost [1992] A. Mathai and S. Provost. Quadratic forms in random variables: Theory and applications. 1992.
• Mattingly et al. [2002] J. C. Mattingly, A. M. Stuart, and D. J. Higham. Ergodicity for sdes and approximations: locally lipschitz vector fields and degenerate noise. Stochastic processes and their applications, 101(2):185–232, 2002.
• Mattingly et al. [2010] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical time-averaging and stationary measures via poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010.
• Øksendal [2003] B. Øksendal. Stochastic differential equations. pages 65–84. Springer, 2003.
• Pardoux and Veretennikov [2001] E. Pardoux and A. Veretennikov. On the Poisson equation and diffusion approximation. i. Ann. Probab., pages 1061–1085, 2001.
• Raginsky et al. [2017] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.
• Vollmer et al. [2016] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-) asymptotic bias and variance of stochastic gradient langevin dynamics. Journal of Machine Learning Research 17, pages 1–48, 2016.
• Wang [2016] F. Wang. Exponential Contraction in Wasserstein Distances for Diffusion Semigroups with Negative Curvature. arXiv:1608.04471, Mar. 2016.
• Welling and Teh [2011] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
• Xu et al. [2017] P. Xu, J. Chen, and Q. Gu. Global convergence of langevin dynamics based algorithms for nonconvex optimization. arXiv preprint arXiv:1707.06618, 2017.
• Appendix A Proof of Theorem 3.1: Integration error of discretized diffusions

Proof of Theorem 3.1.

Denoting by and using the integral form Taylor’s theorem on around the previous iterate , and taking expectations, we obtain

 + +

The first term on the right hand side can be written as

 = =

where in the last step, we used the fact that is independent from and odd moments of is 0. Similarly for the second and the third terms, we obtain respectively

and

Combining these with (LABEL:eq:poisson-eq), we obtain (LABEL:eq:poisson-taylor) can be written as,

 + + +

Finally, dividing each term by , averaging over , and using triangle inequalities, we reach the following bound

For the first term on the right hand side, using Condition 3 and Lemma A.2, we can write

where we used Young’s inequality in the second step, and Lemma A.2 in the last step.

The second term in the above inequality can be bounded by

where in the last step, we used Lemma A.2.

Similarly, the third and the fourth terms in the inequality (LABEL:eq:taylor-bound) can be bounded by

and

For the last term, we write

We first bound the expectation in the above integral

 =A+τnB,   where

Using Condition 1, Lemma F.1 and , we obtain

 A≤ ≤ B≤ 8 3n−1η2+n/2E[η2+n/2∥b(Xm)∥n+42+η2∥b(Xm)∥42∥σ(Xm)Wm∥n2 +3∥b(Xm)∥n2∥σ(Xm)∥4F+∥σ(Xm)Wm∥n+42], ≤ ≤

Plugging this in (LABEL:eq:last-ineq-integral-part), we obtain

Therefore, the last term in (LABEL:eq:taylor-bound) can be bounded by

Using Lemma A.2 and

 ∫10(1−τ)3τndτ≤6n4   % and   ∫10(1−τ)3dτ=14,

the right hand side of (LABEL:eq:last-term-final-2) can be bounded by

Combining the above bounds in (LABEL:eq:bound-first-term), (LABEL:eq:bound-second-term), (LABEL:eq:bound-third-term), (LABEL:eq:bound-fourth-term) and (LABEL:eq:last-term-final-3) and applying them on (LABEL:eq:taylor-bound), we reach the final bound

where

 c1= 6ζ1, c2= 116[ζ22λ2b+ζ3λbλ2σ+ζ4(1+3n−1)λ4σ], c3= 148[ζ3λ3b+ζ4λ4b(1+3n−1)+ζ441.5nn4(λ4b+n2eλ4σ)(λnb+n!!λn