Logsmooth Gradient Concentration and Tighter Runtimes for Metropolized Hamiltonian Monte Carlo

\makesavenoteenv

tabular \makesavenoteenvtable \setpapersizeUSletter \setmarginsrb1in.5in 1in.5in .25in.25in .25in.5in

1 Introduction

Sampling from a high-dimensional logconcave distribution is a fundamental computational task, central to a variety of applications in disciplines such as statistics, machine learning, operations research, and theoretical computer science [AdD+03, BGJ+11]. The important problem of designing efficient samplers has received significant recent attention, due to the widespread use of Bayesian methods in modern large-scale machine learning algorithms [BAR12], and the fact that directly computing posterior densities in these methods is often intractable. Correspondingly, recent research efforts have focused on giving convergence guarantees for Markov Chain Monte Carlo (MCMC) methods, for sampling from a variety of desirable structured distributions arising in both theory and practice.

The specific problem we address in this paper is determining the mixing time of the well-studied Metropolized Hamiltonian Monte Carlo (HMC) algorithm, when sampling from a target distribution whose log-density is smooth and strongly concave. Indeed, as it is the default sampler implementation in a variety of popular packages [ABA16, CGH+17], understanding Metropolized HMC is of high practical import. Moreover, the specific setting we study, where the target distribution has a density proportional to for function with quadratic upper and lower bounds, is commonplace in applications arising from multivariate Gaussians, logistic regression models, and structured mixture models [DCW+18]. This setting is also of great theoretical interest because of its connection to a well-understood setting in convex optimization [NES03], where matching upper and lower bounds have long-been known. Similar guarantees are much less well-understood in sampling settings, and exploring the connection is an active research area (e.g. [MCJ+18, TAL19] and references therein). Throughout the introduction, we will refer to this setting as the “condition number regime” for logconcave sampling, as without a finite condition number, black-box sampling guarantees exist, but typically have a large dimension dependence in the mixing time [VEM05].

1.1 Previous work

Many algorithms have been proposed for sampling from logconcave distributions, mainly falling into two categories: zeroth-order methods and first-order methods. Zeroth-order methods only use the information on the density of the distribution by querying the value of to inform the algorithm trajectory. Well-studied zeroth-order methods include the ball walk [LS93, LS90], hit-and-run [BRS93, LOV99, LV06] and Metropolized random walk [MT96, RT96b]. While possibly preferable in some cases due to weaker assumptions on both the class of target distributions and oracle access to the density, these methods typically have a large polynomial dependence on the dimension, and do not exploit the additional benefits afforded by having a finite condition number. For distributions with additional structure, sampling algorithms may wish to exploit said structure.

First-order methods have access to the gradient information of in addition to the value of at a query point. This class of methods usually involves simulating a continuous Markov process whose stationary distribution is exactly the target distribution. The Langevin dynamics (LD) [GM91], Hamiltonian Monte Carlo (HMC) [KRA40, NEA11] and underdamped Langevin dynamics (ULD) [DR18] are among the most well-studied continuous-time Markov processes which converge to a specified logconcave measure, and sampling algorithms based on first-order information typically model their trajectories off one of these processes. To simulate a random process in discrete time, one approach is to choose a small-enough step size so that the behavior of the discrete Markov process is not too different from that of the original Markov process over a small time interval. This discretization strategy is typical of sampling algorithms with a polynomial dependence on , where is the target total variation distance to the stationary distribution [CCB+18, DM19, MMW+19, MS17, LSV18, CV19, SL19]. However, for precise values of , bounding the error incurred by the discretization is typically not enough, leading to prohibitively large runtimes.

On top of the discretization, one further can apply a Metropolis-Hastings filter to adjust the stationary distribution of the Markov process, so that the target distribution is attained in the long run. Studying the non-asymptotic behavior of Metropolized variants of the Langevin dynamics and HMC has been considered in a large number of recent works [RT96a, RT96b, PST12, BH13, XSL+14, DCW+18, CDW+19]. Indeed, the standard discretizations of these methods are identical, which was observed in prior work (see Appendix A); we will refer to them both as Metropolized HMC. The works which inspired this study in particular were due to [DCW+18, CDW+19], which showed that the mixing time of Metropolized HMC was bounded by roughly , with logarithmic dependence on the target accuracy , where is the condition number of the negative log-density 1. Moreover, the work [SL19] gives an algorithm that depends on in the runtime regime, which is the current best dependence on .

By a plausible assumption on the existence of a gap between the complexity of sampling and optimization in the logconcave setting, it is reasonable to believe that a linear dependence on is necessary. More specifically, it is well-known that gradient-based optimization algorithms require at least queries to an oracle providing first-order information [BUB15]; for the worst-case instance, a quadratic in the graph Laplacian of a length- path, there is a corresponding quadratic gap with sampling a uniform point via a random walk, which mixes in roughly iterations. We believe understanding the tight dependence of the mixing time of popular sampling algorithms on natural parameters such as the condition number is fundamental to the development of the field of sampling, just as characterizing the tight complexity of convex optimization algorithms has resulted in rapid recent progress in the area, by giving researchers goalposts in algorithm design. To that end, this work addresses the following question.

Question 1.1.

What is the mixing time of the Metropolized HMC algorithm?

We give a comparison of (selected recent) prior work in Table 1; for a more complete discussion, we refer the reader to the excellent discussion in [DCW+18, CDW+19]. We note that for the last two rows, the dependence on is logarithmic, and the notion of mixing is in total variation distance, a much stronger notion than the Wasserstein metric used in all other runtimes listed. We omit logarithmic factors for simplicity. We remark that several works obtain different rates under stronger assumptions on the log-density , such as higher-order smoothness (e.g. a Lipschitz Hessian) or moment bounds; as this work studies the basic condition number setting with no additional assumptions, we omit comparison to runtimes of this type.

Algorithm Mixing Time Metric
Langevin Diffusion [DM16] 2
High-Order Langevin Diffusion [MMW+19]
HMC 1 (Collocation Method) [LSV18]
HMC 2 (Collocation Method) [CV19]
ULD 1 (Euler Method) [CCB+18]
ULD 2 (Euler Method) [DR18]
ULD 3 (Random Midpoint Method) [SL19]
Metropolized HMC & MALA [CDW+19] TV
Metropolized HMC & MALA (This paper)
Table 1: Mixing times for algorithms in the condition number regime of logconcave sampling.

1.2 Our contribution

Towards improving our understanding of Question 1.1, we show that there is an algorithm which runs Metropolized HMC (defined in Algorithm 1) for iterations, for sampling from a density defined on , where has a condition number of , and produces a point from a distribution with total variation at most away from the target density, for any 3. This is the first mixing-time guarantee for any algorithm accessing only first-order function information from the log-density which attains a linear dependence on the condition number , without any additional smoothness assumptions (i.e. bounds on higher-order derivatives). Moreover, our dependence on the dimension matches the state-of-the-art prior work [DCW+18, CDW+19], and our algorithm does not require a warm start, as it explicitly bounds the runtime dependence on the warmness from a known starting distribution.

In Section 5, we also give preliminary evidence that for the equivalent first-order sampling methods of Metropolized HMC and Metropolis-adjusted Langevin dynamics, iterations are necessary even for sampling from a quadratic. In particular, we show that if the step size is not sufficiently small, the chain can only move with exponentially small probability, which we combine with a lower bound on the mixing time of continuous-time HMC in [CV19] for small step sizes.

The starting point of our analysis is the mixing time analysis framework for the HMC algorithm in [DCW+18, CDW+19]. However, we introduce several technical modifications to overcome barriers in their work to obtain our improved mixing time bound, which we now discuss. We hope these tools may be of broader interest to both the community studying first-order sampling methods in the smooth, strongly logconcave regime, and sampling researchers in general.

Gradient concentration

How large is the norm of the gradient of a “typical” point drawn from the density ? It has been observed in a variety of recent works studying sampling algorithms [LSV18, SL19, VW19] that the average gradient norm of a point drawn from the target density is bounded by , where is the smoothness parameter of the function and is the ambient dimension; this observation has been used in obtaining state-of-the-art sampling algorithms in the runtime regime. However, for runtimes obtaining a runtime, this guarantee is not good enough, as it must hold for all points in a set of substantially larger measure than guaranteed by e.g. Markov’s inequality. The weaker high-probability guarantee that the gradient norm is bounded by follows directly from sub-Gaussian concentration on the point , and a Lipschitz guarantee on the gradient norm. Indeed, this weaker bound is the bottleneck term in the analysis of [DCW+18, CDW+19], and prevents a faster algorithm when . Can we improve upon the average-case guarantee more generally when the log density is smooth?

For quadratic , it is easy to see that the average gradient norm bound can be converted into a high-probability guarantee. We show that a similar concentration guarantee holds for all logsmooth, strongly logconcave densities, which is the starting point of our improved mixing time bound. Our concentration proof follows straightforwardly from a Hessian-weighted variant of the well-known Poincaré inequality, combined with a reduction due to Herbst, as explored in [LED99].

Mixing time analysis

The study of Markov chains producing iterates , where the transition is described by an algorithm whose steady-state is a stationary distribution , and is drawn from an initial distribution , primarily focuses on characterizing the rate at which the distribution of iterates of the chain approaches . To obtain a mixing time bound, i.e. a bound on the number of iterations needed for our algorithm to obtain a distribution within total variation distance of the stationary , we follow the general framework of bounding the conductance of the random walk defined by Metropolized HMC, initiated in a variety of works on Markov chain mixing times (e.g. [SJ89, LS93]). In particular, [LS93] showed how to use the generalized notion of -conductance to account for a small-probability “bad” region with poor random walk behavior. In our work, the “good” region will be the set of points whose gradient has small norm. However, our mixing time analysis requires several modifications from prior work to overcome subtle technical issues.

Average conductance. As in prior work [CDW+19], because of the exponential warmness of the starting distribution used, we require extensions in the theory of average conductance [LK99] to obtain a milder dependence on the warmness, i.e. doubly logarithmic rather than singly logarithmic, to prevent an additional dimension dependence in the mixing time. The paper [CDW+19] obtained this improved dependence on the warmness by generalizing the analysis of [GMT06] to continuous-time walks and restrictions to high-probability regions. This analysis becomes problematic in our setting, as our region may be nonconvex, and the restriction of a strongly logconcave function to a nonconvex set is possibly not even logconcave. This causes difficulties when bounding standard conductance notions which may depend on sets of small measure, because these sets may behave poorly under restriction by (e.g. in the proof of Lemma C.9).

Blocking conductance. To mitigate the difficulty of poor small-set conductance due to the nonconvexity of , we use the blocking conductance analysis of [KLM06], which averages conductance bounds of sets with measure equal to some specified values in a range lower-bounded by roughly the inverse-warmness. In our case, this is potentially problematic, as the set where our concentration result guarantees that the norm of the gradient is not much larger than its mean has measure roughly , which is too small to bound the behavior of sets of size required by the quality of the warm start. However, we show that, perhaps surprisingly, the analysis of the blocking conductance is not bottlenecked by the worse quality of the gradient concentration required. In particular, the runtime of [DCW+18, CDW+19] resulted from the statement, with probability at most , the gradient norm is bounded by . We are able to sharpen this by Corollary 3.3 to , trading off a for a , which is sufficient for our tighter runtime.

Boosting to high accuracy. Finally, the blocking conductance analysis of [KLM06] makes an algorithmic modification. In particular, letting be the density after running steps of the Markov chain from , the analysis of [KLM06] is able to guarantee that the average density converges to at a rate roughly , with a factor depending on the average conductance. In our case, we can show that in roughly iterations of Algorithm 1, the distance is bounded by a constant. However, as the analysis requires averaging with a potentially poor starting distribution, it is not straightforward to obtain a rate of convergence with dependence for potentially small values of , rather than the dependence typical of rates. Moreover, it is unclear in our setting how to apply standard arguments [AD86, LW95] which convert mixing time guarantees for obtaining a constant total variation distance to guarantees for total variation distance with a logarithmic overhead on , because the definition of mixing time used is a worst-case notion over all starting points. We propose an alternative reduction based on mixing-time guarantees over arbitrary starting distributions of a specified warmness, which we use to boost our constant-accuracy mixing-time guarantee (see Appendix C.4 for a more formal treatment). While it is simple and inspired by classical coupling-based reduction arguments, to the best of our knowledge this reduction is new in the literature, and may be of independent interest.

1.3 Discussion

While we obtain an improved upper bound on the runtime of Metropolized HMC, there are many questions which remain unanswered, and we believe are exciting to explore in the space of sampling algorithms and complexity. We state two here which we believe are natural.

Question 1.2.

Can we obtain a matching lower bound, or an improved upper bound, on the complexity of the Metropolized HMC algorithm in terms of the dependence on the dimension ?

Question 1.3.

Can we obtain improved bounds (upper or lower) for the complexity of first-order sampling algorithms in general, in the condition number regime? For example, is iterations always necessary, implying a separation with the optimization setting?

2 Preliminaries

2.1 Notation

We denote the set by . For , is its complement . is the norm ( for ). Differentiable is -strongly convex and -smooth if

It is well-known that smoothness is equivalent to having a Lipschitz gradient, i.e. , and when is twice-differentiable, smoothness and strong convexity imply

everywhere, where is the identity and is the Loewner order. In this paper, function will always be differentiable, -smooth, and -strongly convex, with minimizer . We let be the condition number of . We define the Hamiltonian of by

is the Gaussian density centered at a point with covariance matrix . For and a distribution , we write

We fix the definition of the distribution density , where has

The marginal in the first argument of the density on proportional to is ; we overload to mean this density. For distributions , on , the total variation is

We say that a distribution is -warm with respect to another distribution if

Finally, we define the expectation and variance with respect to a distribution in the usual way:

2.2 Algorithm

We state the Metropolized HMC algorithm to be analyzed throughout the remainder of this paper. We remark that it may be thought of as a symplectic discretization of the continuous-time Hamiltonian dynamics for ,

The HMC process can be thought of as a dual velocity accumulating the gradient of the primal point , with the primal point being guided by the velocity, similar to the classical mirror descent algorithm. The algorithm resamples each timestep to attain the correct stationary distribution.

Input: Initial point , step size .
Output: Sequence , .

1:for  do
2:     Draw .
3:     .
4:     Draw uniformly in .
5:     if  then
6:         .
7:     else
8:         .
9:     end if
10:end for
Algorithm 1 Metropolized Hamiltonian Monte Carlo: HMC

Input: Points , step size .
Output: Points .

1:.
2:.
3:.
Algorithm 2 Leapfrog: Leapfrog

From a point , we define to be the distribution of after one step of Algorithm 1 starting from . Similarly, is the distribution of starting at , i.e. after the accept-reject step. Algorithm 1 uses the subprocedure Leapfrog, which enjoys the following property.

Lemma 2.1.

If , then .

Proof.

Recall that implies

Reversing these definitions yields the claim. ∎

Corollary 2.2.

is a stationary distribution for the Markov chain defined by Algorithm 1.

Proof.

We show that for , is the stationary distribution on ; correctness then follows from having the correct marginal. Stationarity follows if and only if

for all pairs , , where we overload the definition of to be the transition distribution from a point . By the standard proof of correctness for the Metropolis-Hastings correction, i.e. choosing an acceptance probability proportional to

it suffices to show that . Note that is a deterministic proposal, and uniquely maps to a point . Moreover, by symmetry of in the second argument, iteration of Algorithm 1 is equivalent to drawing , negating it, and then running Leapfrog. Correctness for this equivalent algorithm follows by Lemma 2.1. ∎

3 Gradient concentration

In this section, we give a bound on how well the norm of the gradient concentrates when is smooth and . First, we recall the following “Hessian-weighted” variant of the Poincaré inequality, which first appeared in [BL76].

Theorem 3.1 (Hessian Poincaré).

For distribution density , and continuously differentiable function with bounded variance with respect to ,

An immediate corollary of Theorem 3.1 is that the Poincaré constant of a -strongly logconcave distribution is at most . While it does not appear to have been previously stated in the literature, our concentration bound can be viewed as a simple application of an argument of Herbst which reduces concentration to an isoperimetric inequality such as Theorem 3.1; an exposition of this technique can be found in [LED99]. We now state the concentration result.

Theorem 3.2 (Gradient norm concentration).

If twice-differentiable is -smooth and strongly convex, then for , and all ,

Proof.

Let , and let . We remark that is continuously differentiable, and the existence of its variance follows from strong concavity of and Lipschitzness of its gradient. Then, Theorem 3.1 implies

In the last inequality we used smoothness. Letting , for ,

Using this recursively, we have

There are two things to estimate on the right hand side. First, for sufficiently large ,

Second, letting , [BL97] showed that

For completeness, we show this in Appendix E. Altogether, we have that for all ,

By Markov’s inequality on the exponential, we thus conclude that

Finally, letting and ,

As an immediate corollary, we obtain the following.

Corollary 3.3.

If twice-differentiable is -smooth and strongly convex, then ,

Proof.

It suffices to show that

(1)

We adapt a proof from [VW19]. Observe that because

where is divergence and is the Laplacian operator, integrating both sides and noting that the boundary term vanishes,

In the last equality, we used smoothness of . (1) then follows from concavity of the square root. ∎

We remark that for densities where a log-Sobolev variant of the inequality in Theorem 3.1 holds, we can sharpen the bound in Corollary 3.3 to ; we provide details in Appendix B. This sharpening is desirable for reasons related to the warmness of starting distributions for sampling from , as will become clear in Section 4. However, the “Hessian log-Sobolev” inequality is strictly stronger than Theorem 3.1, and does not hold for general strongly logconcave distributions [BL00]. Correspondingly, the concentration arguments derivable from Poincaré inequalities appear to be weaker [LED99]: we find exploring the tightness of Corollary 3.3 to be an interesting open question.

4 Mixing time bounds via blocking conductance

We first give a well-known bound of the warmness of an initial distribution.

Lemma 4.1 (Initial warmness).

For where is -smooth and -strongly convex with minimizer 4, is a -warm distribution with respect to .

Proof.

By smoothness and strong convexity, and the density of a Gaussian distribution,

In the last inequality we normalized by . Combining these bounds yields the result. ∎

Let be the density of after running steps of Algorithm 1, where is drawn from . Moreover, let be the average density over the first iterations. In Section 4.1, we will show how to use the blocking conductance framework of [KLM06] to obtain a bound on the number of iterations required to obtain a constant-accuracy distribution. We then show in Section 4.2 that we can boost this guarantee to obtain total variation for arbitrary with logarithmic overhead, resulting in our main mixing time claim, Theorem 4.8.

4.1 Constant-accuracy mixing

We state the results required to prove a mixing-time bound for constant levels of total variation from the stationary measure . All proofs are deferred to Appendix C. The first result is a restatement of the main result of [KLM06], modified for our purposes; recall that is an average over the distributions for . Finally, we define to be the probability one step of the walk starting at random point in a set leaves the set.

Theorem 4.2 (Blocking conductance mixing bound).

Suppose the starting distribution is -warm with respect to . Moreover, suppose for some , and for all , we have a bound

(2)

for a decreasing function on the range , and for . Then,

At a high level, the mixing time requires us to choose a threshold which is inversely-proportional to the warmness, and bound the average value of a function in the range , where serves as an indicator of how “bottlenecked” sets of measure exactly equal to are.

Next, by using a logarithmic isoperimetric inequality from [CDW+19], we show in the following lemma that we can bound when is in some range.

Lemma 4.3.

Suppose for with , and all with ,

(3)

Then, if is -strongly logconcave, , and , for all ,

For a more formal statement and proof, see Lemma C.9. Note that in particular the lower range of required by Theorem 4.2 is at least inversely proportional to the warmness, which causes the gradient norm bound obtained by the high-probability set to lose roughly a factor. To this end, for a fixed positive , denote

(4)

In Appendix D, we show the following.

Lemma 4.4.

For and all , with ,

By combining these pieces, we are able to obtain an algorithm which mixes to constant total variation distance from the stationary distribution in iterations.

Theorem 4.5.

From any -warm initial distribution , running Algorithm 1 for iterations, where is a uniform number between and for for universal constant , returns a point from a distribution such that .

Proof.

Throughout this proof, let in the definition (4) be 1. Note that as defined in the theorem statement is precisely the of Theorem 4.2. Moreover, for the set in (4), the probability is not in is bounded via Corollary 3.3 by

(5)

For and , is satisfied for all in the range . Thus, we can apply Lemma 4.3 and conclude that (2) holds for the function

Next, note that is increasing in the range , and attains its maximum at within the range , where . Thus, the conditions of Theorem 4.2 apply, such that

Thus, by choosing to be a sufficiently large multiple of , the guarantee follows. ∎

Corollary 4.6.

Let , . From any -warm initial distribution , running Algorithm 1 for iterations, where is a uniform number between and for for universal constant , returns a point from a distribution such that .

Proof.

It suffices to replace all values with in the proof of Theorem 4.5, and note that we accounted for the additional dependence on in the warmness requirement in the definition (4). ∎

4.2 High-accuracy mixing

We now state a general framework for turning guarantees such as Theorem 4.5 into a -accuracy mixing bound guarantee, with logarithmic overhead in the quantity . We defer a more specific statement and proof to Appendix C.4.

Lemma 4.7.

Suppose there is a Markov chain with transitions given by , and some nonnegative integer , such that for every which is a -warm distribution with respect to ,

(6)

Then, if is a -warm start, and ,

At a high level, the proof technique is to couple points according to the total variation bound between and every iterations, while the total variation distance is at least . This in turn allows us to bound the warmness of the “conditional distribution” of uncoupled points by using the fact that the total variation bound measures the size of the set of uncoupled points, and use the guarantee (6) iteratively. We can now state our main claim.

Theorem 4.8 (Mixing of Hamiltonian Monte Carlo).

There is an algorithm initialized from a point drawn from , which iterates Algorithm 1 for

iterations, and produces a point from a distribution such that .

Proof.

Define a Markov chain with transitions , whose one-step distribution from an initial point is to run the algorithm of Corollary 4.6. Note that each step of the Markov chain with transitions requires iterations of Algorithm 1, and the averaging step is easily implementable by sampling a random stopping time at uniform. Moreover, the Markov chain with transitions satisfies (6) with , by the guarantees of Corollary 3.3. Thus, by running iterations of this Markov chain, we obtain the required guarantee. ∎

5 Step-size lower bound

We make progress towards showing that standard Metropolized first-order methods (i.e. Algorithm 1, which we recall is equivalent to the Metropolis-adjusted Langevin dynamics) have mixing times with at least a linear dependence on . Formally, consider the -dimensional quadratic

where , and for , . This choice of is -strongly convex and -smooth. We show that running Algorithm 1 with a step size much larger than , from a random point from the stationary distribution, will have an exponentially small accept probability. Moreover, [CV19] shows that even the continuous-time HMC algorithm with step size requires iterations to converge.

Theorem 5.1 (Theorem 4 of [Cv19]).

For -smooth, -strongly convex , the relaxation time of ideal HMC for sampling from the density with step size is .

Lemma 5.2.

In one step of Algorithm 1, suppose the starting point is drawn from the stationary distribution. For any step size , for , and with probability at least over the randomness of , , the accept probability satisfies

Proof.

Recalling and , we compute

Let be the coordinate of . For , , and

where the last step follows from our assumption that . Then, letting be the vector containing the first entries of the vector ,

(7)

By standard tail bounds on the chi-squared distribution (Lemma 1 of [LM00]), with probability at least , all of the following hold:

Plugging these bounds into (7), noting that the dominant term behaves as , and using that if then the third term is nonnegative, we see that for sufficiently large , the probability the chain can move is at most . ∎

Appendix A Equivalence of HMC and Metropolis-adjusted Langevin dynamics

We briefly remark on the equivalence of Metropolized HMC and the Metropolis-adjusted Langevin dynamics algorithm (MALA), a well-studied algorithm since its introduction in [BES94]. This equivalence was also commented on in [CDW+19]. The algorithm can be seen as a filtered discretization of the continuous-time Langevin dynamics,

where is Brownian motion. In short, the Metropolized HMC update is

Similarly, the MALA update with step size is

It is clear that in HMC the distribution of is

so it suffices to show for ,

Indeed, the right hand side simplifies to