Variance reduction for Random Coordinate Descent-Langevin Monte Carlo

Variance reduction for Random Coordinate Descent-Langevin Monte Carlo

Abstract

Sampling from a log-concave distribution function is one core problem that has wide applications in Bayesian statistics and machine learning. While most gradient free methods have slow convergence rate, the Langevin Monte Carlo (LMC) that provides fast convergence requires the computation of gradients. In practice one uses finite-differencing approximations as surrogates, and the method is expensive in high-dimensions.

A natural strategy to reduce computational cost in each iteration is to utilize random gradient approximations, such as random coordinate descent (RCD) or simultaneous perturbation stochastic approximation (SPSA). We show by a counter-example that blindly applying RCD does not achieve the goal in the most general setting. The high variance induced by the randomness means a larger number of iterations are needed, and this balances out the saving in each iteration.

We then introduce a new variance reduction approach, termed Randomized Coordinates Averaging Descent (RCAD), and incorporate it with both overdamped and underdamped LMC. The methods are termed RCAD-O-LMC and RCAD-U-LMC respectively. The methods still sit in the random gradient approximation framework, and thus the computational cost in each iteration is low. However, by employing RCAD, the variance is reduced, so the methods converge within the same number of iterations as the classical overdamped and underdamped LMC Dalalyan and Karagulyan (2019); Cheng et al. (2018); Dalalyan and Riou-Durand (2018). This leads to a computational saving overall.

1 Introduction

Monte Carlo Sampling is one of the core problems in Bayesian statistics, data assimilation Reich (2011), and machine learning Andrieu et al. (2003), with wide applications in atmospheric science Fabian (1981), petroleum engineering Nagarajan et al. (2007), remote sensing Li and Newton (2019) and epidemiology Li et al. (2020) in the form of inverse problems Martin et al. (2012), volume computation Vempala (2010), and bandit optimization Russo et al. (2018).

Let be a convex function that is -gradient Lipschitz and -strongly convex in . Define the target probability density function , then is a log-concave function. To sample from the probability distribution induced by amounts to finding an (or a list of ) that can be regarded as i.i.d. (independent and identically distributed) drawn from the distribution.

There is vast literature on sampling, and proposed methods fall into a few different categories. Markov chain Monte Carlo (MCMC) Roberts and Rosenthal (2004) composes a big class of methods, including Metropolis-Hasting based MCMC (MH-MCMC) Metropolis et al. (1953); Hastings (1970), Gibbs samplers (Geman and Geman, 1984; Casella and George, 1992), Hamiltonian Monte Carlo (Neal, 1993; Duane et al., 1987), Langevin dynamics based methods J. et al. (1978) (including both the overdamped Langevin Parisi (1981); Roberts and Tweedie (1996); Dalalyan (2017) and underdamped Langevin Chen et al. (2014); Ma et al. (2015) Monte Carlo), and some kind of combination (such as MALA) Roberts and Tweedie (1996); Roberts and Stramer (2002); Dwivedi et al. (2018); Bou-Rabee et al. (2010). Importance sampling and sequential Monte Carlo Geweke (1989); Neal (2001); Del Moral et al. (2006) framework and ensemble type methods Reich (2011); Garbuno-Inigo et al. (2020); Iglesias et al. (2013); Ding and Li (2019a, b); Ding et al. (2020) are also popular.

Different MCMC methods are implemented differently, but they share the essence, that is to develop a Markovian transition kernel whose invariant measure is the target distribution, so that after many rounds of iteration, the invariant measure is achieved. If the design of the transition kernel does not involve or sense the local behavior of , the convergence is slow Hobert and Robert (2004); Hobert et al. (2001); Roberts and Tweedit (1996); Mengersen and Tweedie (1996).

The Langevin Monte Carlo methods, both the overdamped or underdamped, can be viewed as special kinds of MCMC that involve the computation of . The idea is to find stochastic differential equations (SDEs) whose equilibrium-in-time is the target distribution. These SDEs are typically driven by , and the Overdamped or Underdamped Langevin Monte Carlo (O/U-LMC) can be viewed as the discrete-in-time (such as Euler-Maruyama discretization) version of the Langevin dynamics (SDEs). Since leads the dynamics, fast converge is expected Dalalyan and Karagulyan (2019); Cheng et al. (2018); Dalalyan and Riou-Durand (2018).

However, is typically not available. In particular, if is obtained from inverse problems with an underlying governing differential equation describing the dynamics, as seen in the remote sensing and epidemiology examples above, the explicit formula for is unknown. When this happens, one usually needs to compute all partial derivatives, one by one, either by employing automatic differentiation Baydin et al. (2018), or by surrogating with the finite-difference approximations for every direction . This leads to a cost that is roughly times the number of required iterations. In high dimension, , the numerical cost is high. Therefore, how to sample with a small number of finite differencing approximations with a cost relaxed on , becomes rather crucial.

There are methods proposed to achieve gradient-free property, such as Importance Sampling (IS), Ensemble Kalman methods, random walks methods, and various finite difference approximations to surrogate the gradient. However, IS Geweke (1989); Doucet et al. (2001a, b) has high variance of the weight terms and it leads to wasteful sampling; ensemble Kalman methods Evensen (2006); Bergemann and Reich (2010); Reich (2011); Garbuno-Inigo et al. (2020) usually require Gaussianity assumption Ding and Li (2019a, b); random walk methods such that Metropolized random walk (MRW) Mengersen and Tweedie (1996); Roberts and Tweedie (1996); Roberts and Tweedit (1996), Ball Walk Lovász and Simonovits (1990); Dyer et al. (1991); Lovász and Simonovits (1993) and the Hit-and-run algorithm Bélisle et al. (1993); Kannan et al. (1995); Lovász and Vempala (2006) cannot guarantee fast convergence Vempala (2005); and to our best knowledge, modification of LMC with derivatives replaced by its finite difference approximation Meeds et al. (2015) or Kernel Hilbert space Strathmann et al. (2015) are not yet equipped with theoretical non-asymptotic analysis.

1.1 Contribution

We work under the O/U-LMC framework, and we look for methods that produce i.i.d. samples with only a small number of gradient computation. To this end, the contribution of the paper is twofolded.

We first examine a natural strategy to reduce the cost by adopting randomized coordinate descent (RCD) Nesterov (2012); Wright (2015), a random directional gradient approximation. This method replaces finite difference approximations in directions, by in a randomly selected direction. Presumably this reduces the cost in each iteration by folds, and hopefully the total cost. However, in this article we will show that this is not the case in the general setting. We will provide a counter-example: the high variance induced by the random direction selection process brings up the numerical error, and thus more iterations are needed to achieve the preset error tolerance. This in the end leads to no improvement in terms of the computational cost.

We then propose a variance reduction method to improve the application of RCD to LMC. We call the method Randomized Coordinates Averaging Descent Overdamped/Underdamped LMC (or RCAD-O/U-LMC). The methods start with a fully accurate gradient (up to a discretization error) in the first round of iteration, and in the subsequent iterations they only update the gradient evaluation in one randomly selected direction. Since the methods preserve some information about the gradient along the evolution, the variance is reduced. We prove the new methods converge as fast as the classical O/U-LMC Dalalyan and Karagulyan (2019); Cheng et al. (2018); Dalalyan and Riou-Durand (2018), meaning the preset error tolerance is achieved in the same number of iterations. But since they require only directional derivative per iteration instead of , the overall cost is reduced. We summarize the advantage over the classical O-LMC and U-LMC in Table 1 (assuming computing the full gradient costs times of one partial derivative). The dependence on the conditioning of is omitted in the table, but will be discussed in detail in Section 5.

In some sense, the new methods share some similarity with SAGA Defazio et al. (2014), a modification of SAG (stochastic average gradient) Schmidt et al. (2013). These are two methods designed for reducing variance in the stochastic gradient descent (SGD) framework where the cost function has the form of . Similar approaches are also found in SG-MCMC (stochastic-gradient Markov chain Monte Carlo (SG-MCMC)) Ma et al. (2015); Chen et al. (2014); Gao et al. (2018); Betancourt (2015); Betancourt et al. (2017); Zou et al. (2019); Chatterji et al. (2018). In their cases, variance reduction is introduced in the selection of . In our case, the cost function is a simple convex function, but the gradient can be viewed as and the variance reduction is introduced in the selection of .

There are other variance reduction methods, such as SVRG Johnson and Zhang (2013) and CV-ULD Baker et al. (2017); Chatterji et al. (2018). We leave the discussion to future research.

Algorithm Number of iterations Number of evaluations
O-LMCDalalyan and Karagulyan (2019)
U-LMCCheng et al. (2018); Dalalyan and Riou-Durand (2018)
RCAD-O-LMC
RCAD-U-LMC
Table 1: Number of iterations and directional derivative evaluations of to achieve -accuracy. is the dimension. . If , then , where is a constant independent of . For the overdamped cases, we assume the Lipschitz continuity for the hessian term. Without this assumption, RCAD-O-LMC still outperforms O-LMC, as will be discussed in Section 5.

1.2 Organization

In Section 2, we discuss the essential ingredients of our methods: the random coordinate descent (RCD) method, the overdamped and underdamped Langevin dynamics and the associated Monte Carlo methods (O-LMC and U-LMC). In Section 3, we unify the notations and assumptions used in our methods. In Section 4, we discuss the vanilla RCD applied to LMC and present a counter-example to show it is not effective if used blindly. In Section 5, we introduce our new methods RCAD-O/U-LMC and present the results on convergence and numerical cost. We demonstrate numerical evidence in Section 6. Proofs are rather technical and are all left to appendices.

2 Essential ingredients

2.1 Random coordinate descent (RCD)

When explicit formula for is not available, one needs to compute the partial derivatives for all directions. One straightforward way is to use finite difference: where is the -th unit direction. Given enough smoothness, the introduced error is . For approximating the entire , such finite differencing evaluations are required, and it is expensive in the high dimensional setting when . The cost is similarly bad if one uses automatic differentiation.

Ideally one can take one random direction and computes the derivative in that direction only, and hopefully this random directional derivative reveals some information of the entire gradient . This approach is used in both RCD Wright (2015); Richtarik and Takac (2011); Nesterov (2012) and SPSA Gerencsér (1997); Kleinman et al. (1999). Both methods, instead of calculating the full gradient, randomly pick one direction and use the directional derivative as a surrogate of . More specifically, RCD computes the derivative in one random unit direction and approximates:

(1)

where is randomly drawn from (see the distribution of drawing in Richtarik and Takac (2011)). This approximations is consistent in the expectation sense because

Here is to take expectation.

2.2 Overdamped Langevin dynamics and O-LMC

The O-LMC method is derived from the following Langevin dynamics:

(2)

The SDE characterizes the trajectory of . The forcing term and the Brownian motion term compete: the former drives to the minimum of and the latter provides small oscillations. The initial data is a random variable drawn from a given distribution induced by . Denote the probability density function of , it is a well-known result that satisfies the following Fokker-Planck equation:

(3)

and furthermore, converges to the target density function exponentially fast in time Markowich and Villani (1999).

The overdamped Langevin Monte Carlo (O-LMC), as a sampling method, can be viewed as a discrete-in-time version of the SDE (2). A standard Euler-Maruyama method applied on the equation gives:

(4)

where is i.i.d. drawn from with being the identity matrix of size . Since (4) approximates (2), the density of , denoted as , converges to as , up to a discretization error. It was proved in Dalalyan and Karagulyan (2019) that the convergence to is achieved within iterations if hessian of is Lipschitz. If hessian of is not Lipschitz, the number of iterations increases to . In many real applications, the gradient of is not available and some approximation is used, introducing another layer of numerical error. In Dalalyan and Karagulyan (2019), the authors did discuss the effect of such error, but they assumed the error has bounded variance.

2.3 Underdamped Langevin dynamics and U-LMC

The underdamped Langevin dynamics Chen et al. (2014) is characterized by the following SDE:

(5)

where is a parameter to be tuned. Denote the probability density function of , then satisfies the Fokker-Planck equation

and under mild conditions, it converges to , making the marginal density function for the target  (Villani, 2006; Dolbeault et al., 2009).

The underdamped Langevin Monte Carlo algorithm, U-LMC, can be viewed as a numerical solver to (5). In each step, we sample new particles , where is a Gaussian random vector determined by with the following expectation and covariance:

(6)

We here used the notation to denote the expectation, and to denote the covariance of and . If , we abbreviate it to . The scheme can be interpreted as sampling from the following dynamics in each time interval:

U-LMC does demonstrate faster convergence rate Cheng et al. (2018); Dalalyan and Riou-Durand (2018) than O-LMC. Without the assumption on the hessian of being Lipschitz, the number of iteration is to achieve accuracy. The faster convergence on the discrete level could be explained by the better discretization solver instead of faster convergence of the underlying SDEs. Indeed, without the Lipschitz continuity on the hessian term, the discretizing of (5) produces numerical error. In contrast, the discretization error of (4) is . A third-order discretization was discussed for (5) in Mou et al. (2019), further enhancing the numerical accuracy. Similar to O-LMC, the method needs to numerically approximate . This induces another layer of error, and also requires times of evaluation of .

3 Notations

3.1 Assumption

We make some standard assumptions on :

Assumption 3.1.

The function is -strongly convex and has an -Lipschitz gradient:

  • Convex, meaning for any :

    (7)
  • Gradient is Lipschitz, meaning for any :

    (8)

If is second-order differentiable, these assumptions together mean where is the hessian of . We also define condition number of as

(9)

We will express our results in terms of and . Furthermore, for some results we assume Lipschitz condition of the hessian too:

Assumption 3.2.

The function is second-order differentiable and the hessian of is H-Lipschitz, meaning for any :

(10)

3.2 Wasserstein distance

The Wasserstein distance is a classical quantity that evaluates the distance between two probability measures:

where is the set of distribution of whose marginal distributions, for and respectively, are and . These distributions are called the couplings of and . Here and can be either probability measures themselves or the measures induced by probability density functions and . In this paper we mainly study .

4 Direct application of RCD in LMC, a negative result

We study if RCD can be blindly applied to U-LMC for reducing numerical complexity. This is to replace in the updating formula (4) for U-LMC by the random directional derivative surrogates (1). The resulting algorithms are presented as Algorithm 2 in Appendix A.1.

RCD was introduced in optimization. In Richtarik and Takac (2011), the authors show that despite RCD computes only , instead of directional derivatives in each iteration, the number of iteration needed for achieving -accuracy is , as compared to when the full-gradient is used (suppose Lipschitz coefficient in each direction is at the same order with the total Lipschitz constant). The gain on the cost is mostly reflected by the conditioning of the objective function . This means there are counter-examples for which RCD cannot save compared with ordinary gradient descent. We emphasize that there are of course also plenty examples for which RCD significantly outperforms when is special conditioning structures Richtarik and Takac (2011); Nesterov (2012); Wright (2015). In this article we would like to investigate the general lower-bound situations.

The story is the same for sampling. There are examples that show directly applying the vanilla RCD to U-LMC fails to outperform the classical U-LMC. One example is the following: We assume

where satisfies for all . Denote the sample computed through Algorithm 2 (underdamped) with stepsize . Let be extremely small and the finite differencing error is negligible, and denote the probability density function of , then we can show cannot converge too fast.

Theorem 4.1.

For the example above, choose , there exists uniform nonzero constant such that if satisfy

then

(11)

where , and is the probability density function of -th iteration of RCD-U-LMC.

The proof is found in Section A.2. We note the second term in (11) is rather big. The smallness comes from , the stepsize, and it needs be small enough to balance out the influence from . This puts strong restriction on . Indeed, to have -accuracy, , we need both terms smaller than , and this term suggests that at least. And when combined with restriction from the first term, we arrive at the conclusion that at least iterations are needed, and thus finite differencing approximation are required. The dependence is , and is exactly the same as that in U-LMC, meaning RCD-U-LMC brings no computational advantage over U-LMC in terms of the dependence on the dimension of the problem.

We emphasize that that large second term, as shown in the proof, especially in Section A.2 equation (30), is induced exactly due to the high variance in the gradient approximation. This triggers our investigation into variance reduction techniques.

5 Random direction approximation with variance reduction on O/U-LMC, two positive results

The direct application of RCD induces high variance and thus high error. It leads to many more rounds of iterations for convergence, gaining no numerical saving in the end. In this section we propose RCAD-O/U-LMC with RCAD reducing variance in the framework of RCD. We will prove that while the numerical cost per iteration is reduced by -folds, the number of required iteration is mostly unchanged, and thus the total cost is reduced.

5.1 Algorithm

The key idea is to compute one accurate gradient at the very beginning in iteration No. , and to preserve this information along the iteration to prevent possible high variance. The algorithms for RCAD-O-LMC and RCAD-U-LMC are both presented in Algorithm 1, based on overdamped and underdamped Langevin dynamics respectively. Potentially the same strategy can be combined with SPSA, which we leave to future investigation.

In the methods, an accurate gradient (up to a finite-differencing error) is used in the first step, denoted by , and in the subsequent iterations, only one directional derivative of gets computed and updated in .

Preparation:
1. Input: (space stepsize); (time stepsize); (parameter); (dimension); (stopping index) and .
2. Initial: (overdamped): i.i.d. sampled from a initial distribution induced by and calculate :
(12)
(underdamped): i.i.d. sampled from a initial distribution induced by and calculate as in (12).
Run: For 1. Draw a random number uniformly from . 2. Calculate and flux by letting for and
(13)
3. (overdamped): Draw from :
(14)
(underdamped): Sample where is a Gaussian random variable with expectation and covariance defined in (6), replacing by .
end
Output: .
Algorithm 1 Randomized Coordinate Averaging Decent O/U-LMC (RCAD-O/U-LMC)

5.2 Convergence and numerical cost analysis

We now discuss the convergence of RCAD-O-LMC and RCAD-U-LMC, and compare the results with the classical O-LMC and U-LMC methods Dalalyan and Karagulyan (2019); Cheng et al. (2018). We emphasize that these two papers indeed discuss the numerical error in approximating the gradients, but they both require the variance of error being bounded, which is not the case here. One related work is Chatterji et al. (2018), where the authors construct the Lyapunov function to study the convergence of SG-MCMC. Our proof for the convergence of RCAD-O-LMC is inspired by its technicalities. In Cheng et al. (2018); Chatterji et al. (2018), a contraction map is used for U-LMC, but such map cannot be directly applied to our situation because the variance depends on the entire trajectory of samples. Furthermore, the history of the trajectory is reflected in each iteration, deeming the process to be non-Markovian. We need to re-engineer the iteration formula accordingly for tracing the error propagation.

Convergence for RCAD-O-LMC

For RCAD-O-LMC, we have the following theorem:

Theorem 5.1.

Suppose satisfies Assumption 3.1-3.2 and satisfy

(15)

Then , the Wasserstein distance between , the probability density function of the sample derived from Algorithm 1 (overdamped), and , the target density function, satisfies

(16)

Here , .

See proof in Appendix B. The theorem gives us the strategy of designing stopping criterion: to achieve -accuracy, meaning to have , we can choose to set both terms in (16) less than , and it leads to:

and

This means the cost, also the number of evaluations, is .

Note that the theorem here requires both Assumptions 3.1 and 3.2. We can relax the second assumption. If so, the numerical cost of degrades to , whereas the cost of O-LMC is . Our strategy still outperforms. The proof is the same, and we omit it from the paper.

Convergence for RCAD-U-LMC

For RCAD-U-LMC, we have the following theorem.

Theorem 5.2.

Assume satisfies Assumption 3.1, and set , then there exists a uniformly constant such that if satisfy

(17)

then , the Wasserstein distance between the distribution of the sample , derived from Algorithm 1 (underdamped), and distribution induced by (whose marginal density in is ) decays as:

(18)

See proof in Appendix  C. To achieve -accuracy, meaning to have , we can choose all terms in (18) less than . This gives:

and thus the stopping index needs to be:

This means evaluations of .

6 Numerical result

We demonstrate numerical evidence in this section. We first note that it is extremely difficult to compute the Wasserstein distance between two probability measures in high dimensional problems, especially when they are represented by a number of samples. The numerical result below evaluates a weaker measure:

(19)

where is the test function. are different samples iterate till -th step, and is the target distribution.

In the first example, our target distribution is with , and in the second example we use

For both example, we sample the initial particles according to . We run both RCD-O/U-LMC and RCAD-O/U-LMC using particles and test MSE error with in both examples. In Figure 1 and Figure 2 respectively we show the error with respect to different stepsizes. In all the computation, is big enough. The improvement of adding variance reduction technique is obvious in both examples.

(a)
(b)
Figure 1: Example 1. Decay of Error of O-LMC (left) and U-LMC (right) with and without RCAD.
(a)
(b)
Figure 2: Example 2. Decay of Error of O-LMC (left) and U-LMC (right) with and without RCAD.

7 Conclusion and future work

To our best knowledge, this is the first work that discusses both the negative and positive aspects of applying random gradient approximation, mainly RCD type, to LMC, in both overdamped and underdamped situations without and with variance reduction. Without variance reduction we show the RCD-LMC has the same numerical cost as the classical LMC, and with variance reduction, the numerical cost is reduced in both overdamped and underdamped cases.

There are a few future directions that we would like to pursue. 1. Our method, in its current version, is blind to the structure of . The only assumptions are reflected on the Lipschitz bounds. In Richtarik and Takac (2011); Nesterov (2012); Fercoq and Richtarik (2015) the authors, in studying optimization problems, propose to choose random directions according to the Lipschitz constant in each direction. The idea could potentially be incorporated in our framework to enhance the sampling strategy. 2. Our algorithms are designed based on reducing variance in the RCD framework. Potentially one can also apply variance reduction methods to improve SPSA-LMC. There are also other variance reduction methods that one could explore.

8 Broader Impact

The result provides theoretical guarantee to the application of random coordinate descent to Langevin Monte Carlo, when variance reduction technique is used to reduce the cost. It has potential application to inverse problems emerging from atmospheric science, remote sensing, and epidemiology. This work does not present any foreseeable societal consequence.

{ack}

Both authors acknowledge generous support from NSF-DMS 1750488, NSF-TRIPODS 1740707, Wisconsin Data Science Initiative, and Wisconsin Alumni Research Foundation.

Appendix A Algorithms and Results of RCD-LMC

a.1 Algorithm

We apply RCD as surrogates of the gradient in O/U-LMC. This amounts to replacing the gradient terms in (4) using the approximation (1). The new methods are presented in Algorithm 2, termed RCD-O/U-LMC.

Preparation:
1. Input: (space step); (time step); (parameter); (dimension); (stopping index) and .
2. Initial: (overdamped): i.i.d. sampled from a initial distribution induced by . (underdamped): i.i.d. sampled from the initial distribution induced by .
Run: For 1. Finite difference: calculate flux approximation by RCD:
(20)
with uniformly drawn from . 2. (overdamped): Draw from :
(21)
(underdamped): Sample where is a Gaussian random variable with expectation and covariance defined in (6), replacing by .
end
Output: .
Algorithm 2 RCD-overdamped(underdamped) Langevin Monte Carlo

a.2 A counter-example

In this section, we prove Theorem 4.1.

Fisrt, we define , and denote the probability density of and the probability density of if is distributed according to density function . From [12], we have:

(22)

and thus

(23)
Proof of Theorem 4.1.

Throughout the proof, we drop the superscript “″ to have a concise notation. According to (23), it suffices to find a lower bound for . We first notice

(24)

where takes all randomness into account. This implies to prove (11), it suffices to find a lower bound for second moment of . Indeed, in the end, we will show that

(25)

and thus

proving the statement of the theorem since . To show (25), we first note, by direct calculation:

(26)

then we divide the proof into several steps:

  • First step: Priori estimation

    According to (26), use convergence result of Algorithm 2 ([21] Theorem 4.1), we have for any

    Similar to (24), we have

    which implies

    (27)

    for any .

    Finally, use (27), we can obtain

    (28)
  • Second step: Iteration formula of .

    By the special structure of , we can calculate the second moment explicitly. Since can be written as

    in each step of RCD-U-LMC, according to Algorithm 2, for each and , we have

    (29)

    where is a random variable defined as

    and satisfies

    (30)

    for each . Furthermore,

    (31)

    Now, since , we can replace and by their Taylor expansion:

    (32)

    where are negative constants depends on and satisfy

    Plug (32) into (29), we have

    (33)

    The last three equalities in (33) implies

    Then, we can calculate the iteration formula for and :