# Stochastic Gradient Monomial Gamma Sampler

## Abstract

Recent advances in stochastic gradient techniques have made it possible to estimate posterior distributions from large datasets via Markov Chain Monte Carlo (MCMC). However, when the target posterior is multimodal, mixing performance is often poor. This results in inadequate exploration of the posterior distribution. A framework is proposed to improve the sampling efficiency of stochastic gradient MCMC, based on Hamiltonian Monte Carlo. A generalized kinetic function is leveraged, delivering superior stationary mixing, especially for multimodal distributions. Techniques are also discussed to overcome the practical issues introduced by this generalization. It is shown that the proposed approach is better at exploring complex multimodal posterior distributions, as demonstrated on multiple applications and in comparison with other stochastic gradient MCMC methods.

## 1Introduction

The development of increasingly sophisticated Bayesian models in modern machine learning has accentuated the need for efficient generation of asymptotically exact samples from complex posterior distributions. Markov Chain Monte Carlo (MCMC) is an important framework for drawing samples from a target density function. MCMC sampling typically aims to estimate a desired expectation in terms of a collection of samples, avoiding the need to compute intractable integrals. The Metropolis algorithm [27] was originally proposed to tackle this task. Despite great success, this method is based on *random walk* exploration, which often leads to inefficient posterior sampling (with a finite number of samples). Alternatively, exploration of a target distribution can be guided using *proposals* inspired by *Hamiltonian dynamics*, leading to Hamiltonian Monte Carlo (HMC) [15]. Aided by gradient information, HMC is able to move efficiently in parameter space, thus greatly improving exploration. However, the emergence of big datasets poses a new challenge for HMC, as evaluation of gradients on whole datasets becomes computationally demanding, if not prohibitive, in many cases.

To scale HMC methods to big data, recent advances in Stochastic Gradient MCMC (SG-MCMC) have subsampled the dataset into *minibatches* in each iteration, to decrease computational burden [36]. Stochastic Gradient Langevin Dynamics (SGLD) [36] was first proposed to generate approximate samples from a posterior distribution using minibatches. Since then, research has focused on leveraging the minibatch idea while also providing theoretical guarantees. For instance, [32] showed that by appropriately injecting noise while using a stepsize-decay scheme, SGLD is able to converge asymptotically to the desired posterior. Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [12] extended SGLD with auxiliary momentum variables, akin to HMC, and introduced a *friction* term to counteract the stochastic noise due to subsampling. However, exact estimation of such noise is needed to guarantee a correct SGHMC sampler. To alleviate this issue, the Stochastic Gradient Nosé-Hoover Thermostat (SGNHT) [14] algorithm introduced so-called *thermostat* variables to adaptively estimate stochastic noise via a thermal-equilibrium condition.

One standing challenge of SG-MCMC methods is inefficiency when exploring complex multimodal distributions. This limitation is commonly found in latent variable models with a multi-layer structure. Inefficiency is manifested because sampling algorithms have difficulties moving across modes, while traveling along the surface of the distribution. As a result, it may take a very large number of iterations (posterior samples) to cover more than one mode, greatly limiting scalability.

We investigate strategies for improving mixing in SG-MCMC. We propose the Stochastic Gradient Monomial Gamma Thermostat (SGMGT), building upon the Monomial Gamma Sampler (MGS) proposed by [37]. They showed that a generalized kinetic function typically improves the stationary mixing efficiency of HMC, especially when the target distribution has multiple modes. However, this advantage comes with numerical difficulties, and convergence problems due to poor initialization. By defining a *smooth* version of this generalized kinetic function, we can leverage its mixing efficiency, while satisfying the required conditions for stationarity of the corresponding stochastic process, as well as alleviating numerical difficulties arising from differentiability issues. To ameliorate the convergence issues, we further introduce ) a sampler with an underlying elliptic stochastic differential equation system and ) a resampling scheme for auxiliary variables (momentum and thermostats) with theoretical guarantees. The result is an elegant framework to improve stationary mixing performance on existing SG-MCMC algorithms augmented with auxiliary variables.

## 2Preliminaries

**Hamiltonian Monte Carlo** Suppose we are interested in sampling from a posterior distribution represented as , where denotes model parameters and represents data points. Assuming i.i.d. data, the *potential energy function* denotes the negative log posterior density, up to a normalizing constant, *i.e.*, . For simplicity, in the following we omit the conditioning of in , and write . In HMC, the posterior density is augmented with an auxiliary momentum random variable ; is independent of , and typically has a marginal Gaussian distribution with zero-mean and covariance matrix . The joint distribution is written as , where is the total energy (or *Hamiltonian*) and is the standard (Gaussian) *kinetic energy function*, and is the mass matrix. HMC leverages Hamiltonian dynamics, driven by the following differential equations

where is the system’s time index.

The total Hamiltonian is preserved under perfect simulation, *i.e.*, by solving exactly. However, closed-form solutions for and are often intractable, thus numerical integrators such as the *leap-frog* method are utilized to generate approximate samples of [29]. This leads to the following update scheme:

where is the *stepsize*.

**Monomial Gamma HMC** In the Monomial Gamma Hamiltonian Monte Carlo (MGHMC) [37] algorithm, the following *generalized* kinetic function is employed as a substitute for the Gaussian kinetics of standard HMC:

where denotes the element-wise power operation, is the monomial parameter. Note that when , recovers the standard (Gaussian) kinetics. For general , the update equations are identical to , except for

[37] proved in the univariate case that MGHMC can yield better mixing performance when the sampler reaches its stationary distribution, under perfect dynamic simulation, *i.e.,* infinitesimal stepsize in the limit and adequate (finite) simulation stepsize. Additionally, it was shown that for multimodal distributions sampled via MGMHC, the probability of getting trapped in a single mode goes to zero, as .

However, these theoretical advantages are accompanied by two practical issues: ) the numerical difficulties accentuate dramatically as increases, due to the lack of differentiability of for , and ) convergence is slow with poor initialization. For example, in and , if is far away from the mode(s) of the distribution, will be large, causing the updated momentum to blow up. This renders the change of , *i.e.,* , to be arbitrarily small for large , thus slowing convergence.

**Stochastic Gradient MCMC** SG-MCMC is desirable when the dataset, , is too large to evaluate the potential using all samples. The idea behind SG-MCMC is to replace with an unbiased *stochastic likelihood*, , evaluated from a subset of data (termed a minibatch)

where is a random subset of of size . SG-MCMC algorithms are typically driven by a continuous-time Markov stochastic process of the form [10]

where denotes the parameters of the *augmented* system, *e.g.*, and , and are referred as *drift* and *diffusion* vectors, respectively, and denotes a standard Wiener process.

In SGHMC [12], the resulting stochastic dynamic process is governed by the following Stochastic Differential Equations (SDEs) (with ):

where , is a function of , and is a function of . is modeled as , where and is the discretization stepsize. is an estimator of , is a user-specified *diffusion* factor and is the identity matrix. [12] set for simplicity. The reasoning is that the injected noise will dominate as ( remains as a constant), whereas goes to zero. Unfortunately, the covariance function, , of the stochastic noise, , is difficult to estimate in practice.

Recently, SGNHT [14] considered incorporating additional auxiliary variables (thermostats). The resulting SDEs correspond to

where represents the Hadamard (element-wise) product and are thermostat variables. Note that the diffusion factor, , is decoupled in , thus can adaptively fit to the unknown noise from the stochastic gradient .

## 3Stochastic Gradient Monomial Gamma Sampler

We now consider ) a more efficient (generalized) kinetic function, ) adapting the proposed kinetics to satisfy stationary requirements and alleviate numerical difficulties, ) incorporating an additional first-order stochastic process to and ) stochastic resampling of the momentum and thermostats to lessen convergence issues.

**Generalized kinetics** The statistical physics literature traditionally considers a quadratic form of the kinetics function, and a Gaussian distribution for the thermostats in , when analyzing the dynamic system of a canonical ensemble [34]. Inspired by this, one typical assumption in previous SG-MCMC work is that the marginal distribution for the momentum and thermostat is Gaussian [14]. However, this assumption, while convenient, does not necessarily guarantee an optimal sampler.

In recent work, [23] extended the standard (Newtonian) kinetics to a more general form inspired by relativity theory. By bounding the momentum, their *relativistic Monte Carlo* can lessen the problem associated with large potential gradients, , thus resulting in a more robust alternative to standard HMC. Further, [37] demonstrated that adopting non-Gaussian kinetics delivers better mixing and reduces sampling autocorrelation, especially for cases where the posterior distribution has multiple modes.

These ideas motivate a more general framework to characterize SG-MCMC, with potentially non-Gaussian kinetics and thermostats. As a relaxation of SGNHT [14], we consider a Hamiltonian system defined in a more general form

where and are any valid potential functions, inherently implying that and , define valid probability density functions.

We first consider the SDEs of SGNHT with generalized kinetics . The system can be obtained by generalizing (with identity mass matrix for simplicity) in with arbitrary , thus

However, if we set as in with , the dynamics governing the SDEs in will often fail to converge. This is because the sufficient condition to guarantee that the Itô process governed by the SDEs in converge to a stationary distribution generally requires the Fokker-Planck equation to hold [30]. Further, the existence and uniqueness of the solutions to the Fokker-Planck equation require Lipschitz continuity of drift and diffusion vectors in [6]. Unfortunately, this is not the case for the drift vectors in when , as is non-differentiable at the origin, *i.e.*, .

**Softened kinetics** The above limitation can be avoided by using a *softened* kinetic function . However, to keep the performance benefits from the original *stiff* kinetics, we must ensure that has the same tail behavior. We propose that for , the softened kinetics are (for clarity we consider 1D case, however higher dimensions still apply)

where is a *softening* parameter. Note that is (infinitely) differentiable for any and asymptotically approaches the stiff kinetics as . A comparison between stiff kinetics, , and softened kinetics is shown in Figure 1, for different values of . Discussion and formulation of the softened kinetics for arbitrary (and ) are provided in the Supplementary Material (SM).

To generate samples of the momentum variable, , from the density with softened kinetics, which is proportional to , we use a coordinate-wise rejection sampling, *i.e.*, the proposed for the -th dimension is rejected with probability .

In practice, setting to a relatively large value would still make the gradient ill-posed close to , thus causing high *integration error* when simulating the Hamiltonian dynamics. Conversely, setting to a small value will cause a *high approximation* error w.r.t. the original , thus resulting in a less efficient sampler. Consequently, has to be determined empirically as a trade-off between integration and approximation errors.

**Additional First Order Dynamics** Inspired by [24], we consider adding Brownian motion to and in , with variances and , respectively, while maintaining the stochastic process (asymptotically) converging to the correct marginal distribution of . Specifically, we consider the following SDEs:

The variances control the Brownian motion for , respectively, and denotes a rescaling factor for the friction term of momentum updates. The additional terms and can be understood as first-order Langevin dynamics [36]. The variance term, , controls the contribution of to the update of w.r.t. . This is analogous to the hyperparameter balancing and in the SGD-with-momentum algorithm [31]. Derivation details for and in , as well as other values of , are provided in the SM.

The following theorem, proven in the SM, shows that under regularity conditions, the SDEs in lead to posterior samples from the invariant joint distribution , yielding the desired marginal distribution w.r.t. as .

The reasoning behind increasing *stochasticity* in the SDEs is two-fold. First, the additional Langevin dynamics are crucial to SG-MCMC with generalized kinetics for large . For instance, for , the update for from is . When and is large, will be close to zero, thus (the next sample) will be close to , *i.e.*, the sampler moves arbitrarily slow. As discussed by [37], this can happen when moves to a region where the gradient takes a large absolute value, *e.g.*, near the low-density regions in a light-tailed distribution. Fortunately, the additional Langevin dynamics in , , compensate for the weak updating signal from , by an immediate gradient signal . Additionally, when becomes small, will become large. As a result, these two updating signals and compensate each other, thereby delivering a stable updating scheme. Likewise, the immediate gradient in can provide complementary updating signal for the thermostat variables, , to offset the weak deterministic update , when is large.

Second, has noise components on all parameters, , making the corresponding SDEs *elliptic*. From a theoretical perspective, ellipticity/hypoellipticity are necessary conditions to guarantee existence of bounded solutions for a particular partial differential equation related to the diffusion’s *infinitesimal generator*, which lies in the core of most recent SG-MCMC theory [32]. Ellipticity is characterized by a noise process (Brownian motion) covering all components of the system, via the diffusion, , in . This means is block diagonal, thus a positive definite matrix [26]. In a typical hypoelliptic case, the noise process is imposed on a subset of . However, hypoellipticity also requires the noise to be able to spread through the system via the drift term, , which may not be true for general . For instance, in , and is block diagonal with entries , *i.e.*, and are not explicitly influenced by the noise process, , thus hypoellipticity cannot be guaranteed.

To the authors’ knowledge, for existing SG-MCMC algorithms, only SGLD where , satisfies the ellipticity property, while other algorithms such as SGHMC and SGNHT assume hypoellipticity, thus their corresponding are not positive definite.

One caveat of is that if and are too large, the updates will be dominated by first-order dynamics, thus losing the convergence benefits from second-order dynamics [12]. In practice, and are problem-specific, thus need to be tuned, *e.g.* , by cross-validation.

**Stochastic resampling** When generating samples from the stochastic process in , we resample momentum and thermostats from their marginal distribution with a fixed frequency, instead of every iteration from their conditionals. Since the momentum and thermostats are drawn from the independent marginals of stationary distribution , it can be shown that reconstructing the stochastic process with the solution of the SDEs will still leave the stochastic process invariant to the target stationary distribution [7].

To simplify the discussion, consider a stochastic process of a particle as in with fixed . As show in Figure 2, suppose the initial value of is far from the maximum *a posteriori* value. The dynamics governed by will stochastically move along the Hamiltonian contour. The total Hamiltonian energy level is affected by the joint effect of the stochastic diffusion and momentum refraction (*i.e.*, -), which changes continuously over time.

From previous discussions, moving on a high Hamiltonian contour when is less efficient because the absolute value of the momentum, , will get increasingly large, slowing down the movement of . Resampling of momentum according to its marginal will enable the sampler to immediately move to a lower Hamiltonian energy level.

At the burn-in stage, this *momentum-accumulation/energy-drop* cycle seen in Figure 2(bottom) via resampling momentum can happen several times, until equilibrium is found. In practice, the resulting energy level is often much lower than initially, thereby delivering a more efficient and accurate dynamic updating.

The frequency of resampling from the marginal of the stationary distribution can have a direct impact on the mixing performance. Setting the frequency too high will result in a random-walk behavior. Conversely, with a low frequency resampling, the random-walk behavior is suppressed at a cost of fewer jumps between trajectories associated with different energy levels. It is advisable to increase the resampling frequency if the sampler is initialized on low-density (*e.g.* light-tailed) region.

The resampling step on and plays a role that is similar to adding a Langevin component to , in the sense that both improve convergence for . However, these two strategies (resampling and Langevin) are fundamentally different. We empirically observe that resampling is most helpful during burn-in, while the additional Langevin-style updates are more helpful with mixing during stationary sampling.

**Sgmgt** The specifications described above constitute an SG-MCMC method for the SDEs in , which we call Stochastic Gradient Monomial Gamma Thermostat (SGMGT). We denote the SG-MCMC method with additional Brownian motion on and in as SGMGT-D (Diagonal), *i.e.*, with and . The complete update scheme, with Euler integrator, for SGMGT is presented in the SM. Note that with , SGMGT-D recovers SGHMC as in [12]. Moreover, when , it becomes SGNHT as in [14].

We note that SGMGT-D improves upon SGNHT in three respects: () we introduce generalized kinetics, which provably yield lower autocorrelations than standard HMC, especially in multimodal cases; () the additional stochastic noise on thermostat variables yields more efficient mixing; () we use stochastic resampling to allow for faster interchange between different energy levels, thus alleviating sampling *stickiness*.

To the authors’ knowledge, despite existing analysis for Langevin Monte Carlo [8], rigorous analysis and comparison of the mixing performance of general SG-MCMC is very difficult, thus not yet established. Toward understanding the mixing performance of SGMGT-D, we argue that as the minibatch size increases, and the contribution of the diffusion in decreases, the SGMGT-D will approach MGHMC, in which case, a large will result in high stationary mixing performance, especially when sampling multimodal distribution, as theoretically shown by [37]. Although our experiments support our intuition, a more formal theoretical justification is needed. We leave this as interesting future work.

We observe empirically that when increasing the value of , SGMGT-D may not always achieve superior mixing performance. One possible reason for this is a larger value of induces “stiffer” behavior of at , which typically requires a higher level of softening, thus higher rejection rates during the rejection sampling step. Also, when the dimensionality of is higher, the rejection rate of the rejection sampling step increases (proportional to ). In such cases, the efficiency of the sampler decreases with large . For these reasons, we limit our experiments to .

We clearly have more hyperparameters than SGNHT. In practice, we fix , , and set the resampling frequency , which provides robust performance. Thus, only two additional hyperparameters are employed ( and ) compared to SGNHT, and these parameters require further tuning. We use either validation or a hold-out set in our experiments.

**More accurate numerical integrators** Using a first-order Euler integrator to approximate the solution of the continuous-time SDEs in , leads to errors in the approximate samples [11]. Alternatively, we can use the symmetric splitting scheme of [11] to reduce the order of the approximate error to . Details of the splitting used in this work are provided in the SM.

**Convergence properties** The SGMGT framework, as an instance of SG-MCMC, enjoys the same convergence properties of general SG-MCMC algorithms studied in [10]. It’s worth to mention that on challenging problems the posterior may not be densely sampled to yield ideal posterior computation, and the asymptotic theory is being used as a useful heuristic. Specifically, it is of interest to quantify how fast the sample average, , converges to the true posterior average, , for , where is number of iterations. Here we make the same assumptions of [10], and further assume that a first-order Euler integrator and a fixed stepsize are used.

This proposition indicates that with larger number of iterations and smaller step sizes, smaller bias and MSE bounds can be achieved. We note that these bounds have similar rates compared to other SG-MCMC algorithms such as SGLD, however, as we demonstrate below in the experiments, SGMGT and SGMGT-D usually converge faster than existing SG-MCMC methods.

In addition, for stochastic resampling, we can extend Proposition 2 to the following complementary results:

Proofs of Lemma 1 and Lemma 2 are provided in the SM. The proposed SGMGT framework has a strong connection with second-order stochastic optimization methods, leading to a sampling scheme with minibatches with similar mixing performance as slice sampling [28]. We discuss the details of this in the SM.

## 4Experiments

### 4.1Multiple-well Synthetic Potential

We first evaluate the mixing efficiency of SGMGT and SGMGT-D for a synthetic problem, to generate samples from a complex multimodal distribution. The distribution is shown in Figure 3(left). See SM for the definition of its potential energy. The modes are almost isolated with a low-density region connecting each other. Consequently, traversing between modes is difficult. In order to simulate the noise of the gradient estimates, we set , similar to [14], where .

We compare SGNHT with SGMGT and SGMGT-D with monomial parameter and fix . For all three algorithms, we try a number of hyperparameter settings, *e.g.*, stepsize , , and the soft parameter , and present the best results in Figure 3. Standard SGNHT fails to escape from one of the modes of the distribution. For SGMGT with , the generated samples reached 3 modes. For SGMGT-D with , the sampler identified all 5 modes. In Figure 3(right), SGMGT-D adequately moves across different modes and yields rapid mixing performance, unlike SGMGT which exhibits stickier behavior.

AUROC () | A (15) | G (25) | H (14) | P(8) | R (7) | C (87) |
---|---|---|---|---|---|---|

SGNHT | 0.89 | 0.75 | 0.90 | 0.86 | 0.95 | 0.65 |

SGMGT(a=1) | 0.92 | 0.78 | 0.91 | 0.86 | 0.87 | 0.70 |

SGMGT-D(a=1) | 0.95 | 0.86 | 0.95 | 0.93 |
0.98 |
0.73 |

SGMGT(a=2) | 0.93 | 0.79 | 0.93 | 0.88 | 0.86 | 0.62 |

SGMGT-D(a=2) | 0.95 |
0.90 |
0.95 |
0.90 | 0.97 | 0.69 |

ESS () | A (15) | G (25) | H (14) | P(8) | R (7) | C (87) |

SGNHT | 869 | 941 | 1911 | 2077 | 1761 | 1873 |

SGMGT-D(a=1) | 3147 |
2131 |
2448 | 4244 |
1494 | 3605 |

SGMGT-D(a=2) | 2700 | 1989 | 2768 |
3430 | 2265 |
2969 |

### 4.2Bayesian Logistic Regression

We evaluated the mixing efficiency and accuracy of SGMGT and SGMGT-D using Bayesian logistic regression (BLR) on 6 real-world datasets from the UCI repository [1]: German credit (G), Australian credit (A), Pima Indian (P), Heart (H), Ripley (R) and Caravan (C). The data dimensionality ranges from 7 to 87, and total observations vary between 250 to 5822. Gaussian priors are imposed on the regression coefficients. We set the minibatch size to 16. Other hyperparameters are provided in the SM. For each experiment, we draw 5000 iterations with 1000 burn-in samples.

Results in terms of median Effective Sample Size (ESS) and prediction accuracies as Area Under Receiver Operating Characteristic (AUROC) are summarized in Table 1. All the results are averages over 5 independent runs with random initialization. In general, SGMGT-D performs better than SGMGT. For higher-dimensional dataset Cavaran, the performance of decreases significantly, indicating numerical difficulties. The performance gap between SGMGT and SGMGT-D with or is usually larger than the gap between SGNHT (). Presumably when is greater than 1, SGMGT-D has better convergence.

### 4.3Latent Dirichlet Allocation

We also test our methods on Latent Dirichlet Allocation (LDA) [3]. Details of LDA and our implementation are provided in the SM. We use the ICML dataset [9], which contains 765 documents corresponding to abstracts of ICML proceedings from 2007 to 2011. After stopword removal, we obtain a vocabulary size of 1918 and about 44K words. We use 80% of the documents for training and the remaining 20% for testing. The number of topics is set to 30, resulting in 57,540 parameters. We use a symmetric Dirichlet prior with concentration . All experiments are based on 5000 MCMC samples with 1000 burn-in rounds. We set the minibatch size to 16. Other hyperparameter settings are provided in the SM.

Table ? shows the test perplexities for SGMGT and SGMGT-D for different stepsizes. For each method we highlight the best perplexity. The SGMGT-D with outperforms other methods, however SGMGT with fails to achieve a comparable result with SGMGT with , probably because a good initialization is hard to achieve for a high-dimensional distribution.

### 4.4Discriminative RBM

We applied our SGMGT to the Discriminative Restricted Boltzmann Machine (DRBM) [21] on MNIST data. We choose DRBM instead of RBM because it provides explicit stochastic gradient formulas.

We evaluated our methods empirically and compare them with SGNHT. We use one hidden layer with 500 units. For each method we performed 1500 iterations with 200 burn-in samples. The minibatch size is set to 100. Details of the hyperparameter settings for SGMGT and SGMGT-D are provided in the SM. As shown in Figure 4(right-most 3 panels), we observe that SGMGT-D with yields a superior mixing performance. For SGMGT-D with , the posterior samples demonstrated both rapid local mixing, and long-range movement. In contrast, SGLD seems trapped into a local mode after around 500 iterations.

Figure 4(left) shows that SGMGT-D with delivers the fastest convergence with the highest test accuracy, 0.976. The SGMGT-D improves over SGMGT, while performance of SGMGT-D seems to increase with a large value of . We observed that the stochastic resampling played a crucial role for SGMGT, as removing the resampling step resulted in a large drop in testing performance and mixing efficiency.

### 4.5Recurrent Neural Network

We test our framework on Recurrent Neural Networks (RNNs) for sequence modeling [17]. We consider two tasks: (*i*) polyphonic music prediction; and (*ii*) word-level language modeling, detailed below. Additional details of the experiment are provided in the SM.

**Polyphonic music prediction** We use four datasets: Piano-midi.de (Piano), Nottingham (Nott), MuseData (Muse) and JSB chorales (JSB) [5]. Each of these are represented as a collection of 88-dimensional binary sequences, that span the whole range of piano from A0 to C8.

We use a one-layer LSTM [19] model, and set the number of hidden units to 200. The total number of parameters is around 200K. Each model is trained for 100 epochs. We perform early stopping, while selecting the stepsize and other hyperparameters by monitoring the performance on validation sets. Updates are performed using minibatches from one sequence.

**Language modeling** The Penn Treebank (PTB) corpus [25] is used for word-level language modeling. We adopt the standard split (929K training words, 73K validation words, and 82K test words). The vocabulary size is 10K. We train a two-layer LSTM model on this dataset. The total number of parameters is approximately 6M. Each LSTM layer contains 200 units.

Results are shown in Table ?. The best log-likelihood results on the test set are achieved by using SGMGT-D with either or (depending on the dataset). To compare with optimization-based methods, we also include results for SGD [4] and RMSprop [33]. A more comprehensive comparison is provided in the SM.

Learning curves for Nott and PTB datasets are shown in Figure 5. We omit the SGLD results since they are not comparable with other methods. For both datasets, we observe that SGMGT-D delivers fastest convergence. The best negative log-likelihood is achieved by SGMGT-D . The difference between and is small, though SGMGT-D with seems to decrease slightly faster after 20 epochs for PTB data.

We also observe that the SGMGT with seems suboptimal compared with SGMGT with and SGNHT. We hypothesize that numerical difficulties hinder the success of SGMGT with , especially in higher-dimensional cases, and without the additional Langevin components of SGMGT-D.

## 5Conclusions

We improve upon existing SG-MCMC methods with several generalizations. We employed a more-general kinetic function, which we have shown to have better mixing efficiency, especially for multimodal distributions. Since practical use of the generalized kinetics is limited by convergence issues during burn-in, we injected additional Langevin dynamics and incorporated a stochastic resampling step to obtain generalized SDEs that alleviate the convergence issues. Possible areas of future research include designing an algorithm in a slice-sampling fashion, which maintains the invariant distribution by leveraging the connections between HMC and slice sampling [37]. In addition, it is desirable to design an algorithm that can adaptively choose the monomial parameter , thereby achieving better mixing while automatically avoiding numerical difficulties.

## Acknowledgments

This research was supported by ARO, DARPA, DOE, NGA, ONR and NSF.

### References

**UCI machine learning repository, 2013.**

Bache, Kevin and Lichman, Moshe.**The fundamental incompatibility of Hamiltonian Monte Carlo and data subsampling.**

Betancourt, MJ. ArXiv**Latent Dirichlet allocation.**

Blei, David M., Ng, Andrew Y., and Jordan, Michael I. JMLR**Large-scale machine learning with stochastic gradient descent.**

Bottou, Léon. In*COMPSTAT*, 2010.**Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.**

Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, Pascal. In*ICML*, 2012.**Existence and uniqueness of solutions to fokker–planck type equations with irregular coefficients.**

Bris, C Le and Lions, P-L. Communications in Partial Differential Equations**Mimicking an itô process by a solution of a stochastic differential equation.**

Brunick, Gerard, Shreve, Steven, et al. The Annals of Applied Probability**Finite-time analysis of projected langevin monte carlo.**

Bubeck, Sebastien, Eldan, Ronen, and Lehec, Joseph. In*NIPS*, 2015.**Dependent normalized random measures.**

Chen, Changyou, Rao, Vinayak, Buntine, Wray, and Whye Teh, Yee. In*ICML*, 2013.**On the convergence of stochastic gradient mcmc algorithms with high-order integrators.**

Chen, Changyou, Ding, Nan, and Carin, Lawrence. In*NIPS*, pp. 2278–2286, 2015.**Bridging the gap between stochastic gradient mcmc and stochastic optimization.**

Chen, Changyou, Carlson, David, Gan, Zhe, Li, Chunyuan, and Carin, Lawrence. In*AISTATS*, 2016.**Stochastic gradient hamiltonian monte carlo.**

Chen, Tianqi, Fox, Emily B, and Guestrin, Carlos. ArXiv**Theoretical guarantees for approximate sampling from smooth and log-concave densities.**

Dalalyan, Arnak S. Journal of the Royal Statistical Society: Series B (Statistical Methodology)**Bayesian sampling using stochastic gradient thermostats.**

Ding, Nan, Fang, Y, Babbush, R, Chen, C, Skeel, R. D, and Neven, H. In*NIPS*, 2014.**Hybrid Monte Carlo.**

Duane, Simon, Kennedy, Anthony D, Pendleton, Brian J, and Roweth, Duncan. Physics letters B**Approximate slice sampling for bayesian posterior inference.**

DuBois, Christopher, Balan, Anoop Korattikara, Welling, Max, and Smyth, Padhraic. In*AISTATS*, pp. 185–193, 2014.**Scalable bayesian learning of recurrent neural networks for language modeling.**

Gan, Zhe, Li, Chunyuan, Chen, Changyou, Pu, Yunchen, Su, Qinliang, and Carin, Lawrence. In*ACL*, 2017.**Markov chain monte carlo lecture notes, 2005.**

Geyer, C. J.**Long short-term memory.**

Hochreiter, Sepp and Schmidhuber, Jürgen. Neural computation**Accelerating diffusions.**

Hwang, Chii-Ruey, Hwang-Ma, Shu-Yin, Sheu, Shuenn-Jyi, et al. The Annals of Applied Probability**Classification using discriminative restricted boltzmann machines.**

Larochelle, H. and Bengio, Y. In*ICML*, 2008.**High-order stochastic gradient thermostats for bayesian learning of deep models.**

Li, Chunyuan, Chen, Changyou, Fan, Kai, and Carin, Lawrence. In*AAAI*, 2016.**Relativistic monte carlo.**

Lu, Xiaoyu, Perrone, Valerio, Hasenclever, Leonard, Teh, Yee Whye, and Vollmer, Sebastian J. arXiv**A complete recipe for stochastic gradient mcmc.**

Ma, Yi-An, Chen, Tianqi, and Fox, Emily. In*NIPS*, pp. 2917–2925, 2015.**Building a large annotated corpus of english: The penn treebank.**

Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Computational linguistics**Construction of numerical time-average and stationary measures via Poisson equations.**

Mattingly, J. C., Stuart, A. M., and Tretyakov, M. V. SIAM J. NUMER. ANAL.**Equation of state calculations by fast computing machines.**

Metropolis, Nicholas, Rosenbluth, Arianna W, Rosenbluth, Marshall N, Teller, Augusta H, and Teller, Edward. The journal of chemical physics**Slice sampling.**

Neal, Radford M. Annals of statistics**MCMC using Hamiltonian dynamics.**

Neal, Radford M. Handbook of Markov Chain Monte Carlo**Fokker-planck equation.**

Risken, Hannes. In*The Fokker-Planck Equation*, pp. 63–95. Springer, 1984.**Learning representations by back-propagating errors.**

Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Cognitive modeling**Consistency and fluctuations for stochastic gradient langevin dynamics.**

Teh, Yee Whye, Thiéry, Alexandre, and Vollmer, Sebastian. ArXiv**Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.**

Tieleman, Tijmen and Hinton, Geoffrey. COURSERA: Neural networks for machine learningStatistical mechanics: theory and molecular simulation

Tuckerman, Mark. .**Exploration of the (non-) asymptotic bias and variance of stochastic gradient langevin dynamics.**

Vollmer, Sebastian J, Zygalakis, Konstantinos C, and Teh, Yee Whye. Journal of Machine Learning Research**Bayesian learning via stochastic gradient langevin dynamics.**

Welling, Max and Teh, Yee W. In*ICML*, 2011.**Towards unifying hamiltonian monte carlo and slice sampling.**

Zhang, Yizhe, Wang, Xiangyu, Chen, Changyou, Henao, Ricardo, Fan, Kai, and Carin, Lawrence. In*NIPS*, 2016.