Semi-Separable Hamiltonian Monte Carlo
for Inference in Bayesian Hierarchical Models
Sampling from hierarchical Bayesian models is often difficult for MCMC methods, because of the strong correlations between the model parameters and the hyperparameters. Recent Riemannian manifold Hamiltonian Monte Carlo (RMHMC) methods have significant potential advantages in this setting, but are computationally expensive. We introduce a new RMHMC method, which we call semi-separable Hamiltonian Monte Carlo, which uses a specially designed mass matrix that allows the joint Hamiltonian over model parameters and hyperparameters to decompose into two simpler Hamiltonians. This structure is exploited by a new integrator which we call the alternating blockwise leapfrog algorithm. The resulting method can mix faster than simpler Gibbs sampling while being simpler and more efficient than previous instances of RMHMC.
Semi-Separable Hamiltonian Monte Carlo
for Inference in Bayesian Hierarchical Models
Yichuan Zhang School of Informatics University of Edinburgh Y.Zhangfirstname.lastname@example.org Charles Sutton School of Informatics University of Edinburgh email@example.com
Bayesian statistics provides a natural way to manage model complexity and control overfitting, with modern problems involving complicated models with a large number of parameters. One of the most powerful advantages of the Bayesian approach is hierarchical modeling, which allows partial pooling across a group of datasets, allowing groups with little data to borrow information from similar groups with larger amounts of data. However, such models pose problems for Markov chain Monte Carlo (MCMC) methods, because the joint posterior distribution is often pathological due to strong correlations between the model parameters and the hyperparameters . For example, one of the most powerful MCMC methods is Hamiltonian Monte Carlo (HMC). However, for hierarchical models even the mixing speed of HMC can be unsatisfactory in practice, as has been noted several times in the literature [3, 4, 11]. Riemannian manifold Hamiltonian Monte Carlo (RMHMC)  is a recent extension of HMC that aims to efficiently sample from challenging posterior distributions by exploiting local geometric properties of the distribution of interest. However, it is computationally too expensive to be applicable to large scale problems.
In this work, we propose a simplified RMHMC method, called Semi-Separable Hamiltonian Monte Carlo (SSHMC), in which the joint Hamiltonian over parameters and hyperparameters has special structure, which we call semi-separability, that allows it to be decomposed into two simpler, separable Hamiltonians. This condition allows for a new efficient algorithm which we call the alternating blockwise leapfrog algorithm. Compared to Gibbs sampling, SSHMC can make significantly larger moves in hyperparameter space due to shared terms between the two simple Hamiltonians. Compared to previous RMHMC methods, SSHMC yields simpler and more computationally efficient samplers for many practical Bayesian models.
2 Hierarchical Bayesian Models
Let be a collection of data groups where th data group is a collection of iid observations and their inputs . We assume the data follows a parametric distribution , where is the model parameter for group . The parameters are assumed to be drawn from a prior , where is the hyperparameter with prior distribution . The joint posterior over model parameters and hyperparameters is then
This hierarchical Bayesian model is popular because the parameters for each group are coupled, allowing the groups to share statistical strength. However, this property causes difficulties when approximating the posterior distribution. In the posterior, the model parameters and hyperparameters are strongly correlated. In particular, usually controls the variance of to promote partial pooling, so the variance of depends strongly on This causes difficulties for many MCMC methods, such as the Gibbs sampler and HMC. An illustrative example of pathological structure in hierarchical models is the Gaussian funnel distribution . Its density function is defined as , where is the vector of low-level parameters and is the variance hyperparameters. The pathological correlation between and is illustrated by Figure 1.
3 Hamiltonian Monte Carlo on Posterior Manifold
Hamiltonian Monte Carlo (HMC) is a gradient-based MCMC method with auxiliary variables. To generate samples from a target density , HMC constructs an ergodic Markov chain with the invariant distribution , where is an auxiliary variable. The most common choice of is a Gaussian distribution with precision matrix . Given the current sample , the transition kernel of the HMC chain includes three steps: first sample , second propose a new sample by simulating the Hamiltonian dynamics and finally accept the proposed sample with probability , otherwise leave unchanged. The last step is a Metropolis-Hastings (MH) correction. Define . The Hamiltonian dynamics is defined by the differential equations , where is called the position and is called the momentum.
It is easy to see that , which is called the energy preservation property [11, 10]. In physics, is known as the Hamiltonian energy, and is decomposed into the sum of the potential energy and the kinetic energy . The most used discretized simulation in HMC is the leapfrog algorithm, which is given by the recursion
where is the step size of discretized simulation time. After steps from the current sample , the new sample is proposed as the last point . In Hamiltonian dynamics, the matrix is called the mass matrix. If is constant w.r.t. , then and are independent in . In this case we say that is a separable Hamiltonian. In particular, we use the term standard HMC to refer to HMC using the identity matrix as . Although HMC methods often outperform other popular MCMC methods, they may mix slowly if there are strong correlations between variables in the target distribution. Neal  showed that HMC can mix faster if is not the identity matrix. Intuitively, such a acts like a preconditioner. However, if the curvature of varies greatly, a global preconditioner can be inadequate.
For this reason, recent work, notably that on Riemannian manifold HMC (RMHMC) , has considered non-separable Hamiltonian methods, in which varies with position , so that and are no longer independent in . The resulting Hamiltonian is called a non-separable Hamiltonian. For example, for Bayesian inference problems, Girolami et al.  proposed using the Fisher Information Matrix (FIM) of , which is the metric tensor of posterior manifold. However, for a non-separable Hamiltonian, the simple leapfrog dynamics (2a)-(2c) do not yield a valid MCMC method, as they are no longer reversible. Simulation of general non-separable systems requires the generalized leapfrog integrator (GLI) , which requires computing higher order derivatives to solve a system of non-linear differential equations. The computational cost of GLI in general is where is the number of parameters, which is prohibitive for large .
In hierarchical models, there are two ways to sample the posterior using HMC. One way is to sample the joint posterior directly. The other way is to sample the conditional and , simulating from each conditional distribution using HMC. This strategy is called HMC within Gibbs . In either case, HMC chains tend to mix slowly in hyperparameter space, because the huge variation of potential energy across different hyperparameter values can easily overwhelm the kinetic energy in separable HMC . Hierarchical models also pose a challenge to RMHMC, if we want to sample the model parameters and hyperparameters jointly. In particular, the closed-form FIM of the joint posterior is usually unavailable. Due to this problem, even sampling some toy models like the Gaussian funnel using RMHMC becomes challenging. Betancourt  proposed a new metric that uses a transformed Hessian matrix of , and Betancourt and Girolami  demonstrate the power of this method for efficiently sampling hyperparameters of hierarchical models on some simple benchmarks like Gaussian funnel. However, the transformation requires computing eigendecomposition of the Hessian matrix, which is infeasible in high dimensions.
Because of these technical difficulties, RMHMC for hierarchical models is usually used within a block Gibbs sampling scheme, alternating between and . This RMHMC within Gibbs strategy is useful because the simulation of the non-separable dynamics for the conditional distributions may have much lower computational cost than that for the joint one. However, as we have discussed, in hierarchical models these variables tend be very strongly correlated, and it is well-known that Gibbs samplers mix slowly in such cases . So, the Gibbs scheme limits the true power of RMHMC.
4 Semi-Separable Hamiltonian Monte Carlo
In this section we propose a non-separable HMC method that does not have the limitations of Gibbs sampling and that scales to relatively high dimensions, based on a novel property that we will call semi-separability. We introduce new HMC methods that rely on semi-separable Hamiltonians, which we call semi-separable Hamiltonian Monte Carlo (SSHMC).
4.1 Semi-Separable Hamiltonian
In this section, we define the semi-separable Hamiltonian system. Our target distribution will be the posterior of a hierarchical model (1), where and . Let be the vector of momentum variables corresponding to and respectively. The non-separable Hamiltonian is defined as
where the potential energy is and the kinetic energy is which includes the normalization term . The mass matrix can be an arbitrary p.d. matrix. For example, previous work on RMHMC  has chosen to be FIM of the joint posterior , resulting in an HMC method that requires time. This limits applications of RMHMC to large scale problems.
To attack these computational challenges, we introduce restrictions on the mass matrix to enable efficient simulation. In particular, we restrict to have the form
where and are the precision matrices of and , respectively. Importantly, we restrict to be independent of and to be independent of . If has these properties, we call the resulting Hamiltonian a semi-separable Hamiltonian. A semi-separable Hamiltonian is still in general non-separable, as the two random vectors and are not independent.
The semi-separability property has important computational advantages. First, because is block diagonal, the cost of matrix operations reduces from to . Second, and more important, substituting the restricted mass matrix into (3) results in the potential and kinetic energy:
If we fix or , the non-separable Hamiltonian (3) can be seen as a separable Hamiltonian plus some constant terms. In particular, define the notation
Then, considering as fixed, the non-separable Hamiltonian in (3) is different from the following separable Hamiltonian
only by some constant terms that do not depend on . What this means is that any update to that leaves invariant leaves the joint Hamiltonian invariant as well. An example is the leapfrog dynamics on , where is considered the potential energy, and the kinetic energy.
Similarly, if are fixed, then differs from the following separable Hamiltonian
only by terms that are constant with respect to
Notice that and are coupled by the terms and . Each of these terms appears in the kinetic energy of one of the separable Hamiltonians, but in the potential energy of the other one. We call these terms auxiliary potentials because they are potential energy terms introduced by the auxiliary variables. These auxiliary potentials are key to our method (see Section 4.3).
4.2 Alternating block-wise leapfrog algorithm
Now we introduce an efficient SSHMC method that exploits the semi-separability property. As described in the previous section, any update to that leaves invariant also leaves the joint Hamiltonian invariant, as does any update to that leaves invariant. So a natural idea is simply to alternate between simulating the Hamiltonian dynamics for and that for . Crucially, even though the total Hamiltonian is not separable in general, both and are separable. Therefore when simulating and , the simple leapfrog method can be used, and the more complex GLI method is not required.
We call this method the alternating block-wise leapfrog algorithm (ABLA), shown in Algorithm 1. In this figure the function “leapfrog” returns the result of the leapfrog dynamics (2a)-(2c) for the given starting point, Hamiltonian, and step size. We call each iteration of the loop from an ALBA step. For simplicity, we have shown one leapfrog step for and for each ALBA step, but in practice it is useful to use multiple leapfrog steps per ALBA step. ABLA has discretization error due to the leapfrog discretization, so the MH correction is required. If it is possible to simulate and exactly, then is preserved exactly and there is no need for MH correction.
To show that the SSHMC method by ALBA preserves the distribution , we also need to show the ABLA is a time-reversible and volume-preserving transformation in the joint space of . Let where and . Obviously, any reversible and volume-preserving transformation in a subspace of is also reversible and volume-preserving in . It is easy to see that each leapfrog step in the ABLA algorithm is reversible and volume-preserving in either or . One more property of integrator of interest is symplecticity. Because each leapfrog integrator is symplectic in the subspace of , they are also symplectic in . Then because ABLA is a composition of symplectic leapfrog integrators, and the composition of symplectic transformations is symplectic, we know ABLA is symplectic.
We emphasize that ABLA is actually not a discretized simulation of the semi-separable Hamiltonian system , that is, if starting at a point in the joint space, we run the exact Hamiltonian dynamics for for a length of time , the resulting point will not be the same as that returned by ALBA at time even if the discretized time step is infinitely small. For example, ALBA simulates with step size and with step size where , when that preserves .
4.3 Connection to other methods
Although the SSHMC method may seem similar to RMHMC within Gibbs (RMHMCWG), SSHMC is actually very different. The difference is in the last two terms of (7) and (10); if these are omitted from SSHMC and the Hamiltonians for , then we obtain HMC within Gibbs. Particularly important among these two terms is the auxiliary potential, because it allows each of the separable Hamiltonian systems to borrow energy from the other one. For example, if the previous leapfrog step increases the kinetic energy in , then, in the next leapfrog step for , we see that will have greater potential energy , because the auxiliary potential is shared. That allows the leapfrog step to accommodate a larger change of using . So, the chain will mix faster in . By the symmetry of and , the auxiliary potential will also accelerate the mixing in .
Another way to see this is that the dynamics in RMHMCWG for preserves the distribution but not the joint . That is because the Gibbs sampler does not take into account the effect of on . In other words, the Gibbs step has the stationary distribution rather than The difference between the two is the auxiliary potential. In contrast, the SSHMC methods preserve the Hamiltonian of .
4.4 Choice of mass matrix
The choice of and in SSHMC is usually similar to RMHMCWG. If the Hessian matrix of is independent of and always p.d., it is natural to define as the inverse of the Hessian matrix. However, for some popular models, e.g., logistic regression, the Hessian matrix of the likelihood function depends on the parameters . In this case, one can use any approximate Hessian , like the Hessian at the mode, and define , where is the Hessian of the prior distribution. Such a rough approximation is usually good enough to improve the mixing speed, because the main difficulty is the correlation between model parameters and hyperparameters.
In general, because the computational bottleneck in HMC and SSHMC is computing the gradient of the target distribution, both methods have the same computational complexity , where is the cost of computing the gradient and is the total number of leapfrog steps per iteration. However, in practice we find it very beneficial to use multiple steps in each blockwise leapfrog update in ALBA; this can cause SSHMC to require more time than HMC. Also, depending on the mass matrix , the cost of leapfrog a step in ABLA may be different from those in standard HMC. For some choices of , the leapfrog step in ABLA can be even faster than one leapfrog step of HMC. For example, in many models the computational bottleneck is the gradient , is the normalization in prior. Recall that is a function of . If , will be canceled out, avoiding computation of . One example is using in Gaussian funnel distribution aforementioned in Section 2. A potential problem of such is that the curvature of the likelihood function is ignored. But when the data in each group is sparse and the parameters are strongly correlated, this can give nearly optimal mixing speed and make SSHMC much faster.
In general, any choice of and that would be valid for separable HMC with Gibbs is also valid for SSHMC.
5 Experimental Results
In this section, we compare the performance of SSHMC with the standard HMC and RMHMC within Gibbs  on four benchmark models.111Our use of a Gibbs scheme for RMHMC follows standard practice . The step size of all methods are manually tuned so that the acceptance rate is around -. The number of leapfrog steps are tuned for each method using preliminary runs. The implementation of RMHMC we used is from . The running time is wall-clock time measured after burn-in. The performance is evaluated by the minimum Effective Sample Size (ESS) over all dimensions (see ). When considering the different computational complexity of methods, our main efficiency metric is time normalized ESS.
|HMC with diagonal constant mass||SSHMC (semi-separable mass)|
|time(s)||min ESS(, )||min ESS/s (, )||MSE(, )|
|HMC||5.16||(302.97, 26.30)||(58.64, 5.09)||(2.28, 1.34)|
|RMHMC(Gibbs)||2.7||(2490.98, 8.93)||(895.15, 3.21)||(1.95, 1.33)|
|SSHMC||37.35||(3868.79, 1541.67)||(103.57, 41.27)||(0.04, 0.02)|
|running time(s)||ESS (min, med, max)||ESS||min ESS/s|
|HMC||378||(2.05, 3.68, 4.79)||815||2.15|
|RMHMC(Gibbs)||411||(0.8, 4.08, 4.99)||271||0.6|
|SSHMC||385.82||(2.5, 3.42, 4.27)||2266||5.83|
|time (s)||ESS (min, med, max)||ESS||min ESS/s|
|HMC||162||(1.6, 2.2, 5.2)||(50, 50, 128)||0.31|
|RMHMC(Gibbs)||183||(12.1, 18.4, 33.5)||(385, 163, 411)||0.89|
|SSHMC||883||(78.4, 98.9, 120.7)||(4434, 1706, 1390)||1.57|
5.1 Demonstration on Gaussian Funnel
We demonstrate SSHMC by sampling the Gaussian Funnel (GF) defined in Section 2. We consider dimensional low-level parameters and 1 hyperparameter . RMHMC within Gibbs on GF has block diagonal mass matrix defined as and . We use the same mass matrix in SSHMC, because it is semi-separable. We use 2 leapfrog steps for low-level parameters and 1 leapfrog step for the hyperparameter in ABLA and the same leapfrog step size for the two separable Hamiltonians.
We generate 5000 samples from each method after 1000 burn-in iterations. The ESS per second (ESS/s) and mean squared error (MSE) of the sample estimated mean and variance of the hyperparameter are given in Table 1. Notice that RMHMC is much more efficient for the low-level variable because of the adaptive mass matrix with hyperparameter. Figure 1 illustrates a dramatic difference between HMC and SSHMC. It is clear that HMC suffers from oscillation of the hyperparameter in a narrow region. That is because the kinetic energy limits the change of hyperparameters [11, 3]. In contrast, SSHMC has much wider energy variation and the trajectory spans a larger range of hyperparameter . The energy variation of SSHMC is similar to the RMHMC with Soft-Abs metric (RMHMC-Soft-Abs) reported in , an instance of general RMHMC without Gibbs. But compared with , each ABLA step is about 100 times faster than each generalized leapfrog step and SSHMC can generate around 2.5 times more effective samples per second than RMHMC-Soft-Abs. Although RHMC within Gibbs has better ESS/s on the low level variables, its estimation of the mean and variance is biased, indicating that the chain has not yet mixed. More important, Table 1 shows that the samples generated by SSHMC give nearly unbiased estimates of the mean and variance of the hyperparameter, which neither of the other methods are able to do.
5.2 Hierarchical Bayesian Logistic Regression
In this experiment, we consider hierarchical Bayesian logistic regression with exponential prior for the variance hyperparameter , that is
where is the logistic function and is the th data points the th group. We use the Statlog (German credit) dataset from . This dataset includes 1000 data points and each data has categorical features and numeric features. Bayesian logistic regression on this dataset has been considered as a benchmark for HMC [7, 8], but the previous work uses only one group in their experiments. To make the problem more interesting, we partition the dataset into groups according to the feature Purpose. The size of group varies from 9 to 285. There are model parameters ( parameters for each group) and hyperparameter.
We consider the reparameterization of the hyperparameter . For RMHMC within Gibbs, the mass matrix for group is where is the Fisher Information matrix for model parameter and constant mass . In each iteration of the Gibbs sampler, each is sampled from by RMHMC using 6 generalized leapfrog steps and is sampled using 6 leapfrog steps. For SSHMC, and the same constant mass .
The results are shown in Table 2. SSHMC again has much higher ESS/s than the other methods.
5.3 Stochastic Volatility
A stochastic volatility model we consider is studied in , in which the latent volatilities are modeled by an auto-regressive AR(1) process such that the observations are with latent variable . We consider the distributions , and . The joint probability is defined as
where the prior , and . The FIM of depends on the hyperparameters but not , but the FIM of depends on . For RMHMC within Gibbs we consider FIM as the metric tensor following . For SSHMC, we define as inverse Hessian of , but as an identity matrix. In each ABLA step, we use 5 leapfrog steps for updates of and 2 leapfrog steps for updates of the hyperparameters, so that the running time of SSHMC is about 7 times that of standard HMC.
We generate 20000 samples using each method after 10000 burn-in samples. The histogram of hyperparameters is shown in Figure 2. It is clear that all methods approximately converge to the same distribution. But from Table 3, we see that SSHMC generates almost two times as many ESS/s as RMHMC within Gibbs.
5.4 Log-Gaussian Cox Point Process
The log-Gaussian Cox Point Process (LGCPP) is another popular testing benchmark [5, 7, 14]. We follow the experimental setting of Girolami and Calderhead . The observations are counts at the location , on a regular spatial grid, which are conditionally independent given a latent intensity process with mean , where , , and . is assigned by a Gaussian process prior, with mean function and covariance function where is the Euclidean distance between and . The log joint probability is given by We consider a grid that has 1024 latent variables. Each latent variable corresponds to a single observation .
We consider RMHMC within Gibbs with FIM of the conditional posteriors. See  for the definition of FIM. The generalized leapfrog steps are required for updating , but only the leapfrog steps are required for updating . Each Gibbs iteration takes 20 leapfrog steps for and 1 general leapfrog step for . In SSHMC, we use and . In each ABLA step, the update of takes 2 leapfrog steps and the update of takes 1 leapfrog step. Each SSHMC transition takes 10 ALBA steps. We do not consider HMC on LGCPP, because it mixes extremely slowly for hyperparameters.
The results of ESS are given in Table 4. The mean of sampled latent variable and the histogram of sampled hyperparameters are given in Figure 3. It is clear that the samples of RMHMC and SSHMC are consistent, so both methods are mixing well. However, SSHMC generates about six times as many effective samples per hour as RMHMC within Gibbs.
|time(h)||ESS (min, med, max)||ESS()||min ESS/h|
|SSHMC||2.6||(7.8, 30, 39)||(2101, 270)||103.8|
|RMHMC(Gibbs)||2.64||(1, 29, 38.3)||(200, 46)||16|
We have presented Semi-Separable Hamiltonian Monte Carlo (SSHMC), a new version of Riemannian manifold Hamiltonian Monte Carlo (RMHMC) that aims to retain the flexibility of RMHMC for difficult Bayesian sampling problems, while achieving greater simplicity and lower computational complexity. We tested SSHMC on several different hierarchical models, and on all the models we considered, SSHMC outperforms both HMC and RMHMC within Gibbs in terms of number of effective samples produced in a fixed amount of computation time. Future work could consider other choices of mass matrix within the semi-separable framework, or the use of SSHMC within discrete models, following previous work in discrete HMC [15, 12].
- Bache and Lichman  K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
- Betancourt  M. J. Betancourt. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. ArXiv e-prints, Dec. 2012.
- Betancourt and Girolami  M. J. Betancourt and M. Girolami. Hamiltonian Monte Carlo for Hierarchical Models. ArXiv e-prints, Dec. 2013.
- Choo  K. Choo. Learning hyperparameters for neural network models using Hamiltonian dynamics. PhD thesis, Citeseer, 2000.
- Christensen et al.  O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local Metropolis–Hastings algorithms. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):253–268, 2005.
- Geyer  C. J. Geyer. Practical Markov Chain Monte Carlo. Statistical Science, pages 473–483, 1992.
- Girolami and Calderhead  M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. ISSN 1467-9868. doi: 10.1111/j.1467-9868.2010.00765.x. URL http://dx.doi.org/10.1111/j.1467-9868.2010.00765.x.
- Hoffman and Gelman [In press] M. D. Hoffman and A. Gelman. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, In press.
- Kim et al.  S. Kim, N. Shephard, and S. Chib. Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies, 65(3):361–393, 1998.
- Leimkuhler and Reich  B. Leimkuhler and S. Reich. Simulating Hamiltonian dynamics, volume 14. Cambridge University Press, 2004.
- Neal  R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2011.
- Pakman and Paninski  A. Pakman and L. Paninski. Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in Neural Information Processing Systems 26, pages 2490–2498. 2013.
- Robert and Casella  C. P. Robert and G. Casella. Monte Carlo statistical methods, volume 319. Citeseer, 2004.
- Wang et al.  Z. Wang, S. Mohamed, and N. de Freitas. Adaptive Hamiltonian and Riemann manifold Monte Carlo samplers. In International Conference on Machine Learning (ICML), pages 1462–1470, 2013. URL http://jmlr.org/proceedings/papers/v28/wang13e.pdf. JMLR W&CP 28 (3): 1462–1470, 2013.
- Zhang et al.  Y. Zhang, C. Sutton, A. Storkey, and Z. Ghahramani. Continuous relaxations for discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems (NIPS), 2012.