The Hamming Ball Sampler
We introduce the Hamming Ball Sampler, a novel Markov Chain Monte Carlo algorithm, for efficient inference in statistical models involving high-dimensional discrete state spaces. The sampling scheme uses an auxiliary variable construction that adaptively truncates the model space allowing iterative exploration of the full model space in polynomial time. The approach generalizes conventional Gibbs sampling schemes for discrete spaces and can be considered as a Big Data-enabled MCMC algorithm that provides an intuitive means for user-controlled balance between statistical efficiency and computational tractability. We illustrate the generic utility of our sampling algorithm through application to a range of statistical models.
Statistical inference of high-dimensional discrete-valued vectors or matrices underpins many problems across a variety of applications including language modelling, genetics and image analysis. Bayesian approaches for such models typically rely on the use of Markov Chain Monte Carlo (MCMC) algorithms to simulate from the posterior distribution over these objects. The effective use of such techniques requires the specification of a suitable proposal distribution that allows the MCMC algorithm to fully explore the discrete state space whilst maintaining sampling efficiency. While there have been intense efforts to design optimal proposal distributions for continuous state spaces, generic approaches for high-dimensional discrete state models have received relatively less attention but some examples include the classic Swendsen-Wang algorithm  for Ising/Potts models and more recent Sequential Monte Carlo methods .
In this paper we consider Bayesian inference using MCMC for an unobserved latent discrete-valued discrete sequence or matrix , where each element , given observations . We will assume that the observations are conditionally independent given and model parameters so that the joint distribution factorizes as We further assume that the posterior distribution has a complex dependence structure so that standard MCMC schemes, such as a (Metropolis-within) Gibbs Sampler, using
or a marginal Metropolis-Hastings sampler over based on
are both intractable because exhaustive summation over the entire state space of has exponential complexity.
A popular and tractable alternative is to employ block-conditional (Metropolis-within) Gibbs sampling in which subsets of are updated conditional on other elements being fixed using
where denotes the elements excluding those in . Typical block structures might be rows/columns of , when it is a matrix, or sub-blocks when is a vector. Whilst block-conditional sampling approaches are often convenient (they may be of closed form allowing for Gibbs sampling without resort to Metropolis-Hastings steps), in high dimensions, major alterations to the configuration of maybe difficult to achieve as this must be done via a succession of small (possibly low probability) incremental changes. Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of exhibit strong correlations with each other and together with .
To address these problems, we propose a novel and generic MCMC sampling procedure for high-dimensional discrete-state models, named the “Hamming Ball Sampler". This sampling algorithm employs auxiliary variables that allow iterative sampling from slices of the model space. Marginalization within these model slices is computationally feasible and, by using sufficiently large slices, it is also possible to make significant changes to the configuration of . The proposed sampling algorithm spans a spectrum of procedures that contains the marginal and block-conditional Gibbs sampling strategies as extremes. At the same time, it allows the user to express many more novel schemes so that to select the one that best balances statistical efficiency and computational tractability. In this sense, the Hamming Ball Sampler is an example of a Big Data-enabled MCMC algorithm.
We demonstrate the utility of the sampling procedure with three different statistical models where exhaustive enumeration is impossible for realistic data sets and illustrate the considerable benefits over standard sampling approaches.
The Hamming Ball Sampler considers an augmented joint probability model that can be factorized as where the extra factor is a conditional distribution over an auxiliary variable which lives in the same space and has the same dimensions as . The conditional distribution is chosen to be an uniform distribution over a neighborhood set centered at , where denotes the indicator function and the normalizing constant is the cardinality of .
The neighborhood set will be referred to as a Hamming Ball since it is defined through Hamming distances so that Here, denotes the Hamming distance and the pairs denote non-overlapping subsets of corresponding entries in such that and . Also, the parameter denotes the maximal distance or radius of each individual Hamming Ball set. For instance, these pairs can correspond to different matrix columns so that will be the -th column of and the corresponding column of . Hence, the Hamming Ball would consist of all matrices whose columns are at most elements different to .
1.2 Gibbs sampling
The principle behind the Hamming Ball Sampler is that the use of Gibbs sampling for the augmented joint probability distribution admits the target posterior distribution as a by-product (since marginalization over recovers the target distribution). Specifically, the Hamming Ball Sampler alternates between the steps:
The update of can be implemented as two conditional (Gibbs) updates:
Or, alternatively, as a joint update via a Metropolis-Hastings accept-reject step that draws a new from the proposal distribution and accepts it with probability where is a proposal distribution over the model parameters.
1.3 Restricted state space
Crucially, the restricted state space defined by the Hamming Ball, that has been injected into the model via the auxiliary factor , means that the conditional distribution can be tractably computed as
where is the normalizing constant found by exhaustive summation over all admissible matrices inside the Hamming Ball . Through careful selection of , the cardinality of will be considerably less than the cardinality of so that exhaustive enumeration of all elements inside the Hamming Ball would be computationally feasible.
Overall, the proposed construction uses the auxiliary variable to define a slice of the model given by . Sampling of is performed within this sliced part of the model through . At each iteration, this model slice randomly moves via the re-sampling of in step (6), which simply sets to a random element from . This re-sampling step allows for random exploration that is necessary to ensure that the overall sampling scheme is ergodic. The amount of exploration depends on the radius so that can differ from the current state of the chain, say , at most in elements, i.e. the maximum Hamming distance between and is . Similarly, the subsequent step of drawing the new state, say , is such that at maximum can differ from in elements and overall it can differ from the previous at most in elements. From these observations we can conclude that a necessary condition for the algorithm to be ergodic is that . Figure 1A graphically illustrates the workings of the Hamming Ball Sampler.
1.4 Selection of blocks
The application of the Hamming Ball Sampler requires the selection of the subsets or blocks . This selection will depend on the conditional dependencies specified by the statistical model underlying the problem to be addressed. For some problems, such as the tumor deconvolution mixture model considered later, there may exist a natural choice for these subsets (e.g. columns of a matrix) that can lead to efficient implementations. For instance, under a suitable selection of blocks the posterior conditional could be fully factorized, i.e. ), or have a simple Markov dependence structure (as for the factorial hidden Markov model example considered later) so that exact simulation of would be feasible. In contrast, for unstructured models, where is just a large pool of fully dependent discrete variables (stored as a -dimensional vector), we can divide the variables into randomly chosen blocks , so that they have equal length . In such cases, exact simulation from may not be feasible and instead we can use the Hamming Ball operation to sequentially sample each block. More precisely, this variant of the algorithm can be based on the iteration (6),(8)-(9) with the only difference that the steps (6) and (9) are now split into sequential conditional steps,
This scheme can be thought of as a block Hamming Ball Sampler which incorporates standard block Gibbs sampling (see iterations (4)-(5)) as a special case obtained when the radius is equal to the block size . In a purely block Hamming Ball scheme we will have and in general the parameters can be used to jointly control algorithmic performance (see SI: Fig. 1). This scheme is illustrated in Figure 1B and used later in the regression application and discussed in further detail in SI B.1: Blocking strategies.
1.5 Computational complexity
To find the time complexity of the Hamming Ball Sampler we assume for simplicity that either factorizes across the blocks or we use the block scheme in (11). Then, for blocks of size , the computational complexity of the Hamming Ball Sampler scales with the Hamming radius , block size and according to where Contrast this with the block Gibbs Sampler which has computational complexity of (where ) and it is only applicable for small values of block size . On the other hand, Hamming Ball sampling is more flexible since it can allow to use much larger block sizes by controlling the computational cost through both and . An ideal choice of the Hamming radius is since, as discussed previously, it is possible to change elements per Hamming Ball sampling update so with it is possible to update all elements in each block in a single update. In the Results section we shall see this outcome empirically in actual applications, however, for large-scale problems, it may not be practical to use Hamming distances beyond . Our simulations will show that even in these circumstances the Hamming Ball Sampler can still be advantageous by providing the flexibility to balance statistical and computational efficiency.
A simple generalization of the algorithm is obtained by allowing block-wise varying Hamming maximal distances. If we assume a varying radius for the individual Hamming Balls, then the conditional distribution over becomes uniform on the generalized Hamming Ball where denotes the set of maximal distances for each subset of variables. Furthermore, we could allow , at each iteration, to be randomly drawn from a distribution (see SI B.2: Randomness over the Hamming Ball radius for further discussion).
Alternate auxiliary conditional distributions are also permitted. For instance, a more general auxiliary distribution can have the form , with , which for is non-uniform and places more probability mass towards the center (see SI B.3: Non-uniform auxiliary Hamming Ball distributions for further discussion).
We now illustrate the utility of our Hamming Ball sampling scheme through its application to three statistical models that involve high-dimensional discrete state spaces. We motivate the selection of each model through real scientific problems.
2.1 Tumor deconvolution through mixture modelling
Tumor samples are genetically heterogeneous and typically contain an unknown number of distinct cell sub-populations. Current DNA sequencing technologies ordinarily produce data that comes from an aggregation of these sub-populations thus, in order to gain insight into the latent genetic architecture, statistical modelling must be applied to deconvolve and identify the constituent cell populations and their mutation profiles.
In order to tackle this problem,  and  adopt a statistical framework in which the set of unobserved mutation profiles can be described as a binary matrix , where denotes absence/presence of a mutation in the -th cell type for the -th mutation, and parameters denote the proportion of the tumor attributed to each of the cell types. The observed sequence data depends on through the mutant allele frequencies and we would like to simulate from the posterior distribution and explore the possible configurations of the mutation matrix that are most compatible with the observed data.111In  and  statistical inference was conducted using deterministic approaches and not Monte Carlo sampling. They use massive numbers of random initializations to overcome local modes in the posterior. As our interest is in full posterior characterization we do not compare with the point estimates given by these methods.
We compared three posterior sampling approaches: (i) a conventional block Gibbs sampling strategy that proceeds by conditionally updating one column at a time with the remaining columns and the weights fixed, (ii) a Hamming Ball-based sampling scheme where we define the Hamming Ball as the set of all matrices such that each column is at most bits different from the corresponding column of the auxiliary matrix and (iii) a fully marginalized sampling strategy where was marginalized through exhaustive summation over all column configurations (note, this corresponds to the Hamming Ball Sampler with ). Our data examples was chosen to be sufficiently small so that the fully marginalized sampler was practical. We refer to Materials and Methods for detailed derivations and implementations of the following data examples.
We considered a simulated data example, illustrated in Figure 2A, where the observed sequence data is generated so that it can be equally explained by two different latent genetic architectures. This is an interesting example as one configuration corresponds to a linear phylogenetic relationship between cell types and the other to a branched phylogeny and represent fundamentally different evolutionary pathways. For this example, we would expect an efficient sampler to identify both configurations and to be able to move freely between the two during the simulation revealing the possibility of the existence of dual physical explanations for the observed data.
Figures 2B and C display trace plots of the largest component weight, and the relative computational times for the three sampling schemes (see also SI: Figs. 2-3). All Hamming Ball samplers were effective at identifying both modes but the efficiency of the mode switching depends on the Hamming ball size . This effectiveness can be attributed to the fact that the Hamming Ball schemes can jointly propose to change up to bits across all columns of the current . Furthermore, conditional updates of can be made by marginalizing over a range of mutation matrices. For , the efficiency of the Hamming Ball Sampler is therefore close to the fully marginalized sampling procedure () but more computationally tractable if the number of mutations is large and exhaustive enumeration is impractical. The conditional updates employed by the block Gibbs Sampler requires significantly less computational effort but the approach is prone to being trapped in single posterior mode and our simulations show that it failed to identify the mode corresponding to the linear phylogeny structure (). In real application this could lead to incorrect scientific conclusions and we illustrate these potential impacts on a real cancer data set in SI: C.2. Tumor data analysis.
2.2 Sparse linear regression analysis
In this section we consider variable selection problems with sparse linear regression models. In this case, the high-dimensional discrete-valued object is a binary vector whose entries are 1 when the corresponding covariate in the design matrix is associated with the response (and zero otherwise) and consists of the regression and noise parameters (which can be analytically marginalized out in our chosen set-up, see Materials and Methods). The interest lies in the posterior distribution which would inform us as to which covariates are most important for defining the observed response variable. Typically is assumed to be sparse so that only a few covariates contribute to the explanation of the observations. These sparse linear regression models can arise in problems such as expression quantitative trait loci (eQTL) analysis which is concerned with the association of genotype measurements, typically single nucleotide polymorphisms (SNPs), with phenotype observations (see e.g.  for a review).
We began by simulating a regression dataset with responses and covariates in which there were two relevant covariates that fully explain the data while the reminder were noisy redundant inputs. These two covariates were chosen to be perfect confounders so that only one of them is needed to explain the observed responses. As a consequence this sets up a challenging model exploration problem as only two out of possible models represent the possible truth. We applied a range of MCMC sampling schemes to sample from the posterior distribution over this massive space of possible models including three block Gibbs Samplers (BG1-3), which conditionally update blocks of elements of of size respectively, and three block Hamming Ball samplers (HB1-3) that use blocks of size and consider Hamming ball radii of respectively within each block. Note that the block Gibbs Samplers are special cases of the block Hamming Ball Sampler where .
Figure 3 compares the relative performance of the various sampling schemes. The trace plots show the running marginal posterior inclusion probabilities of the two relevant variables and which converge to the expected values of 0.5 with the Hamming Ball Samplers but not with the block Gibbs Samplers. This indicates that the Hamming Ball schemes were mixing well, able to identify the two relevant variables and frequently switched between their inclusion. In contrast, the block Gibbs Samplers exhibited strong correlation effects (stickiness) that impaired their efficiency.
For such a high-dimensional problem, the performance of the simplest Hamming Ball Sampler (HB1) was particularly outstanding as it used the least CPU time and achieved a lower integrated autocorrelation time than BG1 and BG2. The performance can be explained by the fact that the Hamming Ball sampling schemes can handle a large block of variables at a time but do not require exhaustive enumeration of all possible latent variable combinations within each block. This provides an important computational saving for sparse problems where most combinations will have low probability and the reason why the HB1 sampler was particularly effective for this example. In SI: D.3. Real eQTL analysis we demonstrate the utility of this methodology for a real eQTL example involving 10,000 covariates.
2.3 Factorial hidden Markov models
We finally address the application of the Hamming Ball sampling scheme to the factorial hidden Markov model (FHMM) . The FHMM is an extension of the hidden Markov model (HMM)  where multiple independent hidden chains run in parallel and cooperatively generate the observed data. The latent matrix in this case represents a -valued discrete matrix, whose rows corresponds to hidden Markov chains of length . Posterior inference in FHMMs is extremely challenging since it concerns the computation of which comprises a fully dependent distribution in the space of the possible configurations. This is an extraordinarily large space for even small values of and .
Applications of FHMMs therefore frequently rely upon variational approximations or block Gibbs sampling which alternates between sampling a small set of rows of , conditional on the rest rows . These Gibbs sampling schemes can easily become trapped in local modes due the conditioning structure which means major structural changes to are unlikely to be proposed. Joint posterior updates can be achieved by applying the forward-filtering-backward-sampling algorithm (FF-BS) , with exhaustive enumeration, to simulate a sample from the posterior in time. However, although the use of FF-BS is quite feasible for even very large HMMs, it is only practical for very small values of and in FHMMs.
We consider an additive FHMM with binary hidden chains which models the observation at time according to where is a parameter vector that describes the contribution of the -th feature when generating (given that ) and is Gaussian noise, i.e. ; see Material and Methods. Such a model is useful for applications such as energy disaggregation, where an observed total electricity power at time instant is the sum of individual powers for all devices that are “on” at that time. We set up a simulated sequence of length and hidden chains (simulation details can be found in SI: E Factorial hidden Markov models). For all sampling schemes, we treat the feature contributions as known parameters and we conduct inference on , that models presence/absence of these features, together with the noise variance parameter . We applied three block Gibbs Samplers (BG1-3) and three Hamming Ball-based sampling schemes (HB1-3). As in the tumor deconvolution example, the Hamming Ball is defined as the set of all matrices such that each column is at most bits different from the corresponding column of the auxiliary matrix . We then applied FF-BS within the Hamming Ball to sample from the constrained posterior distribution .
Figure 4 shows the utility of the different sampling schemes applied to the FHMM. Clearly, the Hamming Ball Samplers, and particularly the schemes HB2-3, are able to escape from local modes of the posterior distribution and sample values for that have much higher posterior probability than values sampled by block Gibbs Samplers. The configurations of sampled by the HB2-3 schemes are also close to the true that generated the data. Further, the latter algorithms were able to correctly infer the level of the noise variance that generated the data while the block Gibbs Samplers and HB1 have inferred larger noise variances so that they wrongly explain some true signal variation as noise. Finally, we can conclude that HB2 is the scheme that best balances computational time and sampling efficiency since, while it has similar CPU time with the most advanced block Gibbs Samplers (BG2-3), it allocates computational resources very differently from the BG algorithms which results in significant improvement of the sampling efficiency. In SI: E.3. Energy disaggregation we demonstrate the utility of Hamming Ball sampling to a real energy disaggregation data sequence of length .
The Hamming Ball Sampler provides a generic sampling scheme for statistical models involving high-dimensional discrete latent state-spaces that generalizes and extends conventional block Gibbs sampling approaches. In our investigations, we have applied the Hamming Ball sampling scheme to three different statistical models and shown benefits over standard Gibbs samplers. Importantly, the Hamming Ball Sampler gives the statistical investigator control over the balance between statistical efficiency and computational tractability through an intuitive mechanism - the specification of the Hamming Ball radius and the block design strategy - which is important for Big Data applications. Yet, we have also demonstrated that in many problems, basic Hamming Ball samplers () that are computationally inexpensive can still give relatively good performance compared to standard block Gibbs sampling alternatives.
Throughout we have provided pure and unrefined Hamming Ball sampler implementations. In actual applications, the proposed methodology should not be seen as a single universal method for speeding up MCMC but a novel addition to the toolbox that is currently available to us. For example, the computations performed within each Hamming Ball update are often trivially parallelizable which would allow the user to take advantage of any special hardware for parallel computations, such as graphics processing units [9, 10]. In addition, Hamming Ball sampling updates can also be used alongside standard Gibbs sampling updates as well as within parallel tempering schemes in Evolutionary Monte Carlo algorithms .
Finally, we believe the ideas presented here can have applications in many areas not yet explored, such as Bayesian nonparametrics (e.g. in the Indian Buffet Process) and Markov Random Fields. Further investigations are being conducted to develop the methodology for these statistical models.
4 Materials and Methods
4.1 Tumor deconvolution with mixture modelling
We give a description of the statistical model underlying the tumor deconvolution example in the following. We assume that the data consists of pairs of read counts where corresponds to the number of sequence reads corresponding to the variant allele at the -th locus and is the total number of sequence reads covering the mutation site. The distribution of the variant allele reads given the total read count follows a Binomial distribution where the variant allele frequency is given by and is a sequence read error rate and . The parameter is a vector denoting the proportion of the observed data sample attributed to each of the tumor subpopulations whose genotypes are given by a binary matrix . We specify the prior probabilities over as and This hierarchical representation is equivalent to a marginal prior distribution which induces a sparsity forcing values of to be close to zero when allowing us to do automatic model selection for the number of tumor sub-populations. We further specify the prior probabilities over as and Further details for posterior inference for this model and data simulation are given in SI: C Tumor deconvolution with mixture modelling.
4.2 Sparse linear regression analysis
Here, we provide a description of the statistical sparse linear regression model used in the examples. Suppose a dataset where is the observed response and is the vector of the corresponding covariates. We can collectively store all responses in a vector (assumed to be normalized to have zero mean) and the covariates in a design matrix . We further assume that from the total covariates there exist a small unknown subset of relevant covariates that generate the response. This is represented by an -dimensional unobserved binary vector that indicates the relevant covariates and follows an independent Bernoulli prior distribution, where is assigned a conjugate Beta prior, , and are hyperparameters. Given , a Gaussian linear regression model takes the form where is the design matrix, with , having columns corresponding to and is the respective vector of regression coefficients. The regression coefficients and the noise variance are assigned a conjugate normal-inverse-gamma prior where are hyperparameters. Notice that the particular choice for the covariance matrix , where is is scalar hyperparameter, corresponds to the so called -prior . Based on the above form of the prior distributions we can analytically marginalize out the parameters and obtain the marginalized joint density : where , and denotes the Gamma function. The hyperparameters of the prior were set to fixed values as follows. The hyperparameters of were set to and which leads to a vague prior. The scalar hyperparameter for the -prior were chosen to as also used in . Finally, the hyperparameters for the Beta prior, , were set to the values and which favors sparse configurations for the vector . Further details for inference for this model and data simulation are given in SI: D Sparse linear regression analysis.
4.3 Factorial hidden Markov models
In a typical setting of modelling with FHMMs, the observed sequence is generated through binary hidden sequences represented by a binary matrix . The interpretation of the latter binary matrix is that each row encodes for the presence or absence of a single feature across the observed sequence while each column represents the different features that are active when generating the observation . Different rows of correspond to independent Markov chains following and where the initial state is drawn from a Bernoulli distribution with parameter . Each data point is generated conditional on through a likelihood model parametrized by . For the additive FHMM this likelihood model takes the form where are the parameters. The whole set of model parameters determines the joint probability density over which is written as The Hamming ball algorithm follows precisely the iteration in (6)-(7). The step (7) when implemented by two separate Gibbs step requires sampling from the posterior conditional expressed as where we used that . Given that each is restricted to take values in the neighborhood of , exact sampling from the above distribution can be done using the FF-BS algorithm in time. If we implement step (7) using the M-H joint update, then we need to evaluate the which is simply the normalizing constant of the distribution that is obtained by the forward pass of the FF-BS algorithm in time. Further details for posterior inference for this model and data simulation are given in SI: E Factorial hidden Markov models.
C.Y. is supported by a UK Medical Research Council New Investigator Research Grant (Ref. No. MR/L001411/1), the Wellcome Trust Core Award Grant Number 090532/Z/09/Z and the John Fell Oxford University Press (OUP) Research Fund. MKT acknowledges support from “Research Funding at AUEB for Excellence and Extroversion, Action 1: 2012-2014".
-  Robert H Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58(2):86–88, 1987.
-  Christian Schäfer and Nicolas Chopin. Sequential Monte Carlo on large binary sampling spaces. Statistics and Computing, 23(2):163–184, 2013.
-  Habil Zare, Junfeng Wang, Alex Hu, Kris Weber, Josh Smith, Debbie Nickerson, ChaoZhong Song, Daniela Witten, C Anthony Blau, and William Stafford Noble. Inferring clonal composition from multiple sections of a breast cancer. PLoS Computational Biology, 10(7):e1003703, 2014.
-  Yanxun Xu, Peter Müller, Yuan Yuan, Kamalakar Gulukota, and Yuan Ji. MAD Bayes for Tumor Heterogeneity–Feature Allocation with Exponential Family Sampling. Journal of the American Statistical Association, 2015.
-  Robert B O’Hara, Mikko J Sillanpää, et al. A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85–117, 2009.
-  Zoubin Ghahramani and Michael I. Jordan. Factorial Hidden Markov Models. Mach. Learn., 29(2-3):245–273, November 1997.
-  Lawrence Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
-  Steven L. Scott. Bayesian Methods for Hidden Markov Models: Recursive Computing in the 21st Century. Journal of the American Statistical Association, 97:337–351, 2002.
-  Marc A Suchard, Quanli Wang, Cliburn Chan, Jacob Frelinger, Andrew Cron, and Mike West. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19(2):419–438, 2010.
-  Anthony Lee, Christopher Yau, Michael B Giles, Arnaud Doucet, and Christopher C Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Journal of Computational and Graphical Statistics, 19(4):769–789, 2010.
-  Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte Carlo. CRC press, 2011.
-  Arnold Zellner. On assessing prior distributions and Bayesian regression analysis with -prior distributions. In Bayesian Inference and Decision Techniques-Essays in Honour of Bruno de Finetti, pages 233–243. 1986.
-  Leonard Bottolo and Sylvia Richardson. Evolutionary stochastic search for Bayesian model exploration. Bayesian Analysis, 5(3):583–618, 09 2010.