We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.
Consider a sequence of states sampled from a reversible Markov chain. A Markov chain is said to be reversible if the probability of the chain is the same observed either forwards or backwards in time. Reversibility is a realistic assumption in various settings. For instance, reversible Markov chains are appropriate to model the time-reversal dynamics in physical systems, such as the transitions of a macromolecule conformation at fixed temperature or chemical dynamics in protein folding. In these settings, the system transitions between hidden states over time emitting a sequence of observations . Our aim is to recover the process and hidden state sequence that gave rise to this observed data. To do this we define a prior over reversible Markov chains.
There is a close connection between reversible Markov chains and random walks on graphs. More specifically, a random walk on a weighted undirected graph produces a reversible Markov chain. In a random walk on a graph, the traveller jumps to the next node (state) with probability propotional to the corresponding edge weight. The aim now is to put a prior over the unknown transition matrix (analogously the weights) that guides the walk. Much research has gone into connecting random walks on graphs to reversible Markov chains with seminal works by Diaconis and Freedman (1980) and Diaconis and Coppersmith (1986). The latter assumes an edge reinforcement random walk (ERRW) where the edge weight is increased by one each time an edge is crossed. The process is defined for a finite state space and, in the limit, gives weights that are distributed according to an explicitly characterised mixing measure, which can be a conjugate prior for the reinforcement process. In the more recent work of Bacallado et al. (2013), the authors define a three- parameter random walk with reinforcement, named the scheme, which generalizes the linearly edge reinforced random walk to countably infinite spaces. However, a closed form for the prior (mixing measure) is lacking and inference in this model is challenging. In this work, we assume countably infinite state space and take the alternative approach of explicitly constructing the prior over the transition matrix. Inference can be done using relatively straighforward Markov Chain Monte Carlo method. We use the resulting reversible Markov chain as the hidden sequence in a Hidden Markov model whose utility we validate on two real world datasets.
The paper is organized as follows. In Section 2, we briefly provide some background on the Gamma process which is central to our model definition. In Section 3, we describe the process proposed in this manuscript. We discuss its theoretical properties in Section 4 and provide a de Finetti representation for the process in Section 5. The finite version of the model and its HMM extension appear in Sections 6 and 7 and inference, performed via a Gibbs sampler, is described in Section 8. In Section 9 we study our model’s performance on real datasets and in Section 10 we conclude our work and provide a short discussion about future directions.
2 The Gamma Process
To facilitate understanding, we briefly review the Gamma process over a space . A realization is a positive measure on the space , which can be represented as a countable weighted sum of atoms. Each atom at has corresponding weight . The atoms and weights are distributed according to a Poisson process over the product space with intensity measure
where is the base measure and is concentration parameter. is known as the Lévy measure of the gamma process, and because of this representation the gamma process is a Lévy process. In this paper, we assume that the base measure is the Lebesgue measure. A sample from this Poisson process will yield a countably infinite collection of atoms since . We assume is diffuse (non-atomic) and so, we can write:
Intuitively, sums up the values of with . It can be shown that the distribution of , where , is , hence the name of the process. The gamma process is a completely random measure (Kingman, 1967) and as such for any disjoint and measurable partition of the random variables are mutually independent gamma variables.
3 Model Description
Given a measurable space , with a set and a -algebra of subsets of , our aim is to construct a model, a sample from which will give rise to a reversible Markov chain of states . At each time point the chain is at a state denoted as . We require that the set is countable. We construct the prior by deploying the gamma process in a hierarchical fashion; we use a gamma process to sample the states and given these states, we construct the transition matrix by sampling from another gamma process.
More carefully, let be a gamma process on , with concentration parameter and base measure , as given by Equation 1. A sample from this process corresponds to the set of atoms and their associated weights as in Equation 2. Note that by construction the cardinality of the set is countably infinite and there is a one-to-one mapping of the atoms in to the set of natural numbers . We define a new gamma process on the product space , with concentration parameter and atomic base measure
where is the support of . The base measure is atomic and as such, assigns non-zero mass on atoms on the product space . Since is discrete a.s., will also be discrete so we can write
where, from the definition of the Gamma process with fixed points of discontinuity , we have
where is the shape and the rate of the gamma distribution. To avoid notation overload, we also use to represent the weight matrix which when normalised per row, gives the transition matrix such that is the probability of transitioning to state given that the chain is in state currently. The transition matrix is stochastic, that is, its entries are all non-negative and , for all . By the additive property of the gamma process, each row in the weight matrix is still a sample from a gamma process in the restricted space with base measure , so
To generate the sequence , we draw an initial state , where is the normalised random measure derived from , i.e. and sample the transition , as follows:
The process can be thought as a weighted random walk on a graph, with vertex set and edge set . is a function that puts non-negative weight to each edge in the graph.
The above random walk has not yet yielded a reversible Markov chain. To achieve this reversibility we modify (5) so that is symmetric, i.e.
Note that the function is now symmetric and the new defined using symmetric is no longer a draw from a completely random measure because of the dependency induced by this symmetry. However, each row is still a draw from a completely random measure. The resulting transition matrix is a sample from the prior, the construction of which was just described.
We note here that the choice of the shape value for each weight might not be restricted to the product of the corresponding ’s. Depending on the dataset at hand, the choice might vary. We call the proposed model Symmetric Hierarchical Gamma Process and use the acronym SHGP. A graphical representation of the model is presented in Figure 1(a).
Relation to Hierarchical Dirichlet process
The construction of the proposed SHGP prior closely relates to the Hierarchical Dirichlet process Teh et al. (2006). Both processes use random measures in a hierarchical way: the HDP uses the Dirichlet, while the SHGP the Gamma process as seen in Table 1, where refers to the -th row of the weight matrix. Moreover, both processes when used for infinite Hidden Markov models Beal et al. (2003), put a prior on the transition matrix but in a different fashion; the HDP directly defines a prior over , while the SHGP puts a prior on the weight matrix , the per-row normalisation of which gives the transition matrix. As such, the SHGP allows for a direct treatment of the weigths, the symmetry in Equation 8 is imposed and reversibility arises.
Relation to Hierarchical Gamma process
Interesting is the relation of the SHGP to the simple Hierarchical Gamma process (HGP) also seen in Table 1. Both processes use the Gamma process in a hierarchical way. The HGP does not assume symmetry in the weights but this can be easily imposed. However, the random variable is sampled from Gamma distributions with different shape parameters. More specifically
As seen in Equation 3, in SHGP the base weights of both the nodes and contribute to the edge weight , as opposed to the HGP where only one of the base weights influences the shape. As already stated earlier in this Section, this is a modelling choice that depends on whether or not contribution of both nodes is desired. More details about the relation amongst SHGP, HGP and HDP can be found in the supplementary material.
4 Theoretical Properties
In this section, we describe important theoretical properties of the induced Markov chain given the sample from the SHGP process. The theory used, is the theory applied on Markov chains on countably infinite space since the induced Markov chain falls in this category.
In order to prove that the induced Markov chain is reversible, it is sufficient to prove that detailed balance holds, that is
where is the probability defined by and is the stochastic transition matrix induced by the rows of the symmetrised .
as a result of (8). As a straighforward corollary, is the invariant measure of the chain.
Is the normalization constant per row finite?
When defining the transition matrix , it is crucial that the sum of each row in the weight matrix is almost surely (a.s.) finite, since this ensures that the normalization is a well-defined operation. In other words, we want to ensure that for every row in the weight matrix holds a.s. To start with, the sum converges a.s., that is
if the well-known condition on the Levy measure that
holds. For a Gamma process where it is easy to prove that the above condition holds and as such the sum in (12) converges. Since the weights are defined over the space , we ensure that always. Consequently, we can drop the absolute value notation and simply write . Moreover, since the measure over is continuous, for and a.s.
The sum in each row in the weight matrix is . Each element of this sum is a gamma distributed variable sampled from the gamma distribution . Recall here that the variables and are being sampled from the same Gamma distribution. This, along with the property that the sum of gamma distributed variables with the same rate parameter is a gamma distributed variable with the shape equal to the sum of the shape parameters of the individual gamma variables and the same rate gives the following marginally
Since we have ensured that a.s., the sample is finite a.s., ensuring that the normalization for every row in the weight matrix is a well-defined operation.
Irreducibility and aperiodicity.
A Markov chain is irreducible if it is possible to get from any state to any other state in a finite number of steps with positive probability. In other words, when a Markov chain is irreducible, the sample path (the state sequence) cannot get trapped in smaller subsets of the state space. That is, for any two states there exists an integer , such that . It is easy to see that the proposed generative process produces an irreducible Markov chain almost surely. Looking at the weight matrix, we see that the elements are gamma distributed variables with support , and thus are positive almost surely. This, along with the existence (and finiteness) of the normalization constant shows that the probability of moving from one state to any other in one step is always positive and the chain is irreducible. Let be the set of times when it is possible for the chain to return to starting state . The period of the state is defined to be the greatest common divisor of . For an irreducible chain, the period of it is defined to be the period which is common for all the states. We note that the transition matrix is strictly positive and as a result the chain can be in any state in one step. This implies that all the states have period 1 and the chain is aperiodic.
We showed that the generated Markov chain has an invariant probability distribution . A state is positive recurrent if the expected amount of time to return to state given that the chain started in state has finite first moment that is, , where is the time (after time 0) until reaching state given . An irreducible Markov chain with transition matrix is positive recurrent if and only if there exists a probability distribution on such that [Theorem 21.12, (Levin et al., 2006)]. As such, the generated Markov chain is positive recurrent. Irreducibility, aperiodicity and positive recurrence ensure that the invariant distribution is unique and for all [Theorem 21.14, (Levin et al., 2006)] ,
where denotes the total variation distance between the two distributions. Equation (15) describes the convergence of the chain as and states that every row in the transition matrix converges to the stationary distribution eventually. In other words, the invariant distribution is also the limit distribution of the chain.
5 de Finetti Representation
Diaconis and Freedman (1980) defined a type of exchangeability for Markov chains, known as Markov exchangeability and it is defined for sequences in a countable space as follows:
A process on a countable space is Markov exchangeable if the probability of observing a path is only a function of and the transition counts for all .
In other words, a sequence is Markov exchangeable if the initial state and the transition counts are sufficient statistics. Intuitively, this means that two different state sequences are equiprobable under the joint distribution, if they begin with the same value and preserve the transition counts between unique values. They also proved the following
Theorem 1 (Diaconis and Freedman, 1980)
A process is Markov exchangeable and returns to every state visited infinitely often (recurrent), if and only if it is a mixture of recurrent Markov chains.
In the previous Sections, we defined a prior over transition matrices using a hierarchy of gamma processes. We also proved that the induced Markov chains (given the transition matrix sampled from the prior) are recurrent. The use of the prior let us write the state sequence as a mixture of recurrent Markov chains and using Theorem 1 we can state that the sequence generated by the proposed process and defined on a countably infinite space is Markov exchangeable and recurrent.
For some measure on , where is the space of stochastic matrices on , the distribution of , can be represented as
Equation (16) shows the de Finetti representation of the proposed process. The de Finetti measure is the distribution over the product of the space and the space of stochastic matrices .
6 Finite Model
The inference simplifies considerably if we consider the finite state model which gives the countably infinite state model in the limit. More carefully, we assume that we have a finite number of states and we prove that as , the model converges in distribution to the countably infinite model.
The infinite divisibility property of the gamma process on states that for each there exists a sequence of i.i.d. random variables such that
where is equality in distribution. Due to the additive property of the Gamma distribution, for any finite, disjoint and measurable partition of such that , the variable with law can be written as the sum of Gamma distributed variables each one following the law , that is . The additive property of ensures that and as such the shape parameter of the Gamma distribution of will be equal to the . As we recover the infinite case and Equation (17) holds. For simplicity, we assume that .
By restricting the process to the finite case, we facilitate inference without compromising the properties of the model since can always be chosen sufficiently large. Putting everyting together, the generative process in the finite case is as follows:
7 The SHGP Hidden Markov model
In typical sequential data analysis we are more interested in using a Markov chain as the hidden state sequence in a Hidden Markov model (HMM) rather than viewing as observations themselves. This allows a broad range of data types to be modelled: the example we will demonstrate here include univariate continuous and multivariate count data. Thus we use the SHGP to construct a Hidden Markov model. Consider a sequence of observations which we will assume to be independent conditioned on the latent state sequence . For simplicity consider under the finite SHGP, and a parametric family of observation models , then
where are the state emission parameters. In the case of multinomial outputs is a probability vector, the concatenation of which is known as the emission matrix. The SHGP gives the prior over the hidden state sequence as shown in Figure 1(b). We present multinomial, Poisson and Gaussian observation models , the details for which are given in the supplementary material.
As with many other Bayesian models, exact inference is intractable so we employ Markov Chain Monte Carlo (MCMC) and using an iterative process we achieve posterior inference over the latent variables of the model as seen in Figure 1(b). A detailed description of the sampling steps is provided in the supplementary material. The sampler iterates as follows:
Sampling the concentration parameters, and .
We used slice sampling by (Neal, 2003) to infer the parameters and using Gamma priors and , where and are the pairs of shape and rate parameters for and respectively.
Sampling the weight vector,
The vector is the vector of the base weights in the corresponding random measure . We used slice sampler to sample each weight .
Sampling the weight matrix,
The weight matrix contains the edge weights . We used hybrid Monte Carlo (Neal, 2011) to sample the elements of the matrix at once instead of sampling each element at a time using slice sampling. In the reversible case, only are sampled because of the symmetry in . We also show results using NUTS (Hoffman and Gelman, 2011) although our results suggest this gives similar performance to HMC in this setting.
Sampling the state sequence
We use the forward-backward algorithm (Scott, 2002) to sample the latent state sequence given the current state of all other variables in the model. This is a dynamic programming algorithm that efficiently computes the state posteriors over all the hidden state variables .
Sampling the emission matrix
The posterior over the emission matrix is
The explicit form of the posterior depends on the output, the observed , that is multinomial, Poisson or Gaussian. In all cases, due to conjugacy, the emission matrix is sampled exactly.
In this section we evaluate the SHGP by running SHGP Hidden Markov model on two real world datasets. The datasets are especially chosen such that the underlying systems are reversible. For completion, we also ran SHGP assuming non-reversibility by not imposing symmetry in the inferred weight matrix . Moreover, we compare SHGP to the infinite HMM which learns a transition matrix for the hidden state sequence and does not account for reversibility. For the iHMM we use the beam sampler (Van Gael et al., 2008).
A principled way to evaluate a generative model is by its ability to predict missing data values given some observations. For SHGP we collect samples from the posterior and estimate the predictive distribution of a missing entry in the dataset as the average of the predictive distributions for each of the collected samples. For the experiments we ran, we used two different likelihoods, a Poisson and a Gaussian. For the Poisson model the approximate predictive distribution is
while for the Gaussian is
The supplementary material provides a detailed description of the likelihood models.
9.1 ChIP-seq epigenetic marks
For this experiment we used ChIP-seq (chromatin immunoprecipitation sequencing) data, representing histone modifications and transcription factor binding in human neural crest cell lines (see Park (2009) for a nice review). ChIP-seq is a method to identify the sites in a DNA sequence where specific proteins are bound. The workflow of ChIP-seq is: 1) DNA is extracted from cells, 2) the proteins of interest (POI) and DNA are chemically bound (“cross-linked”), 3) the DNA is fragmented using sonification, 4) an appropriate antibody is used to filter out the DNA fragment bound to the POI using immunoprecipitation, 5) the POI is removed from the DNA, 6) the DNA is sequenced. The reads are finally mapped to a known reference sequence. Note that reversibility is reasonable here because although individual genes have direction, the genome as a whole has no particular direction.
The resulting observed sequence is a matrix of counts, representing how many reads for POI mapped to bin , where a bin is a 100bp region of the genome (different size bins could be used, but 100bp is commonplace). A small section of our full by dataset , of length , along with the identifiers of the POIs is shown in Figure 2.
ChiSeq results presented in Table 2. We ran 10 repeats, each time holding out different of the data and using different random initilisation. The likelihood model used here is Poisson and the task was to predict the held out values in Y. We see that in terms of predictive performance the reversible SHGP-HMM outperforms both the non-reversible version of the model and the iHMM trained using beam sampling. The “emission” matrix, the by matrix of Poisson rates is shown in Figure 3 where we identify expected states known as enhancers and promoters based on their activity levels for the different markers (POIs).
|Model||Alogirthm||Train error||Test error||Train log likelihood||Test log likelihood|
9.2 Single ion channel recordings
Patch clamp recordings are a well established experimental method to measure conformational changes in ion channels, proteins embedded in lipid membranes of cells (such as the cell surface membrane), which control the flow of chemicals such as neurotransmitters across the membrane. These changes are accompanied by changes in electrical potential which can be measured. HMMs have been used to analyse these recordings for many years (Becker et al., 1994), but have usually ignored the prior knowledge that the underlying physical system has time reversible dynamics. We incorporate this information using the SHGP-HMM, analysing a 1MHz recording from the state-of-the-art method of (Rosenstein et al., 2013) of a single alamethicin channel. We subsample this time series by a factor of 100 to obtain a , 10KHz recording, which we log transform and normalise. A small segment of the recording, along with the fitted SHGP-HMM is shown in Figure 6. Grey regions represent aritificial missingness used to test the predictive performance of the models, as shown in Table 3. Here we see that the reversible version of SHGP-HMM outperforms both the non-reversible version and the iHMM using the beam sampler, showing the advantage of using the prior knowledge that the transition matrix should be reversible. More specifically, the reversible model performs slightly better in terms of test error than the non-reversible model, although this difference is not quite significant based on paired t-test (). In terms of test log likelihood the reversible version of the model does perform significantly better however. The use of HMC or NUTS does not significantly impact the results in this case. The iHMM using the beam sampler does significantly worse in terms of train and test error, but not significantly better than the non-reversible HGP-HMM.
SHGP-HMM typically uses to states for this dataset. A typical sample of is shown in Figure 4 and the observation models for each state are illustrated in Figure 5. Comparing the histogram of currents to the learnt observation models we see that some of the less common high energy states are blurred into one, which could possibly be alleviated by more careful selection of priors. An additional difficulty of ion channel recordings is that the current level for a particular state tends to drift over time, which is not a characteristic currently supported by our model.
|Model||Alogirthm||Train error||Test error||Train log likelihood||Test log likelihood|
Reversibility is a property met in various datasets, especially in ion channel recordings. In this paper, we have introduced a hierarchical non-parametric model, SHGP, which gives rise to reversible Markov chains. We have used the SHGP to construct a Hidden Markov model allowing a broad range of data types to be modelled. Our experimental results on two different datasets suggest that accounting for reversibility intrinsically in SHGP results in gains in empirical performance compared to non-reversible models. An interesting direction for future work would be to apply the SHGP to MCMC itself: in this settings, the second eigenvalue of the learnt transition matrix could be use as a measure of the mixing perform of the MCMC chain.
- Bacallado, S., Favaro, S., and Trippa, L. (2013). Bayesian nonparametric analysis of reversible markov chains. The Annals of Statistics, 41(2):870–896.
- Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2003). The infinite hidden markov model. Advances in Neural Information Processing Systems 14, pages 577–584.
- Becker, J. D., Honerkamp, J., Hirsch, J., Fröbe, U., Schlatter, E., and Greger, R. (1994). Analysing ion channels with hidden markov models. Pflügers Archiv, 426(3-4):328–332.
- Diaconis, P. and Coppersmith, D. (1986). Random walks with reinforcement. Unpublished manuscript.
- Diaconis, P. and Freedman, D. (1980). De Finetti’s theorem for Markov chains. The Annals of Probability, 8(1):115–130.
- Hoffman, M. D. and Gelman, A. (2011). The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. arXiv preprint arXiv:1111.4246.
- Kingman, J. F. C. (1967). Completely random measures. Pacific J. Math, 21(1):59–78.
- Levin, D. A., Peres, Y., and Wilmer, E. L. (2006). Markov chains and mixing times. American Mathematical Society.
- Neal, R. (2011). MCMC Using Hamiltonian Dynamics. CRC Press.
- Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31(3):705–741.
- Park, P. J. (2009). Chip–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10):669–680.
- Rada-Iglesias, A., Bajpai, R., Swigut, T., Brugmann, S. A., Flynn, R. A., and Wysocka, J. (2010). A unique chromatin signature uncovers early developmental enhancers in humans. Nature, 470(7333):279–283.
- Rosenstein, J. K., Ramakrishnan, S., Roseman, J., and Shepard, K. (2013). Single ion channel recordings with CMOS-anchored lipid membranes. Nano letters.
- Scott, S. L. (2002). Bayesian methods for hidden markov models: Recursive computing in the 21st century. Journal of the American Statistical Association, 97:337–351.
- Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
- Van Gael, J., Saatci, Y., Teh, Y. W., and Ghahramani, Z. (2008). Beam sampling for the infinite hidden markov model. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1088–1095, New York, NY, USA. ACM.