Variational Inference for Sparse and Undirected Models

Variational Inference for Sparse and Undirected Models


Undirected graphical models are applied in genomics, protein structure prediction, and neuroscience to identify sparse interactions that underlie discrete data. Although Bayesian methods for inference would be favorable in these contexts, they are rarely used because they require doubly intractable Monte Carlo sampling. Here, we develop a framework for scalable Bayesian inference of discrete undirected models based on two new methods. The first is Persistent VI, an algorithm for variational inference of discrete undirected models that avoids doubly intractable MCMC and approximations of the partition function. The second is Fadeout, a reparameterization approach for variational inference under sparsity-inducing priors that captures a posteriori correlations between parameters and hyperparameters with noncentered parameterizations. We find that, together, these methods for variational inference substantially improve learning of sparse undirected graphical models in simulated and real problems from physics and biology.


1 Introduction

Hierarchical priors that favor sparsity have been a central development in modern statistics and machine learning, and find widespread use for variable selection in biology, engineering, and economics. Among the most widely used and successful approaches for inference of sparse models has been regularization, which, after introduction in the context of linear models with the LASSO Tibshirani (1996), has become the standard tool for both directed and undirected models alike Murphy (2012).

Despite its success, however, is a pragmatic compromise. As the closest convex approximation of the idealized norm, regularization cannot model the hypothesis of sparsity as well as some Bayesian alternatives Tipping (2001). Two Bayesian approaches stand out as more accurate models of sparsity than . The first, the spike and slab Mitchell & Beauchamp (1988), introduces discrete latent variables that directly model the presence or absence of each parameter. This discrete approach is the most direct and accurate representation of a sparsity hypothesis Mohamed et al. (2012), but the discrete latent space that it imposes is often computationally intractable for models where Bayesian inference is difficult.

The second approach to Bayesian sparsity uses the scale mixtures of normals Andrews & Mallows (1974), a family of distributions that arise from integrating a zero mean-Gaussian over an unknown variance as


Scale-mixtures of normals can approximate the discrete spike and slab prior by mixing both large and small values of the variance . The implicit prior of regularization, the Laplacian, is a member of the scale mixture family that results from an exponentially distributed variance . Thus, mixing densities with subexponential tails and more mass near the origin more accurately model sparsity than and are the basis for approaches often referred to as “Sparse Bayesian Learning” Tipping (2001). Both the Student- of Automatic Relevance Determination (ARD) MacKay et al. (1994) and the Horseshoe prior Carvalho et al. (2010) incorporate these properties.

Applying these favorable, Bayesian approaches to sparsity has been particularly challenging for discrete, undirected models like Boltzmann Machines. Undirected models possess a representational advantage of capturing ‘collective phenomena’ with no directions of causality, but their likelihoods require an intractable normalizing constant Murray & Ghahramani (2004). For a fully observed Boltzmann Machine with the distribution1 is


where the partition function depends on the couplings. Whenever a new set of couplings are considered during inference, the partition function and corresponding density must be reevaluated. This requirement for an an intractable calculation embedded within already-intractable nonconjugate inference has led some to term Bayesian learning of undirected graphical models “doubly intractable” Murray et al. (2006). When all patterns of discrete spike and slab sparsity are added on top of this, we might call this problem “triply intractable” (Figure 1). Triple-intractability does not mean that this problem is impossible, but it will typically require expensive approaches based on MCMC-within-MCMC Chen & Welling (2012).

Figure 1: Bayesian inference for discrete undirected graphical models with sparse priors is triply intractable, as the space of possible models spans: (i) all possible sparsity patterns, each of which possesses its own (ii) parameter space, for which every distinct set of parameters has its own (iii) intractable normalizing constant.

Here we present an alternative to MCMC-based approaches for learning undirected models with sparse priors based on stochastic variational inference Hoffman et al. (2013). We combine three ideas: (i) stochastic gradient variational Bayes Kingma & Welling (2014); Rezende et al. (2014); Titsias & Lázaro-Gredilla (2014)2, (ii) persistent Markov chains Younes (1989), and (iii) a noncentered parameterization of scale-mixture priors, to inherit the benefits of hierarchical Bayesian sparsity in an efficient variational framework. We make the following contributions:

  • We extend stochastic variational inference to undirected models with intractable normalizing constants by developing a learning algorithm based on persistent Markov chains, which we call Persistent Variational Inference (PVI) (Section 2).

  • We introduce a reparameterization approach for variational inference under sparsity-inducing scale-mixture priors (e.g. the Laplacian, ARD, and the Horseshoe) that significantly improves approximation quality by capturing scale uncertainty (Section 3). When combined with Gaussian stochastic variational inference, we call this Fadeout.

  • We demonstrate how a Bayesian approach for learning sparse undirected graphical models with PVI and Fadeout yields significantly improved inferences of both synthetic and real applications in physics and biology (Section 4).

2 Persistent Variational Inference

Background: Learning in undirected models Undirected graphical models, also known as Markov Random Fields, can be written in log-linear form as


where indexes a set of features and the partition function normalizes the distribution Koller & Friedman (2009). Maximum Likelihood inference selects parameters that maximize the probability of data by ascending the gradient of the (averaged) log likelihood


The first term in the gradient is a data-dependent average of feature over , while the second term is a data-independent average of feature over the model distribution that often requires sampling Murphy (2012)3.

Bayesian learning for undirected models is confounded by the partition function . Given the data , a prior , and the log potentials , the posterior distribution of the parameters is


which contains an intractable partition function within the already-intractable evidence term. As a result, most algorithms for Bayesian learning of undirected models require either doubly-intractable MCMC and/or approximations of the likelihood .

Figure 2: Variational inference for sparse priors with noncentered reparameterizations. Several sparsity-inducing priors such as the Laplacian, Student-, and Horseshoe (shown here) can be derived as scale-mixture priors in which each model parameter is drawn from a zero-mean Gaussian with random variance (top row). The dependency of on gives rise to a strongly curved “funnel” distribution (blue, top left and right) that is poorly modeled by a factorized variational distribution (not shown). A noncentered reparameterization with trades independence of and in the likelihood (blue, top center) for independence in the prior (blue, bottom left), allowing a factorized variational distribution over noncentered parameters (black contours, bottom right) to implicitly capture the a priori correlations between and (black contours, top right). As a result, the variational distribution can more accurately model the bottom of the “funnel”, which corresponds to sparse estimates.

A tractable estimator for ELBO of undirected models Here we consider how to approximate the intractable posterior in (5) without approximating the partition function or the likelihood by using variational inference. Variational inference recasts inference with as an optimization problem of finding a variational distribution that is closest to as measured by KL divergence Jordan et al. (1999). This can be accomplished by maximizing the Evidence Lower BOund


For scalability, we would like to optimize the ELBO with methods that can leverage Monte Carlo estimators of the gradient . One possible strategy for this would be would be to develop an estimator based on the score function Ranganath et al. (2014) with a Monte-Carlo approximation of


Naively substituting the likelihood (3) in the score function estimator (7) nests the intractable log partition function within the average over , making this an untenable (and extremely high variance) approach to inference with undirected models.

We can avoid the need for a score-function estimator with the ‘reparameterization trick’ Kingma & Welling (2014); Rezende et al. (2014); Titsias & Lázaro-Gredilla (2014) that has been incredibly useful for directed models. Consider a variational approximation that is a fully factorized (mean field) Gaussian with means and log standard deviations . The ELBO expectations under can be rewritten as expectations wrt an independent noise source where4 . Then the gradients are


Because these expectations require only the gradient of the likelihood , the gradient for the undirected model (4) can be substituted to form a nested expectation for . This can then be used as a Monte Carlo gradient estimator by sampling .

Persistent gradient estimation In Stochastic Maximum Likelihood estimation for undirected models, the intractable gradients of (4) are estimated by sampling . Although sampling-based approaches are slow, they can be made considerably more efficient by running a set of Markov chains in parallel with state that persists between iterations Younes (1989). Persistent state maintains the Markov chains near their equilibrium distributions, which means that they can quickly re-equilibrate after perturbations to the parameters during learning.

We propose variational inference in undirected models based on persistent gradient estimation of and refer to this as Persistent Variational Inference (PVI) (Algorithm in Appendix). Following the notation of PCD- Tieleman (2008), PVI- refers to using sweeps of Gibbs sampling with persistent Markov chains between iterations. This approach is generally compatible with any estimators of ELBO that are based on the gradient of the log likelihood, several examples of which are explained in Kingma & Welling (2014); Rezende et al. (2014); Titsias & Lázaro-Gredilla (2014).

Behavior of the solution for Gaussian When the variational approximation is a fully factorized Gaussian and the prior is flat , the solution to will satisfy


where is an extended system of the original undirected model in which the parameters fluctuate according to the variational distribution. This bridges to the Maximum Likelihood solution as and , while accounting for uncertainty in the parameters at finite sample sizes with the inverse of ‘sensitivity’ .

3 Fadeout

Prior Hyperprior
Gaussian () constant
Laplacian ()
Student- (ARD)
Table 1: Common priors as scale-mixtures of normal distributions

3.1 Noncentered Parameterizations of Hierarchical Priors

Hierarchical models are powerful because they impose a priori correlations between latent variables that reflect problem-specific knowledge. For scale-mixture priors that promote sparsity, these correlations come in the form of scale uncertainty. Instead of assuming that the scale of a parameter in a model is known a priori, we posit that it is normally distributed with a randomly distributed variance . The joint prior gives rise to a strongly curved ‘funnel’ shape (Figure 2) that illustrates a simple but profound principle about hierarchical models: as the hyperparameter decreases and the prior accepts a smaller range of values for , normalization increases the probability density at the origin, favoring sparsity. This normalization-induced sharpening has been called called a Bayesian Occam’s Razor MacKay (2003).

While normalization-induced sharpening gives rise to sparsity, these extreme correlations are a disaster for mean-field variational inference. Even if a tremendous amount of probability mass is concentrated at the base of the funnel, an uncorrelated mean-field approximation will yield estimates near the top. The result is a potentially non-sparse estimate from a very-sparse prior.

The strong coupling of hierarchical funnels also plagues exact methods based on MCMC with slow mixing, but the statistics community has found that these geometry pathologies can be effectively managed by transformations. Many models can be rewritten in a noncentered form where the parameters and hyperparmeters are a priori independen Papaspiliopoulos et al. (2007); Betancourt & Girolami (2013). For the scale-mixtures of normals, this change of variables is

  Require: Global parameters
  Require: Local parameters
  Require: Hyperprior gradient
  Require: Likelihood gradient
  // Sample from variational distribution
  , ,
  // Centered global parameters
  // Noncentered local parameters
Algorithm 1 Computing for Fadeout

Then while preserving . In noncentered form, the joint prior is independent and well approximated by a mean-field Gaussian, while the likelihood will be variably correlated depending on the strength of the data (Figure 2). In this sense, centered parameterizations (CP) and noncentered parameterizations (NCP) are usually framed as favorable in strong and weak data regimes, respectively.5

We propose the use of non-centered parameterizations of scale-mixture priors for mean-field Gaussian variational inference. For convenience, we like to call this Fadeout (see next section). Fadeout can be easily implemented by either (i) using the chain rule to derive the gradient of the Evidence Lower BOund (ELBO) (Algorithm 1) or, for differentiable models, (ii) rewriting models in noncentered form and using automatic differentiation tools such as Stan Kucukelbir et al. (2017) or autograd6 for ADVI. The only two requirements of the user are the gradient of the likelihood function and a choice of a global hyperprior, several options for which are presented in Table 1.

Estimators for the centered posterior. Fadeout optimizes a mean-field Gaussian variational distribution over the noncentered parameters . As an estimator for the centered parameters, we use the mean-field property to compute the centered posterior mean as , giving 7

Figure 3: An undirected model with a scale mixture prior (factor graph on left) can be given a priori independence of the latent variables by a noncentered parameterization (factor graph on right). This is advantageous for mean-field variational inference that imposes a posteriori independence.
Figure 4: Inverse Ising. Combining Persistent VI with a noncentered Horseshoe prior (Half-Cauchy hyperprior) attains lower error on simulated Ising systems than standard methods for point estimation including: Pseudolikelihood (PL) with or decimation regularization Schmidt (2010); Aurell & Ekeberg (2012); Decelle & Ricci-Tersenghi (2014), Minimum Probability Flow (MPF) Sohl-Dickstein et al. (2011), and Persistent Contrastive Divergence (PCD) Tieleman (2008). For the spin glass, error bars are two logarithmic standard deviations across 5 simulated systems.
Figure 5: Synthetic protein. For reconstructing interactions in a synthetic 20-letter spin-glass, a hierarchical Bayesian approach based on Persistent VI and a noncentered group Horseshoe prior (Half-Cauchy hyperprior) identifies true interactions with more accuracy and less shrinkage than Group . Each pair is the norm of a factor coupling the amino acid at position to the amino acid at position .

3.2 Connection to Dropout

Dropout regularizes neural networks by perturbing hidden units in a directed network with multiplicative Bernoulli or Gaussian noise Srivastava et al. (2014). Although it was originally framed as a heuristic, Dropout has been subsequently interpreted as variational inference under at least two different schemes Gal & Ghahramani (2016); Kingma et al. (2015). Here, we interpret Fadeout the reverse way, where we introduced it as variational inference and now notice that it looks similar to lognormal Dropout.8 If we take the uncertainty in as low and clamp the other variational parameters, the gradient estimator for Fadeout is:

This is the gradient estimator for a lognormal version of Dropout with an weight penalty of . At each sample from the variational distribution, Fadeout introduces scale noise rather than the Bernoulli noise of Dropout. The connection to Dropout would seem to follow naturally from the common interpretation of scale mixtures as continuous relaxations of spike and slab priors Engelhardt & Adams (2014) and the idea that Dropout can be related to variational spike and slab inference Louizos (2015).

4 Experiments

4.1 Physics: Inferring Spin Models

Ising model The Ising model is a prototypical undirected model for binary systems that includes both pairwise interactions and (potentially) sitewise biases. It can be seen as the fully observed case of the Boltzmann machine, and is typically parameterized with signed spins and a likelihood given by


Originally proposed as a minimal model of how long range order arises in magnets, it continues to find application in physics and biology as a model for phase transitions and quenched disorder in spin glasses Nishimori (2001) and collective firing patterns in neural spike trains Schneidman et al. (2006); Shlens et al. (2006).

Hierarchical sparsity prior One appealing feature of the Ising model is that it allows a sparse set of underlying couplings to give rise to long-range, distributed correlations across a system. Since many physical systems are thought to be dominated by a small number of relevant interactions, regularization has been a favored approach for inferring Ising models. Here, we examine how a more accurate model of sparsity based on the Horseshoe prior (Figure 3) can improve inferences in these systems. Each coupling and bias parameter is given its own scale parameter which are in turn tied under a global Half-Cauchy prior for the scales (Figure 3, Appendix).

Simulated datasets We generated synthetic couplings for two kinds of Ising systems: (i) a slightly sub-critical cubic ferromagnet ( for neighboring spins) and (ii) a Sherrington-Kirkpatrick spin glass diluted on an Erdös-Renyi random graph with average degree 2. We sampled synthetic data for each system with the Swendsen-Wang algorithm (Appendix) Swendsen & Wang (1987).

Results On both the ferromagnet and the spin glass, we found that Persistent VI with a noncentered Horseshoe prior (Fadeout) gave estimates with systematically lower reconstruction error of the couplings (Figure 4) versus a variety of standard methods in the field (Appendix).

4.2 Biology: Reconstructing 3D Contacts in Proteins from Sequence Variation

Figure 6: Unsupervised protein contact prediction. When inferring a pairwise undirected model for protein sequences in the SH3 domain family, hierarchical Bayesian approaches based on Persistent VI and noncentered scale mixture priors (Half-Cauchy for Group Horseshoe and Exponential for a Multivariate Laplace) identify local interactions that are close in 3D structure without tuning parameters. When group -regularized maximum Pseudolikelihood estimation is tuned to give the same largest effect size as the Multivariate Laplace, the hierarchical approaches based on Persistent VI are more predictive of 3D proximity (right).

Potts model The Potts model generalizes the Ising model to non-binary categorical data. The factor graph is the same (Figure 3), except each spin can adopt different categories with and each is a matrix as


The Potts model has recently generated considerable excitement in biology, where it has been used to infer 3D contacts in biological molecules solely from patterns of correlated mutations in the sequences that encode them Marks et al. (2011); Morcos et al. (2011). These contacts are have been sufficient to predict the 3D structures of proteins, protein complexes, and RNAs Marks et al. (2012).

Group sparsity Each pairwise factor in a Potts model contains parameters capturing all possible joint configurations of and . One natural way to enforce sparsity in a Potts model is at the level of each group. This can be accomplished by introducing a single scale parameter for all z-scores . We adopt this with the same Half-Cauchy hyperprior as the Ising problem, giving the same factor graph (Figure 3) now corresponding to a Group Horseshoe prior Hernández-Lobato et al. (2013). In the real protein experiment, we also consider an exponential hyperprior, which corresponds to a Multivariate Laplace distribution Eltoft et al. (2006) over the groups.

Synthetic protein data We first investigated the performance of Persistent VI with group sparsity on a synthetic protein experiment. We constructed a synthetic Potts spin glass with a topology inspired by biological macromolecules. We generated synthetic parameters based on contacts in a simulated polymer and sampled 2000 sequences with steps of Gibbs sampling (Appendix).

Results for a synthetic protein We inferred couplings with 400 of the sampled sequences using PVI with group sparsity and two standard methods of the field: and Group regularized maximum pseudolikelihood (Appendix). PVI with a noncentered Horseshoe yielded more accurate (Figure 5, right), less shrunk (Figure 5, left) estimates of interactions that were more predictive of the 1600 remaining test sequences (Table 2). The ability to generalize well to new sequences will likely be important to the related problem of predicting mutation effects with unsupervised models of sequence variation Hopf et al. (2017); Figliuzzi et al. (2015).

Method Runtime (s)
PL, (5xCV) 67.3 375
PL, Group (5xCV) 59.6 303
PVI-3, Half-Cauchy 54.2 585
Table 2: Average log-pseudolikelihood for test sequences.

Results for natural sequence variation We applied the hierarchical Bayesian model from the protein simulation to model across-species amino acid covariation in the SH3 domain family (Figure 6). Transitioning from simulated to real protein data is particularly challenging for Bayesian methods because available sequence data are highly non-independent due to a shared evolutionary history. We developed a new method for estimating the effective sample size (Appendix) which, when combined standard sequence reweighting techniques, yielded a reweighted effective sample size of 1,012 from 10,209 sequences.

The hierarchical Bayesian approach gave highly localized, sparse estimates of interactions compared to the two predominant methods in the field, and group regularized pseudolikelihood (Figure 6). When compared to solved 3D structures for SH3 (Appendix), we found that the inferred interactions were considerably more accurate at predicting amino acids close in structure. Importantly, the hierarchical Bayesian approach accomplished this inference of strong, accurate interactions without a need to prespecify hyperparameters such as for or regularization. This is particularly important for natural biological sequences because the non-independence of samples limits the utility of cross validation for setting hyperparameters.

5 Related work

5.1 Variational Inference

One strategy for improving variational inference is to introduce correlations in variational distribution by geometric transformations. This can be made particularly powerful by using backpropagation to learn compositions of transformations that capture the geometry of complex posteriors Rezende & Mohamed (2015); Tran et al. (2016). Noncentered parameterizations of models may be complementary to these approaches by enabling more efficient representations of correlations between parameters and hyperparameters.

Most related to this work, Louizos et al. (2017); Ghosh & Doshi-Velez (2017) show how variational inference with noncentered scale-mixture priors can be useful for Bayesian learning of neural networks, and how group sparsity can act as a form of automatic compression and model selection.

5.2 Maximum Entropy

Much of the work on inference of undirected graphical models has gone under the name of the Maximum Entropy method in physics and neuroscience, which can be equivalently formulated as maximum likelihood in an exponential family MacKay (2003). From this maximum likelihood interpretation, regularized-maximum entropy modeling (MaxEnt) corresponds to the disfavored “integrate-out” approach to inference in hierarchical models9 MacKay (1996) that will introduce significant biases to inferred parameters Macke et al. (2011). One solution to this bias was foreshadowed by methods for estimating entropy and Mutual Information, which used hierarchical priors to integrate over a large range of possible model complexities Nemenman et al. (2002); Archer et al. (2013). These hierarchical approaches are favorable because in traditional MAP estimation any top level parameters that are fixed before inference (e.g. a global pseudocount ) introduce strong constraints on allowed model complexity. The improvements from PVI and Fadeout may be seen as extending this hierarchical approach to full systems of discrete variables.

6 Conclusion

We introduced a framework for scalable Bayesian sparsity for undirected graphical models composed of two methods. The first is an extension of stochastic variational inference to work with undirected graphical models that uses persistent gradient estimation to bypass estimating partition functions. The second is a variational approach designed to match the geometry of hierarchical, sparsity-promoting priors. We found that, when combined, these two methods give substantially improved inferences of undirected graphical models on both simulated and real systems from physics and computational biology.


We thank David Duvenaud, Finale Doshi-Velez, Miriam Huntley, Chris Sander, and members of the Marks lab for helpful comments and discussions. JBI was supported by a NSF Graduate Research Fellowship DGE1144152 and DSM by NIH grant 1R01-GM106303. Portions of this work were conducted on the Orchestra HPC Cluster at Harvard Medical School.

Appendix A Appendix I: PVI algorithm

  Require: Model. Undirected defined by features on
  Require: Data. Expectations of the features and sample size
  Require: Prior. Prior gradient
  Require: Number of Gibbs sweeps , Markov Chains , variational samples
  Require: Initial variational parameters (e.g. )
  // Initialize parameters and Markov chains
  while not converged do
     // Estimate with samples from the variational distribution
     for  do
        // Estimate model-dependent expectations , where
        for  do
           for  do
           end for
        end for
        // Compute stochastic gradient
     end for
     // Update parameters with Robbins-Monro sequence
  end while
Algorithm 2 Persistent Variational Inference (PVI-) with Gaussian

See Algorithm 2.

Appendix B Appendix II: Experiments

b.1 Spin Models

We generated two synthetic systems. The first system was ferromagnetic (all ) with 64 spins, where neighboring spins , have a nonzero interaction of if adjacent on a periodic lattice. This coupling strength equates to being slightly above the critical temperature, meaning the system will be highly correlated despite the underlying interactions being only nearest-neighbor.

The second system was a diluted Sherrington-Kirkpatrick spin glass Sherrington & Kirkpatrick (1975); Aurell & Ekeberg (2012) with 100 spins. The couplings in this model were defined by Erdős-Renyi random graphs Erdős & Rényi (1960) with non-zero edge weights distributed as where is the average degree. We generated 5 random systems where the average degree was . Across all of the systems, we used Swendsen-Wang sampling Swendsen & Wang (1987) to sample synthetic data and checked that the sampling was sufficient to eliminate autocorrelation in the data.

For inference, we tested both -regularized deterministic approaches as well as a variational approach based on Persistent VI. The regularized approaches included Pseudolikelihood, (PL) Aurell & Ekeberg (2012), Minimum Probability Flow (MPF) Sohl-Dickstein et al. (2011), and Persistent Contrastive Divergence (PCD) Tieleman (2008). Additionally, we tested the proposed alternative regularization method of Pseudolikelihood Decimation Decelle & Ricci-Tersenghi (2014).

For regularized Pseudolikelihood and Minimum Probability Flow, we selected the hyperparameter using 10-fold cross-validation over 10 logarithmically spaced values on the interval . We performed regularization of the deterministic objectives using optimizers from Schmidt (2010), and chose the corresponding hyperparameter for PCD + based on the optimal cross-validated value of that was selected for -regularized Pseudolikelihood.

For the hierarchical model inferred with Persisent VI, we placed a separate noncentered Horseshoe prior over the fields and couplings, in accodance with the (centered) hierarchy

where is the standard Half-Cauchy distribution. We then used PVI-3 with 100 persistent Markov chains and performed stochastic gradient descent using Adam Kingma & Ba (2014) with default momentum and a learning rate that linearly decayed from 0.01 to 0 over iterations.

b.2 Synthetic Protein Data

We constructed a synthetic Potts spin glass with sparse interactions chosen to reflect an underlying 3D structure. After forming a contact topology from a random packed polymer, we generated synthetic group-Student-t distributed sitewise bias vectors (each ) and Gaussian distributed coupling matrices (each ) to mirror the strong site-bias and weak-coupling regime of proteins. Since this system is highly frustrated, we thinned sweeps of Gibbs sampling to 2000 sequences that exhibited no autocorrelation.

Given 400 of the 2000 synthetic sequences10, we inferred and group -regularized MAP estimates under a pseudolikelihood approximation with 5-fold cross validation to choose hyperparameters from 6 values in the range . We also ran PVI-10 with 40 persistent Markov chains and 5000 iterations of stochastic gradient descent with Adam11 Kingma & Ba (2014). We note that the current standards of the field are based on and Group regularized Pseudolikelihood Balakrishnan et al. (2011); Ekeberg et al. (2013).

b.3 Real Protein Data

Sample reweighting

Natural protein sequences share a common evolutionary history that introduces significant redundancy and correlation between related sequences. Treating them as independent data is biased by both the overrepresentation of certain sequences due to the evolutionary process (phylogeny) or the human sampling process (biased sequencing of particular species). Thus, we follow a standard practice of correcting the overrepresentation of sequences by a sample-reweighting approach Ekeberg et al. (2013).

Sequence reweighting. If we were to treat all data as independent, then every sample would receive unit weight in the log likelihood. To correct for the over and underrepresentation of certain sequences, we estimate relative sequence weights using a common inverse neighborhood density based approach from the field Ekeberg et al. (2013). We set the relative weight of each sequence proportional to the inverse number of neighboring sequences that differ by a normalized Hamming distance of less than . We use the established value of .

Effective sample size estimation. We propose a new definition for an effective sample size of correlated discrete data and derive an algorithm for estimating it from count data. The estimator is based on the assumption that in limited data regimes for sparsely coupled systems, the sample Mutual Information between random variables is dominated by random, coincidental correlations rather than actual correlations due to underlying interactions. This is consistent with classic results on the bias of information quantities in limited data regimes known as “Miller Maddow bias” Miller (1955); Paninski (2003). If we can define a null model for how such coincidental correlations would arise for a given random sample of size , then we define as the sample size that matches the expected null MI to the observed MI.


The expectation on the right is given by the average sample Mutual Information in the data, while the expectation on the left will be specific to a null model for Mutual Information . Given a noisy estimator for , we solve for by matching the expectations with Robbins-Monro stochastic optimization.

To define the null model of mutual information we treat every variable as independent categorical counts that were drawn from a Dirichlet-Multinomial hierarchy with a log-uniform hyperprior over the (symmetric) concentration parameter .

Given observed frequencies and for letters and together with a candidate sample size , we (i) use Bayes’ theorem to sample underlying distributions , that produced the observed frequencies, (ii) generate samples from the null joint distribution , and (iii) compute the sample Mutual Information of this synthetic count data (Algorithm 3).

We also experimented with using both MAP and posterior mean estimators as plugin approximations , for the latent distributions, but found that each of these were biased estimators of the true sample size in simulation. Posterior mode estimates generally underestimated the null entropy ( too rough) while the posterior mean overestimated the entropy ( too smooth). It seems reasonable that this would be the behavior of point estimates that do not account for the uncertainty in the null distributions that is signaled by the roughness of the frequency data.

We note that this estimator will become invalid as the data become strong, since the assumption that Mutual Information is dominated by sampling noise will break down. However, for the real protein data that we examined, we found that this approach for effective sample size correction was critical for Bayesian methods such as Peristent VI to be able to set the top level hyperparameters (the sparsity levels) from the data.

  Require: Sample size
  Require: Observed frequencies ,
  Sample positions ,
  Set count data ,
  Sample concentration parameter , with numerical CDF
  Sample null distributions , from Dirichlet
  Sample joint count data from categorical joint distribution
  Compute sample frequencies , ,
  Compute sample Mutual Information
Algorithm 3 Sample the null mutual information as a function of sample size

Inference and results

Alignment Our sequence alignment was based on the Pfam 27.0 family PF00018, which we subsequently processed to remove all sequences with more than 25% gaps.

Indels Natural sequences contain insertions and deletions that are coded by ‘gaps’ in alignments. We treated these as a 21st character (in addition to amino acids) and fit a state Potts model. We acknowledge that, while this may be standard practice in the field, it is a strong independence approximation because all of the gaps in deletions are perfectly correlated.

Inference We used 10,000 iterations of PVI-10 with 10 variational samples per iteration and 40 persistent Gibbs chains.

Comparison to 3D structure We collected about 3D structures of SH3 domains referenced on PF00018 (Pfam 27.0) and computed minimum atom distances between all positions in the Pfam alignment. For each pair , we used the median of distances across all structures to summarize the “typical” minimum atom distance between and .


  1. We exclude biases for simplicity.
  2. This is also a type of noncentered parameterization, but of the variational distribution rather than the posterior.
  3. Depending on the details of the MCMC and the community these approaches are known as Boltzmann Learning, Stochastic Maximum Likelihood, or Persistent Contrastive Divergence Tieleman (2008).
  4. The operator is an element-wise product.
  5. Although “weak data” may seem unrepresentative of typical problems in machine learning, it is important to remember that a sufficiently large and expressive model can make most data weak.
  7. The term is optional in the sense that including it corresponds to averaging over the hyperparameters, whereas discarding it corresponds to optimizing the hyperparameters (Empirical Bayes). We included it for all experiments.
  8. Rather than attempting to explain Dropout, the intent is to lend intuition about noncentered scale-mixture VI.
  9. To see this, note that -regularized MAP estimation is equivalent to integrating out a zero-mean Gaussian prior with unknown, exponentially-distributed variance
  10. We find this effective sample size to mirror natural protein families (unpublished)
  11. , no decay


  1. Andrews, David F and Mallows, Colin L. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pp. 99–102, 1974.
  2. Archer, Evan, Park, Il Memming, and Pillow, Jonathan W. Bayesian and quasi-bayesian estimators for mutual information from discrete data. Entropy, 15(5):1738–1755, 2013.
  3. Aurell, Erik and Ekeberg, Magnus. Inverse ising inference using all the data. Physical review letters, 108(9):090201, 2012.
  4. Balakrishnan, Sivaraman, Kamisetty, Hetunandan, Carbonell, Jaime G, Lee, Su-In, and Langmead, Christopher James. Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4):1061–1078, 2011.
  5. Betancourt, MJ and Girolami, Mark. Hamiltonian monte carlo for hierarchical models. arXiv preprint arXiv:1312.0906, 2013.
  6. Carvalho, Carlos M, Polson, Nicholas G, and Scott, James G. The horseshoe estimator for sparse signals. Biometrika, pp. asq017, 2010.
  7. Chen, Yutian and Welling, Max. Bayesian structure learning for markov random fields with a spike and slab prior. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pp. 174–184. AUAI Press, 2012.
  8. Decelle, Aurélien and Ricci-Tersenghi, Federico. Pseudolikelihood decimation algorithm improving the inference of the interaction network in a general class of ising models. Physical review letters, 112(7):070603, 2014.
  9. Ekeberg, Magnus, Lövkvist, Cecilia, Lan, Yueheng, Weigt, Martin, and Aurell, Erik. Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Physical Review E, 87(1):012707, 2013.
  10. Eltoft, Torbjørn, Kim, Taesu, and Lee, Te-Won. On the multivariate laplace distribution. IEEE Signal Processing Letters, 13(5):300–303, 2006.
  11. Engelhardt, Barbara E and Adams, Ryan P. Bayesian structured sparsity from gaussian fields. arXiv preprint arXiv:1407.2235, 2014.
  12. Erdős, Paul and Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5:17–61, 1960.
  13. Figliuzzi, Matteo, Jacquier, Hervé, Schug, Alexander, Tenaillon, Oliver, and Weigt, Martin. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Molecular biology and evolution, pp. msv211, 2015.
  14. Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1050–1059, 2016.
  15. Ghosh, Soumya and Doshi-Velez, Finale. Model selection in bayesian neural networks via horseshoe priors. arXiv preprint arXiv:1705.10388, 2017.
  16. Hernández-Lobato, Daniel, Hernández-Lobato, José Miguel, and Dupont, Pierre. Generalized spike-and-slab priors for bayesian group feature selection using expectation propagation. Journal of Machine Learning Research, 14(1):1891–1945, 2013.
  17. Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  18. Hopf, Thomas A, Ingraham, John B, Poelwijk, Frank J, Schärfe, Charlotta PI, Springer, Michael, Sander, Chris, and Marks, Debora S. Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128–135, 2017.
  19. Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  20. Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  22. Kingma, DP, Salimans, T, and Welling, M. Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, 28:2575–2583, 2015.
  23. Koller, Daphne and Friedman, Nir. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  24. Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, Andrew, and Blei, David M. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1–45, 2017.
  25. Louizos, Christos. Smart regularization of deep architectures. Master’s thesis, University of Amsterdam, 2015.
  26. Louizos, Christos, Ullrich, Karen, and Welling, Max. Bayesian compression for deep learning. arXiv preprint arXiv:1705.08665, 2017.
  27. MacKay, David JC. Hyperparameters: Optimize, or integrate out? In Maximum entropy and bayesian methods, pp. 43–59. Springer, 1996.
  28. MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  29. MacKay, David JC et al. Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062, 1994.
  30. Macke, Jakob H, Murray, Iain, and Latham, Peter E. How biased are maximum entropy models? In Advances in Neural Information Processing Systems, pp. 2034–2042, 2011.
  31. Marks, Debora S, Colwell, Lucy J, Sheridan, Robert, Hopf, Thomas A, Pagnani, Andrea, Zecchina, Riccardo, and Sander, Chris. Protein 3d structure computed from evolutionary sequence variation. PloS one, 6(12):e28766, 2011.
  32. Marks, Debora S, Hopf, Thomas A, and Sander, Chris. Protein structure prediction from sequence variation. Nature biotechnology, 30(11):1072–1080, 2012.
  33. Miller, George A. Note on the bias of information estimates. Information theory in psychology: Problems and methods, 2(95):100, 1955.
  34. Mitchell, Toby J and Beauchamp, John J. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
  35. Mohamed, Shakir, Ghahramani, Zoubin, and Heller, Katherine A. Bayesian and l1 approaches for sparse unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 751–758, 2012.
  36. Morcos, Faruck, Pagnani, Andrea, Lunt, Bryan, Bertolino, Arianna, Marks, Debora S, Sander, Chris, Zecchina, Riccardo, Onuchic, José N, Hwa, Terence, and Weigt, Martin. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49):E1293–E1301, 2011.
  37. Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
  38. Murray, Iain and Ghahramani, Zoubin. Bayesian learning in undirected graphical models: approximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 392–399. AUAI Press, 2004.
  39. Murray, Iain, Ghahramani, Zoubin, and MacKay, David JC. Mcmc for doubly-intractable distributions. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pp. 359–366. AUAI Press, 2006.
  40. Nemenman, Ilya, Shafee, Fariel, and Bialek, William. Entropy and inference, revisited. Advances in neural information processing systems, 1:471–478, 2002.
  41. Nishimori, Hidetoshi. Statistical physics of spin glasses and information processing: an introduction. Number 111. Clarendon Press, 2001.
  42. Paninski, Liam. Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253, 2003.
  43. Papaspiliopoulos, Omiros, Roberts, Gareth O, and Sköld, Martin. A general framework for the parametrization of hierarchical models. Statistical Science, pp. 59–73, 2007.
  44. Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 814–822, 2014.
  45. Rezende, Danilo and Mohamed, Shakir. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
  46. Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286, 2014.
  47. Schmidt, Mark. Graphical model structure learning with l1-regularization. PhD thesis, UNIVERSITY OF BRITISH COLUMBIA (Vancouver, 2010.
  48. Schneidman, Elad, Berry, Michael J, Segev, Ronen, and Bialek, William. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):1007–1012, 2006.
  49. Sherrington, David and Kirkpatrick, Scott. Solvable model of a spin-glass. Physical review letters, 35(26):1792, 1975.
  50. Shlens, Jonathon, Field, Greg D, Gauthier, Jeffrey L, Grivich, Matthew I, Petrusca, Dumitru, Sher, Alexander, Litke, Alan M, and Chichilnisky, EJ. The structure of multi-neuron firing patterns in primate retina. The Journal of neuroscience, 26(32):8254–8266, 2006.
  51. Sohl-Dickstein, Jascha, Battaglino, Peter B, and DeWeese, Michael R. New method for parameter estimation in probabilistic models: minimum probability flow. Physical review letters, 107(22):220601, 2011.
  52. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  53. Swendsen, Robert H and Wang, Jian-Sheng. Nonuniversal critical dynamics in monte carlo simulations. Physical review letters, 58(2):86, 1987.
  54. Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
  55. Tieleman, Tijmen. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. ACM, 2008.
  56. Tipping, Michael E. Sparse bayesian learning and the relevance vector machine. The journal of machine learning research, 1:211–244, 2001.
  57. Titsias, Michalis and Lázaro-Gredilla, Miguel. Doubly stochastic variational bayes for non-conjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971–1979, 2014.
  58. Tran, Dustin, Ranganath, Rajesh, and Blei, David M. The variational gaussian process. In Proceedings of the International Conference on Learning Representations, 2016.
  59. Younes, Laurent. Parametric inference for imperfectly observed gibbsian fields. Probability theory and related fields, 82(4):625–645, 1989.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description