Variational Inference for Sparse and Undirected Models
Abstract
Undirected graphical models are applied in genomics, protein structure prediction, and neuroscience to identify sparse interactions that underlie discrete data. Although Bayesian methods for inference would be favorable in these contexts, they are rarely used because they require doubly intractable Monte Carlo sampling. Here, we develop a framework for scalable Bayesian inference of discrete undirected models based on two new methods. The first is Persistent VI, an algorithm for variational inference of discrete undirected models that avoids doubly intractable MCMC and approximations of the partition function. The second is Fadeout, a reparameterization approach for variational inference under sparsityinducing priors that captures a posteriori correlations between parameters and hyperparameters with noncentered parameterizations. We find that, together, these methods for variational inference substantially improve learning of sparse undirected graphical models in simulated and real problems from physics and biology.
1 Introduction
Hierarchical priors that favor sparsity have been a central development in modern statistics and machine learning, and find widespread use for variable selection in biology, engineering, and economics. Among the most widely used and successful approaches for inference of sparse models has been regularization, which, after introduction in the context of linear models with the LASSO Tibshirani (1996), has become the standard tool for both directed and undirected models alike Murphy (2012).
Despite its success, however, is a pragmatic compromise. As the closest convex approximation of the idealized norm, regularization cannot model the hypothesis of sparsity as well as some Bayesian alternatives Tipping (2001). Two Bayesian approaches stand out as more accurate models of sparsity than . The first, the spike and slab Mitchell & Beauchamp (1988), introduces discrete latent variables that directly model the presence or absence of each parameter. This discrete approach is the most direct and accurate representation of a sparsity hypothesis Mohamed et al. (2012), but the discrete latent space that it imposes is often computationally intractable for models where Bayesian inference is difficult.
The second approach to Bayesian sparsity uses the scale mixtures of normals Andrews & Mallows (1974), a family of distributions that arise from integrating a zero meanGaussian over an unknown variance as
(1) 
Scalemixtures of normals can approximate the discrete spike and slab prior by mixing both large and small values of the variance . The implicit prior of regularization, the Laplacian, is a member of the scale mixture family that results from an exponentially distributed variance . Thus, mixing densities with subexponential tails and more mass near the origin more accurately model sparsity than and are the basis for approaches often referred to as “Sparse Bayesian Learning” Tipping (2001). Both the Student of Automatic Relevance Determination (ARD) MacKay et al. (1994) and the Horseshoe prior Carvalho et al. (2010) incorporate these properties.
Applying these favorable, Bayesian approaches to sparsity has been particularly challenging for discrete, undirected models like Boltzmann Machines. Undirected models possess a representational advantage of capturing ‘collective phenomena’ with no directions of causality, but their likelihoods require an intractable normalizing constant Murray & Ghahramani (2004). For a fully observed Boltzmann Machine with the distribution
(2) 
where the partition function depends on the couplings. Whenever a new set of couplings are considered during inference, the partition function and corresponding density must be reevaluated. This requirement for an an intractable calculation embedded within alreadyintractable nonconjugate inference has led some to term Bayesian learning of undirected graphical models “doubly intractable” Murray et al. (2006). When all patterns of discrete spike and slab sparsity are added on top of this, we might call this problem “triply intractable” (Figure 1). Tripleintractability does not mean that this problem is impossible, but it will typically require expensive approaches based on MCMCwithinMCMC Chen & Welling (2012).
Here we present an alternative to MCMCbased approaches for learning undirected models with sparse priors based on stochastic variational inference Hoffman et al. (2013). We combine three ideas: (i) stochastic gradient variational Bayes Kingma & Welling (2014); Rezende et al. (2014); Titsias & LázaroGredilla (2014)

We extend stochastic variational inference to undirected models with intractable normalizing constants by developing a learning algorithm based on persistent Markov chains, which we call Persistent Variational Inference (PVI) (Section 2).

We introduce a reparameterization approach for variational inference under sparsityinducing scalemixture priors (e.g. the Laplacian, ARD, and the Horseshoe) that significantly improves approximation quality by capturing scale uncertainty (Section 3). When combined with Gaussian stochastic variational inference, we call this Fadeout.

We demonstrate how a Bayesian approach for learning sparse undirected graphical models with PVI and Fadeout yields significantly improved inferences of both synthetic and real applications in physics and biology (Section 4).
2 Persistent Variational Inference
Background: Learning in undirected models Undirected graphical models, also known as Markov Random Fields, can be written in loglinear form as
(3) 
where indexes a set of features and the partition function normalizes the distribution Koller & Friedman (2009). Maximum Likelihood inference selects parameters that maximize the probability of data by ascending the gradient of the (averaged) log likelihood
(4) 
The first term in the gradient is a datadependent average of feature over , while the second term is a dataindependent average of feature over the model distribution that often requires sampling Murphy (2012)
Bayesian learning for undirected models is confounded by the partition function . Given the data , a prior , and the log potentials , the posterior distribution of the parameters is
(5) 
which contains an intractable partition function within the alreadyintractable evidence term. As a result, most algorithms for Bayesian learning of undirected models require either doublyintractable MCMC and/or approximations of the likelihood .
A tractable estimator for ELBO of undirected models Here we consider how to approximate the intractable posterior in (5) without approximating the partition function or the likelihood by using variational inference. Variational inference recasts inference with as an optimization problem of finding a variational distribution that is closest to as measured by KL divergence Jordan et al. (1999). This can be accomplished by maximizing the Evidence Lower BOund
(6) 
For scalability, we would like to optimize the ELBO with methods that can leverage Monte Carlo estimators of the gradient . One possible strategy for this would be would be to develop an estimator based on the score function Ranganath et al. (2014) with a MonteCarlo approximation of
(7) 
Naively substituting the likelihood (3) in the score function estimator (7) nests the intractable log partition function within the average over , making this an untenable (and extremely high variance) approach to inference with undirected models.
We can avoid the need for a scorefunction estimator with the ‘reparameterization trick’ Kingma & Welling (2014); Rezende et al. (2014); Titsias & LázaroGredilla (2014) that has been incredibly useful for directed models. Consider a variational approximation that is a fully factorized (mean field) Gaussian with means and log standard deviations . The ELBO expectations under can be rewritten as expectations wrt an independent noise source where
(8)  
(9) 
Because these expectations require only the gradient of the likelihood , the gradient for the undirected model (4) can be substituted to form a nested expectation for . This can then be used as a Monte Carlo gradient estimator by sampling .
Persistent gradient estimation In Stochastic Maximum Likelihood estimation for undirected models, the intractable gradients of (4) are estimated by sampling . Although samplingbased approaches are slow, they can be made considerably more efficient by running a set of Markov chains in parallel with state that persists between iterations Younes (1989). Persistent state maintains the Markov chains near their equilibrium distributions, which means that they can quickly reequilibrate after perturbations to the parameters during learning.
We propose variational inference in undirected models based on persistent gradient estimation of and refer to this as Persistent Variational Inference (PVI) (Algorithm in Appendix). Following the notation of PCD Tieleman (2008), PVI refers to using sweeps of Gibbs sampling with persistent Markov chains between iterations. This approach is generally compatible with any estimators of ELBO that are based on the gradient of the log likelihood, several examples of which are explained in Kingma & Welling (2014); Rezende et al. (2014); Titsias & LázaroGredilla (2014).
Behavior of the solution for Gaussian When the variational approximation is a fully factorized Gaussian and the prior is flat , the solution to will satisfy
(10) 
where is an extended system of the original undirected model in which the parameters fluctuate according to the variational distribution. This bridges to the Maximum Likelihood solution as and , while accounting for uncertainty in the parameters at finite sample sizes with the inverse of ‘sensitivity’ .
3 Fadeout
Prior  Hyperprior  

Gaussian ()  constant  
Laplacian ()  
Student (ARD)  
Horseshoe 
3.1 Noncentered Parameterizations of Hierarchical Priors
Hierarchical models are powerful because they impose a priori correlations between latent variables that reflect problemspecific knowledge. For scalemixture priors that promote sparsity, these correlations come in the form of scale uncertainty. Instead of assuming that the scale of a parameter in a model is known a priori, we posit that it is normally distributed with a randomly distributed variance . The joint prior gives rise to a strongly curved ‘funnel’ shape (Figure 2) that illustrates a simple but profound principle about hierarchical models: as the hyperparameter decreases and the prior accepts a smaller range of values for , normalization increases the probability density at the origin, favoring sparsity. This normalizationinduced sharpening has been called called a Bayesian Occam’s Razor MacKay (2003).
While normalizationinduced sharpening gives rise to sparsity, these extreme correlations are a disaster for meanfield variational inference. Even if a tremendous amount of probability mass is concentrated at the base of the funnel, an uncorrelated meanfield approximation will yield estimates near the top. The result is a potentially nonsparse estimate from a verysparse prior.
The strong coupling of hierarchical funnels also plagues exact methods based on MCMC with slow mixing, but the statistics community has found that these geometry pathologies can be effectively managed by transformations. Many models can be rewritten in a noncentered form where the parameters and hyperparmeters are a priori independen Papaspiliopoulos et al. (2007); Betancourt & Girolami (2013). For the scalemixtures of normals, this change of variables is
(11) 
Then while preserving . In noncentered form, the joint prior is independent and well approximated by a meanfield Gaussian, while the likelihood will be variably correlated depending on the strength of the data (Figure 2). In this sense, centered parameterizations (CP) and noncentered parameterizations (NCP) are usually framed as favorable in strong and weak data regimes, respectively.
We propose the use of noncentered parameterizations of scalemixture priors for meanfield Gaussian variational inference. For convenience, we like to call this Fadeout (see next section). Fadeout can be easily implemented by either (i) using the chain rule to derive the gradient of the Evidence Lower BOund (ELBO) (Algorithm 1) or, for differentiable models, (ii) rewriting models in noncentered form and using automatic differentiation tools such as Stan Kucukelbir et al. (2017) or autograd
Estimators for the centered posterior. Fadeout optimizes a meanfield Gaussian variational distribution over the noncentered parameters . As an estimator for the centered parameters, we use the meanfield property to compute the centered posterior mean as , giving
(12) 
3.2 Connection to Dropout
Dropout regularizes neural networks by perturbing hidden units in a directed network with multiplicative Bernoulli or Gaussian noise Srivastava et al. (2014). Although it was originally framed as a heuristic, Dropout has been subsequently interpreted as variational inference under at least two different schemes Gal & Ghahramani (2016); Kingma et al. (2015). Here, we interpret Fadeout the reverse way, where we introduced it as variational inference and now notice that it looks similar to lognormal Dropout.
This is the gradient estimator for a lognormal version of Dropout with an weight penalty of . At each sample from the variational distribution, Fadeout introduces scale noise rather than the Bernoulli noise of Dropout. The connection to Dropout would seem to follow naturally from the common interpretation of scale mixtures as continuous relaxations of spike and slab priors Engelhardt & Adams (2014) and the idea that Dropout can be related to variational spike and slab inference Louizos (2015).
4 Experiments
4.1 Physics: Inferring Spin Models
Ising model The Ising model is a prototypical undirected model for binary systems that includes both pairwise interactions and (potentially) sitewise biases. It can be seen as the fully observed case of the Boltzmann machine, and is typically parameterized with signed spins and a likelihood given by
(13) 
Originally proposed as a minimal model of how long range order arises in magnets, it continues to find application in physics and biology as a model for phase transitions and quenched disorder in spin glasses Nishimori (2001) and collective firing patterns in neural spike trains Schneidman et al. (2006); Shlens et al. (2006).
Hierarchical sparsity prior One appealing feature of the Ising model is that it allows a sparse set of underlying couplings to give rise to longrange, distributed correlations across a system. Since many physical systems are thought to be dominated by a small number of relevant interactions, regularization has been a favored approach for inferring Ising models. Here, we examine how a more accurate model of sparsity based on the Horseshoe prior (Figure 3) can improve inferences in these systems. Each coupling and bias parameter is given its own scale parameter which are in turn tied under a global HalfCauchy prior for the scales (Figure 3, Appendix).
Simulated datasets We generated synthetic couplings for two kinds of Ising systems: (i) a slightly subcritical cubic ferromagnet ( for neighboring spins) and (ii) a SherringtonKirkpatrick spin glass diluted on an ErdösRenyi random graph with average degree 2. We sampled synthetic data for each system with the SwendsenWang algorithm (Appendix) Swendsen & Wang (1987).
Results On both the ferromagnet and the spin glass, we found that Persistent VI with a noncentered Horseshoe prior (Fadeout) gave estimates with systematically lower reconstruction error of the couplings (Figure 4) versus a variety of standard methods in the field (Appendix).
4.2 Biology: Reconstructing 3D Contacts in Proteins from Sequence Variation
Potts model The Potts model generalizes the Ising model to nonbinary categorical data. The factor graph is the same (Figure 3), except each spin can adopt different categories with and each is a matrix as
(14) 
The Potts model has recently generated considerable excitement in biology, where it has been used to infer 3D contacts in biological molecules solely from patterns of correlated mutations in the sequences that encode them Marks et al. (2011); Morcos et al. (2011). These contacts are have been sufficient to predict the 3D structures of proteins, protein complexes, and RNAs Marks et al. (2012).
Group sparsity Each pairwise factor in a Potts model contains parameters capturing all possible joint configurations of and . One natural way to enforce sparsity in a Potts model is at the level of each group. This can be accomplished by introducing a single scale parameter for all zscores . We adopt this with the same HalfCauchy hyperprior as the Ising problem, giving the same factor graph (Figure 3) now corresponding to a Group Horseshoe prior HernándezLobato et al. (2013). In the real protein experiment, we also consider an exponential hyperprior, which corresponds to a Multivariate Laplace distribution Eltoft et al. (2006) over the groups.
Synthetic protein data We first investigated the performance of Persistent VI with group sparsity on a synthetic protein experiment. We constructed a synthetic Potts spin glass with a topology inspired by biological macromolecules. We generated synthetic parameters based on contacts in a simulated polymer and sampled 2000 sequences with steps of Gibbs sampling (Appendix).
Results for a synthetic protein We inferred couplings with 400 of the sampled sequences using PVI with group sparsity and two standard methods of the field: and Group regularized maximum pseudolikelihood (Appendix). PVI with a noncentered Horseshoe yielded more accurate (Figure 5, right), less shrunk (Figure 5, left) estimates of interactions that were more predictive of the 1600 remaining test sequences (Table 2). The ability to generalize well to new sequences will likely be important to the related problem of predicting mutation effects with unsupervised models of sequence variation Hopf et al. (2017); Figliuzzi et al. (2015).
Method  Runtime (s)  

PL, (5xCV)  67.3  375 
PL, Group (5xCV)  59.6  303 
PVI3, HalfCauchy  54.2  585 
Results for natural sequence variation We applied the hierarchical Bayesian model from the protein simulation to model acrossspecies amino acid covariation in the SH3 domain family (Figure 6). Transitioning from simulated to real protein data is particularly challenging for Bayesian methods because available sequence data are highly nonindependent due to a shared evolutionary history. We developed a new method for estimating the effective sample size (Appendix) which, when combined standard sequence reweighting techniques, yielded a reweighted effective sample size of 1,012 from 10,209 sequences.
The hierarchical Bayesian approach gave highly localized, sparse estimates of interactions compared to the two predominant methods in the field, and group regularized pseudolikelihood (Figure 6). When compared to solved 3D structures for SH3 (Appendix), we found that the inferred interactions were considerably more accurate at predicting amino acids close in structure. Importantly, the hierarchical Bayesian approach accomplished this inference of strong, accurate interactions without a need to prespecify hyperparameters such as for or regularization. This is particularly important for natural biological sequences because the nonindependence of samples limits the utility of cross validation for setting hyperparameters.
5 Related work
5.1 Variational Inference
One strategy for improving variational inference is to introduce correlations in variational distribution by geometric transformations. This can be made particularly powerful by using backpropagation to learn compositions of transformations that capture the geometry of complex posteriors Rezende & Mohamed (2015); Tran et al. (2016). Noncentered parameterizations of models may be complementary to these approaches by enabling more efficient representations of correlations between parameters and hyperparameters.
5.2 Maximum Entropy
Much of the work on inference of undirected graphical models has gone under the name of the Maximum Entropy method in physics and neuroscience, which can be equivalently formulated as maximum likelihood in an exponential family MacKay (2003). From this maximum likelihood interpretation, regularizedmaximum entropy modeling (MaxEnt) corresponds to the disfavored “integrateout” approach to inference in hierarchical models
6 Conclusion
We introduced a framework for scalable Bayesian sparsity for undirected graphical models composed of two methods. The first is an extension of stochastic variational inference to work with undirected graphical models that uses persistent gradient estimation to bypass estimating partition functions. The second is a variational approach designed to match the geometry of hierarchical, sparsitypromoting priors. We found that, when combined, these two methods give substantially improved inferences of undirected graphical models on both simulated and real systems from physics and computational biology.
Acknowledgements
We thank David Duvenaud, Finale DoshiVelez, Miriam Huntley, Chris Sander, and members of the Marks lab for helpful comments and discussions. JBI was supported by a NSF Graduate Research Fellowship DGE1144152 and DSM by NIH grant 1R01GM106303. Portions of this work were conducted on the Orchestra HPC Cluster at Harvard Medical School.
Appendix A Appendix I: PVI algorithm
See Algorithm 2.
Appendix B Appendix II: Experiments
b.1 Spin Models
We generated two synthetic systems. The first system was ferromagnetic (all ) with 64 spins, where neighboring spins , have a nonzero interaction of if adjacent on a periodic lattice. This coupling strength equates to being slightly above the critical temperature, meaning the system will be highly correlated despite the underlying interactions being only nearestneighbor.
The second system was a diluted SherringtonKirkpatrick spin glass Sherrington & Kirkpatrick (1975); Aurell & Ekeberg (2012) with 100 spins. The couplings in this model were defined by ErdősRenyi random graphs Erdős & Rényi (1960) with nonzero edge weights distributed as where is the average degree. We generated 5 random systems where the average degree was . Across all of the systems, we used SwendsenWang sampling Swendsen & Wang (1987) to sample synthetic data and checked that the sampling was sufficient to eliminate autocorrelation in the data.
For inference, we tested both regularized deterministic approaches as well as a variational approach based on Persistent VI. The regularized approaches included Pseudolikelihood, (PL) Aurell & Ekeberg (2012), Minimum Probability Flow (MPF) SohlDickstein et al. (2011), and Persistent Contrastive Divergence (PCD) Tieleman (2008). Additionally, we tested the proposed alternative regularization method of Pseudolikelihood Decimation Decelle & RicciTersenghi (2014).
For regularized Pseudolikelihood and Minimum Probability Flow, we selected the hyperparameter using 10fold crossvalidation over 10 logarithmically spaced values on the interval . We performed regularization of the deterministic objectives using optimizers from Schmidt (2010), and chose the corresponding hyperparameter for PCD + based on the optimal crossvalidated value of that was selected for regularized Pseudolikelihood.
For the hierarchical model inferred with Persisent VI, we placed a separate noncentered Horseshoe prior over the fields and couplings, in accodance with the (centered) hierarchy
where is the standard HalfCauchy distribution. We then used PVI3 with 100 persistent Markov chains and performed stochastic gradient descent using Adam Kingma & Ba (2014) with default momentum and a learning rate that linearly decayed from 0.01 to 0 over iterations.
b.2 Synthetic Protein Data
We constructed a synthetic Potts spin glass with sparse interactions chosen to reflect an underlying 3D structure. After forming a contact topology from a random packed polymer, we generated synthetic groupStudentt distributed sitewise bias vectors (each ) and Gaussian distributed coupling matrices (each ) to mirror the strong sitebias and weakcoupling regime of proteins. Since this system is highly frustrated, we thinned sweeps of Gibbs sampling to 2000 sequences that exhibited no autocorrelation.
Given 400 of the 2000 synthetic sequences
b.3 Real Protein Data
Sample reweighting
Natural protein sequences share a common evolutionary history that introduces significant redundancy and correlation between related sequences. Treating them as independent data is biased by both the overrepresentation of certain sequences due to the evolutionary process (phylogeny) or the human sampling process (biased sequencing of particular species). Thus, we follow a standard practice of correcting the overrepresentation of sequences by a samplereweighting approach Ekeberg et al. (2013).
Sequence reweighting. If we were to treat all data as independent, then every sample would receive unit weight in the log likelihood. To correct for the over and underrepresentation of certain sequences, we estimate relative sequence weights using a common inverse neighborhood density based approach from the field Ekeberg et al. (2013). We set the relative weight of each sequence proportional to the inverse number of neighboring sequences that differ by a normalized Hamming distance of less than . We use the established value of .
Effective sample size estimation. We propose a new definition for an effective sample size of correlated discrete data and derive an algorithm for estimating it from count data. The estimator is based on the assumption that in limited data regimes for sparsely coupled systems, the sample Mutual Information between random variables is dominated by random, coincidental correlations rather than actual correlations due to underlying interactions. This is consistent with classic results on the bias of information quantities in limited data regimes known as “Miller Maddow bias” Miller (1955); Paninski (2003). If we can define a null model for how such coincidental correlations would arise for a given random sample of size , then we define as the sample size that matches the expected null MI to the observed MI.
(15) 
The expectation on the right is given by the average sample Mutual Information in the data, while the expectation on the left will be specific to a null model for Mutual Information . Given a noisy estimator for , we solve for by matching the expectations with RobbinsMonro stochastic optimization.
To define the null model of mutual information we treat every variable as independent categorical counts that were drawn from a DirichletMultinomial hierarchy with a loguniform hyperprior over the (symmetric) concentration parameter .
Given observed frequencies and for letters and together with a candidate sample size , we (i) use Bayes’ theorem to sample underlying distributions , that produced the observed frequencies, (ii) generate samples from the null joint distribution , and (iii) compute the sample Mutual Information of this synthetic count data (Algorithm 3).
We also experimented with using both MAP and posterior mean estimators as plugin approximations , for the latent distributions, but found that each of these were biased estimators of the true sample size in simulation. Posterior mode estimates generally underestimated the null entropy ( too rough) while the posterior mean overestimated the entropy ( too smooth). It seems reasonable that this would be the behavior of point estimates that do not account for the uncertainty in the null distributions that is signaled by the roughness of the frequency data.
We note that this estimator will become invalid as the data become strong, since the assumption that Mutual Information is dominated by sampling noise will break down. However, for the real protein data that we examined, we found that this approach for effective sample size correction was critical for Bayesian methods such as Peristent VI to be able to set the top level hyperparameters (the sparsity levels) from the data.
Inference and results
Alignment Our sequence alignment was based on the Pfam 27.0 family PF00018, which we subsequently processed to remove all sequences with more than 25% gaps.
Indels Natural sequences contain insertions and deletions that are coded by ‘gaps’ in alignments. We treated these as a 21st character (in addition to amino acids) and fit a state Potts model. We acknowledge that, while this may be standard practice in the field, it is a strong independence approximation because all of the gaps in deletions are perfectly correlated.
Inference We used 10,000 iterations of PVI10 with 10 variational samples per iteration and 40 persistent Gibbs chains.
Comparison to 3D structure We collected about 3D structures of SH3 domains referenced on PF00018 (Pfam 27.0) and computed minimum atom distances between all positions in the Pfam alignment. For each pair , we used the median of distances across all structures to summarize the “typical” minimum atom distance between and .
Footnotes
 We exclude biases for simplicity.
 This is also a type of noncentered parameterization, but of the variational distribution rather than the posterior.
 Depending on the details of the MCMC and the community these approaches are known as Boltzmann Learning, Stochastic Maximum Likelihood, or Persistent Contrastive Divergence Tieleman (2008).
 The operator is an elementwise product.
 Although “weak data” may seem unrepresentative of typical problems in machine learning, it is important to remember that a sufficiently large and expressive model can make most data weak.
 github.com/HIPS/autograd
 The term is optional in the sense that including it corresponds to averaging over the hyperparameters, whereas discarding it corresponds to optimizing the hyperparameters (Empirical Bayes). We included it for all experiments.
 Rather than attempting to explain Dropout, the intent is to lend intuition about noncentered scalemixture VI.
 To see this, note that regularized MAP estimation is equivalent to integrating out a zeromean Gaussian prior with unknown, exponentiallydistributed variance
 We find this effective sample size to mirror natural protein families (unpublished)
 , no decay
References
 Andrews, David F and Mallows, Colin L. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), pp. 99–102, 1974.
 Archer, Evan, Park, Il Memming, and Pillow, Jonathan W. Bayesian and quasibayesian estimators for mutual information from discrete data. Entropy, 15(5):1738–1755, 2013.
 Aurell, Erik and Ekeberg, Magnus. Inverse ising inference using all the data. Physical review letters, 108(9):090201, 2012.
 Balakrishnan, Sivaraman, Kamisetty, Hetunandan, Carbonell, Jaime G, Lee, SuIn, and Langmead, Christopher James. Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4):1061–1078, 2011.
 Betancourt, MJ and Girolami, Mark. Hamiltonian monte carlo for hierarchical models. arXiv preprint arXiv:1312.0906, 2013.
 Carvalho, Carlos M, Polson, Nicholas G, and Scott, James G. The horseshoe estimator for sparse signals. Biometrika, pp. asq017, 2010.
 Chen, Yutian and Welling, Max. Bayesian structure learning for markov random fields with a spike and slab prior. In Proceedings of the TwentyEighth Conference on Uncertainty in Artificial Intelligence, pp. 174–184. AUAI Press, 2012.
 Decelle, Aurélien and RicciTersenghi, Federico. Pseudolikelihood decimation algorithm improving the inference of the interaction network in a general class of ising models. Physical review letters, 112(7):070603, 2014.
 Ekeberg, Magnus, Lövkvist, Cecilia, Lan, Yueheng, Weigt, Martin, and Aurell, Erik. Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Physical Review E, 87(1):012707, 2013.
 Eltoft, Torbjørn, Kim, Taesu, and Lee, TeWon. On the multivariate laplace distribution. IEEE Signal Processing Letters, 13(5):300–303, 2006.
 Engelhardt, Barbara E and Adams, Ryan P. Bayesian structured sparsity from gaussian fields. arXiv preprint arXiv:1407.2235, 2014.
 Erdős, Paul and Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5:17–61, 1960.
 Figliuzzi, Matteo, Jacquier, Hervé, Schug, Alexander, Tenaillon, Oliver, and Weigt, Martin. Coevolutionary landscape inference and the contextdependence of mutations in betalactamase tem1. Molecular biology and evolution, pp. msv211, 2015.
 Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1050–1059, 2016.
 Ghosh, Soumya and DoshiVelez, Finale. Model selection in bayesian neural networks via horseshoe priors. arXiv preprint arXiv:1705.10388, 2017.
 HernándezLobato, Daniel, HernándezLobato, José Miguel, and Dupont, Pierre. Generalized spikeandslab priors for bayesian group feature selection using expectation propagation. Journal of Machine Learning Research, 14(1):1891–1945, 2013.
 Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Hopf, Thomas A, Ingraham, John B, Poelwijk, Frank J, Schärfe, Charlotta PI, Springer, Michael, Sander, Chris, and Marks, Debora S. Mutation effects predicted from sequence covariation. Nature biotechnology, 35(2):128–135, 2017.
 Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
 Kingma, DP, Salimans, T, and Welling, M. Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, 28:2575–2583, 2015.
 Koller, Daphne and Friedman, Nir. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, Andrew, and Blei, David M. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1–45, 2017.
 Louizos, Christos. Smart regularization of deep architectures. Master’s thesis, University of Amsterdam, 2015.
 Louizos, Christos, Ullrich, Karen, and Welling, Max. Bayesian compression for deep learning. arXiv preprint arXiv:1705.08665, 2017.
 MacKay, David JC. Hyperparameters: Optimize, or integrate out? In Maximum entropy and bayesian methods, pp. 43–59. Springer, 1996.
 MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.
 MacKay, David JC et al. Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062, 1994.
 Macke, Jakob H, Murray, Iain, and Latham, Peter E. How biased are maximum entropy models? In Advances in Neural Information Processing Systems, pp. 2034–2042, 2011.
 Marks, Debora S, Colwell, Lucy J, Sheridan, Robert, Hopf, Thomas A, Pagnani, Andrea, Zecchina, Riccardo, and Sander, Chris. Protein 3d structure computed from evolutionary sequence variation. PloS one, 6(12):e28766, 2011.
 Marks, Debora S, Hopf, Thomas A, and Sander, Chris. Protein structure prediction from sequence variation. Nature biotechnology, 30(11):1072–1080, 2012.
 Miller, George A. Note on the bias of information estimates. Information theory in psychology: Problems and methods, 2(95):100, 1955.
 Mitchell, Toby J and Beauchamp, John J. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
 Mohamed, Shakir, Ghahramani, Zoubin, and Heller, Katherine A. Bayesian and l1 approaches for sparse unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pp. 751–758, 2012.
 Morcos, Faruck, Pagnani, Andrea, Lunt, Bryan, Bertolino, Arianna, Marks, Debora S, Sander, Chris, Zecchina, Riccardo, Onuchic, José N, Hwa, Terence, and Weigt, Martin. Directcoupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49):E1293–E1301, 2011.
 Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
 Murray, Iain and Ghahramani, Zoubin. Bayesian learning in undirected graphical models: approximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 392–399. AUAI Press, 2004.
 Murray, Iain, Ghahramani, Zoubin, and MacKay, David JC. Mcmc for doublyintractable distributions. In Proceedings of the TwentySecond Conference on Uncertainty in Artificial Intelligence, pp. 359–366. AUAI Press, 2006.
 Nemenman, Ilya, Shafee, Fariel, and Bialek, William. Entropy and inference, revisited. Advances in neural information processing systems, 1:471–478, 2002.
 Nishimori, Hidetoshi. Statistical physics of spin glasses and information processing: an introduction. Number 111. Clarendon Press, 2001.
 Paninski, Liam. Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253, 2003.
 Papaspiliopoulos, Omiros, Roberts, Gareth O, and Sköld, Martin. A general framework for the parametrization of hierarchical models. Statistical Science, pp. 59–73, 2007.
 Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 814–822, 2014.
 Rezende, Danilo and Mohamed, Shakir. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
 Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286, 2014.
 Schmidt, Mark. Graphical model structure learning with l1regularization. PhD thesis, UNIVERSITY OF BRITISH COLUMBIA (Vancouver, 2010.
 Schneidman, Elad, Berry, Michael J, Segev, Ronen, and Bialek, William. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):1007–1012, 2006.
 Sherrington, David and Kirkpatrick, Scott. Solvable model of a spinglass. Physical review letters, 35(26):1792, 1975.
 Shlens, Jonathon, Field, Greg D, Gauthier, Jeffrey L, Grivich, Matthew I, Petrusca, Dumitru, Sher, Alexander, Litke, Alan M, and Chichilnisky, EJ. The structure of multineuron firing patterns in primate retina. The Journal of neuroscience, 26(32):8254–8266, 2006.
 SohlDickstein, Jascha, Battaglino, Peter B, and DeWeese, Michael R. New method for parameter estimation in probabilistic models: minimum probability flow. Physical review letters, 107(22):220601, 2011.
 Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Swendsen, Robert H and Wang, JianSheng. Nonuniversal critical dynamics in monte carlo simulations. Physical review letters, 58(2):86, 1987.
 Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 Tieleman, Tijmen. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. ACM, 2008.
 Tipping, Michael E. Sparse bayesian learning and the relevance vector machine. The journal of machine learning research, 1:211–244, 2001.
 Titsias, Michalis and LázaroGredilla, Miguel. Doubly stochastic variational bayes for nonconjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1971–1979, 2014.
 Tran, Dustin, Ranganath, Rajesh, and Blei, David M. The variational gaussian process. In Proceedings of the International Conference on Learning Representations, 2016.
 Younes, Laurent. Parametric inference for imperfectly observed gibbsian fields. Probability theory and related fields, 82(4):625–645, 1989.