Inverse statistical problems: from the inverse Ising problem to data science

Inverse statistical problems: from the inverse Ising problem to data science

H. Chau Nguyen Max-Planck-Institut für Physik komplexer Systeme, Nöthnitzer Str. 38, D-01187 Dresden, Germany    Riccardo Zecchina Bocconi University, via Roentgen 1, 20136 Milano, Italy and Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino    Johannes Berg Institute for Theoretical Physics, University of Cologne, Zülpicher Straße 77, 50937 Cologne, Germany
Abstract

Inverse problems in statistical physics are motivated by the challenges of ‘big data’ in different fields, in particular high-throughput experiments in biology. In inverse problems, the usual procedure of statistical physics needs to be reversed: Instead of calculating observables on the basis of model parameters, we seek to infer parameters of a model based on observations. In this review, we focus on the inverse Ising problem and closely related problems, namely how to infer the coupling strengths between spins given observed spin correlations, magnetisations, or other data. We review applications of the inverse Ising problem, including the reconstruction of neural connections, protein structure determination, and the inference of gene regulatory networks. For the inverse Ising problem in equilibrium, a number of controlled and uncontrolled approximate solutions have been developed in the statistical mechanics community. A particularly strong method, pseudolikelihood, stems from statistics. We also review the inverse Ising problem in the non-equilibrium case, where the model parameters must be reconstructed based on non-equilibrium statistics. 111citation: Inverse statistical problems: from the inverse Ising problem to data science, H.C. Nguyen, R. Zecchina and J. Berg, Advances in Physics, 66 (3), 197-261 (2017)

Contents

I Introduction and applications

The primary goal of statistical physics is to derive observable quantities from microscopic laws governing the constituents of a system. In the example of the Ising model, the starting point is a model describing interactions between elementary magnets (spins), the goal is to derive observables such as spin magnetisations and correlations.

In an inverse problem, the starting point is observations of some system whose microscopic parameters are unknown and to be discovered. In the inverse Ising problem, the interactions between spins are not known to us, but we want to learn them from measurements of magnetisations, correlations, or other observables. In general, the goal is to infer the parameters describing a system (for instance, its Hamiltonian) from extant data. To this end, the relationship between microscopic laws and observables needs to be inverted.

In the last two decades, inverse statistical problems have arisen in different contexts, sparking interest in the statistical physics community in taking the path from model parameters to observables in reverse. The areas where inverse statistical problems have arisen are characterized by (i) microscopic scales becoming experimentally accessible and (ii) sufficient data storage capabilities being available. In particular, the biological sciences have generated several inverse statistical problems, including the reconstruction of neural and gene regulatory networks and the determination of the three-dimensional structure of proteins. Technological progress is likely to to open up further fields of research to inverse statistical analysis, a development that is currently described by the label ‘big data’.

In physics, inverse statistical problems also arise when we need to design a many-body system with particular desired properties. Examples are finding the potentials that result in a particular single-particle distribution kunkin1969inverse (); chayes1984a (), interaction parameters in a binary alloy that yield the observed correlations maysenholder1987determination (), the potentials between atoms that lead to specific crystal lattices zhang2013a (), or the parameters of a Hamiltonian that lead to a particular density matrix changlani2015 (). In the context of soft matter, a question is how to design a many-body system that will self-assemble into a particular spatial configuration or has particular bulk properties rechtsman2006a (); torquato2009a (). In biophysics, we may want to design a protein that folds into a specified three-dimensional shape kuhlman2003a (). For RNA, even molecules with more than one stable target structure are possible flamm2001a (). As a model of such design problems, distasio2013a (); marcotte2013a () study how to find the parameters of Ising Hamiltonian with a prescribed ground state.

In all these examples, ‘spin’ variables describe microscopic degrees of freedom particular to a given system, for instance, the states of neurons in a neural network. The simplest description of these degrees of freedom in terms of random binary variables then leads to Ising-type spins. In the simplest non-trivial scenario, correlations between the ’spins’ are generated by pairwise couplings between the spins, leading to an Ising model with unknown parameters (couplings between the spins and magnetic fields acting on the spins). In many cases of interest, the couplings between spins will not all be positive, as is the case in a model of a ferromagnet. Nor will they couplings conform to a regular lattice embedded in some finite-dimensional space.

For a concrete example, we look at a system of binary variables (Ising spins) with . These spins are coupled by pairwise couplings and are subject to external magnetic fields .

(1)

is the Boltzmann equilibrium distribution , where we have subsumed temperature into the couplings and fields. (We will discuss this choice in section II.0.1). The Hamiltonian

(2)

specifies the energy of the spin system as a function of the microscopic spin variables, local fields, and pairwise couplings. The inverse Ising problem is the determination of the couplings and local fields , given a set of observed spin configurations. Depending on the particular nature of the system at hand, the restriction to binary variables or pairwise interactions may need to be lifted, or the functional form of the Hamiltonian may be altogether different from the Ising Hamiltonian with pairwise interactions (2). For non-equilibrium systems, the steady state is not even described by a Boltzmann distribution with a known Hamiltonian. However, the basic idea remains the same across different types of inverse statistical problems: even when the frequencies of spin configurations may be under-sampled, the data may be sufficient to infer at least some parameters of a model.

The distribution (1) is well known not only as the equilibrium distribution of the Ising model. It is also the form of the distribution which maximizes the (Gibbs) entropy

(3)

under the constraint that is normalized and has particular first and second moments, that is, magnetisations and correlations. We will discuss in section II.1.3 how this distribution emerges as the ‘least biased distribution’ of a binary random variable with prescribed first and second moments jaynes1957a (). The practical problem is then again an inverse one: to find the couplings and local fields such that the first and second moments observed under the Boltzmann distribution (1) match the mean values of and in data. In settings where third moments differ significantly from the prediction of (1) based on the first two moments, one may need to construct the distribution of maximum entropy given the first three moments, leading to three-spin interactions in the exponent of (1).

Determining the parameters of a distribution such as (1) is always a many-body problem: changing a single coupling generally affects correlations between many spin variables, and conversely a change in the correlation between two variables can change the values of many inferred couplings. The interplay between model parameters and observables is captured by a statistical mechanics of inverse problems, where the phase space consists of quantities normally considered as fixed model parameters (couplings, fields). The observables, such as spin correlations and magnetisations on the other hand, are taken to be fixed, as they are specified by empirical observations. Such a perspective is not new to statistical physics; the analysis of neural networks in the seventies and eighties of the last century led to a statistical mechanics of learning watkin1993a (); hertz199a (); engel2001a (), where the phase space is defined by the set of possible rules linking the input into a machine with an output. The set of all rules compatible with a given set of input/output relations then defines a statistical ensemble. In the inverse statistical problems, however, there are generally no explicit rules linking the input and output, but data with different types of correlations or other observations, which are to be accounted for in a statistical model.

Inverse statistical problems fall into the broader realm of statistical inference mackay2003 (); bishop2006 (), which seeks to determine the properties of a probability distribution underlying some observed data. The problem of inferring the parameters of a distribution such as (1) is known under different names in different communities; also emphasis and language differ subtly across communities.

  • In statistics, an inverse problem is the inference of model parameters from data. In our case, the problem is the inference of the parameters of the Ising model from observed spin configurations. A particular subproblem is the inference of the graph formed by the non-zero couplings of the Ising model, termed graphical model selection or reconstruction. In the specific context of statistical models on graphs (graphical models), the term inference describes the calculation of the marginal distribution of one or several variables. (A marginal distribution describes the statistics of one or several particular variables in a many-variable distribution, for example, .)

  • In machine learning, a frequent task is to train an artificial neural network with symmetric couplings such that magnetisations and correlations of the artificial neurons match the corresponding values in the data. This is a special case of what is called Boltzmann machine learning; the general case also considers so-called hidden units, whose values are unobserved Ackley1985a ().

  • In statistical physics, much effort has been directed towards estimating the parameters of the Ising model given observed values of the magnetisation and two-point correlations. As we will see in section II.1, this is a hard problem from an algorithmic point of view. Recently, threshold phenomena arising in inference problems have attracted much interest from the statistical physics community, due to the link between phase transitions and the boundaries separating different regimes of inference problems, for instance solvable and unsolvable problems, or easy and hard ones mezard2009information (); ZdeborovaKrzakala2015 ().

Common theme across different applications and communities is the inference of model parameters given observed data or desired properties. In this review we will focus on a prototype inverse statistical problem: the inverse Ising problem and its close relatives. Many of the approaches developed for this problem are also readily extended to more general scenarios. We will start with a discussion of applications of the inverse Ising problem and related approaches in biology, specifically the reconstruction of neural and genetic networks, the determination of three-dimensional protein structures, the inference of fitness landscapes, the bacterial responses to combinations of antibiotics, and flocking dynamics. We will find that these applications define two distinct settings of the inverse Ising problem; equilibrium and non-equilibrium. Part II of this review treats the inverse Ising problem in an equilibrium setting, where the couplings between spins are symmetric. Detailed balance holds and results from equilibrium statistical physics can be used. This setting arises naturally within the context of maximum entropy models, which seek to describe the observed statistics of configurations with a simplified effective model capturing, for instance, collective effects. We introduce the basics of statistical inference and maximum entropy modelling, discuss the thermodynamics of the inverse Ising problem, and review different approaches to solve the inverse Ising problem, pointing out their connections and comparing the resulting parameter reconstructions. Part III of this review considers asymmetric coupling matrices, where in the absence of detailed balance couplings can be reconstructed from time series, from data on perturbations of the system, or from detailed knowledge of the non-equilibrium steady state.

We now turn to applications of the inverse Ising problem, which mostly lie in the analysis of high-throughput data from biology. One aim of inverse statistical modelling is to find the parameters of a microscopic model to describe this data. A more ambitious aim is achieved when the parameters of the model are informative about the processes which produced the data, this is, when some of the mechanisms underlying the data can be inferred. The data is large-scale measurements of the degrees of freedom of some system. In the language of statistical physics these describe the micro-states of a system: states of neurons, particular sequences of DNA or proteins, or the concentration levels of RNA. We briefly introduce some of the experimental background of these measurements, so their potential and the limitations can be appreciated. The models are simple models of the microscopic degrees of freedom. In the spirit of statistical physics, these models are simple enough so the parameters can be computed given the data, yet sufficiently complex to reproduce some of the statistical interdependences of the observed microscopic degrees of freedom. The simplest case, consisting of binary degrees of freedom with unknown pairwise couplings between them, leads to the inverse Ising problem, although we will also discuss several extensions.

i.1 Modelling neural firing patterns and the reconstruction of neural connections

Neurons can exchange information by generating discrete electrical pulses, termed spikes, that travel down nerve fibres. Neurons can emit these spikes at different rates, a neuron emitting spikes at a high rate is said to be ‘active’ or ‘firing’, a neuron emitting spikes at a low rate or not at all is said to be ‘inactive’ or ‘silent’. The measurement of the activity of single neurons has a long history starting in 1953 with the development of micro-electrodes for recording dowben1953a (). Multi-electrodes were developed, allowing to record multiple simultaneous neural signals independently over long time periods spira2013a (); obien2014a (). Such data presents the intriguing possibility to see elementary brain function emerge from the interplay of a large number of neurons.

However, even when vast quantities of data are available, the different configurations of a system are still under-sampled in most cases. For instance, consider neurons, each of which can be either active (firing) or inactive (silent). Given that the firing patterns of thousands of neurons can be recorded simultaneously schwarz2014a (), the number of observations will generally be far less than the total number of possible neural configurations, . For this reason, a straightforward statistical description that seeks to determine directly the frequency with which each configuration appears will likely fail.

On the other hand, a feasible starting point is a simple distribution, whose parameters can to be determined from the data. For a set of binary variables, this might be a distribution with pairwise interactions between the variables. In Schneidman2006a (), Bialek and collaborators applied such a statistical model to neural recordings. Dividing time into small intervals of duration induces a binary representation of neural data, where each neuron either spikes during a given interval () or it does not (). The joint statistics observed in neurons in the retina of a salamander was modelled by an Ising model (1) with magnetic fields and pairwise symmetric couplings. Rather than describing the dynamics of neural spikes, this model describes the correlated firing of different neurons over the course of the experiment. The symmetric couplings in (1) describe statistical dependencies, not physical connections. The synaptic connections between neurons, on the other hand, are generally not symmetric.

In this context, the distribution (1) can be viewed as the form of the maximum entropy distribution over neural states, given the observed one- and two-point correlations Schneidman2006a (). In Shlens2006a (); Cocco2009a (), a good match was found between the statistics of three neurons predicted by (1) and the firing patterns of the same neurons in the data. This means that the model with pairwise couplings provides a statistical description of the empirical data, one that can even be used to make predictions. Similar results were obtained also from cortical cells in cell cultures tang2008a ().

The mapping from the neural data to a spin model rests on dividing time into discrete bins of duration . A different choice of this interval would lead to different spin configurations; in particular changing affects the magnetisation of all spins by altering the number of intervals in which a neuron fires. In roudinirenberglatham2009a (), Roudi, Nirenberg and Latham show that the pairwise model (1) provides a good description of the underlying spin statistics (generated by neural spike trains), provided , where is the average firing rate of neurons. Increasing the bin size beyond this regime leads to an increase in bins where multiple neurons fire, as a result couplings beyond the pairwise couplings in (1) can become important.

As a minimal model of neural correlations, the statistics (1) has been extended in several ways. Tkačik et al. tkavcik2010 () and Granot-Atedgi et al. granot2013a () consider stimulus-dependent magnetic fields, that is, fields which depend on the stimulus presented to the experimental subject at a particular time of the experiment. Ohiorhenuan et al. looks at stimulus-dependent couplings ohiorhenuan2010a (). When the number of neurons increases to , limitations of the pairwise model (1) become apparent, which has be addressed by adding additional terms coupling triplets, etc. of spins in the exponent of the Boltzmann measure (1ganmor2011a ().

The statistics (1) serves as a description of the empirical data: the couplings between spins in the Hamiltonian (2) do not describe physical connections between the neurons. The determination of the network of neural connections from the observed neural activities is thus a different question. Simoncelli and collaborators paninski2004a (); pillow2008a () and Cocco, Leibler, and Monasson Cocco2009a () use an integrate-and-fire model burkitt2006review () to infer how the neurons are interconnected on the basis of time series of spikes in all neurons. In such a model, the membrane potential of neuron obeys the dynamics

(4)

where the first term on the right-hand side encodes the synaptic connections and a memory kernel ; specifies the time at which neuron emitted its spike. The remaining terms describe a background current, voltage leakage, and white noise. Finding the synaptic connections that best describe a large set of neural spike trains is a computational challenge; paninski2006a (); Cocco2009a () develop an approximation based on maximum likelihood, see section II.1. A related approach based on point processes and generalized linear models (GLM) is presented in truccolo2005point (). We will discuss this problem of inferring the network of connections the context of the non-equilibrium models in section III.

Neural recordings give the firing patterns of several neurons over time. These neurons may have connections between them, but they also receive signals from neural cells whose activity is not recorded macke2011a (). In tyrcha2013 (), the effect of connections between neurons is disentangled from correlations arising from shared non-stationary input. This raises the possibility that the correlations described by the pairwise model (1) in Schneidman2006a () and related works originate from a confounding factor (connections to a neuron other than those whose signal is measured), rather than from connections between recorded neurons kulkarni2007common ().

i.2 Reconstruction of gene regulatory networks

The central dogma of molecular biology is this: Proteins are macromolecules consisting of long chains of amino acids. The particular sequence of a protein is encoded in DNA, a double-stranded helix of complementary nucleotides. Specific parts of DNA, the genes, are transcribed by polymerases, producing a single-stranded copy called m(essenger)RNAs, which are translated by ribosomes, usually multiple times, to produce proteins.

The process of producing protein molecules from the DNA template by transcription and translation is called gene expression. The expression of a gene is tightly controlled to ensure that the right amounts of proteins are produced at the right time. One important control mechanism is transcription factors, proteins which affect the expression of a gene (or several) by binding to DNA near the transcription start site of that gene. This part of DNA is called the regulatory region of a gene. A target gene of a transcription factor may in turn encode another transcription factor, leading to a cascade of regulatory events. To add further complications, the binding of multiple transcription factors in the regulatory region of a gene leads to combinatorial control exerted by several transcription factors on the expression of a gene buchlergerlandhwa2003a (). Can the regulatory connections between genes be inferred from data on gene expression, that is, can we learn the identity of transcription factors and their targets?

Over the last decades, the simultaneous measurement of expression levels of all genes have become routine. At the centre of this development are two distinct technological advances to measure mRNA levels. The first is microarrays, consisting of thousands of short DNA sequences, called probes, grafted to the surface of a small chip. After converting the mRNA in a sample to DNA by reverse transcription, cleaving that DNA into short segments, and fluorescently labelling the resulting DNA segments, fluorescent DNA can bind to its complementary sequence on the chip. (Reverse transcription converts mRNA to DNA, a process which requires a so-called reverse transcriptase as an enzyme.) The amount of fluorescent DNA bound to a particular probe depends on the amount of mRNA originally present in the sample. The relative amount of mRNA from a particular gene can then be inferred from the fluorescence signal at the corresponding probes hoheisel2006a (). A limitation of microarrays is the large amount of mRNA required: The mRNA sample is taken from a population of cells. As a result, cell-to-cell fluctuations of mRNA concentrations are averaged over. To obtain time series, populations of cells synchronized to approximately the same stage in the cell cycle are used gasch2000a ().

The second way to measure gene expression levels is also based on reverse transcription of mRNA, followed by high-throughput sequencing of the resulting DNA segments. Then the relative mRNA levels follow directly from counts of sequence reads wang2009a (). Recently, expression levels even in single cells have been measured in this way wills2013a (). In combination with barcoding (adding short DNA markers to identify individual cells), cells can have their expression profiled individually in a single sequencing run macosko2015a (). Such data may allow, for instance, the analysis of the response of target genes to fluctuations in the concentration of transcription factors. However, due to the destructive nature of single-cell sequencing, there may never be single-cell data that give time series of genome-wide expression levels.

Unfortunately, cellular concentrations of proteins are much harder to measure than mRNA levels. As a result, much of the literature focuses on mRNA levels, neglecting the regulation of translation. Advances in protein mass-spectrometry picotti2013a () may lead to data on both mRNA and protein concentrations. This data would pose the additional challenge of inferring two separate levels of gene regulation: gene transcription from DNA to mRNA and translation from mRNA to proteins.

As in the case of neural data discussed in the preceding section, gene expression data presents two distinct challenges: (i) finding a statistical description of the data in terms of suitable observables and (ii) inferring the underlying regulatory connections. Both these problems have been addressed extensively in the machine learning and quantitative biology communities. Clustering of gene expression data to detect sets of genes with correlated expression levels has been used to detect regulatory relationships. A model-based approach to the reconstruction of regulatory connections is Boolean networks. Boolean networks assign binary states to each gene (gene expression on/off), and the state of a gene at a given time depends on the state of all genes at a previous time through a set of logical functions assigned to each gene. See Dhaeseleer2000a () for a review of clustering and Boolean network inference and hickman2009a () for a review of Boolean network inference.

A statistical description that has also yielded insight into regulatory connections is Bayesian networks. A Bayesian network is a probabilistic model describing a set of random variables (expression levels) through conditional dependencies described by a directed acyclic graph. Learning both the structure of the graph and the statistical dependencies is a hard computational problem, but can capture strong signals in the data that are often associated with a regulatory connection. In principle, causal relationships (like the regulatory connections) can be inferred, in particular if the regulatory network contains no cycles. For reviews, see friedman2004a (); koller2009a (). Both Boolean or Bayesian networks have been applied to measurements of the response of expression levels to external perturbations of the regulatory network or of expression levels, see  ideker2000a (); peer2001a (). A full review of these methods is beyond the scope of this article, instead we focus on approaches related to the inverse Ising problem.

For a statistical description of gene expression levels, lezon2006 () applied a model with pairwise couplings

(5)

fitted to gene expression levels. The standard definition of expression levels is -values of fluorescence signals with the mean value for each gene subtracted. Since (5) is a multi-variate Gaussian distribution, the matrix of couplings must be negative definite. These couplings can be inferred simply by inverting the matrix of variances and covariances of expression levels. In lezon2006 (), the resulting couplings were then used to identify hub genes which regulate many targets. The same approach was used in locasale2009 () to analyse the cellular signalling networks mediated by the phosphorylation of specific sites on different proteins. Again, the distribution (5) can be viewed as a maximum entropy distribution for continuous variables with prescribed first and second moments. This approach is also linked to the concept of partial correlations in statistics baba2004a (); krumsiek2011a ().

Again the maximum-entropy distribution (5) has symmetric couplings between expression levels, whereas the network of regulatory interactions is intrinsically asymmetric. One way to infer the regulatory connections is time series sima2009a (). Bailly2010b () uses expression levels measured at different times to infer the regulatory connections, based on a minimal model of expression dynamics with asymmetric regulatory connections between pairs of genes. In this model, expression levels at successive time intervals obey

(6)

where is a threshold. The regulatory connections are taken to be discrete, with the values denoting repression, activation and no regulation of gene by the product of gene . The matrix of connections is then inferred based on Bayes theorem (see section II.1) and an iterative algorithm for estimating marginal probabilities (message passing, see section II.1.11).

A second line of approach that can provide information on regulatory connections is perturbations tegner2003a (). An example is gene knockdowns, where the expression of one or more genes is reduced, by introducing small interfering RNA (siRNA) molecules into the cell dorsett2004a () or by other techniques. siRNA molecules can be introduced into cells from the outside; after various processing steps they lead to the cleavage of mRNA with a complementary sequence, which is then no longer available for translation. If that mRNA translates to a transcription factor, all targets of that transcription factor will be upregulated or downregulated (depending on whether the transcription factor acted as a repressor or an activator, respectively). Knowing the responses of gene expression levels to a sufficient number of such perturbations allows the inference of regulatory connections. molinelli2013a () considers a model of gene expression dynamics based on continuous variables evolving deterministically as . The first term describes how the expression level of gene affects the rate of gene expression of gene via the regulatory connection , the second term describes mRNA degradation. The stationary points of this model shift in response to perturbations of expression levels of particular genes (for instance through knockdowns), and these changes depend on regulatory connections. In molinelli2013a (), the regulatory connections are inferred from perturbation data, again using belief propagation.

i.3 Protein structure determination

Tremendous efforts have been made to determine the three-dimensional structure of proteins. A linear amino acid chain folds into a convoluted shape, the folded protein, thus bringing amino acids into close physical proximity that are separated by a long distance along the linear sequence.

Due to the number of proteins (several thousand per organism) and the length of individual proteins (hundreds of amino acid residues), protein structure determination is a vast undertaking. However, the rewards are also substantial. The three-dimensional structure of a protein determines its physical and chemical properties, and how it interacts with other cellular components: broadly, the shape of a protein determines many aspects of its function. Protein structure determination relies on crystallizing proteins and analysing the X-ray diffraction pattern of the resulting solid. Given the experimental effort required, the determination of a protein’s structure from its sequence alone has been a key challenge to computational biology for several decades dill2012a (); deJuan2013a (). The computational approach models the forces between amino acids in order to find the low-energy structure a protein in solution will fold into. Depending on the level of detail, this approach requires extensive computational resources.

Figure 1: Correlations and couplings in protein structure determination. Both figures show the three-dimensional structure of a particular part (region 2) of the protein SigmaE of E. coli, as determined by X-ray diffraction. This protein, or rather a protein of similar sequence and presumably similar structure, occurs in many other bacterial species as well. In figure B, lines indicate pairs of sequence positions whose amino acids are highly correlated across different bacteria: for each pair of sequence positions at least amino acids apart, the mutual information of pairwise frequency counts of amino acids was calculated, and the most correlated pairs are shown here. Such pairs that also turn out to be close in the three-dimensional structure are shown in red, those whose distance exceeds are shown in green. We see about as many highly correlated sequence pairs that are proximal to one another as correlated pairs that are further apart. By contrast, in figure A, lines show sequence pairs that are strongly coupled in the Potts model (7), whose model parameters are inferred from the correlations. The fraction of false contact predictions (green lines) is reduced considerably. The figures are taken from Morcos2011a ().

An attractive alternative enlists evolutionary information: Suppose that we have at our disposal amino acid sequences of a protein as it appears in different related species (so-called orthologs). While the sequences are not identical across species, they preserve to some degree the three-dimensional shape of the protein. Suppose a specific pair of amino acids that interact strongly with each other and bring together parts of the protein that are distal on the linear sequence. Replacing this pair with another, equally strongly interacting pair of amino acids would change the sequence, but leave the structure unchanged. For this reason, we expect sequence differences across species to reflect the structure of the protein. Specifically, we expect correlations of amino acids in positions that are proximal to each other in the three-dimensional structure. In turn, the correlations observed between amino acids at different positions might allow to infer which pairs of amino acids are proximal to each other in three dimensions (the so-called contact map). The use of such genomic information has recently lead to predictions of the three-dimensional structure of many protein families inaccessible to other methods ovchinnikov2017 (), for a review see coccoetal2017 ().

Figure 2: Protein contact maps predicted from evolutionary correlations. The two figures show contact maps for the ELAV4 protein (left) and the RAS protein (right). - and -axes correspond to sequence positions along the linear chain of amino acids. Pairs of sequence positions whose amino acids are in close proximity in the folded protein are indicated in grey (experimental data). Pairs of sequence positions with highly correlated amino acids are shown in blue (mutual information, bottom triangle). Pairs of sequence positions with high direct information (8) calculated from (7) are shown in red. The coincidence of red and grey points shows excellent agreement between predictions from direct information with the experimentally determined structure of the protein. The figure is taken from marks2011a ().

Early work looked at the correlations as a measure of proximity gobel1994 (); lockless1999a (); socolich2005 (); halabi2009a (). However correlations are transitive; if amino acids at sequence sites and are correlated due to proximity in the folded state, and and are correlated for some reason, and will also exhibit correlations, which need not stem from proximity. This problem is addressed by an inverse approach aimed at finding the set of pairwise couplings that lead to the observed correlations or sequences Weigt2009a (); burger2010 (); Morcos2011a (); marks2011a (); sulkowska2012 (); dago2012 (); hopf2012a (); cocco2013a (); ekeberg2013 (). Since each sequence position can be taken up by one of amino acids or a gap in the sequence alignment, there are correlations at each pair of sequence positions. In Weigt2009a (); Morcos2011a () a statistical model with pairwise interactions is formulated, based on the Hamiltonian

(7)

This Hamiltonian depends on spin variables , one for each sequence position . Each spin variable can take on one of values, describing the possible amino acids at that sequence position as well as the possibility of a gap (corresponding to an extra amino acid inserted in a particular position in the sequences of other organisms). Each pair of amino acids in sequence position contributes to the energy. The inverse problem is to find the couplings for each pair of sequence positions and pair of amino acids , as well as field , such that the amino acid frequencies and correlations observed across species are reproduced. The sequence positions with strong pairwise couplings are then predicted to be proximal in the protein structure. A simple measure of the coupling between sequence positions is the matrix norm (Frobenius norm) . The so-called direct information Weigt2009a () is an alternative measure based on information theory. A two-site model is defined with . Direct information is the mutual information between the two-site model and a model without correlations between the amino acids

(8)

The Boltzmann distribution resulting from (7) can be viewed as the maximum entropy distribution with one- and two-point correlations between amino acids in different sequence positions determined by the data. There is no reason to exclude higher order terms in the Hamiltonian (7) describing interactions between triplets of sequence positions, although the introduction of such terms may lead to overfitting. Also, fitting the Boltzmann distribution (7) to sequence data uses no prior information on protein structures; for this reason it is called an unsupervised method. Recently, neural network models trained on sequence data and protein structures (supervised learning) have been very successful in predicting new structures jones2015metapsicov (); wang2017accurate ().

The maximum entropy approach to structure analysis is not limited to evolutionary data. In zhangwolynes2015a () Zhang and Wolynes analyse chromosome conformation capture experiments and use the observed frequency of contacts between different parts of a chromosome in a maximum entropy approach to predict the structure and topology of the chromosomes.

i.4 Fitness landscape inference

The concept of fitness lies at the core of evolutionary biology. Fitness quantifies the average reproductive success (number of offspring) of an organism with a particular genotype, i.e., a particular DNA sequence. The dependence of fitness on the genotype can be visualized as a fitness landscape in a high-dimensional space, where fitness specifies the height of the landscape. As the number of possible sequences grows exponentially with their length, the fitness landscape requires in principle an exponentially large number of parameters to specify, and in turn those parameters need an exponentially growing amount of data to infer.

A suitable model system for the inference of a fitness landscape is HIV proteins, due to the large number of sequences stored in clinical databases and the relative ease of generating mutants and measuring the resulting fitness. In a series of papers, Chakraborty and co-workers proposed a fitness model the so-called Gag protein family (group-specific antigen) of the HIV virus dahirel2011a (); ferguson2013a (); shekhar2013a (); mann2014a (). The model is based on pairwise interactions between amino acids. Retaining only the information whether the amino acid at sequence position was mutated () with respect to a reference sequence or not (), Chakraborty and co-workers suggest a minimal model for the fitness landscape given by the Ising Hamiltonian (2). Again, one can view the landscape (2) as generating the maximum entropy distribution constrained by the observed one- and two-point correlations.

Adding a constant to (2) in order to make fitness (expected number of offspring) non-negative does not alter the resulting statistics. The inverse problem is to infer the couplings and fields from frequencies of amino acids and pairs of amino acids in particular sequence positions observed in HIV sequence data. Of course it is not clear from the outset that a model with only pairwise interactions can describe the empirical fitness landscape. As a test of this approach, ferguson2013a () compares the prediction of (2) for specific mutants to the results of independent experimental measurements of fitness.

Statistical models of sequences described by pairwise interactions may be useful to model a wide range of protein families with different functions hopf2015a (), and have been used in other contexts as well. Santolini, Mora, and Hakim model the statistics of sequences binding transcription factors using (7), with each spin taking one of four states to characterize the nucleotides  santolini2014a (). A similar model is used in mora2010a () to model the sequence diversity of the so-called IgM protein, an antibody which plays a key role in the early immune response. The model with pairwise interactions predicts non-trivial three-point correlations which compare well with those found in the data, see figure 3.

Figure 3: Three-point correlations in an amino acid sequences and their prediction from a model with pairwise interactions. Mora et al. look at the so-called D-region in the IgM protein (maximum length mora2010a (). The D-region plays an important role in immune response. The frequencies at which given triplets of consecutive amino acids occur were compiled (-axis, normalized with respect to the prediction of a model with independent sites). The results are compared to the prediction from a model with pairwise interactions like (2) on the -axis. The figure is taken from mora2010a ().

i.5 Combinatorial antibiotic treatment

Antibiotics are chemical compounds which kill specific bacteria or inhibit their growth walsh2000a (); kohanski2010a (). Mutations in the bacterial DNA can lead to resistance against a particular antibiotic, which is a major hazard to public health wise1998a (); liu2015a (). One strategy to slow down or eliminate the emergence of resistance is to use a combination of antibiotics either simultaneously or in rotation walsh2000a (); kohanski2010a (). The key problem of this approach is to find combinations of compounds which are particularly effective against a particular strain of bacteria. Trying out all combinations experimentally is prohibitively expensive. Wood et al. use an inverse statistical approach to predict the effect of combinations of several antibiotics from data on the effect of pairs of antibiotics wood2012a () . The available antibiotics are labelled ; in wood2012a () a distribution over continuous variables is constructed, such that gives the bacterial growth rate when antibiotic is administered, gives the growth rate with both and are given, etc. for higher moments. Choosing this distribution to be a multi-variate Gaussian results in simple relationships between the different moments, which lead to predictions of the response to drug combinations that are borne out well by experiment wood2012a ().

i.6 Interactions between species and between individuals

Species exist in various ecological relationships. For instance individuals of one species hunt and eat individuals of another species. Another example is microorganisms whose growth can be influenced, both positively and negatively, by the metabolic output of other microorganisms. Such relationships form a dense web of ecological interactions between species. Co-culturing and perturbation experiments (for instance species removal) lead to data which may allow the inference of these networks faust2012a (); hekstra2013 ().

Interactions between organisms exist also at the level of individuals, for instance when birds form a flock, or fish form a school. This emergent collective behaviour is thought to have evolved to minimize the exposure of individuals to predators. In bialek2012a (); bialek2014a (), a model with pairwise interactions between the velocities of birds in a flock is constructed. Individual birds labelled move with velocity in a direction specified by . The statistics of the these directions is modelled by a distribution

(9)

where the couplings between the spins need to be inferred from the experimentally observed correlations between normalized velocities. This model can be viewed as the maximum-entropy distribution constrained by pairwise correlations between normalized velocities. From the point of view of statistical physics it describes a disordered system of Heisenberg spins. As birds frequently change their neighbours in flight, the couplings are not constant in time and it makes sense to consider couplings that depend on the distance between two individuals bialek2012a (). An alternative is to apply the maximum entropy principle to entire trajectories cavagna2014a ().

i.7 Financial markets

Market participants exchange commodities, shares in companies, currencies, or other goods and services, usually for money. The change in prices of such goods are often correlated, as the demand for different goods can be influenced by the same events. In bury2013a (); bury2013b (), Bury uses a spin model with pairwise interactions to analyse stock market data. Shares in different companies are described by binary spin variables, where spin indicates ‘bullish’ conditions for shares in company with prices going up at a particular time, and implies ‘bearish’ conditions with decreasing prices. Couplings describe how price changes in shares affect changes in the price of , or how prices are affected jointly by external events. Bury fit stock marked data to this spin model, and found clusters in the resulting matrix of couplings bury2013a (). These clusters correspond to different industries whose companies are traded on the market. In borysov2015a (), a similar analysis finds that heavy tails in the distribution of inferred couplings are linked to such clusters. Slonim et al. identified clusters in stocks using an information-based metric of stock prices slonim2005a () .

Ii Equilibrium reconstruction

The applications discussed above can be classified according to the symmetry of pairwise couplings: In network reconstruction, couplings between spins are generally asymmetric, in maximum entropy models they are symmetric. A stochastic dynamics based on symmetric couplings entails detailed balance, leading to a steady state described by the Boltzmann distribution krapivsky2010kinetic (), whereas asymmetric couplings lead to a non-equilibrium steady state. This distinction shapes the structure of this review: In this section, we discuss the inverse Ising problem in equilibrium, in section III we turn to non-equilibrium scenarios.

ii.0.1 Definition of the problem

We consider the Ising model with binary spin variables . Pairwise couplings (or coupling strengths) encode pairwise interactions between the spin variables, and local magnetic fields act on individual spins. The energy of a spin configuration is specified by the Hamiltonian

(10)

The equilibrium statistics of the Ising model is described by the Boltzmann distribution

(11)

where we have subsumed the temperature into couplings and fields such that : The statistics of spins under the Boltzmann distribution depends on couplings, magnetic fields, and temperature only through the products and . As a result, only the products and can be inferred and we set to without loss of generality. The energy specified by the Hamiltonian (2) or its generalisation (7) is thus a dimensionless quantity.

denotes the partition function

(12)

In such a statistical description of the Ising model, each spin is represented by a random variable. Throughout, we denote a random spin variable by , and a particular realisation of that random variable by . This distinction will become particularly useful in the context of non-equilibrium reconstruction in section III. The expectation values of spin variables and their functions then are denoted

(13)

where is some function mapping a spin configuration to a number. Examples are the equilibrium magnetizations or the pair correlations . In statistics, the latter observable is called the pair average. We are also interested in the connected correlation , which in statistics is known as the covariance.

The equilibrium statistics of the Ising problem (11) is fully determined by the couplings between spins and the magnetic fields acting on the spins. Collectively, couplings and magnetic fields are the parameters of the Ising problem. The forward Ising problem is to compute statistical observables such as the magnetizations and correlations under the Boltzmann distribution (11); the couplings and fields are taken as given. The inverse Ising problem works in the reverse direction: The couplings and fields are unknown and are to be determined from observations of the spins. The equilibrium inverse Ising problem is to infer these parameters from spin configurations sampled independently from the Boltzmann distribution. We denote such a data set of samples by for . (This usage of the term ‘sample’ appears to differ from how it is used in the statistical mechanics of disordered systems, where a sample often refers to a random choice of the model parameters, not spin configurations. However, it is in line with the inverse nature of the problem: From the point of view of the statistical mechanics of disordered systems in an inverse statistical problem the ‘phase space variables’ are couplings and magnetic fields to be inferred, the ‘quenched disorder’ is spin configurations sampled from the Boltzmann distribution.)

Generally, neither the values of couplings nor the graph structure formed by non-zero couplings is known. Unlike in many instances of the forward problem, the couplings often do not conform to a regular, finite-dimensional lattice; there is no sense of spatial distance between spins. Instead, the couplings might be described by a fully connected graph, with all pairs of spins coupling to each other, generally all with different values of the . Alternatively, most of the couplings might be zero, and the non-zero entries of the coupling matrix might define a structure that is (at least locally) treelike. The graph formed by the couplings might also be highly heterogeneous with few highly connected nodes with many non-zero couplings and many spins coupling only to a few other spins. These distinctions can affect how well specific inference methods perform, a point we will revisit in section II.3, which compares the quality of different methods in different situations.

ii.1 Maximum likelihood

The inverse Ising problem is a problem of statistical inference mackay2003 (); bishop2006 (). At the heart of many methods to reconstruct the parameters of the Ising model is the maximum likelihood framework, which we discuss here.

Suppose a set of observations drawn from a statistical model . In the case of the Ising model, each observation would be a spin configuration . While the functional form of this model may be known a priori, the parameter is unknown to us and needs to be inferred from the observed data. Of course, with a finite amount of data, one cannot hope to determine the parameter exactly. The so-called maximum likelihood estimator

(14)

has a number of attractive properties Cramer1961a (): In the limit of a large number of samples, converges in probability to the value being estimated. This property is termed consistency. Also for large sample sizes, there is no consistent estimator with a smaller mean-squared error. For a finite number of samples, the maximum likelihood estimator may however be biased, that is, the mean of over many realisations of the samples does not equal (although the difference vanishes with the sample size). The term likelihood refers to viewed as a function of the parameter at constant values of the data . The same function at constant gives the probability of observing the data .

The maximum likelihood estimator (14) can also be derived using Bayes theorem mackay2003 (); bishop2006 (). In Bayesian inference, one introduces a probability distribution over the unknown parameter . This prior distribution describes our knowledge prior to receiving the data. Upon accounting for additional information from the data, our knowledge is described by the posterior distribution given by Bayes theorem

For the case where is a priori uniformly distributed (describing a scenario where we have no prior knowledge of the parameter value), the posterior probability distribution of the parameter conditioned on the observations is proportional to 222This line of argument only works if the parameter space is bounded, so a uniform prior can be defined.. Then the parameter value maximizing the probability density is given by the maximum likelihood estimator (14). Maximizing the logarithm of the likelihood function, termed the log-likelihood function, leads to the same parameter estimate, because the logarithm is a strictly monotonic function. As the likelihood scales exponentially with the number of samples, the the log-likelihood is more convenient to use. (This is simply the convenience of not having to deal with very small numbers: the logarithm is not linked to the quenched average considered in the statistical mechanics of disordered systems; there is no average involved and the likelihood depends on both the model parameters and the data.)

We now apply the principle of maximum likelihood to the inverse Ising problem. Assuming that the configurations in the dataset were sampled independently from the Boltzmann distribution (1), the log-likelihood of the model parameters given the observed configurations is derived easily.

gives the log-likelihood per sample, a quantity of order zero in since the likelihood scales exponentially with the number of samples. The sample averages of spin variables and their functions are defined by

(17)

Beyond the parameters of the Ising model, the log-likelihood (II.1) depends only on the correlations between pairs of spins observed in the data and the magnetizations . To determine the maximum-likelihood estimates of the model parameters we thus only need the pair correlations and magnetizations observed in the sample (sample averages); at least in principle, further observables are superfluous. In the language of statistics, these sets of sample averages provide sufficient statistics to determine the model parameters.

The log-likelihood (II.1) has a physical interpretation: The first two terms are the sample average of the (negative of the) energy, the second term adds the free energy. Thus the log-likelihood is the (negative of the) entropy of the Ising system, based on the sample estimate of the energy. We will further discuss this connection in section II.1.5.

A second interpretation of the log-likelihood is based on the difference between the Boltzmann distribution (11) and the empirical distribution of the data in the sample , denoted . The difference between two probability distributions and can be quantified by the Kullback–Leibler (KL) divergence

(18)

which is non-negative and reaches zero only when the two distributions are identical Cover2006a (). The KL divergence between the empirical distribution and the Boltzmann distribution is

The second term (the negative empirical entropy) is independent of the model parameters; the best match between the Boltzmann distribution and the empirical distribution (minimal KL divergence) is thus achieved when the likelihood (II.1) is maximal.

Above, we derived the principle of maximum likelihood (14) from Bayes theorem under the assumption that the model parameter is sampled from a uniform prior distribution. Suppose we had the prior information that the parameter was taken from some non-uniform distribution, the posterior distribution would then acquire an additional dependence on the parameter. In the case of the inverse Ising problem, prior information might for example describe the sparsity of the coupling matrix, with a suitable prior that assigns small probabilities to large entries in the coupling matrix. The resulting (log) posterior is

(20)

up to terms that do not depend on the model parameters. The maximum of the posterior is now no longer achieved by maximizing the likelihood, but involves a second term that penalizes coupling matrices with large entries. Maximizing the posterior with respect to the parameters no longer makes the Boltzmann distribution as similar to the empirical distribution as possible, but strikes a balance between making these distributions similar while avoiding large values of the couplings. In the context of inference, the second term is called a regularisation term. Different regularisation terms have been used, including the absolute-value term in (20) as well as a penalty on the square values of couplings (called - and -regularisers, respectively). One standard way to determine the value of the regularisation coefficient is to cross-validate with a part of the data that is initially withheld, that is to probe (as a function of ) how well the model can predict aspects of the data not yet used to infer the model parameters Hastie2009a ().

ii.1.1 Exact maximization of the likelihood

The maximum likelihood estimate of couplings and magnetic fields

(21)

has a simple interpretation. Since serves as a generating function for expectation values under the Boltzmann distribution, we have

(22)

At the maximum of the log-likelihood these derivatives are zero; the maximum-likelihood estimate of the parameters is reached when the expectation values of pair correlations and magnetizations under the Boltzmann statistics match their sample averages

(23)

The log-likelihood (II.1) turns out to be a concave function of the model parameters, see II.1.2. Thus, in principle, it can be maximized by a convex optimization algorithm. One particular way to reach the maximum of the likelihood is a gradient-descent algorithm called Boltzmann machine learning Ackley1985a (). At each step of the algorithm fields and couplings are updated according to

(24)
(25)

The parameter is the learning rate of the algorithm, which has (23) as its fixed point.

In order to calculate the expectation values and on the left-hand side of these equations, one needs to perform thermal averages of the form (13) over all configurations, which is generally infeasible for all but the smallest system sizes. Analogously, when maximizing the log-likelihood (II.1) directly, the partition function is the sum over terms. Moreover, the expectation values or the partition function need to be evaluated many times during an iterative search for the solution of (23) or the maximum of the likelihood. As a result, also numerical sampling techniques such as Monte Carlo sampling are cumbersome, but have been used for moderate system sizes broderick2007faster (). Habeck proposes a Monte Carlo sampler that draws model parameters from the posterior distribution habeck2014 (). A recent algorithm uses information contained in the shape of the likelihood maximum to speed up the convergence ferrari2015a (). An important development in machine learning has led to the so-called restricted Boltzmann machines, where couplings form a symmetric and bipartite graph. Variables fall into two classes, termed ‘visible’ and ‘hidden’, with couplings never linking variables of the same class. This allows fast learning algorithms fischer2012introduction () at the expense of additional hidden variables.

We stress that the difficulty of maximizing the likelihood is associated with the restriction of our input to the first two moments (magnetisations and correlations) of the data. On the one hand, this restriction is natural, as the likelihood only depends on these two moments. On the other hand, computationally efficient methods have been developed that effectively use correlations in the data beyond the first two moments. An important example is pseudolikelihood, which we will discuss in section II.2. Other learning techniques that sidestep the computation of the partition function include score matching hyvarinen2005estimation () and minimum probability flow sohl2011new (). Also, when the number of samples is small (compared to the number of spins), the likelihood need no longer be the best quantity to optimize.

ii.1.2 Uniqueness of the solution

We will show that the log-likelihood (II.1) is a strictly concave function of the model parameters (couplings and magnetic fields). As the space of parameters is convex, the maximum of the log-likelihood is unique.

We use the shorthands and for the model parameters and the functions coupling to them and write the Boltzmann distribution as

(26)

For such a general class of exponential distributions Hastie2009a (), the second derivatives of the log-likelihood with respect to a parameters obey

(27)

This matrix of second derivatives is non-negative (has no negative eigenvalues) since for all . If no non-trivial linear combination of the observables has vanishing fluctuations, the Hessian matrix is even positive-definite. For the inverse Ising problem, there are indeed no non-trivial linear combinations of the spin variables and pairs of spins variables that do not fluctuate under the Boltzmann measure, unless some of the couplings or fields are infinite. As a result, the maximum of the likelihood, if it exists, is unique. However, it can happen that the maximum lies at infinite values of some of the parameters (for instance when the samples contain only positive values of a particular spin, the maximum likelihood value of the corresponding magnetic field is infinite). These divergences can be avoided with the introduction of a regularisation term, see section II.1.

ii.1.3 Maximum entropy modelling

The Boltzmann distribution in general and the Ising Hamiltonian (2) in particular can be derived from information theory and the principle of maximum entropy. This principle has been invoked in neural modelling Schneidman2006a (), protein structure determination Weigt2009a (), and DNA sequence analysis mora2010a (). In this section, we discuss the statistical basis of Shannon’s entropy, the principle of maximum entropy, and their application to inverse statistical modelling.

Consider distinguishable balls, each to be placed in a box with compartments. The number of ways of placing the balls such that balls are in the th compartment () is

(28)

with . For large , we write and exploit Stirling’s formula , yielding the Gibbs entropy

(29)

This combinatorial result forms the basis of equilibrium statistical physics in the classic treatment due to Gibbs and can be found in standard textbooks. In the context of statistical physics, each of the compartments corresponds to a microstate of a system, and each microstate is associated with energy . The balls in the compartments describe a set of copies of the system, or a so-called ensemble of replicas. The replicas may exchange energy with each other, while the ensemble of replicas itself is isolated and has a fixed total energy (and possibly other conserved quantities). In this way, the replicas can be thought of as providing a heat-bath for each other. If we assume that each state of the ensemble of replicas with a given total energy is equally likely, the statistics of is dominated by a sharp maximum of as a function of the , subject to the constraint and . Using Lagrange multipliers to maximize (29) subject to these constraints yields the Boltzmann distribution jaynes1957a ().

This seminal line of argument can also be used to derive Shannon’s information entropy (3). The argument is due to Wallis and is recounted in Jaynes1989a (). Suppose we want to find a probability distribution compatible with a certain constraint, for instance a specific expectation value to within some small margin of error. Consider independent casts of a fair die with faces. We denote the number of times outcome is realized in these throws as . The probability of a particular set is

(30)

In the limit of large , the logarithm of this probability is with .

Each set of casts defines one instance of the . In most instances, the constraint will not be realized. For those (potentially rare) instances obeying the constraint, we can ask what are the most likely values of , and correspondingly . Maximising the Shannon’s information entropy subject to the constraint and the normalisation gives the so-called maximum-entropy estimate of . If the underlying set of probabilities (the die with faces) differs from the uniform distribution, so outcome occurs with probability , it is not the entropy but the relative entropy that is to be maximised. Up to a sign, this is the Kullback-Leibler divergence (18) between and .

The maximum-entropy estimate can be used to approximate an unknown probability distribution that is under-sampled. Suppose data is sampled several times from some unknown probability distribution. With a sufficient number of samples , the distribution can be easily determined from frequency counts . Often this is not feasible; if , fluctuates strongly from one set of samples to the next. This situation appears naturally when the number of possible outcomes grows exponentially with the size of the system, see e.g. section I.1. Nevertheless the data may be sufficient to pin down one or several expectation values. The maximum-entropy estimate has been proposed as the most unbiased estimate of the unknown probability distribution compatible with the observed expectation values Jaynes1989a (). For a discussion of the different ways to justify the maximum entropy principle, and derivations based on robust estimates see tikochinsky1984alternative ().

Many applications of the maximum- entropy estimate are in image analysis and spectral analysis gull1984a (), for reviews in physics and biology see banavar2010a (); bialek2007rediscovering (); Mora2011b (), and for critical discussion see aurell2016maximum (); vannimwegen2016inferring ().

The connection between maximum entropy and the inverse Ising problem is simple: For a set of binary variables, the distribution with given first and second moments maximizing the information entropy is the Boltzmann distribution (1) with the Ising Hamiltonian (2). We use Lagrange multipliers to maximize the information entropy (3) subject to the normalization condition and the constraints on the first and second moments (magnetisations and pair correlations) of to be and . Setting the derivatives of

(31)

with respect to to zero yields the Ising model (1). The Lagrange multipliers and need to be chosen to reproduce the first and second moments (magnetisations and correlations) of the data and can be interpreted as couplings between spins and magnetic fields.

While this principle appears to provide a statistical foundation to the model (1), there is no a priori reason to disregard empirical data beyond the first two moments. Instead, the pairwise couplings result from the particular choice of making the probability distribution match the first two moments of the data. The reasons for this step may be different in different applications.

  • Moments beyond the first and second may be poorly determined by the data. Conversely, with an increasing number of samples, the determination of higher order correlations and hence interactions between triplets of spin variables etc. becomes viable.

  • The data may actually be generated by an equilibrium model with (at most) pairwise interactions between spin variables. This need not be obvious from observed correlations of any order, but can be tested by comparing three-point correlations predicted by a model with pairwise couplings to the corresponding correlations in the data. Examples are found in sequence analysis, where population dynamics leads to an equilibrium steady state BergWillmannLaessig2004 (); SellaHirsh2005 () and the energy can often be approximated by pairwise couplings mora2010a (); santolini2014a (). For a review see stein2015a (). Surprisingly, also in neural data (not generated by an equilibrium model), three-point correlations are predicted well by a model with pairwise interactions tkacik2009spin (); tkavcik2014a ().

  • A model of binary variables interacting via a high-order coupling terms can sometimes be approximated surprisingly well by pairwise interactions. This seems to be the case when the couplings are dense, so that each variable appears in several coupling terms merchan2015a ().

  • Often one seeks to describe a subset of variables from a larger set of variables, for instance when only the variables in the subset can be observed. The subset of variables is characterized by effective interactions which stem from interactions between variables in the subset, and from interactions with the other variables. If the subset is sufficiently small, the resulting statistics is often described by a model with pairwise couplings roudinirenberglatham2009a ().

  • The true probability distribution underlying some data may be too complicated to calculate in practice. A more modest goal then is to describe the data using an effective statistical model such as (1), which is tractable and allows the derivation of bounds on the entropy or the free energy. Examples are the description of neural data and gene expression data using the Ising model with symmetric couplings (see I.1 and I.2).

  • There are also useful models which are computationally tractable but do not maximize the entropy. An example is Gaussian models to generate artificial spike trains with prescribed pair correlations amari2003synchronous (); macke2009generating ().

ii.1.4 Information theoretic bounds on graphical model reconstruction

A particular facet of the inverse Ising problem is graphical model selection. Consider the Ising problem on a graph. A graph is a set of nodes and edges connecting these nodes, and nodes associated with spin variables. Couplings between node pairs connected by an edge are non-zero, couplings between unconnected node pairs are zero. The graphical model selection problem is to recover the underlying graph (and usually also the values of the couplings) from data sampled independently from the Boltzmann distribution. Given a particular number of samples, one can ask with which probability a given method can reconstruct the graph correctly (the reconstruction fluctuates between different realisations of the samples). Notably, there are also universal limits on graphical model selection that are independent of a particular method.

In santhanam2012information (), Santhanam and Wainwright derive information-theoretic limits to graphical model selection. Key result is the dependence of the required number of samples on the smallest and on the largest coupling

(32)

and on the maximum node connectivity (number of neighbours on the graph) . Reconstruction of the graph, by any method, is impossible if fewer than

(33)

samples are available (the precise statement is of a probabilistic nature, see santhanam2012information ()). If the maximum connectivity grows with the system size, this result implies that at least samples are required (with some constant santhanam2012information (). The derivation of this and other results is based on Fano’s inequality (Fano’s lemma) Cover2006a (), which gives a lower bound for the probability of error of a classification function (such as the mapping from samples to the graph underlying these samples).

ii.1.5 Thermodynamics of the inverse Ising problem

Calculations in statistical physics are greatly simplified by introducing thermodynamic potentials. In this section, we will discuss the method of thermodynamic potentials for the inverse Ising problem. It turns out that the maximum likelihood estimation of the fields and couplings is simply a transformation of the thermodynamic potentials.

Recall that the thermodynamic potential most useful for the forward problem, where couplings and magnetic fields are given, is the Helmholtz free energy . Derivatives of this free energy give the magnetizations, correlations, and other observables. The thermodynamic potential most useful for the inverse problem, where the pair correlations and magnetizations are given, is the Legendre transform of the Helmholtz free energy with respect to both couplings and fields Sessak2009a (); Cocco2011a (); Cocco2012a ()

(34)

This thermodynamic potential is readily recognised as the entropy function; up to a sign, it gives the maximum likelihood (II.1) of the model parameters. The transformation (34) thus provides a link between the inference via maximum likelihood and the statistical physics of the Ising model as described by its Helmholtz free energy. The couplings and the fields are found by differentiation,

(35)

where the derivatives are evaluated at the sample correlations and magnetizations. These relationships follow from the inverse transformation of (34)

(36)

by setting derivatives of the term in square brackets with respect to and to zero.

In practice, performing the Legendre transformation of both and is often not necessary; derivatives of the Helmholtz free energy with respect to can also be generated by differentiating with respect to , e.g.,

(37)

The thermodynamics of the inverse problem can thus be reduced to a single Legendre transform of the Helmholtz free energy, yielding the Gibbs free energy

(38)

The magnetic fields are given by the first derivative of the Gibbs free energy

(39)

To infer the couplings, we consider the second derivatives of Gibbs potential, which give

(40)

where is the matrix of connected correlations . (40) follows from the inverse function theorem,

(41)

and linear response theory

(42)

which links the susceptibility of the magnetization to a small change in the magnetic field with the connected correlation stanley1987introduction ().

The result (40) turns out to be central to many methods for the inverse Ising problem. The left-hand side of this expression is a function of . If the Gibbs free energy (38) can be evaluated or approximated, (40) can be solved to yield the couplings. Similarly (39) with the estimated couplings and the sample magnetisations gives the magnetic fields, completing the reconstruction of the parameters of the Ising model.

ii.1.6 Variational principles

For most systems, neither the free energy nor other thermodynamic potentials can be evaluated. However, there are many approximation schemes for  AMF (), which lead to approximations for the entropy and the Gibbs free energy. Direct approximation schemes for and have also be formulated within the context of the inverse Ising problem. The key idea behind most of these approximations is the variational principle.

The variational principle for the free energy is

(43)

where denotes a probability distribution over spin configurations. and and the minimisation is taken over all distributions . This principle finds its origin in information theory. Take an arbitrary trial distribution , the Kullback-Leibler divergence (18) quantifies the difference between and the Boltzmann distribution is positive and vanishes if and only if  Cover2006a (). One then arrives directly at (43) when rewriting .

We will refer to as the functional Helmholtz free energy, also called the non-equilibrium free energy in the context of non-equilibrium statistical physics parrondo2015a (). Another term in use is ‘Gibbs free energy’ AMF (), which we have reserved for the thermodynamic potential (38).

So far, nothing has been gained as the minimum is over all possible distributions , including the Boltzmann distribution itself. A practical approximation arises when a constraint is put on , leading to a family of trial distributions . Often the minimisation can then be carried out over that family, yielding an upper bound to the Helmholtz free energy AMF ().

In the context of the inverse problem, it is useful to derive the variational principles for other thermodynamic potentials as well. Using the definition of Gibbs potential (38) and the variational principle for the Helmholtz potential (43) we obtain

(44)

By means of Lagrange multipliers it is easy to show that this double extremum can be obtained by a single conditional minimisation,

(45)

where the set denotes all distributions with a given  AMF (). We will refer to the functional as the functional Gibbs free energy defined on .

Similarly, the variational principle can be applied to the entropy function , leading once again to a close relationship between statistical modelling and thermodynamics. The entropy (34) is found to be

(46)

where denotes distributions with and . This is nothing but the maximum entropy principle Jaynes1989a (): the variational principle identifies the distribution with the maximum information entropy subject to the constraints on magnetisations and spin correlations, which are set equal to their sample averages (see the section on maximum entropy modelling above).

ii.1.7 Mean-field theory

As a first demonstration of the variational principle, we derive the mean-field theory for the inverse Ising problem. The starting point is an ansatz for the Boltzmann distribution (11) which factorises in the sites stanley1987introduction (); AMF (); tanaka2000a ()

(47)

thus making the different spin variables statistically independent of one another. The parameters of this ansatz describe the spin magnetizations; each spin has a magnetisation resulting from the effective magnetic field acting on that spin. This effective field arises from its local magnetic field , as well as from couplings with other spin. The mean field giving its name to mean-field theory is the average over typical configurations of the effective field.

Using the mean-field ansatz, we now estimate the Gibbs free energy. Within the mean-field ansatz, the minimisation of the variational Gibbs potential (45) is trivial: there is only a single mean-field distribution (47) that satisfies the constraint that spins have magnetisations , namely . We can thus directly write down the mean-field Gibbs free energy

(48)

The equation for the couplings follows from the second order derivative of , cf. equation (40)

(49)

Similarly, the reconstruction of the magnetic field follows from the derivative of with respect to , cf. equation (39),

(50)

This result establishes a simple relationship between the observed connected correlations and the couplings between spins in terms of the inverse of the correlation matrix. The matrix inverse in (49) is of course much simpler to compute than the maximum over the likelihood (II.1) and takes only a polynomial number of steps: Gauß-Jordan elimination for inverting an matrix requires operations, compared to the exponentially large number of steps to compute a partition function or its derivatives.

The standard route to mean-field reconstruction proceeds somewhat differently, namely via the Helmholtz free energy rather than the Gibbs free energy. Early work to address the inverse Ising problem using the mean-field approximation was performed by Peterson and Anderson peterson1987mean (). In Kappen1998b (), Kappen and Rodríguez construct the Helmholtz functional free energy given by (43) under the mean-field ansatz (47). is then minimised with respect to the parameters of the mean-field ansatz by setting its derivatives with respect to to zero. This yields equations which determine the values of the magnetization parameters that minimize the KL divergence between the mean-field ansatz and the Boltzmann distribution, namely the well-known self-consistent equations

(51)

Using as an approximation for the equilibrium magnetisations one can derive the so-called linear-response approximation for the connected correlation function . Taking derivatives of the self-consistent equations (51) with respect to local fields gives

(52)

where we have used the fact that diagonal terms of the coupling matrix are zero. Identifying the result for the connected correlations with the sample correlations leads to a system of linear equations to be solved for the couplings Kappen1998b (). However, to obtain (52), we have used that the diagonal elements are zero. With these constraints, the system of equations (52) becomes over-determined and in general there is no solution. Although different procedures have been suggested to overcome this problem Kappen1998b (); Ricci2012a (); JaquinRancon2016 (), there seems to be no canonical way out of this dilemma. The most common approach is to ignore the constraints on the diagonal elements altogether and invert equation (52) to get

(53)

This result agrees with the reconstruction via the Gibbs free energy except for the non-zero diagonal couplings, which bear no physical meaning and are to be ignored. No diagonal couplings arise in the approach based on the Gibbs free energy since equation (40) with does not involve any unknown couplings .

ii.1.8 The Onsager term and TAP reconstruction

The variational estimate of the Gibbs free energy (48) can be improved further. In 1977, Thouless, Anderson, and Palmer (TAP) advocated adding a term to the Gibbs free energy

(54)

This term can be interpreted as describing the effect of fluctuations of a spin variable on the magnetisation of that spin via their impact on neighbouring spins Thouless1977a (). It is called the Onsager term, which we will derive in section II.1.13 in the context of a systematic expansion around the mean-field ansatz. For the forward problem, adding this term modifies the self-consistent equation (51) to the so-called TAP equation

(55)

In the inverse problem, the TAP free energy(54) gives an the equation for the couplings based on (40)

(56)

Solving this quadratic equation gives the TAP reconstruction Kappen1998b (); tanaka1998mean ()

(57)

where we have chosen the solution that coincides with the mean-field reconstruction when the magnetisations are zero. The magnetic fields can again be found by differentiating the Gibbs free energy

(58)

ii.1.9 Couplings without a loop: mapping to the minimum spanning tree problem

The computational hardness of implementing Boltzmann machine learning (24) comes from the difficulty of computing correlations under the Boltzmann measure, which can require a computational time that scales exponentially with the system size. This scaling originates from the presence of loops in the graph of couplings between the spins. Graphs for which correlations can be computed efficiently are the acyclic graphs or trees, so it comes as no surprise that the first efficient method to solve the inverse Ising problem was developed for trees already in 1968. This was done by Chow and Liu Chow1968a () in the context of a product approximation to a multi-variate probability distribution. While the method itself can be used as a crude approximation for models with loops or as reference point for more advanced methods, the exact result by Chow and Liu is of basic interest in itself as it provides a mapping of the inverse Ising problem for couplings forming a tree onto a minimum spanning tree (MST) problem. MST is a core problem in computational complexity theory, for which there are many efficient algorithms. This section on Chow and Liu’s result also provides some of the background needed in section II.1.10 on the Bethe ansatz and section II.1.11 on belief propagation.

We consider an Ising model whose pairwise couplings form a tree. The graph of couplings may consist of several parts that are not connected to each other (in any number of steps along connected node pairs), or it may form one single connected tree, but it contains no loops. We denote the set of nodes (vertices) associated with a spin of the tree by and the set of edges (couplings between nodes) by . It is straightforward to show that in this case, the Boltzmann distribution for the Ising model can be written in a pairwise factorised form

(59)

denotes the set of neighbours of node , so is the number of nodes couples to. The distributions and denote the one-point and two-point marginals of .

The KL divergence (II.1) between the empirical distribution and is given by

(60)

For a given tree, it is straightforward to show that the KL divergence is minimized when the marginals and equal the empirical marginals and . This gives

(61)

where