Efficiently learning Ising models on arbitrary graphs
We consider the problem of reconstructing the graph underlying an Ising model from i.i.d. samples. Over the last fifteen years this problem has been of significant interest in the statistics, machine learning, and statistical physics communities, and much of the effort has been directed towards finding algorithms with low computational cost for various restricted classes of models. Nevertheless, for learning Ising models on general graphs with nodes of degree at most , it is not known whether or not it is possible to improve upon the computation needed to exhaustively search over all possible neighborhoods for each node.
In this paper we show that a simple greedy procedure allows to learn the structure of an Ising model on an arbitrary bounded-degree graph in time on the order of . We make no assumptions on the parameters except what is necessary for identifiability of the model, and in particular the results hold at low-temperatures as well as for highly non-uniform models. The proof rests on a new structural property of Ising models: we show that for any node there exists at least one neighbor with which it has a high mutual information. This structural property may be of independent interest.
Undirected graphical models, or Markov random fields, are a general and powerful framework for reasoning about high dimensional distributions and are at the core of modern statistical inference. The joint probability distribution specified by such a model factorizes according to an underlying graph, and the absence of edges encodes conditional independence . The graph structure captures the computational aspect inherent in tasks of statistical inference including computing marginals, maximum a posteriori assignments, the partition function, or sampling from the distribution. In addition to their statistical relevance, such computations on graphical models include as special cases many combinatorial optimization and counting problems.
The Ising model is a Markov random field having binary variables and pairwise potential functions. The Ising model has a long and celebrated history starting with its introduction by statistical physicists as a model for spin systems in order to understand the phenomena of phase transition [29, 12]. It has since been used across a wide spectrum of application domains including finance, social networks, computer vision, biology, and signal processing. The understanding of the computational tractability of inference (computing the partition function and sampling) has recently seen significant progress [32, 66, 57, 58, 59, 27].
The inverse problem of learning models from data is equally important. Once the underlying graph is known it is relatively easy to estimate the parameters, hence the focus is largely on the task of structure learning, i.e., estimating the graph. Study of this problem was initiated by Chow and Liu in their seminal 1968 paper , which gave a greedy algorithm for learning tree-structured Markov random fields with runtime on the order of for graphs on nodes. They showed that the maximum likelihood graph is a maximum-weight spanning tree, where each edge has weight equal to the mutual information of the variables at its endpoints. The maximum-likelihood tree can thus be found by a greedy algorithm, for example using Kruskal’s or Prim’s algorithms, and the running-time of the Chow-Liu algorithm is dominated by the time required to compute the mutual information between all pairs of nodes in the graph.
For graphs with loops the problem is much more challenging for two reasons: a node and its neighbor can be marginally independent due to indirect path effects, and moreover, this difficulty is compounded by presence of long-range correlations in the model, in which case distant nodes can be more correlated than nearby nodes. As discussed next in Subsection 1.1, a basic first-order question has remained unanswered: it is not known if it is possible to learn the structure of Ising models on general graphs with nodes of degree at most in time less than . This is roughly the time required to exhaustively search over all possible neighborhoods of a node and for each such candidate neighborhood test whether or not the implied conditional independence holds .
In this paper we show that despite these challenges, a simple greedy algorithm allows to reconstruct arbitrary Ising models on nodes of maximum degree in time , where the notation hides a factor as well as a constant depending (doubly-exponentially) on . Exponential dependence on is unavoidable as the number of samples must itself increase exponentially in , but doubly-exponential dependence on is probably suboptimal. The implication of our result is that learning Ising models on arbitrary graphs of bounded degree has essentially the same computational complexity as learning such models on tree graphs. The proof rests on a new basic property of Ising models: we show that for any node, there exists at least one neighbor with which it has high mutual information, even conditional on any subset of nodes.
1.1 Complexity of graphical model learning
A number of papers, including , , and , have suggested to find each node’s neighborhood by exhaustively searching over candidate neighborhoods and checking conditional independence. For graphical models on nodes of maximum degree , such a search takes time (at least) on the order of . As grows, the computational cost becomes prohibitive, and much work has focused on trying to find algorithms with lower complexity. Writing algorithm runtime in the form , for high-dimensional (large ) models the exponent is of primary importance. We will think of efficient algorithms as having an exponent that is bounded by a constant independent of .
Previous works proposing efficient algorithms either restrict the graph structure or the nature of the interactions between variables. Chow and Liu  made a model restriction of the first type, assuming that the graph is a tree; generalizations include to polytrees , hypertrees , tree mixtures , and others. Among the many possible assumptions of the second type, the correlation decay property (CDP) is distinguished: until the recent paper , all existing efficient algorithms required the CDP . Informally, a graphical model is said to have the CDP if any two variables and are asymptotically independent as the graph distance between and increases. The CDP is known to hold for a number of pairwise graphical models in the so-called high-temperature regime, including Ising, hard-core lattice gas, Potts (multinomial), and others (see the survey article  as well as, e.g., [20, 21, 53, 66, 26, 6]).
It was first observed in  that it is possible to efficiently learn (in time ) models with (exponential) decay of correlations, under the additional assumption that neighboring variables have correlation bounded away from zero (as is true for the ferromagnetic Ising model in the high temperature regime). A variety of other papers including [46, 50, 4] give alternative algorithms, but also require the CDP to guarantee efficiency. Structure learning algorithms that do not explicitly require the CDP are based on convex optimization, such as Ravikumar, Wainwright, and Lafferty’s  approach using -regularized node-wise logistic regression. This algorithm has complexity ; while it is shown to work under certain incoherence conditions that seem distinct from the CDP, Bento and Montanari  established through a careful analysis that the algorithm fails for simple families of ferromagnetic Ising models without the CDP. Other convex optimization-based algorithms (e.g., [37, 30, 31]) assume similar incoherence conditions that are difficult to interpret in terms of model parameters, and likely also require the CDP.
It is noteworthy that most computationally efficient sampling algorithms (which happen to be based on MCMC) require a notion of temporal mixing, and this is closely related to spatial mixing or a version of the CDP (see, e.g., [61, 22, 39, 66]). The conclusion is that under a class of mixing conditions (informally speaking), we can both generate samples efficiently as well as learn graphical models efficiently from i.i.d. samples. For antiferromagnetic Ising models on general bounded degree graphs, one has the striking converse statement that generating samples becomes intractable (NP-hard) precisely at the point where the CDP no longer holds .
Because all known efficient algorithms required the CDP, and because the Ising model exhibits dramatically different macroscopic behavior with versus without the CDP (and this determines computational tractability of sampling), Bento and Montanari  posed the question of whether or not the CDP is necessary for tractable structure learning. A partial answer was given in , by demonstrating that a family of antiferromagnetic Ising models on general graphs can be learned efficiently despite strongly violating the CDP. Thus any relationship between the complexity of sampling (or computing the partition function) and the problem of structure learning from i.i.d. samples seems tenuous, and this is corroborated by the results of this paper. In contrast, the recent papers  and  demonstrate an algorithmic connection between parameter estimation from sufficient statistics for a model on a known graph and computation of the partition function.
We prove that the graph structure of an arbitrary Ising model on a bounded degree graph can be learned efficiently. Before discussing the algorithm, we first state the main conceptual contribution of the paper, namely identifying a new structural property of Ising models. Given the structural result the algorithm is almost obvious, and indeed it can be interpreted as a generalization of Chow and Liu’s 1968 greedy algorithm for learning trees to models on arbitrary graphs.
Proposition 1.1 (Structural property – informal).
Let be a graph on nodes of maximum degree , and consider an Ising model on . Then for any node there exists a neighbor such that the mutual information between and is at least some constant that is independent of .
The property as stated is actually a consequence of Proposition 5.3: we allow to condition on an arbitrary set of nodes, and instead of mutual information, we use a certain conditional influence measure. As shown in Section 5, this influence provides a lower bound on the mutual information.
Our algorithmic result is stated in the following theorem.
Theorem 1.2 (Algorithm performance – informal).
Consider an Ising model on an arbitrary graph on nodes of maximum degree at most . Given samples from the model, where the constant is a function of the range of interaction strengths and , it is possible to learn the underlying graph in time
We remark that the factor in the sample complexity and runtime of our algorithm depends doubly-exponentially on , in contrast to the necessary exponential dependence discussed in Subsection 2.2. While the focus of this paper is not on sample complexity, our algorithm does obtain (with a suboptimal constant) the optimal logarithmic dependence on .
Our algorithm is described in Section 4 and a more detailed statement of the theorem (with full dependence on constants) is given as Theorem 4.1. The algorithm is extremely simple and quite natural: in order to find the neighborhood of a node , it greedily (according to a measure of conditional influence) adds nodes one-by-one to form a constant-size superset of the neighborhood (pseudo-neighborhood). The idea of adding spurious nodes to form a superset is not new, but it is worth emphasizing the difference relative to algorithms that attempt to add only correct nodes.
It is also crucial that the algorithm only adds—and does not remove—nodes in creating the pseudo-neighborhood. The reason is that in models with long-range correlations, a non-neighbor with high influence on (or high mutual information) contains a lot of information about many other non-neighbors, so conditioning on effectively eliminates many non-neighbors from consideration. This allows us to use a potential argument, whereby each added node reduces the conditional entropy of the node by some constant, and this bounds the size of the pseudo-neighborhood. The pseudo-neighborhood can then be easily pruned to remove non-neighbors.
We next mention a few connections to other work and then in Section 2 define the Ising model and structure learning problem. Section 3 introduces the notion of influence we use and states a lemma showing that empirical estimates are close to the population averages. Section 4 presents our algorithm with performance guarantee stated in Theorem 4.1. Section 5 contains proofs of correctness and runtime, as well as a statement of our structural result, Proposition 5.3. The proposition is proved in Sections 6 and 7. Finally, Section 8 contains the proof of the lemma from Section 3 and Section 9 discusses possible extensions.
1.3 Other related work
Since the 1980’s, Hinton and others have studied the problem of learning Ising models under the name of learning Boltzmann machines [2, 28, 63]. Most approaches for learning Boltzmann machines do not assume a sparse underlying graph and attempt to find parameters directly, using gradient optimization methods. Ising models are used to model neuronal networks and protein interaction networks, and the problem of learning Ising models has been studied in this context [55, 15, 44, 65]. The problem of learning Ising models has also been of interest in the statistical physics community, where it is known as the inverse Ising problem. A variety of non-rigorous methods have been proposed, including based on truncation of expansions relating couplings to mean parameters or entropies [56, 14, 52, 38], message-passing [41, 51, 5], and others .
Structure learning of graphical models has been studied in the statistics and machine learning communities as a problem in high-dimensional statistics. Broadly, in high-dimensional statistics one wishes to estimate a high-dimensional object from samples, where the number of samples is far less than the dimensionality of the parameter space. To solve this ill-determined problem requires that there be some sparse underlying combinatorial structure. In our case the graph underlying the Markov random field is a sparse graph. As discussed in , optimization of regularized objective functions has been a popular approach to many problems in high-dimensional statistics including sparse linear regression, low-rank matrix completion, inferring rankings, as well as the problem of learning graphical models (both binary and Gaussian) [48, 40, 24, 49]. Such general methodology based on optimizing likelihood or (pseudo-likelihood) has failed thus-far to learn Ising models on general graphs, and one of the messages of our work is that a tailored approach is often necessary.
The theoretical computer science community has made progress on learning a variety of high-dimensional probability distributions from samples, including mixtures of Gaussians [42, 7]. But there is a more intriguing connection to work on learning function classes. In an Ising model, the conditional distribution of a node given its neighbors is specified by a logistic function, which is a soft version of a linear threshold function (LTF). Thus our algorithm effectively learns soft LTFs over a complicated joint distribution. Arguments based on boolean Fourier analysis have played a major role in learning boolean functions over uniformly random examples and also over product distributions , but due to the complicated joint dependencies in an Ising model, it is not obvious how to apply Fourier analysis in our setting. Our structural result, nevertheless, is at a high level analogous to the statement that LTFs have non-trivial total degree-one Fourier mass (see, e.g., ). Other recent works learn LTFs over non-product distributions, including log-concave  and sub-Gaussian or sub-exponential . These assumptions are badly violated by Ising models in the low-temperature regime (with long-range correlations), but the bounded graph degree assumption makes our soft LTFs depend on few variables, and this makes learning tractable.
2.1 Ising model
We consider the Ising model on a graph with . The notation is used to denote the set of neighbors of node , and the degree of each node is assumed to be bounded by . To each node is associated a binary random variable (spin) . Each configuration of spins (’’ and ’’ are used as shorthand for ) is assigned probability according to the probability mass function
Here is the log-partition function or normalizing constant. The distribution is parameterized by the vector , consisting of edge couplings and node-wise external fields. The edge couplings are assumed to satisfy for all for some constants and the external fields are assumed to satisfy for all The bounds are necessary for model identifiability, and as shown in  and discussed briefly in the next subsection, must appear in the sample complexity.
We can alternatively think of , with if . For a graph , let
be the set of valid parameter vectors corresponding to .
The distribution specified in is a Markov random field, and an implication is that each node is conditionally independent of all other nodes given the values of its neighbors. The conditional probability of given the states of all the other nodes can thus be written as:
A useful property of bounded degree models is that the conditional probability of a spin is always bounded away from and . The proof of this statement is immediate from (2) by conditioning on the neighbors of and using the tower property of conditional expectation.
Lemma 2.1 (Conditional randomness).
For any node , subset , and any configuration ,
This quantity is denoted by and appears throughout the paper.
2.2 Graphical model learning
Denote the set of all graphs on nodes of degree at most by . For some graph and parameters , one observes configurations sampled independently from the Ising model (1). A structure learning algorithm is a (possibly randomized) map
taking samples to a graph . The statistical performance of a structure learning algorithm will be measured using the zero-one loss, meaning that the exact underlying graph must be learned. The risk, or expected loss, under some vector of parameters corresponding to a graph is given by the probability of reconstruction error
and for given , the maximum risk is
Our goal is to find an algorithm with maximum risk (probability of error) tending to zero as , using the fewest possible number of samples . This notion of performance is rather stringent, but also robust, being worst-case over the entire class of graphs and parameters . A lower bound on the number of samples necessary was obtained by Santhanam and Wainwright in Theorem 1 of :
This means in particular that exponential dependence of the sample complexity (and hence runtime) on the quantity is unavoidable.
3 Measuring the influence of a variable
Our algorithm uses a certain conditional influence of a variable on another variable. For nodes , subset of nodes and configuration , define
We also use a quantity we call the average conditional influence, which is obtained by performing a weighted average of over random configurations :
The weights are given by the function
By the Markov property (2), the influence is zero for non-neighbors conditional on neighbors, that is, for any ,
This implies the same statement for . Our structural result, Proposition 5.3, shows that is bounded below for at least one neighbor if does not already contain the neighborhood . Thus computing allows to determine if or not, and our algorithm given in Section 4 is based on this idea.
Other works using a “conditional independence test”, for example [4, 67], use a similar measure of influence amounting to . We do not take the minimum over configurations , as there is no guarantee that is nonzero for any neighbor : each can be marginally independent of conditional on some configuration .
The empirical conditional influence replaces the probability measure by the empirical measure:
where for ,
Like before we define an averaged version (with average taken according to the empirical measure):
It will be necessary that these empirical influences are sufficiently accurate. Let denote the event that empirical influences with conditioning set up to size are accurate to within an additive :
Recall the notation and suppose . If the number of samples is at least , then .
We now describe the structure learning algorithm, which learns the neighborhood of an arbitrary individual node . Algorithm LearnNbhd takes as input the node whose neighborhood we wish to learn as well as the data and a threshold parameter . The first step is to construct a superset (which we call a pseudo-neighborhood) of the neighborhood . This is accomplished by greedily adding to the node with highest conditional influence , until there are no nodes with . (To simplify the description we set if .) At each step the conditional influences are computed with respect to the current set .
As will become apparent in Subsection 5.1, inclusion of non-neighbors is important, as it allows to use a potential argument to show that the constructed pseudo-neighborhood is not too large. Concretely, by definition of the algorithm and a simple lemma relating influence to mutual information, adding a node to the pseudo-neighborhood always reduces by at least the conditional entropy of the variable whose neighborhood we are trying to find. The entropy is non-negative and was initially at most one (since is binary), so this bounds the size of .
The correctness of the algorithm relies on Proposition 5.3 of Subsection 5.2, which shows that there is always at least one neighbor with influence above a constant, and we set equal to this constant. This implies that the algorithm does not terminate before all the neighbors are added. Finally, after construction of the pseudo-neighborhood, the algorithm removes those nodes with low average influence. Proposition 5.3 is again used to show that no neighbors are removed.
Let and . Let and define
Suppose we observe samples , for
where denote numerical constants. Then with probability at least , the structure learning algorithm LearnNbhd returns the correct neighborhood for all and the runtime for each of the nodes is (for a numerical constant C)
As stated, the algorithm has probability of both returning the correct neighborhoods for all nodes and having the claimed runtime. Obviously, the algorithm can be terminated if the runtime exceeds the stated value, giving a deterministic guarantee on runtime.
5 Algorithm correctness
5.1 Entropy increment argument and run-time bound
In this subsection we bound the size of the pseudo-neighborhood constructed, but make no guarantee that it actually contains the true neighborhood. We use several standard information-theoretic quantities including entropy, Kullback-Leibler divergence, and mutual information. The relevant definitions can be found in any information theory textbook (such as ). The following lemma gives a lower bound on the conditional mutual information of each added node.
Assume that event holds and suppose that LearnNbhd added node to the pseudo-neighborhood of after having added . Then the conditional mutual information .
We can now argue that the number of nodes added in the pseudo-neighborhood step is not too large by using a potential argument. The bound is stated in terms of the quantity
If event holds, then at the end of the pseudo-neighborhood construction step the set has cardinality at most .
Before proving the lemma, let us quickly see how it justifies the runtime claimed in Theorem 4.1. Each maximization step in line 2 of LearnNbhd takes time , so we get a cost of for the pseudo-neighborhood step, and this dominates the runtime.
Proof of Lemma 5.2.
Consider the sequence of nodes added, , with . Then
(a) is by non-negativity of conditional entropy and the definition of mutual information, , (b) follows by the chain rule for mutual information, and (c) is by Lemma 5.1. Since is strictly larger than , must satisfy the bound stated in the lemma. ∎
Proof of Lemma 5.1.
Let consist of the first nodes already added to the pseudo-neighborhood. Our goal is to show that the next node has mutual information .
We use a shorthand for the (random) conditional measure: and similarly , with similar definitions for any combination of ’’ and ’’. Thus we can write
Now for any ,
(a) is by Jensen’s inequality applied to the (concave) square root function, (b) Pinsker’s inequality, (c) the definition of total variation distance, (d) Lemma 2.1, (e) is by definition of and the fact that the conditioning set has cardinality .
Finally, by definition of the algorithm, node is only added to if , so the previous displayed equation implies that as claimed. ∎
5.2 Key structural result and algorithm correctness
Let be a graph of maximum degree , and consider an Ising model (1) on with vector of parameters . For any node , if such that , then there exists a node with .
We now show that the pseudo-neighborhood contains the true neighborhood.
If event holds, then for any , the pseudo-neighborhood constructed by LearnNbhd contains the true neighborhood .
Consider the same setup as Corollary 5.4. After the pruning step, .
By Corollary 5.4, the pseudo-neighborhood contains , hence Equation (3) states that for non-neighbors . By Lemma 5.2, , and by definition of the event (with our choice ), we have for all non-neighbors , and hence these are discarded. Conversely, by Proposition 5.3, , so no neighbors are discarded. ∎
All the ingredients are in place to finish the proof of Theorem 4.1.
Fix and , and let consist of the neighbors of not in . Assume that is nonempty or there is nothing to prove. Let and recall the definition We will prove that for any assignment ,
Averaging with respect to and applying the triangle inequality gives
As a consequence there exists some such that
and this proves the Proposition.
We proceed with showing (4). Let consist of nodes in adjacent to and be the effective external field at when we include the effect due to . Using the notation for the conditional measure , and , we let
Suppose . Conditioning on the values of the remaining neighbors of ,
The expectation is over the random vector with law , and in particular . The notation is understood to mean if and if .
It is helpful to rescale and shift the variable in (6) so that it takes values . To this end, define the quantities
(The function was defined at the beginning of Section 3.) Arithmetic manipulations lead to
Multiplying (6) by and using the identities in the last display gives
Summing the last displayed quantity over gives
7 Technical lemma
The goal of this section is to justify Equation (7), which appeared in the proof of Proposition 5.3. We start with an observation that will be used in the proof. Recall that is a random vector equal in distribution to conditioned on . Due to the “conditional randomness” Lemma 2.1, has probability at least of taking each value in , where . Hence we can decompose the probability mass function of as
with for all . We will be concerned with the random variable
and the decomposition (8) will allow us to obtain anti-concentration for from anti-concentration for sums of i.i.d. uniform random variables.
The following result of Erdös on the Littlewood-Offord problem shows anti-concentration for weighted sums of i.i.d. uniform random variables. (It can be found, e.g., as Corollary 7.4 in  and is a simple consequence of Sperner’s Lemma).
Lemma 7.1 (Erdös ).
Let be real numbers with for all . Let be an open interval of length . If is uniformly distributed on , then
We can use the decomposition (8) and Lemma 7.1 to draw the following conclusion. (It is possible to show this directly, but this approach seems clearer.) We mention that the only place the lower bound on the coupling strengths appears is through the following lemma.
Let . Consider the random variable defined in (9), and let . Then
Decompose the probability mass function as discussed at the beginning of this section in (8). Let be the total mass assigned to the uniform part. Let where is uniformly distributed and let where . The variable can be represented as a mixture distribution: if we define , then
and we can think of obtaining by choosing either or with probabilities or . Let . Lemma 7.1 implies that for any , and hence . Denote the probability that lies to the left of and inside , respectively, as
so that . Thinking about placing the probability mass to minimize subject to fixed and justifies the inequality
Using and performing arithmetic manipulations leads to
At least one of the two terms on the left-hand side is larger than the average, which is at least , and in either case