1 Introduction
Abstract

We consider the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on -regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes and maximum neighborhood sizes are allowed to grow as a function of the number of observations . Our main results provide sufficient conditions on the triple for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain assumptions on the population Fisher information matrix, we prove that consistent neighborhood selection can be obtained for sample sizes , with the error decaying as for some constant . If these same assumptions are imposed directly on the sample matrices, we show that samples are sufficient.

High-Dimensional Graphical Model Selection Using -Regularized Logistic Regression


Pradeep Ravikumar Martin J. Wainwright
pradeepr@stat.berkeley.edu wainwrig@stat.berkeley.edu


Department of Statistics, and Department of EECS
UC Berkeley, Berkeley, CA 94720



John D. Lafferty
lafferty@cs.cmu.edu


Departments of Computer Science and Machine Learning
Carnegie Mellon University
Pittsburgh, PA 15213



Technical Report, Department of Statistics, UC Berkekley
April 19, 2008




Keywords: Graphical models, Markov random fields, structure learning, -regularization, model selection, convex risk minimization, high-dimensional asymptotics, concentration.

1 Introduction

Undirected graphical models, also known as Markov random fields (MRFs), are used in a variety of domains, including artificial intelligence, natural language processing, image analysis, statistical physics, and spatial statistics, among others. A Markov random field (MRF) is specified by an undirected graph , with vertex set and edge set . The structure of this graph encodes certain conditional independence assumptions among subsets of the -dimensional discrete random variable , where variable is associated with vertex . A fundamental problem is the graphical model selection problem: given a set of samples from a Markov random field, estimate the structure of the underlying graph. The sample complexity of such an estimator is the minimal number of samples , as a function of the graph size and possibly other parameters such as the maximum node degree , required for the probability of correct identification of the graph to converge to one. Another important property of any model selection procedure is its computational complexity.

Due to both its importance and difficulty, structure learning in random fields has attracted considerable attention. The absence of an edge in a graphical model encodes a conditional independence assumption. Constraint-based approaches (Spirtes et al., 2000) estimate these conditional independencies from the data using hypothesis testing, and then determine a graph that most closely represents those independencies. Each graph represents a model class of graphical models; learning a graph then is a model class selection problem. Score-based approaches combine a metric for the complexity of the graph, with a goodness of fit measure of the graph to the data (for instance, log-likelihood of the maximum likelihood parameters given the graph), to obtain a score for each graph. The score is then used together with a search procedure that generates candidate graph structures to be scored. The number of graph structures grows super-exponentially, however, and Chickering (1995) shows that this problem is in general NP-hard.

A complication for undirected graphical models is that typical score metrics involve the normalization constant (also called the partition function) associated with the Markov random field, which is intractable (#P) to compute for general undirected models. The space of candidate structures in scoring based approaches is thus typically restricted to either directed models—Bayesian networks—or simple undirected graph classes such as trees (Chow and Liu, 1968), polytrees (Dasgupta, 1999) and hypertrees (Srebro, 2003). Abbeel et al. (2006) propose a method for learning factor graphs based on local conditional entropies and thresholding, and analyze its behavior in terms of Kullback-Leibler divergence between the fitted and true models. They obtain a sample complexity that grows logarithmically in the graph size , but the computational complexity grows at least as quickly as , where is the maximum neighborhood size in the graphical model. This order of complexity arises from the fact that for each node, there are possible neighborhoods of size for a graph with vertices. Csiszár and Talata (2006) show consistency of a method that uses pseudo-likelihood and a modification of the BIC criterion, but this also involves a prohibitively expensive search.

In work subsequent to the initial conference version of this work (Wainwright et al., 2007), other researchers have also studied the problem of model selection in discrete Markov random fields. For the special case of bounded degree models, Bresler et al. (2008) describe a simple search-based method, and prove under relatively mild assumptions that it can recover the graph structure with samples. However, in the absence of additional restrictions, the computational complexity of the method is . Santhanam and Wainwright (2008) analyze the information-theoretic limits of graphical model selection, providing both upper and lower bounds on various model selection procedures, but these methods also have prohibitive computational cost.

The main contribution of this paper is to present and analyze the computational and sample complexity of a simple method for graphical model selection. Our analysis is high-dimensional in nature, meaning that both the model dimension as well as the maximum neighborhood size may tend to infinity as a function of the size . Our main result shows that under mild assumptions, consistent neighborhood selection is possible with sample complexity and computational complexity , when applied to any graph with vertices and maximum degree . The basic approach is straightforward: it involves performing -regularized logistic regression of each variable on the remaining variables, and then using the sparsity pattern of the regression vector to infer the underlying neighborhood structure.

The technique of regularization for estimation of sparse models or signals has a long history in many fields; for instance, see Tropp (2006) for one survey. A surge of recent work has shown that -regularization can lead to practical algorithms with strong theoretical guarantees (e.g., Candes and Tao (2006), Donoho and Elad (2003), Meinshausen and Bühlmann (2006), Ng (2004), Tropp (2006), Wainwright (2006), Zhao and Yu (2007)). Despite the well-known computational intractability of discrete MRFs, our method is computationally efficient; it involves neither computing the normalization constant (or partition function) associated with the Markov random field, nor a combinatorial search through the space of graph structures. Rather, it requires only the solution of standard convex programs, with an overall computational complexity of order (Koh et al., 2007), and is thus well-suited to high dimensional problems. Conceptually, like the work of Meinshausen and Bühlmann (2006) on covariance selection in Gaussian graphical models, our approach can be understood as using a type of pseudo-likelihood, based on the local conditional likelihood at each node. In contrast to the Gaussian case, where the exact maximum likelihood estimate can be computed exactly in polynomial time, this use of a surrogate loss function is essential for discrete Markov random fields, given the intractability of computing the exact likelihood.

The remainder of this paper is organized as follows. We begin in Section 2 with background on discrete graphical models, the model selection problem, and logistic regression. In Section 3, we state our main result, develop some of its consequences, and provide a high-level outline of the proof. Section 4 is devoted to proving a result under stronger assumptions on the sample Fisher information matrix itself, whereas Section 5 provides concentration results linking the population matrices to the sample versions. In Section 6, we provide some experimental results to illustrate the practical performance of our method, and the close agreement between theory and practice, and we conclude in Section 7.

Notation:

For the convenience of the reader, we summarize here notation to be used throughout the paper. We use the following standard notation for asymptotics: we write if for some constant , and if for some constant . The notation means that and . Given a vector and parameter , we use to denote the usual norm. Given a matrix and parameter , we use to denote the induced matrix-operator norm (viewed as a mapping from ); see Horn and Johnson (1985). Two examples of particular importance in this paper are the spectral norm , corresponding to the maximal singular value of , and the matrix norm, given by . We make use of the bound , for any square matrix .

2 Background and problem formulation

We begin by providing some background on Markov random fields, defining the problem of graphical model selection, and describing our method based on neighborhood logistic regression.

2.1 Markov random fields

Given an undirected graph with vertex set and edge set , a Markov random field (MRF) consists of random vector , where the random variable is associated with vertex . The random vector is said to be pairwise Markov with respect to the graph if its probability distribution factorizes as , where each is a mapping from pairs to the real line. An important special case is the Ising model, in which for each vertex , and for some parameter , so that the distribution takes the form

(1)

The partition function ensures that the distribution sums to one. The Ising model has proven useful in many domains, including statistical physics, where it describes the behavior of gases or magnets, in computer vision for image segmentation, and in social network analysis.

2.2 Graphical model selection

Suppose that we are given a collection of samples, where each -dimensional vector is drawn in an i.i.d. manner from a distribution of the form (1). It is convenient to view the parameter vector as a -dimensional vector, indexed by pairs of distinct vertices, but non-zero if and only if the vertex pair belongs to the unknown edge set of the underlying graph . The goal of graphical model selection is to infer the edge set of the graphical model defining the probability distribution that generates the samples. In this paper, we study the slightly stronger criterion of signed edge recovery: in particular, given a graphical model with parameter , we define the edge sign vector

(2)

Note that the weaker graphical model selection problem amounts to recovering the vector of absolute values.

The classical notion of statistical consistency applies to the limiting behavior of an estimation procedure as the sample size goes to infinity, with the model size itself remaining fixed. In many contemporary applications of graphical models (e.g., gene microarrays, social networks etc.), the model dimension is comparable or larger than the sample size , so that the relevance of such “fixed ” asymptotics is doubtful. Accordingly, the goal of this paper is to develop the broader notion of high-dimensional consistency, in which both the model dimension and the sample size are allowed to increase, and we study the scaling conditions under which consistent model selection is achievable.

More precisely, we consider sequences of graphical model selection problems, indexed by the sample size , number of vertices , and maximum node degree . We assume that the sample size goes to infinity, and both the problem dimension and may also scale as a function of . The setting of fixed or is covered as a special case. Let be an estimator of the signed edge pattern , based on the samples. Our goal is to establish sufficient conditions on the scaling of the triple such that our proposed estimator is consistent in the sense that

We sometimes call this property sparsistency, as a shorthand for consistency of the sparsity pattern of the parameter .

2.3 Neighborhood-based logistic regression

Note that recovering the signed edge vector of an undirected graph is equivalent to recovering, for each vertex , its neighborhood set , along with the correct signs for all . To capture both the neighborhood structure and sign pattern, we define the signed neighborhood set

(3)

The next step is to observe that this signed neighborhood set can be recovered from the sign-sparsity pattern of the -dimensional subvector of parameters

associated with vertex . In order to estimate this vector , we consider the structure of the conditional distribution of given the other variables . A simple calculation shows that under the model (1), this conditional distribution takes the form

(4)

Thus, the variable can be viewed as the response variable in a logistic regression in which all of the other variables play the role of the covariates.

With this set-up, our method for estimating the sign-sparsity pattern of the regression vector (and hence the neighborhood structure ) is based on computing the -regularized logistic regression of on the other variables . Explicitly, given a set of i.i.d. samples , this regularized regression problem is a convex program, of the form

, where is a regularization parameter, to be specified by the user, and

(5)

is the rescaled negative log likelihood. (The rescaling factor in this definition is for later theoretical convenience.) Following some algebraic manipulation, the regularized negative log likelihood can be written as

(6)

where

(7)

is a rescaled logistic loss, and are empirical moments. Note the objective function (6) is convex but not differentiable, due to the presence of the -regularizer. By Lagrangian duality, the problem (6) can be re-cast as a constrained problem over the ball . Consequently, by the Weierstrass theorem, the minimum over is always achieved.

Accordingly, let be an element of the minimizing set of problem (6). Although need not be unique in general since the problem (6) need not be strictly convex, our analysis shows that in the regime of interest, this minimizer is indeed unique. We use to estimate the signed neighborhood according to

(8)

We say that the full graph is estimated consistently, written as the event , if for all .

3 Method and theoretical guarantees

Our main result concerns conditions on the sample size relative to the parameters of the graphical model—more specifically, the number of nodes and maximum node degree —that ensure that the collection of signed neighborhood estimates (8), one for each node of the graph, agree with the true neighborhoods, so that the full graph is estimated consistently. In this section, we begin by stating the assumptions that underlie our main result, and then give a precise statement of the main result. We then provide a high-level overview of the key steps involved in its proof, deferring detail to later sections. Our analysis proceeds by first establishing sufficient conditions for correct signed neighborhood recovery—that is, —for some fixed node . By showing that this neighborhood consistency is achieved at exponentially fast rates, we can then use a union bound over all nodes of the graph to conclude that consistent graph selection is also achieved.

3.1 Assumptions

Success of our method requires certain assumptions on the structure of the logistic regression problem. These assumptions are stated in terms of the Hessian of the likelihood function , as evaluated at the true model parameter . More specifically, for any fixed node , this Hessian is a matrix of the form

(9)

For future reference, we calculate the explicit expression

(10)

where

(11)

is the variance function. Note that the matrix is the Fisher information matrix associated with the local conditional probability distribution. Intuitively, it serves as the counterpart for discrete graphical models of the covariance matrix of Gaussian graphical models, and indeed our assumptions are analogous to those imposed in previous work on the Lasso for Gaussian linear regression (Meinshausen and Bühlmann, 2006, Tropp, 2006, Zhao and Yu, 2007).

In the following we write simply for the matrix , where the reference node should be understood implicitly. Moreover, we use to denote the subset of indices associated with edges of , and to denote its complement. We use to denote the sub-matrix of indexed by . With this notation, we state our assumptions:

[A1] Dependency condition:

The subset of the Fisher information matrix corresponding to the relevant covariates has bounded eigenvalues: there exists a constant such that

(12)

Moreover, we require that . These conditions ensure that the relevant covariates do not become overly dependent. (As stated earlier, we have suppressed notational dependence on ; thus this condition is assumed to hold for all .)

[A2] Incoherence condition:

Our next assumption captures the intuition that the large number of irrelevant covariates (i.e., non-neighbors of node ) cannot exert an overly strong effect on the subset of relevant covariates (i.e., neighbors of node ). To formalize this intuition, we require the existence of an such that

(13)

3.2 Statement of main result

We are now ready to state our main result on the performance of neighborhood logistic regression for graphical model selection. Naturally, the limits of model selection are determined by the minimum value over the parameters for pairs included in the edge set of the true graph. Accordingly, we define the parameter

(14)

With this definition, we have the following

Theorem 1.

Consider a sequence of graphs such that conditions and are satisfied by the population Fisher information matrices . If the sample size satisfies

(15)

for some constant , and the minimum value decays no faster than , then for the regularization sequence , the estimated graph obtained by neighborhood logistic regression satisfies

(16)

for some constant .

Remarks:

For model selection in graphical models, one is typically interested in node degrees that remain bounded (e.g., ), or that grow only weakly with graph size (say ). In such cases, the growth condition (15) allows the number of observations to be substantially smaller than the graph size, i.e., the “large , small ” regime. In particular, the graph size can grow exponentially with the number of observations (i.e, for some .

In terms of the choice of regularization, the sequence needs to satisfy the following conditions:

Under the growth condition (15), the choice suffices as long as decays no faster than .

The analysis required to prove Theorem 1 can be divided naturally into two parts. First, in Section 4, we prove a result (stated as Proposition 1) for “fixed design” matrices. More precisely, we show that if the dependence (A1) mutual incoherence (A2) conditions hold for the sample Fisher information matrix

(17)

then the growth condition (15) and choice of from Theorem 1 are sufficient to ensure that the graph is recovered with high probability. Interestingly, our analysis shows that if the conditions are imposed directly on the sample Fisher information matrices and , then the weaker growth condition suffices for asymptotically exact graph recovery.

The second part of the analysis, provided in Section 5, is devoted to showing that under the specified growth conditions (A3), imposing incoherence and dependence assumptions on the population version of the Fisher information guarantees (with high probability) that analogous conditions hold for the sample quantities . While it follows immediately from the law of large numbers that the empirical Fisher information converges to the population version for any fixed subset , the delicacy is that we require controlling this convergence over subsets of increasing size. The analysis therefore requires some large-deviations bounds, so as to provide exponential control on the rates of convergence.

3.3 Primal-dual witness for graph recovery

At a high-level, at the core of our proof lies the notion of a primal-dual witness. In particular, we explicitly construct an optimal primal-dual pair, namely, a primal solution , along with an associated subgradient vector (which can be interpreted as a dual solution), such that the Karush-Kuhn-Tucker (KKT) conditions associated with the convex program (6) are satisfied. Moreover, we show that under the stated assumptions on , the primal-dual pair can be constructed such that they act as a witness—that is, a certificate guaranteeing that the method correctly recovers the graph structure.

Let us write the convex program (6) in the form

(18)

where

(19)

is the negative log likelihood associated with the logistic regression model. The KKT conditions associated with this model can be expressed as follows

(20)

where the dual or subgradient vector satisfies the properties

(21)

One way to understand this vector is as a subgradient, meaning an element of the subdifferential of the -norm (see Rockafellar, 1970). An alternative interpretation is based on the constrained equivalent to problem (18), involving the constraint . This -constraint is equivalent to the family of constraints , where the vector ranges over all possible sign vectors. In this formulation, the optimal dual vector is simply the conic combination

(22)

where is the Lagrange multiplier associated with the constraint .

The KKT conditions (20) and (21) must be satisfied by any optimal pair to the convex program (18). In order for this primal-dual pair to correctly specify the graph structure, we require furthermore that the following properties are satisfied:

(23a)
(23b)

We now construct our witness pair as follows. First, we set as the minimizer of the partial penalized likelihood,

(24)

and set . We then set so that condition (23b) holds. Finally, we obtain from equation (20) by plugging in the values of and . Thus, our construction satisfies conditions (23b) and (20). The remainder of the analysis consists of showing that our conditions on imply that, with high-probability, the remaining conditions (23a) and (21) are satisfied.

This strategy is justified by the following lemma, which provides sufficient conditions for shared sparsity and uniqueness of the optimal solution:

Lemma 1 (Shared sparsity and uniqueness).

If there exists an optimal primal solution with associated optimal dual vector such that , then any optimal primal solution must have . Moreover, if the Hessian sub-matrix , then is the unique optimal solution.

Proof.

By Lagrangian duality, the penalized problem (18) can be written as an equivalent constrained optimization problem over the ball , for some constant . Since the Lagrange multiplier associated with this constraint—namely, —is strictly positive, the constraint is active at any optimal solution, so that is constant across all optimal solutions. Consider the representation of as the convex combination (22) of sign vectors , where the weights are non-negative and sum to one. Since is an optimal vector of Lagrange multipliers for the optimal primal solution , it follows (Bertsekas, 1995) that any other optimal primal solution must minimize the associated Lagrangian (i.e., satisfy equation (20)), and moreover must satisfy the complementary slackness conditions for all sign vectors . But these conditions imply that , which cannot occur if for some index for which . We thus conclude that for all optimal primal solutions.

Finally, given that all optimal solutions satisfy , we may consider the restricted optimization problem subject to this set of constraints. If the principal submatrix of the Hessian is positive definite, then this sub-problem is strictly convex, so that the optimal solution must be unique. ∎

In our primal-dual witness proof, we exploit this lemma by constructing a primal-dual pair such that . Moreover, under the conditions of Theorem 1, we prove that the sub-matrix of the sample Fisher information matrix is strictly positive definite with high probability, so that the primal solution is guaranteed to be unique.

4 Analysis under sample Fisher matrix assumptions

We begin by establishing model selection consistency when assumptions are imposed directly on the sample Fisher matrix , as opposed to on the population matrix , as in Theorem 1. In particular, we define the “good event”

(25)

We then state the following

Proposition 1 (Consistency for fixed design).

If for a suitably large constant , and the minimum value decays no faster than , then for the regularization sequence , the estimated graph obtained by neighborhood logistic regression satisfies

(26)

Loosely stated, this result guarantees that if the sample Fisher information matrix is “good”, then the conditional probability of successful graph recovery converges to zero at the specified rate. The remainder of this section is devoted to the proof of Proposition 1.

4.1 Key technical results

We begin with statements of some key technical lemmas that are central to our main argument, with their proofs deferred to Appendix A. The central object is the following expansion, obtained by re-writing the zero-subgradient condition as

(27)

where we have introduced the short-hand notation for the -vector

For future reference, note that . Next, applying the mean-value theorem coordinate-wise to the expansion (27) yields

(28)

where the remainder term takes the form

(29)

with a parameter vector on the line between and , and with denoting the th row of the matrix. The following lemma addresses the behavior of the term in this expansion:

Lemma 2.

If , then for the specified mutual incoherence parameter , we have

(30)

See Appendix A.1 for the proof of this claim.

The following lemma establishes that the sub-vector is an -consistent estimate of the true sub-vector :

Lemma 3 (-consistency of primal subvector).

If , then as , we have

(31)

See Appendix A.2 for the proof of this claim.

Our final technical lemma provides control on the the remainder term (29):

Lemma 4.

If and is sufficiently small, then for mutual incoherence parameter , we have

(32)

See Appendix A.3 for the proof of this claim.

4.2 Proof of Proposition 1

Using these lemmas, we can now complete the proof of Proposition 1. Recalling our shorthand , we re-write condition (28) in block form as:

(33a)
(33b)

Since the matrix is invertible by assumption, the conditions (33) can be re-written as

(34)

Rearranging yields the condition

(35)

Strict dual feasibility

We now demonstrate that for the dual sub-vector defined by equation (35), we have . Using the triangle inequality and the mutual incoherence bound (13), we have that

(37)

Next, applying Lemmas 2 and 4, we have

with probability converging to one.

Correct sign recovery:

We next show that our primal sub-vector defined by equation (24) satisfies sign consistency, meaning, . In order to establish this, it suffices to show that

where we recall the notation . From Lemma 3, we have , so that

(38)
(39)

Since decays no faster than , the right-hand side is upper bounded by , which can be made smaller than by choosing sufficiently small, as asserted in Proposition 1.

5 Uniform convergence of sample information matrices

In this section, we complete the proof of Theorem 1 by showing that if the dependency () and incoherence () assumptions are imposed on the population Fisher information matrix then under the specified scaling of , analogous bounds hold for the sample Fisher information matrices with probability converging to one. These results are not immediate consequences of classical random matrix theory (e.g., Davidson and Szarek (2001)), since the elements of are highly dependent.

Recall the definitions

(40)

where denotes the population expectation, and denotes the empirical expectation, and the variance function was defined previously equation (11). The following lemma asserts the eigenvalue bounds in Assumption hold with high probability for sample covariance matrices:

Lemma 5.

Suppose that assumption holds for the population matrix and . For any and some fixed constants and , we have

(41a)
(41b)

The following result is the analog for the incoherence assumption (), showing that the scaling of given in Theorem 1 guarantees that population incoherence implies sample incoherence.

Lemma 6.

If the population covariance satisfies a mutual incoherence condition (13) with parameter as in Assumption , then the sample matrix satisfies an analogous version, with high probability, in the sense that

(42)

Proofs of these two lemmas are provided in the following sections. Before proceeding, we begin by taking note of a simple bound to be used repeatedly throughout our arguments. By definition of the matrices and (see equations (17) and (10)), the element of the difference matrix can be written as an i.i.d. sum of the form , where each is zero-mean and bounded (in particular, ). By the Azuma-Hoeffding bound (Hoeffding, 1963), for any indices and for any , we have

(43)

So as to simplify notation, throughout this section, we use to denote a universal positive constant, independent of . Note that the precise value and meaning of may differ from line to line.

5.1 Proof of Lemma 5

By the Courant-Fischer variational representation (Horn and Johnson, 1985), we have

where is a unit-norm minimal eigenvector of . Therefore, we have