High-dimensional Ising model selection using {\ell_{1}}-regularized logistic regression

# High-dimensional Ising model selection using ℓ1-regularized logistic regression

[ [    [ [    [ [ University of California, Berkeley, University of California, Berkeley and Carnegie Mellon University P. Ravikumar
M. J. Wainwright
Department of Statistics
University of California
Berkeley, California 94720
USA
J. D. Lafferty
Computer Science Department
and Machine Learning Department
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213
USA
\smonth10 \syear2008\smonth1 \syear2009
\smonth10 \syear2008\smonth1 \syear2009
\smonth10 \syear2008\smonth1 \syear2009
###### Abstract

We consider the problem of estimating the graph associated with a binary Ising Markov random field. We describe a method based on -regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an -constraint. The method is analyzed under high-dimensional scaling in which both the number of nodes and maximum neighborhood size are allowed to grow as a function of the number of observations . Our main results provide sufficient conditions on the triple and the model parameters for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. With coherence conditions imposed on the population Fisher information matrix, we prove that consistent neighborhood selection can be obtained for sample sizes with exponentially decaying error. When these same conditions are imposed directly on the sample matrices, we show that a reduced sample size of suffices for the method to estimate neighborhoods consistently. Although this paper focuses on the binary graphical models, we indicate how a generalization of the method of the paper would apply to general discrete Markov random fields.

[
\kwd
\doi

10.1214/09-AOS691 \volume38 \issue3 2010 \firstpage1287 \lastpage1319

\runtitle

High-dimensional Ising model selection

{aug}

and B]\fnmsJohn D. \snmLafferty\thanksreft1label=e3]lafferty@cs.cmu.edu

\thankstext

t1Supported in part by NSF Grants IIS-0427206 and CCF-0625879.

\thankstext

t2Supported in part by a Siebel Scholarship.

\thankstext

t3Supported in part by NSF Grants DMS-06-05165 and CCF-0545862.

class=AMS] \kwd[Primary ]62F12 \kwd[; secondary ]68T99. Graphical models \kwdMarkov random fields \kwdstructure learning \kwd-regularization \kwdmodel selection \kwdconvex risk minimization \kwdhigh-dimensional asymptotics.

## 1 Introduction

Undirected graphical models, also known as Markov random fields, are used in a variety of domains, including statistical physics Ising25 (), natural language processing Manning99 (), image analysis Woods78 (), Hassner80 (), Cross83 () and spatial statistics Ripley81 (), among others. A Markov random field (MRF) is specified by an undirected graph with vertex set and edge set . The structure of this graph encodes certain conditional independence assumptions among subsets of the -dimensional discrete random variable where variable is associated with vertex . One important problem for such models is to estimate the underlying graph from independent and identically distributed samples drawn from the distribution specified by some Markov random field. As a concrete illustration, for binary random variables, each vector-valued sample might correspond to the votes of a set of politicians on a particular bill, and estimating the graph structure amounts to detecting statistical dependencies in these voting patterns (see Banerjee, Ghaoui and d’Asprémont BanGhaAsp08 () for further discussion of this example).

Due to both its importance and difficulty, the problem of structure learning for discrete graphical models has attracted considerable attention. The absence of an edge in a graphical model encodes a conditional independence assumption. Constraint-based approaches spirtes00 () estimate these conditional independencies from the data using hypothesis testing and then determine a graph that most closely represents those independencies. Each graph represents a model class of graphical models; learning a graph then is a model class selection problem. Score-based approaches combine a metric for the complexity of the graph with a measure of the goodness of fit of the graph to the data; for instance, log-likelihood of the maximum likelihood parameters given the graph, to obtain a score for each graph. The score is used together with a search procedure that generates candidate graph structures to be scored. The number of graph structures grows super-exponentially, however, and Chickering chickering95 () shows that this problem is in general NP-hard.

A complication for undirected graphical models involving discrete random variables is that typical score metrics involve the partition function or cumulant function associated with the Markov random field. For general undirected MRFs, calculation of this partition function is computationally intractable Welsh93 (). The space of candidate structures in scoring based approaches is thus typically restricted to either directed graphical models dasgupta99 () or to simple sub-classes of undirected graphical models such as those based on trees chowliu68 () and hypertrees srebro03 (). Abbeel, Koller and Ng AbbKolNg06 () propose a method for learning factor graphs based on local conditional entropies and thresholding and analyze its behavior in terms of Kullback–Leibler divergence between the fitted and true models. They obtain a sample complexity that grows logarithmically in the number of vertices , but the computational complexity grows at least as quickly as where is the maximum neighborhood size in the graphical model. This order of complexity arises from the fact that for each node, there are possible neighborhoods of size for a graph with vertices. Csiszár and Talata Csiszar06 () show consistency of a method that uses pseudo-likelihood and a modification of the BIC criterion, but this also involves a prohibitively expensive search.

The main contribution of this paper is a careful analysis of the computational and statistical efficiency of a simple method for graphical model selection. The basic approach is straightforward: it involves performing -regularized logistic regression of each variable on the remaining variables, and then using the sparsity pattern of the regression vector to infer the underlying neighborhood structure. Our analysis is high-dimensional in nature, meaning that both the model dimension as well as the maximum neighborhood size may tend to infinity as a function of the size . Our main result shows that under mild assumptions on the population Fisher information matrix, consistent neighborhood selection is possible using samples and computational complexity . We also show that when the same assumptions are imposed directly on the sample matrices, samples suffice for consistent neighborhood selection with the same computational complexity. We focus in this paper on binary Ising models, but indicate in Section 7 a generalization of the method applicable to general discrete Markov random fields.

The technique of -regularization for estimation of sparse models or signals has a long history in many fields (for instance, see Tropp06 () for one survey). A surge of recent work has shown that -regularization can lead to practical algorithms with strong theoretical guarantees (e.g., CandesTao06 (), DonohoElad03 (), Meinshausen06 (), ng04 (), Tropp06 (), Wainwright06new (), ZhaoYu06 ()). Despite the well-known computational intractability of computing marginals and likelihoods for discrete MRFs Welsh93 (), our method is computationally efficient; it involves neither computing the normalization constant (or partition function) associated with the Markov random field nor a combinatorial search through the space of graph structures. Rather, it requires only the solution of standard convex programs with an overall computational complexity of order and is thus well suited to high-dimensional problems KohKimBoy07 (). Conceptually, like the work of Meinshausen and Bühlmann Meinshausen06 () on covariance selection in Gaussian graphical models, our approach can be understood as using a type of pseudo-likelihood based on the local conditional likelihood at each node. In contrast to the Gaussian case, where the exact maximum likelihood estimate can be computed exactly in polynomial time, this use of a surrogate loss function is essential for discrete Markov random fields given the intractability of computing the exact likelihood Welsh93 ().

Portions of this work were initially reported in a conference publication WaiRavLaf06 (), with the weaker result that samples suffice for consistent Ising model selection. Since the appearance of that paper, other researchers have also studied the problem of model selection in discrete Markov random fields. For the special case of bounded degree models, Bresler, Mossel and Sly Bresler08 () describe a simple search-based method, and prove under relatively mild assumptions that it can recover the graph structure with samples. However, in the absence of additional restrictions, the computational complexity of the method is . In other work, Santhanam and Wainwright SanWai08 () analyze the information-theoretic limits of graphical model selection, providing both upper and lower bounds on various model selection procedures, but these methods also have prohibitive computational costs.

The remainder of this paper is organized as follows. We begin in Section 2 with background on discrete graphical models, the model selection problem and logistic regression. In Section 3, we state our main result, develop some of its consequences and provide a high-level outline of the proof. Section 4 is devoted to proving a result under stronger assumptions on the sample Fisher information matrix whereas Section 5 provides concentration results linking the population matrices to the sample versions. In Section 6, we provide some experimental results that illustrate the practical performance of our method and the close agreement between theory and practice. Section 7 discusses an extension to more general Markov random fields, and we conclude in Section 8.

### Notation

For the convenience of the reader, we summarize here notation to be used throughout the paper. We use the following standard notation for asymptotics: we write if for some constant , and if for some constant . The notation means that and . Given a vector and parameter , we use to denote the usual norm. Given a matrix and parameter , we use to denote the induced matrix-operator norm with viewed as a mapping from (see Horn and Johnson Horn85 ()). Two examples of particular importance in this paper are the spectral norm , corresponding to the maximal singular value of , and the matrix norm, given by . We make use of the bound for any symmetric matrix .

## 2 Background and problem formulation

We begin by providing some background on Markov random fields, defining the problem of graphical model selection and describing our method based on neighborhood logistic regression.

### 2.1 Pairwise Markov random fields

Let denote a random vector with each variable taking values in a corresponding set . Say we are given an undirected graph with vertex set and edge set , so that each random variable is associated with a vertex . The pairwise Markov random field associated with the graph over the random vector is the family of distributions of which factorize as where for each edge , is a mapping from pairs to the real line. For models involving discrete random variables, the pairwise assumption involves no loss of generality since any Markov random field with higher-order interactions can be converted (by introducing additional variables) to an equivalent Markov random field with purely pairwise interactions (see Wainwright and Jordan WaiJor03Monster () for details of this procedure).

#### Ising model

In this paper, we focus on the special case of the Ising model in which for each vertex , and for some parameter , so that the distribution takes the form

 Pθ∗(x)=1Z(θ∗)exp{∑(s,t)∈Eθ∗stxsxt}. (1)

The partition function ensures that the distribution sums to one. This model is used in many applications of spatial statistics such as modeling the behavior of gases or magnets in statistical physics Ising25 (), building statistical models in computer vision Geman84 () and social network analysis.

### 2.2 Graphical model selection

Suppose that we are given a collection of samples where each -dimensional vector is drawn in an i.i.d. manner from a distribution of the form (1) for parameter vector and graph over the variables. It is convenient to view the parameter vector as a -dimensional vector, indexed by pairs of distinct vertices but nonzero if and only if the vertex pair belongs to the unknown edge set of the underlying graph . The goal of graphical model selection is to infer the edge set . In this paper, we study the slightly stronger criterion of signed edge recovery; in particular, given a graphical model with parameter , we define the edge sign vector

Here the sign function takes value if , value if and , otherwise. Note that the weaker graphical model selection problem amounts to recovering the vector of absolute values.

The classical notion of statistical consistency applies to the limiting behavior of an estimation procedure as the sample size goes to infinity with the model size itself remaining fixed. In many contemporary applications of graphical models—among them gene microarray data and social network analysis—the model dimension is comparable to or larger than the sample size , so that the relevance of such “fixed ” asymptotics is limited. With this motivation, our analysis in this paper is of the high-dimensional nature, in which both the model dimension and the sample size are allowed to increase, and we study the scalings under which consistent model selection is achievable.

More precisely, we consider sequences of graphical model selection problems, indexed by the sample size , number of vertices and maximum node degree . We assume that the sample size goes to infinity, and both the problem dimension and may also scale as a function of . The setting of fixed or is covered as a special case. Let be an estimator of the signed edge pattern based on the samples. Our goal is to establish sufficient conditions on the scaling of the triple such that our proposed estimator is consistent in the sense that

 P[ˆEn=E∗]→1as n→+∞.

We sometimes call this property sparsistency, as a shorthand for consistency of the sparsity pattern of the parameter .

### 2.3 Neighborhood-based logistic regression

Recovering the signed edge vector of an undirected graph is equivalent to recovering, for each vertex , its neighborhood set along with the correct signs for all . To capture both the neighborhood structure and sign pattern, we define the product set of “signed vertices” as . We use the shorthand “” for elements . We then define the signed neighborhood set as

 (3)

Here the sign function has an unambiguous definition, since for all . Observe that this signed neighborhood set can be recovered from the sign-sparsity pattern of the -dimensional subvector of parameters

 θ∗∖r:={θ∗ru,u∈V∖r},

associated with vertex . In order to estimate this vector , we consider the structure of the conditional distribution of given the other variables . A simple calculation shows that under the model (1), this conditional distribution takes the form

 Pθ∗(xr∣x∖r)=exp(2xr∑t∈V∖rθ∗rtxt)exp(2xr∑t∈V∖rθ∗rtxt)+1. (4)

Thus the variable can be viewed as the response variable in a logistic regression in which all of the other variables play the role of the covariates.

With this set-up, our method for estimating the sign-sparsity pattern of the regression vector and hence the neighborhood structure is based on computing an -regularized logistic regression of on the other variables . Explicitly, given , a set of i.i.d. samples, this regularized regression problem is a convex program of the form

 (5)

where

 ℓ(θ;Xn1):=−1nn∑i=1logPθ(x(i)r∣x(i)∖r) (6)

is the rescaled negative log likelihood (the rescaling factor in this definition is for later theoretical convenience) and is a regularization parameter, to be specified by the user. For notational convenience, we will also use as notation for this regularization parameter suppressing the potential dependence on and .

Following some algebraic manipulation, the regularized negative log likelihood can be written as

 minθ∖r∈Rp−1{1nn∑i=1f(θ;x(i))−∑u∈V∖rθruˆμru+λn∥θ∖r∥1}, (7)

where

 f(θ;x):=log{exp(∑t∈V∖rθrtxt)+exp(−∑t∈V∖rθrtxt)} (8)

is a rescaled logistic loss, and are empirical moments. Note the objective function (7) is convex but not differentiable, due to the presence of the -regularizer. By Lagrangian duality, the problem (7) can be re-cast as a constrained problem over the ball . Consequently, by the Weierstrass theorem, the minimum over is always achieved.

Accordingly, let be an element of the minimizing set of problem (7). Although need not be unique in general since the problem (7) need not be strictly convex, our analysis shows that in the regime of interest, this minimizer is indeed unique. We use to estimate the signed neighborhood according to

 ˆN±(r):={sign(ˆθru)u∣u∈V∖r,ˆθsu≠0}. (9)

We say that the full graph is estimated consistently, written as the event , if every signed neighborhood is recovered—that is, for all .

## 3 Method and theoretical guarantees

Our main result concerns conditions on the sample size relative to the parameters of the graphical model—more specifically, the number of nodes and maximum node degree —that ensure that the collection of signed neighborhood estimates (9), one for each node of the graph, agree with the true neighborhoods so that the full graph is estimated consistently. In this section, we begin by stating the assumptions that underlie our analysis, and then give a precise statement of the main result. We then provide a high-level overview of the key steps involved in its proof, deferring details to later sections. Our analysis proceeds by first establishing sufficient conditions for correct signed neighborhood recovery—that is, —for some fixed node . By showing that this neighborhood consistency is achieved at sufficiently fast rates, we can then use a union bound over all nodes of the graph to conclude that consistent graph selection is also achieved.

### 3.1 Assumptions

Success of our method requires certain assumptions on the structure of the logistic regression problem. These assumptions are stated in terms of the Hessian of the likelihood function as evaluated at the true model parameter . More specifically, for any fixed node , this Hessian is a matrix of the form

 Q∗r:=Eθ∗{∇2logPθ∗[Xr∣X∖r]}. (10)

For future reference, this is given as the explicit expression

 Q∗r=Eθ∗[η(X;θ∗)X∖rXT∖r], (11)

where

 η(u;θ):=4exp(2ur∑t∈V∖rθrtut)(exp(2ur∑t∈V∖rθrtut)+1)2 (12)

is the variance function. Note that the matrix is the Fisher information matrix associated with the local conditional probability distribution. Intuitively, it serves as the counterpart for discrete graphical models of the covariance matrix of Gaussian graphical models, and indeed our assumptions are analogous to those imposed in previous work on the Lasso for Gaussian linear regression Meinshausen06 (), Tropp06 (), ZhaoYu06 ().

In the following we write simply for the matrix where the reference node should be understood implicitly. Moreover, we use to denote the subset of indices associated with edges of , and to denote its complement. We use to denote the sub-matrix of indexed by . With this notation, we state our assumptions:

#### (A1) Dependency condition

The subset of the Fisher information matrix corresponding to the relevant covariates has bounded eigenvalues; that is, there exists a constant such that

 Λmin(Q∗SS)≥Cmin. (13)

Moreover, we require that . These conditions ensure that the relevant covariates do not become overly dependent. (As stated earlier, we have suppressed notational dependence on ; thus these conditions are assumed to hold for each .)

#### (A2) Incoherence condition

Our next assumption captures the intuition that the large number of irrelevant covariates (i.e., nonneighbors of node ) cannot exert an overly strong effect on the subset of relevant covariates (i.e., neighbors of node ). To formalize this intuition, we require the existence of an such that

 |||Q∗ScS(Q∗SS)−1|||∞≤1−α. (14)

### 3.2 Statement of main result

We are now ready to state our main result on the performance of neighborhood logistic regression for graphical model selection. Naturally, the limits of model selection are determined by the minimum value over the parameters for pairs included in the edge set of the true graph. Accordingly, we define the parameter

 θ∗min=min(r,t)∈E|θ∗rt|. (15)

With this definition, we have the following:

###### Theorem 1

Consider an Ising graphical model with parameter vector and associated edge set such that conditions (A1) and (A2) are satisfied by the population Fisher information matrix , and let be a set of i.i.d. samples from the model specified by . Suppose that the regularization parameter is selected to satisfy

 λn≥16(2−α)α√logpn. (16)

Then there exist positive constants and , independent of , such that if

 n>Ld3logp, (17)

then the following properties hold with probability at least .

1. [(a)]

2. For each node , the -regularized logistic regression (5), given data , has a unique solution, and so uniquely specifies a signed neighborhood .

3. For each , the estimated signed neighborhood correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges for which .

The theorem not only specifies sufficient conditions but also the probability with which the method recovers the true signed edge-set. This probability decays exponentially as a function of which leads naturally to the following corollary on model selection consistency of the method for a sequence of Ising models specified by .

###### Corollary 1

Consider a sequence of Ising models with graph edge sets and parameters ; each of which satisfies conditions (A1) and (A2). For each , let be a set of i.i.d. samples from the model specified by , and suppose that satisfies the scaling condition (17) of Theorem 1. Suppose further that the sequence of regularization parameters satisfies condition (16) and

 λ2nn→∞ (18)

and the minimum parameter weights satisfy

 min(r,t)∈E∗n∣∣θ∗(n,p,d)(r,t)∣∣≥10Cmin√dλn (19)

for sufficiently large . Then the method is model selection consistent so that if is the graph structure estimated by the method given data , then as .

#### Remarks

(a) It is worth noting that the scaling condition (17) on allows for graphs and sample sizes in the “large , small ” regime (meaning ), as long as the degrees are bounded, or grow at a sufficiently slow rate. In particular, one set of sufficient conditions are the scalings

 d=O(nc1)andp=O(enc2),3c1+c2<1,

for some constants . Under these scalings, note that we have , so that condition (17) holds.

A bit more generally, note that in the regime , the growth condition (17) requires that that . However, in many practical applications of graphical models (e.g., image analysis, social networks), one is interested in node degrees that remain bounded or grow sub-linearly in the graph size so that this condition is not unreasonable.

(b) Loosely stated, the theorem requires that the edge weights are not too close to zero (in absolute value) for the method to estimate the true graph. In particular, conditions (16) and (19) imply that the minimum edge weight is required to scale as

 θ∗min=Ω(√dlogpn).

Note that in the classical fixed case, this reduces to the familiar scaling requirement of .

(c) In the high-dimensional setting (for , a choice of the regularization parameter satisfying both conditions (16) and (18) is, for example,

 λn=16(2−α)α√logpn

for which the probability of incorrect model selection decays at rate for some constant . In the classical setting (fixed ), this choice can be modified to .

The analysis required to prove Theorem 1 can be divided naturally into two parts. First, in Section 4, we prove a result (stated as Proposition 1) for “fixed design” matrices. More precisely, we show that if the dependence condition (A1) and the mutual incoherence condition (A2) hold for the sample Fisher information matrix

 Qn:=ˆE[η(X;θ∗)X∖rXT∖r]=1nn∑i=1η(x(i);θ∗)x(i)∖r(x(i)∖r)T, (20)

then the growth condition (17) and choice of from Theorem 1 are sufficient to ensure that the graph is recovered with high probability.

The second part of the analysis, provided in Section 5, is devoted to showing that under the specified growth condition (17), imposing incoherence and dependence assumptions on the population version of the Fisher information guarantees (with high probability) that analogous conditions hold for the sample quantities . On one hand, it follows immediately from the law of large numbers that the empirical Fisher information converges to the population version for any fixed subset . However, in the current setting, the added delicacy is that we are required to control this convergence over subsets of increasing size. Our proof therefore requires some large-deviation analysis for random matrices with dependent elements so as to provide exponential control on the rates of convergence.

### 3.3 Primal-dual witness for graph recovery

At the core of our proof lies the notion of a primal-dual witness used in previous work on the Lasso Wainwright06new (). In particular, our proof involves the explicit construction of an optimal primal-dual pair—namely, a primal solution along with an associated subgradient vector (which can be interpreted as a dual solution), such that the sub-gradient optimality conditions associated with the convex program (7) are satisfied. Moreover, we show that under the stated assumptions on , the primal-dual pair can be constructed such that they act as a witness—that is, a certificate guaranteeing that the method correctly recovers the graph structure.

For the convex program (7), the zero sub-gradient optimality conditions Rockafellar () take the form

 ∇ℓ(ˆθ)+λnˆz=0, (21)

where the dual or subgradient vector must satisfy the properties

 ˆzrt=sign(ˆθrt)if ˆθi≠0and|ˆzrt|≤1otherwise. (22)

By convexity, a pair is a primal-dual optimal solution to the convex program and its dual if and only if the two conditions (21) and (22) are satisfied. Of primary interest to us is the property that such an optimal primal-dual pair correctly specifies the signed neighborhood of node ; the necessary and sufficient conditions for such correctness are {subequation}

 sign(ˆzrt) = sign(θ∗rt)∀(r,t)∈S:={(r,t)∈E}and (23) ˆθru = 0for all (r,u)∈Sc:=E∖S. (24)

The -regularized logistic regression problem (7) is convex; however, for , it need not be strictly convex, so that there may be multiple optimal solutions. The following lemma, proved in Appendix A, provides sufficient conditions for shared sparsity among optimal solutions, as well as uniqueness of the optimal solution:

###### Lemma 1

Suppose that there exists an optimal primal solution with associated optimal dual vector such that . Then any optimal primal solution must have . Moreover, if the Hessian sub-matrix is strictly positive definite, then is the unique optimal solution.

Based on this lemma, we construct a primal-dual witness with the following steps.

1. [(a)]

2. First, we set as the minimizer of the partial penalized likelihood

 ˆθS=argmin(θS,0)∈Rp−1{ℓ(θ;Xn1)+λn∥θS∥1} (25)

and set .

3. Second, we set so that condition (24) holds.

4. In the third step, we obtain from (21) by substituting in the values of and . Thus our construction satisfies conditions (24) and (21).

5. The final and most challenging step consists of showing that the stated scalings of imply that, with high-probability, the remaining conditions (23) and (22) are satisfied.

Our analysis in step (d) guarantees that with high probability. Moreover, under the conditions of Theorem 1, we prove that the sub-matrix of the sample Fisher information matrix is strictly positive definite with high probability so that by Lemma 1, the primal solution is guaranteed to be unique.

It should be noted that, since is unknown, the primal-dual witness method is not a practical algorithm that could ever be implemented to solve -regularized logistic regression. Rather, it is a proof technique that allows us to establish sign correctness of the unique optimal solution.

## 4 Analysis under sample Fisher matrix assumptions

We begin by establishing model selection consistency when assumptions are imposed directly on the sample Fisher matrix , as opposed to on the population matrix , as in Theorem 1. In particular, recalling the definition (20) of the sample Fisher information matrix , we define the “good event,”

 M(Xn1):={Xn1∈{−1,+1}n×p∣Qn satisfies (A1) and (A2)}. (26)

As in the statement of Theorem 1, the quantities and refer to constants independent of . With this notation, we have the following:

###### Proposition 1 ((Fixed design))

If the event holds, the sample size satisfies , and the regularization parameter is chosen such that , then with probability at least , the following properties hold.

(a) For each node , the -regularized logistic regression has a unique solution, and so uniquely specifies a signed neighborhood .

(b) For each , the estimated signed neighborhood vector correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges with .

Loosely stated, this result guarantees that if the sample Fisher information matrix is “good,” then the conditional probability of successful graph recovery converges to zero at the specified rate. The remainder of this section is devoted to the proof of Proposition 1.

### 4.1 Key technical results

We begin with statements of some key technical lemmas that are central to our main argument with their proofs deferred to Appendix B. The central object is the following expansion obtained by re-writing the zero-subgradient condition as

 ∇ℓ(ˆθ;Xn1)−∇ℓ(θ∗;Xn1)=Wn−λnˆz, (27)

where we have introduced the short-hand notation for the -dimensional score function,

 Wn:=−1nn∑i=1x(i)∖r{x(i)r−exp(∑t∈V∖rθ∗rtx(i)t)−exp(−∑t∈V∖rθ∗rtx(i)t)exp(∑t∈V∖rθ∗rtx(i)t)+exp(−∑t∈V∖rθ∗rtx(i)t)}.

For future reference, note that . Next, applying the mean-value theorem coordinate-wise to the expansion (27) yields

 ∇2ℓ(θ∗;Xn1)[ˆθ−θ∗]=Wn−λnˆz+Rn, (28)

where the remainder term takes the form

 Rnj=[∇2ℓ(\lx@overaccentset\vspace∗−2pt−θ(j);Xn1)−∇2ℓ(θ∗;Xn1)]Tj(ˆθ−θ∗) (29)

with a parameter vector on the line between and , and with denoting the th row of the matrix. The following lemma addresses the behavior of the term in this expansion:

###### Lemma 2

For the specified mutual incoherence parameter , we have

 P(2−αλn∥Wn∥∞≥α4)≤2exp(−α2λ2n128(2−α)2n+log(p)), (30)

which converges to zero at rate as long as .

See Appendix B.1 for the proof of this claim.

The following lemma establishes that the sub-vector is an -consistent estimate of the true sub-vector :

###### Lemma 3 ((ℓ2-consistency of primal subvector))

If and, then

 ∥ˆθS−θS∥2≤5Cmin√dλn. (31)

See Appendix B.2 for the proof of this claim.

Our final technical lemma provides control on the remainder term (29).

###### Lemma 4

If and , then

 ∥Rn∥∞λn≤25DmaxC2minλnd≤α4(2−α).

See Appendix B.3 for the proof of this claim.

### 4.2 Proof of Proposition 1

Using these lemmas, the proof of Proposition 1 is straightforward. Consider the choice of the regularization parameter, . This choice satisfies the condition of Lemma 2, so that we may conclude that with probability greater than , we have

 ∥Wn∥∞≤α2−αλ4≤λ4

using the fact that . The remaining two conditions that we need to apply the technical lemmas concern upper bounds on the quantity . In particular, for a sample size satisfying , we have

 λnd = 16(2−α)α√logpnd ≤ 16C2min100Dmaxα(2−α) < C2min10Dmax

so that the conditions of both Lemmas 3 and 4 are satisfied.

We can now proceed to the proof of Proposition 1. Recalling our shorthand and the fact that we have set in our primal-dual construction, we can re-write condition (28) in block form as {subequation}

 QnScS[ˆθS−θ∗S] = WnSc−λnˆzSc+RnSc, (32) QnSS[ˆθS−θ∗S] = WnS−λnˆzS+RnS. (33)

Since the matrix is invertible by assumption, the conditions (4.2) can be re-written as

 QnScS(QnSS)−1[WnS−λnˆzS+RnS]=WnSc−λnˆzSc+RnSc. (34)

Rearranging yields the condition,

 [WnSc−RnSc]−QnScS(QnSS)−1[WnS−RnS]+λnQnScS(QnSS)−1ˆzS=λnˆzSc. (35)

#### Strict dual feasibility

We now demonstrate that for the dual sub-vector defined by (35), we have . Using the triangle inequality and the mutual incoherence bound (14), we have that

 ∥ˆzSc∥∞ ≤ +∥RnSc∥∞λn+∥WnSc∥∞λn ≤ (1−α)+(2−α)[∥Rn∥∞λn+∥Wn∥∞λn]. (37)

Next, applying Lemmas 2 and 4, we have

 ∥ˆzSc∥∞≤(1−α)+α4+α4=1−α2

with probability converging to one.

#### Correct sign recovery

We next show that our primal sub-vector defined by (25) satisfies sign consistency, meaning that . In order to do so, it suffices to show that

 ∥θS−θ∗S∥∞≤θ∗min2

recalling the notation . From Lemma 3, we have so that

 2θ∗min∥θS−θ∗S∥∞ ≤ 2θ∗min∥θS−θ∗S∥2 ≤ 2θ∗min5Cmin√dλn,

which is less than one as long as .

## 5 Uniform convergence of sample information matrices

In this section we complete the proof of Theorem 1 by showing that if the dependency (A1) and incoherence (A2) assumptions are imposed on the population Fisher information matrix then under the specified scaling of , analogous bounds hold for the sample Fisher information matrices with probability converging to one. These results are not immediate consequences of classical random matrix theory (e.g., DavSza01 ()) since the elements of are highly dependent. Recall the definitions

 Q∗:=Eθ∗[η(X;θ∗)X∖rXT∖r]andQn:=ˆE[η(X;θ∗)X∖rXT∖r], (38)

where denotes the population expectation, and denotes the empirical expectation, and the variance function was defined previously in (12). The following lemma asserts that the eigenvalue bounds in assumption (A1) hold with high probability for sample covariance matrices:

###### Lemma 5

Suppose that assumption (A1) holds for the population matrix and . For any and some fixed constants and , we have {subequation}

 P[Λmax[1nn∑i=1x(i)∖r(x(i)∖r)T]≥Dmax+δ] ≤ 2exp(−Aδ2nd2+Blog(d)), (39) P[Λmin(QnSS)≤Cmin−δ] ≤ 2exp(−Aδ2nd2+Blog(d)). (40)

The following result is the analog for the incoherence assumption (A2) showing that the scaling of given in Theorem 1 guarantees that population incoherence implies sample incoherence.

###### Lemma 6

If the population covariance satisfies a mutual incoherence condition (14) with parameter as in assumption (A2), then the sample matrix satisfies an analogous version, with high probability in the sense that

 P[|||QnScS(QnSS)−1|||∞≥1−α2]≤exp(−Knd3+log(p)). (41)

Proofs of these two lemmas are provided in the following sections. Before proceeding, we take note of a simple bound to be used repeatedly throughout our arguments. By definition of the matrices and [see (20) and (11)], the th element of the difference matrix can be written as an i.i.d. sum of the form where each is zero-mean and bounded (in particular, ). By the Azuma–Hoeffding bound Hoeffding63 (), for any indices and for any , we have

 P[(Zjk)2≥ε2]=P[∣∣∣1nn∑i=1Z(i)jk∣∣∣≥ε]≤2exp(−ε2n32). (42)

So as to simplify notation, throughout this section, we use to denote a universal positive constant, independent of . Note that the precise value and meaning of may differ from line to line.

### 5.1 Proof of Lemma 5

By the Courant–Fischer variational representation Horn85 (), we have

 Λmin(QSS) = min∥x∥2=1xTQSSx = min∥x∥2=1{xTQnSSx+xT(QSS−QnSS)x} ≤ y