An Oracle Approach for Interaction Neighborhood Estimation in Random Fields

# An Oracle Approach for Interaction Neighborhood Estimation in Random Fields

\fnmsMatthieu \snmLerasle\thanksreft1label=e1]lerasle@gmail.com [    \fnmsDaniel, Y. \snmTakahashi\thanksreft2label=e2]takahashiyd@gmail.com [ Instituto de Matemática e Estatística
Universidade de São Paulo
Caixa Postal 66281
05315-970 São Paulo, Brasil
Instituto de Matemática e Estatística
Universidade de São Paulo
Caixa Postal 66281
05315-970 São Paulo, Brasil
currently at Princeton University
Department of Psychology and Neuroscience
Green Hall, Princeton, NJ 08648
###### Abstract

We consider the problem of interaction neighborhood estimation from the partial observation of a finite number of realizations of a random field. We introduce a model selection rule to choose estimators of conditional probabilities among natural candidates. Our main result is an oracle inequality satisfied by the resulting estimator. We use then this selection rule in a two-step procedure to evaluate the interacting neighborhoods. The selection rule selects a small prior set of possible interacting points and a cutting step remove from this prior set the irrelevant points.
We also prove that the Ising models satisfy the assumptions of the main theorems, without restrictions on the temperature, on the structure of the interacting graph or on the range of the interactions. It provides therefore a large class of applications for our results. We give a computationally efficient procedure in these models. We finally show the practical efficiency of our approach in a simulation study.

\runtitle

Interaction Neighborhood Estimation

and \thankstextt1Supported by FAPESP grant 2009/09494-0.

\thankstext

t2Supported by FAPESP grant 2008/08171-0.

## 1 Introduction

Graphical models, also known as random fields, are used in a variety of domains, including computer vision [4, 21], image processing [9], neuroscience [19], and as a general model in spatial statistics [18]. The main motivation for our work comes from neuroscience where the advancement of multichannel and optical technology enabled the scientists to study not only a unit of neurons per time, but tens to thousands of neurons simultaneously [20]. The very important question now in neuroscience is to understand how the neurons in this ensemble interact with each other and how this is related to the animal behavior [19, 8]. This question turns out to be hard for three reasons at least. First, the experimenter has always only access to a small part of the neural system. Moreover, there is no really good model for population of neurons in spite of the good models available for single neurons. Finally, strong long range interactions exist [15]. Our work tries to overcome some of these difficulties as will be shown.

A random field can be specified by a discrete set of sites , possibly infinite, a finite alphabet of spins , and a probability measure on the set of configurations . One of the objects of interest are the one-point specification probabilities, defined for all sites in and all configurations in by a regular version of the conditional probability

 P(x(i)|x(j),j∈G/{i}).

From a statistical point of view, two problems are of natural interest.

Interaction neighborhood identification problem (INI):

The INI problem is to identify, for all sites in , the minimal subset of necessary to describe the specification probabilities in site (see Sections 2 and 3 for details). is called the interaction neighborhood of and the points in are said to interact with . is not necessarily finite but only a finite subset of sites is observed. The observation set is a sample , where are i.i.d with common law . The question is then to recover from , for all in , the sets .

Oracle neighborhood problem (ON):

The ON problem is to identify, for all in , a set , such that the estimation of the conditional probabilities by the empirical conditional probabilities has a minimal risk (see Sections 2 and 3 for details). is then said to satisfy an oracle inequality and it is also called oracle. We look for oracles among the subsets of and we consider the -distance between conditional probabilities to measure the risk of the estimators. An oracle is in general smaller than because it should balance approximation properties and parsimony.

The literature has mainly been focused in the INI problem, see [3, 7, 10, 11, 17] for examples. It requires in general strong assumptions on to be solved. For example, the -penalization procedure proposed in [17] requires an incoherence assumption on the interaction neighborhoods that is very restrictive, as shown by [3]. Moreover, it is assumed in [3, 7, 17] that is finite and that all the sites are observed, i.e. . Csiszar and Talata [10] consider the case when but assume a uniform bound on the cardinality of . The procedure proposed in [11] holds for infinite graph with each site having infinite neighborhoods, but requires that the main interactions belong to a known neighborhood of of order . Moreover, the result is proved in the Ising model only when the interaction is sufficiently weak.

The first goal of this paper is to show that the ON problem can be solved without any of these hypotheses. We introduce in Section 3.2 a model selection criterion to choose a model and prove that it is an oracle in Theorem 3.2. This result does not require any assumption on the structure of the interaction neighborhoods inside or outside .

The second objective is to show that a selection rule provides also a useful tool to handle the INI problem. We introduce the following two steps procedure. First, we select, for all sites in , a small subset of with the model selection rule. We prove that this set contains the main interacting points inside with large probability. Following the idea introduced in [11], we use then a test to remove from the points of . The new test can be applied to all neighborhoods that are smaller than and that contain the main interaction points in . It requires less restrictive assumptions on the interactions outside and on the measure than the one of [11]. For example, it works in the Ising models without restrictions on the temperature parameter. Furthermore, the two-step method let us look for the interacting points inside all the observation set (of order for some ), and not only inside a prior subset (smaller than ) of .

All the results hold under a key assumption H1 that is not classical, but that is satisfied by Ising models, see Theorem 4.5. We obtain then a large class of models, widely used in practice, where our methods are efficient. We also provide for this model a computationally efficient version of our main algorithms.

The paper is organized as follows. In Section 2, we introduce notations and assumptions used all along the paper. Section 3 gives the main results, in a general framework. Section 4 shows the application to Ising models and Section 5 presents a large simulation study where the problem of the practical calibration of some parameters is adressed. Section 6 is a discussion of the results with a comparison to existing papers. Section 7 gives the proofs of the main theorems and some technical results are recalled in an appendix in Section 9.

## 2 Notations and Main Assumptions

Let be a discrete set of sites, possibly infinite, be the binary alphabet of spins, and be a probability measure on the set of configurations . More generally, for all subsets of , let be the set of configurations on . In what follows, the triplet will be called a random field. For all in , for all , for all in , let and for all probability measures on , let

 Qi|V(x)=Q(x(i)|x(V/{i}))

be a regular version of the conditional probability. All along the paper, we will use the convention that, if is a finite set, a probability measure on and is a configuration such that , then .
For all in and all in , let be the configuration such that for all and We say that there is a pairwise interaction from to if there exists in such that For all subsets of , for all probability measures on , let

 ωVi,j(Q)=supx∈X(G){Qi|V(x)−Qi|V(xj)}.

With the above notations, there is a pairwise interaction from to if and only if . Our second task in this paper is to recover the set of sites having a pairwise interaction with . This definition differs in general from the one suggested in introduction. However, it is easy to check that they coincide in the Ising models defined in Section 4.
Let be i.i.d. with common law . Let be a finite subset of of observed sites, with cardinality . The observation set is then . Let be the empirical measure on defined for all configurations in by

 ˆP(x)=1nn∑i=11{Xi(G)=x(G)}.

For all real valued functions defined on , let . For all subsets of , the -risk of is defined by . This risk is naturally decomposed into two terms. From the triangular inequality, we have

 ∥∥ˆPi|V−Pi|G∥∥∞≤∥∥ˆPi|V−Pi|V∥∥∞+∥∥Pi|V−Pi|G∥∥∞.

We call variance term the random term and bias term the deterministic one .
Let us finally present our general assumptions on the measure . In the following and are positive constants. The two first assumptions are classical and will only be used to discuss the main results.

NN: (Non-Nullness) For all in , .

CA:(Continuity) For all growing sequences of subsets of such that , for all in ,

 limn→∞∥∥Pi|Vn−Pi|G∥∥∞=0.

The following last assumption is very important for the model selection criterion to work. It is satisfied for example by a generalized form of the Ising model as we will see in Section 4.

H1: For all finite subsets of ,

 κmin∥∥Pi|G−Pi|V∥∥∞≤∥∥Pi|G∥∥∞−∥∥Pi|V∥∥∞.

## 3 General results

### 3.1 Control of the variance term of the L∞-risk

Our first theorem provides a sharp control of the variance term of the risk of . It holds without assumption on the measure or the finite subset .

###### Theorem 3.1.

Let be a probability measure on , let be a finite subset of . Let . There exists an absolut constant such that, for all ,

 P⎛⎜⎝∥∥ˆPi|V−Pi|V∥∥∞>c1 ⎷ln(δ/pV−)npV−⎞⎟⎠≤1δ. (1)

Moreover, let . There exists an absolut constant such that, for all ,

 P⎛⎝∥∥ˆPi|V(x)−Pi|V(x)∥∥∞>c2 ⎷ln(δn)nˆpV−⎞⎠≤1δ. (2)

Remark:

• Let denote the cardinality of , if satisfies NN we have Hence, (1) implies that,

 P⎛⎝∥∥ˆPi|V−Pi|V∥∥∞≤c1√ν|V||V|ln(ν)+2ln(n)n⎞⎠≥1−n−2.

The variance term goes almost surely to if . If in addition satisfies CA and is a growing sequence of sets with limit , the estimator is consistent.

• (1) is only interesting theoretically, because the parameter is unknown in practice. We will use (2) for our model selection algorithm.

### 3.2 Model Selection

We deduce from Theorem 3.1 that the risk of the estimator is bounded in the following way. For all , for all subsets ,

 P⎛⎝∥∥Pi|G−ˆPi|V∥∥∞≤∥∥Pi|G−Pi|V∥∥∞+c2 ⎷ln(δn)nˆpV−⎞⎠≥1−δ−1. (3)

The risk of depends on the approximation properties of through the bias that is typically unknown in practice, and on the complexity of , measured here by . The aim of this section is to provide model selection procedures in order to select a subset of that optimizes the bound (3). In the following, we denote by a finite collection of subsets of , possibly random, and we call optimal or oracle in , any subset in , possibly random, such that,

 P⎛⎝∥∥Pi|G−ˆPi|^G∥∥∞≤KinfV∈Gn⎧⎨⎩∥∥Pi|G−Pi|V∥∥∞+ ⎷ln(δn)nˆpV−⎫⎬⎭⎞⎠≥1−δ−1.

We introduce the following selection rule. Let be an almost sure bound on the cardinality of . For all and for all , let

 ^G(C,δ,Gn)=argminV∈Gn{−∥∥ˆPi|V∥∥∞+Cpen(V)},wherepen(V)≥ ⎷ln(δnNn)nˆpV−. (4)

The following theorem states that is almost an oracle.

###### Theorem 3.2.

Let be a probability measure on satisfying H1. Let be a finite collection of finite subsets of , possibly random, and let be an almost sure bound on the cardinality of . For all , , let be the estimator given by (4). There exists a positive constant such that,

 P(∥∥ˆPi|^Gδ(C)−Pi|G∥∥∞≤KinfV∈Gn{∥∥Pi|G−Pi|V∥∥∞+pen(V)})≥1−1δ.

Remarks:

• Theorem 3.2 states that the risk of the estimator selected by the rule (4) is the best among the collection . It is the main result of the paper and we will discuss in what follows several applications.

• The key idea of the proof is that, by assumption H1, we have , hence, our decision rule consists essentially in minimizing the sum of the bias term and the variance term of the risk, and the selected estimator is then an oracle.

• The constant derived in Theorem 3.1 is very pessimistic. Hence, Theorem 3.2 is more interesting theoretically. In the simulations of Section 5, we will calibrate with the slope algorithm introduced in [5] and illustrate the nice properties of the resulting .

Let us go back to the ON problem. It is solved thanks to the following corollary.

###### Corollary 3.3.

Let be a random field. Let is a finite subset of with cardinality , let and let . For all , , let For all , let , let be the estimator given by (4). With probability larger than , we have

 ∥∥ˆPi|^Gδ(C))−Pi|G∥∥∞≤KinfV⊂VM⎧⎨⎩∥∥Pi|G−Pi|V∥∥∞+ ⎷ΓM(δ)nˆpV−⎫⎬⎭.

Remarks:

• The complexity of the model selection algorithm for the collection is . This collection is used when a uniform bound on the cardinalities of the is known. The complexity is the minimal necessary to recover the interaction graph in this problem [7].

### 3.3 Estimation of the interaction subgraph

Let be an integer and let be a finite subset of , with cardinality . For all subsets of , let us choose Let be a finite subset of , we study in this section the estimators of given by

 ^GVi(c)={j∈V,ωVi,j(ˆP)>cvVn(δ)}. (5)

We introduce the following function.

 Ψ(v)=infV,ˆpV−≥v−2∥∥Pi|V−Pi|G∥∥∞.

represents the minimal value of the bias term at a given value of the variance term. Our assumption concerns the rate of convergence of to .

H2(): There exist , such that, for all , for all ,

 P(Ψ(Kv)≤CΨK−αΨΨ(v))≥1−ϵΨ.
###### Theorem 3.4.

Let be a random field satisfying H1, H2. Let be an integer, let be a finite subset of with cardinality . Let and let . Let . For all in , let

 vVn(δ)= ⎷ΓM(δ)nˆpV−.

Let , and let be the set selected by the selection rule (4). Let and let be the associated set defined by (5). Let be the constant defined in Theorem 3.2 and let . We have

 P( {j∈VM,ωGi,j(P)≥(c+c∞)v^Gn(δ)} ⊂^G^Gδ(C)i(c)⊂{j∈VM,ωGi,j(P)≥(c−c∞)v^Gn(δ)})≥1−δ−1−ϵΨ.

Remark:

• When , contains exactly the sites that have a pairwise interaction with of order the risk of an oracle. It provides a partial solution to the INI problem.

• Theorem 3.4 requires the extra assumption H2 compared to Theorem 3.2. Moreover, the theoretical constant depends on the constants , , .

Let us conclude this section with the two steps algorithm suggested by Theorem 3.4 to estimate .

Estimation algorithm:

• Choose a large subgraph of , typically the nearest neighbors of in .

• Selection step. Choose a model , applying the model selection algorithm of Theorem 3.2 to the collection of all subgraphs of with cardinality smaller than .

• Cutting step. Cut the edges of such that .

## 4 Ising Models

The remaining of the paper is devoted to Ising models. These models are very important in statistical mechanics [12] and neuroscience [19] where they represent the interactions respectively between particles and neurons. In this section, we prove that Ising models satisfy H1, so that all our general results apply in these models. We also define effective algorithms for the ON and INI problems, adapted to this special case.

### 4.1 Verification of H1.

Let us recall the definition of Ising models.

###### Definition 4.1.

Let be a real valued function. For all in and all in , let is said to be a pairwise potential of interaction if, for all in , and if

 r:=supi∈Gsupa∈A∑j∈G∥fai,j∥<∞.

In this case, is called the temperature parameter of the pairwise potential .

###### Definition 4.2.

A probability measure on is called an Ising model with potential if, for all ,

 Pi|G(x)=e∑j∈Gfi,j(x(i),x(j))∑a∈Ae∑j∈Gfi,j(a,x(j))=11+e∑j∈Gfi,j(xi(i),x(j))−fi,j(x(i),x(j)).

The existence of a such a measure is well known [12].

Remark:

• The classical Ising model has potential defined by , , , for all and .

• One of the fundamental questions studied for this class of models is the description of conditions on potential that guarantees uniqueness and non-uniqueness of the Ising model. Usually, high temperature implies conditions for the uniqueness of the Ising model and low temperature implies non-uniqueness [12].

Let , we have then

 Pi|G(x)=11+e−∑j∈Ggi,j(x(i),x(j)).

It is clear that Ising models satisfy CA and NN with .

###### Definition 4.3.

Let be an Ising model, with potential . For all in , for all in , let

 ωi,j(f)=sup(a,b)∈A2{gi,j(a,b)−gi,j(a,−b)}=supb∈A{gi,j(a,b)−gi,j(a,−b)}.

Let us first recall some elementary facts about Ising models.

###### Proposition 4.4.

Let be an Ising model, with potential . For all finite subsets of , for all in , we have

1. .

2. .

The following theorem states that all of our general results apply in Ising models. The key ingredient of the proof is the precise control of the bias term (6).

###### Theorem 4.5.

Let be an Ising model, with potential . There exist two positive constants such that, for all subsets of ,

 c∗r∑j∉Vωi,j(f)≤∥∥Pi|G−Pi|V∥∥∞≤C∗r∑j∉Vωi,j(f). (6)

satisfies assumption H1 i.e. there exists a constant such that, for all finite subsets of ,

 κmin∥∥Pi|G−Pi|V∥∥∞≤∥∥Pi|G∥∥∞−∥∥Pi|V∥∥∞.

### 4.2 A special strategy for Ising models

The model selection algorithm (4) might be computationally demanding in practice when the collection is too large. This is the case of the collection used several times in Section 3, when the values of and are large. The purpose of this section is to show that a special strategy, computationally more attractive, can be adopted in Ising models. The idea comes from [7]. Let us describe the method.

Reduction of the number of sites. Let be the configuration in such that, for all in , .

• Computation of the empirical probabilities. For all in , let

 ˆp(j)=ˆP(x1(j)),ˆp(i,j)=ˆP(x1(i,j)).
• Reduction step. We keep the in such that

 |ˆp(i,j)−ˆp(i)ˆp(j)|>η.

Let also be the smallest such that the number of kept after Step 2 is smaller than .

We denote by the set of kept after Step 2. It is clear that the reduction algorithm has a complexity . Remark that the values do not depend on the configuration since the alphabet has only two letters.

Model selection algorithm. Let .

• Computation of the conditional probabilities. For all in , compute , and .

• Selection Step. We choose and

It is clear that, if , hence

 ˆN=|G|=ˆm∑k=0Ckˆm≤2ˆm≤nκ.

Hence, the complexity of the model selection algorithm is . The global complexity of the algorithm is therefore . As a comparison, the model selection algorithm for was .

#### 4.2.1 Control of the risk of the resulting estimator

###### Theorem 4.6.

Let be an Ising model, with potential . Let

 C1=4r(1+e2r)3e−6r(e4r−1),C2=4r(1+e2r)2e6r(e4r−1).

With probability larger than we have that

 ⎧⎨⎩j∈VM,∣∣ωi,j(f)∣∣≥C1⎛⎝η+3√ln(6Mδ)2n⎞⎠⎫⎬⎭

Furthermore, let us denote by

 V(δ,M)={j∈VM;|ωi,j(f)|≤C1(ηms+3√(2n)−1ln(6Mδ)}.

With probability larger than , we have,

 1K∥∥ˆPi|^G−Pi|G∥∥≤∑j∈V(δ,M)|ωi,j(f)|+infV∈G⎧⎪⎨⎪⎩∑j∈ˆV(η)/V|ωi,j(f)|+ ⎷ln(nκδ)nˆpV−⎫⎪⎬⎪⎭.

Remarks:

• The estimator of the interaction graph has better properties than the one obtained with selection and cutting procedure. The main difference is that there is no term in the rate of convergence.

• The oracle inequality might be a little bit less sharp than the one obtained in (19). This is the price to pay to have a computationally efficient algorithm.

• Our result holds in the Ising model. However, [7] used a similar approach in more general random fields with some additional assumptions and obtained good properties for the INI problem.

## 5 Simulation studies

In this section we illustrate results obtained in Sections 3 and 4 using simulation experiments and introduce the slope heuristic. All these simulation experiments can be reproduced by a set of MATLAB® routines that can be downloaded from www.princeton.edu/ dtakahas/publications/LT10routines.zip.
Let . For the sections 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, and 5.7, we consider an Ising model on , with pairwise potential given by for , , , and . The pair of sites where is shown in Figure 1. For all these experiments, . We simulated independent samples of the Ising model with increasing sample sizes , . For each sample size we have independent replicas.

### 5.1 Variance term of the risk

In the following experiment we will verify Theorem 3.1 in a simulation. For each sample size we computed the normalized variance term for different samples and obtained the average value. The result is summarized in Figure 2.

### 5.2 Slope heuristic

The constant derived from Theorem 3.1 is too pessimistic to be used in practice. The purpose of this section is to present a general method to design this constant. It is based on the slope heuristic, introduced in [5] and proved in several other frameworks in [1, 14]. We refer also to [2] for a large discussion on the practical use of this method. In order to describe it, let us introduce, for all in , a quantity , possibly random, measuring the complexity of the model . The heuristic states the following facts.

1. There exists a positive constant such that when , the complexity of the model selected by the rule (4) is as large as possible.

2. When is slightly larger than the complexity of the selected model is much smaller.

3. When then the risk of the selected model is asymptotically the one of an oracle.

The heuristic yields the following algorithm, defined for all complexity measures .

1. For all , compute , the complexity of the model selected by the rule (4).

2. Choose such that is very large for and much smaller for .

3. Select the final .

The algorithm is based on the idea that and therefore that the final , selected by is an oracle by the third point of the slope heuristic. The actual efficiency of this approach depends highly on the choice of the complexity measure and on the practical way to choose in step 2 of the algorithm. We illustrate the dependence on in the following experiences.

is either the cardinality of (the dimension) or the variance estimator . is selected with the maximum jump criteria [2]: fix an increasing sequence of positive numbers and define

 k=argmaxi{Δ^G(Ci)−Δ^G(Ci−1)},and~Cmin=Ck.

If the maximum is achieved in more than one value, take the biggest of such .

Remark: The calculation of does not yield a significant increase of computational time compared to the evaluation of the model selection criteria for one fixed constant . The only additional cost is due to the fact that one has to keep in the computer memory the conditional probabilities that must be computed only once.

### 5.3 Oracle risk compared to the risk of the estimated model

One way to verify the performance of the slope heuristic proposed in previous section is to compute the ratio

 ∥∥^Pi|^G(2~Cmin)−Pi|G∥∥∞infV⊂G∥∥^Pi|V−Pi|G∥∥∞. (7)

With a reasonable procedure, we expect that the above quantity remains bounded. We applied the model selection procedure (4) with slope heuristic discussed above for the set . For each sample size we computed the ratio (7) for 100 different samples and we obtained the average. The result is summarized in Figure 3.

### 5.4 Discovery rate of the model selection procedure for ON problem

Another way to measure the performance of our model selection procedure is to compute the positive discovery rate

 E[|^G(2~Cmin)∩^Goracle||^Goracle|] (8)

and the negative discovery rate

 E⎡⎢ ⎢⎣|G∖(^G(2~Cmin)∪^Goracle)||G∖^Goracle|⎤⎥ ⎥⎦. (9)

with respect to the oracle .
We estimated (8) and (9) and the result is summurized in Figure 4.

### 5.5 Performance of the model selection procedure for INI problem

A natural question is how well the proposed model selection procedure behaves for the INI problem. Observe that the model selection procedure was designed to solve the ON problem and in principle does not necessary work for the INI problem. To investigate this question for each sample size we estimated the positive discovery rate

 E[|^G(2~Cmin)∩Vi||Vi|]

and the negative discovery rate

 E⎡⎢ ⎢⎣|G∖(^G(2~Cmin)∪Vi)||G∖Vi|⎤⎥ ⎥⎦,

with respect to the interaction neighborhood . The result is summurized in Figure 5.

### 5.6 Relationship between the INI and ON problems

Another interesting question is to understand what is the relationship between the INI and ON problems. Useful quantities for this are the positive discovery rate

 E[|^Goracle∩Vi||Vi|] (10)

and the negative discovery rate

 E⎡⎢ ⎢⎣|G∖(^Goracle∪Vi)||G∖Vi|⎤⎥ ⎥⎦. (11)

We estimated these quantities and the results are summarized in Figure 6.

### 5.7 Select and cut procedure

Here we will show the usefulness of the two-step procedure introduced in Theorem 3.4 by an example. We consider the same independent samples used in previous experiments. We also consider and sample sizes , with independent replicas for each sample size.

Let be the subset of chosen by first applying the model selection procedure for the set . To choose the constant in the model selection procedure, we used the slope heuristic with variance as the complexity measure. Let be the subset of obtained by applying to the subset the cutting procedure with . We first computed the average of the risk ratio

 ∥∥^Pi|^G(SC)−Pi|G∥∥∞infV⊂G∥∥^Pi|V−Pi|G∥∥∞. (12)

for each sample size and compared them with the average of risk ratio (7). The results are summarized in Figure 7.

We also computed the positive and negative discovery rates of and with respect to . The results are presented in Figure 8.

### 5.8 Computationally efficient algorithm

In this section we will illustrate the performance of the strategy introduced in Section 4.2 on the Ising model on , where , with pairwise potential for , , , and independently generated from a Gaussian distribution with and . The pairs of sites with are represented in Figure 9.