Asymptotic normality in the maximum entropy models on graphs with an increasing number of parameters

# Asymptotic normality in the maximum entropy models on graphs with an increasing number of parameters

## Abstract

Maximum entropy models, motivated by applications in neuron science, are natural generalizations of the -model to weighted graphs. Similar to the -model, each vertex in maximum entropy models is assigned a potential parameter, and the degree sequence is the natural sufficient statistic. Hillar and Wibisono (2013) has proved the consistency of the maximum likelihood estimators. In this paper, we further establish the asymptotic normality for any finite number of the maximum likelihood estimators in the maximum entropy models with three types of edge weights, when the total number of parameters goes to infinity. Simulation studies are provided to illustrate the asymptotic results.

Key words: Maximum entropy models; Maximum likelihood estimator; Asymptotic normality; Increasing number of parameters.

Mathematics Subject Classification: 62E20; 62F12.

Running title: Asymptotic normality in the maximum entropy models

## 1 Introduction

In neuron networks, neurons in one region of the brain may transmit a continuous signal using sequences of spikes to a second receiver region. The coincidence detectors in the second region capture the absolute difference in spike times between pairs of neurons projecting from the first region. There may be three possible types of timing differences: zero or nonzero indicator; countable number of possible values; any nonnegative real value. Exploring how the transmitted signal in the first region can be recovered by the second is a basic question in the analysis of neuron networks. Maximum entropy models provide a possible solution to this question for the above three possible weighted edges. For detailed explanations, see [10]; for their wide applications in the biological studies as well as other disciplines such as economics and physics, see [10, 7, 1, 22, 24] and references therein. Maximum entropy models (sometimes with different names) also appear in other fields of network analysis, e.g., community detection and social network analysis. For example, see [6, 2, 3, 14, 25, 16].

In the maximum entropy models, the degree sequence is the exclusively natural sufficient statistics on the exponential family distributions and fully captures the information of an undirected graph. Its study primarily focuses on understanding the generating mechanisms of networks. When network edge takes dichotomous values (“0” or “1”), the maximum entropy model becomes the -model (a name given by Chatterjee, Diaconis and Sly (2011)), an undirected version of the model for directed graphs by Holland and Leinhardt (1981). Rinaldo, Petrović and Fienberg (2013) derived necessary and sufficient conditions for the existence and uniqueness of the maximum likelihood estimate (MLE). As the number of parameters goes to infinity, Chatterjee, Diaconis and Sly (2011) proved that the MLE is uniformly consistent; Yan and Xu (2013) further derived its asymptotical normality. When the maximum entropy models involve the finite discrete, infinite discrete or continuous weighted edges, Hillar and Wibisono (2013) have obtained the explicit conditions for the existence and uniqueness of the MLE and proved that the MLE is uniformly consistent as the number of parameters goes to infinity.

Statistical interests are involved with not only the consistency of estimators but also its asymptotic distributions. The latter can be used to construct the confidence interval on parameters and performed the hypothesis testing. In the asymptotic framework considered in this paper, the number of network vertices goes to infinity and the number of parameter is identical to the dimension of networks (i.e., the number of vertices). Instead of studying a more complicated situation on linear combinations of all MLEs, we describe the central limit theorems for the MLEs through the asymptotic behavior of a finite number of the MLEs, although the total number of parameters goes to infinity. With this point, we aim to establish the asymptotic normality of the MLEs when edges take three types of weights as in Hillar and Wibisono (2013). A key step in our proofs applies a highly accurate approximate inverse of the Fisher information matrix by Yan and Xu (2013).

The remainder of this article is organized as follows. In Section 2, we lay out the asymptotic distributions of the MLEs in the maximum entropy models with the finite discrete, infinite discrete and continuous weighted edges in subsection 2.1, 2.2, 2.3, respectively. Simulation studies are given in Section 3. Section 4 concludes with summary and discussion. All proofs are relegated to Appendix.

## 2 Asymptotic normalities

We first give a brief description on the maximum entropy models. Consider an undirected graph with no self-loops on vertices labeled by “1, …, n”. Let be the weight of edge taking values from the set , where could be a finite discrete, infinite discrete or continuous set. Define as the degree of vertex , and is the degree sequence of . Let be a -algebra over the set of all possible values of , . Assume there is a canonical -finite probability measure on . Let be the product measure on . The maximum entropy models assume that the density function of the symmetric adjacent matrix with respective to has the exponential form with the degree sequence as natural sufficient statistics1, i.e.,

 pθ(a)=exp(−θTd−z(θ)), (1)

where is the normalizing constant,

 z(θ)=log∫S(n2)exp(−θTd)ν(n2)(da) = log∏1≤i

and for fixed , the parameter vector belongs to the natural parameter space (Page 1, Brown, 1984)

 Θ={θ∈Rn:z(θ)<∞}.

The parameters can be interpreted as the strength of each vertex that determines how strongly the vertices are connected to each other. The probability distribution (1) implies that the edges for all are mutually independent. Since the sample is one realization of a graph, the density function in (1) is also the likelihood function. We can see that the solution to is the maximum likelihood estimator (MLE) of .

We now consider the asymptotic distributions of the MLEs as the number of parameters goes to infinity. Let be the Fisher information matrix of the parameters . It can be written as

 Vn=∂2z(θ)∂θ∂θT.

For three common types of weights as introduced in Section 1, is the diagonal dominant matrix with nonnegative entries. This property is crucially used in the proof of the central limit theorem on the MLE.

### 2.1 Finite discrete weights

When network edges take finite discrete weights, we assume with a fixed integer. In this case, is the counting measure and the edge weights are independent multinomial random variables with the probability:

 P(aij=a)=ea(θi+θj)∑q−1k=0ek(θi+θj),  a=0,1,…,q−1.

This model is a direct generalization of the -model that only consider the dichotomous edges. The normalizing constant is

 z(θ)=∑1≤i

and the parameter space is . Let be the MLE of . The likelihood equations are

 di=n∑j=1;j≠iq−1∑a=0aea(ˆθi+ˆθj)∑q−1k=0ek(ˆθi+ˆθj),   i=1,…,n, (2)

which is identical to the moment estimating equations. The fixed point iteration algorithm by Chatterjee, Diaconis and Sly (2011) or the minorization-maximization algorithm by Hunter (2004) can be used to solve the above system of equations or analogous problems in the next two subsections. Following Hillar and Wibisono (2013), we assume that is bounded by a constant when considering the asymptotic distribution of the MLE. The central limit theorem for the MLE is stated as follows, whose proof is given in Appendix A.

###### Theorem 1.

In the case of finite discrete weights, the diagonal entries of has the following representation:

 vii=n∑j=1;j≠i∑0≤k

Assume that is bounded by a fixed constant. Then for any fixed , the vector is asymptotically standard multivariate normal as .

Notice that are asymptotically mutually independent by the above theorem. This is due to that the maximum entropy models imply the mutually independent edges in a graph. It further implies that is an approximate diagonal matrix shown in Proposition 1 such that are asymptotically mutually independent.

### 2.2 Continuous weights

When network edges take continuous weights, , is the Lebesgue measure on and the normalizing constant is

 z(θ)=−∑1≤i

Therefore, the corresponding natural parameter space is

 Θ={(θ1,…,θn)∈Rn:θi+θj>0 for i≠j}.

The edge weights () are independently distributed by exponential distributions with density

 p(a)=(θi+θj)exp(−(θi+θj)a),  a>0,  1≤i

whose expectation is . The likelihood equations are

 di=∑j≠i1^θi+^θj,    i=1,…,n.

The asymptotic distribution of the MLE is stated as follows, whose proof is given in Appendix B.

###### Theorem 2.

Let and . In the case of continuous weights, the diagonal entries of are:

 vii=n∑j=1,j≠i1(θi+θj)2,  i=1,…,n.

If , then for any fixed , the vector is asymptotically standard multivariate normal as .

### 2.3 Infinite discrete weights

When edges take infinite discrete weights, we assume that . In this case, is the counting measure, the normalizing constant is

 z(θ)=∑1≤i

and the natural parameter space is

 Θ={(θ1,…,θn)∈Rn:θi+θj>0 for i≠j}.

The edge weights are independent geometric random variables with probability mass function:

 P∗(aij=a)=(1−exp(−θi−θj))exp(−(θi+θj)a),  1≤i

The likelihood equations are

 di=∑j≠i1exp(^θi+^θj)−1,  i=1,…,n,

which are identical to the moment estimating equations.

###### Theorem 3.

Let and . In the case of infinite discrete weights, the diagonal entries of are

 vii=n∑j=1;j≠ieθi+θj(eθi+θj−1)2,  i=1,…,n.

If , then for any fixed , the vector is asymptotically standard multivariate normal as .

###### Remark 1.

By Theorems 13, we have: (1) for any fixed , , , are asymptotically independent; (2) as , the convergence rate of is . If and are constants, then this convergence rate is in the magnitude of ; otherwise it is between and when edges take continuous weights, between and when edges take infinite discrete weights. To compare with the convergence rate in the continuous and infinite discrete cases, we consider a special case . Since and when is large enough, the former is faster than the latter. This can be understood that a lower convergence rate can be incurred if the parameter vector is more quickly close to the boundary of the mean parameter space by noting that in the continuous case and in the infinite discrete case.

###### Remark 2.

In contrast with the conditions (i.e., in the continuous case and in the infinite discrete case) guaranteeing the consistency of the MLE by Hillar and Wibisono (2013), the ones for asymptotic normality seems much more strict. The simulations in the next section suggest there may be space for improvement. On the other hand, the consistency and asymptotic normality for the MLE in the finite discrete case requires the assumption that all parameters are bounded by a constant. This assumption may not be best possible. We will investigate these problems in the future.

###### Remark 3.

The three theorems in this section only describe the joint asymptotic distribution of the first estimators with a fixed constant . Actually, the starting point of subscripts is not essential. These three theorems hold for any fixed MLEs. Since the usual counting subscript starts from , we only show the case presented in the theorems. Our proofs can be directly extended to the case of any fixed MLEs without any difficulty. Another interesting problem is investigating the asymptotic distribution of the linear combination on all the MLEs or a linear combination with a growing number of the MLEs as pointed by one referee. Are there results similar to Propositions 24 (2)? We will investigate this problem in future work.

###### Remark 4.

According to Theorems 13, an approximate confidence interval for is , where and are the natural estimates of and by replacing all by their MLEs, and denotes the percentile point of the standard normal distribution. To test whether at level , the hypothesis can be rejected if . The confidence intervals for contrasts and the hypothesis test for the equality of two parameters can be generalized to multiple parameters. For example, one can use the test statistic

to test whether , which asymptotically follows the chi-square distribution with the degree of freedom .

## 3 Simulations

In this section, we will evaluate the asymptotic results for maximum entropy models on weighted graphs with continuous and infinite discrete weights through numerical simulations. The simulation results for finite discrete weights are similar to the binary case, which has been shown in Yan and Xu (2013), so we do not repeat it here. Firstly, we study the consistency of the estimation. We plot the estimated vs to evaluate the accuracy. Secondly, by Theorems 2 and 3, and are asymptotically normally distributed, where is the estimator of by replacing with . The quantile-quantile (QQ) plots of are shown. We also report the coverage probabilities for certain , as well as the probabilities that the maximum likelihood estimator does not exist in the case of discrete weights. The parameter settings in simulation studies are listed as follows. For continuous weights, let such that , and ; for discrete weights, let such that , , and . Here, we suppress the subscript of in order to conveniently display the notations in the figures. A variety of are chosen: , , , for continuous weights; , , , for discrete weights.

The plots of vs are shown in Figure 1. We used in this figure for the case of discrete weights instead of in order to make vary (When , all the , equal to ). The red lines correspond to the case that . For each sub-figure, the first and second rows represent and , respectively. The first, second and third columns represent for continuous weights and for discrete weights, respectively. From this figure, we can see that as increases, the estimators become more close to the true parameters. As increases, becomes much larger, indicting that controlling the increasing rate of (or decreasing rate of ) is necessary. For continuous weights, when , are very large, exceeding ; for discrete weights, when , the points of vs diverge, indicating that may not be the consistent estimate of in this case. Therefore, the conditions to guarantee the consistency results in Theorems 2 and 3 seem to be reasonable.

The QQ plots in Figures 2 and 3 are based on repetitions for each scenario. The horizontal and vertical axes are the empirical and theoretical quantiles, respectively. The red lines correspond to . The coverage frequencies are reported in Table 1. When , the MLEs for the case of discrete weights do not exist with frequencies. Therefore, the QQ plots for this case are not available. In Figure 2, when , the sample quantiles coincide with the theoretical ones very well (The plot of the case of is similar to that of and is not shown here). On the other hand, when , the sample quantile of evidently deviates from the theoretical one. In this case, the estimated variances of are very small, approaching to zero. For example, when and , the estimated is in the magnitude of , where the central limit theorem cannot be expected according to the classical large sample theory. In Figure 3, the approximation of asymptotic normality is good when and ; while there are notable derivations for when .

In both cases of continuous and discrete weights of Table 1, the length of estimated confidence intervals increases as becomes larger when is fixed. In the case of continuous weights, when , the length of estimated confidence intervals decreases as increases; but when and , it instead becomes larger. This is because (between and ) goes to zero as increases when , leading to a larger confidence interval. In particular, when , some of them exceed , indicating an extremely inaccurate estimate, although the corresponding coverage probabilities are close to . In the case of discrete weights, when and , the coverage frequencies are close to the nominal level; when , the coverage frequencies of pair are higher than the nominal level; when that greatly exceeds the condition of Theorem 3, the MLE almost does not exist. These phenomena further suggest that controlling increasing rate of or decreasing rate of in Theorems 2 and 3 is necessary.

## 4 Summary and discussion

Investigating the asymptotic theories for the network models are open and challenging problems, especially when the number of parameters increases with the size of network. One reason is that network data are not a standard type of data. In a traditional statistical framework, the number of parameters is fixed and the number of samples goes to infinity. In the asymptotic scenario considered in this paper, the sample is only one realization of a random graph and the number of parameters is identical to that of vertices. However, the MLE in some simple undirected models with the degree sequence as the exclusively natural sufficient statistics (i.e., the maximum entropy models) have been derived. As the number of parameters goes to infinity, we obtain the asymptotic normality of the MLE in the maximum entropy models for a class of weighted edges, the proofs of which are in help of the approximated inverses of the Fisher information matrix. We expect that the methods of our proofs can be applied to other high-dimensional cases in which the Fisher information matrix are nonnegative and diagonally dominant or other similar cases. For example, Perry and Wolfe (2012) introduced a family of null models for network data in which the entries of the upper triangle matrix of are assumed independent Bernoulli random variables with success probabilities for , where are smooth functions on parameters and . By making some assumptions of the second derivative of , the Fisher information matrix of the parameters also shares the similar properties like those in the maximum entropy models.

Finally, we shed some light on why the consistency and asymptotic normality of the MLE can be achieved in the maximum entropy models, even though the dimension of parameters increases with the size of network and the sample is only one realization of a random graph. First, in an undirected random graph, it lurks with random variables, which are higher order than the number of parameters. Second, the Fisher information of each parameter are combinations of variances of random variables. Under some conditions, it goes to infinity as increases. Third, the assumption of independently edges avoid the degeneracy problem, unlike Markov dependent exponential random graphs (Frank and Strauss (1986)). The model degeneracy problems of the exponential random graphs have received wide attention (e.g., Strauss, 1986; Snijders, 2002; Handcock, 2003; Hunter and Handcock, 2006; Chatterjee and Diaconis, 2011; Schweinberger, 2011). Moreover, considering the case that the number of parameters is fixed, Shalizi and Rinaldo (2013) demonstrated that exponential random graph models are projective in the sense of that the same parameters can be used for the full network and for any of its subnetworks simultaneously, essentially only for those models with the assumption of dyadic independence, under which the consistency of the MLE is available.

## Appendix A

For fixed , the central limit theorem for the vector can be easily derived by noting that are asymptotically independent. In view of that is the sufficient statistic on , may be approximately represented as a function of . If this can be done, then the asymptotic distribution for the MLE may follow. In order to establish the relationship between and , we will approximate the inverse of a class of matrices. We say an matrix belongs to a matrix class if is a symmetric nonnegative matrix satisfying

 vii=n∑j=1;j≠ivij;  M≥vij=vji≥m>0,  i≠j.

Yan and Xu (2013) have proposed to use to approximate , where is a vector of entries whose values are all of and , and obtained an upper bound on the approximate errors. Here we use a simpler matrix to approximate . Let for a general matrix . It is clear that for two matrices and , and . By Proposition A1 in Yan and Xu (2013), we have:

###### Proposition 1.

If , then for , the following holds:

 ||Wn:=V−1n−Sn|| ≤ ∥V−1n−¯Sn∥+∥¯Sn−Sn∥ ≤ M(nM+(n−2)m)2m3(n−2)(n−1)2+12m(n−1)2+1mn(n−1).

Note that are sums of independent multinomial random variables. By the central limit theorem for the bounded case (Loève, 1977, p. 289), is asymptotically standard normal if diverges. Following Hillar and Wibisono (2013), we assume that is bounded by a constant in this appendix. For convenience, we assume that with a fixed constant. Thus, . Since

 e2k(θi+θj)≤e(k+(k−1))(θi+θj)+L,  1≤k≤q−1,

we have

 q−1∑k=0e2k(θi+θj)≤∑0≤k≠l≤q−1e(k+l)(θi+θj)eL.

Therefore,

 12∑k≠le(k+l)(θi+θj)(∑q−1a=0ea(θi+θj))2 = 12∑k≠le(k+l)(θi+θj)∑k≠le(k+l)(θi+θj)+∑q−1k=0e2k(θi+θj) (3) ≥ ∑k≠le(k+l)(θi+θj)2(1+eL)∑k≠le(k+l)(θi+θj)≥12(1+eL).

Recall the definition of in Theorem 1. It shows that . If is a constant, then for all as . If is a fixed constant, one may replace the statistics by the independent random variables , when considering the asymptotic behaviors of . Therefore, we have the following proposition.

###### Proposition 2.

Assume that with a constant. Then as :
(1)For any fixed , the components of are asymptotically independent and normally distributed with variances , respectively.
(2)More generally, is asymptotically normally distributed with mean zero and variance whenever are fixed constants and the latter sum is finite.

Part (2) follows from part (1) and the fact that

 limr→∞limsupt→∞Var(n∑k=r+1cidi−E(di)√vii)=0

by Theorem 4.2 of Billingsley (1968). To prove the above equation, it suffices to show that the eigenvalues of the covariance matrix of , are bounded by 2 (for all ). This comes from the well-known Perron-Frobenius theory: if is a symmetric positive definite matrix with diagonal elements equaling to , and with negative off-diagonal elements, then its largest eigenvalue is less than . We will only use part (1) to prove Theorem 1.

Before proving Theorem 1, we show three lemmas below. By direct calculations,

 vij=∑0≤k

and . On the other hand, it is easy to see that

 12∑k≠l(k−l)2e(k+l)(αi+αj)(∑q−1a=0ea(αi+αj))2≤12maxk≠l(k−l)2≤q22. (4)

In view of inequality (3), if with a constant, then with and constants. Applying Proposition 1, we have

###### Lemma 1.

Assume that with a fixed constant. If is large enough, then

 ∥V−1n−Sn∥≤c1(n−1)−2, (5)

where is a constant only depending on .

###### Lemma 2.

Assume that with a fixed constant. Let . Then

 ∥Un∥≤∥V−1n−Sn∥+c2(n−1)−2, (6)

where is a constant only depending on .

###### Proof.

Note that

 Un=WnVnWTn=(V−1n−Sn)−Sn(In−VnSn),

and

 {Sn(In−VnSn)}i,j=(δij−1)vijviivjj.

By (3) and (4),

 |{Sn(In−VnSn)}ij|≤c2(n−1)−2,

where is a constant. Thus,

 ∥Un∥≤∥V−1n−Sn∥+∥Sn(In−VnSn)∥≤∥V−1n−Sn∥+c2(n−1)−2.

In order to prove the below lemma, we need one theorem due to Hillar and Wibisono (2013).

###### Theorem 4.

Assume that with a fixed constant. Then for sufficiently large , with probability at least , the MLE exists and satisfies

 maxi|^θi−θi|≤c3√lognn

where is a constant that only depends on .

###### Lemma 3.

If with a fixed constant, then for with a fixed constant,

 θi−^θi=[V−1n{d−E(d)}]i+op(n−1/2).
###### Proof.

Let be the event that the MLE exists and be the event that . Derivations in what follows are on the event . Let

 μ(t)=r−1∑a=0aeat∑r−1k=0ekt.

It is easy to verify that

 μ′(t)=∑0≤k

and

 μ′′(t)=⎡⎣12∑k≠l,a(k−l)2(k+l−2a)e(k+l+a)t(∑r−1a=0eat)3⎤⎦,

such that

 |μ′′(t)|≤(r−1)3. (7)

Applying Taylor’s expansions to at the point , for , we have

 di−E(di) = ∑j≠i(μ(^θi+^θj)−μ(θi+θj)) = ∑j≠i[μ′(θi+θj)(μ(^θi+^θj)−μ(θi+θj))]+hi,

where , and , . Writing the above expressions into a matrix, it yields,

 d−E(d)=Vn(ˆθ−θ)+h,

or equivalently,

 ˆθ−θ=V−1n(d−Ed)+V−1nh, (8)

where . By (7), , where . Therefore, by Lemma 1,

 |(V−1nh)i| = |(Snh)i|+|(Wnh)i| ≤ maxi|hi|vii+∥W∥∑i|hi|=O(lognn).

By Theorem 4, as . It shows for with a fixed constant. Consequently, we have for