Learning latent structure of large random graphs

# Learning latent structure of large random graphs

Roland Diel1    Sylvain Le Corff2    Matthieu Lerasle2
11Laboratoire J.A.Dieudonné UMR CNRS-UNS 6621 Université de Nice Sophia-Antipolis 06108 Nice Cedex 2.
22Laboratoire de mathématiques d’Orsay, Univ. Paris-Sud, CNRS, Université Paris-Saclay, 91405 Orsay.
###### Abstract

In this paper, we estimate the distribution of hidden nodes weights in large random graphs from the observation of very few edges weights. In this very sparse setting, the first non-asymptotic risk bounds for maximum likelihood estimators (MLE) are established. The proof relies on the construction of a graphical model encoding conditional dependencies that is extremely efficient to study -regular graphs obtained using a round-robin scheduling. This graphical model allows to prove geometric loss of memory properties and deduce the asymptotic behavior of the likelihood function. Following a classical construction in learning theory, the asymptotic likelihood is used to define a measure of performance for the MLE. Risk bounds for the MLE are finally obtained by subgaussian deviation results derived from concentration inequalities for Markov chains applied to our graphical model.

## 1 Introduction

Inference in large random graphs is an important topic of interest due to its applications to many fields such as data science, sociology or neurobiology for instance. This paper focuses on large random graphs whose heterogeneity is described by latent data models. The nodes are associated with latent random weights, independent and with unknown distribution. The only available information is given by random weights associated with few edges in the graph which are independent conditionally on the nodes weights. The objective is to estimate the unknown distribution of the nodes weights from these observations. This latent data structure is appealing as it may be used to describe graphs in a wide range of applications. In sports tournaments, nodes represent contestants in a championship and each node weight is the “intrinsic value” of the corresponding player. An edge is drawn between players when they face each others, the result of a contest is the observed edge weight. The problem is to recover from a few games the distribution of the intrinsic values of the players to make early prediction on the issue of the championship for example. In social networks, nodes are members and their weights represent the “popularity” of each member. An edge is drawn between members if a “suggestion of friendship” has been made to one of them. The observed edge weight is if these people are not connected and otherwise. The problem here is to estimate the popularity density in a large population where only a few suggestions of friendship can be made compared to the global size of the network. In neurobiology, random graphs may be used to model neural functional connectivity inside the brain. In this case, nodes are neurons and their weights represent their efficiency to diffuse neural information. An edge between neurons is drawn if the activity of these neurons is observed simultaneously. The weight of this edge is a score representing the influence that these neurons exercise on each other. The problem is therefore to estimate the functional connectivity density inside the brain from these scores.

The problem studied in this paper has a long history, going back at least to [31] who considered the problem of paired comparison to evaluate performances of medicines. In [31] and later in [2], the problem was to recover the weights of a finite number of nodes when the number of measurements on every pair grows to infinity. Further extensions of the so-called Bradley-Terry model have then been studied, see for example [7] for a review. More recently, [22] considered the problem of estimating nodes weights in Bradley-Terry models based on one measurement per pair of nodes when the number of nodes grows to infinity. This framework led to several developments in computational statistics for the Bradley-Terry model, see [14] and [4] for various extensions of this original model. A related problem was considered in [5] where an edge is inserted between each pair of nodes with a probability depending on the nodes weights. Each node has therefore a random degree and the observed degrees are used to infer nodes weights. When the graph is fully observed, [5] proved that with a probability of order , there exists a unique maximum likelihood estimator of the nodes weights which is such that the supremum norm of the estimation error is upper bounded by .

This paper strongly departs from these settings where the all graph is observed, even from [30] where some edges are missing. We consider a very sparse alternative where only very few edges per nodes are observed. A reason why such a sparse setting has never been considered is probably due to [31] who proved that the estimation of the weights is actually impossible in the Bradley-Terry model in this situation. To overcome this issue, we consider the problem of estimating the distribution of the weights and not the weights themselves. There are several motivations to adopt this new approach. The Bradley-Terry model in “random environment” was applied with success to predict the issue of a championship by estimating the probability distribution of the teams weights (strengths) which were assumed to be uniformly distributed, see for example [23] and references therein. Moreover, [6] recently showed that the node with maximal weight can be recovered if the tail of the nodes weights distribution is sufficiently convex. More generally, the idea to use a Bayesian estimator when a frequentist approach is not available is rather standard. The performances of this estimator highly depend on the prior distribution of the parameters and providing a reasonable prior may have a great impact. The study of bayesian estimators with an estimated prior is known as empirical bayes theory [21] and is currently a subject of intense research, see for example [13] for a recent overview. The problem presented in this paper can be understood as finding a statistically efficient estimator of the prior to design an empirical bayes estimator for the nodes weights. The use of latent variables is also at the heart of mixed effect models widely spread in biostatistics, see [15].

This paper shows the first non-asymptotic risk bounds for non-parametric maximum likelihood estimators (MLE) of the distribution of nodes weights. Asymptotic properties of MLE rely heavily on a loss of memory property of the observed random graph. This can be analyzed using a graphical model describing the conditional dependencies between nodes and edges. This graphical model provides a natural parallel with hidden Markov models [3] which is used to study the asymptotic behavior of the likelihood, following [11] in particular. The limit likelihood defines a natural notion of risk to measure performances of MLE. These performances are obtained for finite values of the number of nodes using concentration inequalities for Markov Chains [10]. The excess risk scales as the entropy of the underlying statistical model (in the sense of Dudley) normalized by a term of order when is fixed and . From a learning perspective, Dudley’s entropy bound is known to be sub-optimal in general, it can be replaced by a majorizing measure bound [24] if needed since the bound proposed in this paper is derived from a subgaussian concentration inequality for the underlying process, see Eq. (27).

More generally, we believe that the methodology introduced to prove our results leads the way to exciting research perspectives in various fields. For example, identifiability of non-parametric hidden Markov models with finite state spaces was established very recently along with the first convergence properties of estimators of the unknown distributions, see [8] for a penalized least-squares estimator of the emission densities, [9, 28, 29] for consistent estimation of the posterior distributions of the states and posterior concentration rates for the parameters or [16] for order estimation. However, very few theoretical results are available for the non-parametric estimation of general state spaces hidden Markov models. The arguments leading to our risk bound may probably be extended to this framework. In computational statistics, bayesian estimators of nodes weights have been studied in Bradley-Terry models and other extensions [4]. Designing new algorithms to compute MLE of the prior would therefore be of great interest to derive empirical bayes estimators of these weights.

The paper is organized as follows. Section 2 details the model and the maximum likelihood estimator of the unknown weights distribution. Section 3 presents preliminary results underlying our analysis. A graphical model encoding conditional dependencies in the original graph is built. The round-robin algorithm, a widely spread method in sports tournaments that builds sparse graphs for which our graphical model is stationary, is also presented. Our main results are finally given in Section 4. Convergence of the likelihood is established when the number of nodes grows to and risk bounds for the MLE are provided. Section 5 to 7 are devoted to the proofs of these results. Section 5 proves the fundamental properties of the graphical model associated with round-robin graphs. Section 6 proves the probabilistic tools required to establish the main results. These tools might be of independent interest, they are presented as independent results and hold for stationary processes with conditional dependencies encoded in the graphical model. Proofs of the main results are finally gathered in Section 7.

## 2 Setting

### 2.1 Random graphs with latent variables

Let , denote two positive integers and let be a connected -regular graph. Let denote independent and identically distributed (i.i.d.) random variables taking values in a measurable set with common (unknown) distribution . For all , the observation takes values in a discrete set and conditionally on , the random variables are independent and such that the conditional distribution of is given by :

 P(Xi,j=x|V)=k(x,Vi,Vj).

This framework encompasses the following models.

###### Example 1 (Bradley-Terry model [2]).

In this example, , and for all ,

 k(x,Vi,Vj)=(ViVi+Vj)x(VjVi+Vj)1−x.
###### Example 2 (Extensions of Bradley-Terry model).

In [4], the authors proposed several algorithms to perform Bayesian inference for generalized Bradley-Terry models which fit our framework.

1. The Bradley-Terry model with home advantage introduces an additional parameter to measure the home-field advantage. In this case, , and, if the player is home, for all ,

2. The Bradley-Terry model with ties [20] introduces an additional parameter , , and

 k(1,Vi,Vj)=ViVi+θVjandk(0,Vi,Vj)=(θ2−1)ViVj(θVi+Vj)(Vi+θVj).
###### Example 3 (Random graphs with a given degree sequence).

[5] considers random graphs such that, for all , an edge is inserted between players and with probability where are parameters to be estimated using the degrees of the vertices in the observed graph. Such random graphs fit our framework with , ( in our framework representing in theirs) and for all , ,

 k(x,Vi,Vj)=(ViVj1+ViVj)x(11+ViVj)1−x.

### 2.2 Maximum likelihood estimator

The weights are observed and the objective is to infer the distribution of the hidden variables from these observations. Let be a set of probability measures on . For all , the joint distribution of is given by

 Pn,Nπ(xn,N,A)=∫1A(v)∏(i,j)∈En,Nk(xn,Ni,j,vi,vj)π⊗N(dv). (1)

Using the convention , the log-likelihood is given, for all , by

 ℓn,N(π)=logPn,Nπ(Xn,N),%wherePn,Nπ(Xn,N)=Pn,Nπ(Xn,N,V).

In this paper, is estimated by the standard maximum likelihood estimator defined as any maximizer of the log-likelihood:

 ˆπn,N∈argmaxπ∈Π{ℓn,N(π)}.

## 3 Round-robin graphical model

Section 3.1 details a graphical model encoding the conditional dependences between the random variables . This graphical model is studied in the particular case of round-robin graphs in Section 3.2.

### 3.1 Graphical model

Let denote the graph distance in , that is is the minimal length of a path between nodes and . Write , where and, for any , is the set of such that . Let denote the maximal distance between and :

 qnN+1=max1≤i≤Ndn,N0(1,i).
1. For all , let

 Xn,Nq↔q={Xi,j:{i,j}∈En,N,i∈Vn,Nq,j∈Vn,Nq}.

The set gathers all such that and satisfy .

2. For all , let

 Xn,Nq↔q+1={Xi,j:{i,j}∈En,N,i∈Vn,Nq,j∈Vn,Nq+1}.

Likewise, the set gathers all such that and .

Finally, for any , let

 Xn,Nq =Xn,Nq↔q+1∪Xn,Nq+1↔q+1.

By (1) the joint distribution of and may be factorized using the conditional independence between some subsets of these variables. For all , and all ,

These conditional dependences are represented in the graphical model of Figure 1, where graph separations represent conditional independences. For all any path between and other vertices except and goes through or which means that is independent of all other nodes given and .

### 3.2 Round-Robin Scheduling

There is a large variety of -regular graphs (even up to permutations of the indices), the results of this paper are obtained for the graph built using the round-robin scheduling. At time , this algorithm pairs nodes according to Figure 1(a), that is is paired with , for all . At time , a node is fixed and all others are rotated clockwise as described in Figure 1(b). Node does not move, takes the place of , each odd integer takes the place of , takes the place of and each even integer takes the place of . Then, each node is paired with the new node it faces as in Figure 1(c). At each time , each node moves once according to the round-robin step detailed in Figure 1(b) and is paired with the new node it faces. The round-robin graph denoted by studied in detail in this paper contains all pairs collected in the first rotations of the round-robin algorithm.

Lemma 1 gathers results on the graphical model of Figure 1 when that are central in our analysis.

###### Lemma 1.

Let and be the round-robin graph. Assume that . Then, is the quotient of the Euclidean division of by , that is where . Moreover, is a stationary Markov chain such that for all ,

 ∣∣Vn,Nq∣∣=2(n−1),∣∣Xn,Nq∣∣=n(n−1).

Lemma 1 is proved in Section 5.

## 4 Main results

Section 4.1 computes the limit likelihood function and define a natural risk function to evaluate the performances of the MLE. Risk bounds for the MLE are obtained in Section 4.2 using non-asymptotic concentration inequalities for Markov chains.

### 4.1 Convergence of the likelihood

The problem being reduced to the analysis of the graphical model of Figure 1, convergence results follow from the geometrically decaying mixing rates of the conditional law of given for any . These rates derive from the following assumption.

• There exists such that for all and , .

When , by Lemma 1, the joint sequence is a stationary Markov chain which may be extended to a stationary process indexed by with the same transition kernel. This extension is denoted by .

Define also the shift operator on by for all and all .

###### Theorem 2.

Assume H4.1 holds and is the round-robin graph. There exists a function such that for all ,

 supπ∈Π∣∣logPn,Nπ(Xn,Nq|Xn,Nq+1:qnN−1)−ℓnπ(ϑqXn)∣∣⟶N→∞0,Pπ⋆-a% .s. (2)

Moreover, for all , -a.s. and in ,

 1qnNlogPn,Nπ(Xn,N)⟶N→∞Lnπ⋆(π)=Eπ⋆[ℓnπ(Xn)]. (3)

Theorem 2 establishes convergence to the limit likelihood when the number of nodes goes to while remains fixed. The rate of almost sure convergence is proportional to by Lemma 1. Eq (3) is the key to understand the definition of the risk function used in the next section. We proceed as in Vapnick’s learning theory [27, 26] described now to establish a parallel with our framework. Let denote i.i.d. observations in , let denote a set of parameters, and let denote a loss function. The empirical risk minimizer is defined in this context by

 ^fERMN=argminf∈FN∑i=1ℓ(f,Yi).

If for all , the risk of any is measured by the excess risk [19]

 R(f)=E[ℓ(f,Y)]−E[ℓ(f∗,Y)],

where is a copy of , independent of and is the minimizer of over . Note that when , for all the normalized empirical criterion satisfies almost surely,

 1NN∑i=1ℓ(f,Yi)→E[ℓ(f,Y1)].

Therefore the excess risk is the difference between normalized asymptotic empirical criteria in and its minimizer. In this paper, the MLE minimizes , which, properly normalized converges to . This suggests to define the risk function

 Rnπ⋆(π)=Lnπ⋆(π⋆)−Lnπ⋆(π),∀π∈Π. (4)

By Proposition 13, is actually a minimizer of over . Therefore, is the excess risk associated with the likelihood function.

### 4.2 Risk bounds for the MLE

The following theorem provides non-asymptotic deviation bounds for the excess risk of the MLE. This is the main result of this paper.

###### Theorem 3.

Assume H4.1 holds and is the round-robin graph. For any probability measures and , let

 d(π,π′)={∥π−π′∥tvlog(1∥π−π′∥tv)if∥π−π′∥tv≤e−1,∥π−π′∥tvif∥π−π′∥tv≥e−1. (5)

Assume that is a compact set for the topology induced by and let be the minimal number of balls of -radius necessary to cover . Then, there exists such that, for any ,

 Pn,Nπ⋆(Rnπ⋆(ˆπn,N)>cnε−6n2√N[∫+∞0√logN(Π∪{π⋆},d,ϵ)dϵ+t])≤e−t2.

Theorem 3 is proved in Section 7.3. It provides the first non-asymptotic risk bounds for any estimator in a very sparse setting where the number of edges observed for each node can be very small compared to the number of nodes . It proves that the problem studied in this paper is fundamentally different from the problem of nodes weights estimation that is usually considered, at least in Bradley-Terry models. While estimating nodes weights is only possible when is as large as [31, 22, 30], some information on their distribution may be recovered when . This difference is extremely relevant in sports tournaments for example, it means that one can start to make prediction on the final issue of a championship only after a few weeks, while predictions on the issue of each game can only be made when half the year has passed.

The distance defined in (5) used to measure the entropy of is not intuitive. However, it is easy to check that for any . It follows that, for any class with polynomial entropy for the total variation distance, that is such that for small , Dudley’s entropy integral for distance satisfies

 ∫+∞0√logN(Π∪{π⋆},d,ϵ)dϵ≲α√D.

Therefore, “slow rates” of convergence are obtained for the MLE. The polynomial growth is extremely standard, see [25, p271–274] for various examples where this assumption is satisfied and our result applies. On the other hand, “fast” rates of convergence remain an open question. In particular, the margin condition [18] required to prove such rates would hold if the total variation distance between distributions of the nodes weights was bounded from above by the excess risk derived from the asymptotic of the likelihood.

The remaining of the paper is devoted to the proof of the main results. Section 5 proves Lemma 1, describing precisely the structure of the graphical model given in Figure 1 in the case of a round-robin scheduling. Then, Section 6 establishes central tools in the analysis of the likelihood of stationary processes whose conditional dependences are encoded in the graphical model of Figure 1. These results, that might be of independent interest, are therefore stated as independent lemmas. These tools are finally used in Section 7 to prove the main theorems.

## 5 Round-robin scheduling

This section details the sets and for when (cf. Figures 1(a)-1(c)). In the following, notations for nodes and their weights are identified, i.e. is identified with for all . Lemma 1 follows directly from Lemma 4 and Lemma 5 below. To prove these lemmas, consider the following notations.

 E={4x−1,4x:x∈[⌊N/4⌋]}andO=[N]∖E.

The notation (resp ) comes from the fact that (resp ) contains all paired with after an even (resp odd) number of rotations of the round-robin scheduling. For all , let

 Vn,Nq,e=Vn,Nq∩EandVn,Nq,o=Vn,Nq∩O.
###### Lemma 4.

Let and be the round-robin graph. Assume that and let where . Then,

 Vn,N1={V2x:x=1,…,n}, (6)

and, for any ,

 Vn,Nq={V2x+1:x∈[(q−2)(n−1)+1,(q−1)(n−1)]}∪{V2x:x∈[2+(q−1)(n−1),1+q(n−1)]}. (7)

Furthermore,

 Vn,NqnN+1={V2x+1:x∈[(qnN−1)(n−1)+1,qnN(n−1)+rnN]}∪{V2x:x∈[2+qnN(n−1),1+rnN+qnN(n−1)]}. (8)

Therefore, , and for all , .

###### Proof.

To ease the reading of this proof, one can check its arguments on Figures 3 and 4.

We proceed by induction on . The definition of given by (6) is straightforward. Then, contains:

1. all paired with some on the first rotation of the algorithm besides that does not belong to . These are all ;

2. All paired with and that are not in . After rotations of the round-robin algorithms, all paired with are and those with are .

Therefore,

 Vn,N2⊃{V2x+1:x=1,…,n−1}∪{V2x:x=n+1,…,2n−1}.

On the other hand, by induction, for all ,

 if i is odd, it is paired with {Vi+4x+1:x=0,…n−1}, if i is even, it is paired with {Vi−4x−1:x=0,…,n−1}. (9)

This implies that there is no even number nor odd number such that , which yields:

 Vn,N2={V2x+1:x=1,…,n−1}∪{V2x:x=n+1,…,2n−1}.

(7) is obtained by induction using the same arguments and (8) is a direct consequence of the round-robin algorithm. The last claim follows by noting that for all ,

 |Vn,Nq,e|=|Vn,Nq,o|=n−1.

Indeed, one of the following cases holds.

1. for some . In this case,

 |{j:Vj∈Vn,Nq,e,j∈2Z}|=|{i:Vi∈Vn,Nq,e,i∈2Z+1}|=p.
2. for some . In this case, either

 |{j:Vj∈Vn,Nq,e,j∈2Z}|=p,and|{i:Vi∈Vn,Nq,e,i∈2Z+1}|=p+1,

or

 |{j:Vj∈Vn,Nq,e,j∈2Z}|=p+1,and|{i:Vi∈Vn,Nq,e,i∈2Z+1}|=p.

###### Lemma 5.

Let and be the round-robin graph. Then, for all ,

 |Xn,Nq|=n(n−1).
###### Proof.

The proof essentially consists in building the graphical model of Figure 5 from the one displayed in Figure 1.

Edges involving the first node are decomposed as:

Edges involving nodes in that are both different from are described as follows.

1. Edges between two nodes in denoted by:

 Xn,N1↔1,e ={X4x,4y:(x,y)∈[⌊n/2⌋],x

Note that there is no edge between any and a node for any . In particular, there is no edge between any and . Therefore, describes all edges between nodes in .

2. Edges between