Tailored graph ensembles as proxies or null models for real networks I: tools for quantifying structure
Abstract
We study the tailoring of structured random graph ensembles to real networks, with the objective of generating precise and practical mathematical tools for quantifying and comparing network topologies macroscopically, beyond the level of degree statistics. Our family of ensembles can produce graphs with any prescribed degree distribution and any degreedegree correlation function, its control parameters can be calculated fully analytically, and as a result we can calculate (asymptotically) formulae for entropies and complexities, and for informationtheoretic distances between networks, expressed directly and explicitly in terms of their measured degree distribution and degree correlations.
pacs:
1 Introduction
In the study of natural or synthetic signaling networks, one of the key questions is how network structure relates to the execution of the process which it supports. This is especially true in systems biology, where, for instance, our understanding of how the structure of proteinprotein interaction networks (PPIN) relates to their biological functionality is vital in the design of a new generation of intelligent and personalized medical interventions. In recent years, highthroughput proteomics has allowed for the drafting of large PPIN data sets, for different organisms, and with different experimental techniques and degrees of accuracy. With this accumulation of information, we now face the challenge of analyzing these data from a complex networks perspective, and using them optimally in order to increase our understanding of how PPIN control the functioning of cells, both in healthy and in diseased conditions. A prerequisite for achieving this is the availability of precise mathematical tools with which to quantify topological structure in large observed networks, to compare network instances and distinguish between meaningful and ‘random’ structural features. These tools have to be both systematic, i.e. with a sound statistical or informationtheoretic basis, but also practical, i.e. preferably formulated in terms of explicit formulae as opposed to tedious numerical simulations.
Many quantities have been proposed for characterizing the structure of networks, such as degree distributions [1], degree sequences [2], degree correlations [3] and assortativity [4], clustering coefficients [5], and community structures [6]. To assess the relevance of an observed topological feature in a network, a common strategy is to compare it against similar observations in socalled ‘null models’, defined as randomized versions of the original network which retain some features of the original one. The choice of which topological features to conserve in the randomized models was mostly limited to degree distributions and degree sequences. Such null models were used to assess the statistical relevance of network motifs in real networks, viz. patterns which were observed significantly more often in the real networks than in their randomized counterparts [7, 8, 9]. Whether any such proposed motif is indeed functionally important and/or represent (evolutionary) arisen principles, is however not obvious; topological deviations from randomized networks could also be merely irrelevant consequences of some neglected structural property of the network, i.e. the result of an inappropriate null hypothesis rather than of a distinctive feature of the process [10, 11]. The definition and generation of good null models for benchmarking topological measures of real world graphs (and the dynamical processes which they enable) is a nontrivial issue. Similarly, in comparing observed networks (which, as a result of experimental noise, will usually not even have identical nodes), one would seek to focus on the values of macroscopic topological observables, and know the typical properties of networks with the observed features.
In recent years there have been efforts to define and generate random graphs whose topological features can be controlled and tailored to experimentally observed networks. In [12] a parametrized random graph ensemble was defined where graphs have a prescribed degree sequence, and links are drawn in a way that allows for preferential attachment on the basis of arbitrary twodegree kernels. In this paper we generalize the definition of this ensemble, and show that it can be tailored asymptotically to the generation of graphs with any prescribed degree distribution and any prescribed degree correlation function (and that it is a maximum entropy ensemble, given the degree correlations). Moreover, in spite of its parameter space being in principle infinitely large, in contrast to most random graph ensembles used to mimic real networks, we can derive explicit analytical formulae for the parameters of the ensemble, to leading order in system size, expressed directly in terms of the observed characteristics of the network given. Graphs from this ensemble are thus ideally suited to be used as either proxies or null models for observed networks, depending on the question to be answered.
Statistical mechanics approaches have been proposed to quantify the information content of network structures. Especially the (Shannon or Boltzmann) entropy has been instrumental in characterizing the complexity of network ensembles [13, 14, 15]. Here, the crucial availability of analytical expressions for the parameters of our ensemble will enable us to derive explicit formulae, in the thermodynamic limit (based on combinatorial and saddlepoint arguments), for our ensemble’s Shannon entropy, and hence also for the complexity of its typical graphs. These formulae are compact and transparent, and expressed solely and explicitly in terms of the degree distribution and the degree correlations that our ensemble is targeting. Finally, along similar lines we can obtain an information theoretic distance between networks, again expressed solely in terms of their degree distributions and degree correlations. A companion paper [16] will be devoted to large scale applications to PPIN data of these complexity and distance measures; here we focus on their mathematical derivation. Although there is no need for numerical sampling in our derivations (all results can be obtained analytically), we note that exact algorithms for generating random graphs from the proposed ensemble exist [17].
2 Definitions and properties of network topology characterizations
2.1 Networks, degree distributions, and degree correlation functions
We study networks (or graphs) of nodes (or vertices), labeled by Roman indices etc, where every vertex can be connected to other vertices by undirected links (or ‘edges’). The microscopic structure of such a network is defined in full by an matrix of binary variables , where the nodes and are connected by a link if and only if . We define and for all , and we abbreviate . Henceforth, unless indicated otherwise, any summation over Roman indices will always run over the set .
A standard way of characterizing the topology of a network , as e.g. observed in a biological or physical system under study, is to measure for each vertex the degree , the number of the links to this vertex. From these numbers then follow the empirical degree distribution and the observed average connectivity :
(1) 
(using the Kronecker symbol for , defined as for and otherwise). Objects such as have the advantage of being macroscopic in nature, allowing for sizeindependent characterization of network topologies, and for comparing networks that differ in size. However, networks with the same degree distribution can still differ profoundly in their microscopic structures. We need observables that capture additional topological information, in order to discriminate between different networks with the same degree distribution (1).
To construct macroscopic observables that quantify network topology beyond the level of degree statistics, it is natural to consider how the likelihood for two nodes of a network to be connected depends on their degrees, which is measured by the degree correlation function
(2) 
Here is the probability for two randomly drawn nodes with degrees to be connected, and is the overall probability for two randomly drawn nodes to be connected, irrespectively of their degrees, viz.
(3)  
(4) 
By definition, is symmetric under
exchanging and .
For simple networks , with some degree distribution but without any microstructure beyond that required by
(5) 
It follows that those topological properties of a given (large) network , that manifest themselves at the level of degree correlations and cannot be attributed simply to its degree statistics, can be quantified by a deviation from the simple law (5); see also [7, 19, 20]. One is therefore led in a natural way to the introduction of the relative degree correlations
(6) 
By definition, for sufficiently large simple networks , whereas any statistically relevant deviation from signals the presence in network of underlying criteria for connecting nodes beyond its degrees. Just like , is again a macroscopic observable that can be measured directly and at low computation cost. It is therefore a natural tool for quantifying and comparing network structures beyond the level of degree statistics.
2.2 Properties of the relative degree correlation function
To prepare the ground for proving some asymptotic mathematical properties of the relative degree correlation function , we first simplify the denominator of (3):
(7)  
Upon inserting the result for (4) together with (3) into (6) we then find that
(8) 
and hence, using for all ,
(9) 
We are now in a position to establish three identities obeyed by . The first two of these, viz. (10,12), are the main ones; they are used frequently in mathematical manipulations of subsequent sections. The third provides the physical intuition behind (10,12). It is assumed implicitly in all proofs that remains finite for and that the limits exist.

Linear constraints:
(10) These are easily verified for simple graphs , for which . However, they turn out to hold for any graph , as can be proven using (9) as follows:
(11) 
Normalization:
(12) This follows directly from (10) upon multiplying both sides by , followed by summation over all .

Interpretation of the linear constraints:
(13) Here is defined as the probability that two randomly drawn nodes, one having degree , are connected:
(14) The proof is elementary:
(15) We conclude that our first (proven) identity (10) boils down to the claim that for large one has (modulo irrelevant orders in ), which is easily understood.
We end this section with two further observations. First, the relations (10) involve the degree distribution, so one must expect that the possible values for are dependent upon (or constrained by) . Second, several other useful properties of the kernel can be extracted from (10). For instance, the only separable kernel is for all : a separable kernel is of the form for some function ( being symmetric), and insertion of this form into (10) leads immediately to for all .
3 Random graphs with controlled macroscopic structure
3.1 Definition of the random graph ensembles
To study the signalling properties of realworld networks, or generate ‘null models’ to assess the relevance of observed topological features, one needs random graph ensembles in which one can control the topological characteristics one is interested in and ‘tune’ these to match the characteristics of the observed networks. Most ensembles studied in literature so far have focused on producing graphs with controlled degree statistics. The suggestion that (6) can be used for identifying network complexity beyond degree statistics goes back at least to [7, 18, 19, 20]. In contrast to these earlier studies, which were mostly limited to measuring (6) for real networks, here we take further mathematical steps that will allow us to use (6) as a systematic tool for quantifying complexity and distances in network structure beyond degree statistics. This requires generating random graphs in which we can control at will both the degree distribution and the relative degree correlations .
It will turn out that we can achieve our objectives with the following random graph ensembles, in which all degrees are drawn randomly and independently from , and where in addition the edges are drawn in a way that allows for preferential attachment on the basis of an arbitrary symmetric function of the degrees of the two vertices concerned:
(16)  
(17) 
Here is a normalization constant that ensures for all , , and the function must obey for all and . The ensemble (17) with prescribed degrees was defined and studied in [12, 21]. We note that in the above ensemble one will have .
Upon making the simplest choice for all one retrieves from (16) the ‘flat’ ensemble, where once the individual degrees are drawn randomly from , all graphs with the prescribed degrees carry equal probability:
(18) 
This follows from the property that for the factor in (16) depends on via the degrees only, and will consequently drop out of the measure (17):
(19)  
3.2 Asymptotic properties of the ensembles
One should expect
that macroscopic physical observables
such as (1) and (8) are selfaveraging, and can therefore
be calculated, to leading
order in , in terms of their expectation values over the ensemble (16)
(20)  
It turns out that (20) can be calculated analytically, and expressed in terms of and . The first published result related to this connection in an appendix of [12] was unfortunately subject to an error; see the corrigendum [21] for the correct relation as given below, of which the actual derivation is given in A of this present paper:
(21) 
where the function is calculated selfconsistently, for any , as the solution of
(22) 
It is satisfactory to observe, upon eliminating from (22) via (21), that (22) becomes identical to the set of relations (10) that we derived earlier for solely on the basis of the latter’s microscopic definition. Clearly, (10) must indeed hold for every single graph of the ensemble (16), provided is sufficiently large. On the other hand, for finite a typical graph of the ensemble (16) will display deviations from (21) that are at least of order (the difference between definition (8) and its asymptotic form (9)), but possibly of order (the typical finite size corrections in empirical averages over independent samples).
Expression (21) also provides en passant the explicit proof that for graphs in which the only structure is that imposed by the degree sequence, viz. those generated from (18) corresponding to for all , one indeed finds for . Upon inserting into condition (22) we find that for all , upon which the desired result follows directly from (21).
Asymptotically (i.e. in leading relevant orders in ), the probabilities (16) to find graphs with the correct degree statistics, i.e. with degrees drawn randomly from , depends on via the degree distribution and the kernel only. To see this we study the following function for large ,
(23)  
The leading order in was studied in [12]. If one has (the degrees are imposed as strict constraints), whereas for one has
(24)  
where . We introduce the further shorthand , as well as the notation to denote finite size corrections that obey (to determine the exact scaling with of these corrections we would have to inspect e.g. the finite size corrections to (21)). We write the leading orders of (24) in terms of the kernel , using (9) and (21), and substituting into (21) the present degree distribution , and find
(25)  
where in the last step we used the identities (10). It subsequently follows (23) as
(27)  
(29)  
with , , and . The leading order in reflects the property that the number of finitely connected graphs grows asymptotically with as . The next order is found to depend only on the macroscopic characterization of the specific graph , and on the macroscopic characterization of typical graphs from (16), with calculated for the kernel via (21).
3.3 Existence and uniqueness of tailored ensembles
We will now prove that for each degree distribution and each relative degree correlation function there exist kernels such that their associated ensembles (16) will for large be tailored to the production of random graphs with precisely these statistical features. We identify these kernels and show that they all correspond in leading order in to the same random graph ensemble.

Existence of a family of tailored kernels:
For each nonnegative function such that is nonzero for at least one combination , the following kernel satisfies all conditions required to define a random graph ensemble of the family (16) that generates graphs with degree distribution and relative degree correlation function as :(30) is by construction nonnegative, symmetric, and correctly normalized. Also we will always find due to in combination with our conditions on and the normalization (12). Recovering the correct degree distribution is built into the ensemble (16) via the degree constraints. To prove that equations (21,22) are satisfied we define , and use the fact that by virtue of (21) the condition (22) reduces to (10), and is therefore guaranteed to hold, provided indeed represents a relative degree correlation function.
What remains is to show that there exist functions that meet the relevant conditions. The simplest candidate is , for which we find via (12) and which is easily confirmed to meet all criteria. It gives what we will call the canonical kernel:(31) 
Completeness of the family of tailored kernels:
The set of kernels defined by (30) is complete: if a kernel generates random graphs with statistics and , then is must be of the form (30).
The proof is simple. If generates graphs with relative degree correlation function , according to (21) it must be of the form for some function . Since both and must be nonnegative, the same must be true for . Hence is also of the form (30), with and with the formula for in (30) satisfied automatically due to having to be normalized.
A further corollary is that all kernels tailored to the generation of graphs with statistics and are related to the canonical kernel (31) via separable transformations, with suitably normalized nonnegative functions :(32) 
Asymptotic uniqueness of the canonical ensemble:
The random graph ensembles of all kernels of the family (30), tailored to generating random graphs with statistical properties and , are asymptotically (i.e. for large enough ) identical: if all are drawn randomly from , and belongs to the family (30) with canonical member defined in (31), then(33) This follows from (27), which tells us that in the two leading orders in the probabilities of graphs generated from (30) depend on the kernel of the ensemble only via its associated function , so that .
The above results imply that we may regard the random graph ensemble (16), equipped with the kernel (31), as the natural ensemble for generating large random graphs with topologies controlled strictly by a prescribed degree distribution and prescribed relative degree correlations . We will call , with , the canonical ensemble for graphs with and . Note that for one has , which is indeed equivalent to the trivial choice (as it is related to the latter by a separable transformation).
We can now also define what we mean by ‘null models’. Given the hypothesis that a network has no structure beyond that imposed by its degree statistics, the appropriate null model is a random graph generated by the canonical ensemble with degree distribution and relative degree correlations (giving the trivial kernel ; these are usually referred to as ‘simple graphs’). Similarly, given the hypothesis that a network has no structure beyond that imposed by its degree statistics and its degreedegree correlations, the appropriate null model is a random graph generated by the canonical ensemble with degree distribution and relative degree correlations .
Finally, selfconsistency demands that and the canonical kernel (or a member of its equivalent family, related by separable transformations) are also the most probable pair in a Bayesian sense. The probability that a pair was the ‘generator’ of via (16) can be expressed, via standard Bayesian relations, in terms of the probability of drawing at random from (16):
(34) 
The most probable pair is the one that maximizes (modulo terms independent of ), so in the absence of any prior bias, i.e. if is independent of , it is the kernel that maximizes . Since for any , finding the most probable for a graph boils down to finding the smallest ensemble of graphs compatible with the structure of . Intuitively this makes sense: a more detailed characterization of the topology of an observed graph allows for more information being carried over from the graph to the ensemble, reducing the number of potential graphs allowed for by the ensemble. The smaller the number of graphs in the ensemble, the more accurate will these graphs be when used as proxies for the observed one.
Maximizing over means minimizing in (23), of which the leading orders in are given in (27). Demonstrating Bayesian selfconsistency of our canonical graph ensemble for large hence boils down to proving that the maximum of (29) over (subject to the relevant constraints) is obtained for . The constraints include the set (10). There are clearly more, e.g. for all , however we show below that maximizing (29) over subject only to (10) and already generates the desired result: . Extremizing (29) with the Lagrange formalism, leads to the following equations, which are to be solved in combination with (10) and :