Geometric randomization of real networks with prescribed degree sequence
We introduce a model for the randomization of complex networks with geometric structure. The geometric randomization (GR) model assumes a homogeneous distribution of the nodes in an underlying similarity space and uses rewirings of the links to find configurations that maximize a connection probability akin to that of the or geometric network models. However, GR preserves the original degree sequence, as in the configuration model, thus eliminating the fluctuations of the degree cutoff. Moreover, the model does not require the explicit estimation of hidden degree variables, which restricts the number of free parameters to one, controlling the level of clustering in the rewired network. We illustrate the potential of GR as a null model by investigating the effects on modularity that derive from the flattening of geometric communities in both real and synthetic networks. As a result, we find that for real networks the geometric and topological communities are consistent, while for the randomized counterparts, the topological communities detected are attributable to structural constraints induced by the underlying geometric architecture.
Null models play a central role in network science and statistics to discern regularities and patterns in the fabric of systems that are not attributable to specific constrains. Typically, null models of complex networks are fit with one or several particular structural properties, depending on the question at hand, to predict the organization of a network as the outcome of a random process where other features are allowed to vary. Hence, null models are said to produce maximally random ensembles given some specific features Newman et al. (2001). Many successful applications of null models in complex networks include the detection of rich-club ordering Colizza et al. (2006); Serrano (2008), the characterization of structural correlations in weighted networks Garlaschelli and Loffredo (2009), or the quantification of communities using modularity Newman and Girvan (2004).
Intriguingly, the frontier separating models and null models is not so neat, specially when the models remain simple and the null models fix more than one property. In fact, some famed network models, originally born to explain some peculiarity of the structure of networks on the basis of first principles, are often used as null models, for instance, the growing Barabási-Albert model Barabási and Albert (1999) that explains the generation of scale-free degree distributions implementing a preferential attachment mechanism. Recently, a class of network models in hidden metric spaces Serrano et al. (2008); Boguñá et al. (2009) has been shown to explain many pivotal features of real networks simultaneously —like the small world property, heterogeneous degree distributions, high levels of clustering, and self-similarity— based only on three parameters, controlling the average degree, the exponent of the power-law degree distribution and the clustering coefficient.
The key ingredient of the geometric network models is the fact that the probability to connect two nodes of the network is determined by their effective distance, as measured in a hidden metric space in which nodes are embedded. The underlying space is defined along two dimensions representing popularity and similarity features of the nodes, such that more popular and similar nodes have more chance to interact. In the model Serrano et al. (2008), the hidden degree of a node is a proxy for its popularity, and its angular position in the one-dimensional sphere (or circle) provides the similarity measure. The two coordinates contribute explicitly to the connection probability between two nodes, which increases with the product of their hidden degrees and decreases with their angular distance along the circle. The hidden degree can be estimated by the observed degree and reinterpreted as a radial coordinate in a hyperbolic plane Krioukov et al. (2010), which leads to the formulation of an isomorphic version of the model which is purely geometric. In the model, popularity takes the form of a radial coordinate in the hyperbolic disk, such that higher degree nodes are placed closer to the center, while the angular coordinate remains as in the similarity space, and the probability of connection decreases with the hyperbolic distance.
In both and models, the angular coordinate of nodes, representing the similarity dimension, is extracted from a homogeneous distribution, at odds with hyperbolic maps of real networks Boguñá
et al. (2010). In fact, geometric communities of nodes lying nearby in the similarity space (referred as soft communities or latent communities) are typically detected in real networks Boguñá
et al. (2010); Serrano et al. (2012); García-Pérez
et al. (2016) and can be modeled Zuev et al. (2015); García-Pérez
et al. (2018a).
This observation opens the door to the use of geometric models with homogeneous similarity distribution as null models for the investigation of the community organization and other structural properties of geometric networks.
In this paper, we introduce a variant of the popularity-similarity geometric model, that we named geometric randomization (GR) model, and illustrate its use as a null model for the analysis of the topological properties of real networks, including community structure. The GR model assumes the same form of the connection probability as in the or models, and a homogeneous distribution for the similarity coordinate as well. In contrast, it is fit with a given degree-sequence, like the configuration model Newman (2010). The use of prescribed degrees allows to skip the step of estimating the hidden degrees from real data. It could also help, for instance, in the analysis of features which are specially sensitive to fluctuations of the degree cutoff, like the behavior of dynamical processes such as epidemic spreading or synchronization, or for high-fidelity reproduction of real network topologies. Based on the premises mentioned above, we propose an algorithm that homogenizes the similarity distribution and rewires the links in a network preserving the given degrees to maximize the likelihood that the new topology is generated by the geometric model. We analyze the effects of the GR model on the topological properties of real and synthetic geometric networks, and use it as a null model to explore the effects on modularity of the flattening of geometric communities in the similarity space.
Ii The geometric randomization model
The GR model operates on networks where nodes have an observed degree and exist in a similarity space. The similarity space is taken to be a circle, as in the or models. In those models every node is characterized by a popularity-similarity pair , where is the node’s hidden degree (expected to be proportional to the observed degree ) and its angular or similarity coordinate.
In the GR model, instead, only angular coordinates are assigned to the nodes, chosen uniformly at random from . The network is then rewired in order to maximize the likelihood that the new topology is generated by the model while preserving the observed degrees, and thus the total number of edges . The rewiring procedure is conducted by executing the Metropolis-Hastings algorithm, aimed at finding the network connectivity (i.e. the adjacency matrix ) that maximizes the likelihood function
where stands for the angular distance between nodes and , and the connection probability reads
Parameter depends on the observed average degree of the network, and is the radius of the circle (adjusted to have a density of nodes equal to 1, see Appendix A) .
The algorithm proceeds by repeating the following steps:
Compute the current likelihood
Two links, between nodes and , and between nodes and , are randomly chosen and swapped: the new links are connecting nodes and , and nodes and .
Compute the new likelihood
If accept the link swap
Otherwise, if accept the link swap with probability
The rewiring algorithm is terminated after a number of edges are chosen to be swapped, ensuring that the likelihood has reached a plateau. Notice that at the end of the rewiring procedure the degrees of the nodes have not changed but the resulting network might not be connected. Since the hidden degrees are kept constant (independently of their values), the probability of swapping links between nodes and and between nodes and simply reads
Therefore, the GR model does not actually require to estimate the hidden degrees of the nodes because they do not enter in any step of the algorithm. In contrast, the GR model simply needs to assign uniformly distributed angular coordinates and give a value for the clustering parameter , see next Section for details on this part.
Geometric randomizations of networks can be also obtained using the model with parameters , and —controlling the exponent of the power-law hidden degree distribution, the clustering coefficient, and the average degree, respectively— estimated from the empirical network. This alternative however, requires the explicit estimation of the hidden degree sequence or of the exponent of the hidden degree distribution, and, thus, it may introduce undesired fluctuations in the degree cutoff which can induce relevant differences between the topological properties of real and generated networks.
Iii Tuning clustering through parameter
In order to apply the GR model to a real or synthetic network one simply needs to fix parameter , which controls the level of clustering in the network Serrano et al. (2008). Clustering is a signature of the metricity of geometric networks Krioukov (2016) and gives the connection between the observed topology and the underlying metric space, as a reflection of the triangle inequality.
Note that the value of affects the probability to accept a link swap (see Eq. (3)) so it determines the final network’s structure. We address the role of by applying the GR model to synthetic networks generated by the Geometric Preferential Attachment (GPA) model Zuev et al. (2015) and the soft communities in similarity space (SCSS) model García-Pérez et al. (2018a). Both models are intended to produce synthetic networks with tunable community structure.
The GPA model generates geometric networks with soft-communities using a growing mechanism in the hyperbolic plane.
The probability of connection depends on parameter controlling the initial attractiveness of the different angular regions, such that the heterogeneity of the angular coordinate is a decreasing function of , with recovering the homogeneous distribution.
Notice that the degree distribution and the clustering coefficient in networks generated by the GPA model are independent of .
However, by construction and, thus, the level of clustering is always the maximum possible.
The SCSS model consists in an version for the generation of soft communities that allows to change the generated level of clustering as a function of .
Fig. 1a shows the average clustering coefficient of a GPA network compared with the randomizations obtained by applying the GR model using different values of . As expected, the average clustering of the rewired networks strongly depends on the value of : the lower , the lower in the resulting network. A level of clustering similar to GPA values can be obtained in GR networks by using large values of , such as .
In Fig. 1b-c, we report the average clustering coefficient obtained by applying the GR model to synthetic networks generated with the SCSS model. The SCSS networks are produced using two different generating values, referred as . Fig. 1b-c show that it is possible to fine tune the value of used by the GR networks so that they reproduce the same average clustering as the original networks. If the generation value is used for the rewiring, the level of clustering in the GR instances does not reach that in the original networks and remains smaller. This observation can be understood by noticing the following two points. First, for SCSS networks the is independent of the level of angular clusterization, so any two SCSS networks with equal and the same distribution of hidden degrees, , will have equal . Second, a GR instance of a SCSS network obtained using would be one with homogeneous and the same observed degree distribution as in the SCSS network. That is, if exactly, then the average clustering reached by the GR instance with would need to match that of the SCSS network. Since we do not observe this matching in Fig. 1b-c, we conclude it is due to differences between the distribution of observed and hidden degrees of the SCSS network.
Iv Effects of geometric randomization in empirical networks
In the following, we apply the GR model to real networks. We consider six empirical networks from different domains: the network of chords transitions in western popular music (Music) Serrà et al. (2012), the one-mode projection onto metabolites of the human metabolic network at the cell level (Metabolic) Serrano et al. (2012), the word adjacency network in Darwin’s book On the Origin of Species (Words) M. Ángeles Serrano (2009), the email communication network within the Enron company (Enron) Klimt and Yang (2004), and the Internet at the autonomous system level (Internet) Claffy et al. (2009); Boguñá et al. (2010), see Table 1 and Appendix B for details.
As described in the previous Section, is the only free parameter of the model, and can be used to tune the clustering coefficient. In the following, we will show results by using a value of ensuring that the average clustering of the rewired network is equal to that of the real one. Another possible choice for is the value estimated when embedding the real network into the underlying metric space Boguñá
et al. (2010), which we indicate as in Table 1.
The embedding method estimates the coordinates of the nodes in the underlying geometry by maximizing the likelihood that the observed topology has been produced by the model. In the process, is estimated such that the expected clustering coefficient of the embedded network matches the observed clustering coefficient of the network topology. As explained in the previous section for synthetic networks, using as the input in GR does not produce in general rewired networks with the same average clustering as in the original networks.
For real networks, the two values of are very similar but not always identical, see Table 1.
The small difference is related with the fact that, for some real networks, the GR model cannot adjust simultaneously the empirical connection probability and the observed clustering using a single value of , see Fig. 2.
iv.1 Clustering and degree correlations
Fig. 3 shows the average clustering of the empirical networks under consideration as compared to the randomized versions obtained by the GR model. We consider both values and (the corresponding networks are indicated by GR and GR, respectively), and we include also a comparison with real network replicas generated by the S1 model Serrano et al. (2008), see Appendix A. As expected, GR networks show an average clustering practically identical to that of the original data, while GR networks present mild deviations, and differences are usually more important for S1 networks due to deviations in the obtained degrees. One exception to the preservation of clustering in GR instances is the Words data set. This empirical network has a extremely close to the minimal threshold of defined in hidden metric space network models. The value necessary to ensure that the GR network has the same level of clustering as the empirical one cannot be achieved since it would need to be lower than 1. In general, an embedding value of suggests that clustering is due to finite size effects, since corresponds to absence of clustering in the thermodynamic limit of the geometric network models.
Graphs on the top row of Fig. 4 show the clustering spectrum for empirical networks and networks obtained by the GR and S1 models. In all cases, the functional form of is similar, a decreasing function of with a broad tail. The clustering spectrum of the GR networks is always very close to the original data, while the S1 networks present important departures in some systems, as a result of the lack of preservation of the empirical degrees. This is especially evident for the S1 versions of the Music and Words networks, with the clustering spectrum much lower than that of the original data.
On the other hand, the real networks under consideration are generally disassortative, as revealed by the decreasing form of the
average degree of nearest neighbors, function, Fig. 4 (bottom). Internet, Music and Words show a decay with power law form, while other data sets show milder degree correlations. In all cases, GR networks have distributions very similar to the original data, while S1 networks exhibit strong deviations, with the exception of the Internet.
iv.2 Community structure
So far, GR randomized versions of real and synthetic geometric networks seem to be able to preserve topological features beyond the degree distribution, including clustering and the average nearest neighbors degree. However, the GR randomization homogenizes the distribution of nodes in similarity space, while nodes in real networks are typically heterogeneously distributed, as they are more concentrated in some specific regions Serrano et al. (2012); García-Pérez et al. (2016). This denotes the presence of communities of similar nodes, named soft communities Zuev et al. (2015). Top row of Fig. 5 shows the representations of the empirical networks embedded in the hyperbolic plane, with coordinates (see Appendix A for the relationship between and the degree, and Appendix B for references to the sources of the empirical maps). One can clearly see that the angular coordinates are heterogeneously distributed in . A different perspective is shown in the bottom row in Fig. 5, displaying the probability density function of the similarity coordinate of the nodes for the six empirical networks.
The heterogeneity of the angular coordinate can be quantified by performing a Kolmogorov-Smirnov (KS) test between the probability density functions and . The KS statistic measures the difference between two probability distributions, and it is defined as the maximum difference between the values of the distributions and . The larger the KS score, the more heterogeneous the angular distribution. Thus, it can be used to discard the null hypothesis that the empirical and synthetic samples (with uniform distribution by construction) present the same angular distribution. The KS distance for empirical networks under consideration is reported in Table 1. One can see that the null hypothesis is strongly rejected for all real networks.
Soft communities in the geometric domain can then be detected using geometric methods. We use the definition of soft communities given in Zuev et al. (2015), where they are defined as group of nodes in similarity space separated from the rest by two angular gaps that exceed a certain critical value, . The critical gap is calculated as the expected value of the largest gap between two nodes when the angular coordinates are distributed uniformly at random: . In the top row of Fig. 5, we highlight the soft community deterministic partition detected by the critical gap method in the real networks using different colors.
Next, we compare the community structure of the real networks with their randomized counterparts. To quantify their topological community structure, we apply the widely used Louvain method Blondel et al. (2008), aimed at maximizing the modularity , that compares the fraction of links inside communities with the expected fraction for a random distribution of edges with the same node degree distribution as the given network. Interestingly, Fig. 6a shows that in real networks, albeit the Louvain method identifies topological communities with higher modularity, the soft communities discovered by the CG display large values, in some cases (e.g. Metabolic or Music data sets) comparable to the modularities given by the purely topological LM.
This picture is completely different for GR networks, reported in Fig. 6b. GR networks show strong community organization at the topological level, resulting in large values of as measured by the Louvain method, which is induced by structural constraints imposed by the geometric models Faqeeh et al. (2018). However, as expected, the critical gap does not detect soft communities, as demonstrated by the non-significant values of the modularity, compatible with zero, over different realizations of the randomization process.
We study in more detail the relationship between soft communities and topological ones by comparing the partition obtained by the Louvain method with the partition generated by the critical gap. The overlap between the two partitions can be quantified by the normalized mutual information Cover and Thomas (1991). Fig. 6c shows that the overlap between geometric and topological communities is quite large for real networks, specially for Metabolic and Internet data sets, meaning that communities identified by purely (deterministic) geometric methods are meaningful, though subject to the degree of congruency of the real network with the hidden metric space. On the contrary, Fig.6c shows that the overlap between soft and topological communities in GR networks is very low due to the complete randomization of the angular coordinate operated by GR.
The rewiring process preserving degrees in the geometric randomization of real networks gives an alternative to their replication using directly the popularity-similarity model as a topology generator. The GR offers the advantage of avoiding the delicate task of estimating the hidden degree distribution, and it can be especially useful in problems responsive to fluctuations of the degree cutoff, like the behavior of some dynamical processes including epidemic spreading processes.
As a model, GR depends on a single parameter controlling the level of clustering in the resulting networks, so that the clustering coefficient of real networks can be chosen to be replicated or not. Interestingly, the discrepancies between hidden and observed degrees in embedded networks, have an effect on the clustering level achieved by the GR. In particular, the parameter value suggested by the embedding of the original data is, in general, not far but not totally coincident with the needed value for replicating the clustering coefficient of the original network. Our results also indicate that, in some networks, degree-degree correlations can only be replicated by the geometric network models if the observed degrees are preserved.
As a null model, GR can be used to investigate the relevance of geometric communities in real networks. Taken together, our results indicate that geometric communities are meaningful in the real networks analyzed here. At the same time, topological communities, like those detected in GR networks, are not always reliable and can be a result of constraints induced by the underlying geometric architecture. The fact that an underlying geometric organization imposes structural constraints on complex networks, which are strong enough for recreating detectable topological communities even in the absence of geometric ones, is an interesting subject by itself and will be investigated in future work.
We thank Marián Boguñá and Guillermo García-Pérez for helpful discussions. We acknowledge support from a James S. McDonnell Foundation Scholar Award in Complex Systems; Ministerio de Ciencia, Innovación y Universidades of Spain project no. FIS2016-76830-C2-2-P (AEI/FEDER, UE); and the project Mapping Big Data Systems: embedding large complex networks in low-dimensional hidden metric spaces – Ayudas Fundación BBVA a Equipos de Investigación Científica 2017.
Appendix A Appendix A. The and models
In the model Serrano et al. (2008), every node is characterized by hidden degrees and angular coordinates representing the popularity (related to the degrees), and similarity dimensions. The nodes of the network are distributed at random in the similarity space, which is taken to be a one-dimensional sphere or circle of radius adjusted to have a density of nodes equal to 1. Every pair of nodes is connected with a probability
where stands for the angular separation between nodes and in the similarity circle, and the parameters and control the average degree of the network and the level of clustering, respectively.
There exists an isomorphism between the model and a version in hyperbolic space, the model Krioukov et al. (2010), where the hidden degrees are transformed into a radial coordinate, , in a hyperbolic disk of radius such that
Consequently, nodes closer to the center of the hyperbolic disk have a higher expected degree and every node has then a radial and an angular coordinate . A link between two nodes and exists with a probability that depends on their distance , measured in the hyperbolic hidden metric space, such that nodes with higher probabilities of being connected are closely positioned in that space. Therefore, the connection probability must be a decreasing function of distance between nodes and, specifically, it can be chosen to be
where the parameter still controls the network’s clustering coefficient. The distance in the hyperbolic plane is calculated using the hyperbolic law of cosines,
where is the minimum angular distance between nodes and .
To produce replicas of the real networks using the model, we extracted the parameters from the empirical networks, namely the size and the exponent of the degree distribution, and used the exponent given by the embedding of the network into the hyperbolic disk. In order to generate the hidden degree sequence we adjusted parameter to obtain the observed average degree , see Table I.
Appendix B Appendix B. Empirical data sets.
US Commodities. This network represents the flows of goods and services exchanged (in USD) among industrial sectors in USA during year 2007. The hyperbolic embedding was obtained from Ref.Allard et al. (2017).
Enron. It is the network of email messaging activity within employees from the Enron company. We use the network obtained in Refs.Klimt and Yang (2004); Leskovec et al. (2009) and the hyperbolic embedding constructed in Ref.García-Pérez et al. (2018b)
Internet. This network consists of the connectivity data of the Internet at the autonomous systems level collected by the Archipelago projectClaffy et al. (2009) during June 2009 and embedded in hyperbolic space in Ref.Boguñá et al. (2010).
Human metabolic. This network is the one-mode projection of metabolites of the bipartite metabolic network of human cell metabolisms, as spatially embedded in Ref.Serrano et al. (2012).
Music. In this network nodes are chords–sets of musical notes played in a single beat and links represent observed transitions among them, see Ref.Serrà et al. (2012). We use the hyperbolic embedding of a sparser and undirected version of such network as reconstructed in Ref.García-Pérez et al. (2018b).
Words. This is the network of adjacency between words in the book ”The Origin of Species” by Darwin, see Ref.Milo et al. (2004). We use the embedding presented in Ref.García-Pérez et al. (2018b).
- Newman et al. (2001) M. E. J. Newman, S. H. Strogatz, and D. J. Watts, Phys Rev E 64, 026118 (2001).
- Colizza et al. (2006) V. Colizza, A. Flammini, M. Á. Serrano, and A. Vespignani, Nat Phys 2, 110 (2006).
- Serrano (2008) M. Á. Serrano, Phys Rev E 78, 26101 (2008).
- Garlaschelli and Loffredo (2009) D. Garlaschelli and M. Loffredo, Phys Rev Lett 102, 038701 (2009).
- Newman and Girvan (2004) M. E. J. Newman and M. Girvan, Phys Rev E 69, 026113 (2004).
- Barabási and Albert (1999) A. L. Barabási and R. Albert, Science 286, 509 (1999).
- Serrano et al. (2008) M. Á. Serrano, D. Krioukov, and M. Boguñá, Phys Rev Lett 100, 078701 (2008).
- Boguñá et al. (2009) M. Boguñá, D. Krioukov, and K. Claffy, Nat Phys 5, 74 (2009).
- Krioukov et al. (2010) D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguñá, Phys Rev E 82, 036106 (2010).
- Boguñá et al. (2010) M. Boguñá, F. Papadopoulos, and D. Krioukov, Nat Comms 1, 62 (2010).
- Serrano et al. (2012) M. Á Serrano, M. Boguñá, and F. Sagués, Mol BioSyst 8, 843 (2012).
- García-Pérez et al. (2016) G. García-Pérez, M. Boguñá, A. Allard, and M. Á Serrano, Sci Rep 6, 33441 (2016).
- Zuev et al. (2015) K. Zuev, M. Boguñá, G. Bianconi, and D. Krioukov, Sci Rep 5, 9421 EP (2015).
- García-Pérez et al. (2018a) G. García-Pérez, M. Á. Serrano, and M. Boguñá, J Stat Phys 173, 775 (2018a).
- Newman (2010) M. E. J. Newman, Networks: An Introduction, Oxford University Press, Oxford, (2010).
- Krioukov (2016) D. Krioukov, Phys Rev Lett 116, 208302 (2016).
- Serrà et al. (2012) J. Serrà, A. Corral, M. Boguñá, M. Haro, and J. L. Arcos, Sci Rep 2 (2012),
- M. Ángeles Serrano (2009) M. Á. Serrano, A. Flammini, F. Menczer, PLOS ONE 4, (2009)
- Klimt and Yang (2004) B. Klimt and Y. Yang, in CEAS (2004).
- Claffy et al. (2009) K. Claffy, Y. Hyun, K. Keys, M. Fomenkov, and D. Krioukov, in CATCH (2009),
- Blondel et al. (2008) V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and Étienne Lefebvre, J Stat Mech 10, P10008 (2008).
- Faqeeh et al. (2018) A. Faqeeh, S. Osat, and F. Radicchi, Phys Rev Lett 121, 098301 (2018),
- Cover and Thomas (1991) T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, (1991).
- Allard et al. (2017) A. Allard, M. Á. Serrano, G. García-Pérez, and M. Boguñá, Nat Comms 8, 14103 (2017).
- Leskovec et al. (2009) J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, Internet Mathematics 6, 29 (2009).
- García-Pérez et al. (2018b) G. García-Pérez, M. Boguñá, and M. Á. Serrano, Nat Phys 14, (2018).
- Milo et al. (2004) R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr, I. Ayzenshtat, M. Sheffer, and U. Alon, Science 303, 1538 (2004).