Multilayer stochastic block models reveal the multilayer structure of complex networks
In complex systems, the network of interactions we observe between system’s components is the aggregate of the interactions that occur through different mechanisms or layers. Recent studies reveal that the existence of multiple interaction layers can have a dramatic impact in the dynamical processes occurring on these systems. However, these studies assume that the interactions between systems components in each one of the layers are known, while typically for real-world systems we do not have that information. Here, we address the issue of uncovering the different interaction layers from aggregate data by introducing multilayer stochastic block models (SBMs), a generalization of single-layer SBMs that considers different mechanisms of layer aggregation. First, we find the complete probabilistic solution to the problem of finding the optimal multilayer SBM for a given aggregate observed network. Because this solution is computationally intractable, we propose an approximation that enables us to verify that multilayer SBMs are more predictive of network structure in real-world complex systems.
The development of tools for the analysis of real-world complex networks has significantly advanced our understanding of complex systems in fields as diverse as molecular and cell biology Barabási and Oltvai (2004), neuroscience Bullmore and Sporns (2009), biomedicine Barabási et al. (2011); Csermely et al. (2013), ecology Thompson et al. (2012); Rohr et al. (2014), economics Schweitzer et al. (2009), and sociology Borgatti et al. (2009). One of the main successes of the network approach has been to unravel the relationship between the modular organization of interactions within a complex system Newman (2011), and the function and temporal evolution of the system Guimerà and Amaral (2005); Arenas et al. (2006); Guimerà et al. (2007); Ahn et al. (2010). As a result, a large body of research has been devoted to the detection of the modular structure (or community structure) of complex networks, that is, to the division of the nodes of the network into densely connected subgroups Fortunato (2010).
Stochastic block models (SBMs) White et al. (1976); Holland et al. (1983); Nowicki and Snijders (2001) are a class of probabilistic generative network models that provide a more general description of the (mesoscopic) group structure of real-world networks than modular models. In SBMs, nodes are assumed to belong to groups and connect to each other with probabilities that depend only on their group memberships. The simple mathematical form of SBMs has enabled not only the identification of generalized community structures in networks Nowicki and Snijders (2001); Karrer and Newman (2011); Decelle et al. (2011); Schmidt and Mørup (2013); Peixoto (2013, 2014a, 2014b); Larremore et al. (2014); Aicher et al. (2014); Yan et al. (2014), but also to make network inference a predictive tool to detect missing and spurious links in empirical network data Guimerà and Sales-Pardo (2009), to predict human decisions Guimerà and Sales-Pardo (2011); Guimerà et al. (2012) and the appearance of conflict in work teams Rovira-Asenjo et al. (2013), and for the identification of unknown interactions between drugs Guimerà and Sales-Pardo (2013).
While these approaches have pushed forward our understanding of complex network structure, a limitation is that they rely on the premise that there is a single mechanism that describes the connectivity of the network, even though we know that real-world networks are often the result of processes occurring on different “layers” (for example, social networks comprise relationships that arise in the familiar layer, and others that arise in the professional layer) Kivelä et al. (2014). Moreover, it is increasingly clear that the multilayer structure of complex networks can have a dramatic impact on the dynamical processes that take place on them Morris and Barthelemy (2012); Radicchi and Arenas (2013); Gómez et al. (2013); De Domenico et al. (2013, 2014). Unfortunately, we often lack information about the different layers of interaction and can only observe projections of these multilayer interactions into an aggregate network in which all links are equivalent.
Here, we precisely address the problem of unraveling the underlying multilayer structure in real-world networks. To do so, we first introduce the family of multilayer SBMs that generalizes single-layer SBMs to situations where links arise in different layers and are aggregated through different mechanisms. Although there have been proposals to extend the concept of modularity to multilayer networks Mucha et al. (2010), ours represents a pioneering attempt to extend generative group-based models to multilayer systems, and to study those models rigorously using tools from statistical physics.
Second, we give the probabilistically complete solution to the problem of inferring the optimal multilayer SBM for a given aggregate network. Because this solution is computationally intractable, we propose an approximation which enables us to objectively address the question of whether an observed network is likely to be the projection of multiple layers. Our results suggest that many real-world networks are indeed projections.
I Multilayer stochastic block models
In our approach, nodes interact in different layers. In each one of these layers we define a SBM as follows: each node belongs to a specific group , and links between pairs of nodes belonging to groups and in layer exist with probability . The observed adjacency matrix is an aggregate that results from the combination of the links in each of the layers, but where all information of the layers has been lost (Fig. 1). We call this model the multilayer SBM.
Here we consider the simplest case of two layers, . In such case, there are two combinations with a plausible physical interpretation: i) the AND combination of layers, in which if, and only if, are connected in both layers (Fig. 1(a)); ii) the OR combination of layers, in which if are connected in at least one layer (Fig. 1(b)). Indeed, each of these two mechanisms is plausible for specific scenarios. For example, the AND model is a plausible model for in vivo protein interactions, because in order for proteins to interact in the cell it is necessary for them to be capable of physically interacting (that is, to be linked in the layer of in vitro physical interactions) and to be expressed simultaneously in the same cellular compartment (that is, to be linked in the co-expression layer). The OR model is a plausible model for the effective on-line social network through which memes spread Weng et al. (2013), because some people use Facebook to share memes, others use Twttier, and others use both.
In principle, we would like to identify which is the pair of partitions (in layers 1 and 2, respectively) that best describe the observed aggregate topology. The probabilistically complete way to solve this problem is to obtain the joint probability that and are the true partitions of the nodes given the aggregate observed network. This distribution is given by
where is a matrix whose elements represent the probability that a link exists between a pair of nodes belonging to groups and in layer , and is the integral over all possible values of these probabilities.
This integral can be computed both for AND combinations and for OR combinations of the two layers; for simplicity, here we focus on the AND model and discuss the OR model in the Appendices. Because in a SBM each links is independent of each other and in the AND model a link has to be present in both layers to appear in the observed aggregate network , the AND likelihood is
where is the number of links between pairs of nodes that are in groups and respectively in layer 1, and in groups and respectively in layer 2 (); and is the number of non-links between such pairs of nodes ().
Assuming a uniform distribution for the prior Guimerà and Sales-Pardo (2009) 111Other possibilities include choosing non-uniform priors for the connection probabilities Peixoto (2013, 2014a); Schmidt and Mørup (2013) or different priors for the partitions Decelle et al. (2011); Schmidt and Mørup (2013); Peixoto (2013, 2014a, 2014b); Yan et al. (2014)., we can plug Eq. (2) into Eq. (1) and integrate to find (Appendices)
where, for clarity, we have used the shorthand and , and .
Given Eq. (3), which is the complete probabilistic description of the multilayer SBM, one could in principle find the partitions and that maximize . If this were possible, one would be able to perfectly disentangle the two SBMs responsible for the observed links, even though the observation did not have explicit information about the layers. It would also be possible to compare regular SBMs to multilayer SBMs to determine if a multilayer model is more or less appropriate to describe a given network. Unfortunately, the expression above becomes numerically intractable even for a small number of groups and therefore one needs to make approximations that simplify the problem.
Ii Link reliability with approximate multilayer stochastic block models
We propose an approximation that makes it possible to work with multilayer SBMs. We start by noting that any multilayer SBM can be represented as a single-layer SBM (Fig. 2(a)) 222The reverse is also true, so the possible network models one can generate with single-layer SBMs and multilayer SBMSs are, in fact, identical. However, it is important to note that each of them gives different weights to different models, so that a model that is relatively probable in the multilayer SBM family might be relatively rare in the single-layer SBM family, and vice versa.. In the single-layer SBM, each group comprises the nodes that belong to the same pair of groups in and in in the multilayer SBM (and only those); we call the single-layer partition the intersection partition. Moreover, if group in the intersection partition corresponds to groups in and in , and group in the intersection partition corresponds to groups in and in , then the probability of connection in the single-layer SBM is (for simplicity, we again focus on the AND model and leave the OR model for the Appendices). This fully determines the single-layer SBM.
Here, we make the following approximation: we keep the information of the partitions and in the intersection partition, but consider that the matrix elements , while each being the result of the product of two factors, are all independent of each other (see Fig. 2(b)). Since this approximation is equivalent to integrating separately every term with a different combination in Eq. (2), it follows that the integrated likelihood depends exclusively on the intersection partition. In other words, within this approximation all pairs of partitions with the same intersection partition are equally likely, and it is not possible anymore to uniquely determine the multilayer SBM that best describes the observed topology.
Despite this limitation, our approximation still enables us to address the fundamental question of whether real-world networks are better described by single-layer or multilayer models. Specifically, in what follows we compare the predictive power of single-layer and multilayer SBMs in the problem of detecting missing and spurious links in noisy networks Guimerà and Sales-Pardo (2009); we argue that, if (approximate) multilayer SBMs yield better predictions on real networks, then there is evidence to suggest that these networks are likely the outcome of multilayer processes (despite being observed as single-layer aggregates).
In the problem of assessing link reliability Clauset et al. (2008); Guimerà and Sales-Pardo (2009), the goal is to compute the probability that a link between nodes and truly exists () given a noisy network observation , which contains false positives (spurious interactions that are reported but do not truly exist) and false negatives (missing interactions that truly exist but are not reported). We call the probability the reliability of the link. In general, for any set of models (single-layer SBMs, AND-multilayer SBMs or OR-multilayer SBMs), the reliability is Guimerà and Sales-Pardo (2009)
where is a normalization constant.
In the case of multilayer SBMs, the integral over the ensemble of models requires: i) the integration over the connection probabilities and (akin to what we did to obtain Eq. (1)); ii) the sum over all pairs of partitions and . Within our approximation, the first step can be carried out analytically but the second cannot (Appendices). However, always within our approximation, one can exploit the fact that the integral in Eq. (4) depends exclusively on the intersection partition and map the sum over pairs of partitions onto a sum over a single partition. By doing so we obtain the following expression for the link reliability (see Appendices for the analogous expression for the OR model)
where the sum is over all possible intersection partitions (that is, all single-level partitions), is the number of links between groups and in the intersection partition, is the number of pairs of nodes in groups and , and is the number of pairs that have the same intersection partition (see Appendices). The energy function is
where the sum is over all distinct pairs of groups in .
As in Guimerà and Sales-Pardo (2009), the expression for the link reliability (Eq. 5) is analogous to an ensemble average of an observable in statistical mechanics, giving the meaning of an energy associated to a specific intersection partition. We can use a Markov Chain Monte Carlo algorithm to compute numerically (see Supplementary Material for details). As it turns out, is equal to the energy obtained assuming a single SBM (Eq. S2, Guimerà and Sales-Pardo (2009)) plus a term that arises because of the fact that the probability matrix elements associated to the intersection SBM are the result of a product of two probabilities. In a Bayesian context, we can interpret this term and the degeneration as non-uniform priors for the intersection partitions.
Iii Validation of link reliability estimation in model networks
Now that we are able to estimate link reliabilities using single-layer SBMs Guimerà and Sales-Pardo (2009) and our approximation to two-layer (AND and OR) SBMs (Eq. (5)), we compare the performance of these approaches at detecting missing and spurious interactions. Our expectation is that if real-world networks are truly the result of the aggregation of multiple layers, assuming a two layer structure should result in higher accuracy.
To identify the limits of detectability of the 2-layer SBM model, we first construct a set of multilayer test networks that have a clearly differentiated block structure in each of two layers, and that are aggregated using the AND and OR models (see Methods and Fig. 3). We consider the predictive power of each of the approaches at detecting Clauset et al. (2008); Guimerà and Sales-Pardo (2009): i) missing links (we remove a fraction of the links and compute the fraction of times that a removed link has a higher reliability than a link not present in the original network, that is the AUC statistic); ii) spurious links (we add a fraction of links and compute the fraction of times that an added link has a lower reliability than a link present in the original network, that is the AUC statistic).
For AND networks (Fig. 3(a-f)) we find that, for the detection of both missing and spurious links, the 2-layer approach outperforms the single-layer approach, especially: (i) when the number of distinct node groups in the intersection partition and the connectivity grow; (ii) for small or moderate noise levels (fraction of removed/added links. Only when the structure of the blocks becomes very blurry do we observe that the single-layer approach works better (but in this region all approaches do in fact work poorly).
For OR networks (Fig. 3(g-i)), the 2-layer approach again outperforms its single-layer counterpart in most situations. In this case, however, the largest improvements in performance happen for the hard cases with lower connectivity. This can be explained by noting that the OR model tends to generate very dense networks, whereas aggregate AND networks are sparser than the networks in each of the layers. Therefore, in general we expect the AND model to produce better results in real-world networks.
Iv Multilayer stochastic block models are more predictive for real networks
Finally, we compare the performance of the single-layer and multilayer approaches on three real-world networks: (i) the air transportation network in Eastern Europe Guimerà et al. (2005); (ii) the neural network of C. elegans White et al. (1986); and (iii) the email network within a university Guimerà et al. (2003). Our results show that the two-layer AND model provides a better description of these real-world networks since both missing and spurious interactions are more accurately detected by the multilayer SBM approach consistently (the improvement is slight but, in most cases, significant), especially for low observational noise.
We have introduced the family of multilayer SBMs, which generalizes single-layer SBMs to situations where links arise in different layers and are aggregated through different mechanisms. We have also given the probabilistically complete solution to the problem of inferring the optimal multilayer SBM for a given aggregate network, and proposed a tractable approximation which enables us to objectively address the question of whether an observed network is best described as the projection of multiple layers or as a single layer. Our results suggest that many real-world networks are indeed projections.
Although, as mentioned above, there have been proposals to extend the concept of modularity to multilayer networks Mucha et al. (2010), ours represents a pioneering attempt to extend stochastic block models to multilayer systems. In this regard, it is important to stress that in this work we are concerned with the learning of multilayer models from aggregate networks where all information about the layers has been lost; in this sense, our work is different from previous attempts to do inference of stochastic block models on multigraphs where the layers themselves are observed Guimerà et al. (2012).
Our work is also different from works on link prediction using latent feature models Miller et al. (2009); Palla et al. (2012); Kim and Leskovec (2013). An important difference between latent feature approaches and ours is that the latent feature model considers that the probability of existence of a link is a function of the weighted sum of the interactions at the different layers; therefore, the latent feature model does not allow a physical interpretation of what each layer is and of how layers are combined. All in all, latent feature models are very well suited for the inference of unobserved links, but due to the intricacies of the model and the difficulty to interpret its “parameters,” it is not clear whether they are appropriate to address the question of whether a real network is really the outcome of a multilayer process or not (and may it may also be prone to overfitting when observational data is noisy).
Our multilayer SBM is the simplest group-based multilayer model one can propose. We believe that its detailed analysis will open the door to better understand the structure of real complex networks.
Acknowledgements.We thank the following people for helpful comments and discussions: A. Aguilar-Mogas, A. Arenas, M. De Domenico, A. Godoy-Lorite, T.P. Peixoto, O. Senan-Campos, M. Tarrés-Deulofeu. This work was supported by a James S. McDonnell Foundation Research Award, Spanish Ministerio de EconomÃa y Comptetitividad (MINECO) Grant FIS2010-18639, European Union Grant PIRG-GA-2010-277166 (to RG), European Union Grant PIRG-GA-2010-268342 (to MSP), and European Union FET Grant 317532 (MULTIPLEX).
- Barabási and Oltvai (2004) A.-L. Barabási and Z. N. Oltvai, Nat. Rev. Genet. 5, 101 (2004).
- Bullmore and Sporns (2009) E. Bullmore and O. Sporns, Nat. Rev. Neurosci. 10, 186 (2009).
- Barabási et al. (2011) A.-L. Barabási, N. Gulbahce, and J. Loscalzo, Nat. Rev. Genet. 12, 58 (2011).
- Csermely et al. (2013) P. Csermely, T. Korcsmáros, H. J. M. Kiss, G. London, and R. Nussinov, Pharmacol. Therapeut. 138, 333 (2013).
- Thompson et al. (2012) R. M. Thompson, U. Brose, J. A. Dunne, R. O. Hall, S. Hladyz, R. L. Kitching, N. D. Martinez, H. Rantala, T. N. Romanuk, D. B. Stouffer, and J. M. Tylianakis, Trends Ecol. Evol. 27, 689 (2012).
- Rohr et al. (2014) R. P. Rohr, S. Saavedra, and J. Bascompte, Science 345, 416 (2014).
- Schweitzer et al. (2009) F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo, A. Vespignani, and D. R. White, Science 325, 422 (2009).
- Borgatti et al. (2009) S. P. Borgatti, A. Mehra, D. J. Brass, and G. Labianca, Science 323, 892 (2009).
- Newman (2011) M. E. J. Newman, Nat. Phys. 8, 25 (2011).
- Guimerà and Amaral (2005) R. Guimerà and L. A. N. Amaral, Nature 433, 895 (2005).
- Arenas et al. (2006) A. Arenas, A. Díaz-Guilera, and C. J. Pérez-Vicente, Phys. Rev. Lett. 96, art. no. 114102 (2006).
- Guimerà et al. (2007) R. Guimerà, M. Sales-Pardo, and L. Amaral, Nature Phys. 3, 63 (2007).
- Ahn et al. (2010) Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, Nature 466, 761 (2010).
- Fortunato (2010) S. Fortunato, Phys. Rep. 486, 75 (2010).
- White et al. (1976) H. C. White, S. A. Boorman, and R. L. Breiger, Am. J. Sociol. 81, 730 (1976).
- Holland et al. (1983) P. W. Holland, K. B. Laskey, and S. Leinhardt, Soc. Networks 5, 109 (1983).
- Nowicki and Snijders (2001) K. Nowicki and T. A. B. Snijders, J. Am. Stat. Assoc. 96, 1077 (2001).
- Karrer and Newman (2011) B. Karrer and M. Newman, Phys. Rev. E 83, 016107 (2011).
- Decelle et al. (2011) A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Phys. Rev. Lett. 107, 065701 (2011).
- Schmidt and Mørup (2013) M. N. Schmidt and M. Mørup, IEEE Signal Processing Magazine 30, 110 (2013).
- Peixoto (2013) T. P. Peixoto, Phys. Rev. Lett. 110, 148701 (2013).
- Peixoto (2014a) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014a).
- Peixoto (2014b) T. P. Peixoto, Phys. Rev. E 89, 012804 (2014b).
- Larremore et al. (2014) D. B. Larremore, A. Clauset, and A. Z. Jacobs, Phys. Rev. E 90, 012805 (2014).
- Aicher et al. (2014) C. Aicher, A. Z. Jacobs, and A. Clauset, J. Compl. Netw. (in press, 2014).
- Yan et al. (2014) X. Yan, C. Shalizi, J. E. Jensen, F. Krzakala, C. Moore, L. Zdeborová, P. Zhang, and Y. Zhu, J. Stat. Mech.: Theor. Exp. , P05007 (2014).
- Guimerà and Sales-Pardo (2009) R. Guimerà and M. Sales-Pardo, Proc. Natl. Acad. Sci. U. S. A. 106, 22073 (2009).
- Guimerà and Sales-Pardo (2011) R. Guimerà and M. Sales-Pardo, PLOS ONE 6, e27188 (2011).
- Guimerà et al. (2012) R. Guimerà, A. Llorente, E. Moro, and M. Sales-Pardo, PLOS ONE 7, e44620 (2012).
- Rovira-Asenjo et al. (2013) N. Rovira-Asenjo, T. Gumí, M. Sales-Pardo, and R. Guimerà, Sci. Rep. 3, 1999 (2013).
- Guimerà and Sales-Pardo (2013) R. Guimerà and M. Sales-Pardo, PLOS Comput. Biol. 9, e1003374 (2013).
- Kivelä et al. (2014) M. Kivelä, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, and M. A. Porter, J. Complex Netw. 2, 203 (2014).
- Morris and Barthelemy (2012) R. Morris and M. Barthelemy, Phys. Rev. Lett. 109, 128703 (2012).
- Radicchi and Arenas (2013) F. Radicchi and A. Arenas, Nat. Phys. 9, 717 (2013).
- Gómez et al. (2013) S. Gómez, A. Díaz-Guilera, J. Gómez-Gardeñes, C. J. Pérez-Vicente, Y. Moreno, and A. Arenas, Phys. Rev. Lett. 110, 028701 (2013).
- De Domenico et al. (2013) M. De Domenico, A. Solé-Ribalta, E. Cozzo, M. Kivelä, Y. Moreno, M. A. Porter, S. Gómez, and A. Arenas, Phys. Rev. X 3, 041022 (2013).
- De Domenico et al. (2014) M. De Domenico, G. S. Solé-Ribalta, A, , and A. Arenas, Proc. Natl. Acad. Sci. U.S.A. , 8351â8356 (2014).
- Mucha et al. (2010) P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela, Science 328, 876 (2010).
- Weng et al. (2013) L. Weng, F. Menczer, and Y.-Y. Ahn, Scientific Reports 3, 2522 (2013).
- (40) Other possibilities include choosing non-uniform priors for the connection probabilities Peixoto (2013, 2014a); Schmidt and Mørup (2013) or different priors for the partitions Decelle et al. (2011); Schmidt and Mørup (2013); Peixoto (2013, 2014a, 2014b); Yan et al. (2014).
- (41) The reverse is also true, so the possible network models one can generate with single-layer SBMs and multilayer SBMSs are, in fact, identical. However, it is important to note that each of them gives different weights to different models, so that a model that is relatively probable in the multilayer SBM family might be relatively rare in the single-layer SBM family, and vice versa.
- Clauset et al. (2008) A. Clauset, C. Moore, and M. E. J. Newman, Nature 453, 98 (2008).
- Guimerà et al. (2005) R. Guimerà, S. Mossa, A. Turtschi, and L. A. N. Amaral, Proc. Natl. Acad. Sci. USA 102, 7794 (2005).
- White et al. (1986) J. G. White, E. Southgate, J. N. Thomson, and S. Brenner, Philos. T. R. Soc. B. 314, 1 (1986).
- Guimerà et al. (2003) R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, and A. Arenas, Phys. Rev. E 68, art. no. 065103 (2003).
- Miller et al. (2009) K. Miller, M. I. Jordan, and T. L. Griffiths, in Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Curran Associates, Inc., 2009) pp. 1276–1284.
- Palla et al. (2012) K. Palla, D. Knowles, and Z. Ghahramani, in Proceedings of the 29th International Conference on Machine Learning (ICML-12) (2012) pp. 1607–1614.
- Kim and Leskovec (2013) M. Kim and J. Leskovec, in Advances in Neural Information Processing Systems 26, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Curran Associates, Inc., 2013) pp. 1385–1393.